Whether a user or not, we all are familiar with the popular microblogging service, Twitter. With over 200 million users, it’s no easy task to maintain their infrastructure. It has been plagued with several outages in recent times including one this week. A product with a die hard user base can face severe backlash for even the slightest of outages.
You can be assured to see several tweets along the lines of the following:
As IT professionals, the first thing that runs through our heads is, “What the heck caused the outage?” We decide to put our assumptions to work until the final word from Twitter comes out.
The 25 minute outage this week was supposedly caused by a glitch in a “routine change.”
Shouldn’t routine changes have a standardized set of procedures so that they are performed in the same exact way every time, with verifications that the environments have been successful?
So how could have Twitter prevented this outage and potentially avoid several in the future?
In companies of Twitter’s scale, there are large teams running IT operations. Critical system information is usually stored in employees’ heads and gets spread across the organization as the headcount of the teams increase.
While Twitter may have adopted DevOps philosophies, collaboration amongst the teams - especially on system knowledge - might still be broken. Using IT automation tools doesn’t help if the entire team is not able to collaborate on configuration and put their system knowledge to work.
What Twitter really needs is the ability to capture the entire IT team’s system knowledge. This way instructions are written in collaboration with everyone. These instructions should be written once but reused multiple times. Quality is bound to improve.