Misconfigurations are a major cause of problems in all areas of IT, from Development through to Operations. They can be responsible for data loss, security breaches, even total service outages.
Even the most respected and best known companies can be affected by problems associated with poor system configurations. 2012 saw high profile service interruptions and security alerts affecting some of the biggest names in business today, and it is surprising that most of these heavily automated environments don't have quality configuration management practices in place to catch them.
Skype was one of the big names hit by outages in 2012. The VoIP telephony provider had been suffering from technical problems, some causing service interruptions, for some time. One outage in May forced the company to release a new version of its client software. In June, however, improperly configured systems resulted in widespread disconnections and frustrated users. Skype is largely a free service, so it's difficult to quantify the costs to the company of a service outage although some commercial clients complained of losing hundreds or thousands of dollars as a result.
Amazon is the reigning provider of cloud computing via its Amazon Web Services and Elastic Compute Cloud services. The internet retail giant provides services to an impressive list of clients including Netflix, Pinterest, Instagram, Herok, amongst others.
In both 2011 and 2012, Amazon was hit by outages severe enough to take down many of these well known sites for some time. While some of the service interruptions were the result of natural events, Amazon admitted that others (such as the October 2012 outage) were attributable to service misconfigurations.
Exactly how much the company lost as a result of the outage is unclear but will certainly have been tens if not hundreds of millions of dollars for companies that rely on the service day to day.
The sites that suffered downtime due to Amazon's problems also highlight the need for failsafes in cloud computing.
If their systems had been configured for distributed high availability they could potentially have remained online.
Until August 2012, the Knight Capital group had benefited significantly from the trend towards the digitization of stock market trading. All that changed for the market maker when a misconfigured software installation caused the company to rapidly buy and sell millions of shares in more than one hundred different stocks. For three quarters of an hour after the opening of the New York Stock Exchange, the rogue program disrupted trading for the entire market, chaotically affecting the market value of several well know companies. Even after the software was brought under control, Knight Capital was left holding the bill for a significant number of overvalued shares which it then had to sell back.
The disruption cost Knight Capital around $10 million per minute for a total of $440 million.
Microsoft's ambitious public cloud development and hosting platform, Windows Azure, suffered an outage in August 2012 that was traced back to a poorly implemented configuration. A safety mechanism designed to prevent cascading network failures by capping the number of connections was found to have been at fault. A misconfiguration of the system prevented users from connecting at all. The outage for two and a half hours was a major source of embarrassment for Microsoft, trying to entice new players to use its services. To Microsoft's credit the software giant reach into its own pockets and refunded disgruntled users with credits.
Users of Gmail, Chrome and other Google services had an unpleasant surprise in early December when they were confronted with error messages, browser crashes and drastically slowed performance. Google tracked the issue back to a bad configuration in its load balancing setup.
The fact that even the most powerful and technologically companies can be hit by IT problems serves to highlight the importance of properly configured systems in every area of IT. The outages again show the need for companies to offer compliance and testing as part of their development and production acceptance processes. At UpGuard our tools and services for automating the configuration testing element of the software development lifecycle helps companies avoid misconfiguration errors by baking configuration validation in at every stage.