UpGuard Blog

Gmail outage reminds us that IT automation is not enough

Written by UpGuard | Jan 27, 2014 6:42:00 PM

So you do a bit of IT automation. Maybe you throw in some functional testing for that IT automation too. You have monitoring. You have a top notch engineering team. You're doing enough then, right? Nothing could go wrong?

I mean, you may not be Google, but... oh, hmm, yeah.

Our reminder of the week that configuration management is hard, that it is prone to errors, and that those errors can have a massive impact comes from none other than the auteurs of automation, the commanders of complexity themselves, Google.

Yep, thanks to a bug in "an internal system that generates configuration", Gmail was down for at least 20 (and up to 50) minutes on Friday.

"The incorrect configuration was sent to live services..., caused users’ requests for their data to be ignored, and those services, in turn, generated errors.", said VP of Engineering Ben Traynor.

I don't need to explain the impact up to 50 minutes of downtime for Gmail represents. Consider the event a community service announcement though (one to follow up Dropbox's the previous week). Configurations matter. You need to monitor your configurations for drift. You should control them through executable policies, for compliance as well as functionality.

You can have the best systems in the world, and the smartest people to run them, but it won't mean a thing if your configuration isn't solid.

The good news? There's at least one thing Google didn't have that you can today: UpGuard (shameless plug). UpGuard simplifies configuration management and helps you activate DevOps.