"A subtle bug in the script caused the command to reinstall a small number of active machines. Unfortunately, some master-slave pairs were impacted which resulted in the site going down." - Akhil Gupta, Head of Infrastructure
It got us to thinking here at UpGuard what conclusions and lessons learned for DevOps we could draw from this all-to-common problem (thanks to @schappi for the prompt!).
3 DevOps Lessons Learned from the Dropbox Outage
1. You get what you design for. Sound familiar? It is a great point made by The Phoenix Project by my friend @RealGeneKim. Everyone knows how hard it is to scale a system or service fast (think Twitter, Pinterest, Facebook & Dropbox). Everyone is moving at warp speed which invariably causes painful changes, plus everyone jumping in to resolve problems which makes matters even worse. In the case of the Dropbox outage, a little too late on their fix around distributed state verification. Perhaps more of us will practice system state comparisons and verification moving forward.
2. Any improvement not made at the constraint is an illusion. This is one of the most salient quotes from The Phoenix Project in my opinion. The quality of engineering teams relies heavily on specific skill sets that tend to harbor system knowledge that is often difficult to transfer. Everyone knows that this should be done before X is implemented, that they need to read the documentation (or the always familiar RTFM), but the reality is that most people don't have time. In the case of the Dropbox outage, having end-to-end visibility of their environments and systems (and having them documented for all to see) could have helped avoid this issue from the get-go.
3. Hindsight is 20/20. Or sometimes it is more like 50/50! Nevertheless, having a basic pre-flight checklist, particularly in large distributed systems, could have helped Dropbox administrators ensure that they were alerted if a system was ready to take on the update. From our experience, though, these sorts of checklists are incredibly difficult to orchestrate and integrate into the work cycle of companies. Of course, having this checklist is often never considered until an event occurs. Do you have a well documented pre-flight checklist? If not, you should.
Root cause determination is often the hardest thing to pin down and validate because the body of knowledge discussed in point 2 is so difficult to transfer, thus the specialist that knows something about the system is the poor sucker that get's woken up at 2 am. Kudos to the Dropbox team for identifying with precision the cause of the outage and for their willingness to take responsibility. Dropbox's Gupta commented that "when running infrastructure at large scale, the standard practice of running multiple slaves provides redundancy. However, should those slaves fail, the only option is to restore from backup." It appears that Dropbox is now checking the state of a running server during updates to see whether its data was in active use (a red flag that should have protected the production servers from this outage). Cool!
So someone asked us if the Dropbox outage could have been prevented if they were using UpGuard - my answer is an emphatic yes. We built UpGuard with exactly this scenario in mind. But unlike custom scripts or esoteric domain specific languages, we've created a place where the whole team can see the distributed state of a whole system in one place. Feel free to try it out yourself.