Imagine this — you're rolling out a new version of your web app. Works great in the dev environment, and it's been signed off on in staging, so it gets rolled out to production. Things seem fine, so you call it a night.
Then the support requests begin flooding in. Something's broken somewhere, and it's not immediately obvious how. Performance monitor shows the machines are running well, so it can't be that. Ah well, better crack one of those neon-colored energy drinks, it's time to roll back and log into these machines to look through logs and config files for a potential cause. "How could this be happening," you ask, "I mean... these machines are all configured the same, right?"
Often, that's wrong.
Configuration drift is a very real and increasingly common problem, especially in growing environments. In a way, you can call it the "hidden cost of complexity," and there are a number of causes behind it.
- Well-meaning team members could've updated something to a new version, installed a conflicting package or service, or applied a fix thought to be minor.
- Software or OS updates applied here but not there could've thrown everything out of whack.
- A tiny change in a far-flung config file could be the metaphorical butterfly that flapped its wings.
- Changing settings or firmware on a network device may affect some or all clients connected through it.
- A machine could've been compromised in a way that isn't obvious.
- Space aliens.
And as wildly varied as the causes can be, the potential effects are even worse. We're talking downtime, failed infrastructure, loss of data, loss of business, and even loss of customer trust.
One reason the lurking configuration drift problem isn't more widely discussed in IT probably has a great deal to do with the wide variation in its causes and effects—something with a thousand possible causes and a thousand possible effects is difficult to pin down as one phenomenon. It's not as easy to define and fight as, say, viruses or hardware failure. Viruses are things we can point to and say, "These are bad, here's how they proliferate, and here's how you protect yourself," and as for hardware failure, we all know what that looks like and know how to mitigate it when it happens.
Another reason for not discussing config drift is probably that—until recently—there hasn't been a single solution for preventing or dealing with it.
UpGuard directly combats configuration drift by continually scanning and monitoring your configs across practically every platform and device. It's a robust, collaborative platform with tools to graphically identify differences and potential hazards, and alert you when something goes awry. Reports can be exported to PDF for auditing or compliance purposes, and configs you verify as good can be exported to Chef, Docker, Ansible, and Puppet for automation.
And when we say "collaborative," we mean it. We designed UpGuard from the ground-up to be simple enough to be a valuable tool for every stakeholder. Nodes and their differences are represented graphically, in an easy-to-navigate interface that's useful no matter your background.