So I've finally gotten the go-ahead from higher-ups to join the twenty-first century and use cloud hosting. Now I need to prove that running in AWS is not just easier than maintaining our own farm, but more stable and secure. To do this, I need to be able to monitor each of my instances for configuration drift, ensure that they are properly provisioned, and maintain visibility into dependencies like load balancers and security groups. Fortunately, UpGuard provides all of this information, so even if something were to go wrong I could catch it before someone else does.
I have several hundred servers, so I will use UpGuard’s EC2 discovery option to start monitoring them en masse. UpGuard pulls a list of my instances along with metadata I can use to filter the list down and assign them all to the appropriate environments in a few clicks.
After scanning all the machines, UpGuard classifies them based on properties like operating family and Linux distribution– in this case, I’m using Ubuntu. I can also see a visualization of the most recent changes on each node. Immediately, this group raises a warning light to me: Cluster Node 4 does not look like the others, suggesting it did not receive the same updates as its peers.
To get a more granular view I go to the baseline configuration scan. The most common types of items are all here: packages, ports, files, users. Because this node was discovered from AWS, I also see an AWS metadata section.
In here I see a few pieces of information that I regularly have to check, like instance type and security groups.
Looks like someone forgot they were provisioning at work and not deving at home. The runbook for deploying in this cluster specifies instances should be m3.mediums, but this server is a t2.micro. Insufficient resources in this cluster could easily have resulted in service failures under heavy traffic.
Before jumping over to the AWS console to fix that, I’m going to create a policy to ensure that I’m alerted if this happens in the future. I right click to create a policy check on this item and then edit the policy to check for the desired value: m3.medium. I’ll also leave a somewhat helpful note about why this needs to be done correctly (reason: it is our jobs to do things correctly).
I can also see which security group(s) this node belongs to. I’ll likely want to come back and add some policy checks for those as well, but first I have more basic questions about the integrity of these servers.
To understand how similar the nodes in this cluster are, I’ll perform a group diff. Thanks to the dynamic classification during the discovery process I can just go to that group and click “diff this group.”
While this cluster is supposed to be identical, I can see that it is not quite the way things are. As it turns out, there are a number of files and packages missing from only one server, Cluster Node 4, the same one that raised an eyebrow during our initial assessment. That’s another thing I’ll want to cover with policies using the same method as before.
Lastly, I can check out other resources in my AWS account. For items where I don’t yet have a specified correct value– that is, I don’t have a runbook telling me what should be in a policy– I can use the baseline scan to start constructing standards for things like the unhealthy threshold and health check interval.
I’ve also spent an unhealthy amount of time in recent memory verifying that we’ve spun up new instances for our customers in the correct zones, and that’s also something I can enforce through policy checks here.
For items where I don’t yet have clear directions on the desire state I can use the change reporting to find the things that require more attention. For example, I can compare the current state of our account to previous scans to see what instances are being spun up and down and when security groups are added, removed, and modified.
By baselining my Amazon account, monitoring for changes, and creating policies for load balancers and security groups, I can fully integrate my Amazon environment into the change management process.
Not only will I know that servers are correctly provisioned, but I can tell whether changes take them out of compliance. And when an incident occurs I’ll be able to tell whether it was because of a change on a server’s configuration or a change in Amazon itself. Running in the cloud certainly has its complexities, but now I’m confident that I have the visibility to manage it.
For another interesting AWS tale, see how we got tarpitted when testing AWS permissions programatically.