Monitoring Database Nodes to Remediate Configuration Drift
May 3, 2017
6 minute read
Controlling database configuration drift is a tricky subject. It's a topic that comes up frequently for us here at UpGuard and customers are always keen to know how they can go about taking control and simplify their configuration management processes. We've all experienced at some time or another that issue that was the result of a database migration that didn't complete, a column that has mysteriously changed data type or an old version of a stored proc or view being restored to a new database.
It's a costly and time consuming procedure to track these issues down but the irony is that a complete application failure is probably your best case scenario as it points to the problem immediately instead of hiding away in the shadows causing chaos behind the scenes.
The worst bug is not the one that brings your site down, it is the bug that affects your data over time like a virus causing damage that is difficult to repair.
With that in mind, today I'm going to run through a simplified example of how we can use UpGuard to prevent and diagnose this type of issue.
You receive a call from the sales team informing you that the sales reports being exported from the production and QA systems are not matching up. They have manually trawled through the monthly sales figures and everything looks normal.
The replication from prod to QA seems to have run without error for as long as the logs show. You inspect the reports and indeed the numbers just don't add up.
Something isn't right here.
Luckily, UpGuard is on hand to diagnose the problem.
Step 1: Export The Database Schema
We're going to want to compare the two databases, so we'll need to see the schema. There's several ways you can achieve this and some may be more suitable to your needs than others but in the interest of creating a free and re-usable method that you can run your self, I'm going to use a PowerShell script to export mine. You can get a copy of my script from the gist below.
This PowerShell script will connect to a local database and export the schema for tables, stored procs, views, functions, triggers and db triggers to individual files on the file system.
We will tell UpGuard to include these files when it performs a system scan so that we can compare them between servers.
How you decide to schedule the creation of these files is up to you. Using the task scheduler would be the simplest method but a deployment hook would allow you to be more targeted. If you'd like more guidance on this get in touch via the comments.
Step 2: Include The Files In The Scan
Next, we want to tell UpGuard to include these schema files in the scan. These scans will run daily so you can compare these files over time, but for this example, we are going to run the scans manually.
Simply add the paths to these 6 files into the scan options for this server, and click scan.
NB: Adding the files individually to the scan options instead of using a syntax like *.sql to include them all at once tells UpGuard that we would like to include the entire contents of the file in the scan, and not just the MD5 checksum)
This step needs to be performed on the two servers you want to compare so that they both have these files included in their scans.
NB: You can also apply these settings at a group level in UpGuard
Step 3: Compare The Servers
Now, the fun begins. We head back to the node page for our production server and compare to our problem QA machine.
Immediately, we can see a number of potential areas to investigate. The green sections indicate items that exist on the production machine but not on QA and the orange items indicate items that exist on both, but differ.
Visual Difference of the Production and QA Servers
Let's switch to the table view for easier reading. We can click the "Diff Only" button to show only the items that have changed which allows us to zoom in and see only the important items.
Table View Diff - A = Prod, B = QA
Looking through this list, we can see a few things that stand out, but mostly things that we can disregard for this specific case.
We can be pretty sure a new hotfix, a service user and some other services aren't causing our mismatching number issue, although they do raise the question as to why these are different from our QA server. These servers are supposed to be identical, right? We can dig deeper into that later. For now, lets get back to our problem at hand.
Down the bottom of that list, we can see something of interest. The file "sales_stored_procs.sql" that was generated from our schema export PowerShell script is reported as being modified, meaning the stored procedures on these two machines are not the same.
Let's click "Files" from the wheel view and zoom in to show only the files:
Step 4: View The File Difference
We've detected a change between the two files that represent the stored procedures on these two machines. That's already enough for us to go off, but in many cases, these machines may have dozens of stored procs each containing business logic and number crunching equations that you really can't be bothered sorting through.
We really need to just show the exact lines that have changed. UpGuard has you covered here, simply click that "File Diff" button.
The problem is staring us right in the face about now.
The "GetMonthlySales" stored proc from the problem QA server (on the left) is not including the discount in the summed total. The proc from production on the right clearly is.
This would definitely cause our mismatched figures and we know exactly how to fix the problem now.
Nodes are scanned daily by default on UpGuard and as we had added these files to the scan earlier, we could have gone back in time and determined exactly when this stored procedure changed.
We've detected a small but critical bug in our SQL configuration in a matter of minutes. I know from experience that tracking down this type of issue without a tool like UpGuard can take hours, if not days. A typical production database will have dozens of tables, triggers, stored procs and views that your applications depend on heavily. Keeping track of them across environments is no mean feat.
Using this technique you can detect changes in all of these and react accordingly.
We've really only scratched the surface here in terms of UpGuard capability. From here we could move onto to creating policies that will perform queries against the database every 15 minutes and ensure an expected output. We could also create security policies to ensure that our SQL Server is CIS compliant and be alerted if we fall out of compliance.
With UpGuard in your toolbelt the possibilities are endless. We'll dig deeper into some of them in upcoming posts.