We received a lot of positive feedback regarding our last article on Controlling SQL Configuration Drift so thought it might be a good idea to continue along that same theme of analysis and follow it up with an article about DNS configuration and some simple steps you can take to prevent configuration drift.
DNS configuration management and configuration drift is another one of those tricky things to diagnose. It transparently operates and exists right under our noses, but many people don't completely understand how it works or how to diagnose it when something goes wrong.
With factors such as propagation times, fail over, local vs domain priority and record type confusion all playing contributing factors when it comes to drilling into a problem, often the issue can spiral out of control.
All of a sudden your nslookup and dig skills just aren't cutting it anymore.
This article is going to run through a simplified example of how we could diagnose a DNS problem and come out with a solution. For this example I'm going to be using a Windows DNS Server, although the exact same concept could be applied to BIND or any other platform.
Here's the steps we're going to take:
Exporting the DNS zone information to a text file is going to allow us to scan and analyze it with UpGuard. I've created a small, simple batch file you can run on a daily basis to export your DNS config.
[gist id="9925130" file="dns_export.bat"]
This will save a .txt file under the %windir%\System32\dns directory using the dnscmd command line tool, the file name will be named after your DNS zone with a .txt extension, i.e. yourdomain.com.txt
The dnscmd tool will not overwrite an existing file hence why this small batch script will delete the file first before creating the new one.
Schedule this batch file to run daily from your DNS server either with task scheduler, a UpGuard policy or your favorite scheduler.
Open up UpGuard and go to the node page for your DNS server. Click the "Scan" button and enter in the path to the new exported zone file under the "Scan Directories" input.
In this case you can see our zone export file is named "acmenet.com.txt"
Adding the full path into this box tells UpGuard that we want to scan the entire contents on this file for use with the file diff functionality.
Alright, now that we're all set up. UpGuard is going to scan this zone file on a daily basis giving us a history to work with so lets run through a scenario and see how we could diagnose it with what we have put in place.
It's 6am and you get a call informing you a change that went ahead last night was completed successfully, although people have noticed this morning that the data in one of your core business systems is not right. It appears to be out of date by several days. The DBA and App teams have been engaged and they are all playing the "It's not my problem game" insisting that everything looks good from their end.
First we open up our environment dashboard.
This dashboard is a top down view of the environment that can indicate where changes have occurred since the day before. It visualizes our environment based on the daily scans that take place automatically.
Immediately, we can see there is a change on our DNS server, SRDNS01. The orange on the "Files" segment indicates a modification to one or more files.
We can click on this segment to drill down into that particular server.
We can see the file standing out here pretty clearly. We can click the orange segment to display more information or we can also switch to a table view for an easier to read text based table.
So, we know from this that our DNS Zone file is showing changes compared to the day before.
Let's take a look by clicking the "File Diff" button
Now we're talking.
The first two orange lines are indicating that the Zone version has incremented by 1 from version 18 to version 19. So there has definitely been a change.
The third orange line is the icing on the cake.
It shows us that the DNS CNAME record for prod-appdb01 has had its target changed from prod-sqlsrv01 to test-sqlsrv01.
The left hand side is the previous version of the file, the right hand side is the current.
Now we really have something to work with.
A quick chat with our app teams reveals that the production system database connection strings point to "prod-appdb01", which DNS is supposed to resolve to "prod-sqlsrv01". This is a common DNS technique that allows us to switch around IP addresses and host names on backend machines without needing to update application code.
The DB server "test-sqlsrv01" is a replica of production that is only replicated once a week. Our production application is now connected to the test database due to this DNS issue, which would explain why our system is operational but appears to be out of date.
In this simplified scenario, our app and database teams all claimed to be clear of the problem and were happy to point the finger at everyone else. The problem is that they were right, it was neither of their problems. The application code was still pointing to the same database as it had been previously, and the databases had not been touched.
In only a matter of minutes, we were able to diagnose this problem and arm ourselves with detailed insights on how to repair the problem.