Monitoring a traditional Microsoft SQL Server cluster, one that typically has 1 active and 1 passive node, is a challenge because at any point the services, storage, and SQL Server’s engine can be running on either (or any!) node at any time. Typically we have a “preferred node” however that we like SQL to run on, though we may have failovers from time to time, either manually for maintenance purposes or unexpected events. Because of this uncertainty, and because we know services only run on one node at a time, monitoring both (or all) nodes in the cluster always yields some items in an alert state.
In any Windows Failover cluster, SQL or otherwise, you have a cluster DNS name that follows along with the current active node. This name is for the management of the cluster and connecting to this DNS name ensures you (ideally) will always be connecting to SOMETHING. IE: we don’t need to know or guess which node is available. Naturally, with Uptime, we are aware what servers are up or down, but this “virtual name” provides us a degree of convenience. First let’s understand what Uptime expects with monitoring any host.
In a virtualized environment, aside from the hostname (DNS) we have an IP address (absolute) as well as a system name (absolute server property) and a domain name (if it is joined to a domain). If you’re in a virtualized environment (likely) you also have a VMWARE or Hyper-V UUID. This unique identifier is also critical to Uptime as that is how we consider a server to be unique from the others. If you clone a VM, or detach and reattach a VM, the UUID changes. It can also be changed manually. This change tells Uptime that a new server exists. Uptime cares about most of these items but there are some differences.
Because of this, in a Virtualized server environment, simply adding the cluster’s “virtual name” as an agentless WMI element will not suffice. WMI exposes the complete UUID, for example “4239eef3-ce1f-ebd2-a2eb-c367de4aea7f”. Just as it is seen in a VM guest element with no hooks into WMI at all. Agents however expose the UUID in a different fashion, and in fact, because the “virtual name” doesn’t exist at all in VMWare, only in DNS, we can get away with adding the “virtual name” as an agent bound element. FYI, an agent exposes this same UUID as “a2 eb c3 67 de 4a ea 7f”.
Now let’s add another layer of complexity. Microsoft SQL Server Always-On Availability Groups (quite the mouthful, I will call them AA/AG from now on!) In this setup you have a “traditional” failover cluster except SQL Server gets installed as usual, and the only things that “fails over” as part of the AA/AG (typically / generically speaking) is the cluster virtual name, and the DNS name to IP resolution of the availability group itself and some internal bits in MS-SQL that defines what node is the primary read / write node. The main difference in an AA/AG is that SQL Server is completely active on both (or more) nodes. This allows you to do a lot more and also have near instant failover with likely zero data loss. We can have a read-only secondary replica for reporting purposes or backups for instance, and we can run normal, non-failover databases on either node totally independent from the failover enabled databases. So, instead of “wasting a server” when it’s not failed over as active, you can take advantage of it.
Ok, story is over. Here is how to make this work.
At this point we have at least 3 servers added to Uptime. Next we need to monitor SQL. There are a number of ways of provisioning service monitors in Uptime. Slight differences exist for “traditional” SQL Cluster vs. AA/AGs.
For a traditional failover SQL cluster:
This lets us know that SQL is working and the response time we get tells us roughly how quickly it responds. You may need to adjust your response time settings a bit, but I like to keep them fairly low and save them so I can graph it later.
click the green install button if you don’t already have it installed (I do) then when it finishes successfully (watch the message!) you can hit back and add it. This is a custom one we’re going to do just for the cluster monitoring. You can do SO MUCH with this service monitor but I’m not even going to get off on that tangent here. Another post perhaps! For this example we’re going to query the SQL DB engine for the server name it is running on. This never lies. No matter what the virtual name is, this is going to report the actual server name the engine is on. See below:
Note I am only alerting on the Text Result field. Here you’ll want to enter YOUR preferred node’s server name. If you running on the preferred active now right now just save the monitor, test it and watch the output. Here you can see I’m actually failed over to the second node! Works!
Ok, now, for availability groups. The process is nearly identical actually, except we’re going to apply all of the monitors to all of the nodes. MSSQLSERVICE Service monitor, advanced, basic, windows performance and file system monitoring, but we will only do the SQL Query that queries the virtual name against either the CLUSTER’S virtual host name or the AVAILABILITY GROUP DNS name itself. This depends on your goals… If I want to watch when a database supported by the AA/AG functionality moves nodes itself this is how I would do it. Do not fret, you do not have to add a separate element for EVERY availability group! Just add an element via agent for the CLUSTER virtual node so we know we’ll (theoretically) always have SOMETHING to connect to. (I use caps a lot.. probably the coffee…) Then you’ll create a SQL Query service monitor to query the status of each AVAILABILITY GROUP individually. The query below will check if the listener is active for the availability group, just replace ‘uim-clust’ with your availability group’s listener name in the WHERE clause:
SET NOCOUNT ON;
-- uncomment following line to test
hags.primary_replica from sys.availability_group_listener_ip_addresses aglia
JOIN sys.availability_group_listeners agl
ON agl.listener_id = aglia.listener_id
JOIN [sys].[dm_hadr_availability_group_states] hags
ON hags.group_id = agl.group_id
--comment the following line to return all groups for testing where agl.dns_name = 'uim-clust';
--comment the following line to return all groups for testing
where agl.dns_name = 'uim-clust';
This query will return the SERVERNAME of the currently active primary replica.
As before, we can simply put a warning condition where if the text output does not match the node you want to be the primary replica, alert criteria is met. Since I want UPTIME-SB-SQL1 to be the primary at all times, I would put WARNING if it doesn’t match UPTIME-SB-SQL1.
And there you have it. This same service monitor can be created one for each availability group that runs on your cluster. Each of them could be used as regular service monitors in the cluster health application we created earlier. That way if any of your availability groups is not running on the node you expect, an alert condition is met. There are lots of other properties you might query against to create service monitors in Uptime with regards to clusters, availability groups, and numerous other things from performance to backup states, etc. I hope this guide has served to help you in your journey.
Getting stuck at step 3: Add the cluster “virtual name” into Uptime as an Agent system from Infrastructure > add system / network device. Using the virtual name as it appears in DNS. If the nodes are “SQL-NODE1” and “SQL-NODE2”, this would be something like “SQL-CLUSTER”. I receive an error attempting to add the SQL cluster name. "This element with UUID already exists within a virtual center in uptime." I discovered the devices using vcenter discovery. Verified WMI is working properly. Verified the virtual name is not listed anywhere within the infrastructure.One thing I discovered: I AM able to add the hosts specifying them as "virtual node". But I am not able to add the machines with type "agent".