“Hey, the database is down. What’s going on? Can you fix it? What’s your update? When can you tell me what’s wrong?” As DBAs, nearly all of us have probably heard this phone call or been in this conversation. When we’re supporting databases that need to be highly available, there is a typically a client or customer extremely interested in the system’s uptime and a manager or executive on our side extremely interested in why the highly available database whose licenses and hardware cost a lot of money is now unresponsive.
Given our role as DBAs, it’s our job to quickly resolve an issue and then provide an answer as to why the issue happened so we can prevent it from happening again. If we are supporting one (or more) Always On Availability Group(s) this is even more essential because those databases typically have strict uptime requirements and a lot of money invested in them to meet those SLAs. If we’re asked the reason for an unscheduled failover or more comprehensive AG failure and our answer is “I don’t know”, that is the type of answer that can result in an urgent need to update our resume because we’ve been escorted out of the building. It’s our job to know and communicate the reason behind the failure, whether the failure is in SQL Server, the cluster, or somewhere else in the infrastructure. Even though Always On Availability Groups are a SQL Server feature, they rely on network and cluster infrastructure where our troubleshooting may lead us. Because AGs are a SQL Server feature, the DBA typically has the ultimate responsibility to explain the failure and steps being taken to prevent it in the feature. Let’s address a key element in AG troubleshooting that’s often overlooked by DBAs: the cluster log.
Some former colleagues of mine relayed a story about one of their AGs experiencing several unscheduled failovers at odd hours. After “thorough” research by the DBA team, the reason for the failovers came back as “we have no idea, the SQL Server error log doesn’t tell us anything”. My response, as the story was relayed, was “what did the cluster log say?” This DBA team was not aware of what the cluster log was so, before you leave this blog, let’s discuss what the cluster log is and how to access it.
The cluster log is a log of every event in the cluster: errors, warnings, etc. It is a detailed timeline of everything every node in the cluster is doing. It can be generated via an administrative command prompt by running cluster log /g. If PowerShell is your preference, you can run the Get-ClusterLog cmdlet. Either way, the log files will end up in C:\%windir%\Cluster\Reports for your review. While there are a wide variety of methods to analyze these logs (review manually, search for particular error message, search for particular timestamp, etc.) these files are quite large and difficult to review efficiently. When I want to efficiently scan a cluster log for errors, I typically turn to 2016 IDERA ACE John Morehouse’s blog (https://sqlrus.com/2019/01/finding-cluster-log-errors/) about using a simple PowerShell script to pull the errors out of the cluster log so they can be reviewed more quickly when time is of the essence.
Now we all know that the cluster log exists, how to generate it, and how to review this often overlooked part of troubleshooting cluster issues that can trigger availability group failures and failovers. My hope is that this blog adds more tools to your troubleshooting arsenal and allows you to confidently respond to AG issues that may initially catch you off-guard.