Identifying Performance Problems Part 1

Identifying Performance Problems Part 1

Every server, database, storage appliance and network encounter some sort of performance issues. It is a major part of our job as database administrators to properly monitor and fix those performance issues whether it is in your traditional on premises environment or in the cloud. This article will cover ways monitoring performance and how to establish a baseline. We will go over several important performance metrics every database administrator should know and how to gather and interpret those metrics through several different options within SQL Server and your Windows operating system.  

Now which tool you use to accumulate these metrics depends on what problem you are trying to solve. The key is to try and gather the best metrics at the least cost. Cost will encompass both the overhead needed to collect the information as well as the time needed to analyze the date. Both of these play an important role. In part one we will look at performance metrics you should become familiar with, in part two we will take a look at performance tools available to you to capture these metrics. 

Performance Metrics

To really understand what is happening with performance in your system to need to establish a baseline of performance metrics. A baseline is created by gathering a collection of metric data that helps you understand the normal “state” of your application or server’s performance. This data collected over a time range allows you to identify changes from normal state within those ranges and to easily see actionable anomalies. Without an established baseline and trend analysis, it is very easy to get caught up with every issue you encounterTo avoid this, it is important to know what is considered normal for your environment therefore does not need intervention. As you start to become accustomed to what is normal you will start to note thresholds and can take actions on only those things that exceed those for a given metric.  

One way to do this is to simply chart things like CPU utilization over time for example. There is nothing complicated about baselining, however knowing what to capture and at what granularity will depend on the how critical the performance of your database and application is. Notice I said over time. It is important to establish your normal for time ranges as your hourly, daily, nightly, weekly, and even monthly workloads may vary. You need to know what each of those trends look like ordinarily. 

Now there is no need to memorize these. I like to use this list as a reference and tend to Google each of these when I need to recall exactly what they mean. But it is important to know this list and what type of metric it is. These should be your go-to metrics. 

CPU Utilization Metrics  

CPU Utilization is one of the key metrics for troubleshooting any SQL Server instance’s performance-related issues. When a query is run an execution plan is generated. With the execution plan there are operators which direct the storage engine on how to retrieve data results. Each operator will obtain a worker thread which in turn will be executed using one or more CPU cores that the server has. Servers have a finite number of available cores and if poorly performing queries consume all the available CPU per, performance issues will be prevalent. Properly tuning database queries via code improvements, appropriate indexes and optimal hardware configurations will ensure maximum CPU efficiency. The metrics below will help you to measure the CPU utilization throughout your database. 

Processor(_Total)\% Processor Time 

This counter simply measures the CPU utilization of all of the processors on the server. It is a good indication of the overall workload, and when used in conjunction with other counters, can identity problems with query performance.  

Processor(_Total)\% User Time 

This counter basically measures the percent of the time that the processor is spending executing in User mode, for our example this would be SQL Server. User mode simply means non kernel mode, or in other words non-idle time spend on user processes. High values could mean sub optimal SQL Server instance configurations, excessive querying or things like schema issues. Values as high as 100% can indicate processors bottlenecks.   

Processor(_Total)\% Interrupt Time 

This is time the processor spends receiving and servicing hardware interrupts during sample intervals. It is an indirect indicator of the activity of by things like disk drivers, data communication lines, network interface cards and other peripheral devices. These devices normally interrupt the processor when they have completed a task or require attention. The % Interrupt Time counter will not specifically identify the device that is causing a high % Interrupt Time. To identify the device, use Process Trace Information events to determine what is being dispatched most frequently. As a DBA it is important to know metrics about all things that can affect your server performance even when it seems directly unrelated to the SQL Server instance itself, this is one of those metrics. 

System\Processor Queue Length 

This number indicates the number of threads that are waiting for the time on the processor. If it is greater than zero, it indicates CPU pressure, indicating your workload would benefit from more CPUs. 

Disk Metrics Including Latency and Queues  

How fast SQL Server can read or write to the underlying storage is can be a crucial location where bottlenecks occur. The amount of time that disk storage systems spend in completing I/O requests can vary on the type of disk, more specifically if it is a solid-state drive or a traditional spindle disk. Solid state drives will offer better (lower) latency than your traditional spindle disks. This metric is usually measured in milliseconds however some higher end systems microseconds might be more ideal. Any disk I/O latency should be under 20 milliseconds and optimally be as low as possible within appropriate budgets constraints.    

PhysicalDisk(_Total)\Avg. Disk sec/Read and Avg. Disk sec/Write 

This is a good metric for how the storage subsystem is working. Your latency values in most cases should not be above 20ms, and with Premium Storage you should see values less than 10ms. This is one of the most important latency related metrics a DBA should know.   

Avg. Disk Read Queue Length 

This is a good metric to determine how many requests are waiting on disk resources. A high disk read queue length will reveal disk bottle necks within your system. This is the average number of read requests that were queued for the selected disk during the sample interval. 

Memory Metrics 

With SQL Server, all data related operations (reading or writing) happen within member spaces called buffers. SQL Server buffers data in memory to avoid expensive trips to disk for frequently run queries, in a memory construct known as the buffer pool. The buffer pool in SQL Server is made up of 8 kilobyte pages. These 8 kilobyte pages store the data from as its retrieved from disk, where it also is also stored in 8 kilobyte page in a construct known as an extent, which consists of eight pages, or 64 kilobytes. When SQL Server does physically read data from the disk, it will typically read one extent at a time.  

PLE, or page life expectancy, indicates how long these pages will remain within the buffers. This metric is displayed as seconds. This metric can indicate potentially memory pressure simply because if the value is low (low will be objective depending on the system configuration and associated workloads) that usually means that data is being churned through the available memory. Furthermore, this metric is not to be considered by itself and should be correlated with other metrics to definitively identify performance bottlenecks.    

SQLServer:Buffer Manager\Page life expectancy 

Page life expectancy indicates how long SQL Server expects a page to live in memory. There is no proper value for this setting (older documentation refers to 300 seconds as proper, but that was written in a 32-bit era when servers had far less RAM). You should monitor this over time and evaluate sudden drops. This could indicate poor query patterns, external memory pressure (the server running a large SSIS package) or could just be normal system processing like running a consistency check on a large database.  

Paging File(_Total)\% Usage 

In a properly configured SQL Server memory should not page to the paging file on disk. However, in some configurations you may have other services running that consume system memory, and lead to the operating system paging memory to disk resulting in performance degradation.  

Available Mbytes (Memory\Available Mbytes) 

Monitoring the available memory for Windows is also critical to maintaining a healthy SQL Server instance. This counter is used to ensure Windows has the proper available memory for use by application and processes. If the amount of memory becomes low it maybe an indication that more memory needs to be added to the server, applications need to be moved off that server or may need to stop services, applications or processes from running and consuming memory resources. This can be a good indicator for memory leaks which is when an application or process fails to release discarded memory. 

Miscellaneous SQL Server Related Metrics 

SQLServer:SQL Statistics\Batch Requests/sec 

This counter is good for evaluating how consistently busy a SQL Server instance is over time. Once again there is no good or bad value, but you can use this in conjunction with % Processor time to better understand your workload and baselines. 

SQLServer:SQL Statistics\SQL Compilations/sec and SQL Re-Compilations/sec 

These counters will be triggered when SQL Server has to compile or recompile an execution plan for a query because there is no existing plan in the plan cache, or because a plan was invalidated because of a change. This can indicate T-SQL with recompile query hints, or indicative of memory pressure on the plan cache caused by either a lot of ad-hoc queries or simple memory pressure. 

Summary

Gathering performance metrics and establishing a baseline of these is a critical step in performance monitoring. It is in important first step to know and understand what needs to be collected and what each metric means. In part two we will dive into how to gather this metrics using the tools provided natively inside SQL Server and Windows Server.

SQL Diagnostic Manager for SQL Server monitors the performance of your entire SQL Server environment and provides the most comprehensive diagnostics on the market. Learn more here.

Anonymous