Saturday, November 23, 2024

Troubleshooting performance issues - 1

 Defining a Troubleshooting Methodology 

 

Fairly early on in any analysis, I'll take a look at the wait statistics, in the sys.dm_os_wait_stats Dynamic Management View (DMV), to identify any major resource waits in the system, at the operating system level. Let's say I identify very high PAGEIOLATCH_SH waits, which indicates that sessions are experiencing delays in obtaining a latch for a buffer page. This happens when lots of sessions, or maybe one session in particular, are requesting a lot of data pages that are not available in the buffer pool (and so physical I/O is needed to retrieve them). SQL Server must allocate a buffer page for each one, and place a latch on that page while it's retrieved from disk. The bottleneck here is disk I/O; 

 

The disk subsystem simply can't return pages quickly enough to satisfy all of the page requests, and so sessions are waiting for latches, and performance is suffering. However, this does not necessarily mean that a slow disk subsystem is the cause of the bottleneck; it may simply be the victim of excessive I/O caused by a problem elsewhere in the system. 

 

At this point, I'll want to validate this information by examining the virtual file stats, in sys.dm_io_virtual_file_stats. Specifically, I'll be looking for evidence of high latency associated with the read and write operations being performed by SQL Server. At the same time, I'll be drilling deeper into the problem, since the virtual file stats will tell me how much I/O activity is being performed by SQL Server, and how the I/O load is distributed across files and databases on the SQL Server instance.  

 

To corroborate this data further, I may also check the values of the Physical Disk\Avg. Disk Reads/sec and Physical Disk\Avg. Disk Writes/sec PerfMon counters. So, at this stage, let's say I've confirmed high latency associated with read and write operations, and found that a particular database is experiencing a very high level of mainly read-based I/O. 

 

My next step will be to investigate the execution statistics for queries against this database, which are maintained in sys.dm_exec_query_stats for the execution plans that are in the plan cache. I'll identify the queries that have the highest accumulated 

physical reads, and then review their associated execution plans, looking for any performance tuning opportunities, either by adding missing indexes to the database, or making changes to the SQL code, in order to optimize the way the database engine accesses the data. 

 

It may be that the code is optimized as far as it can be, but a commonly executed 

reporting query simply needs to read 6 GB of data from the database, for aggregation, 

as a part of its execution. If most of this data isn't found in the buffer cache, it will cause high physical I/O, and will account for the high PAGEIOLATCH_SH waits. At this point, we may need to look at our hardware configuration and see if the actual root of our problem is a lack of memory installed in the server. In order to verify this, I'll examine the PerfMon memory counters. 

 

If I see that the Page Life Expectancy is consistently fluctuating, and the system is experiencing non-zero values for Free List Stalls/sec, and high Lazy Writes/sec, then I can be fairly certain that the buffer pool for the instance is inadequately sized for the amount of data that is being used by the workload. This does not necessarily mean the server needs more memory; it may be that the queries are inefficient and are reading far more data than necessary. To identify the appropriate fix will require further and deeper analysis. This is just one of many possible examples. 

Wait Statistics: the Basis for Troubleshooting 

 

One of the first items that I check, when troubleshooting performance problems on a SQL Server, is the wait statistics, which are tracked by the SQLOS during normal operations of any SQL Server. 

 

The SQLOS is a pseudo-operating system that runs as a part of the SQL Server database engine and provides thread scheduling, memory management, and other functions 

to SQL Server. Normally, such services would, for any processes running inside the operating system, be provided by the operating system. The reason that SQL Server provides its own pseudo-operating system environment is that SQL Server knows how to 

schedule its tasks better than the Windows operating system does, and the cooperative 

scheduling that is implemented by the SQLOS allows for higher levels of concurrency 

than the preemptive scheduling provided by the Windows operating system. 

 

As an example of this, any time that SQL Server has to wait while executing an operation or statement, the time spent waiting is tracked by the SQLOS, as wait time. This data is exposed, for each instance of SQL Server installed on a server, in the sys.dm_os_wait_stats DMV. The cause and length of the various waits that SQL Server is experiencing can provide significant insight into the cause of the performance problems, as long as you understand exactly what the wait statistics are telling you, and know how to correlate the wait information with the additional troubleshooting information such as the PerfMon counters, and other DMVs. 

 

In general, when examining wait statistics, I focus on the top waits, according to wait_time_ms, and look out for high wait times associated with the following specific wait types 

CXPACKET 

Often indicates nothing more than that certain queries are executing with parallelism; CXPACKET waits in the server are not an immediate sign of problems, although they may be the symptom of another problem, associated with one of the other high value wait types in the instance. 

 

SOS_SCHEDULER_YIELD 

• The tasks executing in the system are yielding the scheduler, having exceeded their 

quantum, and are having to wait in the runnable queue for other tasks to execute. This may indicate that the server is under CPU pressure.  

 

THREADPOOL 

• A task had to wait to have a worker bound to it, in order to execute. This could 

be a sign of worker thread starvation, requiring an increase in the number of CPUs in the server, to handle a highly concurrent workload, or it can be a sign of blocking, resulting in a large number of parallel tasks consuming the worker threads for long periods. 

 

LCK_* 

• These wait types signify that blocking is occurring in the system and that sessions 

have had to wait to acquire a lock of a specific type, which was being held by another 

database session. This problem can be investigated further using the information in the sys.dm_db_index_operational_stats. 

 

PAGEIOLATCH_*, IO_COMPLETION, WRITELOG 

• These waits are commonly associated with disk I/O bottlenecks, though the root cause of the problem may be, and commonly is, a poorly performing query that is consuming excessive amounts of memory in the server. PAGEIOLATCH_* waits are specifically associated with delays in being able to read or write data from the database files. WRITELOG waits are related to issues with writing to log files. These waits should be evaluated in conjunction with the virtual file statistics as well as Physical Disk performance counters, to determine if the problem is specific to a single database, file, or disk, or is instance wide. 

 

PAGELATCH_* 

• Non-I/O waits for latches on data pages in the buffer pool. A lot of times PAGELATCH_* waits are associated with allocation contention issues. One of the best-known allocations issues associated with PAGELATCH_* waits occurs in tempdb when the a large number of objects are being created and destroyed in tempdb and the system experiences contention on the Shared Global Allocation Map (SGAM), Global Allocation Map (GAM), and Page Free Space (PFS) pages in the tempdb database. 

 

LATCH_* 

• These waits are associated with lightweight short-term synchronization objects that 

are used to protect access to internal caches, but not the buffer cache. These waits 

can indicate a range of problems, depending on the latch type. Determining the specific latch class that has the most accumulated wait time associated with it can be found by querying the sys.dm_os_latch_stats DMV. 

 

ASYNC_NETWORK_IO 

• This wait is often incorrectly attributed to a network bottleneck. In fact, the most 

common cause of this wait is a client application that is performing row-by-row processing of the data being streamed from SQL Server as a result set (client accepts 

one row, processes, accepts next row, and so on). Correcting this wait type generally 

requires changing the client-side code so that it reads the result set as fast as possible, and then performs processing. 

 

 

Conversely, if the primary wait types in the system are LCK_* waits, then you won't want to waste time looking at the disk I/O configuration, but instead focus on discovering what might be causing blocking inside the databases on the server. 

 

 

Virtual File Statistics 

A common trap in my experience, when using wait statistics as a primary source of troubleshooting data, is that most SQL Servers will demonstrate signs of what looks like a disk I/O bottleneck. Unfortunately, the wait statistics don't tell you what is causing the I/O to occur, and it's easy to misdiagnose the root cause. 

This is why an examination of the virtual file statistics, alongside the wait statistics, is almost always recommended. The virtual file statistics are exposed through the sys.dm_io_virtual_file_stats function which, when passed a file_id (and possibly database_id), will provide cumulative physical I/O statistics, the number of reads and writes on each data file, and the number of reads and writes on each log file, for the various databases in the instance, from which can be calculated the ratio of reads to writes. This also shows the number of I/O stalls and the stall time associated with the requests, which is the total amount of time sessions have waited for I/O to be completed on the file. 

 

If the data and log files are on a shared disk array in the server, and the calculated 

avg_total_latency is the same across all of the databases, and higher than what is acceptable for the specific workload, then the problem may be that the workload has 

outgrown the disk I/O subsystem. 

However, if the server hosts a database that is used for archiving data to slower storage, for year-on-year reporting, then it may be that having PAGEIOLATCH_* waits in the database is entirely normal, and the io_stall information for the specific database 

files may lead us to determine that the waits are most likely attributable to the archiving process. This highlights the fact that it helps to have a sound knowledge of the underlying configuration and type of workload for the server, while you're troubleshooting the problem. 

 

If a particular file is subject to very heavy read activity (for example a ratio of 10:1, or higher, for the read:write ratio), and is showing high average latency, then I may recommend a RAID change for the disk array, for example from RAID 10 to RAID 5, 

offering more spindles to share the read I/O. 

 

 

Performance Counters 

 

The two Access Methods counters provide information about the ways that tables 

are being accessed in the database. The most important one is the Full Scans/sec counter, which can give us an idea of the number of index and table scans that are occurring in the system. 

If the disk I/O subsystem is the bottleneck (which, remember, is most often caused by pressure placed on it by a problem elsewhere) and this counter is showing that there are scans occurring, it may be a sign that there are missing indexes, or inefficient code in the database. How many scans are problematic? It depends entirely on the size of the objects being scanned and the type of workload being run. In general, I want the number of Index Searches/sec to be higher than the number of Full Scans/sec by a factor of 800–1000. If the number of Full Scans/sec is too high, to determine if there are missing indexes in the database, resulting in excess I/O operations. 

 

The Buffer Manager and Memory Manager counters can be used, as a group, to identify if SQL Server is experiencing memory pressure. The values of the Page Life Expectancy, Free List Stalls/sec, and Lazy Writes/sec counters, when correlated, will validate or disprove the theory that the buffer cache is under memory pressure. 

A lot of online references will tell you that if the Page Life Expectancy (PLE) performance counter drops lower than 300, However, this guideline value for the PLE counter was set at a time when most SQL Servers only had 4 GB of RAM, and the data cache portion of the buffer pool was generally 1.6 GB. For a server with 32 GB allocated 

to the buffer pool, the PLE value should be at least (32/4)*300 = 2400. 

 

If the PLE is consistently below this value, and the server is experiencing high Lazy Writes/sec, which are page flushes from the buffer cache outside of the normal CHECKPOINT process, then the server is most likely experiencing data cache memory pressure, which will also increase the disk I/O being performed by the SQL Server. At this point, the Access Methods counters should be investigated to determine if excessive table or index scans are being performed on the SQL Server. 

The General Statistics\Processes Blocked, Locks\Lock Waits/sec, and Locks\Lock Wait Time (ms) counters provide information about blocking in the SQL Server instance, at the time of the data collection. If these counters return a value other than zero, over repeated collections of the data, then blocking is actively occurring in one of the databases. 

 

 

The higher the number of SQL Compilations/sec in relation to the Batch Requests/sec, the more likely the SQL Server is experiencing an ad hoc workload that is not making optimal using of plan caching. The higher the number of SQL Re-Compilations/sec in relation to the Batch Requests/sec, the more likely it is that there is an inefficiency in the code design that is forcing a recompile of the code being executed in the SQL Server. In either case, investigation of the Plan Cache, as detailed in the next section, should identify why the server has to consistently compile execution plans for the workload. 

 

The Memory Manager\Memory Grants Pending performance counter provides information about the number of processes waiting on a workspace memory grant in the instance. If this counter has a high value, SQL Server may benefit from additional memory, but there may be query inefficiencies in the instance that are causing excessive memory grant requirements, for example, large sorts or hashes that can be resolved by- tuning the indexing or queries being executed. 

 

 

Plan Cache Usage 

In my experience, the Plan Cache in SQL Server 2005 and 2008 is one of the most 

underused assets for troubleshooting performance problems in SQL Server. As a part of the normal execution of batches and queries, SQL Server tracks the accumulated execution information for each of the plans that is stored inside of the plan cache, up to the point where the plan is flushed from the cache as a result of DDL operations, memory pressure, or general cache maintenance. The execution information stored inside of the plan cache can be found in the sys.dm_exec_query_stats DMV. 

Additionally, the sys.dm_exec_query_plan() function can be cross-applied using 

the plan_handle column from the sys.dm_exec_query_stats DMV to get the execution plan that is stored in the plan cache. 

 

 

 

Investigating CPU Pressure 

 

Performance Monitor 

If your SQL Server system is experiencing excessively high CPU activity, the first tool for which you should reach is Performance Monitor (PerfMon). This Windows monitoring tool will confirm whether the excessive CPU usage is due to SQL Server activity, or is caused by other processes on the server, or the operating system itself.  

 

The primary PerfMon counters that are of value for monitoring CPU usage are listed 

below with brief explanations (quoted from MSDN): 

 

  • Processor/ %Privileged Time – percentage of time the processor spends on execution of Microsoft Windows kernel commands such as core operating system activity and device drivers. 

  • Processor/ %User Time – percentage of time the processor spends on executing user processes such as SQL Server. This includes I/O requests from SQL Server. 

  • Process (sqlservr.exe)/ %Processor Time – the sum of processor time on each processor for all threads of the process. 

 

Investigating CPU-related wait statistics 

Whenever a session has to wait before the requested work can continue, SQL Server records the reason for the wait (the resource that is being waited on) and the length of time waited. The sys.dm_os_wait_stats DMV exposes these wait statistics, aggregated 

across all sessions since the server last restarted or the wait statistics were cleared 

out with the DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR); command. This DMV can be used, among other things, to confirm CPU pressure and establish the most common wait types that are being experienced by a CPU-bound system. 

 

Signal wait time 

Along with a wait_type column, indicating the type of wait, the sys.dm_os_wait_ 

stats DMV returns several useful wait times, including: 

wait_time_ms – total amount of time that tasks have waited on this given wait type; 

this value includes the time in the signal_wait_time_ms column. The value increments 

from the moment a task stops execution, to wait for a resource, to the point it resumes execution. 

 

signal_wait_time_ms – the total amount of time tasks took to start executing after being signaled (i.e. after the resource it was waiting for became available); this is time spent on the runnable queue, and is pure CPU wait. 

If the signal wait time is a significant portion of the total wait time, it means that tasks are waiting a relatively long time to resume execution after the resources that they were waiting for became available. This can indicate either that there are lots of CPU-intensive queries, which may need optimizing, or that the server needs more CPU. 

 

SOS_SCHEDULER_YIELD waits 

The SQL scheduler is a cooperative multi-tasking scheduler. This means that it relies on the executing queries to voluntarily relinquish the CPU after a specific amount of running time. By contrast, the Windows scheduler is a pre-emptive multi-tasking scheduler, which means it removes tasks from the CPU after a specific amount of time. 

When a task voluntarily relinquishes the CPU and begins waiting to resume execution, the wait type assigned to the task is SOS_SCHEDULER_YIELD. The relinquishing task goes back onto the runnable queue and another task gets its allocated time on the CPU. If overall wait times are low, this type of wait is benign, simply indicating that the query spent longer than allowed on the CPU without having to wait for other resources (disk I/O, locks, latches, memory grants and so on). 

If queries show high wait times in sys.dm_exec_requests or sys.dm_os_waiting_ 

tasks for the SOS_SCHEDULER_YIELD wait type, it's an indication that the query is extremely CPU-intensive. If there are high wait times for this wait type overall on the server it can indicate either that there are lots of CPU-intensive queries, which may need optimizing, or that the server needs more CPU.  

 

CXPACKET waits occur during synchronization of the query processor exchange iterator 

between workers, for a query running in parallel across multiple processors. If the server hosts a data warehouse or reporting type of database that receives a low volume of queries but processes large amounts of data, parallelism can substantially reduce the time it takes to execute those queries. In contrast, however, if the server hosts an OLTP database that has a lot of small queries and transactions, then parallelism can kill throughput and negatively impact performance. 

 

 

Investigating CPU Pressure 

 

We'll discuss the three main tools used to measure CPU usage, and diagnose CPU pressure, in SQL Server: 

• Performance Monitor – a Windows monitoring tool for measuring the CPU usage by SQL Server and other processes running on the server. 

SQLTrace – a set of system stored procedures for real-time tracing events that are executing in SQL Server during periods of high CPU usage. 

• Dynamic Management Views – a collection of system objects that provide both snapshot and aggregate data regarding resource usage in SQL Server. 

 

The primary PerfMon counters that are of value for monitoring CPU usage are listed below with brief explanations (quoted from MSDN): 

  • Processor/ %Privileged Time – percentage of time the processor spends on execution of Microsoft Windows kernel commands such as core operating system activity and device drivers. 

  • Processor/ %User Time – percentage of time the processor spends on executing user processes such as SQL Server. This includes I/O requests from SQL Server. 

  • Process (sqlservr.exe)/ %Processor Time – the sum of processor time on each processor for all threads of the process. 

 

These three counters are sufficient to monitor the overall CPU usage, as well as the usage by SQL Server. However, there are several SQL Statistics counters (and one Plan Cache counter) that don't directly monitor CPU usage but do monitor events such as compilation and recompilation events, which can eat up a lot of CPU cycles, or are indications of problems that produce high CPU usage. 

  • SQLServer:SQL Statistics/Auto-Param Attempts/sec 

  • SQLServer:SQL Statistics/Failed Auto-params/sec 

  • SQLServer:SQL Statistics/Batch Requests/sec 

  • SQLServer:SQL Statistics/SQL Compilations/sec 

  • SQLServer:SQL Statistics/SQL Re-Compilations/sec 

  • SQLServer:Plan Cache/Cache hit Ratio 

 

 

Dynamic Management Views 

We can investigate high CPU usage by SQL Server by examining the CPU-related wait statistics, the scheduler details and the aggregated query performance statistics, as follows: 

  • Verifying the extent of CPU pressure via signal waits, using sys.dm_os_wait_stats. 

  • Diagnosing a CPU-bound system according to the types of wait observed, using sys.dm_os_wait_stats and sys.dm_os_schedulers. 

  • Identifying high-CPU cached plans, and associated queries, using sys.dm_exec_query_stats and sys.dm_exec_sql_text. 

  • Identifying currently waiting tasks, especially ones waiting on CPU-related wait types using sys.dm_os_waiting_tasks. 

  • Observing the resource usage of currently executing queries with sys.dm_exec_requests. 

 

Investigating CPU-related wait statistics 

Whenever a session has to wait before the requested work can continue, SQL Server 

records the reason for the wait (the resource that is being waited on) and the length of time waited. The sys.dm_os_wait_stats DMV exposes these wait statistics, aggregated 

across all sessions since the server last restarted or the wait statistics were cleared 

out with the DBCC SQLPERF('sys.dm_os_wait_stats', CLEAR); command. This DMV can be used, among other things, to confirm CPU pressure and establish the most common wait types that are being experienced by a CPU-bound system. 

 

Signal wait time 

Along with a wait_type column, indicating the type of wait, the sys.dm_os_wait_stats DMV returns several useful wait times, including: 

• wait_time_ms – total amount of time that tasks have waited on this given wait type; 

this value includes the time in the signal_wait_time_ms column. The value increments from the moment a task stops execution, to wait for a resource, to the point it resumes execution. 

 

If the signal wait time is a significant portion of the total wait time, it means that tasks are waiting a relatively long time to resume execution after the resources that they were waiting for became available. This can indicate either that there are lots of CPU-intensive queries, which may need optimizing, or that the server needs more CPU. 

 

SELECT SUM(signal_wait_time_ms) AS TotalSignalWaitTime , 

( SUM(CAST(signal_wait_time_ms AS NUMERIC(20, 2))) 

/ SUM(CAST(wait_time_ms AS NUMERIC(20, 2))) * 100 ) 

AS PercentageSignalWaitsOfTotalTime 

FROM sys.dm_os_wait_stats 

 

 

We can also use the sys.dm_os_wait_stats DMV to find out which resource waits are the most common in our CPU-bound system, where we identify the top wait events, ordered according to the total amount of time processes have waited (wait_time_ms) on this event. It is important to ignore the benign waits, typically ones caused by system processes that are expected to be waiting most of the time. We're also subtracting out the signal_wait_time as that portion of the wait time is not waiting for the particular resource but waiting for time on the scheduler. Three interesting wait types to look out for, in regard to CPU pressure, are SOS_SCHEDULER_YIELD, CXPACKET and CMEMTHREAD. 

 

SOS_SCHEDULER_YIELD waits 

The SQL scheduler is a cooperative multi-tasking scheduler. This means that it relies on the executing queries to voluntarily relinquish the CPU after a specific amount of running time. By contrast, the Windows scheduler is a pre-emptive multi-tasking scheduler, which means it removes tasks from the CPU after a specific amount of time. 

When a task voluntarily relinquishes the CPU and begins waiting to resume execution, the wait type assigned to the task is SOS_SCHEDULER_YIELD. The relinquishing task goes back onto the runnable queue and another task gets its allocated time on the CPU. If overall wait times are low, this type of wait is benign, simply indicating that the query spent longer than allowed on the CPU without having to wait for other resources (disk I/O, locks, latches, memory grants and so on). 

If queries show high wait times in sys.dm_exec_requests or sys.dm_os_waiting_tasks for the SOS_SCHEDULER_YIELD wait type, it's an indication that the query is extremely CPU-intensive. If there are high wait times for this wait type overall on the server it can indicate either that there are lots of CPU-intensive queries, which may need optimizing, or that the server needs more CPU. Scheduler activity can be investigated further using the sys.dm_os_schedulers DMV. 

 

CXPACKET waits 

CXPACKET waits occur during synchronization of the query processor exchange iterator between workers, for a query running in parallel across multiple processors. If the server hosts a data warehouse or reporting type of database that receives a low volume of queries but processes large amounts of data, parallelism can substantially reduce the time it takes to execute those queries. In contrast, however, if the server hosts an OLTP database that has a lot of small queries and transactions, then parallelism can kill throughput and negatively impact performance. 

CMEMTHREAD waits 

CMEMTHREAD waits are waits for synchronized memory objects. Some memory objects can be accessed by multiple threads simultaneously, some cannot. When multiple threads are trying to access a memory object, typically a cache, which must be accessed by one 

thread at a time, the waiting threads get a CMEMTHREAD wait. In general, CMEMTHREAD waits are not common or long-lasting. 

 

Investigating scheduler queues 

The sys.dm_os_schedulers DMV can identify whether or not a SQL instance is CPU-bound. This DMV returns one row for each of the SQL Server schedulers and it lists the total number of tasks that are assigned to each scheduler, as well as the number that 

are runnable. 

A runnable task is one that is in the runnable queues, waiting for CPU time. Other tasks on the scheduler that are in the current_tasks_count but not the runnable_tasks_count are ones that are either sleeping or waiting for a resource (lock, latch, I/O, memory, and so on). 

 

SELECT scheduler_id , current_tasks_count , runnable_tasks_count 

FROM sys.dm_os_schedulers 

WHERE scheduler_id < 255 

 

A high number of runnable tasks, like a high signal wait time, indicates that there is not enough CPU for the current query load. The filter for schedulers below 255 removes, from the result set, the numerous hidden schedulers in SQL Server, which are used for backups, the Dedicated Administrator Connection (DAC), and so on, and are not of interest when investigating general CPU load. 

 

 

Identifying CPU-intensive queries 

In order to determine the worst-running queries in the plan cache of SQL Server, the DMVs sys.dm_exec_query_stats and sys.dm_exec_sql_text can be used. The sys.dm_exec_query_stats DMV provides aggregated statistics and returns one row for each query statement in the cached plan. Many of the columns are incremented counters, and provide information about how many times a query has been executed and the resources that were used. For example, the *_worker_time columns represent the time spent on the CPU, and the *_elapsed_time columns show the total execution time. 

 

It's important to note that some queries may never appear in the cache at all. 

Procedures marked WITH RECOMPILE and queries with the hint OPTION (RECOMPILE) are never cached. Also, the query stats for a query are cleared when the query recompiles for any reason, for example due to changing statistics or schema changes. As a result, queries that are subject to many recompilation events may also show a very low total for elapsed time, because that total is only for the current plan, which may not have been in cache very long. 

When querying the plan cache in order to investigate sub-optimal plans, note that some queries may not be listed. Execution plans are retained in cache until SQL Server decides that the plan has aged to a point where it should be removed to allow for new execution plans to be cached, or the cache is fully or partially cleared by DBCC commands, database state changes (restoring a database, detaching or taking a database offline, and so on) or certain server-wide configuration changes, or SQL Server is restarted. SQL Server will also remove execution plans from the cache if it finds that extra memory is required elsewhere in the system. 

No comments:

Post a Comment