Sunday, August 18, 2013

Statistics



               SQL Server tries to use statistics to “react intelligently” in its query optimization. Knowing number of
records, density of pages, histogram, or available indexes help the SQL Server optimizer “guess” more accurately how it can best retrieve data. The Query Optimizer uses them to estimate how many rows will be returned from a query plan.
               With no statistics to show how the data is distributed, the optimizer has no way it can compare the efficiency of different plans and so will be frequently forced to simply scan the table or index. Data is measured two different ways within a single set of statistics, by density and by distribution. Density is a ratio that shows just how many unique values there are within a given column, or set of columns. The formula is quite easy:

Density = 1 / Number of distinct values for column(s)

               A high density (low selectivity, few unique values) will be of less use to the optimizer because it might not be the most efficient way of getting at your data. For example, if you have a column that shows up as a bit, a true or false statement such as, has a customer signed up for you mailing list, then for a million rows, you’re only ever going to see one of two values. That means that using an index or statistics to try to find data within the table based on two values is going to result in scans where more selective data, such as an email address, will result in more efficient data access.

There are 3 types of statistics in a SQL Server database.

1)      Statistics created due to index creation. These statistics have the index name
2)      Statistics created by Optimizer (Column statistics). Starts with _WA_*
3)      User defined statistics which are created with CREATE STATISTICS command by the DBA

 Facts about statistics:

·         Index statistics are always updated with full scan when we rebuild the index (Only exception is in SQL 2012 partitioned index when the number of partitions >1000 it uses default sampling).
·         Column Statistics are not updated because of index rebuilds, they are only updated by either DB Auto update stats option or if we manually update statistics using UPDATE STATISTICS command.
·         If we do not specify the sampling rate while running the UPDATE STATISTICS command it will take the default sampling rate. See below how the default sampling rate algorithm works.
·         If you specify Update statistics command with just the table name it will update statistics for all the statistics including the index statistics on the table. So you may lose the full scan advantage which you get for index statistics if you update statistics with sampling rate after rebuilding indexes (Be very careful with this step during your maintenance windows)

               An index is a physically implemented structure in the database (you can read up more in BOL on clustered and non-clustered indexes) whereas statistics are a set of values that help the optimizer during the execution plan formation stages to decide whether to use an index or not. And it is not a 1-1 relationship between indexes and statistics i.e. all indexes have statistics but one can have statistics without an index. And these statistics that do not associate to an index can also help in the formation of the right execution plan Index helps the optimizer to find the data during the execution of the statements and statistics help the optimizer to determine which indexes to use.

Statistics essentially contain two pieces of information:

·         A histogram which contains a sampling of the data values from the index and the distribution of the values in the ranges of data.
·         Density groups (collections) over the column (or number of columns) of a table or an indexed view.
·
Density essentially reflects the uniqueness of the values in a particular column.

The density is just a fraction which represents how many records will be returned back when a query is made for a given value. So, for the primary key column index, the density will be 1. It however does not give us any information about the ranges i.e. the values that may appear more or less than the average N number of times. That information is provided by the histograms which are represented above.

Another thing of importance to note here is that in the cases of multiple columns in an index, the histogram is available only on the leading column of the index i.e. the first column. The density value will be available for each first column based subset, for example: If I create an index on say (LastName, FirstName), then I would have a density value for each of these:

a) LastName => the value for this would mean that for every LastName value, how many records get qualified?
b) LastName, FirstName => the value for this would mean that for every combination of the LastName and FirstName, how many records get qualified?
c) LastName, FirstName, EmployeeID => This includes the last column as the Primary Key column (a clustered
index) since what we created was a non-clustered index.

When the automatic update statistics option, AUTO_UPDATE_STATISTICS, is on, the query optimizer determines when statistics might be out-of-date and then updates them when they are used by a query. Statistics become
out-of-date after insert, update, delete, or merge operations change the data distribution in the table or indexed view. The query optimizer determines when statistics might be out-of-date by counting the number of data modifications since the last statistics update and comparing the number of modifications to a threshold.

The query optimizer checks for out-of-date statistics before compiling a query and before executing a cached query plan. Before compiling a query, the query optimizer uses the columns, tables, and indexed views in the query predicate to determine which statistics might be out-of-date. Before executing a cached query plan, the Database Engine verifies that the query plan references up-to-date statistics.

Consider updating statistics after performing maintenance procedures that change the distribution of data, such as truncating a table or performing a bulk insert of a large percentage of the rows. This can avoid future delays in query processing while queries wait for automatic statistics updates.  The query optimizer updates statistics when you rebuild an index on a table or view with ALTER INDEX REBUILD or DBCC DBREINDEX check out your “Actual Number of Rows” and “Estimated Number of Rows”.

If these numbers are (consistently) fairly close, then most likely your statistics are up-to-date and used by the optimizer for the query. If not, time for you to re-check your statistics create/update frequency.

No comments:

Post a Comment