Friday, October 2, 2015

A glance at SQL Server Statistics

Statistics are sometimes ignored when it comes to Performance Tuning. Statistics can play a big role in identifying recurring slowness.

Before executing any query, SQL Engine Optimizer should know in advance some information, like the number of records, page density, indexes etc. If updated information is available, the query optimizer will use best execution plan to execute the query.

Before go further on this, here is an example, an online store makes sure that they have enough stock in warehouse that is ready to ship to meet the demand. But their system is broken and online ordering system is unaware of stock quantities in warehouse, then probably online ordering system will order item directly from distributor which will take longer time for item delivery.

On the other hand if online store has outdated information that their warehouse has enough stock, the online order will be mistakenly sent to warehouse, which then be re-routed to the distributor.

Both of these scenarios have bad impacts because information was outdated somewhere.

This example may not fit 100% with SQL Server statistics but it is very close.

Let's take a look at SQL Server with same example. Store A has 10 warehouses in a state at different zip code locations. Store decided to merge all warehouses in one big warehouse to reduce overhead cost.

After the physical change, this information has been updated in database, by simply updating affecting zip codes to one zip code. Note that the index is built on Zip code. There is an pre-existing index on the zipcode column.

Order entry system complains that after merging the Zip codes, the system is little sluggish. The reason of this sluggishness lies behind mass updates. The reason behind this issue could most probably be outdated statistics. Let's take this example:

  • Create a table:
CREATE TABLE [dbo].[Warehouse_Inventory](
[ItemCode] [varchar](50) NOT NULL,
[ZipCode] [int] NOT NULL

  • Now insert some data:

declare @zip int, @inventory int
set @zip = 1
while (@zip < 200000)
set @inventory = @zip+22.5
insert into [dbo].[Warehouse_Inventory]
select 'Inventory ' + cast(@inventory as varchar(10)), @zip
select @zip = @zip+1

  • Create a non clustered index on zip code:

CREATE NONCLUSTERED INDEX [ClusteredIndex-myindex] ON [dbo].[Warehouse_Inventory]
[ZipCode] ASC

  •  Let's run our first query to get inventory from location 30001. Only 1 row should be returned.
select zipcode from [Warehouse_Inventory] where ZipCode = 30001 OPTION (RECOMPILE)

Query used the index as expected, let's look at the index statistics. See from the screenshot below the Estimated and Actual number of rows matches. What query optimizer estimated, same was returned. This sounds good.

Execution Plan after creating Index

Since the zip codes were updated, about 90% of the warehouses are moved to one zip code, 30001.  In table, we need to update those 90% of the zip codes.

update [dbo].[Warehouse_Inventory]
set zipcode = 30001
where zipcode < 180000

Without making any further changes, without rebuilding the indexes etc., running the same query again should return 179,000 rows. Let's look at the same index execution plan created by query optimizer.

Query optimizer estimated only 1 row to be returned, though there were 179,000 rows expected.

Now, let's fix the statistic by updating it manually.
update statistics dbo.warehouse_inventory.
After updating the statistics the number of rows estimated by query optimizer are 179,000. This is what we wanted to see in execution plan.

If query optimizer sends wrong estimate then probably wrong execution plan can be picked which can cause some performance sluggishness.

Here is the query that can be used to find out how many modifications were made to statistics. This can probably help in devising a custom solution for some of the indexes which are updated throughout the day in production application.

OBJECT_NAME([sp].[object_id]) AS 'ObjectName',
[sp].[stats_id] AS 'StatID',
[st].[name] AS 'StatName',
[sp].[last_updated] AS 'LastTimeStatUpdated',
[sp].[modification_counter] AS 'NumberOfModifications'
FROM [sys].[stats] AS [st]
OUTER APPLY sys.dm_db_stats_properties ([st].[object_id],[st].[stats_id]) AS [sp]
where OBJECT_NAME([sp].[object_id]) is not null
and OBJECT_NAME([sp].[object_id]) not like 'sys%'
and OBJECT_NAME([sp].[object_id]) in ('Warehouse_Inventory')

The example above is very simple and probably you may not see any major difference in performance, it was just to demonstrate how to proceed with Statistiscs.

There are some options that SQL Server provides at database level. These options are Auto Update Statistics and Auto Update Statistics Asynchronously.

Auto Update Statistics: If while executing the query, the statistics found outdated by a ratio then update the statistics before executing the query.

Auto Update Statistics Asynchronously: If while executing the query, the statistics found outdated by a ratio and statistics update request will be sent to engine in background.

One or both of these options can be switched on. Since all servers have different load, environment, usage, thus be little careful before opting these options. In General, OLTP may respond very well to ASYNC, though database warehouses have SYNC options go well.