We have a business critical server running MS SQL Server 2008 R2 Enterprise on Windows 2008 R2 Enterprise, in the recent past users have started complaining about performance issues, several transactions & reports started taking time to execute, IT manager has asked us to analyze and fix sudden degradation of performance.
We started with IOPS analysis on SAN Storage, IOPS report surprised all of us, as IOPS not even hitting 50% of max allowed threshold. Then I started tuning indexes, dropping unwanted indexes (we have weekly maintenance job for index rebuild/reorganize) but no luck no improvement in performance to further investigate the issue I ran profiler with various counters but could not manage to get single conclusive evidence. At last I turned to my most trusted & reliable friend GOOGLE, and I found loads of stuff on performance tuning which I had already tried, at last I found one good article on examine IO subsystem latencies from within SQL Server by Paul Randal. Analyzing IO subsystem latencies from SQL server through sys.dm_io_virtual_file_stats DMV
–virtual file latency
[num_of_writes] , [num_of_reads] , [ReadLatency] = CASE WHEN [num_of_reads] = 0 THEN 0 ELSE ([io_stall_read_ms] / [num_of_reads]) END, [WriteLatency] = CASE WHEN [num_of_writes] = 0 THEN 0 ELSE ([io_stall_write_ms] / [num_of_writes]) END, [Latency] = CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0) THEN 0 ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END,
–avg bytes per IOP
[AvgBPerRead] = CASE WHEN [num_of_reads] = 0 THEN 0 ELSE ([num_of_bytes_read] / [num_of_reads]) END, [AvgBPerWrite] = CASE WHEN [io_stall_write_ms] = 0 THEN 0 ELSE ([num_of_bytes_written] / [num_of_writes]) END, [AvgBPerTransfer] = CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0) THEN 0 ELSE (([num_of_bytes_read] + [num_of_bytes_written]) /([num_of_reads] + [num_of_writes])) END, LEFT ([mf].[physical_name], 1) AS [Drive], DB_NAME ([vfs].[database_id]) AS [DB], [mf].[physical_name] FROM sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs] JOIN sys.master_files AS [mf] ON [vfs].[database_id] = [mf].[database_id] AND [vfs].[file_id] = [mf].[file_id] ORDER BY LEFT([mf].[physical_name] , 1) ASC;
This query allows me quickly analyze where read and write hot spots are and also enables me to go to particular database files to see what’s going on and helps me to decide do I have to move these files to dedicated or faster RAID groups.
As you can see query result, one of tempdb files Write latency is bit high, similar issue with other database file groups in many cases read/write latency is more than 100 ms, in case of J drive read latency is 745 ms and write latency is 454 ms.
Conclusion: Question: what is the acceptable latency on production server? Well it depends on the criticality of the application it is running along with RAID groups and technology of the drives (SCSI or SATA or SSD).Keeping all these things in mind we have redesigned our new SAN storage, we have included SSD to host tempdb, RAID 10 to host write intensive file groups and RAID 5 for read intensive file groups along with 100 GB SSD fast cache at storage level. We will be deploying this new storage in a month and I will update you with new result post deployment to compare the performance improvement through new SAN implementation.