Article ID: 897284 - View products that this article applies to.
A database management system (DBMS), such as SQL Server, relies on the timeliness of file input and output (I/O) operations. Any one of the following items may create stuck or stalled I/O problems. These problems can quickly have an adverse affect on SQL Server responsiveness and performance:
Informational message explanation
This informational message indicates that the current load may be experiencing one of the following conditions:
Collapse this tableExpand this table
Stuck I/O and Stalled I/O
Stuck I/OStuck I/O is defined as an I/O request that does not complete. Frequently, stuck I/O is a stuck IRP. To resolve a stuck I/O condition, you typically must restart the computer or perform a similar action. A stuck I/O condition typically indicates one of the following:
Stalled I/OStalled I/O is defined as an I/O request that does complete. However, the I/O operation takes excessive time to complete. Stalled I/O behavior typically occurs because of one of the following reasons:
SQL Server stalled I/O and stuck I/O recording and reportingMicrosoft SQL Server Support handles many cases each year that involve stuck or stalled I/O problems. As outlined later, these I/O problems appear in different ways. I/O problems are some of the most difficult problems to diagnose and to debug. I/O problems require significant time and resources for debugging from Microsoft and from the customer. The reporting features that have been added to SQL Server 2000 SP4 and in later versions significantly reduce the time that is required to identify an I/O problem.
The reporting and the recording of I/O requests are designed on a per file basis. The detection and the reporting of stalled and stuck I/O requests are two separate actions.
RecordingThere are two locations in SQL Server where record action take place. The first location in which recording occurs, is where the I/O operation actually completes. If an I/O request takes more than 15 seconds to complete, a record operation occurs. The second location in which recording occurs, is when the lazy writer runs. When the lazy writer runs, the lazy writer checks all the pending data and all the pending log file I/O requests. If the 15-second threshold has been exceeded, a record operation occurs.
ReportingReporting occurs in intervals that are 5 minutes, or more, apart. Reporting occurs when the next I/O request is made on the file. If a record action has occurred and 5 minutes or more have passed since the last report occurred, the informational message that is mentioned in the "Summary" section is written to the SQL Server error log.
The 15-second threshold is not adjustable. However, you can disable stalled and stuck I/O detection by using trace flag 830 but we do not recommend that you do this.
To disable detection when SQL Server starts, use the -T830 startup parameter. Using the -T830 startup parameter will disable detection every time SQL Server is started. To disable detection for an instance of SQL Server that is currently running, use the following statement:
dbcc traceoff(830, -1)
This setting is only effective for the life of the SQL Server process.
Note An I/O request that becomes stalled or stuck is only reported one time. If the message reports that 10 I/O requests are stalled, those 10 reports will not occur again. If the next message reports that 15 I/O requests are stalled, that means that 15 new I/O requests have become stalled.
Tracking the I/O request packet (IRP)SQL Server uses the standard Microsoft Windows API calls to read and to write data. For example, SQL Server uses the following functions:
http://support.microsoft.comNote Kernel debugging can be an invasive process because kernel debugging can require you to stop the system to complete the debugging actions. We recommend that you see if there are any updates available for the following items:
System performance and query plan actionsOverall system performance can play a key role in I/O processing. You should take the general health of the system into consideration when you are investigating reports of stalled or stuck I/O operations. Excessive loads can cause the overall system to be slow. This includes I/O processing. The behavior of the system at the time the problem occurs can be a key factor in determining the root cause of the problem. For example, if CPU usage becomes high or if CPU usage remains high when the problem occurs, this behavior may indicate that a process on the system is using so much CPU that other processes are being adversely affected.
Performance countersTo monitor I/O performance, examine the following performance counters for specific I/O path information:
Be careful with the performance counters because SQL Server takes full advantage of asynchronous I/O capabilities that push the disk queue lengths heavily. Therefore, longer disk queue lengths alone do not indicate a problem.
In Windows System Monitor, you can review the counters "Physical Disk: Disk Bytes/sec" for each affected disk and compare the rate of activity against the counters "Process: IO Data Bytes/Sec" and “Process: IO Other Bytes/sec” for each process to identify whether a specific set of processes is generating excessive I/O requests. There are various other I/O related counters available in the Process object that reveals more granular information. If you determine that a SQL Server instance is responsible for excessive IO load on the server, review the next section on “Indexes and Parallelism”. For a detailed discussion on detecting and resolving I/O bottlenecks, review the section “I/O Bottlenecks” in the MSDN whitepaper Troubleshooting Performance Problems in SQL Server 2008
(http://msdn.microsoft.com/en-us/library/dd672789.aspx)or Troubleshooting Performance Problems in SQL Server 2005
Indexes and parallelismFrequently, bursts of I/O occur because an index is missing. This behavior can push the I/O path heavily. A pass with the Index Turning Wizard (ITW) may help resolve I/O pressure on the system. If a query benefits from an index instead of from a table scan or perhaps even uses a sort or hash, the system can gain the following advantages:
Practical examples from Microsoft SQL Server SupportThe following examples have been handled by Microsoft SQL Server Support and Platforms Escalation Support. These examples are intended to provide a frame of reference and help set your expectations about stalled and stuck I/O situations and about how a system may be affected or may respond. There is no specific hardware or set of drivers that pose any specific risk or increased risk over another. All systems are the same in this respect.
Example 1: A log write that is stuck for 45 secondsA SQL Server log file write attempt periodically became stuck for approximately 45 seconds. The log write was not finishing in a timely manner. This behavior created a blocking condition that led to 30-second client time-outs.
The application submitted a commit to SQL Server and the commit became stuck as a log write pending. This behavior caused the query to continue holding locks and to block incoming requests from other clients. Then, other clients started to time out. This compounded the problem because the application was not designed to roll back open transactions when a query time out occurred. This created hundreds of open transactions that were holding locks. Therefore, a severe blocking situation occurred.
For more information about transaction handling and blocking, click the following article number to view the article in the Microsoft Knowledge Base:
224453The application serviced a Web site using connection pooling. As more connections became blocked, the Web site created more connections. These connections became blocked and the cycle continued.
(http://support.microsoft.com/kb/224453/ )Understanding and resolving SQL Server 7.0 or 2000 blocking problems
After approximately 45 seconds, the log write completed. However, by this time, hundreds of connections were backed up. The blocking problems resulted in several minutes of recovery time for SQL Server and for the application. When combined with the application problems, the stalled I/O condition had a very negative affect on the system.
ResolutionThe problem was tracked to a stuck I/O request in a Host Bus Adapter (HBA) driver. The computer had multiple HBA cards with failover support. When one HBA was behind or was not communicating with the Storage Area Network (SAN), the "retry before failover" time-out value was configured to 45 seconds. When the time-out was exceeded, the I/O request was routed to the second HBA. The second HBA handled the request and quickly finished. To help prevent such stall conditions, the hardware manufacturer recommended a "retry before failover" setting of 5 seconds.
Example 2: Filter driver interventionMany antivirus software programs and backup products use I/O filter drivers. These I/O filter drivers become part of the I/O request stack, and they have access to the IRP request. Microsoft Product Support Services has seen various issues from bugs that create stuck I/O conditions or stalled I/O conditions in a filter driver implementation.
One such condition was a filter driver for backup processing that allowed a backup of the files that were open when the backup occurred. The system administrator had included the SQL Server data file directory in the file backup selections. When the backup occurred, the backup tried to gather the correct image of the file at the time the backup started. Doing this delayed I/O requests. The I/O requests were allowed to only complete one at a time as they were handled by the software.
When the backup started, SQL Server performance dropped dramatically because the I/Os of SQL Server were forced to complete one at a time. To compound the issue, the "one at a time" logic was such that the I/O operation could not be performed asynchronously. Therefore, when SQL Server expected to post an I/O request and to continue, the worker was stuck in the read or the write call until the I/O request finished. Processing tasks such as a SQL Server read ahead were effectively disabled by the actions of the filter driver. Additionally, another bug in the filter driver left the "one at a time" actions in process, even when the backup was finished. The only way to restore SQL Server performance was to close and then reopen the database, or to restart SQL Server so that the file handle was released and reacquired without the filter driver interaction.
ResolutionTo resolve this problem, the SQL Server data files were removed from the file backup process. The software manufacture also corrected the problem that left the file in "one at a time" mode.
Example 3: Hidden errorsMany higher end systems have multichannel I/O paths to handle load balancing or similar activities. Microsoft Product Support has found problems with the load balancing software where an I/O request fails but the software does not handle the error condition correctly. The software can attempt infinite retries. The I/O operation becomes stuck and SQL Server cannot finish the specified action. Much like the log write condition that was described earlier, many poor system behaviors can occur after such a condition wedges the system.
ResolutionTo resolve this problem, restarting SQL Server is often required. However, sometimes you must restart the operating system to restore processing. We also recommend that you obtain a software update from the I/O vendor.
Example 4: Remote storage, Mirroring, and Raid drivesMany systems use mirroring or take similar steps to prevent data loss. Some of the systems that use mirroring are software based and some are hardware based. The situation typically uncovered by Product Support Services for these systems is increased latency.
An increase in the overall I/O time occurs when the I/O must complete to the mirror before the I/O is considered complete. For remote mirror installations, network retries can become involved. When drive failures occur and the raid system is rebuilding, the I/O pattern can also be interrupted.
ResolutionStrict configuration settings are required to reduce latency to mirrors or to raid rebuild operations.
Example 5: CompressionMicrosoft does not support Microsoft SQL Server 7.0 or Microsoft SQL Server 2000 data files and log files on compressed drives. NTFS compression is not safe for SQL Server because NTFS compression breaks Write Ahead Logging (WAL) protocol. NTFS compression also requires increased processing for each I/O operation. Compression creates "one at a time" like behavior that causes severe performance issues to occur.
ResolutionTo resolve this problem, uncompress the data and the log files.
Additional data pointsThe wait type information that is provided in the sysprocesses system table can be helpful in diagnosing I/O bottlenecks. Buffer I/O latch wait types and writelog waits are key indicators to investigate I/O path performance.
For more information about wait types, click the following article number to view the article in the Microsoft Knowledge Base:
(http://support.microsoft.com/kb/822101/ )Description of the waittype and lastwaittype columns in the master.dbo.sysprocesses table in SQL Server 2000 and SQL Server 2005
(http://support.microsoft.com/kb/231619/ )How to use the SQLIOStress utility to stress a disk subsystem such as SQL Server
(http://support.microsoft.com/kb/826433/ )Additional SQL Server diagnostics added to detect unreported I/O problems
(http://support.microsoft.com/kb/230785/ )SQL Server 7.0, SQL Server 2000, and SQL Server 2005 logging and data storage algorithms extend data reliability
SQL Server requires systems to support ‘ guaranteed delivery to stable media ’ as outlined under the Microsoft SQL Server Always-On Storage Solution Review program. Fo For more information about the input and output requirements for the SQL Server database engine, click the following article number to view the article in the Microsoft Knowledge Base:
(http://support.microsoft.com/kb/967576/ )Microsoft SQL Server Database Engine Input/Output Requirements
Article ID: 897284 - Last Review: August 5, 2009 - Revision: 3.0