Knowing SQL Server: Disk corruption disappears

The situation is commonly as follows:

There is a regular SQL Agent job that runs DBCC CHECKDB

One morning the DBA finds that the job failed, reporting corruptions in one of the databases

The DBA runs DBCC CHECKDB on that database again, but this time there are no reported corruptions

This can lead the DBA to mistrust DBCC CHECKDB. Remember the SQL Server 2000 days where sometimes DBCC CHECKDB occasionally reported corruptions when there weren’t any? Those days are long gone now: if DBCC CHECKDB reports corruption, then at that time that it ran there was definitely corruption.

Think about what DBCC CHECKDB is doing: it reads and processes all the allocated pages in the database – all the pages that are part of tables and indexes at the time that DBCC CHECKDB runs. It doesn’t check all the pages in the data files; only those that are currently being used. The pages that are not currently allocated to an object cannot be checked as there’s no “page history” maintained. There’s really no way for DBCC CHECKDB to tell if they have ever been used before or not and since they’re not currently allocated there’s no valid page structure on them and no past to verify.

And if your database is still being accessed then the set of allocated pages can change after DBCC CHECKDB runs. A simple example of this occurring is:

Nonclustered index X of table Y has some corrupt pages in, which the DBCC CHECKDB (being run by a SQL Agent job) reports

Another SQL Agent job runs and performs index maintenance where it rebuilds index X (the rebuild operation always builds a new index and then drops the old index)

The DBA runs DBCC CHECKDB manually and there are no corruptions reported in the new index structure

Nonclustered index corruption is the best kind of corruption to have. The rebuild operation rewrote the index to a new set of pages and deallocated the pages that had corruption. When DBCC CHECKDB is run manually, those new pages are not corrupt and the old pages are not checked, as they are no longer in use.

These kind of ‘disappearing’ corruptions are a problem because it’s almost impossible to investigate them further. However, they could indicate a problem with your I/O subsystem. If you find that they’re occurring repeatedly, consider briefly preventing the process that causes the corrupt pages to be deallocated so you can investigate the corruption.

Another cause of disappearing corruptions can be transient I/O subsystem problems, where page reads sometimes fail outright and then succeed after that.

And yet one more cause could be that the database is mirrored or is part of an availability group and the page was fixed by automatic page repair before the second DBCC CHECKDB occurred. You can look in the msdb.dbo.suspect_pages table (more details here) for an entry for the broken page(s) with event_type of 4.

Bottom line: From SQL Server 2005 onward, if DBCC CHECKDB reports corruption, then at the time that it ran there definitely was corruption. Make sure you don’t just ignore the problem as next time the corruption occurs, you may not be so ‘lucky’ that it just seemed to disappear.

Thanks to Paul Randall

Knowing SQL Server

Saturday, November 23, 2024

Disk corruption disappears

No comments:

Post a Comment

Blog Archive