Real Life Block Corruption (Maybe)Posted: September 4, 2009
What’s the worst thing that can happen to a database? I think most DBAs will agree that block corruption is a good candidate on the list. When DBAs debate the soundness of their backup policy, corrupted blocks are often used as test cases and rhetoric devices: “Keep just 3 days of backup? But what if a block is corrupted on Saturday and we don’t find out until Monday?”.
Until this week, I only knew about block corruptions from my certification studies and from recovery practices (using dd to corrupt blocks is a common gambit).
We had a block corruption this week. At least, we think we did – neither us, nor Oracle support are 100% certain. It was nothing like the text books described.
On Saturday, our DB crashed. The error in the alert log indicated a corrupted block. We restarted the DB, and…. did nothing. My manager sent me an email asking me to open a ticket to Oracle about this. I saw the email on Monday, failed to realize the importance of the problem (I suck!) and proceeded to work on other tasks.
On Tuesday the DB crashed again. This time it also sprouted endless Ora-600  error message once it started. We gave it another restart, this time it started fine. I did open the ticket to Oracle. Priority 1. We ran a bunch of verifications – RMAN validation, DBV, analyzing bunch of tables and indexes.
RMAN and DBV did not detect any issues. Full export completed successfully. No one is actually certain this is a block corruption. The only strangeness was an index that appeared in DBA_INDEXES but did not exist when we tried to run analyze. We asked our sys admins to check the machine, the OS and the connected storage.
On Wednesday the server crashed again. Again a corrupt block. Different file this time. Oracle supports found that one of the millions of ORA-600 and ORA-7445 errors we’ve seen could be related to a SQL parsing bug and suggested a patch.
We’ve had it. In an emergency 10 hour maintenance, we used export/import to move all the schemas to a new DB server.
We hope this is the end of the problem, but we can’t really tell. Which is exactly how real DBA life is so different from textbook descriptions and recovery practices.