Real Life Block Corruption (Maybe)

What’s the worst thing that can happen to a database? I think most DBAs will agree that block corruption is a good candidate on the list. When DBAs debate the soundness of their backup policy, corrupted blocks are often used as test cases and rhetoric devices: “Keep just 3 days of backup? But what if a block is corrupted on Saturday and we don’t find out until Monday?”.

Until this week, I only knew about block corruptions from my certification studies and from recovery practices (using dd to corrupt blocks is a common gambit).

We had a block corruption this week. At least, we think we did – neither us, nor Oracle support are 100% certain. It was nothing like the text books described.

On Saturday, our DB crashed. The error in the alert log indicated a corrupted block. We restarted the DB, and…. did nothing. My manager sent me an email asking me to open a ticket to Oracle about this. I saw the email on Monday, failed to realize the importance of the problem (I suck!) and proceeded to work on other tasks.

On Tuesday the DB crashed again. This time it also sprouted endless Ora-600 [2662] error message once it started. We gave it another restart, this time it started fine. I did open the ticket to Oracle. Priority 1. We ran a bunch of verifications – RMAN validation, DBV, analyzing bunch of tables and indexes.

RMAN and DBV did not detect any issues. Full export completed successfully. No one is actually certain this is a block corruption. The only strangeness was an index that appeared in DBA_INDEXES but did not exist when we tried to run analyze. We asked our sys admins to check the machine, the OS and the connected storage.

On Wednesday the server crashed again. Again a corrupt block. Different file this time. Oracle supports found that one of the millions of ORA-600 and ORA-7445 errors we’ve seen could be related to a SQL parsing bug and suggested a patch.

We’ve had it. In an emergency 10 hour maintenance, we used export/import to move all the schemas to a new DB server.

We hope this is the end of the problem, but we can’t really tell. Which is exactly how real DBA life is so different from textbook descriptions and recovery practices.


11 Comments on “Real Life Block Corruption (Maybe)”

  1. Noons says:

    “real DBA life is so different from textbook descriptions”

    Bingo! Couldn’t agree more, Chen. This is why I keep sneering at all the db1.0-vs-dba2.0 scenarios and other such nonsense.

    I’d love to see what a dba2.0 would do in a situation like this, where there is not a single specific screen with an “OK” button that can handle the problem and they have to use knowledge and intuition derived from training to actually come up with a solution.

    And preferably one that does not cause the problem to show up again.

    I’d say the solution you folks implemented is likely the safest action and the one leading to a long-lasting, stable solution. Which is what a dba should strive for, instead of clicking on buttons without ever fixing the real problem.

    • alter system set db_ultra_safe=’DATA_AND_INDEX’

      and Oracle will automatically repair your blocks

      Come on DDL, do not tell me you use an Oracle Version that is older than you ?

      Enjoy automatic block repair :
      3.18 Automatic Block Repair
      Automatic block repair allows corrupt data blocks to be automatically repaired as soon as the corruption is detected. This feature reduces the amount of time that data is inaccessible due to block corruption. This reduces block recovery time by using up-to-date good blocks in real-time, as opposed to retrieving blocks from disk or tape backups, or from Flashback logs.

      • prodlife says:

        @Laurent
        I actually saw this new feature (I did lots of reading about corrupt blocks!), but considering the fact that no utility managed to find any corrupt block, we probably couldn’t have used this feature anyway.

  2. Cristian says:

    i agree with Noons, more, it’s never something like text books describes.

  3. Milan N. says:

    Thanks for sharing this story.
    Good thing you could afford 10-hour maintenance window.
    Otherwise I would like to rely on Data Guard environment in situation like yours …but that brings us to discussion about ‘real DBA life’ again …

  4. karlarao says:

    Same here🙂 not so similar but, I’m hitting a bug. I’ve got this recently installed 2node 11gR1 RAC on Windows 64-bit.

    After the installation, I enabled archivelog mode. Then did a backup. Then did a logical & physical corruption check (RESTORE DEVICE TYPE DISK VALIDATE CHECK LOGICAL DATABASE). oops.. I’ve got some errors on the alert log

    “krbr8b2 – Error validating file 2, block 66775: logical corruption”

    It was the SYSMAN.MGMT_DB_INIT_PARAMS_ECM_PK index

    Oracle Support says, the safest way to get rid of the problem is to drop and recreate the repository.. So I executed the action plan, and did the following again:

    – backup incremental level 0
    – restore validate check logical <– oops still, the same error
    – backup validate check logical <– no errors🙂
    – dbv on the file <– no failing pages on datafile 2🙂
    – executed the script on Note 472231.1 to get corrupted objects, it's still the same object
    – analyze table.. validate structure <– no errors🙂
    – analyze index.. validate structure <– no errors🙂

    Oracle Support confirmed it was a bug… and my SR is currently assigned to development for patch creation…

    Bug 7041254 – ORA-19661 during RMAN restore check logical of compressed backup / IOT dummy
    key
    Doc ID: 7041254.8

    – Karl Arao

  5. Matthieu says:

    We had the same issue here a couple of months back, RMAN validated fine, dbv ran fine, all checks were coming back clean, but we still had the application crashing on a corrupt block. Oracle then gave us a simple script that selects every row from the suspected table, and catches any exception, the row where the exception is raised is the culprit. It turned out we did have a corrupted block, and that RMAN and dbv didn’t picked it up…

    • karlarao says:

      Yeah and sometimes the block could be part of free space.
      So to manually format the block you can use this Metalink Note

      Note.336133.1 Ext/Pub How to Format Corrupted Block Not Part of Any Segment

      @Matthieu,
      Do you have a metalink note for that script you have?

  6. joel garry says:

    What’s the worst thing that can happen to a database?

    That would be undetected data errors accumulating over time. Useless backups would be a close second, but perhaps much more common than anyone would care to admit.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s