When routine maintenance goes bad

Yesterday we had a routine maintenance to move the database of a large customer to a newer storage system. We knew it was not going to be completely straight forward because there was a security device involved. The database files (data, redo log, archives) are encrypted on the storage device, so if someone gets the data files, they are useless to him, the database access the storage though the security device which decrypts the data for the database.

Yes, if we ever have performance issues there we’ll know what to blame.

Anyway, since we know this is a complex setup we staged and tested the entire move in advance. It worked. Twice. So we were pretty confident that we can get everything done in about 10 minutes. Of course, we scheduled 3 hours maintenance, starting at 12pm.

At 12:30 we were running on the new device. At 12:40 the DB crashed. Short verification revealed that the disks with the data were gone and that we had I/O Errors when trying to access our redo logs and archives.

We got the support for the device on the phone, which escalated to tier 2, tier 3 and finally the development team. Slowly, the situation became painfully clear – all the data files are encrypted and the encryption key is corrupted. What’s worse, our backups are encrypted in exactly the same way, so they were all lost as well. We double and triple checked this – we had no way to access our backups.

This is even worse than the ASM scenario. How can we recover from such a crash? Luckily we discovered that the export files were kept unencrypted, so we could use the last export from 24 hours ago. Major loss of data, but it could be so much worse.

At 10pm our system team with the device development team finally managed to recover the most essential files – data files and redo logs. We could not recover the archived logs.

We started the database and it demanded media recovery, probably because it crashed when the disks disappeared. Since we had no log archives, we ran recovery until cancel, and immediately canceled the recovery. Then we could open the database with resetlogs, and this worked fine. Around 11pm we had a working database.

We did not realize how much risk the encryption adds to our recovery plans, and indeed to any maintenance. A single point of failure, which no one fully understood simply holds our database hostage. Thats as helpless as DBAs get.

What should we do better? Unfortunately, complete encryption was demanded by the customer so there was nothing we can do about it. I suggested encrypting the backups separately, with a different method and a different key, and if we really want to throw money at the issue we can have a separate DB, with separate security box and we can use Oracle Data Guard to move the data to the fail over system. That can be somewhat safe.

Advertisements


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s