Tom Kyte published his new year resolutions, and his first one is “Practice restore every month”. According to Tom, practicing restore on a regular basis is the best way to avoid making mistakes on restore, which is something DBAs can’t afford. I agree, and I have few tips for restore practicers among you.
- Don’t do it on production. OK, this one is trivial. What I mean is that you can’t really practice restore on your private “fun and games” DB, because it is likely to be very different than the real ones. Less data usually, and probably it is not connected to the Netapp either.
We practice our recovery on our staging environment which is a copy of our production, but this means we have to schedule our practices to times that no one is using the staging for their own tests. QA databases are also fun to practice on, in previous workplace QA trashed their DB so often that I practiced recovery on an almost weekly basis.
- You can design some of the internal procedures so you’ll have to recover all the time. For example, decide that you will create clones of existing DBs by copying files over, without shutting down the DB, and leave out a file or two for the challenge. If you can’t recover from what you made, make sure you understand why and verify that this situation can never ever occur in production. If you (like me) have to clone DBs on a somewhat regular basis, this gives you an opportunity to use the time to practice recovery without having to justify the practice to anyone.
- Get your manager to watch the practice. This has two great benefits – one is that you get to practice in realistically stressfull situation of having a manager looking over your shoulder. The other is that when you recover in an emergency, the manager will be somewhat familiar with what you do and will ask less questions when you are trying to think.
- By practicing recovery on realistic environments in advance you will have a good idea how long recovery should take. Try to finish each practice with a nice table saying something like “For each 20G of lost data files, add 30 minutes to recovery, for each 6 hours of redo to apply add an hour to recover”. Try to get the business stakeholders to sign off these numbers. If they refuse, maybe its time to think of new strategy. Maybe one involving database flashback or dataguard. In any case, when you are in a real recovery situation these numbers will allow you to give your business a reasonably accurate estimation of how long the downtime is going to last. This is always a huge win for the DBAs and reduces the stress levels significantly.
- Recovery is one of these things that are awfully complicated when you memorize a bunch of rules. Do I need the instance up or down? Do I need a backup of the control file? Which redo logs do I need? Do I recover until cancel, until point in time, until now? It magically becomes very easy when you understand the idea behind those rules. Don’t remember when you need the backup control file, instead take few minutes (hours, days) to understand the idea of SCNs and how they are used in recovery. So, when you practice, make sure you understand why you do each step that you do, don’t just work off a checklist.
- Don’t throw away the checklist. Either have a bunch of checklists ready for every scenario (my method) or write a checklist and have someone review it before starting recovery (senior DBA does that). Always work with a list. You don’t want to risk forgeting any step.
- Don’t forget to verify the production backups themselves on a very regular basis. Nothing is worse than discovering that for some reason your backup is useless. I try to always keep a recent full export on the side of such emergencies. I don’t think we ever had to use it, but we came very close once or twice and it was a comforting thought to know that if we can do nothing else, at least the export is there.
Many of our application owners arrive from Microsoft SQL Server backgrounds, where they learned some very bad habits. For example, when using SQL Server’s query analyzer, every time you run a query it will automatically commit. When we started using Oracle, they were confused at first, but quickly learned to configure their Oracle tools (AquaStudio is rather popular for some reason) to emulate the same behavior. Usually it doesn’t really matter, until someone makes a mistake.
“Chen, how do I roll back a transaction after it was committed?” is not the most frequent question I hear, but it is somewhere in the top 10.
We keep our undo for about 30 minutes, so if they caught me fast enough, I’m able to save the day using the magical flashback query:
SELECT * FROM mytable AS OF TIMESTAMP TO_TIMESTAMP ('06-aug-07 08:30:00','DD-MON-YY HH24: MI: SS')
Once the user verifies that I managed to get a good copy of his table, I usually save it as another table, so I won’t lose it because too much time passed.
create table mytable_old as SELECT * FROM mytable AS OF TIMESTAMP TO_TIMESTAMP ('06-aug-07 08:30:00','DD-MON-YY HH24: MI: SS')
And then I have to figure out how to use the old data to undo what the user did. Do I insert missing rows? Reverse a bad update? It all depends on the exact mistake, but if I have good data from flashback, I know that we can fix the damage. Now the user can go to her white board and write 50 times: “I will not use auto commit”.
I heard that in 11g flashback will use the redo logs and not just the undo, so theoretically it means that there won’t be any time constraints on resolving the issue. I’m looking forward to that.
Sometimes, DBAs make mistakes too. And our mistakes are even more painful because DROP and TRUNCATE can’t be rolled back. Oracle now has an option of keeping dropped tables for a while longer in a recycle bin before really dropping them, allowing the DBA to fix his mistakes. We don’t use the recycle bin for a silly reason – when we first started using 10g, this feature caught us by surprise and caused us to run out of space when least expected, so we turned it off and planned to deal with this later. Later didn’t happen yet. So when we drop an important table by mistake we need to restore our backups on a standby machine, roll the logs until the point just before the mistake and get the data from there. That can easily take two hours, while using flashback takes few minutes.
Maybe its time to revisit our recycle bin policy.
One of the databases we need to support is a MySQL database, installed on a windows server. The DBA team collectively has about 3 days of experience with MySQL. The system is technically production, but it is used only by internal users. It is a rather large database (200G) and gets tons of traffic because the application does lots of automatic data collection and reports. In addition it is an ultra sensitive application, which will crash if we stop the database for more than a minute.
But the worse thing is that the developers can’t keep their hands off the database. So every day contains a new surprise – new version, new storage system, the data files move to a new directory, all our backup scripts were deleted, etc etc.
I placed the MySQL data files on a Netapp LUN (connected to windows via iScsi) and I wrote a very nice script that locked tables, flushed them to disk, took netapp snapshot and unlocked the tables. I was rather proud of this script because it was really a good use of Netapp and allowed us to take backup without upsetting our sensitive application.
Well, few weeks ago we noticed that snapshots stopped on the relevant volume. Farther investigation revealed that the backup scripts are gone. Well, I’m used to that and I immediately uploaded a new copy of the scripts.
Only today I noticed that at some point our developers changed MySQL storage from MyISAM to InnoDB. InnoDB tables are not affected by lock and flush commands, which means that we have no valid backups.
I looked into InnoDB backup possibilities a bit and there seem to be two solutions, both of them are not really compatible with Netapp snapshot capabilities, which mean that they will be really really slow and take up too much storage. I’m still checking our options with Netapp support, but we may need to resort to cold backups.
In the meantime my opinion of MySQL just got a bit more negative. It appears to give DBAs (or developers) two choices for storage:
1. MyISAM – which is quick and can be easily backed up, but doesn’t support transactions, ACID or point-in-time recovery, so it can barely be called a real DB.
2. InnoDB – which can be called a real DB, except that it is very difficult to back up.
I love Open Source, but I prefer my databases to be Oracle.
A poor soul landed in my blog after searching for “oracle db won’t come up after crash” I can just imagine the lonely DBA sitting in front of the crashed DB without a clue on how to proceed.
Bummer, isn’t it? The best solution is to find the nearest experienced DBA and get him to help you. But if you don’t have one around, here’s my take on the situation:
It can be that the same issue that caused the crash still exists. Lack of disk space and other storage issues are the first things that come to mind, but I’m sure there can be others.
Another possibility is that while the DB was up you made some changes to the SPFile and when it tries to start with the new SPFile, the changes you made prevent it from starting. In this case, I’d try starting from a backup of the spfile or init.ora (you did backup these very important files, right?).
The most common possibility is that nothing prevents the DB from starting, but when it crashed it got into a slightly inconsistent state and now it needs some recovery. You can use Oreilly’s “Recovering Oracle” page, which is very comprehensive and covers almost every scenario.
Most likely you have a damaged data file, and when you try to do “
alter database open” Oracle will tell you exactly which data file is damaged. In this case I hope you have your archive log files ready, and you can do
"Recover datafile '<datafile-name>';". You can also try recovering to a known good time before the crash by doing “
Recover database until time '2001-03-06:16:00:00'“.
There are tons of other options – recover until cancel, bring datafiles back from backup, bring control file back from backup, etc. Which is why I recommend looking up Oreilly’s page in emergencies and reading Oracle’s Backup and Recovery book cover to cover while you are not handling emergencies.
Understanding Oracle’s concepts of datafiles, redo logs, log archives and control files goes a long way toward having calm and successful recoveries.
I also really recommend practicing recovery from backup on a development system every month or two, because if you are used to doing recovery it will be that much easier to do it when disaster strikes.
Yesterday we had a routine maintenance to move the database of a large customer to a newer storage system. We knew it was not going to be completely straight forward because there was a security device involved. The database files (data, redo log, archives) are encrypted on the storage device, so if someone gets the data files, they are useless to him, the database access the storage though the security device which decrypts the data for the database.
Yes, if we ever have performance issues there we’ll know what to blame.
Anyway, since we know this is a complex setup we staged and tested the entire move in advance. It worked. Twice. So we were pretty confident that we can get everything done in about 10 minutes. Of course, we scheduled 3 hours maintenance, starting at 12pm.
At 12:30 we were running on the new device. At 12:40 the DB crashed. Short verification revealed that the disks with the data were gone and that we had I/O Errors when trying to access our redo logs and archives.
We got the support for the device on the phone, which escalated to tier 2, tier 3 and finally the development team. Slowly, the situation became painfully clear – all the data files are encrypted and the encryption key is corrupted. What’s worse, our backups are encrypted in exactly the same way, so they were all lost as well. We double and triple checked this – we had no way to access our backups.
This is even worse than the ASM scenario. How can we recover from such a crash? Luckily we discovered that the export files were kept unencrypted, so we could use the last export from 24 hours ago. Major loss of data, but it could be so much worse.
At 10pm our system team with the device development team finally managed to recover the most essential files – data files and redo logs. We could not recover the archived logs.
We started the database and it demanded media recovery, probably because it crashed when the disks disappeared. Since we had no log archives, we ran recovery until cancel, and immediately canceled the recovery. Then we could open the database with resetlogs, and this worked fine. Around 11pm we had a working database.
We did not realize how much risk the encryption adds to our recovery plans, and indeed to any maintenance. A single point of failure, which no one fully understood simply holds our database hostage. Thats as helpless as DBAs get.
What should we do better? Unfortunately, complete encryption was demanded by the customer so there was nothing we can do about it. I suggested encrypting the backups separately, with a different method and a different key, and if we really want to throw money at the issue we can have a separate DB, with separate security box and we can use Oracle Data Guard to move the data to the fail over system. That can be somewhat safe.