Save a DB – Practice Restore TodayPosted: January 9, 2009
Tom Kyte published his new year resolutions, and his first one is “Practice restore every month”. According to Tom, practicing restore on a regular basis is the best way to avoid making mistakes on restore, which is something DBAs can’t afford. I agree, and I have few tips for restore practicers among you.
- Don’t do it on production. OK, this one is trivial. What I mean is that you can’t really practice restore on your private “fun and games” DB, because it is likely to be very different than the real ones. Less data usually, and probably it is not connected to the Netapp either.
We practice our recovery on our staging environment which is a copy of our production, but this means we have to schedule our practices to times that no one is using the staging for their own tests. QA databases are also fun to practice on, in previous workplace QA trashed their DB so often that I practiced recovery on an almost weekly basis.
- You can design some of the internal procedures so you’ll have to recover all the time. For example, decide that you will create clones of existing DBs by copying files over, without shutting down the DB, and leave out a file or two for the challenge. If you can’t recover from what you made, make sure you understand why and verify that this situation can never ever occur in production. If you (like me) have to clone DBs on a somewhat regular basis, this gives you an opportunity to use the time to practice recovery without having to justify the practice to anyone.
- Get your manager to watch the practice. This has two great benefits – one is that you get to practice in realistically stressfull situation of having a manager looking over your shoulder. The other is that when you recover in an emergency, the manager will be somewhat familiar with what you do and will ask less questions when you are trying to think.
- By practicing recovery on realistic environments in advance you will have a good idea how long recovery should take. Try to finish each practice with a nice table saying something like “For each 20G of lost data files, add 30 minutes to recovery, for each 6 hours of redo to apply add an hour to recover”. Try to get the business stakeholders to sign off these numbers. If they refuse, maybe its time to think of new strategy. Maybe one involving database flashback or dataguard. In any case, when you are in a real recovery situation these numbers will allow you to give your business a reasonably accurate estimation of how long the downtime is going to last. This is always a huge win for the DBAs and reduces the stress levels significantly.
- Recovery is one of these things that are awfully complicated when you memorize a bunch of rules. Do I need the instance up or down? Do I need a backup of the control file? Which redo logs do I need? Do I recover until cancel, until point in time, until now? It magically becomes very easy when you understand the idea behind those rules. Don’t remember when you need the backup control file, instead take few minutes (hours, days) to understand the idea of SCNs and how they are used in recovery. So, when you practice, make sure you understand why you do each step that you do, don’t just work off a checklist.
- Don’t throw away the checklist. Either have a bunch of checklists ready for every scenario (my method) or write a checklist and have someone review it before starting recovery (senior DBA does that). Always work with a list. You don’t want to risk forgeting any step.
- Don’t forget to verify the production backups themselves on a very regular basis. Nothing is worse than discovering that for some reason your backup is useless. I try to always keep a recent full export on the side of such emergencies. I don’t think we ever had to use it, but we came very close once or twice and it was a comforting thought to know that if we can do nothing else, at least the export is there.