Collecting Requirements for Disaster Recovery

When an earthquake wipes out your datacenter, it may be too late to do anything about it. Obviously you need to plan for such disasters in advance. Every IT organization, big or small, needs such plan. I hope your organization already has one plan, and that you test it on regular basis. But sometimes you need to plan for disasters from almost scratch. Maybe because your business never had a disaster recovery plan, or maybe because availability requirements have suddenly changed and the previous plan is insufficient.

So, How do you start writing your disaster recovery plan?

If you are an Oracle DBA, you may be tempted to start by configuring DataGuard. If you are a sys admin, you may be ordering additional machines and calling various ISPs. If you are a storage manager, you’ll probably pull out your vendor’s favorite remote mirroring solution. If you are sales/marketing, you probably already promised 99.99999% availability.

Don’t do any of that. You start by asking questions. Here are the questions we thought of a bit too late this time around, but next time we’ll ask before we even begin to discuss the right technology:

  1. What is acceptable time to recovery? Can we just ship the tapes somewhere, or do I need hot standby?
  2. How much data loss is acceptable? Can we recover from last nights backup, or do we need data from 5 minutes ago?
  3. How much performance degradation is acceptable during a disaster? For how long? Can I save a bit on the extra hardware?
  4. How much redo logs are generated per day? i.e. what is the rate of data changes that we need to support now?
  5. What is the expected data growth for this DB/App for the next year? How much will we need to scale our solution?
  6. How will clients access the system in case of disaster? Do we need to migrate IPs or can you use new ones?
  7. How often do we need to validate the DR site? Testing every quarter, 6 month, once a year?
  8. When does the DR need to be in place?
  9. How much of a downtime will be acceptable for returning back to the main site? How much in advance do we need to schedule it?
  10. Who decides that it is now a disaster and failover to alternate site (or backups) should occur? What are the criteria for the decision?

From my experience, the fewer questions you ask, and the simpler the questions are, the more likely you are to get good answers. And with good answers, you can choose your technologies, implement, test, rinse, repeat.