Going through the process of a large scale multi-location disaster recovery made me stop and think about all the different incarnations that can be used to recover database servers. 

Living with a datacenter in Hurricane alley, We’ve been doing disaster preparedness(recovery) on a small scale for many years but this year we’ve been working towards recovering all of our assets to an offsite colocation.  That part of the decision is easy, the actual method used to do these recoveries is definitely up in the air and I fully expect our processes to change for the better, every time we redo our disaster testing (many times a year going forward).

In exploring the recovery process we quickly realized that our “hardware failure” recovery documents weren’t going to work effectively in a datacenter failure situation.  So, it was time to design a new set of criteria for success.  I thought Id share our thought process and how we plan on tackling this always fun experience.  Its worth mentioning in a side note that no SQL replication is wanted/allowed for in our case.

1st thought:  Bring up blank OS builds for the database servers, load SQL Server, Patch it to the correct level while the tape restores of the database backups are happening, Recover the system databases then kick off the individual restores(that are scripted with the regular nightly backup jobs)

  • Benefits to DBA: clean, repeatable, documentable process that we are mostly in control of.
  • Drawbacks: Time consuming, potential version match issues, recovering system databases is always “fun”

2nd thought: Use a windows snapshot to restore the OS/Sql Binaries and Sql System databases then recover the user databases using the aforementioned scripts. This also buys us the nicety of having litespeed already installed

  • Benefits to DBA: Faster, System level recovery done in a standard (for our system group) method
  • Drawbacks: system/SQL recovery out of our (DBA) control

Since our Systems engineers are already asking to go the snap route (because thats common for other application servers), and we expect this method to take less overall time, we are planning on trying that first.  Depending on how that test goes, we will likely have option 1 as a backup plan or potentially try that next time thats why we’re testing it, so that we can make sure we have it right.

As always, there’s more than 1 way to accomplish the same outcome so my question is how do you do off-site disaster recovery (testing)?  Or maybe the better question is do you do disaster recovery testing?  If not why?