Backup and Recovery

Invitation for T-SQL Tuesday #19 – Disasters & Recovery

Disasters

Its the first week of June and for those of us living along the Gulf and Atlantic coasts of the US, that brings the beginning of hurricane season.  It also means its time for this months installment of T-SQL Tuesday.

This Months Topic

Hurricane Ike dead ahead

There goes your weekend/month

Disaster Recovery.  This topic is very near and dear to me based on the fact that I live on a barrier island that was the site to the deadliest natural disaster in US history and more recently destroyed by the third costliest hurricane in history.  Needless to say preparing for disasters is nearly instinctive to me which might explain why I’m a DBA but I digress.  Anything you’d like to blog about related to preparing for or recovering from a disaster would be fair game, have a great tip you use to keep backups and recovers running smoothly, a horrific story of recovery gone wrong? or anything else related to keeping your systems online during calamity.  We want to hear it!

My street a month after Hurricane Ike

My street a month after Hurricane Ike

T-SQL Tuesday info

Originally an idea dreamed up by Adam Machanic (Blog|Twitter), it has become a monthly blog party where the host picks a topic and encourages anyone to write a post on that topic then a day or 3 later produces a roundup post of all the different perspectives from the community.

Rules

  • Your post must be published between 00:00 GMT Tuesday June 14, 2011, and 00:00 GMT Wednesday June 15, 2011
  • Your post must contain the T-SQL Tuesday logo from above and the image should link back to this blog post.
  • Trackbacks should work, but if you don’t see one please link to your post in the comments section below so everyone can see your work

Nice to haves!

  • include a reference to T-SQL Tuesday in the title of your post
  • tweet about your post using the hash tag #TSQL2sDay
  • consider hosting T-SQL Tuesday yourself. Adam Machanic keeps the list, if he let me do it you’re bound to qualify!

Check back in a few days to see the roundup post of all the great stories your peers shared

A tall tale of SQL database corruption

This corruption story begins like many.  Somebody in a server room far far away decided to make a change to a VMware guest machine and that little change rippled through our poor server like a lady Gaga Meat Dress through the VMA’s.  Needless to say, it wasnt pretty.  The full set of events may never be known by me but it appeared as though our guest server ran out of disk space on the OS and some form of recovery was done.Shattered into a million pieces

What we started with was a sql 2005 sp3 server where 1 of the drives was apparently corrupted, So 2 SQL instances wouldnt start.  They were both erroring with the message :

Error: 9003, Severity: 20, State: 1.
The log scan number (23:5736:37) passed to log scan in database ‘master’ is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

Using trace flag 3608 and startup parameters -c -m I set about to do a normal “disaster” recovery of our server

After rebuilding the master database, everything came online successfully.  Then master was recovered from the previous backup.  Once master was online I started getting the very same error message about the model database

Error: 9003, Severity: 20, State: 1.

The LSN (11:999:1) passed to log scan in database ‘model’ is invalid

This would prove to be a trying error!  it took about several iterations and quite a time to figure out exactly what was going on.

On this server after initial setup we had moved the system databases from the install drive to seperate drives for log and data.  When rebuilding master, the system db’s wind up back in the default directories but, after recovering master, the databases are pointed back to the original locations.

Once we got the server started the log scan error message for model showed up so, I began what I thought would be a normal restore of the model database.  Unfortunately, there was no way for model to be restored.  During the restore command, I got alternating messages that the model database log file was corrupted


Error: 3283, Severity: 16, State: 1.

The file “modellog” failed to initialize correctly. Examine the error logs for more detail

The Error 3283 Would be followed by

the database ‘model’ is marked RESTORING and is in a state that does not allow recovery to be run.

After trying various iterations of deleting the existing model log & database files, copying in the newly created ones and running restores, nothing was working.  I began to think the disks were actually having problems, or the backup was bad.  After verifying both the backup and the disk config I was left with only a hail mary –> sp_detach_db

After detaching model, I copied in the newly created model files (from the rebuild of master) and ran sp_attach_db on them.  Once the Model database was attached the instance started successfully!

After the instance started model was restored from the same backup and the instance restarted.  Finally, once the instance came online, it was a standard restore of all the user databases.

Im not sure what about the logscan error in model caused the errors I saw, but, both instances behaved exactly the same.  I had to detach and reattach a blank model to make the other instance work as well.

After going through this, I went back and tried to reproduce the problems by intentionally corrupting model and its transaction log in various ways.  Every corruption I could cause in model behaved as I expected and a simple restore statement worked.  Im still not sure WHY this happened but, hopefully it wont happen again and if it does there wont be so much testing to figure out how to get model online

How do you do Disaster Recovery

Going through the process of a large scale multi-location disaster recovery made me stop and think about all the different incarnations that can be used to recover database servers. 

Living with a datacenter in Hurricane alley, We’ve been doing disaster preparedness(recovery) on a small scale for many years but this year we’ve been working towards recovering all of our assets to an offsite colocation.  That part of the decision is easy, the actual method used to do these recoveries is definitely up in the air and I fully expect our processes to change for the better, every time we redo our disaster testing (many times a year going forward).

In exploring the recovery process we quickly realized that our “hardware failure” recovery documents weren’t going to work effectively in a datacenter failure situation.  So, it was time to design a new set of criteria for success.  I thought Id share our thought process and how we plan on tackling this always fun experience.  Its worth mentioning in a side note that no SQL replication is wanted/allowed for in our case.

1st thought:  Bring up blank OS builds for the database servers, load SQL Server, Patch it to the correct level while the tape restores of the database backups are happening, Recover the system databases then kick off the individual restores(that are scripted with the regular nightly backup jobs)

  • Benefits to DBA: clean, repeatable, documentable process that we are mostly in control of.
  • Drawbacks: Time consuming, potential version match issues, recovering system databases is always “fun”

2nd thought: Use a windows snapshot to restore the OS/Sql Binaries and Sql System databases then recover the user databases using the aforementioned scripts. This also buys us the nicety of having litespeed already installed

  • Benefits to DBA: Faster, System level recovery done in a standard (for our system group) method
  • Drawbacks: system/SQL recovery out of our (DBA) control

Since our Systems engineers are already asking to go the snap route (because thats common for other application servers), and we expect this method to take less overall time, we are planning on trying that first.  Depending on how that test goes, we will likely have option 1 as a backup plan or potentially try that next time thats why we’re testing it, so that we can make sure we have it right.

As always, there’s more than 1 way to accomplish the same outcome so my question is how do you do off-site disaster recovery (testing)?  Or maybe the better question is do you do disaster recovery testing?  If not why?

Go to Top