Posts tagged Backups

Extending automation of SQL recoveries using Ola Hallengren’s backup solution

“Automation is inherently good” This might be one of the only phrases you can get all DBA’s to agree on without that dreaded “it depends” that we DBA’s like to throw around so much.

This months T-SQL Tuesday is about automation and I thought I’d write about extending existing automation.  One of my favorite scripts for automation is Ola Hallengren’s Backup & Maintenance solution.  Ola’s scripts are a fantastic way to automate highly configurable backups and maintenance on your SQL Server instances.  If you’re not using them, you should seriously consider looking into why.

This solution serves as an outstanding base but like anything else its can be useful to tweak things a bit.  Extending the initial automation provided by his scripts is what this post is all about.

In particular, I’ve modified Ola’s scripts to generate the files needed to restore all of the databases that have been backed up with his solution.  In particular, having the ability to easily restore the whole server in the case of a disaster.  Though, you could easily pull out one DB to only restore it.  This script is currently only written for litespeed since that’s what I use for backups.  However, it could easily be changed to support native backups or any of the other backup products that Ola’s scripts can be configured for.  Perhaps Ill work on those in the future if it would be useful.

The idea is that every time you take a backup the backup job will create a .sql file on the server filesystem in the backup directory that can be used to restore to the point of the backups that were just taken.

This solution includes three pieces, an additional stored procedure, an additional step in both of the backup jobs to execute that stored procedure and lastly a step in the cleanup procedure to remove the restore scripts from the filesystem that have aged.

A couple of notes of caution:

As with anything you find on the internet, please use at your own risk in a development/test system and proceed with caution.

This script makes several assumptions including

  • That you’ve installed Ola’s commands into the master database
  • That you’re using litespeed
  • That logging to the commandlog table is enabled

The stored procedure is relatively simple and accepts a single parameter @type  “LOG” will generate the script as of the last log backup taken or for any other parameter, I happen to use “FULL”, it generates the script based on the last full backup.

CREATE PROCEDURE [dbo].[GenerateRestoreScript] (@type NCHAR(30) = 'LOG')
AS 
DECLARE @ID INT
DECLARE @DB NVARCHAR(128)

SET NOCOUNT ON

SELECT  @ID = MAX(database_id)
FROM    sys.databases

IF @type = 'LOG' SET @type = 'xp_backup_log' ELSE SET @type = ''
--These intentionally not commented in the script as a precaution (to generate an error)
        SELECT 'ALERT-ALERT-ALERT-ALERT-ALERT-ALERT-ALERT-ALERT-ALERT-ALERT-ALERT-ALERT-ALERT-'
        SELECT 'PLEASE BACKUP THE TAIL OF THE LOG
        SELECT 'OTHERWISE THIS COULD BECOME A RGE (GOOGLE THE ACRONYM!)'
        SELECT 'IF YOU ARE OK REPLACING THE DB AND LOOSING DATA IN THE TAIL LOG RUN THIS SCRIPT WITHOUT THESE COMMENTS '
        SELECT 'RAISERROR(N''ARE YOU SURE YOU WANT TO DO THIS?'', 25,1) WITH Log;'      
        SELECT '--------'

WHILE @ID > 2
    BEGIN

        SELECT  @DB = NAME
        FROM    sys.databases
        WHERE   database_id = @ID

        SELECT  @ID = @ID - 1

        SELECT '----' + @DB + '-----------------------------------------------------'  

        SELECT 'EXECUTE ' 
        + REPLACE(Command, '_backup_', '_restore_')
        + ', @filenumber = 1, @with = N''' 
        + CASE WHEN rn <> 1 THEN 'NO' ELSE '' END 
        +  'RECOVERY'''
        + CASE WHEN CommandType = 'xp_backup_database' THEN ', @with = N''REPLACE'';' ELSE ';' END

         FROM 
        (
        SELECT 
            SUBSTRING(LEFT (Command, CHARINDEX(''', @with =',Command)),CHARINDEX('[master]',Command),LEN(Command)) AS Command
            , ROW_NUMBER() OVER (ORDER BY cl.ID DESC) AS rn
            , CommandType
        FROM    [master].[dbo].[CommandLog] cl
        WHERE   cl.DatabaseName = @DB
                AND (cl.CommandType = 'xp_backup_database' OR cl.CommandType = @type)

                AND cl.ID >= ( SELECT   MAX(ID)
                               FROM     CommandLog c
                               WHERE    CommandType IN ( 'xp_backup_database' )
                                        AND cl.DatabaseName = c.DatabaseName
                             )
        ) AS rntab

        ORDER BY rn DESC                     

    END

To execute the stored procedure, this needs to be added as an additional cmdexec job step to the Full backup job (make sure to change the directory where you want the .sql files stored (H:\SERVERNAME below))

sqlcmd -l 30 -E -S $(ESCAPE_SQUOTE(SRVR)) -d master -y 0 -b -Q "EXEC [dbo].[GenerateRestoreScript] ''FULL''" –o”H:\SERVERNAME\DRFULL_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM))_RESTORE.sql" –w50000

To execute the stored procedure, this needs to be added as an additional cmdexec job step to the Transaction log backup job (make sure to change the directory where you want the .sql files stored (H:\SERVERNAME below))

sqlcmd -E -S $(ESCAPE_SQUOTE(SRVR)) -d master -y 0 -b -Q "EXEC [dbo].[GenerateRestoreScript]" -o"H:\SERVERNAME\DRLOG_$(ESCAPE_SQUOTE(STRTDT))_$(ESCAPE_SQUOTE(STRTTM))_RESTORE.sql" –w50000

This cmdexec Job step that needs to be added to the output file cleanup job to clean up old .sql files (make sure to change the directory where the .sql files stored (H:\SERVERNAME below))

Note: currently this configuration keeps the files from the past 3 days but the actual files kept depends on when the cleanup job is scheduled.

cmd /q /c "For /F "tokens=1 delims=" %v In (''ForFiles /P "H:\SERVERNAME" /m *RESTORE.sql /d -3 2^>^&1'') do if EXIST "H:\SERVERNAME"\%v echo del "H:\SERVERNAME"\%v& del "H:\SERVERNAME"\%v"

I have these steps scripted into Ola’s original solution .sql so the folder names are set properly and job creation is completely automated.  Ill leave that part of extending automation to you, dear reader, as homework.

Invitation for T-SQL Tuesday #19 – Disasters & Recovery

Disasters

Its the first week of June and for those of us living along the Gulf and Atlantic coasts of the US, that brings the beginning of hurricane season.  It also means its time for this months installment of T-SQL Tuesday.

This Months Topic

Hurricane Ike dead ahead

There goes your weekend/month

Disaster Recovery.  This topic is very near and dear to me based on the fact that I live on a barrier island that was the site to the deadliest natural disaster in US history and more recently destroyed by the third costliest hurricane in history.  Needless to say preparing for disasters is nearly instinctive to me which might explain why I’m a DBA but I digress.  Anything you’d like to blog about related to preparing for or recovering from a disaster would be fair game, have a great tip you use to keep backups and recovers running smoothly, a horrific story of recovery gone wrong? or anything else related to keeping your systems online during calamity.  We want to hear it!

My street a month after Hurricane Ike

My street a month after Hurricane Ike

T-SQL Tuesday info

Originally an idea dreamed up by Adam Machanic (Blog|Twitter), it has become a monthly blog party where the host picks a topic and encourages anyone to write a post on that topic then a day or 3 later produces a roundup post of all the different perspectives from the community.

Rules

  • Your post must be published between 00:00 GMT Tuesday June 14, 2011, and 00:00 GMT Wednesday June 15, 2011
  • Your post must contain the T-SQL Tuesday logo from above and the image should link back to this blog post.
  • Trackbacks should work, but if you don’t see one please link to your post in the comments section below so everyone can see your work

Nice to haves!

  • include a reference to T-SQL Tuesday in the title of your post
  • tweet about your post using the hash tag #TSQL2sDay
  • consider hosting T-SQL Tuesday yourself. Adam Machanic keeps the list, if he let me do it you’re bound to qualify!

Check back in a few days to see the roundup post of all the great stories your peers shared

A tall tale of SQL database corruption

This corruption story begins like many.  Somebody in a server room far far away decided to make a change to a VMware guest machine and that little change rippled through our poor server like a lady Gaga Meat Dress through the VMA’s.  Needless to say, it wasnt pretty.  The full set of events may never be known by me but it appeared as though our guest server ran out of disk space on the OS and some form of recovery was done.Shattered into a million pieces

What we started with was a sql 2005 sp3 server where 1 of the drives was apparently corrupted, So 2 SQL instances wouldnt start.  They were both erroring with the message :

Error: 9003, Severity: 20, State: 1.
The log scan number (23:5736:37) passed to log scan in database ‘master’ is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

Using trace flag 3608 and startup parameters -c -m I set about to do a normal “disaster” recovery of our server

After rebuilding the master database, everything came online successfully.  Then master was recovered from the previous backup.  Once master was online I started getting the very same error message about the model database

Error: 9003, Severity: 20, State: 1.

The LSN (11:999:1) passed to log scan in database ‘model’ is invalid

This would prove to be a trying error!  it took about several iterations and quite a time to figure out exactly what was going on.

On this server after initial setup we had moved the system databases from the install drive to seperate drives for log and data.  When rebuilding master, the system db’s wind up back in the default directories but, after recovering master, the databases are pointed back to the original locations.

Once we got the server started the log scan error message for model showed up so, I began what I thought would be a normal restore of the model database.  Unfortunately, there was no way for model to be restored.  During the restore command, I got alternating messages that the model database log file was corrupted


Error: 3283, Severity: 16, State: 1.

The file “modellog” failed to initialize correctly. Examine the error logs for more detail

The Error 3283 Would be followed by

the database ‘model’ is marked RESTORING and is in a state that does not allow recovery to be run.

After trying various iterations of deleting the existing model log & database files, copying in the newly created ones and running restores, nothing was working.  I began to think the disks were actually having problems, or the backup was bad.  After verifying both the backup and the disk config I was left with only a hail mary –> sp_detach_db

After detaching model, I copied in the newly created model files (from the rebuild of master) and ran sp_attach_db on them.  Once the Model database was attached the instance started successfully!

After the instance started model was restored from the same backup and the instance restarted.  Finally, once the instance came online, it was a standard restore of all the user databases.

Im not sure what about the logscan error in model caused the errors I saw, but, both instances behaved exactly the same.  I had to detach and reattach a blank model to make the other instance work as well.

After going through this, I went back and tried to reproduce the problems by intentionally corrupting model and its transaction log in various ways.  Every corruption I could cause in model behaved as I expected and a simple restore statement worked.  Im still not sure WHY this happened but, hopefully it wont happen again and if it does there wont be so much testing to figure out how to get model online

How do you do Disaster Recovery

Going through the process of a large scale multi-location disaster recovery made me stop and think about all the different incarnations that can be used to recover database servers. 

Living with a datacenter in Hurricane alley, We’ve been doing disaster preparedness(recovery) on a small scale for many years but this year we’ve been working towards recovering all of our assets to an offsite colocation.  That part of the decision is easy, the actual method used to do these recoveries is definitely up in the air and I fully expect our processes to change for the better, every time we redo our disaster testing (many times a year going forward).

In exploring the recovery process we quickly realized that our “hardware failure” recovery documents weren’t going to work effectively in a datacenter failure situation.  So, it was time to design a new set of criteria for success.  I thought Id share our thought process and how we plan on tackling this always fun experience.  Its worth mentioning in a side note that no SQL replication is wanted/allowed for in our case.

1st thought:  Bring up blank OS builds for the database servers, load SQL Server, Patch it to the correct level while the tape restores of the database backups are happening, Recover the system databases then kick off the individual restores(that are scripted with the regular nightly backup jobs)

  • Benefits to DBA: clean, repeatable, documentable process that we are mostly in control of.
  • Drawbacks: Time consuming, potential version match issues, recovering system databases is always “fun”

2nd thought: Use a windows snapshot to restore the OS/Sql Binaries and Sql System databases then recover the user databases using the aforementioned scripts. This also buys us the nicety of having litespeed already installed

  • Benefits to DBA: Faster, System level recovery done in a standard (for our system group) method
  • Drawbacks: system/SQL recovery out of our (DBA) control

Since our Systems engineers are already asking to go the snap route (because thats common for other application servers), and we expect this method to take less overall time, we are planning on trying that first.  Depending on how that test goes, we will likely have option 1 as a backup plan or potentially try that next time thats why we’re testing it, so that we can make sure we have it right.

As always, there’s more than 1 way to accomplish the same outcome so my question is how do you do off-site disaster recovery (testing)?  Or maybe the better question is do you do disaster recovery testing?  If not why?

Go to Top