A tall tale of SQL database corruption
This corruption story begins like many. Somebody in a server room far far away decided to make a change to a VMware guest machine and that little change rippled through our poor server like a lady Gaga Meat Dress through the VMA’s. Needless to say, it wasnt pretty. The full set of events may never be known by me but it appeared as though our guest server ran out of disk space on the OS and some form of recovery was done.
What we started with was a sql 2005 sp3 server where 1 of the drives was apparently corrupted, So 2 SQL instances wouldnt start. They were both erroring with the message :
Error: 9003, Severity: 20, State: 1.
The log scan number (23:5736:37) passed to log scan in database ‘master’ is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.
Using trace flag 3608 and startup parameters -c -m I set about to do a normal “disaster” recovery of our server
After rebuilding the master database, everything came online successfully. Then master was recovered from the previous backup. Once master was online I started getting the very same error message about the model database
Error: 9003, Severity: 20, State: 1.
The LSN (11:999:1) passed to log scan in database ‘model’ is invalid
This would prove to be a trying error! it took about several iterations and quite a time to figure out exactly what was going on.
On this server after initial setup we had moved the system databases from the install drive to seperate drives for log and data. When rebuilding master, the system db’s wind up back in the default directories but, after recovering master, the databases are pointed back to the original locations.
Once we got the server started the log scan error message for model showed up so, I began what I thought would be a normal restore of the model database. Unfortunately, there was no way for model to be restored. During the restore command, I got alternating messages that the model database log file was corrupted
Error: 3283, Severity: 16, State: 1.
The file “modellog” failed to initialize correctly. Examine the error logs for more detail
The Error 3283 Would be followed by
the database ‘model’ is marked RESTORING and is in a state that does not allow recovery to be run.
After trying various iterations of deleting the existing model log & database files, copying in the newly created ones and running restores, nothing was working. I began to think the disks were actually having problems, or the backup was bad. After verifying both the backup and the disk config I was left with only a hail mary –> sp_detach_db
After detaching model, I copied in the newly created model files (from the rebuild of master) and ran sp_attach_db on them. Once the Model database was attached the instance started successfully!
After the instance started model was restored from the same backup and the instance restarted. Finally, once the instance came online, it was a standard restore of all the user databases.
Im not sure what about the logscan error in model caused the errors I saw, but, both instances behaved exactly the same. I had to detach and reattach a blank model to make the other instance work as well.
After going through this, I went back and tried to reproduce the problems by intentionally corrupting model and its transaction log in various ways. Every corruption I could cause in model behaved as I expected and a simple restore statement worked. Im still not sure WHY this happened but, hopefully it wont happen again and if it does there wont be so much testing to figure out how to get model online