Runaway System Cache Increase Kills SQL
Ran into this a while back, and we finally found a root cause so, I thought Id put it out here in hopes that it saves at least 1 person the amount of head bashing I had with it
Windows 2003 Enterprise R2 SP2 w/32GB RAM
SQL Server 2005 standard ed SP3 64bit active/passive cluster
We started seeing this glorious message in the SQL Server Error log.
A significant part of sql server process memory has been paged out. This may result in performance degradation. Duration XX seconds. Working set (KB) XXX, committed (KB) XXX, memory utilization 0%
The message varied slightly but the essence was always the same.
This error message can be too common on systems where SQL memory is misconfigured or where something is unduly pressuring SQL for memory. In this case a quick verification of the settings showed that everything was in order. The first 2 times this happened it was the middle of the night during backups(in the SLA window), so no one really noticed a performance degradation. We didn’t think much of it at the time but in hindsight, we should have.
Monday morning 8 AM, developer makes a bad update to the database, No problem I say, Litespeed can rollback the transaction, So I start to copy the full db backup+tran logs off the server (~25gb) this is the way we process litespeed recoveries through the log reader. About 3 minutes later the server became totally unresponsive, and the error about paging the SQL process memory was logged. At the time I didn’t put 2 and 2 together as this particular server runs a varied workload of about 1500 batches/sec and has anywhere from 1200-2500 connections open at a time, so It could have been anything! After some further digging I figured out that the file copies were causing the sql memory to get paged out. At the time I had never heard of a file copy causing an issue in SQL Server!
The experts weigh in
While looking at the issue 2 perfmon counters stuck out–> Memory\Cached Bytes and Memory\Avail MBytes. While file copies were happening the cache bytes counter would increase very quickly while the avail bytes counter would drop, once the available mbytes dropped to 0 sql server started to page memory out. After a bit of paging, the errors were logged that SQL had its memory paged out and SQL became unresponsive. Since this was a high priority system, I did what any good SQL Server DBA would do, I contacted a few people in my network who may have seen this before. Interestingly enough I got the exact same response from every one of them, “use lock pages in memory” and don’t use windows explorer to do huge file copies as this is a known “problem”.
Even though I trusted my sources of info, I had a hard time believing that file copies of sizes all the way down to 1GB would cause this sort of havoc without this being something Bing+Google would know about (different file sizes mattered, some sizes worked fine, some would cause the problem).
Workarounds not welcome
I had a valid workaround with lock pages in memory and not using explorer for file copies but, I don’t normally like workarounds such as this on systems as important as this one is to us. After a few server rebuilds we finally figured out that we could reproduce this issue on any win2k3 R2 ent ed 64bit server, this would be the clue we finally needed to make a breakthrough. After rebuilding the systems from scratch and loading no drivers except SAN we noticed that we couldn’t cause the error! So, after painstakingly adding each and every piece of our standard server build we realized that Symantec AV (10.1.9.9) was the cause. Yes, another file system filter driver was misbehaving.
In looking back through the change communication, we ID’d where a new version of AV was pushed out and we just didn’t hit the error soon enough after the installation to put 2 & 2 together. Since disabling AV wasn’t an option we started trying to find a setting that specifically caused the problem and happened across a change that could be made and allow AV to run and SQL to not get paged out. By unchecking the network scanning options, the windows cache no longer increases during a (network) file copy. problem solved!!
Some time in the future vendors are going to figure out how to write good file system filter drivers, or they are going to stop trying to use them! After fighting this issue for a few weeks (or was it months) I can only hope this happens sooner rather than later
Comments are closed.