Runaway System Cache Increase Kills SQL
Ran into this a while back, and we finally found a root cause so, I thought Id put it out here in hopes that it saves at least 1 person the amount of head bashing I had with it
Windows 2003 Enterprise R2 SP2 w/32GB RAM
SQL Server 2005 standard ed SP3 64bit active/passive cluster
We started seeing this glorious message in the SQL Server Error log.
A significant part of sql server process memory has been paged out. This may result in performance degradation. Duration XX seconds. Working set (KB) XXX, committed (KB) XXX, memory utilization 0%
The message varied slightly but the essence was always the same.
This error message can be too common on systems where SQL memory is misconfigured or where something is unduly pressuring SQL for memory. In this case a quick verification of the settings showed that everything was in order. The first 2 times this happened it was the middle of the night during backups(in the SLA window), so no one really noticed a performance degradation. We didn’t think much of it at the time but in hindsight, we should have.
Monday morning 8 AM, developer makes a bad update to the database, No problem I say, Litespeed can rollback the transaction, So I start to copy the full db backup+tran logs off the server (~25gb) this is the way we process litespeed recoveries through the log reader. About 3 minutes later the server became totally unresponsive, and the error about paging the SQL process memory was logged. At the time I didn’t put 2 and 2 together as this particular server runs a varied workload of about 1500 batches/sec and has anywhere from 1200-2500 connections open at a time, so It could have been anything! After some further digging I figured out that the file copies were causing the sql memory to get paged out. At the time I had never heard of a file copy causing an issue in SQL Server!
The experts weigh in
While looking at the issue 2 perfmon counters stuck out–> Memory\Cached Bytes and Memory\Avail MBytes. While file copies were happening the cache bytes counter would increase very quickly while the avail bytes counter would drop, once the available mbytes dropped to 0 sql server started to page memory out. After a bit of paging, the errors were logged that SQL had its memory paged out and SQL became unresponsive. Since this was a high priority system, I did what any good SQL Server DBA would do, I contacted a few people in my network who may have seen this before. Interestingly enough I got the exact same response from every one of them, “use lock pages in memory” and don’t use windows explorer to do huge file copies as this is a known “problem”.
Even though I trusted my sources of info, I had a hard time believing that file copies of sizes all the way down to 1GB would cause this sort of havoc without this being something Bing+Google would know about (different file sizes mattered, some sizes worked fine, some would cause the problem).
Workarounds not welcome
I had a valid workaround with lock pages in memory and not using explorer for file copies but, I don’t normally like workarounds such as this on systems as important as this one is to us. After a few server rebuilds we finally figured out that we could reproduce this issue on any win2k3 R2 ent ed 64bit server, this would be the clue we finally needed to make a breakthrough. After rebuilding the systems from scratch and loading no drivers except SAN we noticed that we couldn’t cause the error! So, after painstakingly adding each and every piece of our standard server build we realized that Symantec AV (10.1.9.9) was the cause. Yes, another file system filter driver was misbehaving.
In looking back through the change communication, we ID’d where a new version of AV was pushed out and we just didn’t hit the error soon enough after the installation to put 2 & 2 together. Since disabling AV wasn’t an option we started trying to find a setting that specifically caused the problem and happened across a change that could be made and allow AV to run and SQL to not get paged out. By unchecking the network scanning options, the windows cache no longer increases during a (network) file copy. problem solved!!
Some time in the future vendors are going to figure out how to write good file system filter drivers, or they are going to stop trying to use them! After fighting this issue for a few weeks (or was it months) I can only hope this happens sooner rather than later
|Print article||This entry was posted by Allen Kinsel on April 21, 2010 at 8:47 am, and is filed under Memory, Rants, SQL Server. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site.|
No trackbacks yet.
about 4 months ago - No comments
Came across a strange permission issue today that I thought I’d blog about since it is apparently somewhat common judging by the number of unanswered forum posts about it.
Adding/editing or Testing SSRS Datasources in Report Manager generates a permission error
A developer was trying to add a datasource and use “test connection” in the report manager web interface instead of BIDS which, incidentally
about 10 months ago - No comments
Earlier this week I was working through an issue that I’ve seen one too many times lately and thought Id write about it. For various reasons *cough* Legacy *cough* I often seem to run across columns in tables that are defined with what would appear to be the wrong data type or worse, multiple data
about 1 year ago - 40 comments
Its the first week of June and for those of us living along the Gulf and Atlantic coasts of the US, that brings the beginning of hurricane season. It also means its time for this months installment of T-SQL Tuesday.
This Months Topic
Disaster Recovery. This topic is very near and dear to me based on the fact that
about 2 years ago - 11 comments
After last years Summit we launched a feedback site http://feedback.sqlpass.org in hopes of gathering all of the feedback about the event in one place. The number one thing people have asked for on that site is for there to be a track of sessions in the 400-500 level range. The problems with the community desire
about 2 years ago - 2 comments
Occasionally I’m asked what the most important thing for a DBA to know is. It seems some people want to know what 1 thing to focus on in order to be successful long term as a DBA. I’ve pondered this question for quite a long time and I’ve given various answers over the years.
about 2 years ago - 5 comments
This months TSQL Tuesday is hosted by a good friend Pat right over at SQL Asylum
For this months entry I decided to keep it short and sweet, following in my Bits N Bytes theme.
The Meta Script
In the true sense of the word automation, this really doesn’t fit but, in the terms of quickly getting something
about 2 years ago - 4 comments
This corruption story begins like many. Somebody in a server room far far away decided to make a change to a VMware guest machine and that little change rippled through our poor server like a lady Gaga Meat Dress through the VMA’s. Needless to say, it wasnt pretty. The full set of events may never
about 2 years ago - 1 comment
The other day we were going through some old scripts and had a huge discussion about changing object ownership in SQL. Forever the only answers were to recreate the objects or use the stored procedures: sp_changeobjectowner or sp_changedbowner. Alternatively at times there were those of us who used some other undisclosed hack of system tables
about 2 years ago - No comments
For the new year I have decided to try something different with a series of small regular posts.
Dealing with our large team of developers and our DBA team as well, I often run across interesting discussions about small things that are easily overlooked. Often times these discussions end in a exclamation of “I knew that!!” Of
about 2 years ago - 1 comment
And no, I dont mean in my mothers eyes, thats a given.
I won the Ipad/SQL Monitor Contest that Redgate was sponsoring over on Grant Fritcheys Blog
Im thrilled to have won!! Ive been eyeing Ipads for quite some time but, the novelty & cost has kept me at arms length.
Im always in the market for DBA tools that can make