SQL Server, PASS, and other data mishaps
Posts tagged SQL Server
Shadows Rock, Filtering Platform not so much!
Aug 16th
RDP remote control (shadowing) of multiple sessions is a great way to allow geographically separated teams to work on the same Server console. You can do this from task manager
Today I had a new install of windows 2008 that was rejecting the attempts at remote control, the error was ”remote control failed”, nothing was logged in the System or Application event logs. In the Security event log was only one error: “The Windows Filtering Platform has blocked a bind to a local port”
After plenty of fiddling and making sure there was no ”firewall” or reason for the filtering platform to be enabled, I came across this command I never knew existed “shadow”
Apparently whatever had the filtering platform angry and blocking access was ok with that simple command. So in this case going to a command window and running “shadow 3” worked perfectly, I could once again see both terminals and the windows filtering platform allowed me to actually work, instead of impeding me at every turn.
The Windows Filtering Platform on Server 2008 and 2008 R2 has been the culprit more times than I can count lately when the “gremlins” are inhabiting our servers, If only there were a way you could turn it off totally, but I guess that’s sort of like Internet Explorer, it cant be unbundled from the OS.
Sql Server and SSPI handshake failed error hell
Jun 17th
The infamous SSPI Failed error strikes again!
One of our SQL servers was generating these errors for “some” Windows logins but not all.
Error: 17806, Severity: 20, State: 2.
SSPI handshake failed with error code 0x8009030c while establishing a connection with integrated security; the connection has been closed. [CLIENT: 192.168.1.1]
Error: 18452, Severity: 14, State: 1.
Login failed for user ”. The user is not associated with a trusted SQL Server connection. [CLIENT: 192.168.1.1]
After exhausting all of the normal troubleshooting for this error (accounts locked, disabled, Sql Service accts, bad connection strings, SPN’s, etc.) I spent the next few hours learning more about the way SQL handles authentication requests than I had ever wanted to know.
The Scenario –
A couple of separate individual Windows ID’s started generating these errors while attempting connections, all other windows logins were working properly. The connections were initially happening through applications, but also occurred through sqlcmd. When logged in to the server locally with the offending ID’s the connections to SQL would succeed.
The Troubleshooting process –
Check all the regular SSPI issues, I wont bore you with the details as they are easily searchable
- A relatively easy way of checking the “easy” authentication issues If possible/appropriate is to log into the SQL Server locally with the offending ID and fire up sqlcmd and connect to the server via sqlcmd –Sservername,port –E (by specifying the port you force TCP/IP instead of LPC, thereby forcing the network into the equation)
Verify whether the login is trying to use NTLM or Kerberos (many ways to do this but simplest is to see if there are any other KERBEROS connections on the machine)
- SELECT DISTINCT auth_scheme FROM sys.dm_exec_connections
- If Kerberos is in use, there are a few additional things to verify related to SPN’s, since only NTLM was in use on this server I skipped that
Determine if the accounts were excluded from connecting to the machine through the network through a group policy or some other AD setting
After all of these checked out OK, I began to try and figure out what the error code 0x8009030c meant, turns out, its fairly obvious what the description is : sec_e_logon_denied. This description was so helpful I thought about making this server into a boat anchor but, luckily for my employer the server room is located many miles away and has armed guards.
Since I knew we could logon locally to the SQL Server with the ID that SQL was rejecting with logon denied something else was trying to make my life miserable.
We didn’t have logon failure security auditing turned on so, I had no way of getting a better error description, As luck would have it though this would prove instrumental in finding the root cause. To get a better error message, I found this handy KB article detailing steps needed to put net logon into debug mode.
Say hello to my new best friend! — nltest.exe
After downloading nltest & using it to enable netlogon debugging on the SQL Server, I got this slightly better message in the netlogon.log file
06/15 14:15:39 [LOGON] SamLogon: Network logon of DOMAIN\USER from Laptop Entered
06/15 14:15:39 [CRITICAL] NlPrintRpcDebug: Couldn’t get EEInfo for I_NetLogonSamLogonEx: 1761 (may be legitimate for 0xc0000064)
06/15 14:15:39 [LOGON] SamLogon: Network logon of DOMAIN\USER from Laptop Returns 0xC0000064
The error code 0XC0000064 maps to “NO_SUCH_USER”
Since I was currently logged in to the server with the ID that was returning no such user, something else was obviously wrong, and luckily at this point I knew it wasn’t SQL.
Running “set log” on the server revealed that a local DC (call it DC1) was servicing the local logon request.
After asking our AD guys about DC1 and its synchronization status, as well as whether the user actually existed there, everything still looked OK.
After looking around a bit more I discovered this gem of a command for nltest to determine which DC will handle a logon request
C:\>nltest /whowill:Domain Account
[16:32:45] Mail message 0 sent successfully (\MAILSLOT\NET\GETDC579)
[16:32:45] Response 0: DC2 D:Domain A:Account (Act found)
The command completed successfully
Even though this command returned “act found” it was returning from DC2. (I dont exactly understand why the same account would authenticate against 2 different DC’s based on a local desktop login or a SQL login but it apparently can)
After asking the AD guys about DC2 the light bulbs apparently went off for them as that server actually exists behind a different set of firewalls, in a totally different location. While DC2 would return a ping, the console wouldn’t allow logons for some reason. After a quick reboot of DC2, and some magic AD pixie dust (I am not an AD admin, if it wasn’t totally obvious from my newfound friend nltest) the windows Id’s that were having trouble started authenticating against DC3 and our SSPI errors went away.
Interesting tidbit — During troubleshooting, I found that this particular SQL Server was authenticating accounts against at least 5 different DC’s. Some of this might be expected since there are different domains at play but, I haven’t heard a final answer from the AD guys about whether it should work that way.
The solution
Reboot the misbehaving DC, of course there may be other ways to fix this by redirecting requests to a different DC without a reboot but, since it was misbehaving anyway, and the AD experts wanted to reboot so we went with that. A reboot of SQL would have likely solved this problem too but, I hate reboot fixes of issues, they always seem to come back!
Special Houston Area SQL Server group meeting
May 17th
Want to learn SQL from a master (or even better a pair of masters?) Have a free evening? Live within a reasonable drive of Houston? You wont want to miss this presentation. Over the years I’ve had the opportunity to listen to hundreds of different SQL speakers and 2 people who would make my short list of “don’t miss presenters” happen to be presenting at a HASSUG meeting this month.
If your not in Houston, I’d recommend using the LiveMeeting link!!
The following Info is from the http://houston.sqlpass.org site
Special Evening Meeting in THE WOODLANDS!
When: Tuesday, May 18, 2010 – 6:30pm-8:30pm
Where: Woodforest National Bank
25231 Grogans Mill, Suite 550
The Woodlands, TX 77380
Topic: Essential Database Maintenance
Presenters: Kimberly Tripp & Paul Randal, SQLSkills
LiveMeeting Link for May 18 presentation –
https://www.livemeeting.com/cc/usergroups/join?id=HASSUG_WOODLANDS&role=attend
Online portion of meeting to begin at 7pm
Conference Call for audio – 1-888-320-3585 (passcode 76027128)
Allowing effective developer access to SQL Server
Apr 29th
When creating a new application, after going through the entire business analysis & requirements gathering process, normally you wind up with a datamodel that includes many tables and relationships. By this time, depending on the size of the datamodel/system there has been considerable amounts of time invested on all sides. We need a way of preserving this investment of time while still allowing developers to do their thing!
Deploy
Most shops have policies in place for what level of access developers can have in each environment. In many places I’ve seen, developers are allowed DBO access in development, and some lesser access in the higher environments (read only usually).
After you’ve deployed the datamodel to the physical database in a development environment, before you grant the developer group dbo access consider all of the time/effort that has been spent making the datamodel what it is. In order to allow the developers to do their jobs but not allow them to modify the actual table/schema layout you can grant a combinations of privileges.
Grant Alter Schema on the schemas where the developers will need to modify database objects (for instance stored procedures and functions)
Grant db_datareader –to allow read access
Grant db_datawriter –to allow write access
Grant Create Procedure, Function, Default, Etc — Allow developers to do whatever you are comfortable with
Deny Create Table in the database –This restricts all Table based DDL
Optional** Deny Create View, Function, Default, in the database — Restrict any create/alter permissions as needed.
Important** Alter Schema permissions will allow Alter of ANY object type in the schema that you havent explicitly used a Deny on
Principle of least privilege
This method has proven effective to allow developers to write Stored procs, Functions & Views while still keeping the actual datamodel (tables and relationships usually) in pristine shape. You could also mix and match your own grants/denys on certain object types to allow for unlimited configuration without granting the almighty DBO. Yes, you might say that I’m a paranoid DBA who restricts permissions even in DEV! Of course my great developers would never change a modeled database thereby forcing my hand into figuring out this lockdown of privileges
Runaway System Cache Increase Kills SQL
Apr 21st
Ran into this a while back, and we finally found a root cause so, I thought Id put it out here in hopes that it saves at least 1 person the amount of head bashing I had with it
Environment
Windows 2003 Enterprise R2 SP2 w/32GB RAM
SQL Server 2005 standard ed SP3 64bit active/passive cluster
We started seeing this glorious message in the SQL Server Error log.
A significant part of sql server process memory has been paged out. This may result in performance degradation. Duration XX seconds. Working set (KB) XXX, committed (KB) XXX, memory utilization 0%
The message varied slightly but the essence was always the same.
This error message can be too common on systems where SQL memory is misconfigured or where something is unduly pressuring SQL for memory. In this case a quick verification of the settings showed that everything was in order. The first 2 times this happened it was the middle of the night during backups(in the SLA window), so no one really noticed a performance degradation. We didn’t think much of it at the time but in hindsight, we should have.
The Failure
Monday morning 8 AM, developer makes a bad update to the database, No problem I say, Litespeed can rollback the transaction, So I start to copy the full db backup+tran logs off the server (~25gb) this is the way we process litespeed recoveries through the log reader. About 3 minutes later the server became totally unresponsive, and the error about paging the SQL process memory was logged. At the time I didn’t put 2 and 2 together as this particular server runs a varied workload of about 1500 batches/sec and has anywhere from 1200-2500 connections open at a time, so It could have been anything! After some further digging I figured out that the file copies were causing the sql memory to get paged out. At the time I had never heard of a file copy causing an issue in SQL Server!
The experts weigh in
While looking at the issue 2 perfmon counters stuck out–> Memory\Cached Bytes and Memory\Avail MBytes. While file copies were happening the cache bytes counter would increase very quickly while the avail bytes counter would drop, once the available mbytes dropped to 0 sql server started to page memory out. After a bit of paging, the errors were logged that SQL had its memory paged out and SQL became unresponsive. Since this was a high priority system, I did what any good SQL Server DBA would do, I contacted a few people in my network who may have seen this before. Interestingly enough I got the exact same response from every one of them, “use lock pages in memory” and don’t use windows explorer to do huge file copies as this is a known “problem”.
Even though I trusted my sources of info, I had a hard time believing that file copies of sizes all the way down to 1GB would cause this sort of havoc without this being something Bing+Google would know about (different file sizes mattered, some sizes worked fine, some would cause the problem).
Workarounds not welcome
I had a valid workaround with lock pages in memory and not using explorer for file copies but, I don’t normally like workarounds such as this on systems as important as this one is to us. After a few server rebuilds we finally figured out that we could reproduce this issue on any win2k3 R2 ent ed 64bit server, this would be the clue we finally needed to make a breakthrough. After rebuilding the systems from scratch and loading no drivers except SAN we noticed that we couldn’t cause the error! So, after painstakingly adding each and every piece of our standard server build we realized that Symantec AV (10.1.9.9) was the cause. Yes, another file system filter driver was misbehaving.
In looking back through the change communication, we ID’d where a new version of AV was pushed out and we just didn’t hit the error soon enough after the installation to put 2 & 2 together. Since disabling AV wasn’t an option we started trying to find a setting that specifically caused the problem and happened across a change that could be made and allow AV to run and SQL to not get paged out. By unchecking the network scanning options, the windows cache no longer increases during a (network) file copy. problem solved!!
Some time in the future vendors are going to figure out how to write good file system filter drivers, or they are going to stop trying to use them! After fighting this issue for a few weeks (or was it months) I can only hope this happens sooner rather than later
