Sunday, February 15, 2015

Spinlocks and lock hash...

Spinlock - just the name sounds like something meaningful and powerful in SQL Server. "Spinlock contention" - if you could dig into issues like that, surely you'd be a god among geeks in the SQL Server world, right?

What is a spinlock? And how does it apply to you? Further - when should you care?

 The Too Long/Didn't Read Synopsis: Spinlocks can be a cause of contention, but it's actually relatively rare - before spinlock contention is clearly an issue, you have a lot of other possibilities to eliminate.

But I have observed some worrisome LOCK_HASH spinlock cases that suggested that it might be wise to dig into locking behavior if you see a lot of heavy activity on the Lock_Hash spinlock, either through KB 2926217 if you have lots of database locks, or by reviewing overall locking if you have large numbers (thousands and thousands) of individual locks of any type. Caveat emptor, test prior to use in production, package sold by weight, not volume, and contents may have settled....

And on to the body of the post:

I'm going be be forthright with you. I know about programming. For many years, I knew that the big bucks went to developers, and I wanted to be one of them. Alas, I majored in mathematics, not Computer Science. And where I went to school, Pascal was viewed as an up and coming language, and sure, while K&R floated around, that wasn't going to get me anywhere (it got some people places - but not me. Wrong kind of learning for me). I later studied Java and C++, but I kept missing some key that let me go from being able to read a lot of source code to being able to write good applications, even simple ones.

I know about programming - but I'm not a programmer. I can't actually tell you, with certainty, what a spinlock "is". But, as we all know, the best way to gather information on the internet is not to ask a question... it's to post something wrong. And I know what I have to post is probably at least 80% right, so let's take a whack at it, and see who helps me update this blog post!

Key thing about a spinlock is this: it's pure-D user mode. Moving between User Mode and System Mode is expensive - it's not a context switch, by the way, that's something else - so if you can use a spinlock to check for contention on a resource, you can avoid hitting system mode.

"Check for contention?" Okay, let's back up. You know in a multiprocessor system, you have to have isolation. Only one thing can change something at the same time. And you generally don't want anyone even reading something while it's being changed, or you have problems - e.g., a race condition (depending on who wins the race, you get a different value, right? but we generally want consistency).

A spinlock is a way of isolating a resource. And as near as I can tell, it goes like this: suppose you check something that's commonly used, but used for a very short time. So: if you can't have it this very instant, the odds are you can have it if you wait just a tiny bit longer. You'll probably get it faster than the OS can tell you to "sleep" and wake you up when it's available. So, a spinlock plays kid-in-the-back-seat, only instead of "are we there yet?... are we there yet?" it asks "can I have it now? Can I have it now?" over and over.

That's a greatly simplified version, of course. The spinlock request starts with a request for access; if it gets it, great! If not, it's a "collision". The spinlock request then runs a looping procedure (thus, the "spin") and then requests access. If it gets it, great. If it doesn't, it keeps spinning and requesting for a bit, and then it "backs off" - and here's where it goes to sleep.

One of the ways SQL Server maintains great scalability is by staying in User Mode almost exclusively, and using cooperative multitasking - tasks voluntarily yield on the CPU, to let other tasks have a turn, knowing full well that the other tasks will (probably - almost certainly) yield in turn.

(What is scalability? Well - let's say you have 4 CPUs running at 20% and processing a thousand tasks a second. If you throw two thousand tasks a second, and CPU goes up to 45-50%, you have a scalable system. If you jump to 80% - you probably don't. This is super, super simplified, of course, and it assumes that only CPU matters. But the idea of scalability is, your system grows in a relatively stable way - double capacity means a bit more than double resources. (Obviously, it might mean exactly double the resources, but that's far too clean for most real-world systems, which is why I suggest 2x transactions = a bit more than 2x CPU.))

So: where appropriate, SQL Server uses spinlocks. It uses a lot of different spinlocks. You can check spinlock counters using DBCC SQLPERF(SPINLOCKSTATS) and sys.dm_os_spinlock_stats. ("Hey, 'Weirdo - why don't you have links to the Microsoft documentation about these valuable commands?" Because I can't find a link to them! Neither Bing nor Google brings up an MSDN page from Books Online. But a useful resource is here: Diagnosing and Resolving Spinlock Contention on SQL Server. Note: instantaneous values are useless - at the very least, you need to run these twice and compare changes over time.)

When is spinlock contention an issue? Wow.

With certainty, spinlock contention is an issue when you see CPU shoot unexpectedly higher, as transactions climb, without the expected increase in throughput, and you see huge, huge increases in spins (and maybe back-off events) in one (most likely) or more (less likely, but possible) spinlock counters.

So: if our hypothetical server shot up to 80% when we tried to move from 1000 transactions per second, to 2000 transactions per second (instead of the 45-50% we'd expect) - we might be seeing spinlock contention. Spinlocks take up CPU, and, in fact, can "waste" CPU if they spin too long when (if we had an omniscient scheduler) they'd be better off sleeping.

With greater certainty, if your CPU is spiking to the point that CPU usage is actually slowing you down, and the cause is excessive spins on spinlocks, your problem is spinlock contention. (Someone out there just said "well, duh - too much CPU wasted in spinlocks means not enough CPU for other tasks." Good! That's precisely the point I was making, but I try to avoid first use of "well, duh"!)

But can spinlocks cause slowdowns before we reach that point? Yes, probably. One key point is that we have to rule out a bunch before we should settle on spinlock contention as the root cause of your problems. Odds are, if you look for another problem, you'll find one, and get more "bang for your buck" from that search for slowdowns.

Okay: but I evaluate performance (and other stuff) on SQL Server for a living. And I've seen some interesting numbers. And I think - just a feeling, mind you - that we might be able to work on this idea a bit.

One of the places where I see a lot of spinlocks, with high spins/collision and lots of backing off, is in Lock Hash.

"Lock hash? Does that mean you fry up lox with potatoes and onion...?"

Hashing in IT is a different process. A "hash" is a way of encoding data in a set of bits so that only "close" data has the same hash value. If you hash "1313 Mockingbird Lane NE" and "1331 Mockingbird Lane NE" you might get the same hash value - but if you hash "1314 Mockingbird Lane NE" you (should) get different one.

With a good hash algorithm, you can quickly eliminate an overwhelming majority of your possible matches for data. In fact, if you construct a "good" hash table of your data, the expected time to search that table is expected to be based upon a constant. You should be able to search for one value in a thousand roughly as fast as you can search for one value in a million. Note that a binary search algorithm is very fast - based upon the Log-base-2 of the data. And a B-tree search can be proportionately faster, but not much - it's still based upon the logarithm of the number of values. But the hash, for a large enough pool of values, always wins.

SQL stores its locks in a hash - it wants to be able to search for incompatible locks in a very swift manner, no matter how many locks it has.

I've evaluated a few interesting performance situations, and I've seen a few cases where I see some very high results for the Spins Per Collision on the Lock_Hash spinlock, and a great many Back Off events. Only rarely do I see something that obviously calls for http://support.microsoft.com/kb/2926217 - but I've sometimes recommended testing it anyway - note you need trace 1236 turned on for it to work! Now, note that KB article, and read it very carefully.

It calls out database locks, and database locks only - on 16+ core servers, table-level locks are partitioned already, and this trace flag partitions database locks. What else have I seen?

Well - I've seen queries taking out hundreds and thousand of row or key locks, and seen huge, huge values on spins per collision growing over time, and a scary number of back off events.

What do I do when I see this? Well - I sometimes suggest people consider testing a change in their locking behavior. If a big query is going to block hundreds of other queries, even if only briefly, and it's also going to take thousands and thousand of row or index locks, we might want to hint it to take just page or table locks. Hey - it's already going to block like a ... uh, let's be polite and say "big blocky blocking thing", but page or table locks are easier to find and more efficient to create (but might take longer to obtain!).

You can also set up an index so it doesn't allow row locks, or page locks (depending on your needs). This is, obviously, a trickier thing - a hint affects one query, this affects all queries. Then again, if you find that the majority of your queries are going to affect multiple rows on a page anyway, you might just find that you're not really losing much in concurrency.

You can also use Snapshot Isolation - your read queries won't (generally) be locking the table - they'll be using a copy of the relevant values (the version store) in TempDB.

Paul Randal sets up another interesting situation here: SOS_SCHEDULER_YIELD waits and the LOCK_HASH spinlock.

So: While I would have a very hard time proving that Lock_Hash spinlocks are the cause of (or even "a contributing factor to") a performance problem, if I start to see high spins/collision and back offs, I try to nudge people toward thinking about locking behavior. For other spinlocks? Remember: odds are, your root cause is something else. But check out the link to the white paper on resolving spinlock contention - it has some very good tips and techniques.

No comments:

Post a Comment