Wednesday, March 11, 2015

CXPacket waits - they're not evil... they're just misunderstood!

So, everyone seems to know about poor, put-upon CXPacket waits. "That's the parallelism wait!" people will say. "You've set Max Degree of Parallelism too high!" Well... maybe. Or maybe you've set Cost Threshold of Parallelism too low. Or... maybe nothing at all.

It's true, CXPacket is the parallelism wait, but parallelism is a powerful tool - it lets SQL Server break certain query opertions down into multiple pieces, so that they can run faster. The goal isn't to eliminate CXPacket waits - it's to find a balance where so-much in CXPacket waits gives you more benefit than you lose in the waits. (And as you'll see, you don't always lose anything!)

The Too Long, Didn't Read Summary:
Don't assume you're smarter than http://support.microsoft.com/kb/2806535; if you are, prove your genius with test of your application.
CXPacket waits at a reasonably steady rate, that are almost always lower than number of cores, divided by Max Degree of Parallelism, aren't generally a bad thing. Spiky waits might be, and even steady ones could be, but it'll be hard to prove. Consider other performance troubleshooting first.
If you get much higher than that Cores/MaxDOP * 1000ms/second value, yeah, you need to figure out how to drop parallelism. This can be done by dropping Max Degree of Parallelism, increasing Cost Threshold of parallelism, or maybe by tuning your queries(!!!).

On to the main discussion:
Since CXPacket is the parallelism wait, I want to talk about parallelism for a moment. Here's the lowdown on how SQL decides if it should use a parallel plan. Once your query is nicely pulled apart into component bits, SQL Server starts to optimize it. First it looks for trivial plans, where there's only one way to do something, like "update one row in a table with a clustered index". If it finds one, it produces it, and stops. If it didn't find a trivial plan, it starts digging into other optmizations, trying to balance the anticipated cost of the query versus the time it's taken to optimize. It doesn't try to find a perfect plan (there might be billions of possible plans!); it tries to find one close enough to perfection that it would waste time to look for a still-better one. If it finds that good enough plan, it stops and delivers that plan. But if it doesn't, at the start of the next step it makes a decision. Should we try a parallel query plan or not?

If the cost of the query is higher than the Cost Threshold of Parallelism (from SP_Configure), and if parallelism is allowed for this query, it will consider both parallel and serial plans. (Parallelism can be limited by Max Degree of Parallelism, query hints, and Resource Governor - and possibly other ways that I can't remember!)

What's the cost of a query? Well - it's an estimated time to run the query, in seconds, based upon a standard set of hardware. The default Cost Threshold For Parallelism is 5 - meaning a query that would have taken 5 seconds on a single processor server with a single SCSI disk - or that's what I was told, a few years back. As always, I know that sometimes the best way to learn something on the internet is not to post a question, but to post something wrong. If that information is correct, it means that today, that "cost" is much smaller - a query that ran on a standardized server of 16 years ago would probably run 20-30 times faster, unless it was disk bound, today (and even then it might be much faster, given SSDs, SANs, and even plain-old-spinning platters that are nevertheless much faster than back when). So I don't think anyone should be shy about bumping up Cost Threshold of Parallelism. Let the relatively cheap queries run serially; they'll still be fast.

What about Max Degree of Parallelism? The official Microsoft recommendations are here: http://support.microsoft.com/kb/2806535. They come down to this:

1) if you're using Hyperthreading, ignore it. Cut your number of logical processors in half, to just your physical cores.
2) Max DOP of 0, or N, if you have N processor cores, and N less than 8. (This assumes you have only one NUMA node - which is likely, but not certain!)
3) Max DOP of 8, even if you have more than 8 processors - unless you've tested it, and find that more than 8 works better for you. Except:
4) No more than the number of processors in one NUMA node.

The basic idea goes like this: splitting query operations has diminishing returns and comes with more overhead, and greater use of resources. Split it into too many, and you can actually waste time dealing with the extra overhead. More importantly, when an operation get split into many pieces, each piece has to use a worker thread. If too many of your queries are using parallel operations, with too many worker threads, you can run into a worker thread shortage. That's would be bad. Think Egon-level bad.

Experience shows that you usually get the most bang for your buck up to about 8 threads working on a query operation at once. After that, you can start to end up in situations where you're losing more than you gain. Not always - but frequently. So frequently that 8 really is a pretty good starting maximum; it isn't a hard cap, but it's considered a very good practice. With NUMA - well, NUMA is "NonUniform Memory Access - each CPU core has some memory "near" (faster to access) and some "far" (slower to access). So: Microsoft recommends against guaranteeing you'll need both Near and Far memory in the same query. I've seen people disagree with this - people who think the extra boost of a few more processors working the query will overcome the disadvantage of mixing memory. They might be right! But test your application before deciding you know best - all the best instincts in the world don't beat a good load test.

Okay: enough introductory stuff: on to CXPacket waits!

Once a parallel operation starts, the main thread running the query spins off its child tasks. I've seen it said that it always spins off MaxDOP tasks - if your MaxDOP is 8, all parallel tasks split into 8 parts. I've also seen a claim that SQL will determine on the fly how many child tasks to spin up when the plan is compiled. I can't say which is the truth, but I can say, every time I've seen a parallel query in sysprocesses, or sys.dm_exec_requests, or similar reports, they've always had MaxDOP child tasks. Regardless - this parent task now goes into a wait on CXPacket.

Here's where we see something interesting - if you have 32 processors, and Max DOP set to 8 then if you have 4000ms/second waits in CXPacket, and they're more or less consistent, what does that mean? Well, it might well mean that you constantly have 4 different parallel queries running in perfect harmony. And 4000ms/second is a pretty substantial wait - if you saw this much in average waits on Page_IO_Latch, or blocking, you'd have every right to be concerned. But you don't quite yet have reason to be concerned here. If you're using a tool like SQL Nexus, you might see that you have big bursts of CXPacket waits, instead of relatively steady waits - and that's often a sign of a problem. But if you graphed your CXPacket waits out, and they were nice and stable over time, and just about 4000ms/second, you're probably hitting the sweet spot for parallelism. So that's one reason CXPacket isn't evil. CXPacket waits prove that you're using parallel queries - but it doesn't tell you if you're gaining more than you lose. Up to this limit, we have reason to suspect we're probably gaining more than we lose. We're not certain, but we don't have strong evidence that CXPacket waits are a problem - so let's look for other things to optimize first.

Here's another way CXPacket isn't evil. Same situation: 32 processors, Max DOP of 8, about 4000 ms/second of CXPacket waits. Let's say you drop Max DOP to 4, and your CXPacket waits go up to 6000. Good thing or bad thing? Well, remember how we decided that 4000ms/second of waits wasn't bad when we had 32 processors, and a MaxDOP of 8. If we constantly had 4 parallel queries, taking up 8 cores each, we'd have 4000ms/second of CXPacket waits. Here, we could have up to 8 queries, running on 4 processors each - we could have as much as 8000ms/second of CXPacket waits, if parallel queries were constantly running. But instead, we only have 6000ms/second. We may well have better overall use of parallelism - even though our poor, put-upon CXPacket wait looks worse!

This is, obviously, a contrived example. Real life is rarely ever as clean as this example. But this does show you that CXPacket can often be, in fact, just misunderstood. Ah, but it can be evil, too!

There's another way to gain CXPacket waits. When the main worker splits off the child tasks, the main worker goes into CXPacket wait. And then, if a child task finishes before the others, it also goes into CXPacket wait. And if this happens, you're using up a lot more worker threads - remember, don't cross the streams, and don't run out of worker threads! - and you're not getting anywhere near the performance benefit from it that you should. The best that can be said of this situation is that the CXPacket waits probably aren't as bad as they look - i.e., they're not as bad as if they were lock waits, or latch waits, etc., because the parent thread always sucks up CXPacket waits equal to the duration of the parallel operation. Nevertheless, when your CXPacket waits are above your hypothetical optimum level - number of cores, divided by Max DOP, in ms/second - it's definitely time to do some parallelism reduction. If your MaxDOP is higher than recommended, drop it a bit - and if you have > 8 cores, and haven't set it yet, set it! Consider dropping it even if you're spot-on the recommendations; they're only one-size-fits-most. Consider an increase in Cost Threshold of Parallelism - don't let your cheap queries suck up lots of extra resources to save a few ms. But more importantly, look at some parallel queries and see if there's a better way to speed them up - a narrow covering index on a wide table can drop the query cost below the Cost Threshold, especially if it saves you from having to do a post-scan sort; and if you're doing a parallel scan, you should definitely consider an index to see if you can eliminate it.

So: if there's a moral to this story, it's to understand your CXPacket waits - with understanding, and proper care, you can prevent them from turning evil.