Sunday, February 15, 2015

SQL Server and PerfMon counters - including some tricky ones!

First thing you'll learn about me: I'll talk. A lot.

So: let's try something out, which might become a tradition for this blog, based upon those "crazy kids and their intertubes" - the TL/DR synopsis.

The Too Long/Didn't Read Synopsis:
Good Perfmon Counters for SQL Diagnostics and forensics:
Logical (or Physical) Disk: Seconds Per Transfer - <10ms good, 10-20 usually adequate, 20-50 usually painful (at least occasionally) 50+ms: usually a serious bottleneck (This applies to MDF and LDF disks - but might to the ERRORLOG disk - ERRORLOG writes synchronously! - and definitely to pagefile disks. Others - you be the judge.)
Bytes Per Second: How much traffic is the disk seeing? How much should it handle? It's reasonable to see more latency at the upper end of capacity. If an 80 meg/second disk is slowing down, only at 80 megs per second, we're not surprised. But if it's slowing down at 30 meg/second, or heaven help us, *1* meg/second - yes, it happens! - worry.
% Idle Time: A good disk is 90+% idle, almost always. If not, something odd is going on. (Could be getting hammered - could be misconfigured, outdated driver, or on its last legs.)

Memory: Available Megabytes - higher than 1000 and stable - stable is important! - for x64. 800 stable is more clearly acceptable than 1000-2500 (but both can be healthy); more isn't bad, and 5% as a minimum for "big RAM" seems like a reasonable safe harbor. (So: never dropping below 6gb free on your 128gb server is a reasonable safe harbor.)
Memory: Pages Per Second - if Available RAM is changing rapidly or is obviously low, pay attention to paging. Otherwise, Memory Mapped Files can make this look scary high when it's normal.
(OS geeksgurus have a lot more memory stuff to look at - but we're DBAs!)

Processor: %Total CPU: Less than 80% sustained - you're probably doing fine, but most healthy SQL Servers just tend to have 30% or less, most of the time. If you're growing over time, that's relatively normal - if you're spiking, worry. Sometimes looking at individual cores is useful - especially if MaxDOP is 1. But do you really have time to investigate 32 (or lord have mercy, 256?) cores' behavior?
%User time: SQL runs almost - almost - exclusively in user mode. This is the cap on (most) SQL CPU usage, except for Filestream calls.
(Again: OS gurus have more to look at - but DBAs can usually stop here.)

SQL Server:Buffer Manager:Page Life Expectancy:
Please see http://www.sqlskills.com/blogs/jonathan/finding-what-queries-in-the-plan-cache-use-a-specific-index/, section titled "What’s Wrong about Page Life Expectancy >= 300?".
I like to see about 50seconds per gb in the buffer pool as an absolute minimum - a tad bit easier than "300/4gb" = 75seconds per gb. Super-zippy disks can make lower values acceptable. See also: http://www.sqlskills.com/blogs/paul/page-life-expectancy-isnt-what-you-think/.
LazyWrites Per Second: This is buffer pool contention - SQL needed to call the lazy writer to dump some pages to make room for other pages. Some lazywrites, sometimes, is (are?) fine. If your average is meaningfully > 0, and not caused by a huge, huge spike, you've got buffer pool contention.
Buffer Cache Hit Ratio: I check this once in a while, but usually for shock value. It used to be meaningful, but buffer pools have gotten so huge these days, and read-ahead so good, I've never seen it much less than 99%. But if it is, it might be the final stone in the bucket.

SQL Server: Buffer Node: did you check Mr. Randal's advice above? Why the heck not? Paul Randal helped build the durn product! Listen to the man!

Sneaky Statistics - for forensics, mostly.

Process: SQLServr: Threads - if SQL suddenly uses a lot more threads, it's building up a backlog of connections/queries. WHY?
Process: SQLServr: Working set. Did it just drop? Bad news. Grow? Something just 'released' or SQL needs a boatload of memory. Remember: LOCKED PAGES IN THE BUFFER POOL AREN'T COUNTED!!!
SQLServer:SQL Statistics:Batch Requests/Sec: High? Low? (Dropped to 0 meaning SQL's locked?)
SQLServer:SQL Statistics:(re)Compilations/second: Compared to batch requests, if this is high, we'll see slowdowns and higher CPU (yep, compiling is often more expensive than running!)

SQLServer:Wait Statistics:Waits In Progress: I've had a hard time mapping the other wait stats to hard and fast SQLDiag collected waits - but "Waits In Progress" seems accurate. This can point to thread exhaustion, or simply give you a good starting point for digging into oddball waits you don't quite see in other areas.

And now: I'm allowed to ramble to my heart's content! I hope you find this entertaining or informative. If not - why'd you keep reading, silly-person?

PerfMon is one of the best tools for checking performance of almost anything you're working with on the Windows platforms. There might be better methods of finding answers - some "scorecard metrics" that you can pull up more quickly and easily - but for me, there's something special and magical about diving into perfmon counters. Look - disk usage shoots up. What else happens? OH! Paging! I bet our memory is moving - sure enough, look! Megabytes Available just plummeted from 1500 to 400 - this server is in big trouble!

It's not always that obvious, of course. But while Sherlock Holmes calls out the Baker Street Irregulars, or Dr. House plays cane-ball, I open up a set of perfmon counters and pull out my favorite bits of data to let me think about things.

What are my favorites?

Well - remember, this is for SQL Server. So it's all balanced in that direction.

First I'll pull up Logical Disk - logical disks are your C: drives, your D: drives, and so forth. They might also be your L:\SQLLogs\MSSQL\ drive, if you're using Mount Points. If so, you probably have a separate L: drive - and I won't even try to guess what counters will show where. Does a write to L:\SQLLogs\MSSQL only show up there? Or does it also show in L:? I don't think so - but trust your perfmon counters here - if they're showing latency, there's latency somewhere!

The drive letter is a key difference between Logical and Physical drives - physical drives might be 0, 1, 2, 3... with no (obvious) mapping to the drive letter or directory. Sometimes the letter mappings show, but not always. So I use logical. But checking both can be useful, because if physical and logical drives show different values, it can help isolate where a performance issue is occurring. But I never learned how to determine which counter means what - I have another team to do that for me! - so I just grab logical drives. (I believe that physical is pure hardware level - physical seconds per transfer measures from the last handoff from logical IO to physical IO. But I've been wrong before.)

My best metric here is "Seconds Per Transfer". A former colleague of mine, John Gose, explained that this is a great scorecard because SQL Server really doesn't care a whole bunch about "% Disk Time" (which can go > 100% if you have a RAID array), or disk queue length ("up to 1 per spindle is acceptable" - if you have a SAN backed by 250 spindles, shared by 14 applications, what disk queue is acceptable?). If SQL asks for a read or a write, and it occurs in under 10ms, SQL is probably very happy. I've seen some sources say that 7ms is better for a log file transfer (which will almost always be a sequential write) - but 10ms is generally good enough there, too, if you don't have poor wait statistics.

What if you don't have "10ms or less?" Well, that depends. The rule of thumb I heard from Microsoft CSS was "10ms or less is good; 10-20ms is bad enough that you'll notice; 20-50ms is definitely a bottleneck; 50+ms - we can stop looking for bottlenecks. This is the worst bottleneck we'll find." Ah - but that was back in the day of x86!

For most x86 servers, the biggest ordinary buffer pool you'd find was about 1500 meg; you could find more - even back in the days of Windows 2000, there were builds that could run with up to 32 gig. Well, databases were a lot smaller, sure, but data still grew, and someone always wanted "just one more database" on the server. It was pretty common for SQL Server to be disk-bound in this kind of situation. 50ms reads and writes probably put a huge bottleneck on your server in this case - you always needed to go to disk in the relatively near future.

With x64 servers, 32gig for a server isn't a bit surprising, and you can manage a 26 gig buffer pool on that. You can have a pretty darn big database before your truly hot data is bigger than 26 gigabytes. (Technically, 20-25 gigabytes - you always have some "stolen" pages used for other purposes.) 50ms disk reads and writes might not slow your server down all that much. Even 100ms latency might not kill you.

But 50+ms latency will do two things, for sure.

It'll wound the very soul of your DBA, if your DBA is good. (If you're good enough to recognize that this particular app is doing just fine with 50ms latency, you're also good enough to know how good you are, and how rare that is - so, to quote Boris Badenov, "sha up you mouth" about your unwounded soul!)

And it means that if you ever need those disks, your performance isn't just going to drop - it's going to go off a cliff.

So: seconds per transfer. Good statistic. By the way, remember that 10 ms is .010 seconds - count those decimal places carefully!

The other side of seconds per transfer is Bytes Per Second. A decent laptop hard drive should handle 20 meg/second, no problems. A good desktop hard drive should handle 60 meg/second, no problems. Obviously, this is a rough guideline - it's not just bytes, there's number and types of operations, and how big they are, and are they random access or sequential. But if your laptop can't handle 20 megabytes per second, with latency generally lower than 10ms, fix your hard drive - similarly, for your desktop, and 60 meg/second, fix your drive. And if you're using a solid state drive, and are scoffing at me, quit gloating! Sure, I have and love solid state drives, but sometimes we have to use the work-supplied laptop - 'nuff said.

The point is, if you see bad latency, but you're running at the limits of what you can reasonably expect, well, we have to balance the latency against the demand.

If you're on a decent (non-SSD) hard drive, and your latency increases when you hit 60-80meg/second on your desktop - well, that's probably okay. You might need more, or better, disks, but we expect to see some latency increase at the top end.

Scary thing: I've looked at a lot of perfmon that shows average latency of 10ms or less... unless you restrict the capture to times when the disks are running 1 meg/second or greater - and suddenly, latency shoots up to 20ms or worse.

Yes, I said "one megabyte per second" - when I said it was scary, did you think I was joking? If you're looking at PerfMon, and you see latency and Bytes Per Second jump up, always, always, always, set your PerfMon to an appropriate zoom, and see if anything looks bad when you don't have minutes and hours of idle disks being averaged into your PerfMon counters!

What do I advise when I see poor disks? Well - my advice is to talk to your hardware support team if you have one, or your hardware vendor, if you don't. Go to them with a PerfMon capture, and demand that they explain the results to you. Don't let them say "we tested it, and it all showed no problems." A SAN or a RAID controller will usually show no problems unless there's an actual failure detected. But if you know you bought a RAID controller, or a SAN, that can handle 150 meg/second of traffic, and you're seeing poor latency at 25 meg/second, you know something's wrong.

It might be BIOS on the controller or SAN. It might be firmware. It might be a driver, or a filter.

(What's a filter? Well - anything that intercepts IO calls. It might not actually "filter" IO calls in the normal sense of letting some stuff through but blocking others. For example, some IO system drivers are "filters" even though they pass everything through. Anti-virus programs might be filters - and they do act as a traditional filter, blocking certain activity.)

Or it could be hardware - but if it is, your hardware support (or vendor) can probably see an alert, and there was no need to "demand" they explain your performance.

As I mentioned, in my day job, I work with SQL Support - and if they run into a non-obvious disk problem, they'll call up the Microsoft Platforms team. If you need help proving you have storage issues, Microsoft Platforms support can help. That's the big reason why I don't need to dig too deeply into disk performance counters. But I do like to look at % Idle Time, too.

A disk should almost always be 90% or more idle. If a disk is only 50% idle, and latency is still less than 10ms, and it's pushing about as many megabytes per second as you're paying for - okay, that's good. One case I worked, I saw a disk that was 10% idle, running around 200 meg/second, and still showing no more than 8ms latency - I called this out to the customer, because they deserved to know that whoever set up their IO did a hell of a great job on it.

But more frequently, you see latency that's 10-15ms, bytes per second up around 20 meg, and you think the disks are maybe okay - and then you see that they're only showing % Idle Time of 75, or 50, or worse(!!!). This is a final danger sign that might get you to dig in more deeply before real trouble starts.

So, there's your disks - a good basic check on them.

Seconds Per Transfer: .010 or better.
Bytes Per Transfer: use this to judge whether your latency is acceptable. Higher latency at the upper limits of your capacity is okay
% Idle Time: A possibly canary in your coal mine - it sometimes catches incipient problems before they're more clearly visible.

What's next? Memory.

If you're old-school, like me, you used to hear about "Pages Per Second". Unfortunately, while this matters, it's tricky. Windows can use memory mapped files - and reading from a memory mapped file is "paging". And the first time you need to use a memory mapped file, well, of course it's doing a fair amount of paging! That's a feature (not a bug)!

If your server is x64 (and dear lord, I hope it is!), you should instead look at Megabytes Available. Typically, for SQL, we want to see about 1000 megabytes available, and we want it stable. I'd rather see 900 meg available (Min 850, max 950) than 1200 available (min 1000, max 1400), but both are perfectly fine - and while stability is nice, if you need to launch large-ish processes from time to time, you just can't maintain perfect stability. And if you need to launch memory hungry applications, you also need more megabytes available to run them!

Anything much less than 900, though, and we start to panic. Anything less than 500? My friend, if you told me the Titanic sunk because it's x64 navigation system had less than 500 meg available, I'd say "doesn't surprise me a bit!" until I realized you meant the actual Titanic, and weren't speaking metaphorically. (64 bit servers didn't arrive on the scene until well after the Titanic was provisioned - and I know it didn't undergo a retrofitting of its nav system!)

Now: if you see a lot of paging, and a big shift in Megabytes available (or in Working Set in your Process counters) - you might be running into working set trim, and your server can lock up while it's happening. And if you see a fair amount of paging, and Megabytes Available shifting, you might want to see what's happening, to see if you can smooth it out a bit. But a fair amount of paging, and stable Megabytes Available is usually (not always!) innocuous.

What's next? Well, Processor is a good thing to check. Generally, for SQL Server, anything less than 80% sustained CPU usage is acceptable. SQL Server will use parallelism to speed up queries, and use more CPU to try to get better performance. Until you hit 80%, adding more or beefier CPUs probably won't help much. However, semi-paradoxically, most SQL Server instance tend to run on low CPU - 20% CPU or less is not at all uncommon, and I expect to see 30% or less in a healthy server - 40% isn't bad, or even a warning sign, it's just "surprisingly high" and I'll make a mental note. SQL just doesn't have a lot to do with its CPUs most of the time.

Typically, if I see really high CPU, I'll hear that the CPU is normally less than 50%, and suddenly shot up to 100% today - that's a hallmark of a really bad query plan, often one based on data that's purely contained in the buffer pool. (A huge nested loop join on data in the buffer pool can spin CPU like nobody's business.) Update stats, and see if it helps - an update with FULLSCAN isn't necessary, but it can be a valuable troubleshooting step, since it eliminates the chance that a random sampling caused the problem - and consider dumping your procedure cache, or the query plan for the problem-query if you know it.

Dumping your procedure cache is expensive - SQL queries now have to recompile. But think about it this way - if your CPU is pegged at 100%, does it really matter if they're at 100% because queries are recompiling, or because the CPU is spinning endlessly on a bad plan? It's often - not always - worth a shot.

Now we get into SQL Server statistics. If CPU, Memory, and disks are all healthy, we know SQL Server is resting on a good foundation, and we can dig into some specific counters that tell us more about SQL Server.

One of the biggest, most obvious, counter sets is the Buffer Manager counters - how healthy is the SQL Server buffer pool?

Buffer pool is something that confuses a lot of people. They think of it is the place SQL puts data for faster access than from the disks. It's a bit more than that. You see, SQL Server never directly accesses data on disk - it reads data into the buffer pool first, then it might make changes, (hopefully computes the checksum - you're set to CHECKSUM page verification, right?), and then writes it back if changed. So, for example, if you update statistics on a huge table, WITH FULLSCAN, that whole table's going to have to be read into buffer pool before you finish.

The first good scorecard counter is Buffer Manager: Page Life Expectancy. This is SQL's estimate of how long a page that's accessed a single time in the buffer pool will sit (in seconds) before SQL would decide it has a better use for that page. So a page life expectancy of 300 means that if you read a (well, "the") page from an 8kb table that you use once a day, that page will probably stay in the buffer pool for 5 minutes - 5 x60 seconds.

Another way to think of this is, your entire buffer pool could be shuffled every 5 minutes. Now, back in the x86 days, you might have a 1 gig buffer pool; that's about 3 megs of reads per second, and maybe 1-2 megs of writes per second, if you have a busy database with lots of read/write activity.

Back in those days, Page Life Expectancy was judged to be healthy if you had 900, or worst case, 300. But in https://www.sqlskills.com/blogs/jonathan/finding-what-queries-in-the-plan-cache-use-a-specific-index/, Jonathan Kehayias makes a good case for basing it on buffer pool size. He recommends about 300 per four gigs of buffer pool.

As a person who digs into SQL Server performance for a living, I'll throw out another point. The real goal is to have SQL Server be able to flush dirty pages and pull in newly needed pages "at leisure". If you have a good SAN system capable of running 300 meg/second, then for you, 50 meg/second is pretty low speed. 50 meg per second is 3gb per minute - call it 2.5 gb to account for reads and dirty page writes. You could handle about 12.5 gb in 5 minutes, never going about about 17% of your disk capacity. But I'll also point out that this requires knowing and trusting your capacities, making assumptions, and so forth.

The main point is, Page Life Expectancy is an idea about how quickly your buffer pool is churning, and this is partially determined by the speed of your disks. A damn good rule of thumb is what Mr. Kehayias suggests. If you're in that range, you know that if you have a problem with buffer pool, it's probably the disks, not the buffer pool size.

One warning: Yes, we're suggesting that the higher the buffer pool, the higher the page life expectancy we'd like to see.

Does that mean if you're at 8gb buffer pool, you might want to drop to 7gb so you don't need so much Page Life Expectancy? Stop - laugh. Of course not; you know that. Dropping the buffer pool will reduce page life expectancy far faster, and bumping up the buffer pool will increase PLE far faster, than these targets will grow.

(Did you laugh? You should. There's a very human tendency to get wrapped up in the benchmark and forget what it means. A good laugh now at how ridiculous it would be to forget "bigger buffer pool = better PLE" might help you avoid that trap some day when you're exhausted and stressed and trying to fix the (expletive deleted) server - or when a company officer demands to know if you're suggesting a smaller buffer pool. You'll recover, of course - but a good laugh might keep you from stumbling, and laughter is a necessity in this world.)

Me, I've noticed something interesting. There's a great tendency to see small, tight buffer pools, and huge buffer pools (tens of thousands in Page Life Expectancy). This might mean there's two types of servers, under- and over-provisioned. It might also mean that the difference between "just enough" and "way plenty" is relatively small. So: don't be afraid to test out just a bit more RAM when you have the opportunity.

If you're using NUMA - Non-Uniform Memory Allocation, and the SQL Server log will list out your NUMA nodes if you are - check out Paul Randal's valuable advice: http://www.sqlskills.com/blogs/paul/page-life-expectancy-isnt-what-you-think/. If a NUMA node is tight on PLE, it's going to hurt. And this could happen if you have a lot of stolen pages - e.g., a large memory grant - on one node. You probably should take at least a cursory glance at the NUMA levels.
That's SQL Server:Buffer NODE: Page Life Expectancy.

Now: I mentioned that SQL doesn't work on disks, it works in the buffer pool. So if you do some work on massive amounts of data - say, UPDATE STATISTICS...WITH FULLSCAN on large tables, or a massive index rebuild, or a huge bulk insert - you'll see Page Life Expectancy drop. And then you'll see it start to grow. That's normal. Look at overall averages, obviously - but if it keeps dropping really low, that might be a problem, which leads us to our next counter.

Whatever your page life expectancy is, look for LazyWrites Per Second. The LazyWriter is usually the timekeeper in the buffer pool - when SQL has nothing better to do, it sweeps through the buffer pool, reducing TTL scores and flushing dirty pages - hence, "lazy" writer. But if SQL needs more free pages, it will call up the LazyWriter to be a bit more aggressive. If you see occasional LazyWrites, no big deal. If you see that your average LazyWrites Per Second is significantly higher than 0, that probably means you have some buffer pool contention, and might want to see what you can do about that. (Yes, I said "significantly higher than 0". If it was 1 or higher, it's probably gotten to be a bit of a problem already.)

If you look through all of these counters, and you see them all looking good, you've got a good, healthy server, most likely. But even good healthy servers can run into problems.

These next counters aren't always good at finding server health, but they can be very useful in one part of my job: figuring out what went wrong. I'll call this "forensics", though this might trouble some linguistics geeks.

A web application stops responding. Who's at fault? SQL Server (the web back-end) or the web server? Or you get a deadlocked, or non-yielding scheduler warning in your SQL Server log. What happened?

In PerfMon, if you collect Process counters, for all processes, there are some things you can look at in SQL Server (usually sqlservr.exe, in the Process counters) counters.

First is Threads. If SQL Server suddenly started using a lot more threads, just before, or soon before, a lockup occurred, that gives you a big hint that something changed. SQL Server uses worker threads to service queries. A relatively small number of threads can handle a lot of queries, because most queries take just a few milliseconds. The worker thread accepts the query, runs the query, returns the results, and moves on to the next query. If SQL needs a lot more threads, it might mean a lot more queries are coming in, that they're all taking a longer time to run, or some combination of these things. Of course, SQL can be sitting stable at its top number of working threads and suddenly lock up as well, due to a change in query performance, or number of queries, as well - but I've often seen SQL jump in thread count before a problem, and this often gives us a time tick to investigate the logs on the front end server.

Another good counter to review is Working Set. Working Set is how much RAM the process is using, but only counts RAM that's subject to paging. So: if you're using Lock Pages In Memory (so SQL Server's buffer pool is using locked memory), the RAM SQL is using is the working set, plus the buffer pool. If you're not using Lock Pages In Memory, working set (hopefully!) includes the entire buffer pool.

If Working Set changes - especially if it drops precipitously! - that can be valuable evidence that something is going on. Working Set Trim - where Windows decides that it's time to grab a bunch of Working Set memory for other purposes - is a real possibility. Also, if SQL Server working set drops, see what happens to the Total working set counter - does it mirror SQL? And here's a time where we should see what's happening in Memory: Available Megabytes and Pages Per Second.

Note that these won't provide any hard and fast answers - these are just giving you ideas of what could be going on, and might help you figure out when things started to go south.

Another fascinating counter I found a use for recently was SQL Server:Wait Statistics:Waits in Progress.

This counter family seems super-valuable; checking SQL waits is one of the top methods for improving SQL Server performance. The trouble is, when I've tried to cross check them against a SQLDiag report, I've often had troubles connecting them - even the seemingly obvious, like "Cumulative wait time (ms) per second". But I did find that "Waits In Progress" can be an awfully useful indicator of what might be happening. If you keep seeing more than a few waits for any wait type showing up, you can get a sense for contention. Page IO Latch waits continuously spiking? You probably know the answer to that - disks, or buffer pool, or some combination. Page Latches? Start digging into what kind of pages, what kinds of latches, and where? Lock waits? Blocking. And the neat thing is, you can see how many of these collect at any one time, and how far apart they're spread. Lots of page IO latch waits, clustered near a lot of lock waits? This suggests disk insufficiency (or a very tight buffer pool) causing extended blocking, whereas just the lock waits, without any others, suggests a common query being blocked for an extended time.

None of these are as useful as a live capture of a SQLDiag report (or PSSDiag if you work with Microsoft SQL Support). Ah, but what if you can't get authorized to run a full SQLDiag (even without a trace or Xevents collection)? PerfMon is often easier to get permission for; Windows is always collecting PerfMon counters; you're just asking to write a set to disk. And they can give you some ideas on what to dig into more deeply - they can help give you an overall picture of the state of the system.

No comments:

Post a Comment