Spirit of SQL : September 2015

Sunday, September 27, 2015

Brief intro to XPath, pursuant to Extended Events discussion

Okay, as I mentioned earlier, if you want to work with Extended Events in SQL Server, you'll be best served if you know enough XML manipulation to work with the output. And that means you'll want to be able to manipulate XML enough to find, and present, the data that matters to you. For example, if you pull the data from a ring buffer, you'll end up with a big XML document containing everything in the ring buffer; with a file, you can usually pull in smaller XML documents, one per each event, but there may be a fair amount of information in each - and it will vary, depending on what you decided to capture.

So you'll need a way to identify, and obtain data from a mess of XML. XPath is the first part of being able to do this. Earlier, I mentioned XQuery - but it turns out that all examples I've seen use XPath alone - XPath has query-like properties, because it describes parts of the XML document. An XPath construction, by itself, points to data, and you need only use a few tools to extract what you need from what you've pointed to.

Terminology first. XML has "elements". These are like <AnElement AnAttribute="AttributesAreAlwaysInQuotes">Some Text</AnElement>. Everything - open tag, all attributes, all text, closing tag. XPath also talks about "nodes" - these can be an element, but they can also be an attribute, or the text (and some other things, but we don't need to worry about those). These can be part of the XPath expression. Although we won't use this much, it helps to realize that XPath uses familial terminology - a tag that encloses another is a "parent" tag, and parents, plus "grandparents", etc., are all "ancestors"; similarly, the tag enclosed by the parent is a "child". Fellow children of the same parent are siblings, and children, plus "grandchildren" and so forth, are descendents.

Okay: since XML has a root element, and every all new tags must close within their parent tags, we can think of an XML document as kind of like a Unix directory. If you've never used Unix (or not knowingly), the top level directory of a Unix file system is /. You then have a subdirectory (which might be a disk mapped to that directory name). You could easily represent the Windows file system this way by defining / as the root, and the /c as the C: drive, /d for the D: drive, etc.. There are a fair number of similarities. For example, XPath uses both absolute and relative paths. Here's some sample XML: <Vehicle> <Car Make="Chevy" Model = "Sprint"> <Car_Data>My first car...</Car_Data> </Car> </Vehicle>. An absolute path there would be /Vehicle/Car/Car_Data - or even /Vehicle/Car/Car_Data/text(). You can also use relative paths. If Vehicle is not the root of our XML document (maybe it's part of "Assets"), then we can think of a relative path: Vehicle/Car/Car_Data, or Car/Car_Data. For relative paths, especially, there's also another similarlity to file systems: . (a single dot/period) refers to the current location/node, and .. (two dots) refers to the parent location/node. This is the basic "bread and butter" of a search through an XML document. But you quickly run into problems with this.

First: look at my XML - what if I expand this document to a list of all cars I have ever owned? Then, /Vehicle/Car won't be unique, right?

This is actually one of the wonderful things about XPath. /Vehicle/Car can be non-existent, unique, or have hundreds of values. XPath is sort of like an auto-select; "this is the path to all Root-Level Vehicle, child Car, elements." This is really nice, and really slick, when you think about it - it's like a path, and a select, all in one! But a select without a where clause is like a banana split without the banana. It's nice, but there's something critical missing! Fortunately, XPath has all kinds of where clauses.

A simple one is an array-like quantifier. /Vehicle/Car[1] means the first Car that's a child of Vehicle. There are also some functions you can use in the square brackets: last() will get you the last item, so last()-1 will get you the penultimate, and last()-2 will get you the antepenultimate. These are really useful when you need to deterministically return a singular item (rather than a list). Remember, /Vehicle/Car is all Car values that are children of the root Vehicle - so /Vehicle/Car[1] means "the first (and in our case, only) Car". Another neat example for square brackets, one that's a good reminder that XPath is path, plus select, is how you can point to up to the first three cars that are siblings (same parent) with /Vehicle/Car[Position()<=3]. Obviously, that last one won't allow you to return a single item/value!. Still, if Microsoft had a common event for QUERY_PROCESSING health, and it included wait information, and you wanted the top 5 waits... oops, I'm getting ahead of myself.

Also, remember how we said that attributes, and text, are nodes as well? They are - but when you want an absolute path to an attribute, you need to preface the attribute name with @. So, the path to the Car Make would be /Vehicle/Car/@Make. Pause here for a moment - while this is a path, it's a path to an attribute value, while /Vehicle/Car is a path to an element. This will matter - the "return value" is a bit different, right? But we'll get to that later.

For the text data in an element, you need to point to the text() operator. So we'd want /Vehicle/Car/Car_Data/text() to point to "My first car...".

When you start with a / and the root element, you're using an absolute path. Per all XPath information I've found, a relative path, like "Car/Car_Data" should be a way to find any Car_Data child of a Car node. However well this is documented, it doesn't seem to help in the Microsoft XML methods I'm going to be demonstrating later. But it's a good way to introduce another operator: //, a double slash - this is an operator that tells XPath that you don't care where something occurs, so long as it does occur, and this gives us a way to use those relative paths. So, //data will pull any data node, and //data/value will point to any value node that's a child of a data node - or, //Car/Car_data will find the values we hoped for earlier.

For additional searching fun, /event//data will find any wait node that's a descendent of an event node. Or, /DataCollection//event//data will point to all data nodes that are descendents of an event node that are part of the DataCollection XML document, however deeply either the event, or the data, is nested. You can combine these with the array-quantifiers above, but be careful. /DataCollection//event//data[1] means the first wait event in every set of /DataCollection//event//data elements - if you have 10 events, each with data as a descendent, you'll get 10 values. If you want just the first value in the document, you want to use parentheses to clarify this. So, (/DataCollect//event//data)[1] means the first match in the document. Similarly, /DataCollection//event//data[position()<20] would point to the first 19 data siblings in each collection of data-siblings; (/DataCollection//event//data)[position()<20] grabs the first 19 in the entire document. So if you have three sets of 10 data nodes, the first gets all of the data nodes (up to 19 siblings of each group of 10) whereas the second doesn't even quite collect the first two (all 10 siblings in the first group, and only 9 slots remain for 9 of the 10 siblings in the second group).

Okay, stop for a moment, and think about what we've discussed. You can find all nodes of a particular absolute path, and you can find all occurrences of certain relative paths. You can trim these to the first n items - you actually have all the knowledge you need to create an XPath to any single or top-n group of items in a simple XML document whose structure you know beforehand. If XPath stopped there, you'd already be able to do some relatively useful things.

But XPath does more than that. Earlier, I said that the square bracket operator was array-like. It isn't. You can use the powerful selection criteria allowed within the square brackets for something as simple as position in a list. But you can put a great many different selection criteria inside those brackets. For example, if we wanted to be sure we looked only at Chevys in our Vehicle list, we could use the absolute path /Vehicle/Car[@Make="Chevy"].

That's kind of cool, isn't it? But let's continue to pretend we have a lot of cars here, okay? This is way cool - I promise. You can select /Vehicle/Car[Car_Data="My first car..."]. You can say "give me Cars, but only if they have a Car_Data child with a specific text value.

But it gets even cooler. let's redo our XML a bit:

<Vehicle>
<Car Make="Chevy" Model = "Sprint">
<Car_Data>My first car...<Contents>My_Keys</Contents>
</Car_Data>
</Car>
</Vehicle>

We could use /Vehicle/Car//Contents[.="My_Keys"] to find the car that has my keys in it. Interesting note here: I have Contents tagged as it is because while HTML is mostly whitespace agnostic, XML is funny. In text, whitespace is preserved (but all newlines are converted to line feeds). The text of Contents starts the character after the > and continues until the next tag definition, or the closing tag of Contents. So I closed up Contents to avoid having whitespace matching, or line feed aliasing, issues.

Also, just to make sure it's clear, although the predicate in the brackets is often at the last item of the list, it doesn't have to be. /Vehicle/Car[@Make="Chevy"]//Contents[.=My_Keys"] is perfectly acceptable - but if I use that, I better hope I didn't leave my keys in a Ford! As I'm sure you can guess, when you do this, all predicates must be satisfied - it's an AND combination, not an OR.

Okay: now, if I've done my job well, after reading this, you have an idea how to read some simple XPath statements. You can follow an absolute path, or a relative path; you can you can make decisions based upon attributes or child values, or even descendant values. If I threw a blob of XML at you, and showed you how to extract information using SQL Server's XML methods, you could piece together what I'm doing. And that's my next step.

Wednesday, September 23, 2015

Very brief XML introduction - pursuant to Extended Events discussions

I haven't been blogging nearly as much as I hoped - and I hadn't hoped for much more than a single post a month, which means blogging has been pretty damn skimpy!

But I decided that the world needed a bit more information about Extended Events in SQL Server, and, since I was studying them for my own professional development, it was time to try to share them with others.

Problem: if you want to do much of anything useful with Extended Events, you need to know a bit about XML. Talk about a pain!

XML is one of those things, you know? On the face of it, it's stone-cold simple. Enclose text in tags, sometimes even nested tags. You can also have attributes within the tags. And it's extensible, because you can add new attributes, or new tags, and you won't (necessarily) break something that doesn't know to look for the new tags - just like with old web browsers, if you don't know what a tag means, ignore it!

Ah, but XML in SQL Server is a brand new beast. Sometimes you want to check a few things - are they there, or not? - and sometimes you want extract a few values, from text, or attributes. Sometimes you want to pull out some XML, sometimes you need to poke around in different places... you can do all of this, and because you can do all of this, you can process Extended Events results a bit easier than otherwise.

Step one: we need to discuss a bit about XML so you understand enough to know how it works.

There are many, many places to learn about XML out on the web. I'm not going to try to re-invent the wheel, so I'm going to spell out the most important parts here:

XML is eXtensible Markup Language - the "markup" is done by tagging, putting opening and closing tags around a section of text.
There must be a root tag to the document - all other tags will open and close within this.
XML is case sensitive! Never forget this, it will save you a lot of grief if you remember. But you'll probably forget, as I do, far too often.
Tagging is atomic - you can't cross your tags. You can have <Car><Make>Ford<Model>Fiesta</Model></Make></Car>- your tags are all nested - but you can't have <Car><Make>Ford<Model>Fiesta</Make></Model></Car>. If Model starts in Make, it must end there too.
Tags must be closed - you can create a self-closing tag by adding a slash near the end. We could have, for example, </Model Name="Fiesta" />
Tags can have text data - the text enclosed in the tags - or attributes, like the "Name=" in the previous point.

To expand upon this:
<?xml version="1.0"?>
<Memorobilia>
<B5>
<Weapon>
<PPG type="resin_sculpture">
<Quantity>4</Quantity>
</PPG>
</Weapon>
<Souvenir>
<Bear type="Teddy">
<Desc>JS embroidered shirt</Desc>
<Quantity>2</Quantity>
</Bear>
<Londo_Mollari_Figure>
<Quantity>1</Quantity>
<Desc>Posable 'action' figure </Desc>
</Londo_Mollari_Figure>
</Souvenir>
</B5>
</Memorobilia>
Notice that we start and end with "Memorobilia", and that everything we've collected is categorized as B5 (for Babylon 5, a darn good science fiction TV series). We could have eliminated either tag - but by enclosing the B5 tags in Memorobilia tags, we have room to add other forms of memorobilia. Sorry: Memorobilia. XML is case sensitive! And if we want to add this to other collectibles that aren't exactly Memorobilia, we can enclose this entire structure within a new set of tags. That's one of the neat things about XML.

This isn't the best, most well designed XML schema, but it's enough to do some basic understanding. Notice, interestingly, that there's little actual text data in here! But that's perfectly reasonable - as with the car example above, we don't necesssarily need text if we have the right categorizations.

You can also see that some of the data is included, not by tagging/categorization, but by attibutes - we can mention that it's a "Teddy" bear by throwing in an attribute on the Bear (in case there's a posable bear, or a statue of a bear), while noting that the Londo figure, of all the Souvenir "children", has no attributes.

I mentioned "children"; that will be the subject of my next post, when I'll discuss XPath (and maybe XQuery).

Interesting issue with log file shrinks...

So, I was looking at this fine blog entry from Lisa Gardner, one of Microsoft's PFEs. And I decided to do a bit of futzing around with it.

If you're too lazy to follow the link (or if you can't see where the link goes and won't follow a link blindly - GOOD FOR YOU!), there's an undocumented command called DBCC LOGINFO in SQL Server. It will show you how many VLFs you have, and show you a nominal status for each of them - are they free, or in use?

I hope you all know that you don't want too many VLFs - less than a hundred is good, and more than a hundred only if you absolutely, positively have no other reasonable choice. Too many VLFs can slow down recovery(which includes database startup/availability when SQL starts!), transactional replication, mirroring/Availability Groups, and probably more that I can't think of. For all that it's undocumented, it's a good command.

Well, as Lisa points out, when you're using an availability group, DBCC LOGINFO has an interesting quirk.

Normally, when you run a log backup, DBCC LOGINFO will show all of the truncated VLFs as having a status of 0 - not in use. Lisa pointed out that if the server is in an availability group, while the log might be truncated (i.e., the VLFs are available for re-use) DBCC LOGINFO might still show them as having a status of 2 (in use).

Let's step back a bit to make sure we're clear on this. The Transaction Log File of a database is like a whiteboard. Each VLF is its own discrete section of whiteboard. When you "truncate" the log, all VLFs that are freed up by the process are now erased - clean, and clear, and ready to accept new data. These VLFs are still there; the log file is still the same size. But now the log file has more empty space (in the form of VLFs) to use. (This is why I use the whiteboard metaphor - erasing a section of whiteboard doesn't make the whiteboard smaller, it makes the blank space bigger.)

DBCC LOGINFO normally tells you which VLFs are at least partially in use, and which ones are blank.

But when the database is in an availability group, it may not be honest. You may have space (re-usable VLFs) but DBCC LOGINFO might show them as in use, status of 2.

Well, I ran into an oddball problem. On an availability group, I tried to shrink a log file - and it wouldn't. (Yes, I know. It had been bloated by one-time operations.) It had been backed up - I even triggered the backup 2-3 times. And DBCC LOGINFO showed all VLFs in use. DBCC SQLPERF(LOGSPACE) (which shows all databases, log file sizes, and percentage of log files in use) showed nearly nothing, less than 1%, of the log file was in use. So, I'm comfortable that I ran into this DBCC LOGINFO display strangeness.

When this happens, the usual trick is to run the log backup once or twice to truncate just a bit more data. I did that - and I still couldn't shrink the log, and DBCC LOGINFO still showed all VLFs in use.

In this case, I figured out what was happening. My log backups were all occurring on the secondary replica of the database. As soon as I took a log backup on the primary replica, DBCC LOGINFO showed almost all VLFs clear, and I could shrink the log file back to its normal size.

What's going on here? I'm not sure, but it would make sense if DBCC SHRINKFILE uses DBCC LOGINFO to find out which VLFs are not in use. When you shrink a log file, it must shrink on VLF boundaries. If it's using DBCC LOGINFO to find inactive VLFs, it might not be finding any. And that might be causing the problem.

It looks like a backup on the primary replica might fix this. I also saw someone fix it with several log backups, and several checkpoints - the log backups were on the secondary replica, and, of course, the checkpoints were issued on the primary. Some combination of Secondary Replica log backups, and Primary replica checkpoints cleared the DBCC LOGINFO VLF status values - and allowed the file to shrink.

One warning: if you do choose to take a log backup on your primary replica, when you normally run log backups on your secondary, I very strongly recommend that you move or copy the log backup to the same location as your normal log backups. When you have to restore log file backups, you don't want to have to dig around in multiple locations - you want them all in the same place!