Sunday, September 27, 2015

Brief intro to XPath, pursuant to Extended Events discussion

Okay, as I mentioned earlier, if you want to work with Extended Events in SQL Server, you'll be best served if you know enough XML manipulation to work with the output. And that means you'll want to be able to manipulate XML enough to find, and present, the data that matters to you. For example, if you pull the data from a ring buffer, you'll end up with a big XML document containing everything in the ring buffer; with a file, you can usually pull in smaller XML documents, one per each event, but there may be a fair amount of information in each - and it will vary, depending on what you decided to capture.

So you'll need a way to identify, and obtain data from a mess of XML. XPath is the first part of being able to do this. Earlier, I mentioned XQuery - but it turns out that all examples I've seen use XPath alone - XPath has query-like properties, because it describes parts of the XML document. An XPath construction, by itself, points to data, and you need only use a few tools to extract what you need from what you've pointed to.

Terminology first. XML has "elements". These are like <AnElement AnAttribute="AttributesAreAlwaysInQuotes">Some Text</AnElement>. Everything - open tag, all attributes, all text, closing tag. XPath also talks about "nodes" - these can be an element, but they can also be an attribute, or the text (and some other things, but we don't need to worry about those). These can be part of the XPath expression. Although we won't use this much, it helps to realize that XPath uses familial terminology - a tag that encloses another is a "parent" tag, and parents, plus "grandparents", etc., are all "ancestors"; similarly, the tag enclosed by the parent is a "child". Fellow children of the same parent are siblings, and children, plus "grandchildren" and so forth, are descendents.

Okay: since XML has a root element, and every all new tags must close within their parent tags, we can think of an XML document as kind of like a Unix directory. If you've never used Unix (or not knowingly), the top level directory of a Unix file system is /. You then have a subdirectory (which might be a disk mapped to that directory name). You could easily represent the Windows file system this way by defining / as the root, and the /c as the C: drive, /d for the D: drive, etc.. There are a fair number of similarities. For example, XPath uses both absolute and relative paths. Here's some sample XML: <Vehicle> <Car Make="Chevy" Model = "Sprint"> <Car_Data>My first car...</Car_Data> </Car> </Vehicle>. An absolute path there would be /Vehicle/Car/Car_Data - or even /Vehicle/Car/Car_Data/text(). You can also use relative paths. If Vehicle is not the root of our XML document (maybe it's part of "Assets"), then we can think of a relative path: Vehicle/Car/Car_Data, or Car/Car_Data. For relative paths, especially, there's also another similarlity to file systems: . (a single dot/period) refers to the current location/node, and .. (two dots) refers to the parent location/node. This is the basic "bread and butter" of a search through an XML document. But you quickly run into problems with this.

First: look at my XML - what if I expand this document to a list of all cars I have ever owned? Then, /Vehicle/Car won't be unique, right?

This is actually one of the wonderful things about XPath. /Vehicle/Car can be non-existent, unique, or have hundreds of values. XPath is sort of like an auto-select; "this is the path to all Root-Level Vehicle, child Car, elements." This is really nice, and really slick, when you think about it - it's like a path, and a select, all in one! But a select without a where clause is like a banana split without the banana. It's nice, but there's something critical missing! Fortunately, XPath has all kinds of where clauses.

A simple one is an array-like quantifier. /Vehicle/Car[1] means the first Car that's a child of Vehicle. There are also some functions you can use in the square brackets: last() will get you the last item, so last()-1 will get you the penultimate, and last()-2 will get you the antepenultimate. These are really useful when you need to deterministically return a singular item (rather than a list). Remember, /Vehicle/Car is all Car values that are children of the root Vehicle - so /Vehicle/Car[1] means "the first (and in our case, only) Car". Another neat example for square brackets, one that's a good reminder that XPath is path, plus select, is how you can point to up to the first three cars that are siblings (same parent) with /Vehicle/Car[Position()<=3]. Obviously, that last one won't allow you to return a single item/value!. Still, if Microsoft had a common event for QUERY_PROCESSING health, and it included wait information, and you wanted the top 5 waits... oops, I'm getting ahead of myself.

Also, remember how we said that attributes, and text, are nodes as well? They are - but when you want an absolute path to an attribute, you need to preface the attribute name with @. So, the path to the Car Make would be /Vehicle/Car/@Make. Pause here for a moment - while this is a path, it's a path to an attribute value, while /Vehicle/Car is a path to an element. This will matter - the "return value" is a bit different, right? But we'll get to that later.

For the text data in an element, you need to point to the text() operator. So we'd want /Vehicle/Car/Car_Data/text() to point to "My first car...".

When you start with a / and the root element, you're using an absolute path. Per all XPath information I've found, a relative path, like "Car/Car_Data" should be a way to find any Car_Data child of a Car node. However well this is documented, it doesn't seem to help in the Microsoft XML methods I'm going to be demonstrating later. But it's a good way to introduce another operator: //, a double slash - this is an operator that tells XPath that you don't care where something occurs, so long as it does occur, and this gives us a way to use those relative paths. So, //data will pull any data node, and //data/value will point to any value node that's a child of a data node - or, //Car/Car_data will find the values we hoped for earlier.

For additional searching fun, /event//data will find any wait node that's a descendent of an event node. Or, /DataCollection//event//data will point to all data nodes that are descendents of an event node that are part of the DataCollection XML document, however deeply either the event, or the data, is nested. You can combine these with the array-quantifiers above, but be careful. /DataCollection//event//data[1] means the first wait event in every set of /DataCollection//event//data elements - if you have 10 events, each with data as a descendent, you'll get 10 values. If you want just the first value in the document, you want to use parentheses to clarify this. So, (/DataCollect//event//data)[1] means the first match in the document. Similarly, /DataCollection//event//data[position()<20] would point to the first 19 data siblings in each collection of data-siblings; (/DataCollection//event//data)[position()<20] grabs the first 19 in the entire document. So if you have three sets of 10 data nodes, the first gets all of the data nodes (up to 19 siblings of each group of 10) whereas the second doesn't even quite collect the first two (all 10 siblings in the first group, and only 9 slots remain for 9 of the 10 siblings in the second group).

Okay, stop for a moment, and think about what we've discussed. You can find all nodes of a particular absolute path, and you can find all occurrences of certain relative paths. You can trim these to the first n items - you actually have all the knowledge you need to create an XPath to any single or top-n group of items in a simple XML document whose structure you know beforehand. If XPath stopped there, you'd already be able to do some relatively useful things.

But XPath does more than that. Earlier, I said that the square bracket operator was array-like. It isn't. You can use the powerful selection criteria allowed within the square brackets for something as simple as position in a list. But you can put a great many different selection criteria inside those brackets. For example, if we wanted to be sure we looked only at Chevys in our Vehicle list, we could use the absolute path /Vehicle/Car[@Make="Chevy"].

That's kind of cool, isn't it? But let's continue to pretend we have a lot of cars here, okay? This is way cool - I promise. You can select /Vehicle/Car[Car_Data="My first car..."]. You can say "give me Cars, but only if they have a Car_Data child with a specific text value.

But it gets even cooler. let's redo our XML a bit:

<Vehicle>
<Car Make="Chevy" Model = "Sprint">
<Car_Data>My first car...<Contents>My_Keys</Contents>
</Car_Data>
</Car>
</Vehicle>

We could use /Vehicle/Car//Contents[.="My_Keys"] to find the car that has my keys in it. Interesting note here: I have Contents tagged as it is because while HTML is mostly whitespace agnostic, XML is funny. In text, whitespace is preserved (but all newlines are converted to line feeds). The text of Contents starts the character after the > and continues until the next tag definition, or the closing tag of Contents. So I closed up Contents to avoid having whitespace matching, or line feed aliasing, issues.

Also, just to make sure it's clear, although the predicate in the brackets is often at the last item of the list, it doesn't have to be. /Vehicle/Car[@Make="Chevy"]//Contents[.=My_Keys"] is perfectly acceptable - but if I use that, I better hope I didn't leave my keys in a Ford! As I'm sure you can guess, when you do this, all predicates must be satisfied - it's an AND combination, not an OR.

Okay: now, if I've done my job well, after reading this, you have an idea how to read some simple XPath statements. You can follow an absolute path, or a relative path; you can you can make decisions based upon attributes or child values, or even descendant values. If I threw a blob of XML at you, and showed you how to extract information using SQL Server's XML methods, you could piece together what I'm doing. And that's my next step.

No comments:

Post a Comment