Wednesday, May 13, 2015

Thoughts about checksums and changes

So, I'm looking at this server, and I'm thinking "we don't have to worry about data corruption causing X_Issue, if the Page Verify option is set to Checksum in database properties." (See https://technet.microsoft.com/en-us/library/ms190249%28v=sql.105%29.aspx, recovery options.) Trust me, you want checksum set - in return for a tiny, tiny, tiny amount of CPU during a write to disk, you get verification that the write went to disk, and a subsequent read came back from the disk, without any damage occurring to your precious data.

"Precious..." you say doing your best Gollum imitation? Hey, Gollum's got nothing on a DBA guarding a database! For data integrity, a good DBA will even outdo a sysadmin. Believe it.

Checksum's the default - you need to do a bit of work to get it turned off - or, you need a database that was birthed in SQL Server 2000 or earlier. Checksum was introduced in SQL Server 2005, and it's been the default ever since.

As I was pondering this, I thought about changing a database from page verify = NONE to checksum, and then I thought, hey, what if a database is damaged? Oh my lord, what if we set checksum and suddenly we start flagging error 824! Will people think the checksum corrupted the database?

At that point, I realized I was tired. Really tired. Because that's one of the silliest thoughts that could come into my head - I know better! And if you know your checksums, you know better too - even if you don't realize it yet.

Checksums are computed at writes. If we had a thousand damaged pages already, and turned on checksum, how many 824 errors would be caused by that event alone? None. Even if SQL Server went out and computed a checksum on each and every page in the database (and it does not!!! - see the "myth" link below), it would be running a checksum on the already damaged data. Checksums aren't magical. A checksum is simply set up so that if there are small differences in the data, the correct checksum for that data will be different. So: if you compute a checksum, and send a data page to disk, and then, later, recompute the checksum when it's read back into memory, if the checksums match, you're 99+% certain that the data hasn't been damaged in transit or at rest. If they don't match, you know something terrible has happened.

But if data's already damaged, and you compute a checksum, all you are doing is making sure that if it gets damaged again, you'll notice.

If you turn on checksum verification, only the dirty pages in the buffer pool will automatically obtain checksums computed. Hey, here's an interesting thought: what if we cut power right after we set the checksum option? Could we still say all dirty pages in the buffer pool would get a checksum? Well, then it would be a race - but if you follow that Technet link above, you'll see that setting an option runs a checkpoint on the database - so all the dirty pages should immediately get flushed, and hence, checksummed, as long as the checkpoint completes. So, a new blog post, and the basis for a kinda-cool SQL Server trivia question. Not bad for a Wednesday night!

For a bonus trivia question, is it possible that turning on checksum page verification could reveal a bunch of damaged pages? Yes, but it requires a contrived situation to occur.

Stop and think a moment to see if you can get the answer - it's kind of cool once you figure it out.

Okay: if you turn off Checksum page verification, SQL stops checking checksums, and stops computing them thereafter. But remember, when you change the page verification option, SQL doesn't immediately add (or remove!) checksum information until the page is read into memory, changed, then written back to disk. For maximum horror value, let's back up this database, and restore it somewhere. (You do know that corruption in a database usually transmits perfectly via a backup, right?) Now, when someone restores this database, they might notice that checksum is turned off, and turn it back on. We probably still have pages with checksums computed on them - it'd be awfully strange if every single data page had been read and changed and written back to disk! So, if there were some IO errors leading to some incorrect checksum values (technically, the checksum is correct, it's the data that's incorrect) SQL would resume computing and comparing checksum values, and could suddenly uncover a bunch of errors it would have detected if we hadn't turned off checksum in the first place. And yes, good luck explaining that turning on checksum didn't cause the corruption.

So: the moral of this story is, never turn off checksum page verification, unless you need to construct an absurd hypothetical. In fact, don't even turn it off then - make it a thought experiment.

For more on checksum, and the myths about it, please see this page from Paul Randal's "myth a day" series.

Thursday, May 7, 2015

Interesting changes in indexes...

Wow! While digging into another issue, I found some significant changes in index maintenance on the MSDN Alter Index Page for SQL 2012.

First: usually, if you rebuild an index, you get a new set of statistics on the indexed column(s) for free. It'd be silly to read every value, sort it all, and then not use that data to build a new statistics histogram. That's not true if you rebuild a partitioned index. Quoted from the above:

In SQL Server 2012, statistics are not created by scanning all the rows in the table when a partitioned index is created or rebuilt. Instead, the query optimizer uses the default sampling algorithm to generate statistics. To obtain statistics on partitioned indexes by scanning all the rows in the table, use CREATE STATISTICS or UPDATE STATISTICS with the FULLSCAN clause.


That's interesting, but the shocker to me was another tidbit.

When dealing with database integrity errors, I hold my breath while looking at the index ID for problems in an index. Why? Because: Index ID 0 or 1 is raw table data. Index ID 0 is the heap; index ID 1 is the clustered index. (You only ever have one - Index ID of 0 rules out an Index ID of 1, and vice versa.) Index IDs of 2 and higher? They're "just" indexes that can be rebuilt without any risk of data loss.

Ah, but there's a tricky bit now:

In earlier versions of SQL Server, you could sometimes rebuild a nonclustered index to correct inconsistencies caused by hardware failures. In SQL Server 2008 and later, you may still be able to repair such inconsistencies between the index and the clustered index by rebuilding a nonclustered index offline. However, you cannot repair nonclustered index inconsistencies by rebuilding the index online, because the online rebuild mechanism will use the existing nonclustered index as the basis for the rebuild and thus persist the inconsistency. Rebuilding the index offline, by contrast, will force a scan of the clustered index (or heap) and so remove the inconsistency.


So: if you have a damaged index of ID 2 or greater, you have to take it offline to repair it. But don't think about what you're losing - think about what you're gaining! The other option is putting your database in Single User Mode, and running DBCC CHECKDB with REPAIR_REBUILD. So here, you get to take an index offline, instead of taking your whole database offline. It's a fair trade!

Most of the rest of indexing maintenance hasn't changed. If your indexes are under 10% fragmented, great - don't worry about it. Seriously - you can't always get rid of that last bit of fragmentation. For small tables, it might be literally impossible, and for larger ones, there are still probably better places to spend your energy. 10-30% fragmentation is a good time for a defragment/reorganize. If you see more than 30% fragmentation, a rebuild is your best option - but unless you're on Enterprise edition, you can't do an online rebuild so you might want to try to hold off by using a reorganize until you can get a maintenance window.

Oh, one other thing: most of the time you want to leave Fill Factor at the default of 0. A fill factor of 80 means leaving your pages only 80% full. This sounds like a reasonable idea - why pack them full, so that we have page splits as soon as we add a few rows? Why not leave a bit of space so they can fill up over time? It's not a terrible idea, but remember that data pages that are 80% full require 25% more disk space and buffer pool memory to hold them - and thus, 25% more reads and writes to deal with them. If you're having lots of page splits that are slowing you down, sure, consider trying a fill factor on the busiest indexes. But most of the time, you're better off packing them tight, to minimize disk and memory usage, and to reduce the number of reads and writes you need to perform.