Wednesday, May 13, 2015

Thoughts about checksums and changes

So, I'm looking at this server, and I'm thinking "we don't have to worry about data corruption causing X_Issue, if the Page Verify option is set to Checksum in database properties." (See, recovery options.) Trust me, you want checksum set - in return for a tiny, tiny, tiny amount of CPU during a write to disk, you get verification that the write went to disk, and a subsequent read came back from the disk, without any damage occurring to your precious data.

"Precious..." you say doing your best Gollum imitation? Hey, Gollum's got nothing on a DBA guarding a database! For data integrity, a good DBA will even outdo a sysadmin. Believe it.

Checksum's the default - you need to do a bit of work to get it turned off - or, you need a database that was birthed in SQL Server 2000 or earlier. Checksum was introduced in SQL Server 2005, and it's been the default ever since.

As I was pondering this, I thought about changing a database from page verify = NONE to checksum, and then I thought, hey, what if a database is damaged? Oh my lord, what if we set checksum and suddenly we start flagging error 824! Will people think the checksum corrupted the database?

At that point, I realized I was tired. Really tired. Because that's one of the silliest thoughts that could come into my head - I know better! And if you know your checksums, you know better too - even if you don't realize it yet.

Checksums are computed at writes. If we had a thousand damaged pages already, and turned on checksum, how many 824 errors would be caused by that event alone? None. Even if SQL Server went out and computed a checksum on each and every page in the database (and it does not!!! - see the "myth" link below), it would be running a checksum on the already damaged data. Checksums aren't magical. A checksum is simply set up so that if there are small differences in the data, the correct checksum for that data will be different. So: if you compute a checksum, and send a data page to disk, and then, later, recompute the checksum when it's read back into memory, if the checksums match, you're 99+% certain that the data hasn't been damaged in transit or at rest. If they don't match, you know something terrible has happened.

But if data's already damaged, and you compute a checksum, all you are doing is making sure that if it gets damaged again, you'll notice.

If you turn on checksum verification, only the dirty pages in the buffer pool will automatically obtain checksums computed. Hey, here's an interesting thought: what if we cut power right after we set the checksum option? Could we still say all dirty pages in the buffer pool would get a checksum? Well, then it would be a race - but if you follow that Technet link above, you'll see that setting an option runs a checkpoint on the database - so all the dirty pages should immediately get flushed, and hence, checksummed, as long as the checkpoint completes. So, a new blog post, and the basis for a kinda-cool SQL Server trivia question. Not bad for a Wednesday night!

For a bonus trivia question, is it possible that turning on checksum page verification could reveal a bunch of damaged pages? Yes, but it requires a contrived situation to occur.

Stop and think a moment to see if you can get the answer - it's kind of cool once you figure it out.

Okay: if you turn off Checksum page verification, SQL stops checking checksums, and stops computing them thereafter. But remember, when you change the page verification option, SQL doesn't immediately add (or remove!) checksum information until the page is read into memory, changed, then written back to disk. For maximum horror value, let's back up this database, and restore it somewhere. (You do know that corruption in a database usually transmits perfectly via a backup, right?) Now, when someone restores this database, they might notice that checksum is turned off, and turn it back on. We probably still have pages with checksums computed on them - it'd be awfully strange if every single data page had been read and changed and written back to disk! So, if there were some IO errors leading to some incorrect checksum values (technically, the checksum is correct, it's the data that's incorrect) SQL would resume computing and comparing checksum values, and could suddenly uncover a bunch of errors it would have detected if we hadn't turned off checksum in the first place. And yes, good luck explaining that turning on checksum didn't cause the corruption.

So: the moral of this story is, never turn off checksum page verification, unless you need to construct an absurd hypothetical. In fact, don't even turn it off then - make it a thought experiment.

For more on checksum, and the myths about it, please see this page from Paul Randal's "myth a day" series.

No comments:

Post a Comment