Coin database checks #2145

pull sipa wants to merge 3 commits into bitcoin:master from sipa:checkcoins changing 3 files +146 −42
  1. sipa commented at 3:30 PM on January 3, 2013: member

    -checklevel gets a new meaning: 0: verify blocks can be read from disk (like before) 1: verify (contextless) block validity (like before) 2: verify undo data can be read and matches checksums 3: verify coin database is consistent with the last few blocks (close to level 6 before) 4: verify all validity rules of the last few blocks (including signature checks)

    Level 3 is the new default, as it's reasonably fast. As levels 3 and 4 are implemented using an in-memory rollback of the database, they are limited to as many blocks as possible without exceeding the limits set by -dbcache. The default of -dbcache=25 allows for some 150-200 blocks to be rolled back.

    In case an error is found, the application quits with a message instructing the user to restart with -reindex. Better instructions, and automatic recovery (when possible) or automatic reindexing are left as future work.

    In addition, this also changes the on-disk format of undo data (adding a checksum), as the correctness of the coindb checks depends on having intact undo data.

  2. Make DisconnectBlock fault-tolerant 2cbd71da06
  3. Add checksums to undo data
    This should be compatible with older code that didn't write checksums.
    8539361e66
  4. BitcoinPullTester commented at 3:52 PM on January 3, 2013: none

    Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/24e864547189c6b9347740f50d5fed656fc48816 for binaries and test log.

  5. BitcoinPullTester commented at 4:21 PM on January 3, 2013: none

    Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/aa38398d4bd32c6b90175d08b65ecb324701a368 for binaries and test log.

  6. jgarzik commented at 4:54 PM on January 3, 2013: contributor

    ACK

  7. gmaxwell commented at 7:35 PM on January 3, 2013: contributor

    Some problems with making reindex automatic is that reindex is quite burdensome and intrusive so you may want to control when its run, and the other problem is that making it automatic may hide hardware or software errors that can also cause silent failures. A failure of this test is something which should never happen, even if the user is randomly powering off their machine.

  8. gmaxwell commented at 8:28 PM on January 3, 2013: contributor

    So, I see that setting dbcache really large doesn't check more than 2500 blocks by default. Thats good— wouldn't want a long startup time. But setting dbcache to 0 results in it checking very few. I'm not sure that this is a good idea.

    I'm also seeing some odd memory usage behavior. Starting with -dbcache=4000 -checkblocks=0 results in an RES of 2.3 GiB once the check is finished.

    I also found 2013-01-03 20:18:52 Verifying last 215017 blocks at level 2 2013-01-03 20:23:16 No coin database inconsistencies in last 31137 blocks (6770582 transactions)

    a little odd, since that it's actually doing is checking 215017 at level 0/1 and checking 31137 at 2.

    Should the undo data also get checked at level 1? Obviously if it's broken it will break the level 2 tests, but since we do the level 2 tests on fewer block, some specific undo tests might make sense? (e.g. we could count the txins in the block and the undo data)

  9. sipa commented at 2:17 AM on January 4, 2013: member

    @gmaxwell there is a trivial undo test possible: verify their checksums. However, how meaningful is it to check undo data without checking the coin state - there is no scenario in which you need undo data but not coin state.

    In way, the system is backwards. The coin state is by far the most important thing to verify, but unfortunately it is also almost the most expensive, and it already requires consistent blocks and undo data.

    Regarding only checking very few with low dbcache... what would you suggest?

    Also, the calculation for memory sizes based on transaction counts is only very approximately and spread over several types of caches (leveldb blktree, leveldb coindb, coins view cache). Doing the rollback pulls the transaction data in the global cache, and then changes to it in a memory-only cache on top of it, that is discarded. The transactions themself remain in the global cache though, afterwards.

  10. sipa commented at 2:22 AM on January 4, 2013: member

    @Diapolo such a flag would be possible, but why add it? The next startup the database will most likely still be inconsistent, so it would be detected again (and if it isn't, maybe it was something temporary...).

    For GUI users, I would like to see something like "Your database is corrupted. Do you want to rebuild it now, or exit?". For bitcoind I think the right approach is indeed quitting immediately with an error message instructing the user to restart with -reindex, so they can run the rebuild on their own terms (or perhaps copy a database from another instance or backup they have running).

  11. sipa commented at 2:23 AM on January 4, 2013: member

    @gmaxwell Suggestions for better reporting are welcome :)

  12. gmaxwell commented at 7:14 AM on January 4, 2013: contributor

    The reason I suggested a basic check of undo data is simply because unless dbcache is very large we test far fewer blocks at level 2 and there is currently no way to do even basic checking of all the undo data. This means that if dbcache is set small, we may not discover corrupted undo data which could be trivially discovered, until a reorg.

    For small dbcache values— we could give it a minimum value... or could make it always some reasonable number even if we go over our coincachesize limit?

    I really wish there were a way to randomly test the coins database as I expect the leveldb disk layout probably ends up with old and new data being exposed to different errors. Oh well.

    In any case, tested with fuzzed undo (with and without undo checksums) and coins data. It's quite sensitive to errors.

    After running it some with a corrupted leveldb and corrupted undos, after restoring the non-corrupted files, I'm getting

    2013-01-04 07:09:47 ERROR: DisconnectBlock() : block and undo data inconsistent 2013-01-04 07:09:47 ERROR: VerifyDB() : *** irrecoverable inconsistency in block data

    So either I screwed up, or the corruption was somehow propagated to other files. I note that many of the error messages are repeated, so I can't easily tell which test failed. If giving them distinct messages is too much work, perhaps just slipping in FILE and LINE would be helpful. (though this isn't an issue in the codebase unique to the checking pull)

  13. New database check routine
    -checklevel gets a new meaning:
    0: verify blocks can be read from disk (like before)
    1: verify (contextless) block validity (like before)
    2: verify undo files can be read and have good checksums
    3: verify coin database is consistent with the last few blocks
       (close to level 6 before)
    4: verify all validity rules of the last few blocks
    
    Level 3 is the new default, as it's reasonably fast. As level 3 and
    4 are implemented using an in-memory rollback of the database, they
    are limited to as many blocks as possible without exceeding the
    limits set by -dbcache. The default of -dbcache=25 allows for some
    150-200 blocks to be rolled back.
    
    In case an error is found, the application quits with a message
    instructing the user to restart with -reindex. Better instructions,
    and automatic recovery (when possible) or automatic reindexing are
    left as future work.
    1f355b66cd
  14. sipa commented at 2:03 PM on January 4, 2013: member

    @gmaxwell I've added a new level in between 1 and 2 (which verifies undo data). I've also changed the heuristic for determining how far to roll back a bit - it now aims for using +- 5-10 MB extra for validation, even with very small dbcache.

  15. BitcoinPullTester commented at 2:28 PM on January 4, 2013: none

    Automatic sanity-testing: PASSED, see http://jenkins.bluematt.me/pull-tester/1f355b66cd90f1bd96a862604a7216e8ba56f853 for binaries and test log.

  16. gmaxwell commented at 6:50 PM on January 4, 2013: contributor

    ACK ACK ACK

  17. sipa commented at 12:04 AM on January 6, 2013: member

    Well the reported messages are still confusing. It's not actually checking 2500 blocks at level 3...

  18. gmaxwell commented at 2:26 PM on January 11, 2013: contributor

    Because this increases our sensitivity to corruption so much, I'd rather pull now and get the messages just right in another pull.

  19. gmaxwell referenced this in commit 1f4b80a437 on Jan 11, 2013
  20. gmaxwell merged this on Jan 11, 2013
  21. gmaxwell closed this on Jan 11, 2013

  22. sipa deleted the branch on May 3, 2013
  23. laudney referenced this in commit ee9813b8e1 on Mar 19, 2014
  24. owlhooter referenced this in commit 9af9d57b41 on Oct 11, 2018
  25. DrahtBot locked this on Sep 8, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-19 09:16 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me