Assertion when restarting after a crash with pruning #9001

issue tdaede openend this issue on October 24, 2016
  1. tdaede commented at 0:17 am on October 24, 2016: none
    1. Run bitcoind on testnet with prune=1000
    2. Kill or crash bitcoind while it is syncing (such as running out of memory on a VPS)
    3. Start bitcoind again

    Expected behaviour

    Bitcoind starts again, potentially resyncing from scratch.

    Actual behaviour

    0bitcoind: chain.cpp:96: CBlockIndex* CBlockIndex::GetAncestor(int): Assertion `pindexWalk->pprev' failed.
    

    What version of bitcoin-core are you using?

    0.13.1rc2 binaries

  2. fanquake commented at 1:05 am on October 24, 2016: member
    What operating system are you using? How much memory/disk space is available?
  3. unsystemizer commented at 5:30 am on October 24, 2016: contributor

    Such as running out of memory on a VPS

    If that’s the case your filesystem may be corrupt, and of course data on it as well. You should check it with fsck and make sure it’s sound before bitcoind is suspected.

  4. jonasschnelli added the label Data corruption on Oct 24, 2016
  5. jonasschnelli commented at 6:11 am on October 24, 2016: contributor

    A sudden shutdown (crash/kill) of bitcoind may lead to database corruption (which results in a re-sync from the scratch). I observed these types of corruptions often on VPS. Features like #8037 could be a relief in such cases…

    IMO applications with heavy database-interaction like bitcoind (UTXO set interaction) tend to loose integrity in force shutdown (crash/kill) situations.

  6. tdaede commented at 6:41 am on October 24, 2016: none

    @fanquake, this is Fedora 24 with 1GB of RAM. That said, this is 100% reproducible for me with the given settings. It shouldn’t depend on available RAM. @unsystemizer, the error is repeatable, and I don’t think there is any situation where an OOM would lead to FS corruption.

    Note here that the issue isn’t the database corruption, or the requirement that the re-sync happen - that’s fine. The issue is that bitcoind hits an assert and exits immediately, rather than automatically starting a re-sync from scratch.

  7. jonasschnelli commented at 6:55 am on October 24, 2016: contributor
    @tdaede: the assertion you hit is very likely caused by a database corruption (block index).
  8. unsystemizer commented at 7:02 am on October 24, 2016: contributor

    @tdaede - when you say it’s repeatable, if you just restart bitcoind without fixing anything, I believe it’s repeatable. It could also be repeatable if you shutdown the system while bitcoind is running and corrupt data the same or similar way. It doesn’t mean it’s a problem with Bitcoin Core. As I said you need to prove with fsck that the filesystem is sound. Even then it may be a problem with LevelDB or something that would have to be dealt with upstream.

    OOM can’t lead to FS corruption: I disagree. https://unix.stackexchange.com/questions/12699/do-journaling-filesystems-guarantee-against-corruption-after-a-power-failure

  9. luke-jr commented at 3:17 am on October 26, 2016: member
    @unsystemizer OOM isn’t a power failure. It cannot cause filesystem corruption unless there are very serious kernel bugs.
  10. unsystemizer commented at 3:34 am on October 26, 2016: contributor
    Or in the hypervisor or h/w drivers sitting below the VM or somewhere else. It should still be shown that the filesystem is not corrupt, I think.
  11. jnewbery commented at 11:03 pm on January 24, 2017: contributor

    I think this is almost certainly not filesystem corruption. I can reproduce this failure mode by manually removing a single block from my block index when flushing to disk, and then starting bitcoind again. I hit the assert with this backtrace:

    0[#0](/bitcoin-bitcoin/0/)  CBlockIndex::GetAncestor (this=0x555556396400, height=8) at chain.cpp:105
    1[#1](/bitcoin-bitcoin/1/)  0x000055555592b7e6 in CBlockIndex::BuildSkip (this=0x555556395ec0) at chain.cpp:123
    2[#2](/bitcoin-bitcoin/2/)  0x000055555587de97 in LoadBlockIndexDB (chainparams=...) at validation.cpp:3526
    3[#3](/bitcoin-bitcoin/3/)  0x00005555558800f9 in LoadBlockIndex (chainparams=...) at validation.cpp:3815
    4[#4](/bitcoin-bitcoin/4/)  0x00005555555b1fb2 in AppInitMain (threadGroup=..., scheduler=...) at init.cpp:1428
    5[#5](/bitcoin-bitcoin/5/)  0x0000555555583f8f in AppInit (argc=2, argv=0x7fffffffe568) at bitcoind.cpp:167
    6[#6](/bitcoin-bitcoin/6/)  0x0000555555584668 in main (argc=2, argv=0x7fffffffe568) at bitcoind.cpp:196
    

    If I do anything that corrupts the block index in a more intrusive way (eg removing the pprev pointer or changing any of the other header fields), then we fail at a different point. The blockhash is no longer valid so we fail CheckProofOfWork(). Critically, CBlockIndex::BuildSkip in LoadBlockIndexDB() is called after LoadBlockIndexGuts(), which loads all of the block indexes from disk. So it looks to me like this failure mode can probably only be hit if all the indexes on disk are valid, but one or more block are missing. @tdaede are you able to upload debug.log? You suspect that this may be something to do with pruning. I’ve had quite a close read of that code and I can’t see where it would cause us to lose block indexes or not flush them to disk. If you have a debug.log, it might help pin down where we’re losing the block index.

    I’m also planning to open a PR which gives us slightly better diagnostics here by printing out the blockhash and height of the orphan block.

  12. tdaede commented at 3:37 am on January 25, 2017: none

    @jnewbery I had an offline conversation with @gmaxwell and apparently flushing is disabled during initial sync for speed, which means that the indexes can end up ahead of the block database. This was apparently done for speed - a simple fix would be to disable this optimization if it no longer gives significant speed gains.

    There is also no way to fetch the missing block when pruning, so you have to start over.

  13. jnewbery commented at 2:33 pm on January 25, 2017: contributor

    Thanks @tdaede. That makes sense, but I still don’t understand how the block index database can get into this bad state. This assert is only hit if there’s a block in the database with a parent which isn’t in the database. I don’t yet understand how not flushing during startup could get us into that state.

    I’ll try to get hold of @gmaxwell later today to try to understand this a bit better.

  14. maflcko commented at 9:00 am on August 31, 2023: member
    Is this still an issue with a recent version of Bitcoin Core?
  15. maflcko closed this on Aug 31, 2023

  16. bitcoin locked this on Aug 30, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-03 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me