index: Don’t commit state in BaseIndex::Rewind #33212

pull mzumsande wants to merge 2 commits into bitcoin:master from mzumsande:202508_index_nocommit changing 2 files +17 −7
  1. mzumsande commented at 8:52 pm on August 18, 2025: contributor

    The committed state of an index should never be ahead of the flushed chainstate. Otherwise, in the case of an unclean shutdown, the blocks necessary to revert from the prematurely committed state are not be available, which would corrupt the coinstatsindex in particular. Instead, the index state will be committed with the next ChainStateFlushed notification.

    Fixes #33208

  2. DrahtBot renamed this:
    index: Don't commit state in BaseIndex::Rewind
    index: Don't commit state in BaseIndex::Rewind
    on Aug 18, 2025
  3. DrahtBot added the label UTXO Db and Indexes on Aug 18, 2025
  4. DrahtBot commented at 8:52 pm on August 18, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33212.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK achow101, stickies-v
    Concept ACK fjahr

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  5. mzumsande force-pushed on Aug 18, 2025
  6. achow101 added this to the milestone 30.0 on Aug 18, 2025
  7. in test/functional/feature_coinstatsindex.py:335 in 4625b4b8f7 outdated
    330+        self.generatetoaddress(index_node, 1, getnewdestination()[2], sync_fun=self.no_op)
    331+        self.sync_index_node()
    332+        index_node.kill_process()
    333+        self.start_node(1, extra_args=self.extra_args[1])
    334+        self.sync_index_node()
    335+        assert_equal(index_node.getindexinfo()['coinstatsindex']['best_block_height'], start_height)
    


    hodlinator commented at 9:35 am on August 20, 2025:

    Right before the kill, an extra getblockcount()-call returns start_height + 2, same value for getindexinfo().best_block_height.

    After the kill, we jump back exactly 2 blocks to start_height. I guess this is due to the clean shutdown in restart_node() on line 319 in the preceding test code above having fully committed that state to disk.

    IIUC, could add a comment to that affect:

    0        # Indexes reset to the point we last committed them to disk - during the clean shutdown in the restart_node()- call above.
    1        assert_equal(index_node.getindexinfo()['coinstatsindex']['best_block_height'], start_height)
    

    And/or rename start_height to committed_height.

    I tested with/without the C++ change and got the expected result, failure was:

    0test_framework.test_node.FailedToStartError: [node 1] bitcoind exited with status 1 during initialization. Error: best block of coinstatsindex not found. Please rebuild the index.
    

    Which is the same error as in #33208.

    (Staying at Python review, I don’t have in-depth understanding of indexes/reorgs yet).


    mzumsande commented at 3:44 pm on August 21, 2025:

    After the kill, we jump back exactly 2 blocks to start_height. I guess this is due to the clean shutdown in restart_node() on line 319 in the preceding test code above having fully committed that state to disk.

    I think the previous gettxoutsetinfo (which flushes the chainstate) is the last time. But I added a restart, so that the added test doesn’t depend on the previous subtests. I also added a comment / did the rename.

  8. fanquake added the label Needs backport (29.x) on Aug 20, 2025
  9. stickies-v commented at 8:48 am on August 21, 2025: contributor

    Concept ACK

    With this change, we only call Commit() during Sync() and ChainStateFlushed(). The latter makes sense, but I’d like to look into the Sync() usage a bit more to understand if this fix is complete.

  10. fjahr commented at 12:37 pm on August 21, 2025: contributor

    Concept ACK

    Re my concerns here that this might not address the full issue I am now more confident that this may be all there is to it. This is anecdotal evidence and not based on statistics but vague memories but it seems that it happens more often on testnets than on mainnet and they couldn’t remember seeing it on signet ever. So this would make a lot of sense if the issues are always connected to reorgs.

  11. stickies-v approved
  12. stickies-v commented at 2:07 pm on August 21, 2025: contributor

    ACK 4625b4b8f772de1cd852003645c75bfa68c6735a

    Delaying the Commit() to a ChainStateFlushed() event seems like a safe way to (mostly?) fix the reported issue, with the only downside I can see potentially having to do a tiny bit of duplicated work during Init().

    It seems to me like it might not fully fix the issue. In BaseIndex::Sync(), we’ll Commit() ever SYNC_LOCATOR_WRITE_INTERVAL (30s). NextSyncBlock() will get us all the way up to the chaintip, which I think is not guaranteed to be flushed to disk. So it seems like even without reorgs, Init() can still trip up here with the issue reported in #33208?

    I’m not sure what the best way to address this is, and I think this PR is a solid improvement on its own, so my concern (if valid) is not blocking, it would be nice to get this in v30. One approach could be to set a std::optional<CBlockIndex> BaseIndex::last_flushed_block_index, which (if set) can be used by Commit() to only run if it’s not ahead of that index?

  13. DrahtBot requested review from fjahr on Aug 21, 2025
  14. fanquake requested review from furszy on Aug 21, 2025
  15. mzumsande commented at 2:54 pm on August 21, 2025: contributor

    It seems to me like it might not fully fix the issue. In BaseIndex::Sync(), we’ll Commit() ever SYNC_LOCATOR_WRITE_INTERVAL (30s). NextSyncBlock() will get us all the way up to the chaintip, which I think is not guaranteed to be flushed to disk. So it seems like even without reorgs, Init() can still trip up here with the issue reported in #33208?

    Good point, I think you are right! This happens when the chaintip advances during the time when Sync() is syncing. In particular, we Commit right before setting m_synced = true. I’d guess that this is probably not responsible for the majority of reported cases, because it would happen mostly during the initial sync of the index, but I agree it’s possible. #30611 should also help to make this more rare, because the unrelated crash would now need to happen within one hour instead of 24h.

    One approach could be to set a std::optional BaseIndex::last_flushed_block_index, which (if set) can be used by Commit() to only run if it’s not ahead of that index?

    Yes, this looks like a promising approach to me. I probably won’t have time to work on this until september, so if you have time and want to take a stab at it that’d be great!

  16. mzumsande force-pushed on Aug 21, 2025
  17. index: don't commit state in BaseIndex::Rewind
    The committed state of an index should never
    be ahead of the flushed chainstate. Otherwise, in the case
    of an unclean shutdown, the blocks necessary to revert
    from the prematurely committed state would not be
    available, which would corrupt the coinstatsindex in particular.
    Instead, the index state will be committed with the next
    ChainStateFlushed notification.
    01b95ac6f4
  18. test: index with an unclean restart after a reorg
    This test fails without the previous commit.
    a602f6fb7b
  19. mzumsande force-pushed on Aug 21, 2025
  20. mzumsande commented at 3:45 pm on August 21, 2025: contributor
    4625b4b to a602f6f: small changes to the test
  21. achow101 commented at 8:45 pm on August 21, 2025: member
    ACK a602f6fb7bf5f9e57299f4d6e246c82379fad8d2
  22. DrahtBot requested review from stickies-v on Aug 21, 2025
  23. stickies-v commented at 9:05 am on August 22, 2025: contributor

    re-ACK a602f6fb7bf5f9e57299f4d6e246c82379fad8d2

    CI failure (“The hosted runner lost communication with the server”) looks unrelated, would be good to restart?

  24. fanquake commented at 2:36 pm on August 22, 2025: member
    Seems like the 3rd or 4th re-run finally got lucky.
  25. fanquake merged this on Aug 22, 2025
  26. fanquake closed this on Aug 22, 2025

  27. in test/functional/feature_coinstatsindex.py:336 in a602f6fb7b
    331+        self.generatetoaddress(index_node, 1, getnewdestination()[2], sync_fun=self.no_op)
    332+        self.sync_index_node()
    333+        index_node.kill_process()
    334+        self.start_node(1, extra_args=self.extra_args[1])
    335+        self.sync_index_node()
    336+        # Because of the unclean shutdown above, indexes reset to the point we last committed them to disk.
    


    fjahr commented at 12:04 pm on August 23, 2025:
    nit: Would have been nice to give a why (e.g. something like “This ensures they are never ahead of the actual chain on restart since this would be identified as index corruption upon restart”) and maybe it could also have mentioned that this is a general issue for all indexes, not coinstats-specific.
  28. fjahr commented at 12:04 pm on August 23, 2025: contributor
    Post merge ACK a602f6fb7bf5f9e57299f4d6e246c82379fad8d2
  29. fanquake removed the label Needs backport (29.x) on Aug 24, 2025
  30. fanquake commented at 9:48 pm on August 24, 2025: member
    Backported to 29.x in #33251.
  31. fjahr referenced this in commit 16b1710d97 on Aug 24, 2025
  32. fjahr referenced this in commit fcac8022d8 on Aug 24, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-09-02 12:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me