index: Don't commit state in BaseIndex::Rewind

mzumsande commented at 8:52 pm on August 18, 2025: contributor

The committed state of an index should never be ahead of the flushed chainstate. Otherwise, in the case of an unclean shutdown, the blocks necessary to revert from the prematurely committed state are not be available, which would corrupt the coinstatsindex in particular. Instead, the index state will be committed with the next ChainStateFlushed notification.

Fixes #33208

DrahtBot renamed this:
~~index: Don't commit state in BaseIndex::Rewind~~
index: Don't commit state in BaseIndex::Rewind
on Aug 18, 2025

DrahtBot added the label UTXO Db and Indexes on Aug 18, 2025

DrahtBot commented at 8:52 pm on August 18, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33212.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	achow101, stickies-v
Concept ACK	fjahr

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

mzumsande force-pushed on Aug 18, 2025

achow101 added this to the milestone 30.0 on Aug 18, 2025

in test/functional/feature_coinstatsindex.py:335 in 4625b4b8f7 outdated

330+        self.generatetoaddress(index_node, 1, getnewdestination()[2], sync_fun=self.no_op)
331+        self.sync_index_node()
332+        index_node.kill_process()
333+        self.start_node(1, extra_args=self.extra_args[1])
334+        self.sync_index_node()
335+        assert_equal(index_node.getindexinfo()['coinstatsindex']['best_block_height'], start_height)

hodlinator commented at 9:35 am on August 20, 2025:

Right before the kill, an extra getblockcount()-call returns start_height + 2, same value for getindexinfo().best_block_height.

After the kill, we jump back exactly 2 blocks to start_height. I guess this is due to the clean shutdown in restart_node() on line 319 in the preceding test code above having fully committed that state to disk.

IIUC, could add a comment to that affect:

0        # Indexes reset to the point we last committed them to disk - during the clean shutdown in the restart_node()- call above.
1        assert_equal(index_node.getindexinfo()['coinstatsindex']['best_block_height'], start_height)

And/or rename start_height to committed_height.

I tested with/without the C++ change and got the expected result, failure was:

0test_framework.test_node.FailedToStartError: [node 1] bitcoind exited with status 1 during initialization. Error: best block of coinstatsindex not found. Please rebuild the index.

Which is the same error as in #33208.

(Staying at Python review, I don’t have in-depth understanding of indexes/reorgs yet).

mzumsande commented at 3:44 pm on August 21, 2025:

After the kill, we jump back exactly 2 blocks to start_height. I guess this is due to the clean shutdown in restart_node() on line 319 in the preceding test code above having fully committed that state to disk.

I think the previous gettxoutsetinfo (which flushes the chainstate) is the last time. But I added a restart, so that the added test doesn’t depend on the previous subtests. I also added a comment / did the rename.

fanquake added the label Needs backport (29.x) on Aug 20, 2025

stickies-v commented at 8:48 am on August 21, 2025: contributor

Concept ACK

With this change, we only call Commit() during Sync() and ChainStateFlushed(). The latter makes sense, but I’d like to look into the Sync() usage a bit more to understand if this fix is complete.

fjahr commented at 12:37 pm on August 21, 2025: contributor

Concept ACK

Re my concerns here that this might not address the full issue I am now more confident that this may be all there is to it. This is anecdotal evidence and not based on statistics but vague memories but it seems that it happens more often on testnets than on mainnet and they couldn’t remember seeing it on signet ever. So this would make a lot of sense if the issues are always connected to reorgs.

stickies-v approved

stickies-v commented at 2:07 pm on August 21, 2025: contributor

ACK 4625b4b8f772de1cd852003645c75bfa68c6735a

Delaying the Commit() to a ChainStateFlushed() event seems like a safe way to (mostly?) fix the reported issue, with the only downside I can see potentially having to do a tiny bit of duplicated work during Init().

It seems to me like it might not fully fix the issue. In BaseIndex::Sync(), we’ll Commit() ever SYNC_LOCATOR_WRITE_INTERVAL (30s). NextSyncBlock() will get us all the way up to the chaintip, which I think is not guaranteed to be flushed to disk. So it seems like even without reorgs, Init() can still trip up here with the issue reported in #33208?

I’m not sure what the best way to address this is, and I think this PR is a solid improvement on its own, so my concern (if valid) is not blocking, it would be nice to get this in v30. One approach could be to set a std::optional<CBlockIndex> BaseIndex::last_flushed_block_index, which (if set) can be used by Commit() to only run if it’s not ahead of that index?

DrahtBot requested review from fjahr on Aug 21, 2025

fanquake requested review from furszy on Aug 21, 2025

mzumsande commented at 2:54 pm on August 21, 2025: contributor

It seems to me like it might not fully fix the issue. In BaseIndex::Sync(), we’ll Commit() ever SYNC_LOCATOR_WRITE_INTERVAL (30s). NextSyncBlock() will get us all the way up to the chaintip, which I think is not guaranteed to be flushed to disk. So it seems like even without reorgs, Init() can still trip up here with the issue reported in #33208?

Good point, I think you are right! This happens when the chaintip advances during the time when Sync() is syncing. In particular, we Commit right before setting m_synced = true. I’d guess that this is probably not responsible for the majority of reported cases, because it would happen mostly during the initial sync of the index, but I agree it’s possible. #30611 should also help to make this more rare, because the unrelated crash would now need to happen within one hour instead of 24h.

One approach could be to set a std::optional BaseIndex::last_flushed_block_index, which (if set) can be used by Commit() to only run if it’s not ahead of that index?

Yes, this looks like a promising approach to me. I probably won’t have time to work on this until september, so if you have time and want to take a stab at it that’d be great!

mzumsande force-pushed on Aug 21, 2025

index: don't commit state in BaseIndex::Rewind

The committed state of an index should never
be ahead of the flushed chainstate. Otherwise, in the case
of an unclean shutdown, the blocks necessary to revert
from the prematurely committed state would not be
available, which would corrupt the coinstatsindex in particular.
Instead, the index state will be committed with the next
ChainStateFlushed notification.

01b95ac6f4

test: index with an unclean restart after a reorg

This test fails without the previous commit.

a602f6fb7b

mzumsande force-pushed on Aug 21, 2025

mzumsande commented at 3:45 pm on August 21, 2025: contributor

4625b4b to a602f6f: small changes to the test

achow101 commented at 8:45 pm on August 21, 2025: member

ACK a602f6fb7bf5f9e57299f4d6e246c82379fad8d2

DrahtBot requested review from stickies-v on Aug 21, 2025

stickies-v commented at 9:05 am on August 22, 2025: contributor

re-ACK a602f6fb7bf5f9e57299f4d6e246c82379fad8d2

CI failure (“The hosted runner lost communication with the server”) looks unrelated, would be good to restart?

fanquake commented at 2:36 pm on August 22, 2025: member

Seems like the 3rd or 4th re-run finally got lucky.

fanquake merged this on Aug 22, 2025

fanquake closed this on Aug 22, 2025

in test/functional/feature_coinstatsindex.py:336 in a602f6fb7b

331+        self.generatetoaddress(index_node, 1, getnewdestination()[2], sync_fun=self.no_op)
332+        self.sync_index_node()
333+        index_node.kill_process()
334+        self.start_node(1, extra_args=self.extra_args[1])
335+        self.sync_index_node()
336+        # Because of the unclean shutdown above, indexes reset to the point we last committed them to disk.

fjahr commented at 12:04 pm on August 23, 2025:

nit: Would have been nice to give a why (e.g. something like “This ensures they are never ahead of the actual chain on restart since this would be identified as index corruption upon restart”) and maybe it could also have mentioned that this is a general issue for all indexes, not coinstats-specific.

fjahr commented at 12:04 pm on August 23, 2025: contributor

Post merge ACK a602f6fb7bf5f9e57299f4d6e246c82379fad8d2

fanquake removed the label Needs backport (29.x) on Aug 24, 2025

fanquake commented at 9:48 pm on August 24, 2025: member

Backported to 29.x in #33251.

fjahr referenced this in commit 16b1710d97 on Aug 24, 2025

fjahr referenced this in commit fcac8022d8 on Aug 24, 2025

fanquake referenced this in commit 162c009c1d on Aug 27, 2025

glozow referenced this in commit fd784f2774 on Sep 3, 2025

ajtowns referenced this in commit 7c9ecdcda5 on Sep 4, 2025

ajtowns referenced this in commit cea2a6cc3e on Sep 4, 2025

mzumsande deleted the branch on Sep 9, 2025

index: Don’t commit state in BaseIndex::Rewind #33212

Code Coverage & Benchmarks

Reviews