index: Improve BaseIndex::BlockUntilSyncedToCurrentChain reliability

ryanofsky commented at 1:21 pm on September 30, 2022: contributor

Since commit f08c9fb0c6a799e3cb75ca5f763a746471625beb from PR #21726, index BlockUntilSyncedToCurrentChain behavior has been less reliable, and there has also been a race condition in the coinstatsindex_initial_sync unit test.

It seems better for BlockUntilSyncedToCurrentChain to actually wait for the last connected block to be fully processed, than to be able to return before prune locks are set, so this switches the order of m_best_block_index = block; and UpdatePruneLock statements in SetBestBlockIndex to make it more reliable.

Also since commit f08c9fb0c6a799e3cb75ca5f763a746471625beb, there has been a race condition in the coinstatsindex_initial_sync test. Before that commit, the atomic index best block pointer m_best_block_index was updated as the last step of BaseIndex::BlockConnected, so BlockUntilSyncedToCurrentChain could safely be used in tests to wait for the last BlockConnected notification to be finished before stopping and destroying the index. But after that commit, calling BlockUntilSyncedToCurrentChain is no longer sufficient, and there is a race between the test shutdown code which destroys the index object and the new code introduced in that commit calling AllowPrune() and GetName() on the index object. Reproducibility instructions for this are in #25365 (comment)

This commit fixes the coinstatsindex_initial_sync race condition, even though it will require an additional change to silence TSAN false positives, #26188, after it is fixed. So this partially addresses but does not resolve the bug reporting TSAN errors #25365.

There is no known race condition outside of test code currently, because the bitcoind Shutdown function calls FlushBackgroundCallbacks not BlockUntilSyncedToCurrentChain to safely shut down.

Co-authored-by: vasild Co-authored-by: MarcoFalke

ryanofsky marked this as ready for review on Sep 30, 2022

maflcko added this to the milestone 24.0 on Sep 30, 2022

DrahtBot added the label UTXO Db and Indexes on Sep 30, 2022

maflcko commented at 1:41 pm on September 30, 2022: member

Concept ACK dd2ef55a86b85a9f1dc8cd1a7a4a0fc7ed2d7da4

The description makes sense, but I have not reviewed in detail that bitcoind behaviour remains unchanged.

Tagged for backport, but happy to drop again if this seems too risky.

ryanofsky commented at 1:43 pm on September 30, 2022: contributor

Might be possible to write a unit test that ensures BlockUntilSyncedToCurrentChain doesn’t return until after prune locks are updated. But it seems a little tricky so this PR does not have a test for now.

ryanofsky commented at 1:53 pm on September 30, 2022: contributor

Tagged for backport, but happy to drop again if this seems too risky.

Thanks, I think it’s probably good to backport. SInce it’s just doing an atomic assignment one step later it seems safe and it’s hard to think of ways it could cause a problem.

DrahtBot commented at 10:34 pm on September 30, 2022: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#24230 (indexes: Stop using node internal types and locking cs_main, improve sync logic by ryanofsky)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

maflcko requested review from fjahr on Oct 4, 2022

fanquake commented at 2:35 pm on October 4, 2022: member

@mzumsande care to take a look here?

in src/index/base.cpp:427 in dd2ef55a86 outdated

419         node::PruneLockInfo prune_lock;
420         prune_lock.height_first = block->nHeight;
421         WITH_LOCK(::cs_main, m_chainstate->m_blockman.UpdatePruneLock(GetName(), prune_lock));
422     }
423+
424+    // Intentionally set m_best_block_index as the last step in this function,

mzumsande commented at 8:33 pm on October 4, 2022:

Maybe it could be helpful to additionally add a similar comment at the end of BaseIndex::BlockConnected, so that no one attempts to add references to *this after the call to SetBestBlockIndex there either, for the same reason.

ryanofsky commented at 2:57 pm on October 5, 2022:

re: #26215 (review)

Agree that would be helpful. Added comment

mzumsande commented at 8:50 pm on October 4, 2022: contributor

ACK dd2ef55a86b85a9f1dc8cd1a7a4a0fc7ed2d7da4

Took me a while to catch up with the entire discussion around #25365, but this makes sense to me. Aside from the issue with BlockUntilSyncedToCurrentChain that is fixed here, the order between updating prune locks and the best index shouldn’t matter, since the validation code which performs the pruning does not interact with m_best_index- so I think this is safe.

index: Improve BaseIndex::BlockUntilSyncedToCurrentChain reliability

Since commit f08c9fb0c6a799e3cb75ca5f763a746471625beb from PR
https://github.com/bitcoin/bitcoin/pull/21726, index
`BlockUntilSyncedToCurrentChain` behavior has been less reliable, and there has
also been a race condition in the `coinstatsindex_initial_sync` unit test.

It seems better for `BlockUntilSyncedToCurrentChain` to actually wait for the
last connected block to be fully processed, than to be able to return before
prune locks are set, so this switches the order of `m_best_block_index =
block;` and `UpdatePruneLock` statements in `SetBestBlockIndex` to make it more
reliable.

Also since commit f08c9fb0c6a799e3cb75ca5f763a746471625beb, there has been a
race condition in the `coinstatsindex_initial_sync` test. Before that commit,
the atomic index best block pointer `m_best_block_index` was updated as the
last step of `BaseIndex::BlockConnected`, so `BlockUntilSyncedToCurrentChain`
could safely be used in tests to wait for the last `BlockConnected`
notification to be finished before stopping and destroying the index. But
after that commit, calling `BlockUntilSyncedToCurrentChain` is no longer
sufficient, and there is a race between the test shutdown code which destroys
the index object and the new code introduced in that commit calling
`AllowPrune()` and `GetName()` on the index object. Reproducibility
instructions for this are in
https://github.com/bitcoin/bitcoin/issues/25365#issuecomment-1259744133

This commit fixes the `coinstatsindex_initial_sync` race condition, even though
it will require an additional change to silence TSAN false positives,
https://github.com/bitcoin/bitcoin/pull/26188, after it is fixed. So this
partially addresses but does not resolve the bug reporting TSAN errors
https://github.com/bitcoin/bitcoin/issues/25365.

There is no known race condition outside of test code currently, because the
bitcoind `Shutdown` function calls `FlushBackgroundCallbacks` not
`BlockUntilSyncedToCurrentChain` to safely shut down.

Co-authored-by: Vasil Dimov <vd@FreeBSD.org>
Co-authored-by: MacroFake <falke.marco@gmail.com>

8891949bdc

ryanofsky force-pushed on Oct 5, 2022

ryanofsky commented at 3:09 pm on October 5, 2022: contributor

Thanks for the review!

Updated dd2ef55a86b85a9f1dc8cd1a7a4a0fc7ed2d7da4 -> 8891949bdcb25093d3a6703ae8228c3c3687d3a4 (pr/untilsync.1 -> pr/untilsync.2, compare) adding suggested comment.

re: #26215#pullrequestreview-1130577545

Took me a while to catch up with the entire discussion around #25365

I hope the PR description was clear enough explaining the change and what motivated it, so it wasn’t actually neccessary to read the old discussion thread, but you could let me know if I missed anything important in the explanation.

mzumsande commented at 6:23 pm on October 5, 2022: contributor

re-ACK 8891949bdcb25093d3a6703ae8228c3c3687d3a4

I hope the PR description was clear enough explaining the change and what motivated it, so it wasn’t actually neccessary to read the old discussion thread, but you could let me know if I missed anything important in the explanation.

The PR is clear, I don’t think anything is missing, I just decided I wanted to follow the discussion chronologically.

fanquake merged this on Oct 10, 2022

fanquake closed this on Oct 10, 2022

fanquake added the label Needs backport (24.x) on Oct 10, 2022

sidhujag referenced this in commit f4572f9ae4 on Oct 10, 2022

fanquake referenced this in commit 5ad82a09b4 on Oct 11, 2022

fanquake removed the label Needs backport (24.x) on Oct 11, 2022

fanquake commented at 1:21 am on October 11, 2022: member

Added to #26133 for backport to 24.x.

jonatack commented at 4:36 pm on October 11, 2022: contributor

Post-merge ACK 8891949bdcb25093d3a6703ae8228c3c3687d3a4

achow101 referenced this in commit 885366c67a on Oct 13, 2022

bitcoin locked this on Oct 13, 2023

index: Improve BaseIndex::BlockUntilSyncedToCurrentChain reliability #26215

Conflicts