In commit “test: fix race in shutdown code in coinstatsindex_initial_sync” (3ae1fe44981cabb4bf5e714ccaac4659e8cb3f50)
I think it would be better to move the SyncWithValidationInterfaceQueue()
call up one line before the Stop()
call, and not an introduce an artificial pause between stopping the index and destroying the index. There should be no need for an extra pause between stopping and destroying. Adding the pause in the wrong place in the test could potentially mask real bugs, so better to add it in the right place one line up.
I also think the code comment “Make sure coin_stats_index is not used by the scheduler thread” and PR description “fix race in shutdown code” are a little misleading, because this is not really fixing a race condition. This is working around TSAN false positive errors that happen because TSAN doesn’t understand our BaseIndex::BlockUntilSyncedToCurrentChain
and BaseIndex::BlockConnected
functions.
TSAN alerts about a race condition because it doesn’t see the test thread which the destroys the object explicitly waiting for the scheduler thread which calls BlockConnected
. But this is because, as an optimization, our BlockUntilSyncedToCurrentChain
function skips calling SyncWithValidationInterfaceQueue
when m_best_block_index
matches the chain tip. This is a valid optimization as long as the BaseIndex::BlockConnected
implementation does not do any reads or writes to *this
after it updates m_best_block_index
. This was the case before f08c9fb0c6a799e3cb75ca5f763a746471625beb from #21726 did add a real race bug by making AllowPrune()
and GetName()
calls after updating m_best_block_index
. I think a fix for this could look like:
0--- a/src/index/base.cpp
1+++ b/src/index/base.cpp
2@@ -414,10 +414,18 @@ IndexSummary BaseIndex::GetSummary() const
3 void BaseIndex::SetBestBlockIndex(const CBlockIndex* block) {
4 assert(!node::fPruneMode || AllowPrune());
5
6- m_best_block_index = block;
7 if (AllowPrune() && block) {
8 node::PruneLockInfo prune_lock;
9 prune_lock.height_first = block->nHeight;
10 WITH_LOCK(::cs_main, m_chainstate->m_blockman.UpdatePruneLock(GetName(), prune_lock));
11 }
12+ // For thread safety and compatibility with m_best_block_index optimization
13+ // in the BlockUntilSyncedToCurrentChain(), update m_best_block_index as
14+ // the last step in this function, and avoid updating state or accessing
15+ // *this after m_best_block_index is set. If state is updated after
16+ // m_best_block_index, external threads won't be able to rely on
17+ // BlockUntilSyncedToCurrentChain() to be in sync with that state. And if
18+ // *this is accessed, external threads won't be able to use
19+ // BlockUntilSyncedToCurrentChain() to safely destroy the index object.
20+ m_best_block_index = block;
21 }
For this commit, I think a more accurate comment explaining the SyncWithValidationInterfaceQueue();
call would say something like “To avoid false positive TSAN errors, explicitly signal the test thread from the validationinterface thread so TSAN does not think there is a race between index shutdown code and BlockConnected notification code. The BlockUntilSyncedToCurrentChain() call above is sufficient to avoid these races in reality, but TSAN can’t detect this due to an optimization in that function.”