test: silence TSAN false positive in coinstatsindex_initial

vasild commented at 4:09 pm on September 27, 2022: contributor

Silence false positives from TSAN about unsynchronized calls to BaseIndex::~BaseIndex() and BaseIndex::SetBestBlockIndex(). They are synchronized, but beyond the comprehension of TSAN - by SyncWithValidationInterfaceQueue(), called from BaseIndex::BlockUntilSyncedToCurrentChain().

Fixes #25365

vasild commented at 4:12 pm on September 27, 2022: contributor

A call to SyncWithValidationInterfaceQueue() also fixes it. I am +0.1 on stopping the scheduler thread and flushing the queue - looks more robust to me.

DrahtBot added the label Tests on Sep 27, 2022

fanquake added this to the milestone 24.0 on Sep 27, 2022

glozow requested review from maflcko on Sep 28, 2022

in src/test/coinstatsindex_tests.cpp:88 in eaec74f3b9 outdated

79@@ -79,6 +80,13 @@ BOOST_FIXTURE_TEST_CASE(coinstatsindex_initial_sync, TestChain100Setup)
80     // Shutdown sequence (c.f. Shutdown() in init.cpp)
81     coin_stats_index.Stop();
82 
83+    // Make sure coin_stats_index is not used by the scheduler thread after it
84+    // has been destroyed.
85+    if (m_node.scheduler) {
86+        m_node.scheduler->stop();
87+    }
88+    GetMainSignals().FlushBackgroundCallbacks();

maflcko commented at 6:46 pm on September 28, 2022:

I don’t really like duplicating the shutdown sequence, especially not in the same order. This is just bloating the tests with hard to understand and hard to maintain code.

The reason that the bug exits is a presumed silent merge conflict. I already fixed the same bug in fa176e253fb473767c61d4d8cd2d93e87d71a015, but the test here copied the unfixed code.

My preference would be to use the same fix in all unit tests and not use a different fix for each unit test.

vasild commented at 8:37 am on September 29, 2022:

Done.

maflcko commented at 6:47 pm on September 28, 2022: member

Concept ACK. This may fix the bug, but it could be better.

ryanofsky commented at 7:03 pm on September 28, 2022: contributor

A call to SyncWithValidationInterfaceQueue() also fixes it. I am +0.1 on stopping the scheduler thread and flushing the queue - looks more robust to me.

Hmm, I would think the opposite SyncWithValidationInterfaceQueue seems like the more robust approach to me. It forces all events that may have been posted before the index is stopped to be processed before the index destructor is called.

Stopping the scheduler seems more indirect and more error prone because it make additional assumptions that validationinterface uses the scheduler internally and that it is safe to stop the scheduler at all because nothing else depends on it.

Also agree with Marco’s suggestion that it would be good to stick to consistent approach across tests, which also favors SyncWithValidationInterfaceQueue

vasild force-pushed on Sep 29, 2022

vasild commented at 8:22 am on September 29, 2022: contributor

eaec74f3b9...40ef8e064d: @ryanofsky, @MarcoFalke fair enough! Switched to SyncWithValidationInterfaceQueue() :-)

in src/test/coinstatsindex_tests.cpp:86 in 40ef8e064d outdated

78@@ -79,6 +79,10 @@ BOOST_FIXTURE_TEST_CASE(coinstatsindex_initial_sync, TestChain100Setup)
79     // Shutdown sequence (c.f. Shutdown() in init.cpp)
80     coin_stats_index.Stop();
81 
82+    // Make sure coin_stats_index is not used by the scheduler thread after it
83+    // has been destroyed.
84+    SyncWithValidationInterfaceQueue();
85+
86     // Rest of shutdown sequence and destructors happen in ~TestingSetup()

maflcko commented at 9:18 am on September 29, 2022:

Maybe remove this comment as well, or add it to the other places, to mirror fa176e253fb473767c61d4d8cd2d93e87d71a015?

vasild commented at 9:53 am on September 29, 2022:

Removed.

in src/test/coinstatsindex_tests.cpp:88 in 40ef8e064d outdated

78@@ -79,6 +79,10 @@ BOOST_FIXTURE_TEST_CASE(coinstatsindex_initial_sync, TestChain100Setup)
79     // Shutdown sequence (c.f. Shutdown() in init.cpp)
80     coin_stats_index.Stop();

maflcko commented at 9:19 am on September 29, 2022:

Unrelated: I think the two Stop(), here and in fa176e253fb473767c61d4d8cd2d93e87d71a015 can be removed, as they are done by the destructor anyway.

vasild commented at 9:57 am on September 29, 2022:

It looks like that all calls to Stop() are done just before the objects are destroyed (except the call from the destructor itself). Thus Stop() can be removed and its (short) body moved inside BaseIndex::~BaseIndex(). If this is the case then this deserves a separate PR. Am I missing something?

ryanofsky commented at 3:46 pm on September 29, 2022:

The calls to Stop need to happen before the BaseIndex destructor runs because they are responsible for joining the sync thread, which can call virtual methods. By the time the BaseIndex destructor runs, whatever subclass inherited from BaseIndex is partially destroyed and the virtual method calls won’t work.

The call to Stop() in the destructor was never valid and I have a commit c499c8797e65c66318eb1faa2e36c7c3f3e2ba1a in #24230 which removes it and asserts that it was called previously.

maflcko approved

maflcko commented at 9:19 am on September 29, 2022: member

lgtm

fanquake added the label Needs backport (24.x) on Sep 29, 2022

vasild force-pushed on Sep 29, 2022

vasild commented at 9:53 am on September 29, 2022: contributor

40ef8e064d...3ae1fe4498: take suggestion to remove a comment

in src/test/coinstatsindex_tests.cpp:80 in 3ae1fe4498 outdated

78@@ -79,7 +79,9 @@ BOOST_FIXTURE_TEST_CASE(coinstatsindex_initial_sync, TestChain100Setup)
79     // Shutdown sequence (c.f. Shutdown() in init.cpp)
80     coin_stats_index.Stop();
81 
82-    // Rest of shutdown sequence and destructors happen in ~TestingSetup()
83+    // Make sure coin_stats_index is not used by the scheduler thread after it
84+    // has been destroyed.

ryanofsky commented at 6:27 pm on September 29, 2022:

In commit “test: fix race in shutdown code in coinstatsindex_initial_sync” (3ae1fe44981cabb4bf5e714ccaac4659e8cb3f50)

I think it would be better to move the SyncWithValidationInterfaceQueue() call up one line before the Stop() call, and not an introduce an artificial pause between stopping the index and destroying the index. There should be no need for an extra pause between stopping and destroying. Adding the pause in the wrong place in the test could potentially mask real bugs, so better to add it in the right place one line up.

I also think the code comment “Make sure coin_stats_index is not used by the scheduler thread” and PR description “fix race in shutdown code” are a little misleading, because this is not really fixing a race condition. This is working around TSAN false positive errors that happen because TSAN doesn’t understand our BaseIndex::BlockUntilSyncedToCurrentChain and BaseIndex::BlockConnected functions.

TSAN alerts about a race condition because it doesn’t see the test thread which the destroys the object explicitly waiting for the scheduler thread which calls BlockConnected. But this is because, as an optimization, our BlockUntilSyncedToCurrentChain function skips calling SyncWithValidationInterfaceQueue when m_best_block_index matches the chain tip. This is a valid optimization as long as the BaseIndex::BlockConnected implementation does not do any reads or writes to *this after it updates m_best_block_index. This was the case before f08c9fb0c6a799e3cb75ca5f763a746471625beb from #21726 did add a real race bug by making AllowPrune() and GetName() calls after updating m_best_block_index. I think a fix for this could look like:

 0--- a/src/index/base.cpp
 1+++ b/src/index/base.cpp
 2@@ -414,10 +414,18 @@ IndexSummary BaseIndex::GetSummary() const
 3 void BaseIndex::SetBestBlockIndex(const CBlockIndex* block) {
 4     assert(!node::fPruneMode || AllowPrune());
 5 
 6-    m_best_block_index = block;
 7     if (AllowPrune() && block) {
 8         node::PruneLockInfo prune_lock;
 9         prune_lock.height_first = block->nHeight;
10         WITH_LOCK(::cs_main, m_chainstate->m_blockman.UpdatePruneLock(GetName(), prune_lock));
11     }
12+    // For thread safety and compatibility with m_best_block_index optimization
13+    // in the BlockUntilSyncedToCurrentChain(), update m_best_block_index as
14+    // the last step in this function, and avoid updating state or accessing
15+    // *this after m_best_block_index is set. If state is updated after
16+    // m_best_block_index, external threads won't be able to rely on
17+    // BlockUntilSyncedToCurrentChain() to be in sync with that state. And if
18+    // *this is accessed, external threads won't be able to use
19+    // BlockUntilSyncedToCurrentChain() to safely destroy the index object.
20+    m_best_block_index = block;
21 }

For this commit, I think a more accurate comment explaining the SyncWithValidationInterfaceQueue(); call would say something like “To avoid false positive TSAN errors, explicitly signal the test thread from the validationinterface thread so TSAN does not think there is a race between index shutdown code and BlockConnected notification code. The BlockUntilSyncedToCurrentChain() call above is sufficient to avoid these races in reality, but TSAN can’t detect this due to an optimization in that function.”

maflcko commented at 6:40 pm on September 29, 2022:

This is working around TSAN false positive errors

No, this is a real issue (use-after-free), see steps to reproduce without tsan (for example with valgrind): #25365 (comment)

ryanofsky commented at 6:57 pm on September 29, 2022:

This is working around TSAN false positive errors

No, this is a real issue (use-after-free), see steps to reproduce without tsan (for example with valgrind): #25365 (comment)

Yes I did see that comment, but it was not reproducing an existing race condition, it was adding a new race condition by calling this->GetName() after the m_best_block_index = block assignment. If the m_best_block_index = block; assignment was done last, there would be no race.

My comment above may be a little muddled because I noticed the real race condition caused by f08c9fb0c6a799e3cb75ca5f763a746471625beb from #21726 in the middle of writing it. I think we should fix the real race condition in non-test code making a change to SetBestBlockIndex like the one I suggested above. After we do that, the the current change 3ae1fe44981cabb4bf5e714ccaac4659e8cb3f50 in this PR will just be silencing TSAN errors, not avoiding a real race condition anymore.

vasild commented at 8:42 am on September 30, 2022:

it was not reproducing an existing race condition, it was adding a new race condition by calling this->GetName() after the m_best_block_index = block assignment

Ok, but see also #25365 (comment). I could reproduce use-after-free in Valgrind even without adding a call to GetName(), just a sleep suffices.

I got the impression that when BaseIndex::SetBestBlockIndex() executes this is already freed, but I do not remember if I checked to access a member before the m_best_block_index assignment.

Moving the assignment to the bottom may fix it. Anyway, this whole story looks overly complicated to me. It shouldn’t require multiple people discussing whether this is a bug or not and how to fix it exactly! I opened #26210 to explore an alternative.

ryanofsky commented at 1:37 pm on September 30, 2022:

Ok, but see also #25365 (comment). I could reproduce use-after-free in Valgrind even without adding a call to GetName(), just a sleep suffices.

Right. Your reproduction puts the sleep right before the UpdatePruneLock(GetName(), ...) call, so it does not need to add a new call to GetName() and does reveal an existing race condition in the test. Marco’s reproduction puts that sleep after that and adds another GetName() call, so it is really adding a new race condition.

I got the impression that when BaseIndex::SetBestBlockIndex() executes this is already freed, but I do not remember if I checked to access a member before the m_best_block_index assignment.

I don’t think this is possible because BlockUntilSyncedToCurrentChain would not have returned before the m_best_block_index assignment was done.

Moving the assignment to the bottom may fix it. Anyway, this whole story looks overly complicated to me. It shouldn’t require multiple people discussing whether this is a bug or not and how to fix it exactly! I opened #26210 to explore an alternative.

Yes I agree moving the assignment to the bottom is not the best long term solution for races generally. I just think moving the assignment is the most direct fix for the race condition introduced in #21726. Also I think it makes code more logical and seems like a reliability improvement. I opened a new PR #26215 which moves the assignment, and I think it would be the way to address the real race condition, and turn the TSAN true positives into false positives. After that, this PR would be a good way to fix remaining false positives (again with the suggestion that I think it would be better for test shutdown sequence to be 1-flush 2-stop 3-destroy not 1-stop 2-flush 3-destroy so test behavior is closer to bitcoind shutdown behavior).

maflcko commented at 10:46 am on October 4, 2022:

I this is a false positive, I wonder what the steps are to reproduce. After all tsan should be deterministic, assuming manual sleeps are added at the right places?

ryanofsky commented at 1:58 pm on October 4, 2022:

I this is a false positive, I wonder what the steps are to reproduce. After all tsan should be deterministic, assuming manual sleeps are added at the right places?

If the goal is to reproduce TSAN false positive errors that remain after #26215 fixes the race condition introduced by #21726, maybe a good way would be to put a 2 second sleep at the end of SetBestBlockIndex after it assigns m_best_block_index and 1 second sleep in the test before it calls BlockUntilSyncedToCurrentChain. This way the TSAN should see the scheduler thread reading the vtable ptr, and the test thread writing it, and no apparent synchronization between the two threads because BlockUntilSyncedToCurrentChain will return early without ever calling SyncWithValidationInterfaceQueue

ryanofsky commented at 6:35 pm on September 29, 2022: contributor

Reviewed 3ae1fe44981cabb4bf5e714ccaac4659e8cb3f50, but I think f08c9fb0c6a799e3cb75ca5f763a746471625beb from #21726 added a real race that this workaround doesn’t really address, and I made some suggestions below.

ryanofsky referenced this in commit dd2ef55a86 on Sep 30, 2022

vasild force-pushed on Oct 4, 2022

vasild commented at 10:40 am on October 4, 2022: contributor

3ae1fe4498...d2395c83c4: rebase + move SyncWithValidationInterfaceQueue() before coin_stats_index.Stop() as suggested above.

I think this should be good to go, unrelated to other PRs, as a minimal change that fixes the problem, to be included in 24.x.

in src/test/coinstatsindex_tests.cpp:88 in d2395c83c4 outdated

75@@ -76,10 +76,12 @@ BOOST_FIXTURE_TEST_CASE(coinstatsindex_initial_sync, TestChain100Setup)
76 
77     BOOST_CHECK(block_index != new_block_index);
78 
79+    // Make sure coin_stats_index is not used by the scheduler thread after it
80+    // has been destroyed.
81+    SyncWithValidationInterfaceQueue();
82+
83     // Shutdown sequence (c.f. Shutdown() in init.cpp)
84     coin_stats_index.Stop();

maflcko commented at 10:44 am on October 4, 2022:

If you move this, it would be good to update the other places as well.

My preference would be to use the same fix in all unit tests and not use a different fix for each unit test.

vasild commented at 11:25 am on October 4, 2022:

That is just in txindex_tests, right? Moved.

vasild commented at 11:25 am on October 4, 2022: contributor

d2395c83c4..c4566600be: use same shutdown order also in txindex_tests.

in src/test/coinstatsindex_tests.cpp:80 in d2395c83c4 outdated

75@@ -76,10 +76,12 @@ BOOST_FIXTURE_TEST_CASE(coinstatsindex_initial_sync, TestChain100Setup)
76 
77     BOOST_CHECK(block_index != new_block_index);
78 
79+    // Make sure coin_stats_index is not used by the scheduler thread after it
80+    // has been destroyed.

ryanofsky commented at 2:09 pm on October 4, 2022:

In commit “test: fix race in shutdown code in coinstatsindex_initial_sync” (d2395c83c406f96e3c8c3d638684823c7ada493c)

After #26215 is merged, this comment “Make sure coin_stats_index is not used by the scheduler thread after it has been destroyed” will be misleading, because the previous BlockUntilSyncedToCurrentChain above will already be waiting for the scheduler thread, and the only purpose of this SyncWithValidationInterfaceQueue call will be to silence TSAN errors.

vasild commented at 1:31 pm on October 10, 2022:

Reworded.

ryanofsky approved

ryanofsky commented at 2:27 pm on October 4, 2022: contributor

Code review ACK c4566600be3b3744d0bb7d0747efabd09b29abf3. But it would be better if #26215 were merged before this PR, because unlike this PR, #26215 is a direct fix for the race condition introduced in #21726. After #26215, this PR would be easier to understand as a followup that addresses TSAN false positive errors and makes test shutdown sequences consistent.

I don’t think it would be good to merge this PR before #26215, because it would only be fixing the race condition indirectly instead of directly, and it would be adding misleading comments about the purpose of the SyncWithValidationInterfaceQueue call following the BlockUntilSyncedToCurrentChain call. The new SyncWithValidationInterfaceQueue call shouldn’t be necessary for anything other than making TSAN happy.

ryanofsky referenced this in commit 8891949bdc on Oct 5, 2022

fanquake referenced this in commit 4175c332b9 on Oct 10, 2022

fanquake commented at 6:28 am on October 10, 2022: member

#26215 has been merged. @vasild can you rebase/follow up with @ryanofsky’s recent comments here.

vasild force-pushed on Oct 10, 2022

vasild commented at 1:31 pm on October 10, 2022: contributor

c4566600be...26bef2957f: rebase and adjust after #26215

Note - I cannot reproduce this with the steps from #26188 (review).

maflcko commented at 1:44 pm on October 10, 2022: member

Update title/OP?

vasild renamed this:
~~test: fix race in shutdown code in coinstatsindex_initial_sync~~
test: silence TSAN false positive in coinstatsindex_initial_sync
on Oct 10, 2022

vasild commented at 2:45 pm on October 10, 2022: contributor

Update title/OP?

Done, sorry.

maflcko commented at 3:45 pm on October 10, 2022: member

I can’t reproduce this either

maflcko removed the label Needs backport (24.x) on Oct 10, 2022

maflcko removed this from the milestone 24.0 on Oct 10, 2022

in src/test/coinstatsindex_tests.cpp:79 in 26bef2957f outdated

75@@ -76,10 +76,15 @@ BOOST_FIXTURE_TEST_CASE(coinstatsindex_initial_sync, TestChain100Setup)
76 
77     BOOST_CHECK(block_index != new_block_index);
78 
79+    // Silence false positives from TSAN about unsynchronized calls to

ryanofsky commented at 6:45 pm on October 10, 2022:

In commit “test: silence TSAN false positive in coinstatsindex_initial_sync” (b3e76f92e05da022009ad65a97f8f9f95b4373c5)

Since we don’t have simple reproduction steps anymore, and TSAN implementation details are a little opaque, maybe it is better to say what this is trying to do at a higher level like “// It is not safe to stop and destroy the index until it finishes handling the last BlockConnected notification. The BlockUntilSyncedToCurrentChain call above is sufficient to ensure this, but the SyncWithValidationInterfaceQueue call below is also needed to ensure TSAN always sees the test thread waiting for the notification thread, and avoid potential false positive reports.”

vasild commented at 7:48 am on October 11, 2022:

Changed the comments, thanks!

sidhujag referenced this in commit f4572f9ae4 on Oct 10, 2022

ryanofsky approved

ryanofsky commented at 7:10 pm on October 10, 2022: contributor

Code review ACK 26bef2957f3f24b505f133c691ad94832a69dd05. I do think this change improves the tests, even if we can’t reproduce the original TSAN errors anymore. The PR fixes a misleading comment in the txindex_initial_sync test, and makes shutdown order in that test more closely match bitcoind shutdown order. It also makes the coinstatsindex_initial_sync test consistent with the txindex_initial_sync test.

If we are sure there are no TSAN false positive errors here, a potentially simpler alternative to this PR is to remove the SyncWithValidationInterfaceQueue calls and comments from both tests. It’s possible TSAN is taking atomic reads and writes into account and can infer that the test thread is waiting for the notification thread because it is checking the atomic m_best_block_index variable. That could explain why sleeps suggested in #26188 (review) don’t cause errors, even if the sleeps do cause BlockUntilSyncedToCurrentChain to return early without calling SyncWithValidationInterfaceQueue.

fanquake referenced this in commit 5ad82a09b4 on Oct 11, 2022

test: silence TSAN false positive in coinstatsindex_initial_sync

Fixes https://github.com/bitcoin/bitcoin/issues/25365

6526dc3b78

test: move SyncWithValidationInterfaceQueue() before Stop() in txindex_tests

So that the call order is the same as in coinstatsindex_tests.

861cb3fadc

vasild force-pushed on Oct 11, 2022

vasild commented at 7:52 am on October 11, 2022: contributor

26bef2957f...861cb3fadc: reword the comments

… remove the SyncWithValidationInterfaceQueue calls and comments from both tests …

I am also ok with this, since it is a test-only and a false positive, to leave it “unfixed” until it happens again in CI, if ever. @MarcoFalke, what do you think?

maflcko commented at 8:00 am on October 11, 2022: member

review ACK 861cb3fadce88cfaee27088185a48f03fb9dafe7

I think anything is fine here. Either keep both or delete both.

ryanofsky approved

ryanofsky commented at 5:25 pm on October 12, 2022: contributor

Code review ACK 861cb3fadce88cfaee27088185a48f03fb9dafe7. Just comment change since last review.

Looks like this is ready to merge and issue #25365 is resolved.

fanquake merged this on Oct 13, 2022

fanquake closed this on Oct 13, 2022

vasild deleted the branch on Oct 13, 2022

adam2k referenced this in commit c8fca8a784 on Oct 19, 2022

sidhujag referenced this in commit 945a820485 on Oct 23, 2022

janus referenced this in commit f7bf2d1a4a on Jan 20, 2023

aguycalled referenced this in commit 2f61f6b4bb on Jan 25, 2023

bitcoin locked this on Oct 13, 2023

test: silence TSAN false positive in coinstatsindex_initial_sync #26188