index: race fix, lock cs_main while 'm_synced' is subject to change

furszy commented at 3:38 pm on April 13, 2024: member

Fixes #29831 and #29863. Thanks to Marko for the detailed description of the issue.

The race occurs because a block could be connected and its event signaled in-between reading the ’next block’ and setting m_synced during the index initial synchronization. This is because cs_main is not locked through the process of determining the final index sync state. To address the issue, the m_synced flag set has been moved under cs_main guard.

DrahtBot commented at 3:38 pm on April 13, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	ryanofsky, fjahr, achow101
Stale ACK	andrewtoth, Sjors

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

DrahtBot added the label UTXO Db and Indexes on Apr 13, 2024

maflcko commented at 10:31 am on April 14, 2024: member

Is this related to bbe82c116e72ca0638751e063bf564cd1fe5c4d5?

furszy commented at 1:36 pm on April 14, 2024: member

Is this related to bbe82c1?

No, it is related to 0faafb57f8298547949cbc0044ee9e925ed887ba. This PR partially reverts it and documents the flow so the regression does not happen again.

andrewtoth commented at 11:02 pm on April 15, 2024: contributor

Is this related to bbe82c1?

No, it is related to 0faafb5. This PR partially reverts it and documents the flow so the regression does not happen again.

Is it worth it to partially revert? Why not just git revert 0faafb57f8298547949cbc0044ee9e925ed887ba? Compared to this partial revert, the happy path of the full revert is only holding the lock for an extra pointer comparison check to see if there is a reorg, which doesn’t seem like it is worth moving the lock for. Handling Rewind is more complex, but it will happen very rarely and shouldn’t really happen much in Sync unless we are very close to tip.

furszy commented at 10:04 am on April 16, 2024: member

Is it worth it to partially revert? Why not just git revert 0faafb57f8298547949cbc0044ee9e925ed887ba?

I didn’t do it because the revert is not clean. It conflicts with bbe82c116e72ca0638751e063bf564cd1fe5c4d5 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me). But np on doing it and squashing the commits down to one. One sec.

furszy force-pushed on Apr 16, 2024

andrewtoth commented at 1:37 pm on April 16, 2024: contributor

utACK cbcb2c82669deaad71e739c64b1baf687e76e604

I didn’t do it because the revert is not clean. It conflicts with https://github.com/bitcoin/bitcoin/commit/bbe82c116e72ca0638751e063bf564cd1fe5c4d5 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me).

Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:

0revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5
1revert 0faafb57f8298547949cbc0044ee9e925ed887ba
2doc commit

This would be a cleaner git history, and bbe82c116e72ca0638751e063bf564cd1fe5c4d5 turned out to be a red herring so would make more sense to revert that then have people be confused thinking that actually solved anything.

Sjors commented at 2:48 pm on April 16, 2024: member

utACK cbcb2c82669deaad71e739c64b1baf687e76e604

Holding on to cs_main in this particular spot, when index catches up to the tip, is not a big deal performance wise. The other, much more frequently called Commit() in Sync() was already done without holding cs_main.

ryanofsky approved

ryanofsky commented at 3:28 pm on April 16, 2024: contributor

Code review ACK cbcb2c82669deaad71e739c64b1baf687e76e604. I think I have the opposite opinion as andrewtoth, and that original bugfix c395b26df37b3839a9c76abbd4d5fb48e79b3208 is more direct and simple than current cbcb2c82669deaad71e739c64b1baf687e76e604 since the original fix leaves the Rewind code alone.

re: #29867 (comment)

Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:
0revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5
1revert 0faafb57f8298547949cbc0044ee9e925ed887ba
2doc commit

I don’t think this will work because don’t actually want to revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5. It is still important to set m_synced = true; after calling Commit to avoid the bug reported #29767. We can’t just go back to the state before 0faafb57f8298547949cbc0044ee9e925ed887ba because the bug in #29767 was still present before that commit, just less likely to happen. So better to make a direct change fixing the current bug, than trying to fix it by reverting previous changes.

re: #29867 (comment)

Holding on to cs_main in this particular spot, when index catches up to the tip, is not a big deal performance wise.

Agree just locking cs_main while calling NextSyncBlock and setting m_synced = true like the code before 0faafb57f8298547949cbc0044ee9e925ed887ba is probably the easiest fix. I think the following fix which avoids locking cs_main while calling Commit would work too, however:

 0--- a/src/index/base.cpp
 1+++ b/src/index/base.cpp
 2@@ -160,12 +160,24 @@ void BaseIndex::Sync()
 3             }
 4 
 5             const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
 6+            // If pindex_next is null, it means pindex is the chain tip, so
 7+            // commit data indexed so far.
 8             if (!pindex_next) {
 9                 SetBestBlockIndex(pindex);
10                 // No need to handle errors in Commit. See rationale above.
11                 Commit();
12-                m_synced = true;
13-                break;
14+
15+                // If pindex is still the chain tip after committing, exit the
16+                // sync loop. It is important for cs_main to locked while
17+                // setting m_synced = true, otherwise a new block could be
18+                // attached while m_synced is still false, and it would not be
19+                // indexed.
20+                LOCK(::cs_main);
21+                pindex_next = NextSyncBlock(pindex, m_chainstate->m_chain);
22+                if (!pindex_next) {
23+                    m_synced = true;
24+                    break;
25+                }
26             }
27             if (pindex_next->pprev != pindex && !Rewind(pindex, pindex_next->pprev)) {
28                 FatalErrorf("%s: Failed to rewind index %s to a previous chain tip", __func__, GetName());

Any of the fixes seem ok to me.

DrahtBot requested review from ryanofsky on Apr 16, 2024

Sjors commented at 3:43 pm on April 16, 2024: member

@ryanofsky’s suggestion looks good to me as well. Though maybe if the timing is really unlucky you’d get lots of Commit() calls?

andrewtoth commented at 4:08 pm on April 16, 2024: contributor

It is still important to set m_synced = true; after calling Commit to avoid the bug reported #29767.

Ahh, I see yes BlockConnected is not guarded by cs_main, so that could still be called even if we hold cs_main inside Sync. Thank you for clarifying this for me.

ryanofsky commented at 6:27 pm on April 16, 2024: contributor

Though maybe if the timing is really unlucky you’d get lots of Commit() calls?

That’s a good point. If Commit() is slow and it has to be called more than once, it could make index sync less efficient, even if it blocks validation code less.

There’s a choice about what to do when the Commit() call at the end of the sync is slow and a new block is connected while it is executing.

Currently, the new block may get ignored and the index could be corrupted.
Before 0faafb57f8298547949cbc0044ee9e925ed887ba from #28955, and after cbcb2c82669deaad71e739c64b1baf687e76e604 from this PR, cs_main is locked during Commit() so a new block can’t be connected.
The suggested diff would call Commit() again and potentially keep calling it until no new blocks were connected.

I think behaviors 2 or 3 both should be ok in practice, and 3 may be more appealing conceptually because ideally indexes should not hold cs_main and block validation code while writing their own data.

Another alternative could be to use approach 3 but only call Commit() exactly once at the end of the sync, skipping subsequent Commit() calls if more blocks are connected until the sync is completely finished and there is a ChainStateFlushed() notification. This would be similar to what happens with normal BlockConnected() notifications for new blocks.

I think any approach that fixes the bug is fine though, and the current approach does LGTM.

furszy commented at 8:14 pm on April 17, 2024: member

Little delayed but here again. I’m also okay with both options. I’m feeling a bit more inclined towards option 3, primarily because it allows us to explain what the code is doing more clearly.

Also, just to share some extra thoughts around this issue, another -bad- option to hold a minimal cs_main lock could be to keep the current flow in master and connect missing intermediate blocks inside BlockConnected only the first time it is called after setting m_synced. This does not require chain access nor any extra cs_main lock, as we could go backwards from the newly connected tip down to the index tip, reading and connecting the missing block(s). The obvious downside of this option is that the index could finish the initial sync process missing one or two blocks, only updating its tip on the next BlockConnected signal, which could take a while. So, this would change the meaning of m_synced.

index: race fix, lock cs_main while 'm_synced' is subject to change

This ensures that the index does not miss any 'new block' signals
occurring in-between reading the 'next block' and setting 'm_synced'.
Because, if this were to happen, the ignored blocks would never be
indexed, thus stalling the index forever.

65951e0418

furszy force-pushed on Apr 17, 2024

ryanofsky approved

ryanofsky commented at 12:28 pm on April 18, 2024: contributor

Code review ACK 65951e0418c53cbbf30b9ee85e24ccaf729088a1

DrahtBot requested review from Sjors on Apr 18, 2024

in src/index/base.cpp:164 in 65951e0418

159@@ -160,12 +160,24 @@ void BaseIndex::Sync()
160             }
161 
162             const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
163+            // If pindex_next is null, it means pindex is the chain tip, so
164+            // commit data indexed so far.

fjahr commented at 8:39 pm on April 18, 2024:

nit: I had to read this three times to get it right, but feel free to ignore unless you have to retouch.

I would change it to [...], so commit the data that was indexed so far.

fjahr commented at 8:41 pm on April 18, 2024: contributor

Code review ACK 65951e0418c53cbbf30b9ee85e24ccaf729088a1

achow101 commented at 6:05 pm on April 25, 2024: member

ACK 65951e0418c53cbbf30b9ee85e24ccaf729088a1

achow101 merged this on Apr 25, 2024

achow101 closed this on Apr 25, 2024

furszy deleted the branch on Apr 25, 2024

ryanofsky commented at 7:02 pm on April 25, 2024: contributor

Note just to clarify issue history: The race condition fixed here was introduced by 0faafb57f8298547949cbc0044ee9e925ed887ba from #28955, which stopped locking cs_main while setting m_synced = true. It could cause indexes to incorrectly discard block-connected notifications.

This race condition is different from the other race condition recently fixed in #29776, which is a much older bug going back to 4368384f1d267b011e03a805f934f5c47e2ca1b2 from #14121, caused by incorrectly setting m_synced = true before the Commit() call, instead of after. That race condition could cause the sync and notification threads to both try to write index data at the same time, and possibly corrupt the state. That race condition was probably also more likely to happen after #28955, though, due to cs_main being released.

fanquake commented at 3:30 am on April 26, 2024: member

Seems like there’s at least something which could/should be backported here? Maybe only the other change, given the newly broken code has only existed in master?

ryanofsky commented at 12:36 pm on April 26, 2024: contributor

Yes in terms of backports, it could make sense to backport the #29776 fix to releases before #28955, since the race fixed by #29776 where two threads are trying to update index state at the same time at the end of a sync could happen before #28955, although it was probably much less likely to happen before #28955.

It would only make sense to backport this PR to releases after #28955, though, since this bug, which could cause blocks to be missing from the index, was introduced in #28955.

fanquake commented at 1:07 pm on April 26, 2024: member

I’ve pulled #29776 in for 27.x. It could go into 26.x as well cc @glozow.

glozow referenced this in commit 4fef05fd79 on Apr 29, 2024

glozow referenced this in commit 4388b7e59c on Apr 29, 2024

bitcoin locked this on Apr 26, 2025

index: race fix, lock cs_main while ’m_synced’ is subject to change #29867

Code Coverage

Reviews

Conflicts