index: race fix, lock cs_main while ’m_synced’ is subject to change #29867

pull furszy wants to merge 1 commits into bitcoin:master from furszy:2024_index_fix_race changing 1 files +14 −2
  1. furszy commented at 3:38 pm on April 13, 2024: member

    Fixes #29831 and #29863. Thanks to Marko for the detailed description of the issue.

    The race occurs because a block could be connected and its event signaled in-between reading the ’next block’ and setting m_synced during the index initial synchronization. This is because cs_main is not locked through the process of determining the final index sync state. To address the issue, the m_synced flag set has been moved under cs_main guard.

  2. DrahtBot commented at 3:38 pm on April 13, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK ryanofsky, fjahr, achow101
    Stale ACK andrewtoth, Sjors

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  3. DrahtBot added the label UTXO Db and Indexes on Apr 13, 2024
  4. maflcko commented at 10:31 am on April 14, 2024: member
    Is this related to bbe82c116e72ca0638751e063bf564cd1fe5c4d5?
  5. furszy commented at 1:36 pm on April 14, 2024: member

    Is this related to bbe82c1?

    No, it is related to 0faafb57f8298547949cbc0044ee9e925ed887ba. This PR partially reverts it and documents the flow so the regression does not happen again.

  6. andrewtoth commented at 11:02 pm on April 15, 2024: contributor

    Is this related to bbe82c1?

    No, it is related to 0faafb5. This PR partially reverts it and documents the flow so the regression does not happen again.

    Is it worth it to partially revert? Why not just git revert 0faafb57f8298547949cbc0044ee9e925ed887ba? Compared to this partial revert, the happy path of the full revert is only holding the lock for an extra pointer comparison check to see if there is a reorg, which doesn’t seem like it is worth moving the lock for. Handling Rewind is more complex, but it will happen very rarely and shouldn’t really happen much in Sync unless we are very close to tip.

  7. furszy commented at 10:04 am on April 16, 2024: member

    Is it worth it to partially revert? Why not just git revert 0faafb57f8298547949cbc0044ee9e925ed887ba?

    I didn’t do it because the revert is not clean. It conflicts with bbe82c116e72ca0638751e063bf564cd1fe5c4d5 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me). But np on doing it and squashing the commits down to one. One sec.

  8. furszy force-pushed on Apr 16, 2024
  9. andrewtoth commented at 1:37 pm on April 16, 2024: contributor

    utACK cbcb2c82669deaad71e739c64b1baf687e76e604

    I didn’t do it because the revert is not clean. It conflicts with https://github.com/bitcoin/bitcoin/commit/bbe82c116e72ca0638751e063bf564cd1fe5c4d5 and it would require an extra commit for the added doc (which, based on the regression, is a must have for me).

    Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:

    0revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5
    1revert 0faafb57f8298547949cbc0044ee9e925ed887ba
    2doc commit
    

    This would be a cleaner git history, and bbe82c116e72ca0638751e063bf564cd1fe5c4d5 turned out to be a red herring so would make more sense to revert that then have people be confused thinking that actually solved anything.

  10. Sjors commented at 2:48 pm on April 16, 2024: member

    utACK cbcb2c82669deaad71e739c64b1baf687e76e604

    Holding on to cs_main in this particular spot, when index catches up to the tip, is not a big deal performance wise. The other, much more frequently called Commit() in Sync() was already done without holding cs_main.

  11. ryanofsky approved
  12. ryanofsky commented at 3:28 pm on April 16, 2024: contributor

    Code review ACK cbcb2c82669deaad71e739c64b1baf687e76e604. I think I have the opposite opinion as andrewtoth, and that original bugfix c395b26df37b3839a9c76abbd4d5fb48e79b3208 is more direct and simple than current cbcb2c82669deaad71e739c64b1baf687e76e604 since the original fix leaves the Rewind code alone.

    re: #29867 (comment)

    Not a blocker, but I think it would be cleaner if this PR was the following 3 commits:

    0revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5
    1revert 0faafb57f8298547949cbc0044ee9e925ed887ba
    2doc commit
    

    I don’t think this will work because don’t actually want to revert bbe82c116e72ca0638751e063bf564cd1fe5c4d5. It is still important to set m_synced = true; after calling Commit to avoid the bug reported #29767. We can’t just go back to the state before 0faafb57f8298547949cbc0044ee9e925ed887ba because the bug in #29767 was still present before that commit, just less likely to happen. So better to make a direct change fixing the current bug, than trying to fix it by reverting previous changes.

    re: #29867 (comment)

    Holding on to cs_main in this particular spot, when index catches up to the tip, is not a big deal performance wise.

    Agree just locking cs_main while calling NextSyncBlock and setting m_synced = true like the code before 0faafb57f8298547949cbc0044ee9e925ed887ba is probably the easiest fix. I think the following fix which avoids locking cs_main while calling Commit would work too, however:

     0--- a/src/index/base.cpp
     1+++ b/src/index/base.cpp
     2@@ -160,12 +160,24 @@ void BaseIndex::Sync()
     3             }
     4 
     5             const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
     6+            // If pindex_next is null, it means pindex is the chain tip, so
     7+            // commit data indexed so far.
     8             if (!pindex_next) {
     9                 SetBestBlockIndex(pindex);
    10                 // No need to handle errors in Commit. See rationale above.
    11                 Commit();
    12-                m_synced = true;
    13-                break;
    14+
    15+                // If pindex is still the chain tip after committing, exit the
    16+                // sync loop. It is important for cs_main to locked while
    17+                // setting m_synced = true, otherwise a new block could be
    18+                // attached while m_synced is still false, and it would not be
    19+                // indexed.
    20+                LOCK(::cs_main);
    21+                pindex_next = NextSyncBlock(pindex, m_chainstate->m_chain);
    22+                if (!pindex_next) {
    23+                    m_synced = true;
    24+                    break;
    25+                }
    26             }
    27             if (pindex_next->pprev != pindex && !Rewind(pindex, pindex_next->pprev)) {
    28                 FatalErrorf("%s: Failed to rewind index %s to a previous chain tip", __func__, GetName());
    

    Any of the fixes seem ok to me.

  13. DrahtBot requested review from ryanofsky on Apr 16, 2024
  14. Sjors commented at 3:43 pm on April 16, 2024: member
    @ryanofsky’s suggestion looks good to me as well. Though maybe if the timing is really unlucky you’d get lots of Commit() calls?
  15. andrewtoth commented at 4:08 pm on April 16, 2024: contributor

    It is still important to set m_synced = true; after calling Commit to avoid the bug reported #29767.

    Ahh, I see yes BlockConnected is not guarded by cs_main, so that could still be called even if we hold cs_main inside Sync. Thank you for clarifying this for me.

  16. ryanofsky commented at 6:27 pm on April 16, 2024: contributor

    Though maybe if the timing is really unlucky you’d get lots of Commit() calls?

    That’s a good point. If Commit() is slow and it has to be called more than once, it could make index sync less efficient, even if it blocks validation code less.

    There’s a choice about what to do when the Commit() call at the end of the sync is slow and a new block is connected while it is executing.

    1. Currently, the new block may get ignored and the index could be corrupted.
    2. Before 0faafb57f8298547949cbc0044ee9e925ed887ba from #28955, and after cbcb2c82669deaad71e739c64b1baf687e76e604 from this PR, cs_main is locked during Commit() so a new block can’t be connected.
    3. The suggested diff would call Commit() again and potentially keep calling it until no new blocks were connected.

    I think behaviors 2 or 3 both should be ok in practice, and 3 may be more appealing conceptually because ideally indexes should not hold cs_main and block validation code while writing their own data.

    Another alternative could be to use approach 3 but only call Commit() exactly once at the end of the sync, skipping subsequent Commit() calls if more blocks are connected until the sync is completely finished and there is a ChainStateFlushed() notification. This would be similar to what happens with normal BlockConnected() notifications for new blocks.

    I think any approach that fixes the bug is fine though, and the current approach does LGTM.

  17. furszy commented at 8:14 pm on April 17, 2024: member

    Little delayed but here again. I’m also okay with both options. I’m feeling a bit more inclined towards option 3, primarily because it allows us to explain what the code is doing more clearly.

    Also, just to share some extra thoughts around this issue, another -bad- option to hold a minimal cs_main lock could be to keep the current flow in master and connect missing intermediate blocks inside BlockConnected only the first time it is called after setting m_synced. This does not require chain access nor any extra cs_main lock, as we could go backwards from the newly connected tip down to the index tip, reading and connecting the missing block(s). The obvious downside of this option is that the index could finish the initial sync process missing one or two blocks, only updating its tip on the next BlockConnected signal, which could take a while. So, this would change the meaning of m_synced.

  18. index: race fix, lock cs_main while 'm_synced' is subject to change
    This ensures that the index does not miss any 'new block' signals
    occurring in-between reading the 'next block' and setting 'm_synced'.
    Because, if this were to happen, the ignored blocks would never be
    indexed, thus stalling the index forever.
    65951e0418
  19. furszy force-pushed on Apr 17, 2024
  20. ryanofsky approved
  21. ryanofsky commented at 12:28 pm on April 18, 2024: contributor
    Code review ACK 65951e0418c53cbbf30b9ee85e24ccaf729088a1
  22. DrahtBot requested review from Sjors on Apr 18, 2024
  23. in src/index/base.cpp:164 in 65951e0418
    159@@ -160,12 +160,24 @@ void BaseIndex::Sync()
    160             }
    161 
    162             const CBlockIndex* pindex_next = WITH_LOCK(cs_main, return NextSyncBlock(pindex, m_chainstate->m_chain));
    163+            // If pindex_next is null, it means pindex is the chain tip, so
    164+            // commit data indexed so far.
    


    fjahr commented at 8:39 pm on April 18, 2024:

    nit: I had to read this three times to get it right, but feel free to ignore unless you have to retouch.

    I would change it to [...], so commit the data that was indexed so far.

  24. fjahr commented at 8:41 pm on April 18, 2024: contributor
    Code review ACK 65951e0418c53cbbf30b9ee85e24ccaf729088a1
  25. achow101 commented at 6:05 pm on April 25, 2024: member
    ACK 65951e0418c53cbbf30b9ee85e24ccaf729088a1
  26. achow101 merged this on Apr 25, 2024
  27. achow101 closed this on Apr 25, 2024

  28. furszy deleted the branch on Apr 25, 2024
  29. ryanofsky commented at 7:02 pm on April 25, 2024: contributor

    Note just to clarify issue history: The race condition fixed here was introduced by 0faafb57f8298547949cbc0044ee9e925ed887ba from #28955, which stopped locking cs_main while setting m_synced = true. It could cause indexes to incorrectly discard block-connected notifications.

    This race condition is different from the other race condition recently fixed in #29776, which is a much older bug going back to 4368384f1d267b011e03a805f934f5c47e2ca1b2 from #14121, caused by incorrectly setting m_synced = true before the Commit() call, instead of after. That race condition could cause the sync and notification threads to both try to write index data at the same time, and possibly corrupt the state. That race condition was probably also more likely to happen after #28955, though, due to cs_main being released.

  30. fanquake commented at 3:30 am on April 26, 2024: member
    Seems like there’s at least something which could/should be backported here? Maybe only the other change, given the newly broken code has only existed in master?
  31. ryanofsky commented at 12:36 pm on April 26, 2024: contributor

    Yes in terms of backports, it could make sense to backport the #29776 fix to releases before #28955, since the race fixed by #29776 where two threads are trying to update index state at the same time at the end of a sync could happen before #28955, although it was probably much less likely to happen before #28955.

    It would only make sense to backport this PR to releases after #28955, though, since this bug, which could cause blocks to be missing from the index, was introduced in #28955.

  32. fanquake commented at 1:07 pm on April 26, 2024: member
    I’ve pulled #29776 in for 27.x. It could go into 26.x as well cc @glozow.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me