Close minor startup race between main and scheduler threads #22577

pull LarryRuane wants to merge 1 commits into bitcoin:master from LarryRuane:2021-07-28-startup-race changing 5 files +21 −11
  1. LarryRuane commented at 10:31 PM on July 28, 2021: contributor

    This is a low-priority bug fix. The scheduler thread runs CheckForStaleTipAndEvictPeers() every 45 seconds (EXTRA_PEER_CHECK_INTERVAL). If its first run happens before the active chain is set up (CChain::SetTip()), bitcoind will assert:

    (...)
    2021-07-28T22:16:49Z init message: Loading block index…
    bitcoind: validation.cpp:4968: CChainState& ChainstateManager::ActiveChainstate() const: Assertion `m_active_chainstate' failed.
    Aborted (core dumped)
    

    I ran into this while using the debugger to investigate an unrelated problem. Single-stepping through threads with a debugger can cause the relative thread execution timing to be very different than usual. I don't think any automated tests are needed for this PR. I'll give reproduction steps in the next PR comment.

  2. LarryRuane commented at 10:33 PM on July 28, 2021: contributor

    The easiest way to reproduce this problem is to apply this patch, and then start bitcoind normally. This delays the main thread from loading the block index for 60 seconds.

    --- a/src/init.cpp
    +++ b/src/init.cpp
    @@ -1343,6 +1343,7 @@ bool AppInitMain(NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
             bilingual_str strLoadError;
     
             uiInterface.InitMessage(_("Loading block index…").translated);
    +        UninterruptibleSleep(std::chrono::milliseconds{60000});
     
             do {
                 const int64_t load_block_index_start_time = GetTimeMillis();
    
  3. DrahtBot added the label P2P on Jul 28, 2021
  4. DrahtBot added the label Validation on Jul 28, 2021
  5. DrahtBot commented at 3:33 AM on July 29, 2021: member

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #20295 (rpc: getblockfrompeer by Sjors)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  6. Crypt-iQ commented at 6:01 PM on July 29, 2021: contributor

    I have also received this failure when running the functional tests under ASAN in parallel.

  7. tryphe approved
  8. tryphe commented at 11:34 PM on July 29, 2021: contributor

    tested ACK c0f913de644bf5516bcfae0cc8a87c449ba80596 (in conjunction with sleep() in #22577 (comment))

    Can confirm the assertion happens with bitcoind/bitcoin-qt with the sleep in place:

    bitcoin-qt: validation.cpp:4992: CChainState& ChainstateManager::ActiveChainstate() const: Assertion `m_active_chainstate' failed.`
    

    With patch + sleep in place, things run as expected.

  9. kristapsk approved
  10. kristapsk commented at 12:15 AM on July 30, 2021: contributor

    ACK c0f913de644bf5516bcfae0cc8a87c449ba80596. Was able to reproduce the issue and this solves it.

  11. lsilva01 approved
  12. lsilva01 commented at 5:03 AM on July 30, 2021: contributor
  13. MarcoFalke commented at 9:34 AM on July 30, 2021: member

    Wouldn't it make more sense to delay scheduling until later? Identical to how Connman does it?

  14. LarryRuane force-pushed on Jul 30, 2021
  15. Close minor startup race between main and scheduler threads
    Don't schedule class PeerManagerImpl's background tasks from its
    constructor, but instead do that from a separate method,
    StartScheduledTasks(), that can be called later at the end of startup,
    after other things, such as the active chain, are initialzed.
    703b1e612a
  16. LarryRuane force-pushed on Jul 30, 2021
  17. LarryRuane commented at 10:42 PM on July 30, 2021: contributor

    @MarcoFalke

    Wouldn't it make more sense to delay scheduling until later? Identical to how Connman does it?

    Good idea, thanks! After more investigation, I think the root problem is that the peer manager's background tasks are started from PeerManagerImpl() constructor. The new version separates the starting of these background tasks from the constructor. It's done from a new method, StartScheduledTasks(), which is called at the end of AppInitMain() (startup). It should always be safe to delay (within reason) anything that runs in the background.

    Force-pushed (previous version).

  18. tryphe commented at 1:23 AM on July 31, 2021: contributor

    Can we reproduce the crash with the new commit, or should it never occur now? I expected to see some removed code. Cheers.

  19. LarryRuane commented at 2:40 AM on July 31, 2021: contributor

    You can reproduce the crash by running master plus the patch. This PR branch (either the first version or the current version) plus the patch should not reproduce the crash.

    The reason there's no removed code (removing the first version) is because I did a force-push.

  20. in src/net_processing.cpp:275 in 703b1e612a
     270 | @@ -271,7 +271,7 @@ class PeerManagerImpl final : public PeerManager
     271 |  {
     272 |  public:
     273 |      PeerManagerImpl(const CChainParams& chainparams, CConnman& connman, CAddrMan& addrman,
     274 | -                    BanMan* banman, CScheduler& scheduler, ChainstateManager& chainman,
     275 | +                    BanMan* banman, ChainstateManager& chainman,
     276 |                      CTxMemPool& pool, bool ignore_incoming_txs);
    


    tryphe commented at 3:34 AM on July 31, 2021:

    Nit: CTxMemPool& pool, can be moved to the line above to fill in the extra space that has been created.

    There are four other lines in the commit that could use the same spacing fix: net_processing.cpp:1401 net_processing.cpp:1408 net_processing.h:41 test/util/setup_common.cpp:201

    It doesn't really matter too much but could save a future nit PR, although makes this code review slightly harder on the eyes


    LarryRuane commented at 1:45 AM on August 2, 2021:

    Thank you, I thought about doing that but decided it would be better to minimize the diff. Also, clang-format-diff.py didn't suggest doing that.

  21. MarcoFalke commented at 6:11 AM on July 31, 2021: member

    cr ACK 703b1e612a4bd4521e20ae21eb8fb7c19f4ef942

  22. tryphe commented at 5:12 AM on August 2, 2021: contributor

    tested ACK 703b1e612a4bd4521e20ae21eb8fb7c19f4ef942

    Approach ACK, prefer over the original changes conceptually. I find myself asking whether replacing a pointer check(https://github.com/bitcoin/bitcoin/commit/c0f913de644bf5516bcfae0cc8a87c449ba80596) with a whole new async task is worth it. Code-wise I think this adds a little abstraction, but it seems like the right direction, however.

    Sgtm.

    Downside: can't insert a wrench in this code to duplicate the original bad behavior, but that doesn't seem necessary.

  23. MarcoFalke removed the label P2P on Aug 2, 2021
  24. MarcoFalke removed the label Validation on Aug 2, 2021
  25. DrahtBot added the label P2P on Aug 2, 2021
  26. LarryRuane commented at 5:54 PM on August 2, 2021: contributor

    @tryphe

    ... a whole new async task ...

    It's not a new async task; all this PR does is move where the two peer-related async tasks are started to a later time during system initialization.

    Downside: can't insert a wrench in this code to duplicate the original bad behavior ...

    I'm unclear on what you mean, and same with your earlier comment:

    ... I'm wondering if there's a creative way to reproduce it against the new commit ...

    The whole point of the PR is that even with the reproduction patch (the extra delay) that causes the failure on master, it shouldn't be possible for that patch, or any patch, to reproduce the problem with this PR's new (or even with its old) commit. Maybe I'm missing something?

  27. 0xB10C commented at 9:45 PM on August 2, 2021: member

    ACK 703b1e612a4bd4521e20ae21eb8fb7c19f4ef942

    Tested that master crashes with the patch from #22577 (comment) and that 703b1e612a4bd4521e20ae21eb8fb7c19f4ef942 does not crash anymore.

  28. MarcoFalke commented at 7:42 AM on August 3, 2021: member
    ... I'm wondering if there's a creative way to reproduce it against the new commit ...

    It should be possible to reproduce again by moving the node.peerman->StartScheduledTasks(*node.scheduler); to right after the make_unique.

  29. glozow commented at 9:40 AM on August 4, 2021: member

    code review ACK 703b1e612a4bd4521e20ae21eb8fb7c19f4ef942 - it makes sense to me to start peerman's background tasks here, after chainstate->LoadChainTip() and node.connman->Start() have been called.

    Also used a sleep (https://github.com/bitcoin/bitcoin/pull/22577#issuecomment-888664132) to verify the race exists on master but not on this branch since peerman->StartScheduledTasks() isn't called until later

  30. LarryRuane commented at 2:51 PM on August 4, 2021: contributor

    @tryphe, Marco's comment helped me understand what you're asking for. You're looking for a patch to apply to the latest code that will cause the failure.

    <details> <summary>Here's the patch that Marco suggested that should reproduce the problem (apply to this PR)</summary>

    --- a/src/init.cpp
    +++ b/src/init.cpp
    @@ -1181,6 +1181,7 @@ bool AppInitMain(NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
         assert(!node.peerman);
         node.peerman = PeerManager::make(chainparams, *node.connman, *node.addrman, node.banman.get(),
                                          chainman, *node.mempool, ignores_incoming_txs);
    +    if (node.peerman) node.peerman->StartScheduledTasks(*node.scheduler);
         RegisterValidationInterface(node.peerman.get());
     
         // sanitize comments per BIP-0014, format user agent and check total size
    @@ -1343,6 +1344,8 @@ bool AppInitMain(NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
             bilingual_str strLoadError;
     
             uiInterface.InitMessage(_("Loading block index…").translated);
    +        // This delay must be > EXTRA_PEER_CHECK_INTERVAL (45 seconds) to reproduce the failure
    +        UninterruptibleSleep(std::chrono::seconds{60});
     
             do {
                 const int64_t load_block_index_start_time = GetTimeMillis();
    @@ -1789,8 +1792,6 @@ bool AppInitMain(NodeContext& node, interfaces::BlockAndHeaderTipInfo* tip_info)
             banman->DumpBanlist();
         }, DUMP_BANS_INTERVAL);
     
    -    if (node.peerman) node.peerman->StartScheduledTasks(*node.scheduler);
    -
     #if HAVE_SYSTEM
         StartupNotify(args);
     #endif
    

    </details>

  31. DrahtBot commented at 3:42 PM on August 4, 2021: member

    <!--cf906140f33d8803c4a75a2196329ecb-->

    🐙 This pull request conflicts with the target branch and needs rebase.

    <sub>Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft".</sub>

  32. DrahtBot added the label Needs rebase on Aug 4, 2021
  33. MarcoFalke merged this on Aug 4, 2021
  34. MarcoFalke closed this on Aug 4, 2021

  35. sidhujag referenced this in commit 157542f679 on Aug 4, 2021
  36. theStack commented at 2:51 PM on August 13, 2021: member

    post-merge ACK 703b1e612a4bd4521e20ae21eb8fb7c19f4ef942

    Nice find! Could reproduce the issue fixed with the patch proposed in #22577 (comment)

    assertion "m_active_chainstate" failed: file "validation.cpp", line 4968, function "ActiveChainstate"
    Abort trap (core dumped)
    

    After moving back the StartScheduledTasks call to its original place in the PR but keeping the uninterrupted sleep after showing the "Loading block index...", the problem doesn't occur anymore. 👌

  37. luke-jr referenced this in commit 05e84aa550 on Oct 10, 2021
  38. luke-jr referenced this in commit 38eb87367f on Dec 14, 2021
  39. DrahtBot locked this on Aug 18, 2022
  40. LarryRuane deleted the branch on Feb 24, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-27 15:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me