init: change shutdown order of load block thread and scheduler #30435

pull mzumsande wants to merge 1 commits into bitcoin:master from mzumsande:202407_shutdown_order changing 1 files +3 −2
  1. mzumsande commented at 5:21 AM on July 12, 2024: contributor

    This avoids situations during a reindex, in which the shutdown doesn't finish since LimitValidationInterfaceQueue() is called by the load block thread when the scheduler is already stopped, in which case it would block indefinitely. This can lead to intermittent failures in feature_reindex.py (#30424), which I could locally reproduce with

    diff --git a/src/validation.cpp b/src/validation.cpp
    index 74f0e4975c..be1706fdaf 100644
    --- a/src/validation.cpp
    +++ b/src/validation.cpp
    @@ -3446,6 +3446,7 @@ static void LimitValidationInterfaceQueue(ValidationSignals& signals) LOCKS_EXCL
         AssertLockNotHeld(cs_main);
     
         if (signals.CallbacksPending() > 10) {
    +        std::this_thread::sleep_for(std::chrono::milliseconds(50));
             signals.SyncWithValidationInterfaceQueue();
         }
     }
    

    It has also been reported by users running reindex-chainstate (#23234).

    I thought for a bit about potential downsides of changing this order, but couldn't find any.

    Fixes #30424 Fixes #23234

  2. DrahtBot commented at 5:21 AM on July 12, 2024: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--006a51241073e994b41acfe9ec718e94-->

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK maflcko, hebasto, tdb3, BrandonOdiwuor
    Stale ACK TheCharlatan

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #29432 (Stratum v2 Template Provider (take 3) by Sjors)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. TheCharlatan approved
  4. TheCharlatan commented at 6:45 AM on July 12, 2024: contributor

    ACK e427fed82f7931ae6f09a4939e0fcd6cb235ef0d

  5. in src/init.cpp:301 in e427fed82f outdated
     296 | @@ -297,9 +297,10 @@ void Shutdown(NodeContext& node)
     297 |      StopTorControl();
     298 |  
     299 |      // After everything has been shut down, but before things get flushed, stop the
     300 | -    // scheduler and load block thread.
     301 | -    if (node.scheduler) node.scheduler->stop();
     302 | +    // load block thread and the scheduler. Load block is stopped first, since it can
     303 | +    // call LimitValidationInterfaceQueue() which would block indefinitely without the scheduler.
    


    maflcko commented at 6:59 AM on July 12, 2024:

    I don't think this deadlock is limited to LimitValidationInterfaceQueue, but affects any thread that may call SyncWithValidationInterfaceQueue.

    So before the scheduler is stopped, all threads in the context of rpc, p2p, indexes must have been stopped. Also chainmans' restart_indexes and ABC must not be called afterwards. In the code below this line, chainman should only flush the chainstate, which should be fine, as it does not call SyncWithValidationInterfaceQueue.

    (IPC must also be stopped, e.g. src/node/interfaces.cpp- void waitForNotificationsIfTipChanged calls SyncWithValidationInterfaceQueue, but I think IPC stuff is unrelated to the changes here and can be done in a follow-up)


    mzumsande commented at 3:51 PM on July 12, 2024:

    True. I changed the wording of the comment / commit message, referring to SyncWithValidationInterfaceQueue instead of LimitValidationInterfaceQueue.

  6. maflcko commented at 7:00 AM on July 12, 2024: member

    lgtm ACK e427fed82f7931ae6f09a4939e0fcd6cb235ef0d

    But I think the comment can be improved to clarify that this is done in symmetry with the other threads (rpc, p2p, ...)

  7. maflcko commented at 7:09 AM on July 12, 2024: member

    I guess it should be backported to 27.x? (My understanding is that this existed "forever", since 0.16.x, because SyncWithValidationInterfaceQueue never had a boost interruption point, or other interrupt check)

  8. hebasto approved
  9. hebasto commented at 9:49 AM on July 12, 2024: member

    ACK e427fed82f7931ae6f09a4939e0fcd6cb235ef0d, the change looks correct and it indeed fixes the issue.

  10. fanquake added the label Needs backport (27.x) on Jul 12, 2024
  11. init: change shutdown order of load block thread and scheduler
    This avoids situations during a reindex in which shutdown
    doesn't finish since SyncWithValidationInterfaceQueue is
    called by the load block thread when the scheduler is already stopped.
    5fd4836019
  12. mzumsande force-pushed on Jul 12, 2024
  13. mzumsande commented at 3:53 PM on July 12, 2024: contributor

    e427fed to 5fd4836: reworded comment

    I guess it should be backported to 27.x? (My understanding is that this existed "forever", since 0.16.x, because SyncWithValidationInterfaceQueue never had a boost interruption point, or other interrupt check)

    yes, this has existed for a long time, #23234 was opened in 2021.

  14. maflcko commented at 3:53 PM on July 12, 2024: member

    review ACK 5fd48360198d2ac49e43b24cc1469557b03567b8

  15. DrahtBot requested review from hebasto on Jul 12, 2024
  16. DrahtBot requested review from TheCharlatan on Jul 12, 2024
  17. furszy commented at 7:28 PM on July 12, 2024: member

    q: what about disallowing the blocking-wait after stopping the scheduler? maybe only on debug mode. e.g. implementing an isActive method in the task runner and calling it prior to creating the promise. It would help us catch these type of errors (if we still have them).

  18. mzumsande commented at 8:41 PM on July 12, 2024: contributor

    It would help us catch these type of errors (if we still have them).

    Do you mean throwing an assert instead of blocking indefinitely? That might be more convenient than hanging indefinitely, but I'm not sure if this really makes much of a difference in practice, because both ways should be easily recognizable both by users and tests.

    An alternative approach would be to allow it - just return instead of waiting forever, if we are in Shutdown mode and rely on a later FlushBackgroundCallbacks() call from the shutdown thread cleaning everything up. Something like this has been proposed in #23234 (comment) . Not sure if I prefer it to this approach - even though it would make the Shutdown code less brittle, it doesn't seem ideal to me if callers cannot be sure that SyncWithValidationInterfaceQueue always does what it's supposed to do.

  19. hebasto approved
  20. hebasto commented at 9:24 PM on July 12, 2024: member

    re-ACK 5fd48360198d2ac49e43b24cc1469557b03567b8.

  21. tdb3 approved
  22. tdb3 commented at 10:45 PM on July 13, 2024: contributor

    ACK 5fd48360198d2ac49e43b24cc1469557b03567b8 Nice work. Had to use a higher sleep than 50ms to reproduce the error on my local machine (used 150ms).

  23. furszy commented at 2:45 PM on July 15, 2024: member

    It would help us catch these type of errors (if we still have them).

    Do you mean throwing an assert instead of blocking indefinitely? That might be more convenient than hanging indefinitely, but I'm not sure if this really makes much of a difference in practice, because both ways should be easily recognizable both by users and tests.

    Yes, check if the scheduler is processing events and if not: log something + throw an error. I'm unsure regular users can detect and report which thread hanged here.

    An alternative approach would be to allow it - just return instead of waiting forever, if we are in Shutdown mode and rely on a later FlushBackgroundCallbacks() call from the shutdown thread cleaning everything up. Something like this has been proposed in #23234 (comment) . Not sure if I prefer it to this approach - even though it would make the Shutdown code less brittle, it doesn't seem ideal to me if callers cannot be sure that SyncWithValidationInterfaceQueue always does what it's supposed to do.

    I'm not fan of that approach neither. Would prefer to actually do what the function is suppose to be doing (wait for an empty events queue) and actively process events in the caller thread - if we decide to go down this route -:

    void ValidationSignals::SyncWithValidationInterfaceQueue()
    {
        AssertLockNotHeld(cs_main);
        if (m_internals->m_task_runner->CanProcessEvents()) {
            // Block until the validation queue drains
            std::promise<void> promise;
            CallFunctionInValidationInterfaceQueue([&promise] {
                promise.set_value();
            });
            promise.get_future().wait();
        } else {
            // Process all remaining events in this thread
            FlushBackgroundCallbacks();
        }
    }
    
  24. BrandonOdiwuor approved
  25. BrandonOdiwuor commented at 4:03 PM on July 16, 2024: contributor

    Code Review ACK 5fd48360198d2ac49e43b24cc1469557b03567b8

  26. fanquake merged this on Jul 16, 2024
  27. fanquake closed this on Jul 16, 2024

  28. mzumsande deleted the branch on Jul 16, 2024
  29. fanquake referenced this in commit 67ca13aa8d on Jul 17, 2024
  30. fanquake referenced this in commit 05192ba84c on Jul 17, 2024
  31. fanquake removed the label Needs backport (27.x) on Jul 17, 2024
  32. fanquake commented at 10:32 AM on July 17, 2024: member

    Backported to 27.x in #30467.

  33. fanquake referenced this in commit 0cbdc6b380 on Jul 24, 2024
  34. luke-jr referenced this in commit 26c4e2097d on Aug 1, 2024
  35. tomt1664 referenced this in commit d01f87c014 on Apr 1, 2025
  36. tomt1664 referenced this in commit 6cc58f98ac on Apr 21, 2025
  37. tomt1664 referenced this in commit 43e3228d2c on Apr 29, 2025
  38. tomt1664 referenced this in commit 7eeafbbf11 on Jun 16, 2025
  39. tomt1664 referenced this in commit 8614ff46c9 on Jun 24, 2025
  40. bitcoin locked this on Jul 18, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-17 03:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me