fixes for race conditions on disconnects #249

pull ryanofsky wants to merge 8 commits into bitcoin-core:master from ryanofsky:pr/disraces changing 4 files +252 −120
  1. ryanofsky commented at 3:50 am on March 11, 2026: collaborator
  2. DrahtBot commented at 3:50 am on March 11, 2026: none

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK Sjors
    Stale ACK ismaelsadeeq

    If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

    LLM Linter (✨ experimental)

    Possible typos and grammar issues:

    • acquires the it -> acquires it [Typo/misplaced “the” makes the sentence harder to read.]

    2026-03-26 13:27:54

  3. ryanofsky commented at 3:06 pm on March 11, 2026: collaborator
    Closing this, replaced by #250!
  4. ryanofsky closed this on Mar 11, 2026

  5. ryanofsky reopened this on Mar 12, 2026

  6. ryanofsky force-pushed on Mar 12, 2026
  7. ryanofsky commented at 5:56 pm on March 12, 2026: collaborator

    Updated 9536b6334ee3a5701d0f9adf813f58a8e6fdef96 -> 884c8469bc175f2282be996d9a6d410f7181649c (pr/disraces.1 -> pr/disraces.2, compare) incorporating changes from #250 and making a lot of test improvements to make them more reliable and better documented and fix some bugs. Also rearranged commits so tests are added before fixes (otherwise it was difficult to revert the fixes because of merge conflicts).

    Updated 884c8469bc175f2282be996d9a6d410f7181649c -> 2fb97e8cca9feb1df70cf29b2a9895bea2c4c49c (pr/disraces.2 -> pr/disraces.3, compare) to fix iwyu error https://github.com/bitcoin-core/libmultiprocess/actions/runs/23016144578

  8. ryanofsky force-pushed on Mar 12, 2026
  9. Sjors commented at 10:17 am on March 13, 2026: member

    ACK 2fb97e8cca9feb1df70cf29b2a9895bea2c4c49c

    I checked that the fixes themselves are still the same as when I last looked (https://github.com/bitcoin-core/libmultiprocess/pull/250#issuecomment-4045668979). Since the tests here precede their fixes, it was also easy to confirm that each test actually caught something and the fix made it go away. I also lightly checked that they failed for the right reasons.

    The improved test code looks good to me as well.

    CI passed on Bitcoin Core. Our new TSan Bitcoin Core job passed too on #257, but it might be good to manually restart it a couple of times.

    I’ll let the TSan job run locally for a while to see if it finds anything.

  10. in test/mp/test/test.cpp:341 in 2fb97e8cca
    336+    // Regression test for bitcoin/bitcoin#34711, bitcoin/bitcoin#34756
    337+    // where worker thread is destroyed before it starts.
    338+    //
    339+    // The test works by using the `makethread` hook to start a disconnect as
    340+    // soon as ProxyServer<ThreadMap>::makeThread is called, and using the
    341+    // `makethread_created` hook to sleep 100ms after the thread is created but
    


    Sjors commented at 10:18 am on March 13, 2026:
    In 88cacd4239fc3fd147b22c2ad3091c5123c3285f test: worker thread destroyed before it is initialized: nit, it’s only waiting 10ms

    ryanofsky commented at 5:11 pm on March 25, 2026:

    In 88cacd4 test: worker thread destroyed before it is initialized: nit, it’s only waiting 10ms

    Thanks, fixed

  11. Sjors commented at 3:38 pm on March 13, 2026: member

    I’ll let the TSan job run locally for a while to see if it finds anything.

    It’s been running for well over an hour now without hitting anything.

  12. ismaelsadeeq commented at 1:21 pm on March 17, 2026: member
    Concept ACK
  13. in src/mp/proxy.cpp:422 in 88cacd4239
    418-        g_thread_context.thread_name = ThreadName(m_connection.m_loop->m_exe_name) + " (from " + from + ")";
    419+        EventLoop& loop{*m_connection.m_loop};
    420+        g_thread_context.thread_name = ThreadName(loop.m_exe_name) + " (from " + from + ")";
    421         g_thread_context.waiter = std::make_unique<Waiter>();
    422         thread_context.set_value(&g_thread_context);
    423+        if (loop.testing_hook_makethread_created) loop.testing_hook_makethread_created();
    


    ismaelsadeeq commented at 9:40 am on March 18, 2026:

    In “test: worker thread destroyed before it is initialized” 88cacd4239fc3fd147b22c2ad3091c5123c3285f

    Why not initialize the loop at the top and of the method and capture it in the thread, which will make the first if statement less verbose and more readable?

    0    EventLoop& loop{*m_connection.m_loop};
    1    if (loop.testing_hook_makethread) loop.testing_hook_makethread();
    2    const std::string from = context.getParams().getName();
    3    std::promise<ThreadContext*> thread_context;
    4    std::thread thread([&thread_context, &loop, from, this]() {
    5        g_thread_context.thread_name = ThreadName(loop.m_exe_name) + " (from " + from + ")";
    6        g_thread_context.waiter = std::make_unique<Waiter>();
    7        thread_context.set_value(&g_thread_context);
    8        if (loop.testing_hook_makethread_created) loop.testing_hook_makethread_created();
    

    ryanofsky commented at 5:27 pm on March 25, 2026:

    re: #249 (review)

    Why not initialize the loop at the top and of the method and capture it in the thread, which will make the first if statement less verbose and more readable?

    Nice suggestion. Applied this.

  14. in test/mp/test/test.cpp:64 in 88cacd4239 outdated
    63@@ -63,6 +64,7 @@ class TestSetup
    64 {
    


    ismaelsadeeq commented at 9:50 am on March 18, 2026:

    In “test: worker thread destroyed before it is initialized” 88cacd4239fc3fd147b22c2ad3091c5123c3285f

    Comment can be adjusted to with disconnect_later addition

    0- * Provides client_disconnect and server_disconnect lambdas that can be used to
    1- * trigger disconnects and test handling of broken and closed connections.
    2+ * Provides disconnection lambdas that can be used to trigger
    3+ * disconnects and test handling of broken and closed connections.
    4  *
    

    ryanofsky commented at 5:32 pm on March 25, 2026:

    re: #249 (review)

    Comment can be adjusted to with disconnect_later addition

    Good catch, applied update

  15. sedited approved
  16. sedited commented at 10:06 am on March 18, 2026: collaborator

    lgtm

    Changes make sense to me, and I’m not seeing the same failures anymore after testing this out a bit, though as Sjors mentioned, hitting them can be flaky.

  17. in src/mp/proxy.cpp:414 in f09731e242


    ismaelsadeeq commented at 12:32 pm on March 18, 2026:

    In “race fix: m_on_cancel called after request finishes” f09731e242f30d92873fb332419f162ece253328

    IIUC, the reason why there is a race is that we do not lock before setting the thread context to the thread context promise. When a disconnect occurs in the interval between setting the thread context and acquiring the lock, a race happens because in that interval the client drops the connection, ProxyServer is destroyed, and then the worker thread tries to acquire the lock and access the waiter — but it has already been destroyed, causing a SIGSEGV. Acquiring the lock before calling set_value prevents this by ensuring the worker thread holds the mutex before the main thread can proceed. The ProxyServer destructor must acquire the same mutex to signal shutdown, so it blocks until the worker thread is safely inside its wait loop, guaranteeing the waiter is never destroyed while the worker thread is still trying to access it.

    0g_thread_context.waiter = std::make_unique<Waiter>();
    1thread_context.set_value(&g_thread_context);  //  signals main thread
    2//  race window opens here
    3Lock lock(g_thread_context.waiter->m_mutex);  //  worker tries to lock
    

    When set_value is called, the main thread unblocks and proceeds:

    0Main thread                          Worker thread
    1────────────────────────────────     ────────────────────────────────
    2thread_context.get_future().get()    set_value() → main thread wakes
    3creates ProxyServer<Thread>          
    4client drops reference               
    5~ProxyServer<Thread> runs            
    6waiter set to nullptr                
    7waiter destroyed                     Lock(g_thread_context.waiter->m_mutex)
    8                                     ← waiter is gone → SIGSEGV
    

    It will be nice to mention this in the commit message.


    ryanofsky commented at 5:45 pm on March 25, 2026:

    re: #249 (review)

    IIUC, the reason why there is a race is that we do not lock before setting the thread context to the thread context promise.

    Yes your description is exactly right. I tried to describe the failure in some detail in the regression test but now added more information to the commit message as well including a link to your comment.

  18. in include/mp/type-context.h:128 in e69b6bf3f4
    124@@ -129,6 +125,8 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
    125                             server_context.request_canceled = true;
    126                         };
    127                         // Update requests_threads map if not canceled.
    128+                        const auto& params = call_context.getParams();
    


    ismaelsadeeq commented at 12:53 pm on March 18, 2026:

    In e69b6bf3f4ef6ba45606cf85fdb44cc4607c021e race fix: getParams() called after request cancel

    Perhaps add a comment here (also trying to test my understanding of the fix)

    0// It is safe to call getParams() here because this code runs inside sync(),
    1// which executes on the event loop thread. Even if cancellation is initiated
    2// by the client at this point, the params structs will only be freed after
    3// sync() returns and the event loop resumes processing the cancellation.
    

    ryanofsky commented at 6:01 pm on March 25, 2026:

    re: #249 (review)

    In e69b6bf race fix: getParams() called after request cancel

    Thanks! Added a similar comment. Your comment is mostly right but I would want to mention that it is safe to call getParams() not just because the call is executing on the event loop thread, but also because cancel_monitor.m_canceled is false.

  19. ismaelsadeeq commented at 12:56 pm on March 18, 2026: member

    Code review ACK 2fb97e8cca9feb1df70cf29b2a9895bea2c4c49c

    • 88cacd4239fc3fd147b22c2ad3091c5123c3285f “test: worker thread destroyed before it is initialized”
    • f09731e242f30d92873fb332419f162ece253328 “race fix: m_on_cancel called after request finishes”
    • 75c5425173fb517755252ff20b4af8d1342972db “test: getParams() called after request cancel”
    • e69b6bf3f4ef6ba45606cf85fdb44cc4607c021e race fix: getParams() called after request cancel
    • 846a43aafb439e31aa4954e4be3e0920c2466877 “test: m_on_cancel called after request finishes”
    • 2fb97e8cca9feb1df70cf29b2a9895bea2c4c49c “race fix: m_on_cancel called after request finishes”

    The current test hooks in the event loop and the branching approach is a bit awkward imo. A potentially better approach could be to have virtual notification methods in the event loop that are triggered at points in the code that tests are interested in. In certain code paths, the event loop simply invokes those notifications unconditionally. By default, the implementations of those methods are no-ops and cheap — no branching, no null checks. In tests, we then subclass EventLoop and override those methods to inject test-specific behaviour, keeping test concerns entirely out of production code.

      0diff --git a/include/mp/proxy-io.h b/include/mp/proxy-io.h
      1index 3594708..5174fe8 100644
      2--- a/include/mp/proxy-io.h
      3+++ b/include/mp/proxy-io.h
      4@@ -251,7 +251,7 @@ public:
      5                 LogFn{[old_callback = std::move(old_callback)](LogMessage log_data) {old_callback(log_data.level == Log::Raise, std::move(log_data.message));}},
      6                 context){}
      7
      8-    ~EventLoop();
      9+    virtual ~EventLoop();
     10
     11     //! Run event loop. Does not return until shutdown. This should only be
     12     //! called once from the m_thread_id thread. This will block until
     13@@ -341,21 +341,22 @@ public:
     14     //! External context pointer.
     15     void* m_context;
     16
     17-    //! Hook called when ProxyServer<ThreadMap>::makeThread() is called.
     18-    std::function<void()> testing_hook_makethread;
     19+    //! Virtual method called on the event loop thread when ProxyServer<ThreadMap>::makeThread()
     20+    //! is called to create a new remote worker thread.
     21+    virtual void onThreadCreate() { /* Nothing to notify by default. */ }
     22
     23-    //! Hook called on the worker thread inside makeThread(), after the thread
     24-    //! context is set up and thread_context promise is fulfilled, but before it
     25-    //! starts waiting for requests.
     26-    std::function<void()> testing_hook_makethread_created;
     27+    //! Virtual method called on a new worker thread after the thread context
     28+    //! is set up and the thread_context promise is fulfilled, but before the
     29+    //! thread starts waiting for requests.
     30+    virtual void onThreadCreated() { /* Nothing to notify by default. */ }
     31
     32-    //! Hook called on the worker thread when it starts to execute an async
     33-    //! request. Used by tests to control timing or inject behavior at this
     34-    //! point in execution.
     35-    std::function<void()> testing_hook_async_request_start;
     36+    //! Virtual method called on a worker thread when it starts to execute an
     37+    //! asynchronous request.
     38+    virtual void onAsyncRequestStart() { /* Nothing to notify by default. */ }
     39
     40-    //! Hook called on the worker thread just before returning results.
     41-    std::function<void()> testing_hook_async_request_done;
     42+    //! Virtual method called on a worker thread after an asynchronous request
     43+    //! finishes executing, but before the response is sent.
     44+    virtual void onAsyncRequestDone() { /* Nothing to notify by default. */ }
     45 };
     46
     47 //! Single element task queue used to handle recursive capnp calls. (If the
     48diff --git a/include/mp/type-context.h b/include/mp/type-context.h
     49index 9c7f21b..12218c3 100644
     50--- a/include/mp/type-context.h
     51+++ b/include/mp/type-context.h
     52@@ -73,7 +73,7 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
     53     auto invoke = [self = kj::mv(self), call_context = kj::mv(server_context.call_context), &server, req, fn, args...](CancelMonitor& cancel_monitor) mutable {
     54                 MP_LOG(*server.m_context.loop, Log::Debug) << "IPC server executing request #" << req;
     55                 EventLoop& loop = *server.m_context.loop;
     56-                if (loop.testing_hook_async_request_start) loop.testing_hook_async_request_start();
     57+                loop.onAsyncRequestStart();
     58                 ServerContext server_context{server, call_context, req};
     59                 {
     60                     // Before invoking the function, store a reference to the
     61@@ -192,7 +192,7 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
     62                     }
     63                     // End of scope: if KJ_DEFER was reached, it runs here
     64                 }
     65-                if (loop.testing_hook_async_request_done) loop.testing_hook_async_request_done();
     66+                loop.onAsyncRequestDone();
     67                 return call_context;
     68             };
     69
     70diff --git a/src/mp/proxy.cpp b/src/mp/proxy.cpp
     71index f36e19f..0c9d146 100644
     72--- a/src/mp/proxy.cpp
     73+++ b/src/mp/proxy.cpp
     74@@ -411,7 +411,7 @@ ProxyServer<ThreadMap>::ProxyServer(Connection& connection) : m_connection(conne
     75
     76 kj::Promise<void> ProxyServer<ThreadMap>::makeThread(MakeThreadContext context)
     77 {
     78-    if (m_connection.m_loop->testing_hook_makethread) m_connection.m_loop->testing_hook_makethread();
     79+    m_connection.m_loop->onThreadCreate();
     80     const std::string from = context.getParams().getName();
     81     std::promise<ThreadContext*> thread_context;
     82     std::thread thread([&thread_context, from, this]() {
     83@@ -420,7 +420,7 @@ kj::Promise<void> ProxyServer<ThreadMap>::makeThread(MakeThreadContext context)
     84         g_thread_context.waiter = std::make_unique<Waiter>();
     85         Lock lock(g_thread_context.waiter->m_mutex);
     86         thread_context.set_value(&g_thread_context);
     87-        if (loop.testing_hook_makethread_created) loop.testing_hook_makethread_created();
     88+        loop.onThreadCreated();
     89         // Wait for shutdown signal from ProxyServer<Thread> destructor (signal
     90         // is just waiter getting set to null.)
     91         g_thread_context.waiter->wait(lock, [] { return !g_thread_context.waiter; });
     92diff --git a/test/mp/test/test.cpp b/test/mp/test/test.cpp
     93index 4f71a55..d3dd6d8 100644
     94--- a/test/mp/test/test.cpp
     95+++ b/test/mp/test/test.cpp
     96@@ -60,6 +60,24 @@ static_assert(std::is_integral_v<decltype(kMP_MINOR_VERSION)>, "MP_MINOR_VERSION
     97  * client_disconnect manually, but false allows testing more ProxyClient
     98  * behavior and the "IPC client method called after disconnect" code path.
     99  */
    100+
    101+//! EventLoop subclass used by tests to override virtual methods and inject
    102+//! test-specific behavior.
    103+class TestEventLoop : public EventLoop
    104+{
    105+public:
    106+    using EventLoop::EventLoop;
    107+    std::function<void()> on_thread_create;
    108+    std::function<void()> on_thread_created;
    109+    std::function<void()> on_async_request_start;
    110+    std::function<void()> on_async_request_done;
    111+
    112+    void onThreadCreate() override { if (on_thread_create) on_thread_create(); }
    113+    void onThreadCreated() override { if (on_thread_created) on_thread_created(); }
    114+    void onAsyncRequestStart() override { if (on_async_request_start) on_async_request_start(); }
    115+    void onAsyncRequestDone() override { if (on_async_request_done) on_async_request_done(); }
    116+};
    117+
    118 class TestSetup
    119 {
    120 public:
    121@@ -69,17 +87,19 @@ public:
    122     std::promise<std::unique_ptr<ProxyClient<messages::FooInterface>>> client_promise;
    123     std::unique_ptr<ProxyClient<messages::FooInterface>> client;
    124     ProxyServer<messages::FooInterface>* server{nullptr};
    125+    TestEventLoop* test_loop{nullptr};
    126     //! Thread variable should be after other struct members so the thread does
    127     //! not start until the other members are initialized.
    128     std::thread thread;
    129
    130     TestSetup(bool client_owns_connection = true)
    131-        : thread{[&] {
    132-              EventLoop loop("mptest", [](mp::LogMessage log) {
    133+        : thread{[&, client_owns_connection] {
    134+              TestEventLoop loop("mptest", [](mp::LogMessage log) {
    135                   // Info logs are not printed by default, but will be shown with `mptest --verbose`
    136                   KJ_LOG(INFO, log.level, log.message);
    137                   if (log.level == mp::Log::Raise) throw std::runtime_error(log.message);
    138               });
    139+              test_loop = &loop;
    140               auto pipe = loop.m_io_context.provider->newTwoWayPipe();
    141
    142               auto server_connection =
    143@@ -336,31 +356,31 @@ KJ_TEST("Worker thread destroyed before it is initialized")
    144     // Regression test for bitcoin/bitcoin#34711, bitcoin/bitcoin#34756
    145     // where worker thread is destroyed before it starts.
    146     //
    147-    // The test works by using the `makethread` hook to start a disconnect as
    148-    // soon as ProxyServer<ThreadMap>::makeThread is called, and using the
    149-    // `makethread_created` hook to sleep 100ms after the thread is created but
    150-    // before it starts waiting, so without the bugfix,
    151+    // The test works by overriding the `onThreadCreate` method to start a
    152+    // disconnect as soon as ProxyServer<ThreadMap>::makeThread is called, and
    153+    // overriding the `onThreadCreated` method to sleep after the thread is
    154+    // created but before it starts waiting, so without the bugfix,
    155     // ProxyServer<Thread>::~ProxyServer would run and destroy the waiter,
    156     // causing a SIGSEGV in the worker thread after the sleep.
    157-    TestSetup setup;
    158-    ProxyClient<messages::FooInterface>* foo = setup.client.get();
    159-    foo->initThreadMap();
    160-    setup.server->m_impl->m_fn = [] {};
    161-
    162-    EventLoop& loop = *setup.server->m_context.connection->m_loop;
    163-    loop.testing_hook_makethread = [&] {
    164-        setup.server_disconnect_later();
    165-    };
    166-    loop.testing_hook_makethread_created = [&] {
    167-        std::this_thread::sleep_for(std::chrono::milliseconds(10));
    168-    };
    169-
    170     bool disconnected{false};
    171-    try {
    172-        foo->callFnAsync();
    173-    } catch (const std::runtime_error& e) {
    174-        KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
    175-        disconnected = true;
    176+    {
    177+        TestSetup setup;
    178+        setup.test_loop->on_thread_create = [&] {
    179+            setup.server_disconnect_later();
    180+        };
    181+        setup.test_loop->on_thread_created = [&] {
    182+            std::this_thread::sleep_for(std::chrono::milliseconds(10));
    183+        };
    184+        ProxyClient<messages::FooInterface>* foo = setup.client.get();
    185+        foo->initThreadMap();
    186+        setup.server->m_impl->m_fn = [] {};
    187+
    188+        try {
    189+            foo->callFnAsync();
    190+        } catch (const std::runtime_error& e) {
    191+            KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
    192+            disconnected = true;
    193+        }
    194     }
    195     KJ_EXPECT(disconnected);
    196 }
    197@@ -370,29 +390,31 @@ KJ_TEST("Calling async IPC method, with server disconnect racing the call")
    198     // Regression test for bitcoin/bitcoin#34777 heap-use-after-free where
    199     // an async request is canceled before it starts to execute.
    200     //
    201-    // Use testing_hook_async_request_start to trigger a disconnect from the
    202+    // Override onAsyncRequestStart to trigger a disconnect from the
    203     // worker thread as soon as it begins to execute an async request. Without
    204     // the bugfix, the worker thread would trigger a SIGSEGV after this by
    205     // calling call_context.getParams().
    206-    TestSetup setup;
    207-    ProxyClient<messages::FooInterface>* foo = setup.client.get();
    208-    foo->initThreadMap();
    209-    setup.server->m_impl->m_fn = [] {};
    210-
    211-    EventLoop& loop = *setup.server->m_context.connection->m_loop;
    212-    loop.testing_hook_async_request_start = [&] {
    213-        setup.server_disconnect();
    214-        // Sleep is neccessary to let the event loop fully clean up after the
    215-        // disconnect and trigger the SIGSEGV.
    216-        std::this_thread::sleep_for(std::chrono::milliseconds(10));
    217-    };
    218-
    219-    try {
    220-        foo->callFnAsync();
    221-        KJ_EXPECT(false);
    222-    } catch (const std::runtime_error& e) {
    223-        KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
    224+    bool disconnected{false};
    225+    {
    226+        TestSetup setup;
    227+        setup.test_loop->on_async_request_start = [&] {
    228+            setup.server_disconnect();
    229+            // Sleep is neccessary to let the event loop fully clean up after the
    230+            // disconnect and trigger the SIGSEGV.
    231+            std::this_thread::sleep_for(std::chrono::milliseconds(10));
    232+        };
    233+        ProxyClient<messages::FooInterface>* foo = setup.client.get();
    234+        foo->initThreadMap();
    235+        setup.server->m_impl->m_fn = [] {};
    236+
    237+        try {
    238+            foo->callFnAsync();
    239+        } catch (const std::runtime_error& e) {
    240+            KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
    241+            disconnected = true;
    242+        }
    243     }
    244+    KJ_EXPECT(disconnected);
    245 }
    246
    247 KJ_TEST("Calling async IPC method, with server disconnect after cleanup")
    248@@ -401,27 +423,29 @@ KJ_TEST("Calling async IPC method, with server disconnect after cleanup")
    249     // an async request is canceled after it finishes executing but before the
    250     // response is sent.
    251     //
    252-    // Use testing_hook_async_request_done to trigger a disconnect from the
    253+    // Override onAsyncRequestDone to trigger a disconnect from the
    254     // worker thread after it execute an async requests but before it returns.
    255     // Without the bugfix, the m_on_cancel callback would be called at this
    256     // point accessing the cancel_mutex stack variable that had gone out of
    257     // scope.
    258-    TestSetup setup;
    259-    ProxyClient<messages::FooInterface>* foo = setup.client.get();
    260-    foo->initThreadMap();
    261-    setup.server->m_impl->m_fn = [] {};
    262-
    263-    EventLoop& loop = *setup.server->m_context.connection->m_loop;
    264-    loop.testing_hook_async_request_done = [&] {
    265-        setup.server_disconnect();
    266-    };
    267-
    268-    try {
    269-        foo->callFnAsync();
    270-        KJ_EXPECT(false);
    271-    } catch (const std::runtime_error& e) {
    272-        KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
    273+    bool disconnected{false};
    274+    {
    275+        TestSetup setup;
    276+        setup.test_loop->on_async_request_done = [&] {
    277+            setup.server_disconnect();
    278+        };
    279+        ProxyClient<messages::FooInterface>* foo = setup.client.get();
    280+        foo->initThreadMap();
    281+        setup.server->m_impl->m_fn = [] {};
    282+
    283+        try {
    284+            foo->callFnAsync();
    285+        } catch (const std::runtime_error& e) {
    286+            KJ_EXPECT(std::string_view{e.what()} == "IPC client method call interrupted by disconnect.");
    287+            disconnected = true;
    288+        }
    289     }
    290+    KJ_EXPECT(disconnected);
    291 }
    292
    293 KJ_TEST("Make simultaneous IPC calls on single remote thread")
    
  20. in include/mp/type-context.h:164 in 2fb97e8cca
    159+                            // cancellation happened. So we do not need to be
    160+                            // notified of cancellations after this point. Also
    161+                            // we do not want to be notified because
    162+                            // cancel_mutex and server_context could be out of
    163+                            // scope when it happens.
    164+                            cancel_monitor.m_on_cancel = nullptr;
    


    ismaelsadeeq commented at 1:05 pm on March 18, 2026:

    Gemini wrote a mermaid diagram

    Before 2fb97e8cca9feb1df70cf29b2a9895bea2c4c49c

     0sequenceDiagram
     1    participant EL as Event Loop Thread
     2    participant W as Worker Thread
     3    participant CM as CancelMonitor (m_on_cancel)
     4    Note over W: --- PHASE: EXECUTION ---
     5    W->>W: fn.invoke(server_context, cancel_mutex)
     6    Note over W: invocation finished
     7    W->>W: [Pop Stack Frame]
     8    Note right of W: server_context & cancel_mutex are now FREED
     9    Note over EL, W: --- THE RACE WINDOW ---
    10    Note right of EL: Client Disconnects NOW
    11    EL->>CM: Trigger ~CancelProbe()
    12    CM->>CM: Execute m_on_cancel()
    13    CM-->>W: Access freed server_context / cancel_mutex
    14    Note over CM: CRASH: Use-After-Free
    15    W->>EL: m_loop->sync() (Too late!)
    

    After 2fb97e8cca9feb1df70cf29b2a9895bea2c4c49c

     0sequenceDiagram
     1    participant EL as Event Loop Thread
     2    participant W as Worker Thread
     3    participant CM as CancelMonitor (m_on_cancel)
     4    Note over W: --- PHASE: EXECUTION ---
     5    W->>W: fn.invoke(server_context, cancel_mutex)
     6    Note over W: --- PHASE: CLEANUP ---
     7    W->>EL: m_loop->sync() [START]
     8    Note right of EL: SYNC BARRIER (On Event Loop)
     9    EL->>CM: m_on_cancel = nullptr
    10    Note right of EL: DISARMED: Callback is gone
    11    W->>W: [Pop Stack Frame]
    12    Note right of W: server_context & cancel_mutex are now FREED
    13    Note over EL, W: --- THE DISCONNECT WINDOW ---
    14    EL->>EL: Trigger ~CancelProbe()
    15    EL->>EL: Check m_on_cancel (is NULL)
    16    Note right of EL: SAFE: No callback to trigger
    17    EL-->>W: sync() returns [END]
    18    W->>W: Worker exits safely
    

    ryanofsky commented at 6:28 pm on March 25, 2026:

    re: #249 (review)

    Gemini wrote a mermaid diagram

    Yes, this is right. In the following code:

    https://github.com/bitcoin-core/libmultiprocess/blob/db8f76ad290a18a1b2811232b94cb894fd5dfca5/include/mp/proxy-io.h#L744-L749

    There is a brief window after fn() returns but before cancel_monitor_ptr = nullptr runs where the m_on_cancel callback previously could fire and access stack variables in fn() that were out of scope.

  21. Sjors commented at 2:07 pm on March 24, 2026: member
    @ryanofsky should we add a “Bitcoin Core v31” milestone or tag? I think only this PR is needed.
  22. ismaelsadeeq commented at 3:22 pm on March 25, 2026: member

    rfm?

    cc @Eunovo

  23. ryanofsky commented at 4:30 pm on March 25, 2026: collaborator

    @ryanofsky should we add a “Bitcoin Core v31” milestone or tag? I think only this PR is needed.

    Hmm, I was thinking this should not be part of v31 because the bugs that this fixes only happen if clients disconnect unclearly or take the odd step of immediately destroying a worker thread after creating it. Also my understanding is even those things will not trigger bugs without unlucky timing, requiring tests to be run hundreds of times or more, and sometimes also needing asan or tsan or an overloaded system with lots of preemption as well.

    On the other hand, the fixes in this PR are all pretty simple, just moving some assignments around, so they do seem like fairly safe changes.

    Overall I’d lean not to merging this into v31 unless there is a specific reason to do that. But no strong opinion, and I’ve been a little out of the loop, and there’s actually more time than I thought before 31.0 (https://github.com/bitcoin/bitcoin/issues/33607) so it could be reasonable to target.

    In any case, I’ll make some minor updates here and respond to all the review comments and also refresh https://github.com/bitcoin/bitcoin/pull/34804.

  24. Sjors commented at 6:04 pm on March 25, 2026: member

    I agree the fixes aren’t v31 release blocking, but I do think they’re worth including if possible.

    Unclean disconnects is something SRI developers and users are quite likely to encounter because they stack a bunch of applications: node -> template provider -> job declarator -> translator. They can be shut down in any order, or just crash, and that’s not always going to be graceful. We then get bug reports of things we already know.

  25. ryanofsky referenced this in commit 5eeb4e6aed on Mar 25, 2026
  26. ryanofsky referenced this in commit 415081af17 on Mar 25, 2026
  27. ryanofsky force-pushed on Mar 25, 2026
  28. ryanofsky referenced this in commit e27077fe5f on Mar 25, 2026
  29. ryanofsky force-pushed on Mar 25, 2026
  30. ryanofsky referenced this in commit c6d60fd646 on Mar 25, 2026
  31. ryanofsky commented at 9:08 pm on March 25, 2026: collaborator

    Updated 2fb97e8cca9feb1df70cf29b2a9895bea2c4c49c -> 25eeb8ef9ee514dad9d044ba7987ec9f22416611 (pr/disraces.3 -> pr/disraces.4, compare) with cleanups and additional refactorings

    Updated 25eeb8ef9ee514dad9d044ba7987ec9f22416611 -> 81c21698d3a67b2b8a0f8d7d65ac2928be939348 (pr/disraces.4 -> pr/disraces.5, compare) to fix unused capture warnings https://github.com/bitcoin-core/libmultiprocess/actions/runs/23562943350/job/68607284028?pr=249


    re: #249#pullrequestreview-3966409562

    A potentially better approach could be to have virtual notification methods in the event loop

    Agree loop.onAsyncRequestStart(); is less awkward than if (loop.testing_hook_async_request_start) loop.testing_hook_async_request_start().

    I think the main downside of using virtual methods is needing to define a TestEventLoop class with boilerplate overriding each virtual method and connecting it to a corresponding std::function. Virtual methods can also make the code more fragile. For example, if someone added real code to the EventLoop::onThreadCreated method, not knowing the method was only intended to be used for testing, this could lead to failures or tests appearing to pass when non-test code was broken. Not using virtual methods also avoids the performace cost of virtual calls when no hooks are defined.

    There could be other ways of making hook call sites less awkward, too. Could define a CallHook helper like CallHook(loop.testing_hook_async_request_start); that skips calling the hook if the function pointer is null.

    re: #249 (comment)

    it was also easy to confirm that each test actually caught something

    Hopefully the tests don’t seem too complicated. They just add pauses and disconnects to the call sequence to trigger the reported bugs. It is unclear if the tests will have much long term value for preventing new bugs, but they do add more coverage and can at least prevent the fixes from being accidentally reverted.

  32. Sjors commented at 7:51 am on March 26, 2026: member

    This hit another AssertionError: [node 0] Expected message(s) ['canceled while executing'] not found in log in the TSan job: https://github.com/bitcoin-core/libmultiprocess/actions/runs/23564112620/job/68621668478?pr=249#step:21:1991

    local timeout_factor=10 so this was run before #263 landed.


    e69b6bf3f4ef6ba45606cf85fdb44cc4607c021e -> 8003d822cd: the commit is now limited to just the fix while the new 7265b492c60a366222d6ecafb11db1148028ed66 does a bigger cleanup.

    I checked again that the test (fails) -> fix (passes) pattern still works.

  33. ryanofsky referenced this in commit 1c6505bdff on Mar 26, 2026
  34. ryanofsky force-pushed on Mar 26, 2026
  35. ryanofsky referenced this in commit cb57b3a6ec on Mar 26, 2026
  36. ryanofsky referenced this in commit 32bc4ca644 on Mar 26, 2026
  37. ryanofsky referenced this in commit c90ae7025c on Mar 26, 2026
  38. ryanofsky commented at 9:51 am on March 26, 2026: collaborator

    Rebased 81c21698d3a67b2b8a0f8d7d65ac2928be939348 -> 396473f8d5480805c18849513d06cd559bb1cb8f (pr/disraces.5 -> pr/disraces.6, compare) with no changes after #263, to trigger a CI run and see if interface_ipc.py is still hanging waiting for ‘canceled while executing’ messages (https://github.com/bitcoin-core/libmultiprocess/actions/runs/23564112620/job/68621668478?pr=249#step:21:1991)

    Rebased 396473f8d5480805c18849513d06cd559bb1cb8f -> ff0eed1bf183627c89007e71d631f039bd61bfb5 (pr/disraces.6 -> pr/disraces.7, compare) with no changes after #264, to trigger a CI run and see if “exit code 143” errors are fixed (https://github.com/bitcoin-core/libmultiprocess/actions/runs/23587830246)

  39. test: worker thread destroyed before it is initialized
    Add test for race condition in makeThread that can currently trigger segfaults
    as reported:
    
    https://github.com/bitcoin/bitcoin/issues/34711
    https://github.com/bitcoin/bitcoin/issues/34756
    
    The test currently crashes and will be fixed in the next commit.
    
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    
    git-bisect-skip: yes
    a1d643348f
  40. race fix: worker thread destroyed before it is initialized
    This fixes a race condition in makeThread that can currently trigger segfaults
    as reported:
    
    https://github.com/bitcoin/bitcoin/issues/34711
    https://github.com/bitcoin/bitcoin/issues/34756
    
    The problem is a segfault in ProxyServer<ThreadMap>::makeThread calling
    `Lock lock(g_thread_context.waiter->m_mutex);` that happens because the
    waiter pointer is null. The waiter pointer can be null if the worker
    thread is destroyed immediately after it is created, because
    `~ProxyServer<Thread>` sets it to null.
    
    The fix works by moving the lock line above the
    `thread_context.set_value()` line so the worker thread can't be
    destroyed before it is fully initialized.
    
    A more detailed description of the bug and fix can be found in
    https://github.com/bitcoin-core/libmultiprocess/pull/249#discussion_r2953134565
    
    The bug can be reproduced by running the unit test added in the previous commit
    or by calling makeThread and immediately disconnecting or destroying the
    returned thread. The bug is not new and has existed since makeThread was
    implemented, but it was found due to a new functional test in bitcoin core and
    with antithesis testing (see details in linked issues).
    
    The fix was originally posted in
    https://github.com/bitcoin/bitcoin/issues/34711#issuecomment-3986380070
    f11ec29ed2
  41. test: getParams() called after request cancel
    Add test for disconnect race condition in the mp.Context PassField() overload
    that can currently trigger segfaults as reported in
    https://github.com/bitcoin/bitcoin/issues/34777.
    
    The test currently crashes and will be fixed in the next commit.
    
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    
    git-bisect-skip: yes
    4a60c39f24
  42. race fix: getParams() called after request cancel
    This fixes a race condition in the mp.Context PassField() overload which is
    used to execute async requests, that can currently trigger segfaults as
    reported in https://github.com/bitcoin/bitcoin/issues/34777 when it calls
    call_context.getParams() after a disconnect.
    
    The bug can be reproduced by running the unit test added in the previous commit
    and was also seen in antithesis (see details in linked issue), but should be
    unlikely to happen normally because PassField checks for cancellation and
    returns early before actually using the getParams() result.
    
    This bug was introduced commit in
    0174450ca2e95a4bd1f22e4fd38d83b1d432ac1f which started to cancel
    requests on disconnects. Before that commit, requests were not canceled
    and would continue to execute (Cap'n Proto would just discard the
    responses) so it was ok to call getParams().
    
    This fix was originally posted in
    https://github.com/bitcoin/bitcoin/issues/34777#issuecomment-4024285314
    f5509a31fc
  43. test: m_on_cancel called after request finishes
    Add test disconnect for race condition in the mp.Context PassField() overload
    reported in https://github.com/bitcoin/bitcoin/issues/34782.
    
    The test crashes currently with AddressSanitizer, but will be fixed in the next
    commit. It's also possible to reproduce the bug without AddressSanitizer by
    adding an assert:
    
    ```diff
    --- a/include/mp/type-context.h
    +++ b/include/mp/type-context.h
    @@ -101,2 +101,3 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
                         server_context.cancel_lock = &cancel_lock;
    +                    KJ_DEFER(server_context.cancel_lock = nullptr);
                         server.m_context.loop->sync([&] {
    @@ -111,2 +112,3 @@ auto PassField(Priority<1>, TypeList<>, ServerContext& server_context, const Fn&
                                 MP_LOG(*server.m_context.loop, Log::Info) << "IPC server request #" << req << " canceled while executing.";
    +                            assert(server_context.cancel_lock);
                                 // Lock cancel_mutex here to block the event loop
    ```
    
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    
    git-bisect-skip: yes
    1643d05ba0
  44. race fix: m_on_cancel called after request finishes
    This fixes a race condition in the mp.Context PassField() overload which is
    used to execute async requests, that can currently trigger segfaults as
    reported in https://github.com/bitcoin/bitcoin/issues/34782 when a cancellation
    happens after the request executes but before it returns.
    
    The bug can be reproduced by running the unit test added in the previous commit
    and was also seen in antithesis (see details in linked issue), but should be
    unlikely to happen normally because the cancellation would have to happen in a
    very short window for there to be a problem.
    
    This bug was introduced commit in 0174450ca2e95a4bd1f22e4fd38d83b1d432ac1f
    which started to cancel requests on disconnects. Before that commit a
    cancellation callback was not present.
    
    This fix was originally posted in
    https://github.com/bitcoin/bitcoin/issues/34782#issuecomment-4033169085
    and there is a sequence diagram explaining the bug in
    https://github.com/bitcoin-core/libmultiprocess/pull/249#discussion_r2953306851
    1dbc59a4aa
  45. refactor: Move type-context.h getParams() call closer to use
    This change has no effect on behavior, it just narrows the scope of the
    params variable to avoid potential bugs if a cancellation happens and
    makes them no longer valid.
    ff1d8ba172
  46. refactor: Use loop variable in type-context.h
    Replace `server.m_context.loop` references with `loop` in Context
    PassField implmentation after a `loop` variable was introduced in a
    recent commit. Also adjust PassField scopes and indentation without
    changing behavior.
    
    This commit is easiest to review ignoring whitespace.
    ff0eed1bf1
  47. ryanofsky referenced this in commit 73c88d69ce on Mar 26, 2026
  48. ryanofsky referenced this in commit 2b7d3fa83e on Mar 26, 2026
  49. ryanofsky force-pushed on Mar 26, 2026
  50. ryanofsky referenced this in commit 7970e4670e on Mar 26, 2026
  51. ryanofsky referenced this in commit 54319b7e4a on Mar 26, 2026
  52. ryanofsky force-pushed on Mar 26, 2026
  53. ryanofsky referenced this in commit a9fcf879d3 on Mar 26, 2026
  54. ryanofsky referenced this in commit ca3c58c6b9 on Mar 26, 2026
  55. Sjors commented at 1:37 pm on March 26, 2026: member
    ACK ff0eed1bf183627c89007e71d631f039bd61bfb5
  56. DrahtBot requested review from ismaelsadeeq on Mar 26, 2026
  57. ryanofsky merged this on Mar 27, 2026
  58. ryanofsky closed this on Mar 27, 2026


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/libmultiprocess. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-03-29 21:30 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me