Races between `~ProxyClient<Thread>` and `~Connection` destructors

ryanofsky commented at 5:44 pm on July 10, 2025: collaborator

#186 fixed several race conditions that happen during the “disconnecting and blocking” unit test, but a new one was reported in https://github.com/Sjors/bitcoin/pull/90#issuecomment-3024942087 with details in https://github.com/Sjors/bitcoin/pull/90#issuecomment-3029059790 and I want to write more notes about it. There is also a branch with previously attempted fixes

The problem is not too hard to reproduce locally: if mptest binary is run in in a loop in bash, the tsan error happens in a few minutes. With attempted fixes so far running the mptest binary in a loop results in a deadlock after about ~20 minutes for me.

Background

If a Connection object is destroyed while a ProxyClient<Thread> object is still using it (due to a sudden disconnect), the ProxyClient<Thread>::m_disconnect_cb callback will be called from ~Connection destructor, deleting the ProxyClient<Thread> object from whichever ConnThreads map it is in.

If a ProxyClient<Thread> object is destroyed before its Connection object (the more normal case), the ProxyClient<Thread> destructor will unregister m_disconnect_cb so when the Connection is later destroyed it will not try to delete the ProxyClient<Thread> object.

If the two destructors are called in sequence in either order, everything is fine. But if the destructors are called from different threads in a race, numerous problems occur.

Problem 1: invalid `m_disconnect_cb` iterator erase

If ~Connection is destroyed first and is about to call the m_disconnect_cb callback, but then the ~ProxyClient destructor is called in another thread, it will currently (as of a11e6905c238dc35a8bbef995190296bc6329d49) try to delete the m_disconnect_cb using an iterator which is no longer valid. 79c09e44420a79bb7ac56960fe94a16849f04bc3 fixes this by changing the ~ProxyClient<Thread> destructor to avoid making deleting m_disconnect_cb once it is invalidated.

Problem 2: `ProxyClient<Thread>` use-after-free

Second problem is if ~Connection is destroyed first and invokes the m_disconnect_cb callback, and ~ProxyClient<Thread> destructor runs and finishes before m_disconnect_cb has a chance to do anything, when the callback resumes, the ProxyClient<Thread> object will be gone and the callbacks’s attempt to access and erase it will be invalid. 79c09e4 fixes this by making ~ProxyClient<Thread> set a destroyed bit when it runs and having the m_disconnect_cb callback just return early if the bit is set.

Problem 3: `ThreadContext` use after free

Just fixing the race between ~Connection and ~ProxyClient<Thread> is actually not enough because after the ProxyClient<Thread> object is destroyed, the thread owning the ProxyClient<Thread> can exit, destroying the thread_local ThreadContext struct that contained the ProxyClient<Thread> object, and also destroying the ThreadContext.waiter::m_mutex mutex the ProxyClient<Thread> code uses for synchronization. If this happens while an m_disconnect_db callback is still in progress, that callback will fail when it tries to lock the mutex and the mutex longer exists. (Commit 85df929b4e63d9e4a9271aaa7a3ba6458104a14b was a broken attempt to fix this and commit 30431bc0c6fa125eb438cb49856f0144af2ea8c5 has another description of the problem).

Possible Fixes

Adding EventLoop::sync call

The most direct fix for problem 3 is probably to have the thread that destroyed the ProxyClient<Thread> object and is about to exit and destroy ThreadContext::waiter.m_mutex call EventLoop::sync to force m_disconnect_cb to finish running before this happens. This would work because m_disconnect_cb always runs on the event loop thread. This would require reverting 85df929b4e63d9e4a9271aaa7a3ba6458104a14b to avoid reintroducing Problem 2 (so the m_disconnect_cb callback checks destroyed bit after it acquires ThreadContext::waiter.m_mutex instead of before.)

Calling removeSyncCleanup from EventLoop thread

If removeSyncCleanup(m_disconnect_cb) could be called from EventLoop thread I think all previous fixes could be reverted and there would be no problem here because m_disconnect_cb would only be accessed from one thread, and couldn’t be removed while it might be running. This would somehow have to be done without ThreadContext::waiter.m_mutex being held (to avoid deadlocks) so I’m not sure if this is possible.

Switching mutexes

I am thinking that cleaner fixes for all the problems above could be possible by switching ProxyClient<Thread> and related code to use EventLoop::m_mutex for some locking instead of ThreadContext::waiter.m_mutex. This could simplify things by not needing to deal with two mutexes and needing to release them to lock in a consistent order. I was even thinking of switching to EventLoop::m_mutex exclusively for all locking, but I don’t think I can do this because it’s possible for a single thread to make IPC calls over multiple connections and for not all connections to use the same event loop.

Refactoring ThreadContext

ThreadContext callback_threads request_threads maps indexed by Connection* could be replaced with Connection maps indexed by thread_id which could allow not using ThreadContext::waiter.m_mutex at all and using EventLoop::m_mutex everywhere.

Summary

There is a bug and I am not sure how to fix it yet.

ryanofsky commented at 1:33 pm on August 23, 2025: collaborator

I think the second fix mentioned above “Calling removeSyncCleanup from EventLoop thread” should be possible by adding a ProxyClient<Thread>::close method that calls EventLoop::sync to unregister m_disconnect_cb from the event loop thread. This could be called from type-context.h before the ProxyClient<Thread> object is destroyed, and before the waiter->m_mutex is locked to avoid deadlocks.

This fix would be make the ProxyClient<Thread> class behave more similarly to generic ProxyClient objects which were fixed in d8011c83608e63db0caba2d6e1ef354537f25a63 from #186 in a similar way to unregister their disconnect callbacks from the event loop thread instead of on their own threads.

Then the removeSyncCleanup method could have an assert ensuring it is only called on the event loop thread, which would make everything easier to think about and avoid need for any of the previously attempted fixes above.

I was also trying to think about whether it’s possible to improve the “disconnecting and blocking” unit test to make this bug happen more reliably. Currently in that test, the server thread executing the IPC call and triggering the client disconnect just exits right away, so the request_threads ProxyClient<Thread> object it owns is usually destroyed long before the server receives and processes the client disconnect, and before the server Connection object is destroyed. But occasionally the server thread executing the IPC call can get delayed for whatever reason and not actually destroy its request_threads ProxyClient<Thread> object until after the server Connection object is gone. When this happened there was previously a bug fixed by c6f7fdf1735015d069b173dd5d511ab148b678e7 from #186 and I started writing some code #186 (comment) to make the test more deterministic by using promises instead of threads to test both of these cases deterministically, instead of only testing case randomly depending on thread scheduling.

I think if I can finish the test improvement which does that, I might also be able to write a test case that reproduces the bug here reliably, by having the server request_threads ProxyClient<Thread> object be destroyed while the server Connection is being destroyed, instead of before or after. This could be done by having the test add its own addSyncCleanup callback to make the server thread executing the IPC call return while the server connection object is being destroyed. Or doing this might make the test too complicated, but the test improvement from #186 (comment) should at least be done that alone might make this bug easier to reproduce.

ryanofsky referenced this in commit 2b05ed71af on Sep 2, 2025

ryanofsky referenced this in commit cd6d95aef8 on Sep 3, 2025

ryanofsky referenced this in commit 9fe7f0c841 on Sep 5, 2025

ryanofsky referenced this in commit 7a8a99c50f on Sep 5, 2025

Sjors referenced this in commit 9fe5d03058 on Sep 10, 2025

Sjors referenced this in commit 5b65dae936 on Sep 10, 2025

ryanofsky referenced this in commit b6321bb99c on Sep 10, 2025

ryanofsky referenced this in commit d86823eed7 on Sep 10, 2025

ryanofsky referenced this in commit 4a269b21b8 on Sep 11, 2025

ryanofsky referenced this in commit a0952c2d0f on Sep 17, 2025

ryanofsky referenced this in commit 47d79db8a5 on Sep 17, 2025

theuni referenced this in commit 86167c57d6 on Sep 26, 2025

Races between ~ProxyClient<Thread> and ~Connection destructors #189