bugfix: Do not lock EventLoop::mutex after EventLoop is done

ryanofsky commented at 4:24 am on February 10, 2025: collaborator

Currently, there are two cases where code may attempt to lock EventLoop::mutex after the EventLoop::loop() method has finished running. This is not a problem by itself, but is a problem if there is external code which deletes the EventLoop object immediately after the method returns, which is the case in Bitcoin Core. Both cases of locking after the loop method returns happen due to uses of the Unlock() utility function which unlocks a mutex temporarily, runs a callback function and relocks it.

The first case happens in EventLoop::removeClient when the EventLoop::done() condition is reached and it calls Unlock() in order to be able to call write(post_fd, ...) without blocking and wake the EventLoop thread. In this case, since EventLoop execution is done there is not really any point to using Unlock() and relocking, so the code is changed to use a simple lock.unlock() call and not try to relock.

The second case happens in EventLoop::post where Unlock() is used in a similar way, and depending on thread scheduling (if the thread is delayed for a long time before relocking) can result in failing to relock EventLoop::m_mutex after calling write(). In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new addClient/removeClient calls are added to this code, so the EventLoop::loop() function will never exit while the post() function is waiting.

These two changes are being labeled as a bugfix even though there is not technically a bug here in the libmultiprocess code or API. The EventLoop object itself was safe before these changes as long as outside code waited for EventLoop methods to finish executing before deleting the EventLoop object. Originally the multiprocess code added in https://github.com/bitcoin/bitcoin/pull/10102 did this, but later as more features were added for binding and connecting to unix sockets, and unit tests were added which would immediately delete the EventLoop object after EventLoop::loop() returned, it meant this code could now start failing to relock after unlocking.

A previous attempt was made to fix this bug in https://github.com/bitcoin/bitcoin/pull/31815 outside of libmultiprocess code. But it only addressed the first case of a failing Unlock() in EventLoop::removeClient(), not the second case in EventLoop::post().

This first case described above was not observed but is theoretically possible. The second case happens intermittently on macos and was reported https://github.com/chaincodelabs/libmultiprocess/issues/154

bugfix: Do not lock EventLoop::mutex after EventLoop is done

Currently, there are two cases where code may attempt to lock
`EventLoop::mutex` after the `EventLoop::loop()` method has finished running.
This is not a problem by itself, but is a problem if there is external code
which deletes the `EventLoop` object immediately after the method returns,
which is the case in Bitcoin Core. Both cases of locking after the loop method
returns happen due to uses of the `Unlock()` utility function which unlocks a
mutex temporarily, runs a callback function and relocks it.

The first case happens in `EventLoop::removeClient` when the
`EventLoop::done()` condition is reached and it calls `Unlock()` in order to be
able to call `write(post_fd, ...)` without blocking and wake the EventLoop
thread. In this case, since `EventLoop` execution is done there is not really
any point to using `Unlock()` and relocking, so the code is changed to use a
simple `lock.unlock()` call and not try to relock.

The second case happens in `EventLoop::post` where `Unlock()` is used in a
similar way, and depending on thread scheduling (if the thread is delayed for a
long time before relocking) can result in failing to relock
`EventLoop::m_mutex` after calling `write()`. In this case, since relocking the
mutex is actually necessary for the code that follows, the fix is different:
new `addClient`/`removeClient` calls are added to this code, so the
`EventLoop::loop()` function will never exit while the `post()` function is
waiting.

These two changes are being labeled as a bugfix even though there is not
technically a bug here in the libmultiprocess code or API. The `EventLoop`
object itself was safe before these changes as long as outside code waited for
`EventLoop` methods to finish executing before deleting the `EventLoop` object.
Originally the multiprocess code added in
https://github.com/bitcoin/bitcoin/pull/10102 did this, but later as more
features were added for binding and connecting to unix sockets, and unit tests
were added which would immediately delete the `EventLoop` object after
`EventLoop::loop()` returned, it meant this code could now start failing to
relock after unlocking.

A previous attempt was made to fix this bug in
https://github.com/bitcoin/bitcoin/pull/31815 outside of libmultiprocess code.
But it only addressed the first case of a failing `Unlock()` in
`EventLoop::removeClient()`, not the second case in `EventLoop::post()`.

This first case described above was not observed but is theoretically possible.
The second case happens intermittently on macos and was reported
https://github.com/chaincodelabs/libmultiprocess/issues/154

48d01bcca7

ryanofsky merged this on Feb 10, 2025

ryanofsky closed this on Feb 10, 2025

ryanofsky referenced this in commit 3a95817ece on Feb 10, 2025

ryanofsky commented at 5:28 am on February 10, 2025: collaborator

#160 contains some followups to try to make this code a little safer and easier to maintain.

Sjors referenced this in commit 6aabfcb615 on Feb 10, 2025

Sjors referenced this in commit 1746618e08 on Feb 13, 2025

ryanofsky referenced this in commit 83e40d3b52 on Feb 24, 2025

ryanofsky referenced this in commit 8619f03ec2 on Feb 24, 2025

ryanofsky referenced this in commit cbb7b41c20 on Feb 24, 2025

fanquake referenced this in commit 01f7715766 on Feb 25, 2025

fanquake referenced this in commit ba0a4391ff on Feb 25, 2025

ryanofsky referenced this in commit 258a617c1e on Jun 19, 2025

janus referenced this in commit 86cb86b050 on Sep 1, 2025

bitcoin-core locked this on Feb 10, 2026

bugfix: Do not lock EventLoop::mutex after EventLoop is done #159