ci: intermittent issue in rpc_misc.py node0 stderr terminate called after throwing an instance of 'kj::ExceptionImpl' [15:12:14.943] what(): mp/proxy.cpp:242: disconnected: write(m_post

maflcko commented at 3:27 pm on October 24, 2024: member

https://cirrus-ci.com/task/5442110629871616?logs=ci#L3430

 0[15:12:14.943]  node0 2024-10-24T15:12:14.578291Z [shutoff] [src/logging/timer.h:58] [Log] [bench] FlushStateToDisk: write block and undo data to disk started 
 1[15:12:14.943]  node0 2024-10-24T15:12:14.578357Z [shutoff] [src/logging/timer.h:58] [Log] [bench] FlushStateToDisk: write block and undo data to disk completed (0.06ms) 
 2[15:12:14.943]  node0 2024-10-24T15:12:14.578365Z [shutoff] [src/logging/timer.h:58] [Log] [bench] FlushStateToDisk: write block index to disk started 
 3[15:12:14.943]  node0 2024-10-24T15:12:14.578381Z [shutoff] [src/logging/timer.h:58] [Log] [bench] FlushStateToDisk: write block index to disk completed (0.01ms) 
 4[15:12:14.943]  node0 2024-10-24T15:12:14.578390Z [shutoff] [src/logging/timer.h:58] [Log] [bench] FlushStateToDisk: write coins cache to disk (0 coins, 262kB) started 
 5[15:12:14.943]  node0 2024-10-24T15:12:14.578406Z [shutoff] [src/txdb.cpp:146] [BatchWrite] [coindb] Writing final batch of 0.00 MiB 
 6[15:12:14.943]  node0 2024-10-24T15:12:14.578614Z [shutoff] [src/txdb.cpp:148] [BatchWrite] [coindb] Committed 0 changed transaction outputs (out of 0) to coin database... 
 7[15:12:14.943]  node0 2024-10-24T15:12:14.578640Z [shutoff] [src/logging/timer.h:58] [Log] [bench] FlushStateToDisk: write coins cache to disk (0 coins, 262kB) completed (0.24ms) 
 8[15:12:14.943]  node0 2024-10-24T15:12:14.578654Z [shutoff] [src/validationinterface.cpp:245] [ChainStateFlushed] [validation] Enqueuing ChainStateFlushed: block hash=08088a81df63d1c115663bb817c7ce81057d3e9e57dc706c85e4f3878390e828 
 9[15:12:14.943]  node0 2024-10-24T15:12:14.585972Z [shutoff] [src/init.cpp:398] [Shutdown] Shutdown: done 
10[15:12:14.943]  node0 2024-10-24T15:12:14.586061Z [capnp-loop] [src/ipc/capnp/protocol.cpp:35] [IpcLogFn] [ipc] {bitcoin-node-21814/b-capnp-loop-21841} EventLoop::loop done, cancelling event listeners. 
11[15:12:14.943]  node0 2024-10-24T15:12:14.586078Z [capnp-loop] [src/ipc/capnp/protocol.cpp:35] [IpcLogFn] [ipc] {bitcoin-node-21814/b-capnp-loop-21841} EventLoop::loop bye. 
12[15:12:14.943]  test  2024-10-24T15:12:14.621000Z TestFramework (ERROR): Assertion failed 
13[15:12:14.943]                                    Traceback (most recent call last):
14[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/test_framework.py", line 132, in main
15[15:12:14.943]                                        self.run_test()
16[15:12:14.943]                                      File "/ci_container_base/ci/scratch/build-i686-pc-linux-gnu/test/functional/rpc_misc.py", line 83, in run_test
17[15:12:14.943]                                        self.restart_node(0, ["-txindex", "-blockfilterindex", "-coinstatsindex"])
18[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/test_framework.py", line 595, in restart_node
19[15:12:14.943]                                        self.stop_node(i)
20[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/test_framework.py", line 581, in stop_node
21[15:12:14.943]                                        self.nodes[i].stop_node(expected_stderr, wait=wait)
22[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/test_node.py", line 410, in stop_node
23[15:12:14.943]                                        self.wait_until_stopped(expected_stderr=expected_stderr)
24[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/test_node.py", line 445, in wait_until_stopped
25[15:12:14.943]                                        self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
26[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/test_node.py", line 843, in wait_until
27[15:12:14.943]                                        return wait_until_helper_internal(test_function, timeout=timeout, timeout_factor=self.timeout_factor)
28[15:12:14.943]                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/util.py", line 292, in wait_until_helper_internal
30[15:12:14.943]                                        if predicate():
31[15:12:14.943]                                           ^^^^^^^^^^^
32[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/test_node.py", line 445, in <lambda>
33[15:12:14.943]                                        self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
34[15:12:14.943]                                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
35[15:12:14.943]                                      File "/ci_container_base/test/functional/test_framework/test_node.py", line 430, in is_node_stopped
36[15:12:14.943]                                        raise AssertionError("Unexpected stderr {} != {}".format(stderr, expected_stderr))
37[15:12:14.943]                                    AssertionError: Unexpected stderr terminate called after throwing an instance of 'kj::ExceptionImpl'
38[15:12:14.943]                                      what():  mp/proxy.cpp:242: disconnected: write(m_post_fd, &buffer, 1): Broken pipe
39[15:12:14.943]                                    stack: 657ed083 657ed86e f7ac7d20 f7713156 f77a7e07 !=
40[15:12:14.943]  test  2024-10-24T15:12:14.623000Z TestFramework (DEBUG): Closing down network thread 
41[15:12:14.943]  test  2024-10-24T15:12:14.673000Z TestFramework (INFO): Stopping nodes 
42[15:12:14.943]  test  2024-10-24T15:12:14.673000Z TestFramework.node0 (DEBUG): Stopping node 
43[15:12:14.943] 
44[15:12:14.943]  node0 stderr terminate called after throwing an instance of 'kj::ExceptionImpl'
45[15:12:14.943]   what():  mp/proxy.cpp:242: disconnected: write(m_post_fd, &buffer, 1): Broken pipe
46[15:12:14.943] stack: 657ed083 657ed86e f7ac7d20 f7713156 f77a7e07

maflcko added the label interfaces on Oct 24, 2024

maflcko added the label CI failed on Oct 24, 2024

fanquake commented at 3:29 pm on October 24, 2024: member

cc @ryanofsky

ryanofsky commented at 1:05 pm on October 29, 2024: contributor

Assuming this is a new error, it is possible this is caused by #31105 which was merged a week ago and changed some parts of IPC shutdown sequence.

It seems this failure here is happening because some code is trying to use the IPC event loop after it is closed. Specifically some code is calling the mp::EventLoop::removeClient which fails writing to the event loop’s local pipe at https://github.com/chaincodelabs/libmultiprocess/blob/abe254b9734f2e2b220d1456de195532d6e6ac1e/src/mp/proxy.cpp#L242

ryanofsky commented at 1:35 pm on October 29, 2024: contributor

Looking more closely at changes in #31105, I don’t see anything that directly relates to this code, so it seems possible this is an older bug or a different bug. Looking up the “Broken pipe” error, it also seems like that can only be triggered if something is closing the EventLoop::m_wait_fd file descriptor before the write(m_post_fd, ...) call happens. But I don’t see a code path where that would happen. May need to reproduce locally to debug. Would be curious to know if this happens more.

ryanofsky commented at 9:38 pm on January 22, 2025: contributor

I’ve been looking into this issue since it was reported again https://github.com/chaincodelabs/libmultiprocess/issues/128#issuecomment-2604033275 and was able to get it to happen by removing other code in rpc_misc.py, so it is only calling node.echoipc and self.stop_node(0), and then running the test in a loop. If I do that I tend to see the crash after 20-30 minutes of looping the test. I also tried implementing a change to dump IPC debug logs from the child bitcoin-node processs that is spawned during the test, but unfortunately this change made the problem disappear when the test was run in a loop overnight, so I don’t have great information about what is leading up and causing the child process to crash,

0@@ -32,6 +33,9 @@ namespace capnp {
1 namespace {
2 void IpcLogFn(bool raise, std::string message)
3 {
4+    std::ofstream f("/tmp/echoclient.txt", std::ios::app);
5+    f << message << std::endl;
6+    f.close();
7     LogDebug(BCLog::IPC, "%s\n", message);
8     if (raise) throw Exception(message);
9 }

The one useful thing I could get was a stack trace from the core dump, which looks like:

 0#+begin_src c++
 1[#0](/bitcoin-bitcoin/0/)  0x00007f96ad7b2a9c in __pthread_kill_implementation () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
 2[#1](/bitcoin-bitcoin/1/)  0x00007f96ad760576 in raise () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
 3[#2](/bitcoin-bitcoin/2/)  0x00007f96ad748935 in abort () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
 4[#3](/bitcoin-bitcoin/3/)  0x00007f96adaacc2b in __gnu_cxx::__verbose_terminate_handler() [clone .cold] () from /nix/store/ybjcla5bhj8g1y84998pn4a2drfxybkv-gcc-13.3.0-lib/lib/libstdc++.so.6
 5[#4](/bitcoin-bitcoin/4/)  0x00007f96adabc20a in __cxxabiv1::__terminate(void (*)()) () from /nix/store/ybjcla5bhj8g1y84998pn4a2drfxybkv-gcc-13.3.0-lib/lib/libstdc++.so.6
 6[#5](/bitcoin-bitcoin/5/)  0x00007f96adabc275 in std::terminate() () from /nix/store/ybjcla5bhj8g1y84998pn4a2drfxybkv-gcc-13.3.0-lib/lib/libstdc++.so.6
 7[#6](/bitcoin-bitcoin/6/)  0x00007f96adabc4c7 in __cxa_throw () from /nix/store/ybjcla5bhj8g1y84998pn4a2drfxybkv-gcc-13.3.0-lib/lib/libstdc++.so.6
 8[#7](/bitcoin-bitcoin/7/)  0x000055d3346d6275 in kj::ExceptionCallback::RootExceptionCallback::onFatalException(kj::Exception&&) ()
 9[#8](/bitcoin-bitcoin/8/)  0x000055d3357661f5 in kj::throwFatalException(kj::Exception&&, unsigned int) ()
10[#9](/bitcoin-bitcoin/9/)  0x000055d335761de3 in kj::_::Debug::Fault::fatal() ()
11[#10](/bitcoin-bitcoin/10/) 0x000055d3355edbe9 in mp::EventLoop::removeClient(std::unique_lock<std::mutex>&)::$_0::operator()() const (this=<optimized out>) at mp/src/mp/proxy.cpp:244
12[#11](/bitcoin-bitcoin/11/) mp::Unlock<std::unique_lock<std::mutex>, mp::EventLoop::removeClient(std::unique_lock<std::mutex>&)::$_0>(std::unique_lock<std::mutex>&, mp::EventLoop::removeClient(std::unique_lock<std::mutex>&)::$_0&&) (lock=..., callback=...) at mp/include/mp/util.h:157
13[#12](/bitcoin-bitcoin/12/) mp::EventLoop::removeClient (this=this@entry=0x55d365fc2528, lock=...) at mp/src/mp/proxy.cpp:242
14[#13](/bitcoin-bitcoin/13/) 0x000055d3355f0033 in mp::EventLoop::startAsyncThread(std::unique_lock<std::mutex>&)::$_0::operator()() const (this=<optimized out>) at mp/src/mp/proxy.cpp:262
15[#14](/bitcoin-bitcoin/14/) std::__invoke_impl<void, mp::EventLoop::startAsyncThread(std::unique_lock<std::mutex>&)::$_0>(std::__invoke_other, mp::EventLoop::startAsyncThread(std::unique_lock<std::mutex>&)::$_0&&) (__f=...)
16    at /nix/store/4krab2h0hd4wvxxmscxrw21pl77j4i7j-gcc-13.3.0/include/c++/13.3.0/bits/invoke.h:61
17[#15](/bitcoin-bitcoin/15/) std::__invoke<mp::EventLoop::startAsyncThread(std::unique_lock<std::mutex>&)::$_0>(mp::EventLoop::startAsyncThread(std::unique_lock<std::mutex>&)::$_0&&) (__fn=...)
18    at /nix/store/4krab2h0hd4wvxxmscxrw21pl77j4i7j-gcc-13.3.0/include/c++/13.3.0/bits/invoke.h:96
19[#16](/bitcoin-bitcoin/16/) std::thread::_Invoker<std::tuple<mp::EventLoop::startAsyncThread(std::unique_lock<std::mutex>&)::$_0> >::_M_invoke<0ul> (this=<optimized out>)
20    at /nix/store/4krab2h0hd4wvxxmscxrw21pl77j4i7j-gcc-13.3.0/include/c++/13.3.0/bits/std_thread.h:292
21[#17](/bitcoin-bitcoin/17/) std::thread::_Invoker<std::tuple<mp::EventLoop::startAsyncThread(std::unique_lock<std::mutex>&)::$_0> >::operator() (this=<optimized out>)
22    at /nix/store/4krab2h0hd4wvxxmscxrw21pl77j4i7j-gcc-13.3.0/include/c++/13.3.0/bits/std_thread.h:299
23[#18](/bitcoin-bitcoin/18/) std::thread::_State_impl<std::thread::_Invoker<std::tuple<mp::EventLoop::startAsyncThread(std::unique_lock<std::mutex>&)::$_0> > >::_M_run (this=<optimized out>)
24    at /nix/store/4krab2h0hd4wvxxmscxrw21pl77j4i7j-gcc-13.3.0/include/c++/13.3.0/bits/std_thread.h:244
25[#19](/bitcoin-bitcoin/19/) 0x00007f96adae86d3 in execute_native_thread_routine () from /nix/store/ybjcla5bhj8g1y84998pn4a2drfxybkv-gcc-13.3.0-lib/lib/libstdc++.so.6
26[#20](/bitcoin-bitcoin/20/) 0x00007f96ad7b0d02 in start_thread () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
27[#21](/bitcoin-bitcoin/21/) 0x00007f96ad8303ac in __clone3 () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
28#+end_src

The stack trace shows the crash is happening in EventLoop::startAsyncThread when it is calling EventLoop::removeClient:

https://github.com/chaincodelabs/libmultiprocess/blob/3b2617b3e59f55ae3501c82a62c87f393077779f/src/mp/proxy.cpp#L260

which is failing when it tries to write to EventLoop::m_post_fd:

https://github.com/chaincodelabs/libmultiprocess/blob/3b2617b3e59f55ae3501c82a62c87f393077779f/src/mp/proxy.cpp#L242

but as mentioned earlier #31151 (comment), it’s not clear why that write would ever fail because the pipe should be open for reading at that point. Since I couldn’t easily add more log prints, I tried looking at surrounding code for reference counting-bugs, to see if there was any shutdown sequence that might lead to stopping the event loop early and closing the pipe, but I couldn’t find anything. Looking at the EventLoop thread I did see some cases where locks for EventLoop::m_mutex are missing, so I will try to add those and see if that helps. If it doesn’t and I might try to add delays different places to see if I can get the bug to happen more reliably. I also need to try building libmultiprocess code with -fsanitize=thread (regardless of this bug), since there might be other issues it could catch.

ryanofsky commented at 12:28 pm on January 23, 2025: contributor

I think I figured out how the wait_stream = nullptr; statement in EventLoop::loop is happening before the write(m_post_fd, ...) call in EventLoop::removeClient called by the thread in EventLoop::startAsyncThread, which is causing the error about trying to write to a closed pipe.

The issue is that the event loop can exit early, after it has detected the loop ending condition m_num_clients == 0 && m_async_fns.empty(), but before the write(m_post_fd, ...) call is made to notify the loop that the ending condition has been reached.

Specifically, this can happen in the ~Connection destructor when it calls startAsyncThread && removeClient. Depending on what order the destructor thread and the async thread run, it is possible that the removeClient call in the destructor runs first and detects m_num_clients == 0 and calls write(m_post_fd, ...), but (importantly) the event loop does not read from the other end of the pipe right away. Then the async thread calls addClient and removeClient as it is executing the m_async_fns queue, and (also importantly) the final removeClient call it makes is interrupted after it runs m_num_clients -= 1 but before it runs write(m_post_fd, ...). When this happens, EventLoop::loop thread can wake up, read from the other end of the pipe because of the write(m_post_fd, ...) call that happened earlier, and exit the loop before the second write(m_post_fd, ...) has a chance to run, resulting in a write error.

That’s the sequence of events that I think is happening in this case, but there are other sequences possible that could have the same effect. For example, if the m_async_fns list has multiple entries (this could happen with multiple connections) and the async thread makes multiple removeClient and write(m_post_fd, ...) calls, and earlier writes are not read right away, and the thread is interrupted before the final write(m_post_fd, ...) call, the loop can exit, causing the final write to fail trying to write to a closed pipe.

Fundamentally the problem here is that are multiple write(m_post_fd, ...) calls being made at all. If the removeClient if (m_num_clients == 0) check were changed to if (m_num_clients == 0 && m_async_fns.empty()) I think this problem would not happen.

A second issue with this code is that if (m_num_clients == 0 && m_async_fns.empty()) loop exit check in EventLoop::loop is unsafe because it is made without holding an m_mutex lock, which could cause a race resulting in the same error reported here, if loop exit happens early because it is based on stale values.

Using -fsanitize=thread helped uncover both these issues. It uncovered the second issue right away, but after I fixed the second issue, it only reported the first issue if I ran a test calling node.echoipc and self.stop_node in a loop thousands of times that had to run around ~30 minutes before a race would be triggered and TSAN would report something. TSAN output was also a bit confusing because it referred to the write() call as a read and the close() call as a write, but that does makes sense given TSAN’s file descriptor race detection, since conceptually a write() call does not mutate a file descriptor but a close() does. I also wasted a lot of time trying to suppress that error report with AnnotateHappensBefore and AnnotateHappensAfter by trying to add a HappensBefore call after the pipe write call, and a HappensAfter call after the pipe read call, but this did not suppress it. Eventually I moved the HappensAfter call right before the close() call and this did suppress the error. Then at some point after that I realized that multiple writes could trigger a real race condition and the TSAN error maybe should not be suppressed.

I think next step will be to make a multiprocess PR adding an EventLoop::done method to check for the m_num_clients == 0 && m_async_fns.empty() condition consistently so there should only ever be a single write triggering a shutdown and no way for the event loop to exit before the write happens. This should also fix the missed m_mutex problem. Separately another PR might add TSAN HappensBefore/HappensAfter annotations if TSAN detects errors, since TSAN doesn’t actually have a way of knowing the pipe write must happen before the code that runs after the pipe read.

Maybe there are CI improvements that could be made too. After #30975 it would make sense to build libmultiprocess with -fsanitize=thread in the tsan build. And it would be nice if core dumps could be enabled on CI and GDB were run on any core dumps that were produced to output a backtrace.

ryanofsky referenced this in commit 5a6c9e0f71 on Jan 23, 2025

maflcko commented at 2:05 pm on January 23, 2025: member

Maybe there are CI improvements that could be made too. After #30975 it would make sense to build libmultiprocess with -fsanitize=thread in the tsan build

Would it hurt to just build all dependencies with any sanitizer flags in the CI?

ryanofsky commented at 3:32 pm on January 23, 2025: contributor

Would it hurt to just build all dependencies with any sanitizer flags in the CI?

I don’t know. I think it could reveal bugs we would want to know about, or it could cause breakages if packages don’t support being built with sanitizers. It would seem like a nice thing to do by default, but I’m not sure how much effort it might take.

Sjors referenced this in commit 0f48801af8 on Jan 24, 2025

ryanofsky referenced this in commit 0e4f88d3f9 on Jan 24, 2025

Sjors referenced this in commit 44949fc308 on Jan 27, 2025

ryanofsky closed this on Jan 27, 2025

ryanofsky referenced this in commit 72f6669b50 on Jan 27, 2025

ryanofsky renamed this:
ci: intermittent issue in rpc_misc.py node0 stderr terminate called after throwing an instance of 'kj::ExceptionImpl' [15:12:14.943] what(): mp/proxy.cpp:242: disconnected: write(m_post_fd, &buffer, 1): Broken pipe [15:12:14.943] stack: 657ed083 657ed86e f7ac7d20 f7713156 f77a7e07
ci: intermittent issue in rpc_misc.py node0 stderr terminate called after throwing an instance of 'kj::ExceptionImpl' [15:12:14.943] what(): mp/proxy.cpp:242: disconnected: write(m_post_fd, &buffer, 1): Broken pipe
on Feb 3, 2025

ci: intermittent issue in rpc_misc.py node0 stderr terminate called after throwing an instance of ‘kj::ExceptionImpl’ [15:12:14.943] what(): mp/proxy.cpp:242: disconnected: write(m_post_fd, &buffer, 1): Broken pipe #31151