Expected behavior
Functional tests that complete successfully without attaching a C++ debugger should also complete successfully when a C++ debugger is attached.
Actual behavior
Functional tests that automatically stop (referred to as node0.a) and then start a node (referred to as node0.b) (e.g. feature_reindex.py
, mempool_persist.py
) fail when lldb
is attached to the bitcoind process of node0.a. The test framework raises AssertionError: [node 0] Error: no RPC connection
, with the stderr output of node0.b
indicating that node0 stderr Error: Cannot obtain a lock on data directory /var/folders/sn/cvk2394n1y582qrt04llpzyw0000gn/T/bitcoin_func_test_x8dfwjci/node0/regtest. Bitcoin Core is probably already running.
Empirically, it seems that node0.b is started before the lock of the process of node0.a is properly released by the filesystem.
Full stacktrace of the AssertionError:
0TestFramework (ERROR): Assertion failed
1Traceback (most recent call last):
2 File "./test/functional/test_framework/test_framework.py", line 537, in start_nodes
3 node.wait_for_rpc_connection()
4 File "./test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection
5 raise FailedToStartError(self._node_msg(
6test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
7
8During handling of the above exception, another exception occurred:
9
10Traceback (most recent call last):
11 File "./test/functional/test_framework/test_framework.py", line 132, in main
12 self.run_test()
13 File "./test/functional/feature_reindex.py", line 36, in run_test
14 self.reindex(False)
15 File "./test/functional/feature_reindex.py", line 30, in reindex
16 self.start_nodes(extra_args)
17 File "./test/functional/test_framework/test_framework.py", line 540, in start_nodes
18 self.stop_nodes()
19 File "./test/functional/test_framework/test_framework.py", line 555, in stop_nodes
20 node.stop_node(wait=wait, wait_until_stopped=False)
21 File "./test/functional/test_framework/test_node.py", line 335, in stop_node
22 self.stop(wait=wait)
23 File "./test/functional/test_framework/test_node.py", line 183, in __getattr__
24 assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
25AssertionError: [node 0] Error: no RPC connection
To reproduce
It seems the issue is not reproducible on all platforms, e.g. @LarryRuane reported he could not reproduce with either lldb
or gdb
on his linux setup. However, given the nature of how this can be fixed (see further), it doesn’t look like an issue that would be exclusive to my setup. Would be great if people can try to reproduce and report back with their system information to further pin this down.
git checkout master
- in
./test/functional/feature_reindex.py
, add a pdb breakpoint before running the first test. The placement of the breakpoint is important, the bug is only reproducible when the C++ debugger is attached (step 4) before a node stop/start cycle.
0...
1 def run_test(self):
2 import pdb; pdb.set_trace() # Added this line
3 self.reindex(False)
4...
- run
./test/functional/feature_reindex.py
- attach your C++ debugger, e.g.
PATH=/usr/bin /usr/bin/lldb -p $(pgrep bitcoind)
continue
your C++ debugger (you don’t need to set any breakpoints)continue
pdb- test_framework should raise
AssertionError: [node 0] Error: no RPC connection
Note: when skipping steps 4 and 5, the test should still run fine. It’s attaching the C++ debugger that seems to interfere with releasing the filesystem lock in time.
Ideas to fix I’ve tried some simple fixes to resolve the issue, ordered by increasing complexity. I’ve added commit shas that contain an example implementation.
- Add a sleep timer: c44a0eba
- Remove .lock file: a619c6750
- Explicitly remove the lock: 96c862a92
If this seems like an issue worth fixing, I’d be happy to try to implement the fix in a PR. I will probably need some guidance when it comes to filesystem locks and how it affects different platforms, and as mentioned earlier I’d need some reports on which systems (mostly OS and C++ debugger) this issue does or does not affect.
System information
The issue can be replicated with current master branch 0bd7ca9
I can reproduce the issue on:
- macOS 12.0.1, Apple M1 Pro with lldb version lldb-1300.0.32.4
- macOS 11.6, x86_64 with lldb version lldb-1103.0.22.10