Expected behavior
Functional tests that complete successfully without attaching a C++ debugger should also complete successfully when a C++ debugger is attached.
Actual behavior
Functional tests that automatically stop (referred to as node0.a) and then start a node (referred to as node0.b) (e.g. feature_reindex.py, mempool_persist.py) fail when lldb is attached to the bitcoind process of node0.a. The test framework raises AssertionError: [node 0] Error: no RPC connection, with the stderr output of node0.b indicating that node0 stderr Error: Cannot obtain a lock on data directory /var/folders/sn/cvk2394n1y582qrt04llpzyw0000gn/T/bitcoin_func_test_x8dfwjci/node0/regtest. Bitcoin Core is probably already running.
Empirically, it seems that node0.b is started before the lock of the process of node0.a is properly released by the filesystem.
Full stacktrace of the AssertionError:
TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "./test/functional/test_framework/test_framework.py", line 537, in start_nodes
node.wait_for_rpc_connection()
File "./test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection
raise FailedToStartError(self._node_msg(
test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./test/functional/test_framework/test_framework.py", line 132, in main
self.run_test()
File "./test/functional/feature_reindex.py", line 36, in run_test
self.reindex(False)
File "./test/functional/feature_reindex.py", line 30, in reindex
self.start_nodes(extra_args)
File "./test/functional/test_framework/test_framework.py", line 540, in start_nodes
self.stop_nodes()
File "./test/functional/test_framework/test_framework.py", line 555, in stop_nodes
node.stop_node(wait=wait, wait_until_stopped=False)
File "./test/functional/test_framework/test_node.py", line 335, in stop_node
self.stop(wait=wait)
File "./test/functional/test_framework/test_node.py", line 183, in __getattr__
assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
AssertionError: [node 0] Error: no RPC connection
To reproduce
It seems the issue is not reproducible on all platforms, e.g. @LarryRuane reported he could not reproduce with either lldb or gdb on his linux setup. However, given the nature of how this can be fixed (see further), it doesn't look like an issue that would be exclusive to my setup. Would be great if people can try to reproduce and report back with their system information to further pin this down.
git checkout master- in
./test/functional/feature_reindex.py, add a pdb breakpoint before running the first test. The placement of the breakpoint is important, the bug is only reproducible when the C++ debugger is attached (step 4) before a node stop/start cycle.
...
def run_test(self):
import pdb; pdb.set_trace() # Added this line
self.reindex(False)
...
- run
./test/functional/feature_reindex.py - attach your C++ debugger, e.g.
PATH=/usr/bin /usr/bin/lldb -p $(pgrep bitcoind) continueyour C++ debugger (you don't need to set any breakpoints)continuepdb- test_framework should raise
AssertionError: [node 0] Error: no RPC connection
Note: when skipping steps 4 and 5, the test should still run fine. It's attaching the C++ debugger that seems to interfere with releasing the filesystem lock in time.
Ideas to fix I've tried some simple fixes to resolve the issue, ordered by increasing complexity. I've added commit shas that contain an example implementation.
- Add a sleep timer: c44a0eba
- Remove .lock file: a619c6750
- Explicitly remove the lock: 96c862a92
If this seems like an issue worth fixing, I'd be happy to try to implement the fix in a PR. I will probably need some guidance when it comes to filesystem locks and how it affects different platforms, and as mentioned earlier I'd need some reports on which systems (mostly OS and C++ debugger) this issue does or does not affect.
System information
<!-- What version of Bitcoin Core are you using, where did you get it (website, self-compiled, etc)? -->
The issue can be replicated with current master branch 0bd7ca9
<!-- What type of machine are you observing the error on (OS/CPU and disk type)? -->
I can reproduce the issue on:
- macOS 12.0.1, Apple M1 Pro with lldb version lldb-1300.0.32.4
- macOS 11.6, x86_64 with lldb version lldb-1103.0.22.10