test: C++ debugger interferes with functional tests that quickly stop/start nodes #23651

issue stickies-v openend this issue on December 2, 2021
  1. stickies-v commented at 2:22 pm on December 2, 2021: contributor

    Expected behavior

    Functional tests that complete successfully without attaching a C++ debugger should also complete successfully when a C++ debugger is attached.

    Actual behavior

    Functional tests that automatically stop (referred to as node0.a) and then start a node (referred to as node0.b) (e.g. feature_reindex.py, mempool_persist.py) fail when lldb is attached to the bitcoind process of node0.a. The test framework raises AssertionError: [node 0] Error: no RPC connection, with the stderr output of node0.b indicating that node0 stderr Error: Cannot obtain a lock on data directory /var/folders/sn/cvk2394n1y582qrt04llpzyw0000gn/T/bitcoin_func_test_x8dfwjci/node0/regtest. Bitcoin Core is probably already running.

    Empirically, it seems that node0.b is started before the lock of the process of node0.a is properly released by the filesystem.

    Full stacktrace of the AssertionError:

     0TestFramework (ERROR): Assertion failed
     1Traceback (most recent call last):
     2  File "./test/functional/test_framework/test_framework.py", line 537, in start_nodes
     3    node.wait_for_rpc_connection()
     4  File "./test/functional/test_framework/test_node.py", line 224, in wait_for_rpc_connection
     5    raise FailedToStartError(self._node_msg(
     6test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization
     7
     8During handling of the above exception, another exception occurred:
     9
    10Traceback (most recent call last):
    11  File "./test/functional/test_framework/test_framework.py", line 132, in main
    12    self.run_test()
    13  File "./test/functional/feature_reindex.py", line 36, in run_test
    14    self.reindex(False)
    15  File "./test/functional/feature_reindex.py", line 30, in reindex
    16    self.start_nodes(extra_args)
    17  File "./test/functional/test_framework/test_framework.py", line 540, in start_nodes
    18    self.stop_nodes()
    19  File "./test/functional/test_framework/test_framework.py", line 555, in stop_nodes
    20    node.stop_node(wait=wait, wait_until_stopped=False)
    21  File "./test/functional/test_framework/test_node.py", line 335, in stop_node
    22    self.stop(wait=wait)
    23  File "./test/functional/test_framework/test_node.py", line 183, in __getattr__
    24    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    25AssertionError: [node 0] Error: no RPC connection
    

    To reproduce It seems the issue is not reproducible on all platforms, e.g. @LarryRuane reported he could not reproduce with either lldb or gdb on his linux setup. However, given the nature of how this can be fixed (see further), it doesn’t look like an issue that would be exclusive to my setup. Would be great if people can try to reproduce and report back with their system information to further pin this down.

    1. git checkout master
    2. in ./test/functional/feature_reindex.py, add a pdb breakpoint before running the first test. The placement of the breakpoint is important, the bug is only reproducible when the C++ debugger is attached (step 4) before a node stop/start cycle.
    0...
    1    def run_test(self):
    2        import pdb; pdb.set_trace()  # Added this line
    3        self.reindex(False)
    4...
    
    1. run ./test/functional/feature_reindex.py
    2. attach your C++ debugger, e.g. PATH=/usr/bin /usr/bin/lldb -p $(pgrep bitcoind)
    3. continue your C++ debugger (you don’t need to set any breakpoints)
    4. continue pdb
    5. test_framework should raise AssertionError: [node 0] Error: no RPC connection

    Note: when skipping steps 4 and 5, the test should still run fine. It’s attaching the C++ debugger that seems to interfere with releasing the filesystem lock in time.

    Ideas to fix I’ve tried some simple fixes to resolve the issue, ordered by increasing complexity. I’ve added commit shas that contain an example implementation.

    1. Add a sleep timer: c44a0eba
    2. Remove .lock file: a619c6750
    3. Explicitly remove the lock: 96c862a92

    If this seems like an issue worth fixing, I’d be happy to try to implement the fix in a PR. I will probably need some guidance when it comes to filesystem locks and how it affects different platforms, and as mentioned earlier I’d need some reports on which systems (mostly OS and C++ debugger) this issue does or does not affect.

    System information

    The issue can be replicated with current master branch 0bd7ca9

    I can reproduce the issue on:

    • macOS 12.0.1, Apple M1 Pro with lldb version lldb-1300.0.32.4
    • macOS 11.6, x86_64 with lldb version lldb-1103.0.22.10
  2. stickies-v added the label Bug on Dec 2, 2021
  3. maflcko commented at 2:28 pm on December 2, 2021: member
    Can you check if this is timeout related? That is, increase the --timeout-factor=100?
  4. stickies-v closed this on Dec 2, 2021

  5. stickies-v reopened this on Dec 2, 2021

  6. stickies-v commented at 2:38 pm on December 2, 2021: contributor

    Can you check if this is timeout related? That is, increase the --timeout-factor=100?

    I did try that, but it doesn’t resolve the issue: ./test/functional/feature_reindex.py --timeout-factor=100

  7. stickies-v commented at 11:28 pm on December 4, 2021: contributor
    I was able to replicate this issue on an x86_64 Intel macOS 11.6 (lldb-1103.0.22.10) too, so it doesn’t appear to be specific to Apple Silicon.
  8. kouloumos commented at 3:52 pm on August 18, 2022: contributor
    I was able to replicate this issue on an x86_64 Intel macOS 10.15 (lldb-1200.0.44.2). I was not able to replicate this issue on an x86_64 Intel NixOs 22.05 (gdb v12.1 & lldb v13.0.1)
  9. maflcko added the label macOS on Aug 18, 2022
  10. maflcko added the label Tests on Aug 18, 2022
  11. rserranon commented at 3:20 am on November 19, 2022: none

    I was able to reproduce the error AssertionError: [node 0] Error: no RPC connection on M1 MacOS 11.6 (lldb-1300.0.42.3)

    I also verified that by not attaching LLDB, the feature_reindex.py test, runs all tests successfully.

  12. stickies-v commented at 9:45 am on October 15, 2024: contributor

    I’m going to close this issue, if anyone is interested in fixing this please feel free to do so. The issue still persists for me, but it’s quite an edge case, I’ve not ran into anymore since reporting this almost three years ago.

    I explored opening a PR for it using the below diff that forces removal of the .lock file, but it doesn’t quite seem to fix things for me, and I don’t feel like spending more time on it at the moment. I’d be happy to review a PR that addresses this, though.

     0diff --git a/test/functional/test_framework/test_node.py b/test/functional/test_framework/test_node.py
     1index 60ca9269a5..fe6c16c373 100755
     2--- a/test/functional/test_framework/test_node.py
     3+++ b/test/functional/test_framework/test_node.py
     4@@ -408,6 +408,21 @@ class TestNode():
     5         if wait_until_stopped:
     6             self.wait_until_stopped(expected_stderr=expected_stderr)
     7 
     8+        # The .lock file is automatically deleted when the global
     9+        # `util::fs_helpers::dir_locks` is destructed on shutdown.
    10+        # However, on macOS, when a debugger is attached, the removal of
    11+        # this file can take longer than the execution of `stop_node()`,
    12+        # causing issues when `start_node()` is called immediately
    13+        # after. For that reason, manually ensure the .lock file is
    14+        # removed before returning in this function.
    15+        # See [#23651](/bitcoin-bitcoin/23651/)
    16+        lock_path = self.datadir_path / self.chain / ".lock"
    17+        try:
    18+            os.remove(lock_path)
    19+        except FileNotFoundError:
    20+            # in most cases, the .lock file should already be removed
    21+            pass
    22+
    23     def is_node_stopped(self, *, expected_stderr="", expected_ret_code=0):
    24         """Checks whether the node has stopped.
    25 
    
  13. stickies-v closed this on Oct 15, 2024

  14. stickies-v closed this on Oct 15, 2024


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-06 12:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me