test: Fix all races after a socket is closed gracefully #34425

pull maflcko wants to merge 2 commits into bitcoin:master from maflcko:2601-test-missing-sync changing 10 files +31 −6
  1. maflcko commented at 9:35 pm on January 27, 2026: member

    Currently, functional tests may intermittently fail, due to a lack of synchronization after a node connection socket was closed gracefully: If node A is connected to node B, and node B closes the connection, node A must wait for the connection to be closed before continuing the test. Otherwise, subsequent re-connections may not work while a stale connection is still alive.

    This can be reproduced locally via something like:

    0diff --git a/src/net.cpp b/src/net.cpp
    1index 6b79e913e8..32bd061500 100644
    2--- a/src/net.cpp
    3+++ b/src/net.cpp
    4@@ -2200,2 +2200,3 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
    5                 }
    6+            UninterruptibleSleep(599ms);
    7                 pnode->CloseSocketDisconnect();
    

    With this diff, the tests should fail on master, and pass after this fix.

    The fix involves simply calling self.wait_until(lambda: len(node_a.getpeerinfo()) == 0) after each disconnect, and before each re-connection.

  2. DrahtBot added the label Tests on Jan 27, 2026
  3. DrahtBot commented at 9:35 pm on January 27, 2026: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34425.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Stale ACK w0xlt, davidgumberg

    If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

    Conflicts

    No conflicts as of last run.

  4. in test/functional/rpc_setban.py:29 in fa652a5d63
    27@@ -28,17 +28,19 @@ def run_test(self):
    28 
    29         # Node 0 get banned by Node 1
    


    w0xlt commented at 9:52 pm on January 27, 2026:

    nit not related to PR changes

    0        # Node 0 gets banned by Node 1
    
  5. w0xlt commented at 10:00 pm on January 27, 2026: contributor
    ACK fa652a5d63d9e1855e73fa72d86c0b54f67aa0c7
  6. maflcko renamed this:
    test: Add missing sync on node0 in rpc_setban.py
    test: Fix all races after a socket is closed gracefully
    on Jan 28, 2026
  7. maflcko force-pushed on Jan 28, 2026
  8. DrahtBot added the label CI failed on Jan 28, 2026
  9. DrahtBot removed the label CI failed on Jan 28, 2026
  10. maflcko commented at 11:17 am on January 28, 2026: member
    Pushed more commits to fix all such races in all tests
  11. davidgumberg commented at 6:47 pm on January 29, 2026: contributor
    Maybe some of these commits can be squashed?
  12. maflcko force-pushed on Jan 29, 2026
  13. maflcko commented at 6:51 pm on January 29, 2026: member
    Sure, squashed all down into one commit. It touches a few files, but 16 lines modified should be trivial enough for a single commit.
  14. maflcko force-pushed on Jan 29, 2026
  15. DrahtBot added the label CI failed on Jan 29, 2026
  16. in test/functional/p2p_disconnect_ban.py:107 in fa759a637c outdated
    103@@ -104,6 +104,7 @@ def run_test(self):
    104         # Keep mocktime, to avoid ban expiry when restart takes longer than
    105         # time_remaining
    106         self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
    107+        self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)
    


    davidgumberg commented at 7:13 pm on January 29, 2026:

    I don’t think wait_until is necessary here, since node 1 has banned node0 and is not connected, e.g.:

    0        assert(len(self.nodes[0].getpeerinfo()) == 0)
    1        self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
    2        self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)
    

    passes


    maflcko commented at 7:24 pm on January 29, 2026:

    The were connected, see line 25:

    0        self.connect_nodes(0, 1)
    

    A sync is required after the disconnect and before the next connection.

    Your test diff is expected to pass in the same way that the test is passing on current master.

    Did you apply the cpp diff as well to trigger the error?


    davidgumberg commented at 8:22 pm on January 29, 2026:
    Ah I see, and confirmed that the suggested cpp diff causes an error, feel free to mark to resolved
  17. DrahtBot removed the label CI failed on Jan 29, 2026
  18. davidgumberg commented at 9:19 pm on January 29, 2026: contributor

    Concept ACK, but I think len(getpeerinfo()) is kind of a brittle way / hacky way to enforce that some node is or isn’t connected to another node.

    What do you think of adding a TestNode method that checks if a node is connected to another node by looking for the subver in getpeerinfo(), ie:

    0class TestNode():
    1    def is_connected_to(self, other):
    2        assert isinstance(other, TestNode)
    3        other_subver = other.getnetworkinfo()['subversion']
    4
    5        return any(peer["subver"] == other_subver for peer in self.getpeerinfo())
    

    And then doing:

    0- self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)
    1+ self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
    

    alternatively, could do it using the with style

    0class TestNode(): [@contextlib](/bitcoin-bitcoin/contributor/contextlib/).contextmanager
    1    def wait_for_disconnected_from(self, other, timeout=5):
    2        """
    3        Wait until this node is disconnected from other.
    4        """
    5        yield
    6        self.wait_until(lambda: not self.is_connected_to(other), timeout=timeout)
    

    and then doing:

    0with self.nodes[0].wait_for_disconnected_from(self.nodes[1]):
    1    self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
    
  19. test: Add is_connected_to helper
    Needed in the next commit.
    
    Co-Authored-By: David Gumberg <davidzgumberg@gmail.com>
    fafa766c95
  20. maflcko force-pushed on Jan 30, 2026
  21. maflcko force-pushed on Jan 30, 2026
  22. DrahtBot added the label CI failed on Jan 30, 2026
  23. maflcko commented at 10:29 am on January 30, 2026: member

    Ah nice. I pushed the is_connected_to helper, because it makes the code more self-documenting and easier to understand.

    Not sure about the contextmanager, because connections/disconnections are often not the primary target of the test and using contexts for all of them could be confusing, so I’ll leave this as-is for now.

  24. DrahtBot removed the label CI failed on Jan 30, 2026
  25. maflcko requested review from w0xlt on Mar 4, 2026
  26. maflcko requested review from davidgumberg on Mar 4, 2026
  27. davidgumberg commented at 5:08 pm on March 4, 2026: contributor

    crACK fad3df9af6d787f66d3150882ed1

    Nice, closes a source of intermittent test failures.

  28. in test/functional/rpc_setban.py:46 in fad3df9af6 outdated
    43-            self.restart_node(1, [])
    44             self.nodes[0].addnode("127.0.0.1:" + str(p2p_port(1)), "onetry")
    45+        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
    46 
    47         # However, node 0 should be able to reconnect if it has noban permission
    48         self.restart_node(1, ['-whitelist=127.0.0.1'])
    


    w0xlt commented at 6:35 pm on March 4, 2026:

    Looks like one is missing here.

    0        self.restart_node(1, ['-whitelist=127.0.0.1'])
    1        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
    

    maflcko commented at 8:39 am on March 5, 2026:
    Can you explain this suggestion? The line you suggest already exists right above in line 43

    maflcko commented at 8:46 am on March 5, 2026:
    ok, I guess this is another bug in the test when running under --v2transport. I really don’t like this fragile (and nested) assert_debug_log with timeout-polling.

    maflcko commented at 8:55 am on March 5, 2026:
    thx, removed the pre-existing fragile and buggy timeout

    maflcko commented at 9:57 am on March 5, 2026:

    Ok, this was a bit tedious to fix. In the latest attempt, I’ve:

    • Removed the timeout from the outer/lower exit stack to check “dropped (banned)\n” on node 1, because the inner/top exit stack waits longer
    • The inner/top exit stack checks “Cleared nodestate for peer=2” for the v1 disconnect
    • It also checks for “Cleared nodestate for peer=3” for the v2 disconnect (if applicable)

    I don’t think a self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1])) can be used here, because it requires a connection to have existed before, so that it can wait for it to be dropped. However, we can’t know the 3rd connection existed, unless we wait for it in the debug log, in which case we can just wait for the 3rd disconnect in the debug log as well.

    So I’ve added a plain:

    • assert not self.nodes[0].is_connected_to(self.nodes[1]), which does not poll

    I hope this is correct now :melting_face:

  29. DrahtBot requested review from w0xlt on Mar 4, 2026
  30. DrahtBot added the label Needs rebase on Mar 4, 2026
  31. maflcko force-pushed on Mar 5, 2026
  32. maflcko commented at 8:56 am on March 5, 2026: member

    Force pushed to remove hunk which is no longer needed after it was removed due to other issues in #34728

    (did not rebase, though)

  33. test: Fix all races after a socket is closed gracefully
    This waits for the disconnect to fully happen before the next connect.
    
    Also, in rpc_setban.py:
    * Fix a typo.
    * Move the `self.restart_node(1, [])` out of the context manager,
      because nothing from the restart needs to be checked.
    faa25f83e8
  34. maflcko force-pushed on Mar 5, 2026
  35. DrahtBot added the label CI failed on Mar 5, 2026
  36. DrahtBot removed the label Needs rebase on Mar 5, 2026
  37. DrahtBot removed the label CI failed on Mar 5, 2026
  38. in test/functional/rpc_setban.py:50 in faa25f83e8
    51+            ],
    52+            timeout=8))
    53         with context:
    54-            self.restart_node(1, [])
    55             self.nodes[0].addnode("127.0.0.1:" + str(p2p_port(1)), "onetry")
    56+        assert not self.nodes[0].is_connected_to(self.nodes[1])
    


    rkrux commented at 11:21 am on March 5, 2026:
    Can these changes be put in a separate commit?
  39. rkrux commented at 11:23 am on March 5, 2026: contributor

    Onboard with the idea but feel that these calls are easy to miss while adding or updating functional tests. disconnect_nodes(self, a, b) does some checking like this internally and given that most of these occurrences are “non disconnect_nodes” cases where one node is restarted, can we consider adding this check at the start of connect_nodes(self, a, b, ...) itself so that we take the onus away from the developer?

    An example that I have tested:

     0diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
     1index 1f95756445..291b5ec197 100755
     2--- a/test/functional/test_framework/test_framework.py
     3+++ b/test/functional/test_framework/test_framework.py
     4@@ -569,6 +569,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
     5         """
     6         from_connection = self.nodes[a]
     7         to_connection = self.nodes[b]
     8+        from_connection.wait_until(lambda: not from_connection.is_connected_to(to_connection))
     9         ip_port = "127.0.0.1:" + str(p2p_port(b))
    10 
    11         if peer_advertises_v2 is None:
    12diff --git a/test/functional/wallet_hd.py b/test/functional/wallet_hd.py
    13index ca8393044c..62ad88d2d7 100755
    14--- a/test/functional/wallet_hd.py
    15+++ b/test/functional/wallet_hd.py
    16@@ -86,7 +86,6 @@ class WalletHDTest(BitcoinTestFramework):
    17             assert_equal(hd_info_2["hdkeypath"], "m/84h/1h/0h/0/" + str(i))
    18             assert_equal(hd_info_2["hdmasterfingerprint"], hd_fingerprint)
    19         assert_equal(hd_add, hd_add_2)
    20-        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
    21         self.connect_nodes(0, 1)
    22         self.sync_all()
    23 
    24@@ -103,7 +102,6 @@ class WalletHDTest(BitcoinTestFramework):
    25             self.nodes[1].wallets_path / self.default_wallet_name / self.wallet_data_filename
    26         )
    27         self.start_node(1, extra_args=self.extra_args[1])
    28-        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
    29         self.connect_nodes(0, 1)
    30         self.sync_all()
    31         # Wallet automatically scans blocks older than key on startup
    32(END)
    

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-03-16 03:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me