test: Fix all races after a socket is closed gracefully #34425

pull maflcko wants to merge 4 commits into bitcoin:master from maflcko:2601-test-missing-sync changing 3 files +32 −24
  1. maflcko commented at 9:35 PM on January 27, 2026: member

    Currently, functional tests may intermittently fail, due to a lack of synchronization after a node connection socket was closed gracefully: If node A is connected to node B, and node B closes the connection, node A must wait for the connection to be closed before continuing the test. Otherwise, subsequent re-connections may not work while a stale connection is still alive.

    This can be reproduced locally via something like:

    diff --git a/src/net.cpp b/src/net.cpp
    index 6b79e913e8..32bd061500 100644
    --- a/src/net.cpp
    +++ b/src/net.cpp
    @@ -2200,2 +2200,3 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
                     }
    +            UninterruptibleSleep(599ms);
                     pnode->CloseSocketDisconnect();
    

    With this diff, the tests should fail on master, and pass after this fix.

    The fix is placed inside the connect_nodes helper.

  2. DrahtBot added the label Tests on Jan 27, 2026
  3. DrahtBot commented at 9:35 PM on January 27, 2026: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--006a51241073e994b41acfe9ec718e94-->

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34425.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK rkrux, w0xlt, davidgumberg, sedited

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    No conflicts as of last run.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

  4. in test/functional/rpc_setban.py:29 in fa652a5d63
      27 | @@ -28,17 +28,19 @@ def run_test(self):
      28 |  
      29 |          # Node 0 get banned by Node 1
    


    w0xlt commented at 9:52 PM on January 27, 2026:

    nit not related to PR changes

            # Node 0 gets banned by Node 1
    
  5. w0xlt commented at 10:00 PM on January 27, 2026: contributor

    ACK fa652a5d63d9e1855e73fa72d86c0b54f67aa0c7

  6. maflcko renamed this:
    test: Add missing sync on node0 in rpc_setban.py
    test: Fix all races after a socket is closed gracefully
    on Jan 28, 2026
  7. maflcko force-pushed on Jan 28, 2026
  8. DrahtBot added the label CI failed on Jan 28, 2026
  9. DrahtBot removed the label CI failed on Jan 28, 2026
  10. davidgumberg commented at 6:47 PM on January 29, 2026: contributor

    Maybe some of these commits can be squashed?

  11. maflcko force-pushed on Jan 29, 2026
  12. maflcko commented at 6:51 PM on January 29, 2026: member

    Sure, squashed all down into one commit. It touches a few files, but 16 lines modified should be trivial enough for a single commit.

  13. maflcko force-pushed on Jan 29, 2026
  14. DrahtBot added the label CI failed on Jan 29, 2026
  15. in test/functional/p2p_disconnect_ban.py:107 in fa759a637c outdated
     103 | @@ -104,6 +104,7 @@ def run_test(self):
     104 |          # Keep mocktime, to avoid ban expiry when restart takes longer than
     105 |          # time_remaining
     106 |          self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
     107 | +        self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)
    


    davidgumberg commented at 7:13 PM on January 29, 2026:

    I don't think wait_until is necessary here, since node 1 has banned node0 and is not connected, e.g.:

            assert(len(self.nodes[0].getpeerinfo()) == 0)
            self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
            self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)
    

    passes


    maflcko commented at 7:24 PM on January 29, 2026:

    The were connected, see line 25:

            self.connect_nodes(0, 1)
    

    A sync is required after the disconnect and before the next connection.

    Your test diff is expected to pass in the same way that the test is passing on current master.

    Did you apply the cpp diff as well to trigger the error?


    davidgumberg commented at 8:22 PM on January 29, 2026:

    Ah I see, and confirmed that the suggested cpp diff causes an error, feel free to mark to resolved

  16. DrahtBot removed the label CI failed on Jan 29, 2026
  17. davidgumberg commented at 9:19 PM on January 29, 2026: contributor

    Concept ACK, but I think len(getpeerinfo()) is kind of a brittle way / hacky way to enforce that some node is or isn't connected to another node.

    What do you think of adding a TestNode method that checks if a node is connected to another node by looking for the subver in getpeerinfo(), ie:

    class TestNode():
        def is_connected_to(self, other):
            assert isinstance(other, TestNode)
            other_subver = other.getnetworkinfo()['subversion']
    
            return any(peer["subver"] == other_subver for peer in self.getpeerinfo())
    

    And then doing:

    - self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)
    + self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
    

    alternatively, could do it using the with style

    class TestNode(): [@contextlib](/bitcoin-bitcoin/contributor/contextlib/).contextmanager
        def wait_for_disconnected_from(self, other, timeout=5):
            """
            Wait until this node is disconnected from other.
            """
            yield
            self.wait_until(lambda: not self.is_connected_to(other), timeout=timeout)
    

    and then doing:

    with self.nodes[0].wait_for_disconnected_from(self.nodes[1]):
        self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
    
  18. maflcko force-pushed on Jan 30, 2026
  19. maflcko force-pushed on Jan 30, 2026
  20. DrahtBot added the label CI failed on Jan 30, 2026
  21. maflcko commented at 10:29 AM on January 30, 2026: member

    Ah nice. I pushed the is_connected_to helper, because it makes the code more self-documenting and easier to understand.

    Not sure about the contextmanager, because connections/disconnections are often not the primary target of the test and using contexts for all of them could be confusing, so I'll leave this as-is for now.

  22. DrahtBot removed the label CI failed on Jan 30, 2026
  23. maflcko requested review from w0xlt on Mar 4, 2026
  24. maflcko requested review from davidgumberg on Mar 4, 2026
  25. davidgumberg commented at 5:08 PM on March 4, 2026: contributor

    crACK fad3df9af6d787f66d3150882ed1

    Nice, closes a source of intermittent test failures.

  26. in test/functional/rpc_setban.py:46 in fad3df9af6 outdated
      43 | -            self.restart_node(1, [])
      44 |              self.nodes[0].addnode("127.0.0.1:" + str(p2p_port(1)), "onetry")
      45 | +        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
      46 |  
      47 |          # However, node 0 should be able to reconnect if it has noban permission
      48 |          self.restart_node(1, ['-whitelist=127.0.0.1'])
    


    w0xlt commented at 6:35 PM on March 4, 2026:

    Looks like one is missing here.

            self.restart_node(1, ['-whitelist=127.0.0.1'])
            self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
    

    maflcko commented at 8:39 AM on March 5, 2026:

    Can you explain this suggestion? The line you suggest already exists right above in line 43


    maflcko commented at 8:46 AM on March 5, 2026:

    ok, I guess this is another bug in the test when running under --v2transport. I really don't like this fragile (and nested) assert_debug_log with timeout-polling.


    maflcko commented at 8:55 AM on March 5, 2026:

    thx, removed the pre-existing fragile and buggy timeout


    maflcko commented at 9:57 AM on March 5, 2026:

    Ok, this was a bit tedious to fix. In the latest attempt, I've:

    • Removed the timeout from the outer/lower exit stack to check "dropped (banned)\n" on node 1, because the inner/top exit stack waits longer
    • The inner/top exit stack checks "Cleared nodestate for peer=2" for the v1 disconnect
    • It also checks for "Cleared nodestate for peer=3" for the v2 disconnect (if applicable)

    I don't think a self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1])) can be used here, because it requires a connection to have existed before, so that it can wait for it to be dropped. However, we can't know the 3rd connection existed, unless we wait for it in the debug log, in which case we can just wait for the 3rd disconnect in the debug log as well.

    So I've added a plain:

    • assert not self.nodes[0].is_connected_to(self.nodes[1]), which does not poll

    I hope this is correct now :melting_face:


    maflcko commented at 3:25 PM on March 27, 2026:

    (is now in a separate commit)

  27. DrahtBot requested review from w0xlt on Mar 4, 2026
  28. DrahtBot added the label Needs rebase on Mar 4, 2026
  29. maflcko force-pushed on Mar 5, 2026
  30. maflcko force-pushed on Mar 5, 2026
  31. DrahtBot added the label CI failed on Mar 5, 2026
  32. DrahtBot removed the label Needs rebase on Mar 5, 2026
  33. DrahtBot removed the label CI failed on Mar 5, 2026
  34. in test/functional/rpc_setban.py:50 in faa25f83e8 outdated
      51 | +            ],
      52 | +            timeout=8))
      53 |          with context:
      54 | -            self.restart_node(1, [])
      55 |              self.nodes[0].addnode("127.0.0.1:" + str(p2p_port(1)), "onetry")
      56 | +        assert not self.nodes[0].is_connected_to(self.nodes[1])
    


    rkrux commented at 11:21 AM on March 5, 2026:

    Can these changes be put in a separate commit?


    maflcko commented at 3:24 PM on March 27, 2026:

    Yes, done, thx

  35. rkrux commented at 11:23 AM on March 5, 2026: contributor

    Onboard with the idea but feel that these calls are easy to miss while adding or updating functional tests. disconnect_nodes(self, a, b) does some checking like this internally and given that most of these occurrences are "non disconnect_nodes" cases where one node is restarted, can we consider adding this check at the start of connect_nodes(self, a, b, ...) itself so that we take the onus away from the developer?

    An example that I have tested:

    diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
    index 1f95756445..291b5ec197 100755
    --- a/test/functional/test_framework/test_framework.py
    +++ b/test/functional/test_framework/test_framework.py
    @@ -569,6 +569,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
             """
             from_connection = self.nodes[a]
             to_connection = self.nodes[b]
    +        from_connection.wait_until(lambda: not from_connection.is_connected_to(to_connection))
             ip_port = "127.0.0.1:" + str(p2p_port(b))
     
             if peer_advertises_v2 is None:
    diff --git a/test/functional/wallet_hd.py b/test/functional/wallet_hd.py
    index ca8393044c..62ad88d2d7 100755
    --- a/test/functional/wallet_hd.py
    +++ b/test/functional/wallet_hd.py
    @@ -86,7 +86,6 @@ class WalletHDTest(BitcoinTestFramework):
                 assert_equal(hd_info_2["hdkeypath"], "m/84h/1h/0h/0/" + str(i))
                 assert_equal(hd_info_2["hdmasterfingerprint"], hd_fingerprint)
             assert_equal(hd_add, hd_add_2)
    -        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
             self.connect_nodes(0, 1)
             self.sync_all()
     
    @@ -103,7 +102,6 @@ class WalletHDTest(BitcoinTestFramework):
                 self.nodes[1].wallets_path / self.default_wallet_name / self.wallet_data_filename
             )
             self.start_node(1, extra_args=self.extra_args[1])
    -        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
             self.connect_nodes(0, 1)
             self.sync_all()
             # Wallet automatically scans blocks older than key on startup
    (END)
    
  36. sedited commented at 4:37 PM on March 18, 2026: contributor

    The changes here lgtm, but I'm also curious if this could be programmed a bit more defensively, as suggested by rkrux here.

  37. DrahtBot added the label Needs rebase on Mar 19, 2026
  38. in test/functional/test_framework/test_node.py:900 in faa25f83e8 outdated
     896 | @@ -920,6 +897,11 @@ def disconnect_p2ps(self):
     897 |  
     898 |          self.wait_until(lambda: self.num_test_p2p_connections() == 0)
     899 |  
     900 | +    def is_connected_to(self, other):
    


    w0xlt commented at 10:12 PM on March 26, 2026:
        def is_connected_to(self, other):
         """Check if this node has an active connection to other."""
    

    maflcko commented at 3:24 PM on March 27, 2026:

    Thx, but I am confident that a function called is_connected_to(self, other) means that it returns whether it is connected to the other thing, and doesn't need a docstring

  39. test: Add is_connected_to helper
    Needed in the next commit.
    
    Co-Authored-By: David Gumberg <davidzgumberg@gmail.com>
    faa404e119
  40. test: Stricter checks in rpc_setban.py
    Make the checks stricter and easier to follow:
    * Fix a typo.
    * After the first ban from node 1 wait until node 0 "sees" the ban.
    * Move the restart_node out of the debug log context, to avoid bloat.
    * Removed the timeout from the outer/lower exit stack to check "dropped
      (banned)\n" on node 1, because the inner/top exit stack waits longer.
    * The inner/top exit stack checks for the both disconnections peer=2 and
      possibly peer=3 (for v2->v1 retry).
    * And finally, add a redundant assert to confirm once more that node 0
      is has "seen" the ban.
    fa21edddb2
  41. test: Fix all races after a socket is closed gracefully
    This waits for any disconnect (e.g. from a restart of one of the nodes)
    to fully happen before the next connect.
    
    Can be reviewed with the git option:
    
    --color-moved=dimmed-zebra
    fab2772647
  42. test: Remove unused, confusing and brittle connect_nodes.wait_for_connect
    The option is unused since the last removals in:
    * 4c40a923f003420193aa574745f70788bcf35265, and
    * 81bf3ebff7e7108bbfbf6fe4e122f4e52f278701
    
    It was brittle and lead to intermittent test issues. Generally, it is
    also confusing, because if a test wanted to connect nodes without
    checking their connection, it can use `addnode`, like the rpc_setban.py
    test.
    
    So fix all issues by removing it.
    fae807ed25
  43. in test/functional/p2p_permissions.py:93 in faa25f83e8


    w0xlt commented at 10:27 PM on March 26, 2026:

    The reconnect race this PR is trying to fix is also present in test/functional/p2p_permissions.py:{92,93}.

             for flag, permissions in [(["-whitelist=noban,out@127.0.0.1"], ["noban", "download"]), (["-whitelist=noban@127.0.0.1"], [])]:
                 self.restart_node(0, flag)
    +            self.wait_until(lambda: not self.nodes[1].is_connected_to(self.nodes[0]))
                 self.connect_nodes(0, 1)
    
    

    maflcko commented at 3:23 PM on March 27, 2026:

    I guess fixed now?

  44. maflcko force-pushed on Mar 27, 2026
  45. maflcko commented at 3:22 PM on March 27, 2026: member

    so that we take the onus away from the developer?

    Yeah, that sounds great. If the heavy lifting isn't done by the test framework, then it will just be annoying and tedious for developers to figure it out by themselves.

    Pushed your suggestion, but I had to rework it, because the tests allow two-way connections.

  46. DrahtBot removed the label Needs rebase on Mar 27, 2026
  47. maflcko requested review from w0xlt on Apr 3, 2026
  48. maflcko requested review from rkrux on Apr 14, 2026
  49. in test/functional/rpc_setban.py:47 in fa21edddb2
      47 | +                "Cleared nodestate for peer=2",
      48 | +                "Cleared nodestate for peer=3",
      49 | +            ] if self.options.v2transport else [
      50 | +                "Cleared nodestate for peer=2",  # Just one v1 disconnect to wait for
      51 | +            ],
      52 | +            timeout=8))
    


    rkrux commented at 12:32 PM on April 14, 2026:

    In fa21edddb2722f7188130e56b06985ee2ec3ec1c: curious what made you select 8?


    maflcko commented at 12:56 PM on April 14, 2026:

    I'd say any strictly positive value is fine here

  50. rkrux approved
  51. rkrux commented at 12:40 PM on April 14, 2026: contributor

    ACK fae807ed25561bab7148c5a0d7bd847314c79d88

    Thanks for incorporating the suggestion.

    Edit: The last sentence in the description can be updated.

  52. maflcko commented at 12:56 PM on April 14, 2026: member

    Edit: The last sentence in the description can be updated.

    Thx, done

  53. w0xlt commented at 2:39 PM on April 15, 2026: contributor

    ACK fae807ed25561bab7148c5a0d7bd847314c79d88

  54. maflcko removed review request from davidgumberg on Apr 16, 2026
  55. maflcko requested review from davidgumberg on Apr 16, 2026
  56. in test/functional/test_framework/test_framework.py:571 in fab2772647
     566 | +
     567 | +        def find_conn(node, peer_subversion, inbound):
     568 | +            return next(filter(lambda peer: peer['subver'] == peer_subversion and peer['inbound'] == inbound, node.getpeerinfo()), None)
     569 | +
     570 | +        self.wait_until(lambda: not find_conn(from_connection, to_connection_subver, inbound=False))
     571 | +        self.wait_until(lambda: not find_conn(to_connection, from_connection_subver, inbound=True))
    


    davidgumberg commented at 7:05 PM on April 17, 2026:

    In commit https://github.com/bitcoin/bitcoin/pull/34425/changes/fab27726478d686dcd66d26d3b7b34aa323b8066 (test: Fix all races after a socket is closed gracefully)

    Very nice! I think this is also a win for other tests since now the test framework enforces that if you call connect_nodes, the nodes were not already connected.

  57. DrahtBot requested review from davidgumberg on Apr 17, 2026
  58. davidgumberg approved
  59. davidgumberg commented at 7:08 PM on April 17, 2026: contributor

    (re-ish) crACK fae807ed25561bab7148c5a0d7bd847314c79d88

    I re-reviewed the changes here entirely since I recall little since my last review, I think this is a good change that will make tests less brittle, and I like the changes suggested by @rkrux that have been implemented.

  60. sedited approved
  61. sedited commented at 9:05 AM on April 19, 2026: contributor

    ACK fae807ed25561bab7148c5a0d7bd847314c79d88

  62. sedited merged this on Apr 19, 2026
  63. sedited closed this on Apr 19, 2026

  64. maflcko deleted the branch on Apr 19, 2026
  65. fanquake commented at 10:45 AM on April 21, 2026: member

    Backported to 31.x in #35046.

  66. fanquake referenced this in commit 332d2a2ef2 on Apr 21, 2026
  67. fanquake referenced this in commit 402f347fd5 on Apr 21, 2026
  68. fanquake referenced this in commit bb536df042 on Apr 21, 2026
  69. fanquake referenced this in commit ab19213660 on Apr 21, 2026

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me