test: Fix all races after a socket is closed gracefully #34425

maflcko commented at 9:35 PM on January 27, 2026: member

Currently, functional tests may intermittently fail, due to a lack of synchronization after a node connection socket was closed gracefully: If node A is connected to node B, and node B closes the connection, node A must wait for the connection to be closed before continuing the test. Otherwise, subsequent re-connections may not work while a stale connection is still alive.

This can be reproduced locally via something like:

diff --git a/src/net.cpp b/src/net.cpp
index 6b79e913e8..32bd061500 100644
--- a/src/net.cpp
+++ b/src/net.cpp
@@ -2200,2 +2200,3 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
                 }
+            UninterruptibleSleep(599ms);
                 pnode->CloseSocketDisconnect();

With this diff, the tests should fail on master, and pass after this fix.

The fix is placed inside the connect_nodes helper.

DrahtBot added the label Tests on Jan 27, 2026

DrahtBot commented at 9:35 PM on January 27, 2026: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34425.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	rkrux, w0xlt, davidgumberg, sedited

If your review is incorrectly listed, please copy-paste <code></code> into the comment that the bot should ignore.

Conflicts

No conflicts as of last run.

in test/functional/rpc_setban.py:29 in fa652a5d63

  27 | @@ -28,17 +28,19 @@ def run_test(self):
  28 |  
  29 |          # Node 0 get banned by Node 1

w0xlt commented at 9:52 PM on January 27, 2026:

nit not related to PR changes

        # Node 0 gets banned by Node 1

w0xlt commented at 10:00 PM on January 27, 2026: contributor

ACK fa652a5d63d9e1855e73fa72d86c0b54f67aa0c7

maflcko renamed this:
~~test: Add missing sync on node0 in rpc_setban.py~~
test: Fix all races after a socket is closed gracefully
on Jan 28, 2026

maflcko force-pushed on Jan 28, 2026

DrahtBot added the label CI failed on Jan 28, 2026

DrahtBot removed the label CI failed on Jan 28, 2026

davidgumberg commented at 6:47 PM on January 29, 2026: contributor

Maybe some of these commits can be squashed?

maflcko force-pushed on Jan 29, 2026

maflcko commented at 6:51 PM on January 29, 2026: member

Sure, squashed all down into one commit. It touches a few files, but 16 lines modified should be trivial enough for a single commit.

maflcko force-pushed on Jan 29, 2026

DrahtBot added the label CI failed on Jan 29, 2026

in test/functional/p2p_disconnect_ban.py:107 in fa759a637c outdated

 103 | @@ -104,6 +104,7 @@ def run_test(self):
 104 |          # Keep mocktime, to avoid ban expiry when restart takes longer than
 105 |          # time_remaining
 106 |          self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
 107 | +        self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)

davidgumberg commented at 7:13 PM on January 29, 2026:

I don't think wait_until is necessary here, since node 1 has banned node0 and is not connected, e.g.:

        assert(len(self.nodes[0].getpeerinfo()) == 0)
        self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
        self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)

passes

maflcko commented at 7:24 PM on January 29, 2026:

The were connected, see line 25:

        self.connect_nodes(0, 1)

A sync is required after the disconnect and before the next connection.

Your test diff is expected to pass in the same way that the test is passing on current master.

Did you apply the cpp diff as well to trigger the error?

davidgumberg commented at 8:22 PM on January 29, 2026:

Ah I see, and confirmed that the suggested cpp diff causes an error, feel free to mark to resolved

DrahtBot removed the label CI failed on Jan 29, 2026

davidgumberg commented at 9:19 PM on January 29, 2026: contributor

Concept ACK, but I think len(getpeerinfo()) is kind of a brittle way / hacky way to enforce that some node is or isn't connected to another node.

What do you think of adding a TestNode method that checks if a node is connected to another node by looking for the subver in getpeerinfo(), ie:

class TestNode():
    def is_connected_to(self, other):
        assert isinstance(other, TestNode)
        other_subver = other.getnetworkinfo()['subversion']

        return any(peer["subver"] == other_subver for peer in self.getpeerinfo())

And then doing:

- self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)
+ self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))

alternatively, could do it using the with style

class TestNode(): [@contextlib](/bitcoin-bitcoin/contributor/contextlib/).contextmanager
    def wait_for_disconnected_from(self, other, timeout=5):
        """
        Wait until this node is disconnected from other.
        """
        yield
        self.wait_until(lambda: not self.is_connected_to(other), timeout=timeout)

and then doing:

with self.nodes[0].wait_for_disconnected_from(self.nodes[1]):
    self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])

maflcko force-pushed on Jan 30, 2026

DrahtBot added the label CI failed on Jan 30, 2026

maflcko commented at 10:29 AM on January 30, 2026: member

Ah nice. I pushed the is_connected_to helper, because it makes the code more self-documenting and easier to understand.

Not sure about the contextmanager, because connections/disconnections are often not the primary target of the test and using contexts for all of them could be confusing, so I'll leave this as-is for now.

DrahtBot removed the label CI failed on Jan 30, 2026

maflcko requested review from w0xlt on Mar 4, 2026

maflcko requested review from davidgumberg on Mar 4, 2026

davidgumberg commented at 5:08 PM on March 4, 2026: contributor

crACK fad3df9af6d787f66d3150882ed1

Nice, closes a source of intermittent test failures.

in test/functional/rpc_setban.py:46 in fad3df9af6 outdated

  43 | -            self.restart_node(1, [])
  44 |              self.nodes[0].addnode("127.0.0.1:" + str(p2p_port(1)), "onetry")
  45 | +        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
  46 |  
  47 |          # However, node 0 should be able to reconnect if it has noban permission
  48 |          self.restart_node(1, ['-whitelist=127.0.0.1'])

w0xlt commented at 6:35 PM on March 4, 2026:

Looks like one is missing here.

        self.restart_node(1, ['-whitelist=127.0.0.1'])
        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))

maflcko commented at 8:39 AM on March 5, 2026:

Can you explain this suggestion? The line you suggest already exists right above in line 43

maflcko commented at 8:46 AM on March 5, 2026:

ok, I guess this is another bug in the test when running under --v2transport. I really don't like this fragile (and nested) assert_debug_log with timeout-polling.

maflcko commented at 8:55 AM on March 5, 2026:

thx, removed the pre-existing fragile and buggy timeout

maflcko commented at 9:57 AM on March 5, 2026:

Ok, this was a bit tedious to fix. In the latest attempt, I've:

Removed the timeout from the outer/lower exit stack to check "dropped (banned)\n" on node 1, because the inner/top exit stack waits longer
The inner/top exit stack checks "Cleared nodestate for peer=2" for the v1 disconnect
It also checks for "Cleared nodestate for peer=3" for the v2 disconnect (if applicable)

I don't think a self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1])) can be used here, because it requires a connection to have existed before, so that it can wait for it to be dropped. However, we can't know the 3rd connection existed, unless we wait for it in the debug log, in which case we can just wait for the 3rd disconnect in the debug log as well.

So I've added a plain:

assert not self.nodes[0].is_connected_to(self.nodes[1]), which does not poll

I hope this is correct now :melting_face:

maflcko commented at 3:25 PM on March 27, 2026:

(is now in a separate commit)

DrahtBot requested review from w0xlt on Mar 4, 2026

DrahtBot added the label Needs rebase on Mar 4, 2026

maflcko force-pushed on Mar 5, 2026

DrahtBot added the label CI failed on Mar 5, 2026

DrahtBot removed the label Needs rebase on Mar 5, 2026

DrahtBot removed the label CI failed on Mar 5, 2026

in test/functional/rpc_setban.py:50 in faa25f83e8 outdated

  51 | +            ],
  52 | +            timeout=8))
  53 |          with context:
  54 | -            self.restart_node(1, [])
  55 |              self.nodes[0].addnode("127.0.0.1:" + str(p2p_port(1)), "onetry")
  56 | +        assert not self.nodes[0].is_connected_to(self.nodes[1])

rkrux commented at 11:21 AM on March 5, 2026:

Can these changes be put in a separate commit?

maflcko commented at 3:24 PM on March 27, 2026:

Yes, done, thx

rkrux commented at 11:23 AM on March 5, 2026: contributor

Onboard with the idea but feel that these calls are easy to miss while adding or updating functional tests. disconnect_nodes(self, a, b) does some checking like this internally and given that most of these occurrences are "non disconnect_nodes" cases where one node is restarted, can we consider adding this check at the start of connect_nodes(self, a, b, ...) itself so that we take the onus away from the developer?

An example that I have tested:

diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
index 1f95756445..291b5ec197 100755
--- a/test/functional/test_framework/test_framework.py
+++ b/test/functional/test_framework/test_framework.py
@@ -569,6 +569,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
         """
         from_connection = self.nodes[a]
         to_connection = self.nodes[b]
+        from_connection.wait_until(lambda: not from_connection.is_connected_to(to_connection))
         ip_port = "127.0.0.1:" + str(p2p_port(b))
 
         if peer_advertises_v2 is None:
diff --git a/test/functional/wallet_hd.py b/test/functional/wallet_hd.py
index ca8393044c..62ad88d2d7 100755
--- a/test/functional/wallet_hd.py
+++ b/test/functional/wallet_hd.py
@@ -86,7 +86,6 @@ class WalletHDTest(BitcoinTestFramework):
             assert_equal(hd_info_2["hdkeypath"], "m/84h/1h/0h/0/" + str(i))
             assert_equal(hd_info_2["hdmasterfingerprint"], hd_fingerprint)
         assert_equal(hd_add, hd_add_2)
-        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
         self.connect_nodes(0, 1)
         self.sync_all()
 
@@ -103,7 +102,6 @@ class WalletHDTest(BitcoinTestFramework):
             self.nodes[1].wallets_path / self.default_wallet_name / self.wallet_data_filename
         )
         self.start_node(1, extra_args=self.extra_args[1])
-        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
         self.connect_nodes(0, 1)
         self.sync_all()
         # Wallet automatically scans blocks older than key on startup
(END)

sedited commented at 4:37 PM on March 18, 2026: contributor

The changes here lgtm, but I'm also curious if this could be programmed a bit more defensively, as suggested by rkrux here.

DrahtBot added the label Needs rebase on Mar 19, 2026

in test/functional/test_framework/test_node.py:900 in faa25f83e8 outdated

 896 | @@ -920,6 +897,11 @@ def disconnect_p2ps(self):
 897 |  
 898 |          self.wait_until(lambda: self.num_test_p2p_connections() == 0)
 899 |  
 900 | +    def is_connected_to(self, other):

w0xlt commented at 10:12 PM on March 26, 2026:

    def is_connected_to(self, other):
     """Check if this node has an active connection to other."""

maflcko commented at 3:24 PM on March 27, 2026:

Thx, but I am confident that a function called is_connected_to(self, other) means that it returns whether it is connected to the other thing, and doesn't need a docstring

test: Add is_connected_to helper

Needed in the next commit.

Co-Authored-By: David Gumberg <davidzgumberg@gmail.com>

faa404e119

test: Stricter checks in rpc_setban.py

Make the checks stricter and easier to follow:
* Fix a typo.
* After the first ban from node 1 wait until node 0 "sees" the ban.
* Move the restart_node out of the debug log context, to avoid bloat.
* Removed the timeout from the outer/lower exit stack to check "dropped
  (banned)\n" on node 1, because the inner/top exit stack waits longer.
* The inner/top exit stack checks for the both disconnections peer=2 and
  possibly peer=3 (for v2->v1 retry).
* And finally, add a redundant assert to confirm once more that node 0
  is has "seen" the ban.

fa21edddb2

test: Fix all races after a socket is closed gracefully

This waits for any disconnect (e.g. from a restart of one of the nodes)
to fully happen before the next connect.

Can be reviewed with the git option:

--color-moved=dimmed-zebra

fab2772647

test: Remove unused, confusing and brittle connect_nodes.wait_for_connect

The option is unused since the last removals in:
* 4c40a923f003420193aa574745f70788bcf35265, and
* 81bf3ebff7e7108bbfbf6fe4e122f4e52f278701

It was brittle and lead to intermittent test issues. Generally, it is
also confusing, because if a test wanted to connect nodes without
checking their connection, it can use `addnode`, like the rpc_setban.py
test.

So fix all issues by removing it.

fae807ed25

in test/functional/p2p_permissions.py:93 in faa25f83e8

w0xlt commented at 10:27 PM on March 26, 2026:

The reconnect race this PR is trying to fix is also present in test/functional/p2p_permissions.py:{92,93}.

         for flag, permissions in [(["-whitelist=noban,out@127.0.0.1"], ["noban", "download"]), (["-whitelist=noban@127.0.0.1"], [])]:
             self.restart_node(0, flag)
+            self.wait_until(lambda: not self.nodes[1].is_connected_to(self.nodes[0]))
             self.connect_nodes(0, 1)

maflcko commented at 3:23 PM on March 27, 2026:

I guess fixed now?

maflcko force-pushed on Mar 27, 2026

maflcko commented at 3:22 PM on March 27, 2026: member

so that we take the onus away from the developer?

Yeah, that sounds great. If the heavy lifting isn't done by the test framework, then it will just be annoying and tedious for developers to figure it out by themselves.

Pushed your suggestion, but I had to rework it, because the tests allow two-way connections.

DrahtBot removed the label Needs rebase on Mar 27, 2026

maflcko requested review from w0xlt on Apr 3, 2026

maflcko requested review from rkrux on Apr 14, 2026

in test/functional/rpc_setban.py:47 in fa21edddb2

  47 | +                "Cleared nodestate for peer=2",
  48 | +                "Cleared nodestate for peer=3",
  49 | +            ] if self.options.v2transport else [
  50 | +                "Cleared nodestate for peer=2",  # Just one v1 disconnect to wait for
  51 | +            ],
  52 | +            timeout=8))

rkrux commented at 12:32 PM on April 14, 2026:

In fa21edddb2722f7188130e56b06985ee2ec3ec1c: curious what made you select 8?

maflcko commented at 12:56 PM on April 14, 2026:

I'd say any strictly positive value is fine here

rkrux approved

rkrux commented at 12:40 PM on April 14, 2026: contributor

ACK fae807ed25561bab7148c5a0d7bd847314c79d88

Thanks for incorporating the suggestion.

Edit: The last sentence in the description can be updated.

maflcko commented at 12:56 PM on April 14, 2026: member

Edit: The last sentence in the description can be updated.

Thx, done

w0xlt commented at 2:39 PM on April 15, 2026: contributor

ACK fae807ed25561bab7148c5a0d7bd847314c79d88

maflcko removed review request from davidgumberg on Apr 16, 2026

maflcko requested review from davidgumberg on Apr 16, 2026

in test/functional/test_framework/test_framework.py:571 in fab2772647

 566 | +
 567 | +        def find_conn(node, peer_subversion, inbound):
 568 | +            return next(filter(lambda peer: peer['subver'] == peer_subversion and peer['inbound'] == inbound, node.getpeerinfo()), None)
 569 | +
 570 | +        self.wait_until(lambda: not find_conn(from_connection, to_connection_subver, inbound=False))
 571 | +        self.wait_until(lambda: not find_conn(to_connection, from_connection_subver, inbound=True))

davidgumberg commented at 7:05 PM on April 17, 2026:

In commit https://github.com/bitcoin/bitcoin/pull/34425/changes/fab27726478d686dcd66d26d3b7b34aa323b8066 (test: Fix all races after a socket is closed gracefully)

Very nice! I think this is also a win for other tests since now the test framework enforces that if you call connect_nodes, the nodes were not already connected.

DrahtBot requested review from davidgumberg on Apr 17, 2026

davidgumberg approved

davidgumberg commented at 7:08 PM on April 17, 2026: contributor

(re-ish) crACK fae807ed25561bab7148c5a0d7bd847314c79d88

I re-reviewed the changes here entirely since I recall little since my last review, I think this is a good change that will make tests less brittle, and I like the changes suggested by @rkrux that have been implemented.

sedited approved

sedited commented at 9:05 AM on April 19, 2026: contributor

ACK fae807ed25561bab7148c5a0d7bd847314c79d88

sedited merged this on Apr 19, 2026

sedited closed this on Apr 19, 2026

maflcko deleted the branch on Apr 19, 2026

fanquake commented at 10:45 AM on April 21, 2026: member

Backported to 31.x in #35046.

fanquake referenced this in commit 332d2a2ef2 on Apr 21, 2026

fanquake referenced this in commit 402f347fd5 on Apr 21, 2026

fanquake referenced this in commit bb536df042 on Apr 21, 2026

fanquake referenced this in commit ab19213660 on Apr 21, 2026