test: Fix all races after a socket is closed gracefully

maflcko commented at 9:35 pm on January 27, 2026: member

Currently, functional tests may intermittently fail, due to a lack of synchronization after a node connection socket was closed gracefully: If node A is connected to node B, and node B closes the connection, node A must wait for the connection to be closed before continuing the test. Otherwise, subsequent re-connections may not work while a stale connection is still alive.

This can be reproduced locally via something like:

0diff --git a/src/net.cpp b/src/net.cpp
1index 6b79e913e8..32bd061500 100644
2--- a/src/net.cpp
3+++ b/src/net.cpp
4@@ -2200,2 +2200,3 @@ void CConnman::SocketHandlerConnected(const std::vector<CNode*>& nodes,
5                 }
6+            UninterruptibleSleep(599ms);
7                 pnode->CloseSocketDisconnect();

With this diff, the tests should fail on master, and pass after this fix.

The fix involves simply calling self.wait_until(lambda: len(node_a.getpeerinfo()) == 0) after each disconnect, and before each re-connection.

DrahtBot added the label Tests on Jan 27, 2026

DrahtBot commented at 9:35 pm on January 27, 2026: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34425.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Stale ACK	w0xlt, davidgumberg

If your review is incorrectly listed, please copy-paste <!–meta-tag:bot-skip–> into the comment that the bot should ignore.

Conflicts

No conflicts as of last run.

in test/functional/rpc_setban.py:29 in fa652a5d63

27@@ -28,17 +28,19 @@ def run_test(self):
28 
29         # Node 0 get banned by Node 1

w0xlt commented at 9:52 pm on January 27, 2026:

nit not related to PR changes

0        # Node 0 gets banned by Node 1

w0xlt commented at 10:00 pm on January 27, 2026: contributor

ACK fa652a5d63d9e1855e73fa72d86c0b54f67aa0c7

maflcko renamed this:
~~test: Add missing sync on node0 in rpc_setban.py~~
test: Fix all races after a socket is closed gracefully
on Jan 28, 2026

maflcko force-pushed on Jan 28, 2026

DrahtBot added the label CI failed on Jan 28, 2026

DrahtBot removed the label CI failed on Jan 28, 2026

maflcko commented at 11:17 am on January 28, 2026: member

Pushed more commits to fix all such races in all tests

davidgumberg commented at 6:47 pm on January 29, 2026: contributor

Maybe some of these commits can be squashed?

maflcko force-pushed on Jan 29, 2026

maflcko commented at 6:51 pm on January 29, 2026: member

Sure, squashed all down into one commit. It touches a few files, but 16 lines modified should be trivial enough for a single commit.

maflcko force-pushed on Jan 29, 2026

DrahtBot added the label CI failed on Jan 29, 2026

in test/functional/p2p_disconnect_ban.py:107 in fa759a637c outdated

103@@ -104,6 +104,7 @@ def run_test(self):
104         # Keep mocktime, to avoid ban expiry when restart takes longer than
105         # time_remaining
106         self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
107+        self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)

davidgumberg commented at 7:13 pm on January 29, 2026:

I don’t think wait_until is necessary here, since node 1 has banned node0 and is not connected, e.g.:

0        assert(len(self.nodes[0].getpeerinfo()) == 0)
1        self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])
2        self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)

passes

maflcko commented at 7:24 pm on January 29, 2026:

The were connected, see line 25:

0        self.connect_nodes(0, 1)

A sync is required after the disconnect and before the next connection.

Your test diff is expected to pass in the same way that the test is passing on current master.

Did you apply the cpp diff as well to trigger the error?

davidgumberg commented at 8:22 pm on January 29, 2026:

Ah I see, and confirmed that the suggested cpp diff causes an error, feel free to mark to resolved

DrahtBot removed the label CI failed on Jan 29, 2026

davidgumberg commented at 9:19 pm on January 29, 2026: contributor

Concept ACK, but I think len(getpeerinfo()) is kind of a brittle way / hacky way to enforce that some node is or isn’t connected to another node.

What do you think of adding a TestNode method that checks if a node is connected to another node by looking for the subver in getpeerinfo(), ie:

0class TestNode():
1    def is_connected_to(self, other):
2        assert isinstance(other, TestNode)
3        other_subver = other.getnetworkinfo()['subversion']
4
5        return any(peer["subver"] == other_subver for peer in self.getpeerinfo())

And then doing:

0- self.wait_until(lambda: len(self.nodes[0].getpeerinfo()) == 0)
1+ self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))

alternatively, could do it using the with style

0class TestNode(): [@contextlib](/bitcoin-bitcoin/contributor/contextlib/).contextmanager
1    def wait_for_disconnected_from(self, other, timeout=5):
2        """
3        Wait until this node is disconnected from other.
4        """
5        yield
6        self.wait_until(lambda: not self.is_connected_to(other), timeout=timeout)

and then doing:

0with self.nodes[0].wait_for_disconnected_from(self.nodes[1]):
1    self.restart_node(1, extra_args=[f"-mocktime={old_time+4}"])

test: Add is_connected_to helper

Needed in the next commit.

Co-Authored-By: David Gumberg <davidzgumberg@gmail.com>

fafa766c95

maflcko force-pushed on Jan 30, 2026

DrahtBot added the label CI failed on Jan 30, 2026

maflcko commented at 10:29 am on January 30, 2026: member

Ah nice. I pushed the is_connected_to helper, because it makes the code more self-documenting and easier to understand.

Not sure about the contextmanager, because connections/disconnections are often not the primary target of the test and using contexts for all of them could be confusing, so I’ll leave this as-is for now.

DrahtBot removed the label CI failed on Jan 30, 2026

maflcko requested review from w0xlt on Mar 4, 2026

maflcko requested review from davidgumberg on Mar 4, 2026

davidgumberg commented at 5:08 pm on March 4, 2026: contributor

crACK fad3df9af6d787f66d3150882ed1

Nice, closes a source of intermittent test failures.

in test/functional/rpc_setban.py:46 in fad3df9af6 outdated

43-            self.restart_node(1, [])
44             self.nodes[0].addnode("127.0.0.1:" + str(p2p_port(1)), "onetry")
45+        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
46 
47         # However, node 0 should be able to reconnect if it has noban permission
48         self.restart_node(1, ['-whitelist=127.0.0.1'])

w0xlt commented at 6:35 pm on March 4, 2026:

Looks like one is missing here.

0        self.restart_node(1, ['-whitelist=127.0.0.1'])
1        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))

maflcko commented at 8:39 am on March 5, 2026:

Can you explain this suggestion? The line you suggest already exists right above in line 43

maflcko commented at 8:46 am on March 5, 2026:

ok, I guess this is another bug in the test when running under --v2transport. I really don’t like this fragile (and nested) assert_debug_log with timeout-polling.

maflcko commented at 8:55 am on March 5, 2026:

thx, removed the pre-existing fragile and buggy timeout

maflcko commented at 9:57 am on March 5, 2026:

Ok, this was a bit tedious to fix. In the latest attempt, I’ve:

Removed the timeout from the outer/lower exit stack to check “dropped (banned)\n” on node 1, because the inner/top exit stack waits longer
The inner/top exit stack checks “Cleared nodestate for peer=2” for the v1 disconnect
It also checks for “Cleared nodestate for peer=3” for the v2 disconnect (if applicable)

I don’t think a self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1])) can be used here, because it requires a connection to have existed before, so that it can wait for it to be dropped. However, we can’t know the 3rd connection existed, unless we wait for it in the debug log, in which case we can just wait for the 3rd disconnect in the debug log as well.

So I’ve added a plain:

assert not self.nodes[0].is_connected_to(self.nodes[1]), which does not poll

I hope this is correct now :melting_face:

DrahtBot requested review from w0xlt on Mar 4, 2026

DrahtBot added the label Needs rebase on Mar 4, 2026

maflcko force-pushed on Mar 5, 2026

maflcko commented at 8:56 am on March 5, 2026: member

Force pushed to remove hunk which is no longer needed after it was removed due to other issues in #34728

(did not rebase, though)

test: Fix all races after a socket is closed gracefully

This waits for the disconnect to fully happen before the next connect.

Also, in rpc_setban.py:
* Fix a typo.
* Move the `self.restart_node(1, [])` out of the context manager,
  because nothing from the restart needs to be checked.

faa25f83e8

maflcko force-pushed on Mar 5, 2026

DrahtBot added the label CI failed on Mar 5, 2026

DrahtBot removed the label Needs rebase on Mar 5, 2026

DrahtBot removed the label CI failed on Mar 5, 2026

in test/functional/rpc_setban.py:50 in faa25f83e8

51+            ],
52+            timeout=8))
53         with context:
54-            self.restart_node(1, [])
55             self.nodes[0].addnode("127.0.0.1:" + str(p2p_port(1)), "onetry")
56+        assert not self.nodes[0].is_connected_to(self.nodes[1])

rkrux commented at 11:21 am on March 5, 2026:

Can these changes be put in a separate commit?

rkrux commented at 11:23 am on March 5, 2026: contributor

Onboard with the idea but feel that these calls are easy to miss while adding or updating functional tests. disconnect_nodes(self, a, b) does some checking like this internally and given that most of these occurrences are “non disconnect_nodes” cases where one node is restarted, can we consider adding this check at the start of connect_nodes(self, a, b, ...) itself so that we take the onus away from the developer?

An example that I have tested:

 0diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
 1index 1f95756445..291b5ec197 100755
 2--- a/test/functional/test_framework/test_framework.py
 3+++ b/test/functional/test_framework/test_framework.py
 4@@ -569,6 +569,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
 5         """
 6         from_connection = self.nodes[a]
 7         to_connection = self.nodes[b]
 8+        from_connection.wait_until(lambda: not from_connection.is_connected_to(to_connection))
 9         ip_port = "127.0.0.1:" + str(p2p_port(b))
10 
11         if peer_advertises_v2 is None:
12diff --git a/test/functional/wallet_hd.py b/test/functional/wallet_hd.py
13index ca8393044c..62ad88d2d7 100755
14--- a/test/functional/wallet_hd.py
15+++ b/test/functional/wallet_hd.py
16@@ -86,7 +86,6 @@ class WalletHDTest(BitcoinTestFramework):
17             assert_equal(hd_info_2["hdkeypath"], "m/84h/1h/0h/0/" + str(i))
18             assert_equal(hd_info_2["hdmasterfingerprint"], hd_fingerprint)
19         assert_equal(hd_add, hd_add_2)
20-        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
21         self.connect_nodes(0, 1)
22         self.sync_all()
23 
24@@ -103,7 +102,6 @@ class WalletHDTest(BitcoinTestFramework):
25             self.nodes[1].wallets_path / self.default_wallet_name / self.wallet_data_filename
26         )
27         self.start_node(1, extra_args=self.extra_args[1])
28-        self.wait_until(lambda: not self.nodes[0].is_connected_to(self.nodes[1]))
29         self.connect_nodes(0, 1)
30         self.sync_all()
31         # Wallet automatically scans blocks older than key on startup
32(END)

test: Fix all races after a socket is closed gracefully #34425

Code Coverage & Benchmarks

Reviews

Conflicts