test: improve robustness of connect

furszy commented at 8:58 pm on May 15, 2024: member

Decoupled from #27837 because this can help other too, found it investigating a CI failure https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200.

The connect_nodes function in the test framework relies on a stable number of peer connections to verify that the new connection between the nodes is successfully established. This approach is fragile, as any of the peers involved in the process can drop, lose, or create a connection at any step, causing subsequent wait_until checks to stall indefinitely even when the peers in question were connected successfully.

This commit improves the situation by using the nodes’ subversion and the connection direction (inbound/outbound) to identify the exact peer connection and perform the checks exclusively on it.

DrahtBot commented at 8:58 pm on May 15, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	maflcko, AngusP, stratospher, achow101
Stale ACK	rkrux

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#27837 (net: introduce block tracker to retry to download blocks after failure by furszy)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Tests on May 15, 2024

in test/functional/test_framework/test_framework.py:630 in 59ba873510 outdated

624@@ -627,19 +625,28 @@ def connect_nodes(self, a, b, *, peer_advertises_v2=None, wait_for_connect: bool
625         if not wait_for_connect:
626             return
627 
628+        # Use subversion as peer id
629+        from_connection_subver = from_connection.getnetworkinfo()['subversion']
630+        to_connection_subver = to_connection.getnetworkinfo()['subversion']

rkrux commented at 2:31 pm on May 16, 2024:

Looking at the sample output of subversion - "subversion": "\/Satoshi:25.1.0\/", it doesn’t seem unique enough because multiple nodes can be running the same version. Won’t this cause issues in find_conn later?

https://chainquery.com/bitcoin-cli/getnetworkinfo

furszy commented at 4:02 pm on May 16, 2024:

Looking at the sample output of subversion - "subversion": "\/Satoshi:25.1.0\/", it doesn’t seem unique enough because multiple nodes can be running the same version. Won’t this cause issues in find_conn later?

The test framework appends the node number to the user agent string. See test_node.

AngusP commented at 8:26 pm on May 18, 2024:

Worth adding a note to the comment?

0        # Use subversion as peer id. Test nodes have their node number appended to the UA

furszy commented at 10:07 pm on May 18, 2024:

Worth adding a note to the comment?

0        # Use subversion as peer id. Test nodes have their node number appended to the UA

sure. Pushed.

in test/functional/test_framework/test_framework.py:649 in 59ba873510 outdated

650         # The message bytes are counted before processing the message, so make
651         # sure it was fully processed by waiting for a ping.
652-        self.wait_until(lambda: sum(peer["bytesrecv_per_msg"].pop("pong", 0) >= 29 for peer in from_connection.getpeerinfo()) == from_num_peers)
653-        self.wait_until(lambda: sum(peer["bytesrecv_per_msg"].pop("pong", 0) >= 29 for peer in to_connection.getpeerinfo()) == to_num_peers)
654+        self.wait_until(lambda: (peer := find_conn(from_connection, to_connection_subver, inbound=False)) is not None and peer["bytesrecv_per_msg"].pop("pong", 0) >= 29)
655+        self.wait_until(lambda: (peer := find_conn(to_connection, from_connection_subver, inbound=True)) is not None and peer["bytesrecv_per_msg"].pop("pong", 0) >= 29)

rkrux commented at 2:35 pm on May 16, 2024:

find_conn(from_connection, to_connection_subver, inbound=False)

Might as well make these calls once and store in variable instead of finding 3 times? Unless I’m missing something that requires these calls to be made every time.

furszy commented at 4:05 pm on May 16, 2024:

Might as well make these calls once and store in variable instead of finding 3 times? Unless I’m missing something that requires these calls to be made every time.

The code waits until the data arrives. These requires polling for updates. We could couple the checks in this way:

0 self.wait_until(lambda: (peer := find_conn(from_connection, to_connection_subver, inbound=False)) is not None
1                                and peer['version'] != 0
2                                and peer['bytesrecv_per_msg'].pop('verack', 0) >= 21)

But I’m a ~0 here because it would make debugging harder.

rkrux commented at 9:06 am on May 18, 2024:

Hmm I see now what you mean.

rkrux approved

rkrux commented at 2:37 pm on May 16, 2024: contributor

tACK 59ba873

Make successful, so are all the functional tests.

Overall I agree with this change because it makes the nodes connection verification dependent more on connect_nodes() arguments a, b instead of the earlier approach that relied more on the state of the whole response of getpeerinfo(), which seemed brittle as mentioned in the PR description.

I can see connect_nodes() being used in numerous functional tests, thereby increasing robustness for all of them! @furszy Were you able to identify few connections that were dropped in logs of this CI run? https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200

furszy commented at 4:19 pm on May 16, 2024: member

@furszy Were you able to identify few connections that were dropped in logs of this CI run? https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200

Just one. One of the P2PInterface connections I create on the test side gets disconnected after advancing the node time prior to connecting the test nodes again. Need to expand the complete CI logs to find it. But the fragility is easy to reproduce. Just launch a thread that disconnects a node before calling connect_nodes().

in test/functional/test_framework/test_framework.py:643 in 59ba873510 outdated

642-        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) >= 21 for peer in from_connection.getpeerinfo()) == from_num_peers)
643-        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) >= 21 for peer in to_connection.getpeerinfo()) == to_num_peers)
644+        self.wait_until(lambda: find_conn(from_connection, to_connection_subver, inbound=False) is not None)
645+        self.wait_until(lambda: find_conn(to_connection, from_connection_subver, inbound=True) is not None)
646+
647+        self.wait_until(lambda: (peer := find_conn(from_connection, to_connection_subver, inbound=False)) is not None and peer['bytesrecv_per_msg'].pop('verack', 0) >= 21)

maflcko commented at 3:09 pm on May 17, 2024:

Any reason to have a duplicate None check for peer at this point? Shouldn’t it be enough to have the assert in the second-to-previous line?

furszy commented at 4:09 pm on May 17, 2024:

Any reason to have a duplicate None check for peer at this point? Shouldn’t it be enough to have the assert in the second-to-previous line?

Not sure if I fully grasped the question but if the peer gets disconnected for some reason in-between these two checks, this wait_until call would throw an exception as the code would be trying to use the brackets operator on a None value to access the “bytesrecv_per_msg” value.

maflcko commented at 9:09 am on May 18, 2024:

Sure, but then the test will fail regardless. Might as well fail early, if the peer disconnects again, no?

furszy commented at 2:03 pm on May 18, 2024:

ah ok, yeah. Now we are sync. Pushed. Thanks.

furszy force-pushed on May 18, 2024

in test/functional/test_framework/test_framework.py:636 in e9cb1168d1 outdated

631+
632+        def find_conn(node, peer_subversion, inbound):
633+            return next(filter(lambda peer: peer['subver'] == peer_subversion and peer['inbound'] == inbound, node.getpeerinfo()), None)
634+
635         # poll until version handshake complete to avoid race conditions
636         # with transaction relaying

AngusP commented at 8:33 pm on May 18, 2024:

This comment now doesn’t really make sense to me?

Maybe

0        # wait until peers have completed a version handshake and so match the
1        # expected connection subversion

furszy commented at 10:06 pm on May 18, 2024:

This comment now doesn’t really make sense to me?

Why? the procedure behavior didn’t change.

AngusP commented at 10:33 am on May 19, 2024:

I think just the way it read with the old impl vs the new, “waiting until we find a peer connection matching some predicate” rather than “wait until the peer has a non-zero version” . But on second thoughts you’re right it’s not functionally different so disregard my suggestion

furszy commented at 2:05 pm on May 19, 2024:

Ok. The “version handshake” wording refers to the complete initial negotiation round. (1) outbound send version, (2) inbound send version, (3) inbound send verack, (4) outbound send verack.

AngusP commented at 8:35 pm on May 18, 2024: contributor

Approach ACK, some nits/questions

Tested, functional tests all pass

furszy force-pushed on May 18, 2024

AngusP approved

AngusP commented at 10:28 am on May 19, 2024: contributor

tACK f4c588c98f163e9fa9083821ba36e1053f8c1496

Tested, functional tests all pass

DrahtBot requested review from rkrux on May 19, 2024

in test/functional/test_framework/test_framework.py:633 in f4c588c98f outdated

624@@ -627,19 +625,32 @@ def connect_nodes(self, a, b, *, peer_advertises_v2=None, wait_for_connect: bool
625         if not wait_for_connect:
626             return
627 
628+        # Use subversion as peer id. Test nodes have their node number appended to the user agent string
629+        from_connection_subver = from_connection.getnetworkinfo()['subversion']
630+        to_connection_subver = to_connection.getnetworkinfo()['subversion']
631+
632+        def find_conn(node, peer_subversion, inbound):
633+            return next(filter(lambda peer: peer['subver'] == peer_subversion and peer['inbound'] == inbound, node.getpeerinfo()), None)

stratospher commented at 1:26 pm on May 19, 2024:

wondering if we need the inbound check because an inbound and an outbound connection to the same TestNode isn’t done. or did you keep it for assurance about the connection direction?

furszy commented at 2:00 pm on May 19, 2024:

wondering if we need the inbound check because an inbound and an outbound connection to the same TestNode isn’t done. or did you keep it for assurance about the connection direction?

Bidirectional connections are allowed and quite common. E.g. in p2p_disconnect_ban.py or in rpc_net.py.

stratospher commented at 1:35 pm on May 19, 2024: contributor

tested ACK f4c588c9.

very cool! i like how the logic depends only on the node we’re connecting to. reproduced the intermittent failure using this patch and verified that the new logic fixes it!

    self.nodes[0].add_p2p_connection(P2PInterface(), send_version=False, wait_for_verack=False)
    self.connect_nodes(0, 2)

also didn’t observe that much of a time increase (only around 70 s) when running all the functional tests in my local config (without wallets).

in test/functional/test_framework/test_framework.py:653 in f4c588c98f outdated

654         # The message bytes are counted before processing the message, so make
655         # sure it was fully processed by waiting for a ping.
656-        self.wait_until(lambda: sum(peer["bytesrecv_per_msg"].pop("pong", 0) >= 29 for peer in from_connection.getpeerinfo()) == from_num_peers)
657-        self.wait_until(lambda: sum(peer["bytesrecv_per_msg"].pop("pong", 0) >= 29 for peer in to_connection.getpeerinfo()) == to_num_peers)
658+        self.wait_until(lambda: check_bytesrecv(find_conn(from_connection, to_connection_subver, inbound=False), 'pong', 29))
659+        self.wait_until(lambda: check_bytesrecv(find_conn(to_connection, from_connection_subver, inbound=True), 'pong', 29))

maflcko commented at 6:28 am on May 21, 2024:

style-wise, I wonder if it makes sense to call wait_until and find_conn 6 times, when all logic could be put into a single function, which will be passed as a lambda to wait_until once.

I guess this would mean slightly cleaner code, but would make it harder to see which part of the single function returned False when a failure happens. Though, debug logs could be added to help with that, if needed.

What do you think?

furszy commented at 1:51 pm on May 21, 2024:

I guess this would mean slightly cleaner code, but would make it harder to see which part of the single function returned False when a failure happens. Though, debug logs could be added to help with that, if needed.

What do you think?

I don’t think that the slightly cleaner code worth the extra debug logs noise we will get. These will be outputted even when peers connect successfully, so we would see some “connect_nodes(): veracity msg hasn’t arrived yet” or “.. pong msg hasn’t arrived yet” lines regardless of the outcome. And this will be worst when peers take a bit of time to connect if we run valgrind or run it on a slow device.

Maybe we could expand wait_until so that the provided function returns a message instead of a bool. This way, wait_until can print the detailed message at the end if the timeout is reached.

maflcko commented at 3:11 pm on May 21, 2024:

extra debug logs noise we will get.

The debug logs could be added when needed. There are already two log lines, so a third (when added) shouldn’t matter too much, I’d say.

For example:

0 node1 2024-05-01T17:15:16.401946Z (mocktime: 2024-05-01T16:46:25Z) [httpworker.0] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ 
1 node1 2024-05-01T17:15:16.453092Z (mocktime: 2024-05-01T16:46:25Z) [http] [httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:47742

furszy commented at 2:05 pm on May 22, 2024:

The debug logs could be added when needed. There are already two log lines, so a third (when added) shouldn’t matter too much, I’d say.

For example:

0 node1 2024-05-01T17:15:16.401946Z (mocktime: 2024-05-01T16:46:25Z) [httpworker.0] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ 
1 node1 2024-05-01T17:15:16.453092Z (mocktime: 2024-05-01T16:46:25Z) [http] [httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:47742

Hmm ok, a “connect_nodes(), waiting for ” line could be useful then. If we wrap all checks for both peers inside a single function, then on every failure, the function will re-execute some already passed checks once more. And to overcome this, we should keep a state of the connection. Which means more code (which is not bad, just the diff will be bigger). Will leave it for a follow-up just so we can move forward here. Maybe someone wants to tackle it later.

maflcko commented at 2:07 pm on May 22, 2024:

Yeah, sounds good to close this thread for now.

in test/functional/test_framework/test_framework.py:628 in f4c588c98f outdated

624@@ -627,19 +625,32 @@ def connect_nodes(self, a, b, *, peer_advertises_v2=None, wait_for_connect: bool
625         if not wait_for_connect:
626             return
627 
628+        # Use subversion as peer id. Test nodes have their node number appended to the user agent string

maflcko commented at 6:29 am on May 21, 2024:

nit: Could add a comment to where the ID is set, to clarify that it is now required for connect_nodes to work?

furszy commented at 1:59 pm on May 21, 2024:

sure. Pushed.

maflcko commented at 6:30 am on May 21, 2024: member

This approach is fragile, as any of the peers involved in the process can drop, lose, or create a connection at any step, causing subsequent wait_until checks to stall indefinitely even when the peers in question were connected successfully.

I’d say the tests should avoid racy code and deterministically execute the test code line-by-line, but enforcing this in connect_nodes is the wrong approach. So Concept ACK.

test: improve robustness of connect_nodes()

The 'connect_nodes' function in the test framework relies
on a stable number of peer connections to verify the new
connection between the nodes is successfully established.
This approach is fragile, as any of the peers involved in
the process can drop, lose, or create a connection at any
step, causing subsequent 'wait_until' checks to stall
indefinitely even when the peers in question are connected
successfully.

This commit improves the situation by using the nodes' subversion
and the connection direction (inbound/outbound) to identify the
exact peer connection and perform the checks exclusively on it.

6629d1d0f8

furszy force-pushed on May 21, 2024

in test/functional/test_framework/test_node.py:109 in 6629d1d0f8

105@@ -106,7 +106,7 @@ def __init__(self, i, datadir_path, *, chain, rpchost, timewait, timeout_factor,
106             "-debugexclude=libevent",
107             "-debugexclude=leveldb",
108             "-debugexclude=rand",
109-            "-uacomment=testnode%d" % i,
110+            "-uacomment=testnode%d" % i,  # required for subversion uniqueness across peers

maflcko commented at 3:13 pm on May 21, 2024:

0            f"-uacomment=testnode{i}",  # required for subversion uniqueness across peers

style-nit, if you retouch

in test/functional/test_framework/test_framework.py:648 in 6629d1d0f8

647+        def check_bytesrecv(peer, msg_type, min_bytes_recv):
648+            assert peer is not None, "Error: peer disconnected"
649+            return peer['bytesrecv_per_msg'].pop(msg_type, 0) >= min_bytes_recv
650+
651+        self.wait_until(lambda: check_bytesrecv(find_conn(from_connection, to_connection_subver, inbound=False), 'verack', 21))
652+        self.wait_until(lambda: check_bytesrecv(find_conn(to_connection, from_connection_subver, inbound=True), 'verack', 21))

maflcko commented at 3:20 pm on May 21, 2024:

nit: As you remove the version check, I wonder if this one can be removed as well for the same reason? The final pong test should be necessary and sufficient already.

furszy commented at 1:58 pm on May 22, 2024:

nit: As you remove the version check, I wonder if this one can be removed as well for the same reason? The final pong test should be necessary and sufficient already.

I removed the version check because veracks are direct responses to version messages at the protocol level. I’m not sure we should rely on the pong alone because that might change over time?

maflcko commented at 2:06 pm on May 22, 2024:

I’m not sure we should rely on the pong alone because that might change over time?

If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

See the next line comment:

0        # The message bytes are counted before processing the message, so make
1        # sure it was fully processed by waiting for a ping.

furszy commented at 7:41 pm on May 22, 2024:

I’m not sure we should rely on the pong alone because that might change over time?

If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

I don’t think thats applicable here. We cannot freely change the initial negotiation phase (the version-verack window) without a BIP for the p2p protocol change. At the protocol level, only features negotiation messages are allowed in this phase. Any other received message will be ignored and it is a reason for banning the other party.

See the next line comment:

0        # The message bytes are counted before processing the message, so make
1        # sure it was fully processed by waiting for a ping.

What about including the flag indicating that the connection is ready (fSuccessfullyConnected) in the getpeerinfo response? It seems generally useful and would clean up all this code.

maflcko commented at 7:58 pm on May 22, 2024:

I’m not sure we should rely on the pong alone because that might change over time?

If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

I don’t think thats applicable here. We cannot freely change the initial negotiation phase (the version-verack window) without a BIP for the p2p protocol change. At the protocol level, only features negotiation messages are allowed in this phase. Any other received message will be ignored and it is a reason for banning the other party.

Yes, that is what I am trying to say. The pong alone should be necessary and sufficient.

furszy commented at 8:41 pm on May 22, 2024:

Yes, that is what I am trying to say. The pong alone should be necessary and sufficient.

Ah ok. “goto the first message”.. now we are sync. Nice bikeshedding from my end.

So yeah, we could wait only for the pong or introduce a new field on the getpeerinfo output (fSuccessfullyConnected with another name like “handshake_completed”).

maflcko commented at 8:26 am on May 23, 2024:

I think you’ll have to tell maintainers if you want to address this nit, or leave it for a follow-up, otherwise they won’t know whether it is fine to merge this from your side or not.

furszy commented at 1:12 pm on May 23, 2024:

I think you’ll have to tell maintainers if you want to address this nit, or leave it for a follow-up, otherwise they won’t know whether it is fine to merge this from your side or not.

ok, yes. I was thinking on the second approach, but let’s move forward. This will let me un-draft #27837. Happy to review any follow-up doing this. Thanks Marko for the ping and this discussion.

maflcko commented at 11:03 am on June 9, 2024:

Done in #30252, also mentioning fSuccessfullyConnected. Feel free to NACK/ACK/nit.

maflcko approved

maflcko commented at 3:21 pm on May 21, 2024: member

utACK 6629d1d0f8285d1bf2d87341a856abe903f26c13

DrahtBot requested review from AngusP on May 21, 2024

DrahtBot requested review from stratospher on May 21, 2024

in test/functional/test_framework/test_framework.py:645 in 6629d1d0f8

644+        self.wait_until(lambda: find_conn(from_connection, to_connection_subver, inbound=False) is not None)
645+        self.wait_until(lambda: find_conn(to_connection, from_connection_subver, inbound=True) is not None)
646+
647+        def check_bytesrecv(peer, msg_type, min_bytes_recv):
648+            assert peer is not None, "Error: peer disconnected"
649+            return peer['bytesrecv_per_msg'].pop(msg_type, 0) >= min_bytes_recv

AngusP commented at 5:16 pm on May 21, 2024:

Style nit: Mixture of ' and " strings here and in a few places, though it was already mixed before – " seems to be the style in this file

AngusP approved

AngusP commented at 5:59 pm on May 21, 2024: contributor

re-ACK 6629d1d0f8285d1bf2d87341a856abe903f26c13

stratospher commented at 8:01 am on May 22, 2024: contributor

reACK 6629d1d.

achow101 commented at 8:02 pm on May 22, 2024: member

ACK 6629d1d0f8285d1bf2d87341a856abe903f26c13

achow101 merged this on May 23, 2024

achow101 closed this on May 23, 2024

furszy deleted the branch on May 23, 2024

UdjinM6 referenced this in commit 2c825b4b4a on Oct 3, 2024

kwvg referenced this in commit e458adb61c on Oct 3, 2024

PastaPastaPasta referenced this in commit 8cef87d81d on Oct 4, 2024

panleone referenced this in commit e7c21c7e58 on Nov 11, 2024

Fuzzbawls referenced this in commit bf9617c2e2 on Nov 15, 2024

Fabcien referenced this in commit b8637fc664 on Feb 11, 2025

roqqit referenced this in commit adea33234e on May 2, 2025

bitcoin locked this on Jun 9, 2025

test: improve robustness of connect_nodes() #30118

Code Coverage

Reviews

Conflicts