test: improve robustness of connect_nodes() #30118

pull furszy wants to merge 1 commits into bitcoin:master from furszy:2024_test_improve_connect_nodes changing 2 files +20 −9
  1. furszy commented at 8:58 pm on May 15, 2024: member

    Decoupled from #27837 because this can help other too, found it investigating a CI failure https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200.

    The connect_nodes function in the test framework relies on a stable number of peer connections to verify that the new connection between the nodes is successfully established. This approach is fragile, as any of the peers involved in the process can drop, lose, or create a connection at any step, causing subsequent wait_until checks to stall indefinitely even when the peers in question were connected successfully.

    This commit improves the situation by using the nodes’ subversion and the connection direction (inbound/outbound) to identify the exact peer connection and perform the checks exclusively on it.

  2. DrahtBot commented at 8:58 pm on May 15, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK maflcko, AngusP, stratospher, achow101
    Stale ACK rkrux

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #27837 (net: introduce block tracker to retry to download blocks after failure by furszy)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label Tests on May 15, 2024
  4. in test/functional/test_framework/test_framework.py:630 in 59ba873510 outdated
    624@@ -627,19 +625,28 @@ def connect_nodes(self, a, b, *, peer_advertises_v2=None, wait_for_connect: bool
    625         if not wait_for_connect:
    626             return
    627 
    628+        # Use subversion as peer id
    629+        from_connection_subver = from_connection.getnetworkinfo()['subversion']
    630+        to_connection_subver = to_connection.getnetworkinfo()['subversion']
    


    rkrux commented at 2:31 pm on May 16, 2024:

    Looking at the sample output of subversion - "subversion": "\/Satoshi:25.1.0\/", it doesn’t seem unique enough because multiple nodes can be running the same version. Won’t this cause issues in find_conn later?

    https://chainquery.com/bitcoin-cli/getnetworkinfo


    furszy commented at 4:02 pm on May 16, 2024:

    Looking at the sample output of subversion - "subversion": "\/Satoshi:25.1.0\/", it doesn’t seem unique enough because multiple nodes can be running the same version. Won’t this cause issues in find_conn later?

    The test framework appends the node number to the user agent string. See test_node.


    AngusP commented at 8:26 pm on May 18, 2024:

    Worth adding a note to the comment?

    0        # Use subversion as peer id. Test nodes have their node number appended to the UA
    

    furszy commented at 10:07 pm on May 18, 2024:

    Worth adding a note to the comment?

    0        # Use subversion as peer id. Test nodes have their node number appended to the UA
    

    sure. Pushed.

  5. in test/functional/test_framework/test_framework.py:649 in 59ba873510 outdated
    650         # The message bytes are counted before processing the message, so make
    651         # sure it was fully processed by waiting for a ping.
    652-        self.wait_until(lambda: sum(peer["bytesrecv_per_msg"].pop("pong", 0) >= 29 for peer in from_connection.getpeerinfo()) == from_num_peers)
    653-        self.wait_until(lambda: sum(peer["bytesrecv_per_msg"].pop("pong", 0) >= 29 for peer in to_connection.getpeerinfo()) == to_num_peers)
    654+        self.wait_until(lambda: (peer := find_conn(from_connection, to_connection_subver, inbound=False)) is not None and peer["bytesrecv_per_msg"].pop("pong", 0) >= 29)
    655+        self.wait_until(lambda: (peer := find_conn(to_connection, from_connection_subver, inbound=True)) is not None and peer["bytesrecv_per_msg"].pop("pong", 0) >= 29)
    


    rkrux commented at 2:35 pm on May 16, 2024:

    find_conn(from_connection, to_connection_subver, inbound=False)

    Might as well make these calls once and store in variable instead of finding 3 times? Unless I’m missing something that requires these calls to be made every time.


    furszy commented at 4:05 pm on May 16, 2024:

    Might as well make these calls once and store in variable instead of finding 3 times? Unless I’m missing something that requires these calls to be made every time.

    The code waits until the data arrives. These requires polling for updates. We could couple the checks in this way:

    0 self.wait_until(lambda: (peer := find_conn(from_connection, to_connection_subver, inbound=False)) is not None
    1                                and peer['version'] != 0
    2                                and peer['bytesrecv_per_msg'].pop('verack', 0) >= 21)
    

    But I’m a ~0 here because it would make debugging harder.


    rkrux commented at 9:06 am on May 18, 2024:
    Hmm I see now what you mean.
  6. rkrux approved
  7. rkrux commented at 2:37 pm on May 16, 2024: none

    tACK 59ba873

    Make successful, so are all the functional tests.

    Overall I agree with this change because it makes the nodes connection verification dependent more on connect_nodes() arguments a, b instead of the earlier approach that relied more on the state of the whole response of getpeerinfo(), which seemed brittle as mentioned in the PR description.

    I can see connect_nodes() being used in numerous functional tests, thereby increasing robustness for all of them! @furszy Were you able to identify few connections that were dropped in logs of this CI run? https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200

  8. furszy commented at 4:19 pm on May 16, 2024: member

    @furszy Were you able to identify few connections that were dropped in logs of this CI run? https://cirrus-ci.com/task/5805115213348864?logs=ci#L3200

    Just one. One of the P2PInterface connections I create on the test side gets disconnected after advancing the node time prior to connecting the test nodes again. Need to expand the complete CI logs to find it. But the fragility is easy to reproduce. Just launch a thread that disconnects a node before calling connect_nodes().

  9. in test/functional/test_framework/test_framework.py:643 in 59ba873510 outdated
    642-        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) >= 21 for peer in from_connection.getpeerinfo()) == from_num_peers)
    643-        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) >= 21 for peer in to_connection.getpeerinfo()) == to_num_peers)
    644+        self.wait_until(lambda: find_conn(from_connection, to_connection_subver, inbound=False) is not None)
    645+        self.wait_until(lambda: find_conn(to_connection, from_connection_subver, inbound=True) is not None)
    646+
    647+        self.wait_until(lambda: (peer := find_conn(from_connection, to_connection_subver, inbound=False)) is not None and peer['bytesrecv_per_msg'].pop('verack', 0) >= 21)
    


    maflcko commented at 3:09 pm on May 17, 2024:
    Any reason to have a duplicate None check for peer at this point? Shouldn’t it be enough to have the assert in the second-to-previous line?

    furszy commented at 4:09 pm on May 17, 2024:

    Any reason to have a duplicate None check for peer at this point? Shouldn’t it be enough to have the assert in the second-to-previous line?

    Not sure if I fully grasped the question but if the peer gets disconnected for some reason in-between these two checks, this wait_until call would throw an exception as the code would be trying to use the brackets operator on a None value to access the “bytesrecv_per_msg” value.


    maflcko commented at 9:09 am on May 18, 2024:
    Sure, but then the test will fail regardless. Might as well fail early, if the peer disconnects again, no?

    furszy commented at 2:03 pm on May 18, 2024:
    ah ok, yeah. Now we are sync. Pushed. Thanks.
  10. furszy force-pushed on May 18, 2024
  11. in test/functional/test_framework/test_framework.py:636 in e9cb1168d1 outdated
    631+
    632+        def find_conn(node, peer_subversion, inbound):
    633+            return next(filter(lambda peer: peer['subver'] == peer_subversion and peer['inbound'] == inbound, node.getpeerinfo()), None)
    634+
    635         # poll until version handshake complete to avoid race conditions
    636         # with transaction relaying
    


    AngusP commented at 8:33 pm on May 18, 2024:

    This comment now doesn’t really make sense to me?

    Maybe

    0        # wait until peers have completed a version handshake and so match the
    1        # expected connection subversion
    

    furszy commented at 10:06 pm on May 18, 2024:

    This comment now doesn’t really make sense to me?

    Why? the procedure behavior didn’t change.


    AngusP commented at 10:33 am on May 19, 2024:
    I think just the way it read with the old impl vs the new, “waiting until we find a peer connection matching some predicate” rather than “wait until the peer has a non-zero version” . But on second thoughts you’re right it’s not functionally different so disregard my suggestion

    furszy commented at 2:05 pm on May 19, 2024:
    Ok. The “version handshake” wording refers to the complete initial negotiation round. (1) outbound send version, (2) inbound send version, (3) inbound send verack, (4) outbound send verack.
  12. AngusP commented at 8:35 pm on May 18, 2024: contributor

    Approach ACK, some nits/questions

    Tested, functional tests all pass

  13. furszy force-pushed on May 18, 2024
  14. AngusP approved
  15. AngusP commented at 10:28 am on May 19, 2024: contributor

    tACK f4c588c98f163e9fa9083821ba36e1053f8c1496

    Tested, functional tests all pass

  16. DrahtBot requested review from rkrux on May 19, 2024
  17. in test/functional/test_framework/test_framework.py:633 in f4c588c98f outdated
    624@@ -627,19 +625,32 @@ def connect_nodes(self, a, b, *, peer_advertises_v2=None, wait_for_connect: bool
    625         if not wait_for_connect:
    626             return
    627 
    628+        # Use subversion as peer id. Test nodes have their node number appended to the user agent string
    629+        from_connection_subver = from_connection.getnetworkinfo()['subversion']
    630+        to_connection_subver = to_connection.getnetworkinfo()['subversion']
    631+
    632+        def find_conn(node, peer_subversion, inbound):
    633+            return next(filter(lambda peer: peer['subver'] == peer_subversion and peer['inbound'] == inbound, node.getpeerinfo()), None)
    


    stratospher commented at 1:26 pm on May 19, 2024:
    wondering if we need the inbound check because an inbound and an outbound connection to the same TestNode isn’t done. or did you keep it for assurance about the connection direction?

    furszy commented at 2:00 pm on May 19, 2024:

    wondering if we need the inbound check because an inbound and an outbound connection to the same TestNode isn’t done. or did you keep it for assurance about the connection direction?

    Bidirectional connections are allowed and quite common. E.g. in p2p_disconnect_ban.py or in rpc_net.py.

  18. stratospher commented at 1:35 pm on May 19, 2024: contributor

    tested ACK f4c588c9.

    very cool! i like how the logic depends only on the node we’re connecting to. reproduced the intermittent failure using this patch and verified that the new logic fixes it!

        self.nodes[0].add_p2p_connection(P2PInterface(), send_version=False, wait_for_verack=False)
        self.connect_nodes(0, 2)
    

    also didn’t observe that much of a time increase (only around 70 s) when running all the functional tests in my local config (without wallets).

  19. in test/functional/test_framework/test_framework.py:653 in f4c588c98f outdated
    654         # The message bytes are counted before processing the message, so make
    655         # sure it was fully processed by waiting for a ping.
    656-        self.wait_until(lambda: sum(peer["bytesrecv_per_msg"].pop("pong", 0) >= 29 for peer in from_connection.getpeerinfo()) == from_num_peers)
    657-        self.wait_until(lambda: sum(peer["bytesrecv_per_msg"].pop("pong", 0) >= 29 for peer in to_connection.getpeerinfo()) == to_num_peers)
    658+        self.wait_until(lambda: check_bytesrecv(find_conn(from_connection, to_connection_subver, inbound=False), 'pong', 29))
    659+        self.wait_until(lambda: check_bytesrecv(find_conn(to_connection, from_connection_subver, inbound=True), 'pong', 29))
    


    maflcko commented at 6:28 am on May 21, 2024:

    style-wise, I wonder if it makes sense to call wait_until and find_conn 6 times, when all logic could be put into a single function, which will be passed as a lambda to wait_until once.

    I guess this would mean slightly cleaner code, but would make it harder to see which part of the single function returned False when a failure happens. Though, debug logs could be added to help with that, if needed.

    What do you think?


    furszy commented at 1:51 pm on May 21, 2024:

    I guess this would mean slightly cleaner code, but would make it harder to see which part of the single function returned False when a failure happens. Though, debug logs could be added to help with that, if needed.

    What do you think?

    I don’t think that the slightly cleaner code worth the extra debug logs noise we will get. These will be outputted even when peers connect successfully, so we would see some “connect_nodes(): veracity msg hasn’t arrived yet” or “.. pong msg hasn’t arrived yet” lines regardless of the outcome. And this will be worst when peers take a bit of time to connect if we run valgrind or run it on a slow device.

    Maybe we could expand wait_until so that the provided function returns a message instead of a bool. This way, wait_until can print the detailed message at the end if the timeout is reached.


    maflcko commented at 3:11 pm on May 21, 2024:

    extra debug logs noise we will get.

    The debug logs could be added when needed. There are already two log lines, so a third (when added) shouldn’t matter too much, I’d say.

    For example:

    0 node1 2024-05-01T17:15:16.401946Z (mocktime: 2024-05-01T16:46:25Z) [httpworker.0] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ 
    1 node1 2024-05-01T17:15:16.453092Z (mocktime: 2024-05-01T16:46:25Z) [http] [httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:47742 
    

    furszy commented at 2:05 pm on May 22, 2024:

    The debug logs could be added when needed. There are already two log lines, so a third (when added) shouldn’t matter too much, I’d say.

    For example:

    0 node1 2024-05-01T17:15:16.401946Z (mocktime: 2024-05-01T16:46:25Z) [httpworker.0] [rpc/request.cpp:187] [parse] [rpc] ThreadRPCServer method=getpeerinfo user=__cookie__ 
    1 node1 2024-05-01T17:15:16.453092Z (mocktime: 2024-05-01T16:46:25Z) [http] [httpserver.cpp:306] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:47742 
    

    Hmm ok, a “connect_nodes(), waiting for ” line could be useful then. If we wrap all checks for both peers inside a single function, then on every failure, the function will re-execute some already passed checks once more. And to overcome this, we should keep a state of the connection. Which means more code (which is not bad, just the diff will be bigger). Will leave it for a follow-up just so we can move forward here. Maybe someone wants to tackle it later.


    maflcko commented at 2:07 pm on May 22, 2024:
    Yeah, sounds good to close this thread for now.
  20. in test/functional/test_framework/test_framework.py:628 in f4c588c98f outdated
    624@@ -627,19 +625,32 @@ def connect_nodes(self, a, b, *, peer_advertises_v2=None, wait_for_connect: bool
    625         if not wait_for_connect:
    626             return
    627 
    628+        # Use subversion as peer id. Test nodes have their node number appended to the user agent string
    


    maflcko commented at 6:29 am on May 21, 2024:
    nit: Could add a comment to where the ID is set, to clarify that it is now required for connect_nodes to work?

    furszy commented at 1:59 pm on May 21, 2024:
    sure. Pushed.
  21. maflcko commented at 6:30 am on May 21, 2024: member

    This approach is fragile, as any of the peers involved in the process can drop, lose, or create a connection at any step, causing subsequent wait_until checks to stall indefinitely even when the peers in question were connected successfully.

    I’d say the tests should avoid racy code and deterministically execute the test code line-by-line, but enforcing this in connect_nodes is the wrong approach. So Concept ACK.

  22. test: improve robustness of connect_nodes()
    The 'connect_nodes' function in the test framework relies
    on a stable number of peer connections to verify the new
    connection between the nodes is successfully established.
    This approach is fragile, as any of the peers involved in
    the process can drop, lose, or create a connection at any
    step, causing subsequent 'wait_until' checks to stall
    indefinitely even when the peers in question are connected
    successfully.
    
    This commit improves the situation by using the nodes' subversion
    and the connection direction (inbound/outbound) to identify the
    exact peer connection and perform the checks exclusively on it.
    6629d1d0f8
  23. furszy force-pushed on May 21, 2024
  24. in test/functional/test_framework/test_node.py:109 in 6629d1d0f8
    105@@ -106,7 +106,7 @@ def __init__(self, i, datadir_path, *, chain, rpchost, timewait, timeout_factor,
    106             "-debugexclude=libevent",
    107             "-debugexclude=leveldb",
    108             "-debugexclude=rand",
    109-            "-uacomment=testnode%d" % i,
    110+            "-uacomment=testnode%d" % i,  # required for subversion uniqueness across peers
    


    maflcko commented at 3:13 pm on May 21, 2024:
    0            f"-uacomment=testnode{i}",  # required for subversion uniqueness across peers
    

    style-nit, if you retouch

  25. in test/functional/test_framework/test_framework.py:648 in 6629d1d0f8
    647+        def check_bytesrecv(peer, msg_type, min_bytes_recv):
    648+            assert peer is not None, "Error: peer disconnected"
    649+            return peer['bytesrecv_per_msg'].pop(msg_type, 0) >= min_bytes_recv
    650+
    651+        self.wait_until(lambda: check_bytesrecv(find_conn(from_connection, to_connection_subver, inbound=False), 'verack', 21))
    652+        self.wait_until(lambda: check_bytesrecv(find_conn(to_connection, from_connection_subver, inbound=True), 'verack', 21))
    


    maflcko commented at 3:20 pm on May 21, 2024:
    nit: As you remove the version check, I wonder if this one can be removed as well for the same reason? The final pong test should be necessary and sufficient already.

    furszy commented at 1:58 pm on May 22, 2024:

    nit: As you remove the version check, I wonder if this one can be removed as well for the same reason? The final pong test should be necessary and sufficient already.

    I removed the version check because veracks are direct responses to version messages at the protocol level. I’m not sure we should rely on the pong alone because that might change over time?


    maflcko commented at 2:06 pm on May 22, 2024:

    I’m not sure we should rely on the pong alone because that might change over time?

    If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

    See the next line comment:

    0        # The message bytes are counted before processing the message, so make
    1        # sure it was fully processed by waiting for a ping.
    

    furszy commented at 7:41 pm on May 22, 2024:

    I’m not sure we should rely on the pong alone because that might change over time?

    If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

    I don’t think thats applicable here. We cannot freely change the initial negotiation phase (the version-verack window) without a BIP for the p2p protocol change. At the protocol level, only features negotiation messages are allowed in this phase. Any other received message will be ignored and it is a reason for banning the other party.

    See the next line comment:

    0        # The message bytes are counted before processing the message, so make
    1        # sure it was fully processed by waiting for a ping.
    

    What about including the flag indicating that the connection is ready (fSuccessfullyConnected) in the getpeerinfo response? It seems generally useful and would clean up all this code.


    maflcko commented at 7:58 pm on May 22, 2024:

    I’m not sure we should rely on the pong alone because that might change over time?

    If it changes (for example the ping is sent before the verack), the test will already start to fail intermittently and will need to be adjusted anyway.

    I don’t think thats applicable here. We cannot freely change the initial negotiation phase (the version-verack window) without a BIP for the p2p protocol change. At the protocol level, only features negotiation messages are allowed in this phase. Any other received message will be ignored and it is a reason for banning the other party.

    Yes, that is what I am trying to say. The pong alone should be necessary and sufficient.


    furszy commented at 8:41 pm on May 22, 2024:

    Yes, that is what I am trying to say. The pong alone should be necessary and sufficient.

    Ah ok. “goto the first message”.. now we are sync. Nice bikeshedding from my end.

    So yeah, we could wait only for the pong or introduce a new field on the getpeerinfo output (fSuccessfullyConnected with another name like “handshake_completed”).


    maflcko commented at 8:26 am on May 23, 2024:
    I think you’ll have to tell maintainers if you want to address this nit, or leave it for a follow-up, otherwise they won’t know whether it is fine to merge this from your side or not.

    furszy commented at 1:12 pm on May 23, 2024:

    I think you’ll have to tell maintainers if you want to address this nit, or leave it for a follow-up, otherwise they won’t know whether it is fine to merge this from your side or not.

    ok, yes. I was thinking on the second approach, but let’s move forward. This will let me un-draft #27837. Happy to review any follow-up doing this. Thanks Marko for the ping and this discussion.


    maflcko commented at 11:03 am on June 9, 2024:
    Done in #30252, also mentioning fSuccessfullyConnected. Feel free to NACK/ACK/nit.
  26. maflcko approved
  27. maflcko commented at 3:21 pm on May 21, 2024: member
    utACK 6629d1d0f8285d1bf2d87341a856abe903f26c13
  28. DrahtBot requested review from AngusP on May 21, 2024
  29. DrahtBot requested review from stratospher on May 21, 2024
  30. in test/functional/test_framework/test_framework.py:645 in 6629d1d0f8
    644+        self.wait_until(lambda: find_conn(from_connection, to_connection_subver, inbound=False) is not None)
    645+        self.wait_until(lambda: find_conn(to_connection, from_connection_subver, inbound=True) is not None)
    646+
    647+        def check_bytesrecv(peer, msg_type, min_bytes_recv):
    648+            assert peer is not None, "Error: peer disconnected"
    649+            return peer['bytesrecv_per_msg'].pop(msg_type, 0) >= min_bytes_recv
    


    AngusP commented at 5:16 pm on May 21, 2024:
    Style nit: Mixture of ' and " strings here and in a few places, though it was already mixed before – " seems to be the style in this file
  31. AngusP approved
  32. AngusP commented at 5:59 pm on May 21, 2024: contributor
    re-ACK 6629d1d0f8285d1bf2d87341a856abe903f26c13
  33. stratospher commented at 8:01 am on May 22, 2024: contributor
    reACK 6629d1d.
  34. achow101 commented at 8:02 pm on May 22, 2024: member
    ACK 6629d1d0f8285d1bf2d87341a856abe903f26c13
  35. achow101 merged this on May 23, 2024
  36. achow101 closed this on May 23, 2024

  37. furszy deleted the branch on May 23, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-22 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me