test: Avoid race in disconnect_nodes helper #26138

pull MarcoFalke wants to merge 1 commits into bitcoin:master from MarcoFalke:2209-test-race-🍄 changing 1 files +9 −8
  1. MarcoFalke commented at 1:49 PM on September 20, 2022: member

    Also wait for the other node to notice the closed socket. Otherwise, the other node is not able to use the connect helper.

    Fixes https://github.com/bitcoin/bitcoin/issues/26014

  2. test: Avoid race in disconnect_nodes helper faeea28753
  3. MarcoFalke added this to the milestone 24.0 on Sep 20, 2022
  4. fanquake added the label Tests on Sep 20, 2022
  5. stickies-v commented at 6:49 PM on September 20, 2022: contributor

    Makes sense to verify the disconnect has been picked up by both nodes, so this seems like a good improvement to me.

    In connect_nodes(), relying on the number of connected/veracked peers instead of the actual peer seems like a very fragile approach though. I would expect it to break again in the future. I'm not sure we don't have better ways to check 2 nodes being connected?

    For example, I think building on faeea2875 this (MVP - open to improvements) would work:

    <details> <summary>git diff</summary>

    diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
    index b1164b98f..f79b931b6 100755
    --- a/test/functional/test_framework/test_framework.py
    +++ b/test/functional/test_framework/test_framework.py
    @@ -581,8 +581,6 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
         def connect_nodes(self, a, b):
             from_connection = self.nodes[a]
             to_connection = self.nodes[b]
    -        from_num_peers = 1 + len(from_connection.getpeerinfo())
    -        to_num_peers = 1 + len(to_connection.getpeerinfo())
             ip_port = "127.0.0.1:" + str(p2p_port(b))
             from_connection.addnode(ip_port, "onetry")
             # poll until version handshake complete to avoid race conditions
    @@ -590,30 +588,27 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
             # See comments in net_processing:
             # * Must have a version message before anything else
             # * Must have a verack message before anything else
    -        self.wait_until(lambda: sum(peer['version'] != 0 for peer in from_connection.getpeerinfo()) == from_num_peers)
    -        self.wait_until(lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers)
    -        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in from_connection.getpeerinfo()) == from_num_peers)
    -        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in to_connection.getpeerinfo()) == to_num_peers)
    +        def is_connection_established(from_node: TestNode, to_node: TestNode) -> bool:
    +            connected = from_node.get_connected_peers_info(to_node.index)
    +            veracked = [peer for peer in connected if peer['bytesrecv_per_msg'].pop('verack', 0) == 24]
    +            return len(veracked) > 0
    +
    +        self.wait_until(lambda: is_connection_established(from_connection, to_connection))
    +        self.wait_until(lambda: is_connection_established(to_connection, from_connection))
    +
     
         def disconnect_nodes(self, a, b):
             def disconnect_nodes_helper(node_a, node_b):
    -            def get_peer_ids(from_connection, node_num):
    -                result = []
    -                for peer in from_connection.getpeerinfo():
    -                    if "testnode{}".format(node_num) in peer['subver']:
    -                        result.append(peer['id'])
    -                return result
    -
    -            peer_ids = get_peer_ids(node_a, node_b.index)
    -            if not peer_ids:
    +            peers = node_a.get_connected_peers_info(node_b.index)
    +            if not peers:
                     self.log.warning("disconnect_nodes: {} and {} were not connected".format(
                         node_a.index,
                         node_b.index,
                     ))
                     return
    -            for peer_id in peer_ids:
    +            for peer in peers:
                     try:
    -                    node_a.disconnectnode(nodeid=peer_id)
    +                    node_a.disconnectnode(nodeid=peer["id"])
                     except JSONRPCException as e:
                         # If this node is disconnected between calculating the peer id
                         # and issuing the disconnect, don't worry about it.
    @@ -622,8 +617,8 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
                             raise
     
                 # wait to disconnect
    -            self.wait_until(lambda: not get_peer_ids(node_a, node_b.index), timeout=5)
    -            self.wait_until(lambda: not get_peer_ids(node_b, node_a.index), timeout=5)
    +            self.wait_until(lambda: not node_a.get_connected_peers_info(node_b.index), timeout=5)
    +            self.wait_until(lambda: not node_b.get_connected_peers_info(node_a.index), timeout=5)
     
             disconnect_nodes_helper(self.nodes[a], self.nodes[b])
     
    diff --git a/test/functional/test_framework/test_node.py b/test/functional/test_framework/test_node.py
    index e35cae006..73ad492a5 100755
    --- a/test/functional/test_framework/test_node.py
    +++ b/test/functional/test_framework/test_node.py
    @@ -21,6 +21,7 @@ import collections
     import shlex
     import sys
     from pathlib import Path
    +from typing import Any, Dict, List
     
     from .authproxy import JSONRPCException
     from .descriptors import descsum_create
    @@ -645,6 +646,13 @@ class TestNode():
     
             return p2p_conn
     
    +    def get_connected_peers_info(self, other_node_index: int) -> List[Dict[str, Any]]:
    +        result = []
    +        for peer in self.getpeerinfo():
    +            if "testnode{}".format(other_node_index) in peer['subver']:
    +                result.append(peer)
    +        return result
    +
         def num_test_p2p_connections(self):
             """Return number of test framework p2p connections to the node."""
             return len([peer for peer in self.getpeerinfo() if peer['subver'] == P2P_SUBVERSION])
    

    </details>

  6. MarcoFalke commented at 8:08 AM on September 21, 2022: member

    get_connected_peers_info would only work with Bitcoin Core peers, not mininode peers, so I'd prefer not to make it public

  7. MarcoFalke commented at 8:10 AM on September 21, 2022: member

    I would expect it to break again in the future.

    The only way I can see it would "break" is when you attempt to connect to the same peer twice. In which case the failure is probably wanted. I am not sure if your approach silently dismisses that failure.

  8. stickies-v commented at 10:41 AM on September 22, 2022: contributor

    get_connected_peers_info would only work with Bitcoin Core peers, not mininode peers, so I'd prefer not to make it public

    Good point, I did not consider that. I've updated the diff (below) to keep it private. Also reduces the diff quite a bit. (Note: get_peer_ids() would benefit from the same list comprehension imo, but not a required change for this to work)

    The only way I can see it would "break" is when you attempt to connect to the same peer twice. In which case the failure is probably wanted. I am not sure if your approach silently dismisses that failure.

    Updated to only return True if we have exactly 1 connection to b, which I think addresses your concern. I don't have any specific ideas of how it would break otherwise, but since connect_nodes() doesn't own the nodes, I think it's not unreasonable to assume that the node may update its peers in the background? If without additional complexity we can check that a is now connected to b and not just to any new node, that seems like a strict improvement to me?

    <details> <summary>git diff</summary>

    diff --git a/test/functional/test_framework/test_framework.py b/test/functional/test_framework/test_framework.py
    index b1164b98f..fa3ddebe3 100755
    --- a/test/functional/test_framework/test_framework.py
    +++ b/test/functional/test_framework/test_framework.py
    @@ -581,8 +581,6 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
         def connect_nodes(self, a, b):
             from_connection = self.nodes[a]
             to_connection = self.nodes[b]
    -        from_num_peers = 1 + len(from_connection.getpeerinfo())
    -        to_num_peers = 1 + len(to_connection.getpeerinfo())
             ip_port = "127.0.0.1:" + str(p2p_port(b))
             from_connection.addnode(ip_port, "onetry")
             # poll until version handshake complete to avoid race conditions
    @@ -590,10 +588,14 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
             # See comments in net_processing:
             # * Must have a version message before anything else
             # * Must have a verack message before anything else
    -        self.wait_until(lambda: sum(peer['version'] != 0 for peer in from_connection.getpeerinfo()) == from_num_peers)
    -        self.wait_until(lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers)
    -        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in from_connection.getpeerinfo()) == from_num_peers)
    -        self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in to_connection.getpeerinfo()) == to_num_peers)
    +        def is_connection_established(from_node: TestNode, to_node: TestNode) -> bool:
    +            connected = [peer for peer in from_connection.getpeerinfo() if f"testnode{b}" in peer["subver"]]
    +            veracked = [peer for peer in connected if peer["bytesrecv_per_msg"].pop('verack', 0) == 24]
    +            return len(veracked) == 1
    +
    +        self.wait_until(lambda: is_connection_established(from_connection, to_connection))
    +        self.wait_until(lambda: is_connection_established(to_connection, from_connection))
    +
     
         def disconnect_nodes(self, a, b):
             def disconnect_nodes_helper(node_a, node_b):
    
    

    </details

  9. in test/functional/test_framework/test_framework.py:611 in faeea28753
     611 |              if not peer_ids:
     612 |                  self.log.warning("disconnect_nodes: {} and {} were not connected".format(
     613 | -                    from_connection.index,
     614 | -                    node_num,
     615 | +                    node_a.index,
     616 | +                    node_b.index,
    


    stickies-v commented at 10:51 AM on September 22, 2022:

    nit: If you're touching, would use f-string instead:

                    self.log.warning(f"disconnect_nodes: {node_a.index} and {node_b.index} were not connected")
    
  10. in test/functional/test_framework/test_framework.py:600 in faeea28753
     595 | @@ -596,24 +596,24 @@ def connect_nodes(self, a, b):
     596 |          self.wait_until(lambda: sum(peer['bytesrecv_per_msg'].pop('verack', 0) == 24 for peer in to_connection.getpeerinfo()) == to_num_peers)
     597 |  
     598 |      def disconnect_nodes(self, a, b):
     599 | -        def disconnect_nodes_helper(from_connection, node_num):
     600 | -            def get_peer_ids():
     601 | +        def disconnect_nodes_helper(node_a, node_b):
     602 | +            def get_peer_ids(from_connection, node_num):
    


    stickies-v commented at 10:52 AM on September 22, 2022:

    nit: Since you're touching, I think adding type hints here would be helpful. Also, for consistency would prefer calling it node instead of connection and index instead of num?

        def disconnect_nodes(self, a: int, b: int) -> None:
            def disconnect_nodes_helper(node_a: TestNode, node_b: TestNode) -> None:
                def get_peer_ids(from_node: TestNode, to_node_index: int) -> List[int]:
    

    MarcoFalke commented at 7:03 PM on September 28, 2022:

    I kept the name as-is to make the diff minimal. If there are additional changes, they can be made in a follow-up.

    If there is a follow-up, it would probably be fine to also make to_node_index of type TestNode.


    fanquake commented at 9:24 AM on September 29, 2022:

    @stickies-v maybe you can follow up here?

  11. stickies-v approved
  12. stickies-v commented at 10:58 AM on September 22, 2022: contributor

    ACK faeea2875

    I would prefer making connect_nodes() more robust too as per my suggestion, but I think this is a solid improvement on its own and my comment could be implemented in a separate PR if that's preferable or the proposal is contentious.

    A few nits, no blockers.

  13. DrahtBot commented at 3:19 AM on September 23, 2022: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    No conflicts as of last run.

  14. in test/functional/test_framework/test_framework.py:626 in faeea28753
     621 | @@ -622,9 +622,10 @@ def get_peer_ids():
     622 |                          raise
     623 |  
     624 |              # wait to disconnect
     625 | -            self.wait_until(lambda: not get_peer_ids(), timeout=5)
     626 | +            self.wait_until(lambda: not get_peer_ids(node_a, node_b.index), timeout=5)
     627 | +            self.wait_until(lambda: not get_peer_ids(node_b, node_a.index), timeout=5)
    


    glozow commented at 3:19 PM on September 28, 2022:

    Ok I interpret this race to have been

    1. call disconnect(1, 2), wait until 2 no longer shows up in node1.getpeerinfo()
    2. call connect(0, 2). get to_num_peers = 2 when querying node2.getpeerinfo()
    3. node1 actually disconnected from node2. now there's only 1 peer in node2.getpeerinfo().
    4. lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers == 2
    5. hang

    Which is why we need to wait for both.

    And now the order is

    1. call disconnect(1, 2), wait until 2 no longer shows up in node1.getpeerinfo() AND wait until 1 no longer shows up in node2.getpeerinfo()
    2. call connect(0, 2). get to_num_peers = 1 when querying node2.getpeerinfo()
    3. lambda: sum(peer['version'] != 0 for peer in to_connection.getpeerinfo()) == to_num_peers == 1
  15. glozow commented at 3:19 PM on September 28, 2022: member

    ACK faeea28753a94c45618c1b0ba83bb8700c53009a

  16. glozow merged this on Sep 28, 2022
  17. glozow closed this on Sep 28, 2022

  18. fanquake added the label Needs backport (24.x) on Sep 28, 2022
  19. MarcoFalke commented at 6:57 PM on September 28, 2022: member

    I think it's not unreasonable to assume that the node may update its peers in the background?

    The functional test main thread is a single thread, so there should be nothing happening in the background. If something were to happen in the background, it is either irrelevant to the test, or it could result in a rare intermittent false-failure.

  20. MarcoFalke removed the label Needs backport (24.x) on Sep 28, 2022
  21. sidhujag referenced this in commit aec13d64ea on Sep 28, 2022
  22. fanquake referenced this in commit f34c98a460 on Sep 29, 2022
  23. MarcoFalke deleted the branch on Sep 29, 2022
  24. bitcoin locked this on Sep 29, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-17 06:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me