ci: failure in p2p_handshake.py #30368

issue fanquake openend this issue on July 1, 2024
  1. fanquake commented at 10:35 am on July 1, 2024: member

    https://github.com/bitcoin/bitcoin/actions/runs/9741796701/job/26881733806#step:7:6601

     0  node0 2024-07-01T10:31:12.195996Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: sendheaders (0 bytes) peer=19 
     1 test  2024-07-01T10:32:32.120000Z TestFramework (ERROR): Assertion failed 
     2                                   Traceback (most recent call last):
     3                                     File "/home/runner/work/_temp/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/test_framework.py", line 132, in main
     4                                       self.run_test()
     5                                     File "/home/runner/work/_temp/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/p2p_handshake.py", line 93, in run_test
     6                                       with node.assert_debug_log(["connected to self", "disconnecting"]):
     7                                     File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
     8                                       next(self.gen)
     9                                     File "/home/runner/work/_temp/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/test_node.py", line 491, in assert_debug_log
    10                                       self._raise_assertion_error('Expected messages "{}" does not partially match log:\n\n{}\n\n'.format(str(expected_msgs), print_log))
    11                                     File "/home/runner/work/_temp/ci/scratch/build/bitcoin-x86_64-pc-linux-gnu/test/functional/test_framework/test_node.py", line 188, in _raise_assertion_error
    12                                       raise AssertionError(self._node_msg(msg))
    13                                   AssertionError: [node 0] Expected messages "['connected to self', 'disconnecting']" does not partially match log:
    14                                    - 2024-07-01T10:31:12.118903Z [http] [httpserver.cpp:305] [http_request_cb] [http] Received a POST request for / from 127.0.0.1:58234
    15                                    - 2024-07-01T10:31:12.119030Z [httpworker.3] [rpc/request.cpp:232] [parse] [rpc] ThreadRPCServer method=addconnection user=__cookie__
    16                                    - 2024-07-01T10:31:12.119166Z [httpworker.3] [net.cpp:407] [ConnectNode] [net] trying v1 connection 127.0.0.1:11072 lastseen=0.0hrs
    17                                    - 2024-07-01T10:31:12.119487Z [httpworker.3] [net.cpp:3747] [CNode] [net] Added connection peer=19
    18                                    - 2024-07-01T10:31:12.119532Z [net] [net.cpp:3747] [CNode] [net] Added connection peer=20
    19                                    - 2024-07-01T10:31:12.119620Z [net] [net.cpp:1814] [CreateNodeFromAcceptedSocket] [net] connection from 127.0.0.1:36064 accepted
    20                                    - 2024-07-01T10:31:12.120224Z [httpworker.3] [net.cpp:3791] [PushMessage] [net] sending version (114 bytes) peer=19
    21                                    - 2024-07-01T10:31:12.120525Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: version (114 bytes) peer=20
    22                                    - 2024-07-01T10:31:12.120606Z [msghand] [net.cpp:3791] [PushMessage] [net] sending version (114 bytes) peer=20
    23                                    - 2024-07-01T10:31:12.120711Z [msghand] [net_processing.cpp:1624] [PushNodeVersion] [net] send version message: version 70016, blocks=202, txrelay=1, peer=20
    24                                    - 2024-07-01T10:31:12.120739Z [msghand] [net.cpp:3791] [PushMessage] [net] sending wtxidrelay (0 bytes) peer=20
    25                                    - 2024-07-01T10:31:12.120810Z [msghand] [net.cpp:3791] [PushMessage] [net] sending sendaddrv2 (0 bytes) peer=20
    26                                    - 2024-07-01T10:31:12.121500Z [msghand] [net.cpp:3791] [PushMessage] [net] sending verack (0 bytes) peer=20
    27                                    - 2024-07-01T10:31:12.121619Z [msghand] [net_processing.cpp:3865] [ProcessMessage] [net] receive version message: /Satoshi:27.99.0(testnode0)/: version 70016, blocks=202, us=[::]:0, txrelay=1, peer=20
    28                                    - 2024-07-01T10:31:12.121699Z [httpworker.3] [net_processing.cpp:1624] [PushNodeVersion] [net] send version message: version 70016, blocks=202, txrelay=1, peer=19
    29                                    - 2024-07-01T10:31:12.173332Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: version (114 bytes) peer=19
    30                                    - 2024-07-01T10:31:12.173389Z [msghand] [net.cpp:3791] [PushMessage] [net] sending wtxidrelay (0 bytes) peer=19
    31                                    - 2024-07-01T10:31:12.173492Z [msghand] [net.cpp:3791] [PushMessage] [net] sending sendaddrv2 (0 bytes) peer=19
    32                                    - 2024-07-01T10:31:12.174163Z [msghand] [net.cpp:3791] [PushMessage] [net] sending verack (0 bytes) peer=19
    33                                    - 2024-07-01T10:31:12.174318Z [msghand] [net.cpp:3791] [PushMessage] [net] sending getaddr (0 bytes) peer=19
    34                                    - 2024-07-01T10:31:12.174407Z [msghand] [net_processing.cpp:3865] [ProcessMessage] [net] receive version message: /Satoshi:27.99.0(testnode0)/: version 70016, blocks=202, us=[::]:0, txrelay=1, peer=19
    35                                    - 2024-07-01T10:31:12.174440Z [msghand] [node/timeoffsets.cpp:31] [Add] [net] Added time offset +0s, total samples 8
    36                                    - 2024-07-01T10:31:12.174635Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: wtxidrelay (0 bytes) peer=19
    37                                    - 2024-07-01T10:31:12.174736Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: sendaddrv2 (0 bytes) peer=19
    38                                    - 2024-07-01T10:31:12.174851Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: verack (0 bytes) peer=19
    39                                    - 2024-07-01T10:31:12.174884Z [msghand] [net_processing.cpp:3910] [ProcessMessage] New outbound-full-relay v1 peer connected: version: 70016, blocks=202, peer=19
    40                                    - 2024-07-01T10:31:12.174908Z [msghand] [net.cpp:3791] [PushMessage] [net] sending sendcmpct (9 bytes) peer=19
    41                                    - 2024-07-01T10:31:12.175052Z [msghand] [net.cpp:3791] [PushMessage] [net] sending ping (8 bytes) peer=19
    42                                    - 2024-07-01T10:31:12.175299Z [msghand] [net.cpp:3791] [PushMessage] [net] sending getheaders (645 bytes) peer=19
    43                                    - 2024-07-01T10:31:12.175392Z [msghand] [net_processing.cpp:5886] [SendMessages] [net] initial getheaders (201) to peer=19 (startheight:202)
    44                                    - 2024-07-01T10:31:12.175533Z [msghand] [net.cpp:3791] [PushMessage] [net] sending feefilter (8 bytes) peer=19
    45                                    - 2024-07-01T10:31:12.181934Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: wtxidrelay (0 bytes) peer=20
    46                                    - 2024-07-01T10:31:12.182171Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: sendaddrv2 (0 bytes) peer=20
    47                                    - 2024-07-01T10:31:12.182305Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: verack (0 bytes) peer=20
    48                                    - 2024-07-01T10:31:12.182339Z [msghand] [net_processing.cpp:3910] [ProcessMessage] New inbound v1 peer connected: version: 70016, blocks=202, peer=20
    49                                    - 2024-07-01T10:31:12.182363Z [msghand] [net.cpp:3791] [PushMessage] [net] sending sendcmpct (9 bytes) peer=20
    50                                    - 2024-07-01T10:31:12.182510Z [msghand] [net.cpp:3791] [PushMessage] [net] sending ping (8 bytes) peer=20
    51                                    - 2024-07-01T10:31:12.182633Z [msghand] [net.cpp:3791] [PushMessage] [net] sending getheaders (645 bytes) peer=20
    52                                    - 2024-07-01T10:31:12.182731Z [msghand] [net_processing.cpp:5886] [SendMessages] [net] initial getheaders (201) to peer=20 (startheight:202)
    53                                    - 2024-07-01T10:31:12.182868Z [msghand] [net.cpp:3791] [PushMessage] [net] sending feefilter (8 bytes) peer=20
    54                                    - 2024-07-01T10:31:12.183073Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: getaddr (0 bytes) peer=20
    55                                    - 2024-07-01T10:31:12.183266Z [msghand] [addrman.cpp:840] [GetAddr_] [addrman] GetAddr returned 0 random addresses
    56                                    - 2024-07-01T10:31:12.183804Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: sendcmpct (9 bytes) peer=19
    57                                    - 2024-07-01T10:31:12.183937Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: ping (8 bytes) peer=19
    58                                    - 2024-07-01T10:31:12.183965Z [msghand] [net.cpp:3791] [PushMessage] [net] sending pong (8 bytes) peer=19
    59                                    - 2024-07-01T10:31:12.186653Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: sendcmpct (9 bytes) peer=20
    60                                    - 2024-07-01T10:31:12.186783Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: ping (8 bytes) peer=20
    61                                    - 2024-07-01T10:31:12.186810Z [msghand] [net.cpp:3791] [PushMessage] [net] sending pong (8 bytes) peer=20
    62                                    - 2024-07-01T10:31:12.186931Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: getheaders (645 bytes) peer=19
    63                                    - 2024-07-01T10:31:12.186975Z [msghand] [net_processing.cpp:4463] [ProcessMessage] [net] getheaders 202 to end from peer=19
    64                                    - 2024-07-01T10:31:12.187008Z [msghand] [net.cpp:3791] [PushMessage] [net] sending headers (82 bytes) peer=19
    65                                    - 2024-07-01T10:31:12.187180Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: feefilter (8 bytes) peer=19
    66                                    - 2024-07-01T10:31:12.187218Z [msghand] [net_processing.cpp:5242] [ProcessMessage] [net] received: feefilter of 0.00001000 BTC/kvB from peer=19
    67                                    - 2024-07-01T10:31:12.187289Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: getheaders (645 bytes) peer=20
    68                                    - 2024-07-01T10:31:12.187321Z [msghand] [net_processing.cpp:4463] [ProcessMessage] [net] getheaders 202 to end from peer=20
    69                                    - 2024-07-01T10:31:12.187355Z [msghand] [net.cpp:3791] [PushMessage] [net] sending headers (82 bytes) peer=20
    70                                    - 2024-07-01T10:31:12.190584Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: feefilter (8 bytes) peer=20
    71                                    - 2024-07-01T10:31:12.190632Z [msghand] [net_processing.cpp:5242] [ProcessMessage] [net] received: feefilter of 0.00001000 BTC/kvB from peer=20
    72                                    - 2024-07-01T10:31:12.190758Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: pong (8 bytes) peer=19
    73                                    - 2024-07-01T10:31:12.190831Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: pong (8 bytes) peer=20
    74                                    - 2024-07-01T10:31:12.190965Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: headers (82 bytes) peer=19
    75                                    - 2024-07-01T10:31:12.191601Z [msghand] [net_processing.cpp:2995] [UpdatePeerStateForReceivedHeaders] [net] Protecting outbound peer=19 from eviction
    76                                    - 2024-07-01T10:31:12.191651Z [msghand] [net.cpp:3791] [PushMessage] [net] sending sendheaders (0 bytes) peer=19
    77                                    - 2024-07-01T10:31:12.191862Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: headers (82 bytes) peer=20
    78                                    - 2024-07-01T10:31:12.192242Z [msghand] [net.cpp:3791] [PushMessage] [net] sending sendheaders (0 bytes) peer=20
    79                                    - 2024-07-01T10:31:12.195693Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: sendheaders (0 bytes) peer=20
    80                                    - 2024-07-01T10:31:12.195996Z [msghand] [net_processing.cpp:3664] [ProcessMessage] [net] received: sendheaders (0 bytes) peer=19
    81 test  2024-07-01T10:32:32.124000Z TestFramework (DEBUG): Closing down network thread 
    82 test  2024-07-01T10:32:32.174000Z TestFramework (INFO): Stopping nodes 
    83 test  2024-07-01T10:32:32.174000Z TestFramework.node0 (DEBUG): Stopping node 
    

    Related to #30362. cc @theStack

  2. willcl-ark added the label Tests on Jul 1, 2024
  3. willcl-ark added the label P2P on Jul 1, 2024
  4. theStack commented at 12:37 pm on July 1, 2024: contributor

    Seems like it’s possible that the VERSION message for a new outbound connection is sent out, received locally and processed before the corresponding CNode instance is added to the m_nodes list, so the self-connection can’t be detected yet. I can reproduce the failure locally with the following patch:

     0diff --git a/src/net.cpp b/src/net.cpp
     1index 990c58ee3d..64b5995dbe 100644
     2--- a/src/net.cpp
     3+++ b/src/net.cpp
     4@@ -2921,6 +2921,7 @@ void CConnman::OpenNetworkConnection(const CAddress& addrConnect, bool fCountFai
     5     pnode->grantOutbound = std::move(grant_outbound);
     6
     7     m_msgproc->InitializeNode(*pnode, nLocalServices);
     8+    std::this_thread::sleep_for(std::chrono::seconds{1});
     9     {
    10         LOCK(m_nodes_mutex);
    11         m_nodes.push_back(pnode);
    

    One simple solution would be to put the InitializeNode call also under the m_nodes_mutex lock, like e.g. (the delay is included for demonstration purposes):

     0diff --git a/src/net.cpp b/src/net.cpp
     1index 990c58ee3d..e314675356 100644
     2--- a/src/net.cpp
     3+++ b/src/net.cpp
     4@@ -2920,9 +2920,10 @@ void CConnman::OpenNetworkConnection(const CAddress& addrConnect, bool fCountFai
     5         return;
     6     pnode->grantOutbound = std::move(grant_outbound);
     7
     8-    m_msgproc->InitializeNode(*pnode, nLocalServices);
     9     {
    10         LOCK(m_nodes_mutex);
    11+        m_msgproc->InitializeNode(*pnode, nLocalServices);
    12+        std::this_thread::sleep_for(std::chrono::seconds{1});
    13         m_nodes.push_back(pnode);
    14
    15         // update connection count by network
    

    This works for me, but I’m not sure if this has other unintended side-effects or there is a better solution. Pinging some contributors who have worked deeper in the net/P2P areas for ideas/comments: @mzumsande @sipa @sr-gi @vasild

  5. Sjors commented at 12:38 pm on July 1, 2024: member
    Noticed as well for #30356 which is not based on #30362, but I guess CI runs on a merge commit? https://github.com/bitcoin/bitcoin/actions/runs/9742619062/job/26884283814?pr=30356
  6. maflcko removed the label Tests on Jul 1, 2024
  7. maflcko added the label CI failed on Jul 1, 2024
  8. vasild commented at 4:28 pm on July 1, 2024: contributor

    Calling InitializeNode() after acquiring m_nodes_mutex has the potential to cause a deadlock because of acquiring cs_main and m_nodes_mutex in different order in different parts of the code:

    1. Existent code: PeerManagerImpl::NewPoWValidBlock() acquires cs_main and calls m_connman.ForEachNode() which acquires m_nodes_mutex.
    2. New code: CConnman::OpenNetworkConnection() acquires m_nodex_mutex and calls InitializeNode() which acquires cs_main

    :bomb:

  9. vasild commented at 4:54 pm on July 1, 2024: contributor

    Rudimentary idea: insert the entry in CConnman::m_nodes just before PushVersion().

    0void PeerManagerImpl::InitializeNode(CNode& node, ServiceFlags our_services)
    1{
    2...
    3    // void CConnman::NodesAppend() { WITH_LOCK(m_nodes_mutex, m_nodes.push_back(node)); }
    4    m_connman.NodesAppend(node); 
    5    if (!node.IsInboundConn()) {
    6        PushNodeVersion(node, *peer);
    7    }
    8}
    

    or, same idea but implement by splitting the PushNodeVersion() bit away from InitializeNode():

    0    // no PushNodeVersion() by InitializeNode()
    1    m_msgproc->InitializeNode(*pnode, nLocalServices);
    2    {   
    3          LOCK(m_nodes_mutex);
    4          m_nodes.push_back(pnode);
    5...
    6    }
    7    // BeginHandshake() { if (!node.IsInboundConn()) { PushNodeVersion(node, *peer); } }
    8    m_msgproc->BeginHandshake();
    
  10. mzumsande commented at 5:03 pm on July 1, 2024: contributor

    Looks to me like this is an actual bug that could happen in the real network when making an automatic outbound connection to yourself (where the opencon thread instead of a httpworker thread opens the connection, whereas the net thread accepts connection in both cases), so it’s probably not just a ci/test failure?!

    Maybe a workable fix would be to take the PushNodeVersion call out of InitializeNode and call it separately after having added the node to m_nodes (which would mean adding PushNodeVersion to NetEventsInterface) [Edit: looks like this is similar to vasild’s second suggestion which I didn’t see before submitting this answer]

  11. mzumsande commented at 6:40 pm on July 1, 2024: contributor
    since the failures are frequent and it seems that the solution won’t be all that trivial, I think it might be best to revert #30362 temporarily and re-introduce it together with a fix in net - without any pressure to get the CI green quickly.
  12. theStack commented at 6:59 pm on July 1, 2024: contributor

    Thanks for the suggestions @vasild and @mzumsande, moving the PushNodeVersion part out of InitializeNode and calling it after the CNode instance is added to the list sounds like a good idea.

    since the failures are frequent and it seems that the solution won’t be all that trivial, I think it might be best to revert #30362 temporarily and re-introduce it together with a fix in net - without any pressure to get the CI green quickly.

    Agree, opened a revert PR #30374.

  13. fanquake referenced this in commit 04d63fb8c3 on Jul 2, 2024
  14. vasild commented at 10:53 am on July 2, 2024: contributor
    In case you settle with adding to m_nodes before PushNodeVersion() and I do not manage to review, it is good to ensure that no code will be upset by an existence of an (oubound) entry in m_nodes that does not have version pushed.
  15. glozow closed this on Jul 16, 2024

  16. glozow referenced this in commit 35dddbccf1 on Jul 16, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me