net: fix race condition in self-connect detection #30394

pull theStack wants to merge 3 commits into bitcoin:master from theStack:202407-p2p-fix_selfdetection_racecond changing 4 files +25 −11
  1. theStack commented at 7:06 pm on July 4, 2024: contributor

    This PR fixes a recently discovered race condition in the self-connect detection (see #30362 and #30368).

    Initiating an outbound network connection currently involves the following steps after the socket connection is established (see CConnman::OpenNetworkConnection method):

    1. set up node state
    2. queue VERSION message (both steps 1 and 2 happen in InitializeNode)
    3. add new node to vector m_nodes

    If we connect to ourself, it can happen that the sent VERSION message (step 2) is received and processed locally before the node object is added to the connection manager’s m_nodes vector (step 3). In this case, the self-connect remains undiscovered, as the detection doesn’t find the outbound peer in m_nodes yet (see CConnman::CheckIncomingNonce).

    Fix this by swapping the order of 2. and 3., by taking the PushNodeVersion call out of InitializeNode and doing that in the SendMessages method instead, which is only called for CNode instances in m_nodes.

    The temporarily reverted test introduced in #30362 is readded. Fixes #30368.

    Thanks go to vasild, mzumsande and dergoegge for suggestions on how to fix this (see #30368 (comment) ff. and #30394 (review)).

  2. DrahtBot commented at 7:06 pm on July 4, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK naiyoma, mzumsande, tdb3, glozow, dergoegge
    Stale ACK maflcko, furszy, sipa

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30111 (locks: introduce mutex for tx download, flush rejection filters on UpdatedBlockTip by glozow)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label P2P on Jul 4, 2024
  4. tdb3 commented at 9:21 pm on July 4, 2024: contributor

    Approach ACK

    Great job finding the bug and fixing it.

    To sanity check, re-arranged calls to mimick failure recreation (https://github.com/bitcoin/bitcoin/issues/30368#issuecomment-2200034205). p2p_handshake failed as expected after the re-arrange.

     0diff --git a/src/net.cpp b/src/net.cpp
     1index a10be3e33a..6ed167d6fd 100644
     2--- a/src/net.cpp
     3+++ b/src/net.cpp
     4@@ -2921,6 +2921,8 @@ void CConnman::OpenNetworkConnection(const CAddress& addrConnect, bool fCountFai
     5     pnode->grantOutbound = std::move(grant_outbound);
     6 
     7     m_msgproc->InitializeNode(*pnode, nLocalServices);
     8+    m_msgproc->SendInitialMessages(*pnode);
     9+    std::this_thread::sleep_for(std::chrono::seconds{1});
    10     {
    11         LOCK(m_nodes_mutex);
    12         m_nodes.push_back(pnode);
    13@@ -2928,7 +2930,7 @@ void CConnman::OpenNetworkConnection(const CAddress& addrConnect, bool fCountFai
    14         // update connection count by network
    15         if (pnode->IsManualOrFullOutboundConn()) ++m_network_conn_counts[pnode->addr.GetNetwork()];
    16     }
    17-    m_msgproc->SendInitialMessages(*pnode);
    18+    //m_msgproc->SendInitialMessages(*pnode);
    19 }
    

    Also started on signet, connected to peers, and synced the latest ~1500 blocks successfully.

    The name SendInitialMessages() seems reasonable to me, since it’s generic enough to adapt to future change without rename.

    I’d like to double check @vasild’s comment (https://github.com/bitcoin/bitcoin/issues/30368#issuecomment-2202728919) before full ACK.

    In case you settle with adding to m_nodes before PushNodeVersion() and I do not manage to review, it is good to ensure that no code will be upset by an existence of an (oubound) entry in m_nodes that does not have version pushed.

  5. maflcko commented at 9:36 am on July 5, 2024: member
    ACK 7c9684a5a357e300f4901167a4bb0e1bdafd075d
  6. DrahtBot requested review from tdb3 on Jul 5, 2024
  7. in src/net.h:998 in 7c9684a5a3 outdated
    990@@ -991,9 +991,12 @@ class NetEventsInterface
    991     /** Mutex for anything that is only accessed via the msg processing thread */
    992     static Mutex g_msgproc_mutex;
    993 
    994-    /** Initialize a peer (setup state, queue any initial messages) */
    995+    /** Initialize a peer (setup state) */
    996     virtual void InitializeNode(CNode& node, ServiceFlags our_services) = 0;
    997 
    998+    /** Queue initial messages for peer (only relevant for outbound connections) */
    999+    virtual void SendInitialMessages(CNode& node) = 0;
    


    dergoegge commented at 9:18 am on July 8, 2024:
    I think I’d be nicer to not change this interface and I believe it would be trivial to just use SendMessages? i.e. push the version message to outbound peers at the start of SendMessages (if it hasn’t been pushed already).

    theStack commented at 2:23 pm on July 8, 2024:
    Sounds like a good idea. The approach has the small drawback that the CNode state has to be extended by a boolean variable (something like e.g. m_initial_version_message_sent), but that seems to be the lesser evil than changing the NetEventsInterface. Will adapt later if there are no objections.

    dergoegge commented at 2:25 pm on July 8, 2024:
    Please put that state on Peer if you go this route.

    theStack commented at 5:39 pm on July 8, 2024:
    Done as suggested.
  8. theStack force-pushed on Jul 8, 2024
  9. theStack commented at 5:38 pm on July 8, 2024: contributor
    Thanks for the reviews! I force-pushed following @dergoegge’s suggestion of doing the version message push in SendMessages instead, with the advantage of not needing to change the NetEventsInterface (sorry for invalidating the ACK @maflcko). The previous version is still available on the following branch: https://github.com/theStack/bitcoin/tree/202407-p2p-fix_selfdetection_racecond_oldversion
  10. in src/net.h:994 in 56d24db9f6 outdated
    990@@ -991,7 +991,7 @@ class NetEventsInterface
    991     /** Mutex for anything that is only accessed via the msg processing thread */
    992     static Mutex g_msgproc_mutex;
    993 
    994-    /** Initialize a peer (setup state, queue any initial messages) */
    995+    /** Initialize a peer (setup state) */
    


    dergoegge commented at 8:23 am on July 9, 2024:
    (not in this PR but) Since we don’t want any messages send in InitializeNode, maybe it makes sense to disallow that somehow, perhaps by not passing a CNode?

    theStack commented at 10:31 am on July 9, 2024:
    Makes sense. I think the easiest way to achieve this is by passing the CNode reference as const instead. Since that is simple enough, I’ve added an extra commit for this (with an example of a compiler error in the commit message).
  11. in src/net_processing.cpp:5828 in 56d24db9f6 outdated
    5816@@ -5817,6 +5817,12 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
    5817     // disconnect misbehaving peers even before the version handshake is complete.
    5818     if (MaybeDiscourageAndDisconnect(*pto, *peer)) return true;
    5819 
    5820+    // Initiate version handshake for outbound connections
    5821+    if (!pto->IsInboundConn() && !peer->m_initial_version_message_sent) {
    5822+        PushNodeVersion(*pto, *peer);
    5823+        peer->m_initial_version_message_sent = true;
    5824+    }
    


    dergoegge commented at 8:35 am on July 9, 2024:

    This isn’t fully atomic, i.e. in theory two threads could enter this section at the same time.

    Not tested but maybe the following works:

    0    if (!pto->IsInboundConn() && peer->m_initial_version_message_sent.compare_exchange_strong(false, true)) {
    1        PushNodeVersion(*pto, *peer);
    2    }
    

    Alternatively, we could guard m_initial_version_message_sent with NetEventsInterface::g_msgproc_mutex, since it is only accessed from the “msghand” thread.


    theStack commented at 10:31 am on July 9, 2024:
    Thanks, I’ve decided to take the second approach, i.e. guarding with NetEventsInterface::g_msgproc_mutex.
  12. theStack force-pushed on Jul 9, 2024
  13. dergoegge approved
  14. dergoegge commented at 10:35 am on July 9, 2024: member
    Code review ACK fb587ce36afcc8789214af45a36082c4f5238e50
  15. DrahtBot requested review from maflcko on Jul 9, 2024
  16. in test/functional/p2p_handshake.py:96 in fb587ce36a outdated
    94-        # the race condition causing issue #30368 is fixed
    95+        self.log.info("Check that connecting to ourself leads to immediate disconnect")
    96+        with node.assert_debug_log(["connected to self", "disconnecting"]):
    97+            node_listen_addr = f"127.0.0.1:{p2p_port(0)}"
    98+            node.addconnection(node_listen_addr, "outbound-full-relay", self.options.v2transport)
    99+        self.wait_until(lambda: len(node.getpeerinfo()) == 0)
    


    maflcko commented at 12:33 pm on July 9, 2024:
    0            self.wait_until(lambda: len(node.getpeerinfo()) == 0)
    

    nit in the last commit: I am not a fan of using assert_debug_log to sync the test logic. Otherwise the test may fail on slow machines.

    If you want to keep it, please make the sync explicit by increasing the timeout of assert_debug_log. Otherwise, you can do a proper wait (as suggested) by moving the wait_until into the inner scope.

  17. in src/net_processing.cpp:5823 in fb587ce36a outdated
    5816@@ -5817,6 +5817,12 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
    5817     // disconnect misbehaving peers even before the version handshake is complete.
    5818     if (MaybeDiscourageAndDisconnect(*pto, *peer)) return true;
    5819 
    5820+    // Initiate version handshake for outbound connections
    5821+    if (!pto->IsInboundConn() && !peer->m_initial_version_message_sent) {
    5822+        PushNodeVersion(*pto, *peer);
    5823+        peer->m_initial_version_message_sent = true;
    


    maflcko commented at 12:33 pm on July 9, 2024:
    0        peer->m_outbound_version_message_sent = true;
    

    nit: Could clarify the name

  18. maflcko commented at 12:34 pm on July 9, 2024: member
    left two nits.
  19. DrahtBot requested review from maflcko on Jul 9, 2024
  20. maflcko commented at 12:34 pm on July 9, 2024: member
    ACK fb587ce36afcc8789214af45a36082c4f5238e50
  21. furszy commented at 1:28 pm on July 9, 2024: member

    Code review ACK fb587ce36afcc8789214af45a36082c4f5238e50

    Nice finding.

  22. in src/net.h:994 in e1349beed4 outdated
    990@@ -991,8 +991,8 @@ class NetEventsInterface
    991     /** Mutex for anything that is only accessed via the msg processing thread */
    992     static Mutex g_msgproc_mutex;
    993 
    994-    /** Initialize a peer (setup state, queue any initial messages) */
    


    glozow commented at 2:03 pm on July 9, 2024:
    nitty nit: this diff should be in f8594553fab6f865b13be4d5873b4ff55aae62d2, not e1349beed448cbecb276991019e78f0fccf579cd
  23. glozow commented at 3:01 pm on July 9, 2024: member
    lgtm, do you plan to take any of the suggestions?
  24. theStack force-pushed on Jul 9, 2024
  25. theStack commented at 3:40 pm on July 9, 2024: contributor

    lgtm, do you plan to take any of the suggestions?

    Tackled all of the ordinary and nitty nits (thanks Marco and Gloria), re-review should be trivial.

  26. sipa commented at 3:55 pm on July 9, 2024: member
    utACK a3122d38c65577cc13d88f8690da1e8f9d959c74, only reviewed the test change lightly.
  27. DrahtBot requested review from dergoegge on Jul 9, 2024
  28. DrahtBot requested review from furszy on Jul 9, 2024
  29. mzumsande commented at 4:15 pm on July 9, 2024: contributor

    I wonder if this could this lead to problems with “non-shy” software that would send VERSION immediately on incoming connections (which bitcoin core doesn’t do thanks to Hal Finney’s #101, but alternative clients might).

    When Bitcoin Core makes an outgoing connection after this patch, it won’t send out the version message immediately, so it could receive a version message before sending one. In that case, I think that ProcessMessage could send out other messages (WTXIDRELAY etc.) before sending the version message, which would be a protocol violation.

    One possible fix for this might be to not process messages of outbound peers before having sent the version message.

    I didn’t test any of this, this is just from reading the code.

  30. theStack commented at 5:23 pm on July 9, 2024: contributor

    @mzumsande: Good catch! With that in my mind, I see the following paths forward:

    1. Queue outbound VERSION message immediately in CConnman::OpenNetworkConnection (as done in a previous version of this PR: https://github.com/theStack/bitcoin/tree/202407-p2p-fix_selfdetection_racecond_oldversion, extending the NetEventsInterface with a new SendInitialMessages method, called at the end)
    2. Adapt the current state of this PR to only start processing messages of outbound peers after m_outbound_version_message_sent is set (as you suggested above)
    3. Keep it as it is, as there are probably no relevant node implementations out there that are still non-shy (light clients generally don’t accept inbound connections as far as I’m aware, so they wouldn’t be affected by it)

    Any other options? 2. seems overly complex and 3. is more of a strong gut feeling (but can hardly be proven and it’s kind of irresponsible, so I guess we really shouldn’t do that), so I would tend to go for option 1.

  31. dergoegge commented at 5:30 pm on July 9, 2024: member
    Option 2. seems the best to me.
  32. sipa commented at 5:34 pm on July 9, 2024: member
    I think (2) is a 1-line change: return immediately from ProcessMessages if no version message has been sent by us to the peer.
  33. theStack force-pushed on Jul 9, 2024
  34. theStack commented at 5:59 pm on July 9, 2024: contributor

    Option 2. seems the best to me.

    I think (2) is a 1-line change: return immediately from ProcessMessages if no version message has been sent by us to the peer.

    Thanks for the quick feedback, done.

  35. mzumsande commented at 6:51 pm on July 9, 2024: contributor
    I also think that 2) would be the best approach, because 1) doesn’t seem to remove the race, just makes it more unlikely: If a version message was received after InitializeNode() and before SendInitialMessages() I think the race would still be happening.
  36. DrahtBot added the label CI failed on Jul 9, 2024
  37. DrahtBot commented at 7:33 pm on July 9, 2024: contributor

    🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the documentation.

    Possibly this is due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    Leave a comment here, if you need help tracking down a confusing failure.

    Debug: https://github.com/bitcoin/bitcoin/runs/27231374754

  38. net: fix race condition in self-connect detection
    Initiating an outbound network connection currently involves the
    following steps after the socket connection is established (see
     `CConnman::OpenNetworkConnection` method):
        1. set up node state
        2. queue VERSION message
        3. add new node to vector `m_nodes`
    
    If we connect to ourself, it can happen that the sent VERSION message
    (step 2) is received and processed locally *before* the node object
    is added to the connection manager's `m_nodes` vector (step 3). In this
    case, the self-connect remains undiscovered, as the detection doesn't
    find the outbound peer in `m_nodes` yet (see `CConnman::CheckIncomingNonce`).
    
    Fix this by swapping the order of 2. and 3., by taking the `PushNodeVersion`
    call out of `InitializeNode` and doing that in the `SendMessages` method
    instead, which is only called for `CNode` instances in `m_nodes`.
    
    Thanks go to vasild, mzumsande, dergoegge and sipa for suggestions on
    how to fix this.
    66673f1c13
  39. net: prevent sending messages in `NetEventsInterface::InitializeNode`
    Now that the queueing of the VERSION messages has been moved out of
    `InitializeNode`, there is no need to pass a mutable `CNode` reference any
    more. With a const reference, trying to send messages in this method would
    lead to a compile-time error, e.g.:
    
    ----------------------------------------------------------------------------------------------------------------------------------
    ...
    net_processing.cpp: In member function ‘virtual void {anonymous}::PeerManagerImpl::InitializeNode(const CNode&, ServiceFlags)’:
    net_processing.cpp:1683:21: error: binding reference of type ‘CNode&’ to ‘const CNode’ discards qualifiers
     1683 |     PushNodeVersion(node, *peer);
    ...
    ----------------------------------------------------------------------------------------------------------------------------------
    0dbcd4c148
  40. Reapply "test: p2p: check that connecting to ourself leads to disconnect"
    This reverts commit 9ec2c53701a391629b55aeb2804e8060d2c453a4 with
    a tiny change included (identation of the wait_until call).
    16bd283b3a
  41. theStack force-pushed on Jul 9, 2024
  42. theStack commented at 8:11 pm on July 9, 2024: contributor
    The unit test failure in the CI was caused by the ConnmanTestMsg::Handshake helper (used in the unit test outbound_slow_chain_eviction), where a version handshake is done manually by calling the relevant connman/peerman methods. Due to a missing call to SendMessages after InitializeNode (not needed previously), the ProcessMessages call after wouldn’t have any effect with the new logic. Should be fixed now.
  43. furszy commented at 9:11 pm on July 9, 2024: member

    When Bitcoin Core makes an outgoing connection after this patch, it won’t send out the version message immediately, so it could receive a version message before sending one. In that case, I think that ProcessMessage could send out other messages (WTXIDRELAY etc.) before sending the version message, which would be a protocol violation.

    Can we get a test for this? Fixes are great but tests are awesome to not break it again in the future.

  44. DrahtBot removed the label CI failed on Jul 9, 2024
  45. mzumsande commented at 9:19 pm on July 9, 2024: contributor

    Can we get a test for this? Fixes are great but tests are awesome to not break it again in the future.

    Our own test framework was behaving that way before #29353, so maybe the functional tests would have caught it? I thought about reverting c340503b67585b631909584f1acaa327f5af25d3 locally to find out, but didn’t have the time yet.

  46. naiyoma commented at 12:41 pm on July 10, 2024: contributor
    tested ACK https://github.com/bitcoin/bitcoin/pull/30394/commits/16bd283b3ad05daa41259a062aee0fc05b463fa6, built and tested locally, test passes successfully.
  47. DrahtBot requested review from sipa on Jul 10, 2024
  48. mzumsande commented at 6:35 pm on July 10, 2024: contributor

    ACK 16bd283b3ad05daa41259a062aee0fc05b463fa6

    Our own test framework was behaving that way before #29353, so maybe the functional tests would have caught it? I thought about reverting c340503 locally to find out, but didn’t have the time yet.

    FYI, I tested reverting c340503b67585b631909584f1acaa327f5af25d3 on top of this branch now and without the latest fix (2) to ProcessMessages, other messages are actually sent before version, and p2p_add_connection.py fails intermittently - so the fix works.

  49. tdb3 approved
  50. tdb3 commented at 0:43 am on July 11, 2024: contributor
    ACK 16bd283b3ad05daa41259a062aee0fc05b463fa6
  51. glozow commented at 1:17 pm on July 15, 2024: member
    ACK 16bd283b3ad05daa41259a062aee0fc05b463fa6
  52. dergoegge approved
  53. dergoegge commented at 4:40 pm on July 15, 2024: member
    ACK 16bd283b3ad05daa41259a062aee0fc05b463fa6
  54. glozow merged this on Jul 16, 2024
  55. glozow closed this on Jul 16, 2024

  56. theStack deleted the branch on Jul 16, 2024
  57. maflcko commented at 11:42 am on July 16, 2024: member
  58. maflcko commented at 1:07 pm on July 16, 2024: member

    Re: #30394#pullrequestreview-2166824948

    In that case, I think that ProcessMessage could send out other messages (WTXIDRELAY etc.) before sending the version message, which would be a protocol violation.

    Can you add a reference for this? In the BIP I could only find “The wtxidrelay message MUST be sent in response to a version message from a peer”, not that it has to be sent after the own version.

  59. mzumsande commented at 2:31 pm on July 16, 2024: contributor
    I didn’t find a reference in a BIP (there isn’t any for the general version message flow as far as I can see) and maybe “violation” is too strong, but any message received before the version would be ignored by bitcoin core peers: https://github.com/bitcoin/bitcoin/blob/46878326808f643f08ec3f69dcec5c8fafeb7b59/src/net_processing.cpp#L3889-L3893 So this is clearly not the intended way the protocol should work.
  60. fanquake added the label Needs backport (27.x) on Jul 16, 2024
  61. fanquake referenced this in commit 7c43043188 on Jul 17, 2024
  62. fanquake referenced this in commit 7b9dd554aa on Jul 17, 2024
  63. fanquake referenced this in commit 5193af3ff9 on Jul 17, 2024
  64. fanquake referenced this in commit 0933cf53b4 on Jul 17, 2024
  65. fanquake referenced this in commit 064f214673 on Jul 17, 2024
  66. fanquake referenced this in commit ab42206652 on Jul 17, 2024
  67. fanquake commented at 10:32 am on July 17, 2024: member
    Backported to 27.x in #30467.
  68. glozow referenced this in commit 20ccb30b7a on Jul 18, 2024
  69. fanquake removed the label Needs backport (27.x) on Jul 24, 2024
  70. fanquake referenced this in commit 0cbdc6b380 on Jul 24, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-16 19:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me