p2p: protect manual evictions #34743

pull willcl-ark wants to merge 4 commits into bitcoin:master from willcl-ark:protect-manual-evictions changing 5 files +131 −11
  1. willcl-ark commented at 1:28 PM on March 5, 2026: member

    Ref: #5097

    Manual peers added via -addnode, -connect, or the addnode RPC are currently disconnected during IBD if they trigger block-stalling logic. That can conflict with explicit operator intent to keep those peers connected, especially in -connect-based setups.

    This PR changes only the block-stalling path for manual peers. Instead of disconnecting the peer, it releases that peer’s in-flight block requests so other peers can pick them up and IBD can continue.

    The change also adds a one-round recovery flag to avoid immediately reassigning the freed blocks back to the same manual peer due to per-peer SendMessages ordering.

  2. willcl-ark renamed this:
    Protect manual evictions
    p2p: protect manual evictions
    on Mar 5, 2026
  3. DrahtBot commented at 1:29 PM on March 5, 2026: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK w0xlt, sedited, kevkevinpal, brunoerg

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    No conflicts as of last run.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

    LLM Linter (✨ experimental)

    Possible places where comparison-specific test macros should replace generic comparisons:

    • [test/functional/p2p_ibd_stalling.py] assert manual_peer.is_connected -> Replace with a comparison-specific helper, e.g. assert_equal(manual_peer.is_connected, True) (or assert_equal(bool(manual_peer.is_connected), True)).

    <sup>2026-03-24 14:32:15</sup>

  4. DrahtBot added the label P2P on Mar 5, 2026
  5. w0xlt commented at 3:30 PM on March 5, 2026: contributor

    Concept ACK

  6. in test/functional/test_framework/test_node.py:921 in 547f119668
     916 | +
     917 | +        def addnode_callback(address, port):
     918 | +            if v2:
     919 | +                self.addnode('%s:%d' % (address, port), "onetry", v2transport=True)
     920 | +            else:
     921 | +                self.addnode('%s:%d' % (address, port), "onetry")
    


    kevkevinpal commented at 7:59 PM on March 5, 2026:

    Might be able to do this instead. It seems redundant to have the two calls.

    self.addnode('%s:%d' % (address, port), "onetry", v2transport=v2)
    

    willcl-ark commented at 10:08 AM on March 9, 2026:

    Taken in be002a7

  7. jonatack commented at 2:51 AM on March 6, 2026: member

    Thanks for picking this up, Will.

  8. sedited commented at 7:35 AM on March 6, 2026: contributor

    Concept ACK

  9. kevkevinpal commented at 3:45 PM on March 6, 2026: contributor

    Concept ACK

  10. willcl-ark force-pushed on Mar 6, 2026
  11. brunoerg commented at 2:20 PM on March 9, 2026: contributor

    Concept ACK

  12. DrahtBot added the label Needs rebase on Mar 11, 2026
  13. willcl-ark force-pushed on Mar 11, 2026
  14. DrahtBot removed the label Needs rebase on Mar 11, 2026
  15. sedited requested review from ajtowns on Mar 20, 2026
  16. in src/net_processing.cpp:6111 in 749f425094 outdated
    6113 | +                // Return early so the block download logic below doesn't
    6114 | +                // immediately re-assign them to this peer.
    6115 | +                while (!state.vBlocksInFlight.empty()) {
    6116 | +                    RemoveBlockRequest(state.vBlocksInFlight.front().pindex->GetBlockHash(), node.GetId());
    6117 | +                }
    6118 | +                state.m_stall_recovery = true;
    


    ajtowns commented at 1:36 AM on March 23, 2026:

    Would it be better for this to be a timer -- ie, if it stalls, stop requesting any blocks from this peer for X seconds? Having the peer immediately start providing blocks again seems like it sets things up for regular stall/clear/stall/clear behaviour. I could see an argument for setting X to be fairly large, eg 10 minutes, to minimise that behaviour during IBD, while still recovering gracefully after IBD is complete.

  17. in src/net_processing.cpp:6135 in accb5e4edf
    6130 | @@ -6115,27 +6131,39 @@ bool PeerManagerImpl::SendMessages(CNode& node)
    6131 |              QueuedBlock &queuedBlock = state.vBlocksInFlight.front();
    6132 |              int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
    6133 |              if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
    6134 | -                LogInfo("Timeout downloading block %s, %s", queuedBlock.pindex->GetBlockHash().ToString(), node.DisconnectMsg());
    6135 | -                node.fDisconnect = true;
    6136 | -                return true;
    6137 | +                if (node.IsManualConn()) {
    6138 | +                    LogDebug(BCLog::NET, "Not disconnecting manual peer for block download timeout, %s", node.DisconnectMsg());
    


    ajtowns commented at 1:45 AM on March 23, 2026:

    I'm not sure this change makes sense -- if a peer can't deliver a block to us in ten minutes, it's not very useful even if it was a manual connection?

  18. in src/net_processing.cpp:6159 in accb5e4edf
    6160 | +                    // or manual peers; just reset their sync state.
    6161 |                      // Note: If all our peers are inbound, then we won't
    6162 |                      // disconnect our sync peer for stalling; we have bigger
    6163 |                      // problems if we can't get any outbound peers.
    6164 | -                    if (!node.HasPermission(NetPermissionFlags::NoBan)) {
    6165 | +                    if (!node.HasPermission(NetPermissionFlags::NoBan) && !node.IsManualConn()) {
    


    ajtowns commented at 1:50 AM on March 23, 2026:

    Likewise here -- if a peer can't reply to our getheaders request in 15 minutes it doesn't seem very useful; and the "NoBan" permission is already there for keeping connections to not very useful nodes.

  19. ajtowns commented at 1:54 AM on March 23, 2026: contributor

    The PR description is quite complicated, could it be simplified?

    I believe the approach we currently have is "if stalling in block downloads is detected, unstall by disconnecting the peer with the oldest queued block", and this introduces a new unstalling technique "reset the peer with the oldest queued block's entire download queue, but let it restart downloading blocks almost immediately".

  20. ajtowns commented at 2:05 AM on March 23, 2026: contributor

    It might be good to run a test simulating this behaviour -- I think having three nodes: A doing IBD, manually connected to B and C; B normal; C has it's bandwidth shaped by the OS to, say, 10kB/s (trickle -u 10 bitcoind ...) ? With master, that should A disconnect C for stalling; with this PR, I think it should see C repeatedly stall, and as a result see a somewhat slower IBD than if C were disconnected.

  21. net: don't disconnect manual peers for block stalling
    Manual connections (-addnode, -connect) represent explicit user intent to
    maintain the connection, yet they currently get disconnected during IBD
    when the block download window stalls.
    
    Instead of disconnecting a stalled manual peer, release its in-flight
    blocks so other peers can request them. Set m_stall_recovery to skip one
    round of block download for this peer so its next SendMessages call
    doesn't immediately reclaim the same blocks before another peer can pick
    them up.
    
    This behavior first showed up as intermittent failures in the functional
    tests with few peers, but it can also delay recovery outside tests when a
    manual peer is revisited before other peers have a chance to request the
    freed blocks.
    409313bab6
  22. test: add add_manual_p2p_connection helper to TestNode
    Add a helper for creating manual (addnode onetry) p2p connections,
    mirroring the existing add_outbound_p2p_connection method. This will
    be used by tests that verify manual peer behavior during IBD.
    6c245778c6
  23. test: verify manual peers survive block stalling timeout
    test that a manual (addnode) peer which stalls on a block, is not
    disconnected when the stalling timeout fires.
    7c519790f5
  24. doc: update addnode rpc help and add release note 935afd4042
  25. willcl-ark force-pushed on Mar 24, 2026
  26. frankomosh commented at 7:49 AM on April 1, 2026: contributor

    Looked into the test coverage for the new m_stall_recovery mechanism and have some findings. I ran mutation testing to check how well the functional test covers the m_stall_recovery mechanism. I targeted lines 6100–6127 (the IsManualConn() stalling branch) and lines 6177–6188 (the m_stall_recovery skip logic).

    Setup:

    mutation-core mutate -f="src/net_processing.cpp" --range 6100 6127   # 8 mutants
    mutation-core mutate -f="src/net_processing.cpp" --range 6177 6188   # 9 mutants
    mutation-core analyze -f="muts-net_processing-cpp" \
      -c="cmake --build build -j$(nproc) && build/test/functional/p2p_ibd_stalling.py"
    

    Result: 52.94% mutation score (9 killed / 17 total)

    Four surviving mutants are directly in the PR's new code and seems to suggest the test doesn't exercise m_stall_recovery at all:

    # Survivor 1: disabling the flag entirely,  test doesn't notice
    -                state.m_stall_recovery = true;
    +                state.m_stall_recovery = false;
     
    # Survivor 2: not returning early after releasing blocks , test doesn't notice
                    state.m_stall_recovery = true;
    -                return true;
    +                return false;
     
    # Survivor 3: skip logic completely removed,  test doesn't notice
    -        if (state.m_stall_recovery) {
    +        if (1==0) {
     
    # Survivor 4: flag never resets, permanently skipping downloads, test doesn't notice
    -            state.m_stall_recovery = false;
    +            state.m_stall_recovery = true;
    

    The current test (test_manual_peer_stalling) verifies two things well:

    1. The manual peer is not disconnected (killed mutant 3.r1: fDisconnect = truefalse)
    2. IBD completes through other peers

    But it does not verify that the m_stall_recovery one-round skip actually fires, or has any effect. All four mutations to the skip mechanism survive because the outbound peers download blocks fast enough that the skip is irrelevant to the test outcome.

    This is not blocking in any way, the core protection (no disconnect) is well-tested. But if m_stall_recovery is meant to be a meaningful part of the design (preventing immediate re-assignment of freed blocks), it may be worth adding a test assertion that verifies the manual peer doesn't get new block requests in the immediate round after release?

    Tested on Ubuntu 24.04



github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-27 03:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me