p2p: protect addnode peers during IBD #32051

pull jonatack wants to merge 4 commits into bitcoin:master from jonatack:2025-03-addnode-p2p changing 2 files +17 −8
  1. jonatack commented at 3:27 am on March 13, 2025: member

    When an addnode peer is disconnected by our IBD headers/blocks download timeout or stalling logic, ThreadOpenAddedConnections attempts to immediately reconnect it (unless “onetry” was passed to the addnode RPC) up to the limit of 8 addnode connections that is separate from regular peer connection limits.

    During IBD in a low-bandwidth environment, I see these disconnections+reconnections very frequently with high-quality addnode peers; see https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2025-01-22#1083973. The logging for these disconnections was: “Peer is stalling block download, disconnecting .” The ping times of the affected peers were often 20-100 seconds.

    After IBD, I continue to see the following addnode peer disconnections, albeit much less frequently: “Timeout downloading block , disconnecting .”

    This is probably network, resource and time intensive to an extent, particularly in the case of I2P peers that involves destroying and rebuilding 2 tunnels per peer connection, and worth avoiding where it is simple to do so.

    Automatic (non-manually added) peers are also disconnected, but they are a different category and case (non-trusted non-protected peers, no immediate reconnection) that would require monitoring over time to adjust the timeouts accordingly; mzumsande was looking into this (see https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2025-01-22#1083993).

    The goal of this pull is therefore to avoid unnecessary disconnections/immediate reconnections of addnode peers.

  2. p2p: allow BLOCK_STALLING_TIMEOUT_MAX for addnode peers 921b89e1d4
  3. DrahtBot commented at 3:27 am on March 13, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32051.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK pinheadmz, i-am-yuvi
    Concept ACK 1440000bytes, vasild, enirox001

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  4. DrahtBot added the label P2P on Mar 13, 2025
  5. fanquake commented at 3:33 am on March 13, 2025: member
    0[23:32:25.656] /ci_container_base/src/rpc/net.cpp:310:125: error: invalid operands to binary expression ('const char[198]' and 'const char[196]')
    1[23:32:25.656]                 "Nodes added using addnode (or -connect) are protected from disconnection due to DoS or IBD header/block\n" +
    2[23:32:25.656]                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
    3[23:32:25.656] 1 error generated.
    
  6. jonatack force-pushed on Mar 13, 2025
  7. DrahtBot commented at 3:35 am on March 13, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38682150505

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  8. DrahtBot added the label CI failed on Mar 13, 2025
  9. p2p: don't disconnect addnode peers for block download timeout 3463a7f481
  10. p2p: don't disconnect addnode peers for slow initial-headers-sync 64b956f422
  11. rpc, doc: update addnode documentation 93b07997e9
  12. jonatack force-pushed on Mar 13, 2025
  13. DrahtBot removed the label CI failed on Mar 13, 2025
  14. jonatack marked this as ready for review on Mar 13, 2025
  15. jonatack commented at 3:32 pm on March 13, 2025: member
    Ready for review.
  16. in src/net_processing.cpp:5806 in 93b07997e9
    5802@@ -5801,16 +5803,21 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
    5803             return true;
    5804         }
    5805         // In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
    5806-        // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.
    5807+        // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout
    


    yancyribbens commented at 4:58 pm on March 14, 2025:
    0        // (with N being the number of peers from which we're downloading validated blocks), disconnect due to timeout
    
  17. in src/net_processing.cpp:5818 in 3463a7f481 outdated
    5816-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
    5817-                pto->fDisconnect = true;
    5818+                if (pto->IsManualConn()) {
    5819+                    LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
    5820+                } else {
    5821+                    LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
    


    mzumsande commented at 5:38 pm on March 17, 2025:
    This is not the timeout specific for IBD stalling but the general one (which is generally longer). So we’d still disconnect for stalling (see log message “Peer is stalling block download”) - is that on purpose / can you explain a bit more the reasoning behind this?
  18. mzumsande commented at 5:45 pm on March 17, 2025: contributor

    I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

    While we have more trust in manual peers than automatic ones, I don’t think that means the trust should be without bounds. Even trusted nodes can become unresponsive temporarily (e.g. very slow internet). I’m not sure if we should give one addnode peer the power to let our entire IBD come to a halt completely.

    An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers) so that they wouldn’t be disconnected but also wouldn’t be able to stall us indefinitely.

  19. hodlinator commented at 10:56 pm on March 28, 2025: contributor

    I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

    Maybe one could add yet another option to make manually specified nodes be treated with extra leniency? First I was thinking that maybe we could treat -connect-nodes vs -addnodes differently automatically in this respect. But it would be more natural for -connect to be protected, and your (jonatack’s) use-case is -addnode from what I gather.

  20. ajtowns commented at 0:13 am on April 24, 2025: contributor

    I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

    IMHO: It could be nice if we could “score” peers during IBD, and scale the number of blocks we request from them at once according to that score. In that case, automatic peers with low scores would get dropped, and manual peers with low scores would just not get any blocks requested.

  21. jonatack commented at 0:21 am on April 24, 2025: member
    Appreciate the thoughtful feedback. I’ll look at an update.
  22. in src/net_processing.cpp:5816 in 3463a7f481 outdated
    5814             int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
    5815             if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
    5816-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
    5817-                pto->fDisconnect = true;
    5818+                if (pto->IsManualConn()) {
    5819+                    LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
    


    naiyoma commented at 9:15 am on April 30, 2025:
    nit: I think the message can be a bit clearer, e.g., “Timeout downloading block %s from manual peer (addnode/connect), not disconnecting
  23. 1440000bytes commented at 1:23 am on May 23, 2025: none
    Concept ACK
  24. achow101 added the label Review club on May 23, 2025
  25. vasild commented at 6:23 am on May 23, 2025: contributor

    Concept ACK to improve this, but:

    what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion? … one addnode peer the power to let our entire IBD come to a halt completely

    Would it be too difficult to request a given block from more than one peer concurrently? That is, if a block is requested from a peer and not received within some time, then request it from other(s) as well, but don’t cancel the original request. Then we take the block from whoever delivers it first. Subsequent receptions of the block from others would be an excess/redundant traffic, but I guess that is ok, if this technique is used carefully. Is this the same:

    An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers)

  26. in src/net_processing.cpp:5840 in 64b956f422 outdated
    5838                         LogInfo("Timeout downloading headers, %s\n", pto->DisconnectMsg(fLogIPs));
    5839                         pto->fDisconnect = true;
    5840                         return true;
    5841                     } else {
    5842-                        LogInfo("Timeout downloading headers from noban peer, not %s\n", pto->DisconnectMsg(fLogIPs));
    5843+                        LogInfo("Timeout downloading headers from %s peer, not %s\n", pto->IsManualConn() ? "addnode" : "noban", pto->DisconnectMsg(fLogIPs));
    


    pinheadmz commented at 2:41 pm on May 23, 2025:

    64b956f4220300254126b166deb97849b236c2ed

    Nit: could you use NetPermissions::ToStrings(NetPermissionFlags flags) just to avoid hard coding things?

  27. pinheadmz commented at 2:46 pm on May 23, 2025: member

    concept ACK and untested code review 93b07997e9

    I believe this also closes an 11-year-old issue: #5097

  28. DrahtBot requested review from vasild on May 23, 2025
  29. pinheadmz commented at 2:48 pm on May 23, 2025: member

    I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

    If we’re considering a case where the user specifically chooses specific peers for IBD for privacy or whatever reasons, don’t we just let them have the slow sync they trade-off for?

  30. i-am-yuvi commented at 1:54 pm on May 26, 2025: contributor
    Concept ACK
  31. i-am-yuvi commented at 8:17 am on May 28, 2025: contributor

    Code Review ACK 93b07997e9a38523f5ab850aa32ca57983fd2552

    I’ve made a review notes for this pr -> here

  32. stringintech commented at 9:13 am on May 28, 2025: none

    An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers) so that they wouldn’t be disconnected but also wouldn’t be able to stall us indefinitely.

    This makes sense, as the PR tries to achieve a similar goal with initial headers sync timeouts in commit 64b956f - when a timeout occurs, the sync state is reset to allow retrying with other peers while preserving the connection, and the PR extends this behavior to also apply to addnode peers.

  33. enirox001 commented at 7:23 pm on May 28, 2025: contributor

    Concept ACK and Code Review ACK.

    I think the PR’s increased timeout tolerance for addnode peers is valuable. Given my limited domain knowledge, I’d suggest test coverage on potential silent stalls to ensure blocks are efficiently re-requested


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-05-30 00:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me