p2p: protect addnode peers during IBD

jonatack commented at 3:27 am on March 13, 2025: member

When an addnode peer is disconnected by our IBD headers/blocks download timeout or stalling logic, ThreadOpenAddedConnections attempts to immediately reconnect it (unless “onetry” was passed to the addnode RPC) up to the limit of 8 addnode connections that is separate from regular peer connection limits.

During IBD in a low-bandwidth environment, I see these disconnections+reconnections very frequently with high-quality addnode peers; see https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2025-01-22#1083973. The logging for these disconnections was: “Peer is stalling block download, disconnecting .” The ping times of the affected peers were often 20-100 seconds.

After IBD, I continue to see the following addnode peer disconnections, albeit much less frequently: “Timeout downloading block , disconnecting .”

This is probably network, resource and time intensive to an extent, particularly in the case of I2P peers that involves destroying and rebuilding 2 tunnels per peer connection, and worth avoiding where it is simple to do so.

Automatic (non-manually added) peers are also disconnected, but they are a different category and case (non-trusted non-protected peers, no immediate reconnection) that would require monitoring over time to adjust the timeouts accordingly; mzumsande was looking into this (see https://bitcoin-irc.chaincode.com/bitcoin-core-dev/2025-01-22#1083993).

The goal of this pull is therefore to avoid unnecessary disconnections/immediate reconnections of addnode peers.

p2p: allow BLOCK_STALLING_TIMEOUT_MAX for addnode peers 921b89e1d4

DrahtBot commented at 3:27 am on March 13, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32051.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	pinheadmz, yuvicc
Concept ACK	1440000bytes, vasild, enirox001, hodlinator

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

DrahtBot added the label P2P on Mar 13, 2025

fanquake commented at 3:33 am on March 13, 2025: member

0[23:32:25.656] /ci_container_base/src/rpc/net.cpp:310:125: error: invalid operands to binary expression ('const char[198]' and 'const char[196]')
1[23:32:25.656]                 "Nodes added using addnode (or -connect) are protected from disconnection due to DoS or IBD header/block\n" +
2[23:32:25.656]                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
3[23:32:25.656] 1 error generated.

jonatack force-pushed on Mar 13, 2025

DrahtBot commented at 3:35 am on March 13, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38682150505

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot added the label CI failed on Mar 13, 2025

p2p: don't disconnect addnode peers for block download timeout 3463a7f481

p2p: don't disconnect addnode peers for slow initial-headers-sync 64b956f422

rpc, doc: update addnode documentation 93b07997e9

jonatack force-pushed on Mar 13, 2025

DrahtBot removed the label CI failed on Mar 13, 2025

jonatack marked this as ready for review on Mar 13, 2025

jonatack commented at 3:32 pm on March 13, 2025: member

Ready for review.

in src/net_processing.cpp:5806 in 93b07997e9

5802@@ -5801,16 +5803,21 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
5803             return true;
5804         }
5805         // In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
5806-        // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.
5807+        // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout

yancyribbens commented at 4:58 pm on March 14, 2025:

0        // (with N being the number of peers from which we're downloading validated blocks), disconnect due to timeout

in src/net_processing.cpp:5818 in 3463a7f481 outdated

5816-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
5817-                pto->fDisconnect = true;
5818+                if (pto->IsManualConn()) {
5819+                    LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
5820+                } else {
5821+                    LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));

mzumsande commented at 5:38 pm on March 17, 2025:

This is not the timeout specific for IBD stalling but the general one (which is generally longer). So we’d still disconnect for stalling (see log message “Peer is stalling block download”) - is that on purpose / can you explain a bit more the reasoning behind this?

mzumsande commented at 5:45 pm on March 17, 2025: contributor

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

While we have more trust in manual peers than automatic ones, I don’t think that means the trust should be without bounds. Even trusted nodes can become unresponsive temporarily (e.g. very slow internet). I’m not sure if we should give one addnode peer the power to let our entire IBD come to a halt completely.

An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers) so that they wouldn’t be disconnected but also wouldn’t be able to stall us indefinitely.

hodlinator commented at 10:56 pm on March 28, 2025: contributor

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

Maybe one could add yet another option to make manually specified nodes be treated with extra leniency? First I was thinking that maybe we could treat -connect-nodes vs -addnodes differently automatically in this respect. But it would be more natural for -connect to be protected, and your (jonatack’s) use-case is -addnode from what I gather.

ajtowns commented at 0:13 am on April 24, 2025: contributor

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

IMHO: It could be nice if we could “score” peers during IBD, and scale the number of blocks we request from them at once according to that score. In that case, automatic peers with low scores would get dropped, and manual peers with low scores would just not get any blocks requested.

jonatack commented at 0:21 am on April 24, 2025: member

Appreciate the thoughtful feedback. I’ll look at an update.

in src/net_processing.cpp:5816 in 3463a7f481 outdated

5814             int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
5815             if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
5816-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
5817-                pto->fDisconnect = true;
5818+                if (pto->IsManualConn()) {
5819+                    LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));

naiyoma commented at 9:15 am on April 30, 2025:

nit: I think the message can be a bit clearer, e.g., “Timeout downloading block %s from manual peer (addnode/connect), not disconnecting

1440000bytes commented at 1:23 am on May 23, 2025: none

Concept ACK

achow101 added the label Review club on May 23, 2025

vasild commented at 6:23 am on May 23, 2025: contributor

Concept ACK to improve this, but:

what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion? … one addnode peer the power to let our entire IBD come to a halt completely

Would it be too difficult to request a given block from more than one peer concurrently? That is, if a block is requested from a peer and not received within some time, then request it from other(s) as well, but don’t cancel the original request. Then we take the block from whoever delivers it first. Subsequent receptions of the block from others would be an excess/redundant traffic, but I guess that is ok, if this technique is used carefully. Is this the same:

An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers)

in src/net_processing.cpp:5840 in 64b956f422 outdated

5838                         LogInfo("Timeout downloading headers, %s\n", pto->DisconnectMsg(fLogIPs));
5839                         pto->fDisconnect = true;
5840                         return true;
5841                     } else {
5842-                        LogInfo("Timeout downloading headers from noban peer, not %s\n", pto->DisconnectMsg(fLogIPs));
5843+                        LogInfo("Timeout downloading headers from %s peer, not %s\n", pto->IsManualConn() ? "addnode" : "noban", pto->DisconnectMsg(fLogIPs));

pinheadmz commented at 2:41 pm on May 23, 2025:

64b956f4220300254126b166deb97849b236c2ed

Nit: could you use NetPermissions::ToStrings(NetPermissionFlags flags) just to avoid hard coding things?

pinheadmz commented at 2:46 pm on May 23, 2025: member

concept ACK and untested code review 93b07997e9

I believe this also closes an 11-year-old issue: #5097

DrahtBot requested review from vasild on May 23, 2025

pinheadmz commented at 2:48 pm on May 23, 2025: member

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

If we’re considering a case where the user specifically chooses specific peers for IBD for privacy or whatever reasons, don’t we just let them have the slow sync they trade-off for?

yuvicc commented at 1:54 pm on May 26, 2025: contributor

Concept ACK

yuvicc commented at 8:17 am on May 28, 2025: contributor

Code Review ACK 93b07997e9a38523f5ab850aa32ca57983fd2552

I’ve made a review notes for this pr -> here

stringintech commented at 9:13 am on May 28, 2025: contributor

An alternative would be to not give these peers unlimited time but clear their block requests (so that these blocks can be requested from other peers) so that they wouldn’t be disconnected but also wouldn’t be able to stall us indefinitely.

This makes sense, as the PR tries to achieve a similar goal with initial headers sync timeouts in commit 64b956f - when a timeout occurs, the sync state is reset to allow retrying with other peers while preserving the connection, and the PR extends this behavior to also apply to addnode peers.

enirox001 commented at 7:23 pm on May 28, 2025: contributor

Concept ACK and Code Review ACK.

I think the PR’s increased timeout tolerance for addnode peers is valuable. Given my limited domain knowledge, I’d suggest test coverage on potential silent stalls to ensure blocks are efficiently re-requested

achow101 commented at 10:24 pm on June 2, 2025: member

I think the main conceptual question is what should happen in the case of a stalling situation where the addnode peer is actually slow and causing congestion?

I’d like to see this addressed before moving forward here.

Maybe one could add yet another option to make manually specified nodes be treated with extra leniency?

It could be a whitebind/whitelist option.

in src/net_processing.cpp:5816 in 93b07997e9

5814             int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
5815             if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
5816-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
5817-                pto->fDisconnect = true;
5818+                if (pto->IsManualConn()) {
5819+                    LogInfo("Timeout downloading block %s from addnode peer, not %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));

hodlinator commented at 1:49 pm on June 3, 2025:

Is there any risk that this log message becomes spammy if we keep the peer connected? - Should we do something like reset state.m_downloading_since to current_time, clear state.vBlocksInFlight or add another state-field to throttle the log?

pinheadmz commented at 1:55 pm on June 3, 2025: member

@achow101 re: slow preferred peer “causing congestion” – wouldn’t this only affect the single user who explicitly set the preferred peer?

hodlinator commented at 2:15 pm on June 3, 2025: contributor

Concept ACK 93b07997e9a38523f5ab850aa32ca57983fd2552

achow101 commented at 0:28 am on July 4, 2025: member

@achow101 re: slow preferred peer “causing congestion” – wouldn’t this only affect the single user who explicitly set the preferred peer?

Yes, however I have seen people post on various forums suggestions of specific nodes that they can addnode, and users may be addnode’ing such nodes without realizing that there are potential adverse consequences of doing so. Generally, I would prefer to have things that change how we treat nodes be limited to the whitelist options rather than addnode also implying that those nodes will get specific special treatment.

p2p: protect addnode peers during IBD #32051

Code Coverage & Benchmarks

Reviews