I find it a bit confusing, I’d prefer separating stall detection logic from the action we take. Peers are stalling regardless of whether they’re added manually or not, so the stalling logic should be unaffected.
I think we should always log when a stall is detected, even for manual peers, so operators can see what’s happening:
- for automatic peers, disconnect as usual;
- for manual peers, only disconnect if they exceed
BLOCK_STALLING_TIMEOUT_MAX
- otherwise keep them connected.
That way the logs clearly show tolerated stalls, and the policy difference is obvious without blending it into the timeout calculation, something like:
0diff --git a/src/net_processing.cpp b/src/net_processing.cpp
1--- a/src/net_processing.cpp (revision 8405fdb06e8ff3220590bfac84e98547067ec6b7)
2+++ b/src/net_processing.cpp (date 1755201170971)
3@@ -5814,14 +5814,16 @@
4 // the download window should be much larger than the to-be-downloaded set of blocks, so disconnection
5 // should only happen during initial block download.
6 LogInfo("Peer is stalling block download, %s\n", pto->DisconnectMsg(fLogIPs));
7- pto->fDisconnect = true;
8- // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
9- // bandwidth is insufficient.
10- const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
11- if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
12- LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
13- }
14- return true;
15+ if (!pto->IsManualConn() || state.m_stalling_since < current_time - BLOCK_STALLING_TIMEOUT_MAX) { // Allow more time for addnode peers
16+ pto->fDisconnect = true;
17+ // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
18+ // bandwidth is insufficient.
19+ const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
20+ if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
21+ LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
22+ }
23+ return true;
24+ }
25 }
26 // In case there is a block that has been in flight from this peer for block_interval * (1 + 0.5 * N)
27 // (with N the number of peers from which we're downloading validated blocks), disconnect due to timeout.
I would also prefer seeing a motivation being explained in the commit message for this change, maybe something like:
0p2p: allow BLOCK_STALLING_TIMEOUT_MAX for addnode peers
1
2When a manually added peer stalls during IBD, the default timeout is often too short for high-latency/low-bandwidth environments.
3This results in frequent disconnect/reconnect cycles, especially over Tor/I2P, which are wasteful and slow down IBD.
4
5This change uses `BLOCK_STALLING_TIMEOUT_MAX` for manual peers when detecting block download stalls, giving them a longer grace period before disconnection.
6Non-manual peers keep the existing timeout and behavior.