p2p: don't disconnect manual peers for block stalling #34743

willcl-ark commented at 1:28 PM on March 5, 2026: member

Manual peers added via -addnode, -connect, or the addnode RPC represent explicit operator intent to keep those connections around.

During IBD, a manual peer can currently be disconnected if it triggers block-stalling logic. This can be surprising in -connect or -addnode-based setups, where the operator may prefer (and probably expect) to keep the peer connected even if it is not a useful block download peer at that moment.

This PR changes only the block-stalling path. Instead of disconnecting a stalling manual peer, it releases that peer's in-flight block requests so other peers can request them and IBD can continue.

After releasing the requests, the manual peer is temporarily skipped for block downloads. This avoids immediately assigning scarce IBD block download slots back to the peer that just stalled, while still allowing the peer to become eligible again after the cooldown.

This intentionally does not change block download timeout or headers sync timeout behavior.

willcl-ark renamed this:
~~Protect manual evictions~~
p2p: protect manual evictions
on Mar 5, 2026

DrahtBot commented at 1:29 PM on March 5, 2026: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34743.

Reviews

See the guideline and AI policy for information on the review process.

Type	Reviewers
Concept ACK	w0xlt, sedited, kevkevinpal, brunoerg

If your review is incorrectly listed, please copy-paste <code></code> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#35561 (net: move some CNodeState fields to Peer by Crypt-iQ)
#35522 (refactor: Extract per-message helpers from SendMessages() (move-only) by pablomartin4btc)
#35315 (refactor: Use NodeClock::time_point in more places by maflcko)
#34565 (refactor: extract BlockDownloadManager from PeerManagerImpl by w0xlt)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label P2P on Mar 5, 2026

w0xlt commented at 3:30 PM on March 5, 2026: contributor

Concept ACK

in test/functional/test_framework/test_node.py:921 in 547f119668

 916 | +
 917 | +        def addnode_callback(address, port):
 918 | +            if v2:
 919 | +                self.addnode('%s:%d' % (address, port), "onetry", v2transport=True)
 920 | +            else:
 921 | +                self.addnode('%s:%d' % (address, port), "onetry")

kevkevinpal commented at 7:59 PM on March 5, 2026:

Might be able to do this instead. It seems redundant to have the two calls.

self.addnode('%s:%d' % (address, port), "onetry", v2transport=v2)

willcl-ark commented at 10:08 AM on March 9, 2026:

Taken in be002a7

jonatack commented at 2:51 AM on March 6, 2026: member

Thanks for picking this up, Will.

cf https://github.com/bitcoin/bitcoin/pull/32051

sedited commented at 7:35 AM on March 6, 2026: contributor

Concept ACK

kevkevinpal commented at 3:45 PM on March 6, 2026: contributor

Concept ACK

willcl-ark force-pushed on Mar 6, 2026

brunoerg commented at 2:20 PM on March 9, 2026: contributor

Concept ACK

DrahtBot added the label Needs rebase on Mar 11, 2026

willcl-ark force-pushed on Mar 11, 2026

DrahtBot removed the label Needs rebase on Mar 11, 2026

sedited requested review from ajtowns on Mar 20, 2026

in src/net_processing.cpp:6111 in 749f425094 outdated

6113 | +                // Return early so the block download logic below doesn't
6114 | +                // immediately re-assign them to this peer.
6115 | +                while (!state.vBlocksInFlight.empty()) {
6116 | +                    RemoveBlockRequest(state.vBlocksInFlight.front().pindex->GetBlockHash(), node.GetId());
6117 | +                }
6118 | +                state.m_stall_recovery = true;

ajtowns commented at 1:36 AM on March 23, 2026:

Would it be better for this to be a timer -- ie, if it stalls, stop requesting any blocks from this peer for X seconds? Having the peer immediately start providing blocks again seems like it sets things up for regular stall/clear/stall/clear behaviour. I could see an argument for setting X to be fairly large, eg 10 minutes, to minimise that behaviour during IBD, while still recovering gracefully after IBD is complete.

in src/net_processing.cpp:6135 in accb5e4edf

6130 | @@ -6115,27 +6131,39 @@ bool PeerManagerImpl::SendMessages(CNode& node)
6131 |              QueuedBlock &queuedBlock = state.vBlocksInFlight.front();
6132 |              int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
6133 |              if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
6134 | -                LogInfo("Timeout downloading block %s, %s", queuedBlock.pindex->GetBlockHash().ToString(), node.DisconnectMsg());
6135 | -                node.fDisconnect = true;
6136 | -                return true;
6137 | +                if (node.IsManualConn()) {
6138 | +                    LogDebug(BCLog::NET, "Not disconnecting manual peer for block download timeout, %s", node.DisconnectMsg());

ajtowns commented at 1:45 AM on March 23, 2026:

I'm not sure this change makes sense -- if a peer can't deliver a block to us in ten minutes, it's not very useful even if it was a manual connection?

in src/net_processing.cpp:6159 in accb5e4edf

6160 | +                    // or manual peers; just reset their sync state.
6161 |                      // Note: If all our peers are inbound, then we won't
6162 |                      // disconnect our sync peer for stalling; we have bigger
6163 |                      // problems if we can't get any outbound peers.
6164 | -                    if (!node.HasPermission(NetPermissionFlags::NoBan)) {
6165 | +                    if (!node.HasPermission(NetPermissionFlags::NoBan) && !node.IsManualConn()) {

ajtowns commented at 1:50 AM on March 23, 2026:

Likewise here -- if a peer can't reply to our getheaders request in 15 minutes it doesn't seem very useful; and the "NoBan" permission is already there for keeping connections to not very useful nodes.

ajtowns commented at 1:54 AM on March 23, 2026: contributor

The PR description is quite complicated, could it be simplified?

I believe the approach we currently have is "if stalling in block downloads is detected, unstall by disconnecting the peer with the oldest queued block", and this introduces a new unstalling technique "reset the peer with the oldest queued block's entire download queue, but let it restart downloading blocks almost immediately".

ajtowns commented at 2:05 AM on March 23, 2026: contributor

It might be good to run a test simulating this behaviour -- I think having three nodes: A doing IBD, manually connected to B and C; B normal; C has it's bandwidth shaped by the OS to, say, 10kB/s (trickle -u 10 bitcoind ...) ? With master, that should A disconnect C for stalling; with this PR, I think it should see C repeatedly stall, and as a result see a somewhat slower IBD than if C were disconnected.

willcl-ark force-pushed on Mar 24, 2026

frankomosh commented at 7:49 AM on April 1, 2026: contributor

Looked into the test coverage for the new m_stall_recovery mechanism and have some findings. I ran mutation testing to check how well the functional test covers the m_stall_recovery mechanism. I targeted lines 6100–6127 (the IsManualConn() stalling branch) and lines 6177–6188 (the m_stall_recovery skip logic).

Setup:

mutation-core mutate -f="src/net_processing.cpp" --range 6100 6127   # 8 mutants
mutation-core mutate -f="src/net_processing.cpp" --range 6177 6188   # 9 mutants
mutation-core analyze -f="muts-net_processing-cpp" \
  -c="cmake --build build -j$(nproc) && build/test/functional/p2p_ibd_stalling.py"

Result: 52.94% mutation score (9 killed / 17 total)

Four surviving mutants are directly in the PR's new code and seems to suggest the test doesn't exercise m_stall_recovery at all:

# Survivor 1: disabling the flag entirely,  test doesn't notice
-                state.m_stall_recovery = true;
+                state.m_stall_recovery = false;
 
# Survivor 2: not returning early after releasing blocks , test doesn't notice
                state.m_stall_recovery = true;
-                return true;
+                return false;
 
# Survivor 3: skip logic completely removed,  test doesn't notice
-        if (state.m_stall_recovery) {
+        if (1==0) {
 
# Survivor 4: flag never resets, permanently skipping downloads, test doesn't notice
-            state.m_stall_recovery = false;
+            state.m_stall_recovery = true;

The current test (test_manual_peer_stalling) verifies two things well:

The manual peer is not disconnected (killed mutant 3.r1: fDisconnect = true → false)
IBD completes through other peers

But it does not verify that the m_stall_recovery one-round skip actually fires, or has any effect. All four mutations to the skip mechanism survive because the outbound peers download blocks fast enough that the skip is irrelevant to the test outcome.

This is not blocking in any way, the core protection (no disconnect) is well-tested. But if m_stall_recovery is meant to be a meaningful part of the design (preventing immediate re-assignment of freed blocks), it may be worth adding a test assertion that verifies the manual peer doesn't get new block requests in the immediate round after release?

Tested on Ubuntu 24.04

willcl-ark renamed this:
~~p2p: protect manual evictions~~
p2p: don't disconnect manual peers for block stalling
on May 1, 2026

willcl-ark force-pushed on May 1, 2026

willcl-ark commented at 9:08 AM on May 1, 2026: member

Thanks for the reviews. Sorry for the slow reply, I was on annual leave.

I’ve rebased and reduced the scope of the PR a bit in respone to the above. Since the previous version:

I now only change the IBD block-stalling path for manual peers
- Drop protection for manual peers from block download timeout and headers sync timeout disconnects, this feels like a broader policy change which could be done later...
Keep the one-round recovery skip to avoid immediately assigning the released blocks back to the same manual peer due to per-peer SendMessages ordering.
- Added test coverage for the recovery skip per suggestion #34743#pullrequestreview-4042358709
Updated comments, commit messages, RPC help, and release notes

willcl-ark commented at 9:13 AM on May 1, 2026: member

This is not blocking in any way, the core protection (no disconnect) is well-tested. But if m_stall_recovery is meant to be a meaningful part of the design (preventing immediate re-assignment of freed blocks), it may be worth adding a test assertion that verifies the manual peer doesn't get new block requests in the immediate round after release?

Thanks for this. coverage was missing indeed. I added this in a standalone commit at the end for now, but it can be squashed.

The PR description is quite complicated, could it be simplified?

I hope I've'done this now, also I renamed it with the new narrow scope of the change.

Would it be better for this to be a timer -- ie, if it stalls, stop requesting any blocks from this peer for X seconds? @ajtowns I do quite like this idea, but I think "add a manual-peer block download cooldown policy" seems like a broader change than what I am proposing here, which is mainly to prevent the disconnects.

willcl-ark force-pushed on May 1, 2026

DrahtBot added the label CI failed on May 1, 2026

DrahtBot removed the label CI failed on May 1, 2026

in test/functional/test_framework/test_node.py:903 in b4d2539e12 outdated

 899 | @@ -900,6 +900,33 @@ def addconnection_callback(address, port):
 900 |  
 901 |          return p2p_conn
 902 |  
 903 | +    def add_manual_p2p_connection(self, p2p_conn, *, p2p_idx, **kwargs):

mzumsande commented at 2:16 PM on May 1, 2026:

did you consider using add_outbound_p2p_connection(connection_type="manual") instead instead of having a new helper?

willcl-ark commented at 10:12 AM on June 9, 2026:

Yes I tried this first, but it uses addconnection rpc which does not accept manual type. I think using addnode is also a bit more realistic here, although it's not worth much.

in src/net_processing.cpp:6114 in 254e2d2937

6116 | -            // bandwidth is insufficient.
6117 | -            const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);
6118 | -            if (stalling_timeout != new_timeout && m_block_stalling_timeout.compare_exchange_strong(stalling_timeout, new_timeout)) {
6119 | -                LogDebug(BCLog::NET, "Increased stalling timeout temporarily to %d seconds\n", count_seconds(new_timeout));
6120 | +            if (node.IsManualConn()) {
6121 | +                LogDebug(BCLog::NET, "Not disconnecting manual peer for stalling block download, %s", node.DisconnectMsg());

mzumsande commented at 2:37 PM on May 1, 2026:

Doesn't make sense to use DisconnectMsg() when we aren't disconnecting.

willcl-ark commented at 10:13 AM on June 9, 2026:

Fixed via the rework

mzumsande commented at 3:43 PM on May 1, 2026: contributor

Some thoughts:

Peers are assigned 16 blocks in parallel (MAX_BLOCKS_IN_TRANSIT_PER_PEER). Whenever any peer gets a free slot, but there is no possible assignment from the 1024-block window (because all other blocks are already present or in flight), the stalling process is started. After 2s, a manual staller will be disconnected and reconnect (master), or its block are cleared (PR).

I think what we definitely want to avoid is that the stalling peer will get re-assigned any of the block requests it was assigned before - otherwise it seems likely that it will stall right again. If it gets assigned other blocks ~1000 blocks later, it has some time to recover from what may be just a temporary phase of slowness.

That means that in the meantime (2s + time until the manual staller gets assigned block requests again), 15 more slots must have opened up somewhere, usually by receiving 15 more blocks from other peers. The current approach may reduce this time, because the disconnection/reconnection process happening on master for manual connections will also take up a couple of seconds, whereas just skipping one SendMessages round may be quicker, so it might lead to more stalling situations than before. @ajtowns 's suggestion of a timeout would avoid that.

The above is just me thinking how it should work, it would be interesting to verify empirically with the current branch if reassignments actually happen or the above is wrong. Maybe doing IBD on mainnet with 8 fast clearnet peers + one slower manual onion peer would be a possible setup to measure this on mainnet?

sedited commented at 9:08 AM on June 4, 2026: contributor

@willcl-ark what is the status here?

p2p: pause stalling manual block downloads

Manual peers represent explicit operator intent to keep the connection,
but immediately reusing a peer that just stalled can repeatedly consume
scarce IBD block download slots.

Release the stalled manual peer's in-flight block requests so other
peers can make progress, then pause block requests to that peer for a
short cooldown (10 minutes).

a889acf97c

test: add manual p2p connection helper

Add TestNode.add_manual_p2p_connection() for functional tests that need
connections created through the addnode RPC.

This mirrors add_outbound_p2p_connection(), while using addnode "onetry"
so the node treats the peer as a manual connection.

f81fc10dd5

test: cover manual peer block download cooldown

The block-stalling policy now keeps manual peers connected but prevents
immediate block reassignment after their in-flight requests are released.

Cover that a stalling manual peer survives, has its in-flight request
released, remains ineligible for block requests during the cooldown, and
can download again after the cooldown.

93b6443a7b

doc: document manual peer stalling behavior

Manual peers now remain connected when block stalling is detected,
but their in-flight requests are released and new block requests are
paused briefly.

Update addnode RPC help and release notes.

0b9c9a517e

willcl-ark force-pushed on Jun 9, 2026

willcl-ark commented at 10:10 AM on June 9, 2026: member

Thanks for the review AJ and Martin, and for the ping sedited.

You've persuaded me that relying on the scheduler to skip downloads, and only for a single round, is too brittle. I also think the time-based behavior is just outright easier to reason about.

I've reworked this so that instead of releasing a stalling manual peer's in-flight blocks and only skipping one SendMessages round, the branch now releases the in-flight requests and pauses new block requests to that manual peer for 10 minutes. This keeps the manual connection alive while avoiding an immediate stall/clear/stall loop.

The 10-minute cooldown is somewhat arbitrary, but follows AJ's earlier suggestion. In the context of IBD, it seems like a reasonable balance between not sidelining a potentially useful peer for too long and not continuing to send scarce block download requests to a peer that is not delivering blocks quickly enough.

in src/net_processing.cpp:6138 in 0b9c9a517e

6140 | +            } else {
6141 | +                LogInfo("Peer is stalling block download, %s", node.DisconnectMsg());
6142 | +                node.fDisconnect = true;
6143 | +                // Increase timeout for the next peer so that we don't disconnect multiple peers if our own
6144 | +                // bandwidth is insufficient.
6145 | +                const auto new_timeout = std::min(2 * stalling_timeout, BLOCK_STALLING_TIMEOUT_MAX);

frankomosh commented at 7:17 AM on July 11, 2026:

Slightly tangential, but relevant since this PR touches the stalling path. I notice m_block_stalling_timeout's ratchet-up has no IsInitialBlockDownload() guard, despite the field doc (/** Stalling timeout for blocks in IBD */) and the inline comment directly above stating that disconnection "should only happen during initial block download."

Since IsInitialBlockDownload() is a one-directional latch, once false it never re-evaluates. A node whose tip goes stale by more than PowTargetSpacing() * 20 (3.33h, identical on all chains) after exiting IBD, e.g. after extended P2P downtime, re-enters the geometric conditions where the staller arms and this ratchet-up fires, while still reporting IsInitialBlockDownload() == False. A falsification test I did confirms the staller arms in this condition.

Maybe wrapping the ratchet-up in if (m_chainman.IsInitialBlockDownload()) would align the code with its documented intent and close this window?