p2p: Relax BlockRequestAllowed to respond to advertised blocks

instagibbs commented at 1:42 pm on July 3, 2025: member

First commit adds coverage to a similar case to #13370 . h/t dergoegge for detecting this issue independently. Documentation of this behavior with a test is then followed by a proposed change where anytime we considered something VALID_TRANSACTIONS and potentially advertised via compact blocks, we should respond to requests for the getdata if the peer fell back for whatever reason.

If the second commit is deemed controversial it can be shelved and current behavior documented alone.

DrahtBot commented at 1:42 pm on July 3, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32869.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	dergoegge

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

instagibbs force-pushed on Jul 3, 2025

DrahtBot added the label CI failed on Jul 3, 2025

DrahtBot commented at 1:46 pm on July 3, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/45297064338 LLM reason (✨ experimental): The CI failure is caused by an unfixable lint error detected by rust and Python linters, leading to the overall build failure.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

dergoegge commented at 1:56 pm on July 3, 2025: member

Concept ACK

h/t dergoegge for detecting this issue independently.

For context, I found this case (as demonstrated by the first commit in the PR) by adding a netsplit oracle (see https://gist.github.com/dergoegge/3036796551095a9ce7535fb5d74e6656#detecting-network-split-bugs) to fuzzamoto. While extremely unrealistic/unlikely to trigger in the wild, I’d still consider this timeout case a bug, especially because it could cause disconnects across the whole network (as opposed to isolated instances with short id collisions).

Resolving this will help avoid special casing the oracle for this scenario.

instagibbs renamed this:
~~WIP p2p: Relax BlockRequestAllowed to respond to advertised blocks~~
p2p: Relax BlockRequestAllowed to respond to advertised blocks
on Jul 3, 2025

instagibbs force-pushed on Jul 3, 2025

fanquake requested review from mzumsande on Jul 3, 2025

DrahtBot removed the label CI failed on Jul 3, 2025

DrahtBot added the label P2P on Jul 3, 2025

in test/functional/p2p_invalid_cb.py:62 in cd13a2086a outdated

57+        # No disconnect yet, node1 is waiting on response to getdata
58+        assert_equal(len(self.nodes[1].getpeerinfo()), 1)
59+
60+        # Make sure node1 has processed the compact block
61+        self.wait_until(lambda: len(self.nodes[1].getchaintips()) == 2)
62+        assert {"height": 13, "hash": block.hash, "branchlen": 1, "status": "headers-only"} in node.getchaintips()

davidgumberg commented at 0:11 am on July 4, 2025:

0        assert {"height": 13, "hash": block.hash, "branchlen": 1, "status": "headers-only"} in self.nodes[1].getchaintips()

instagibbs commented at 12:16 pm on July 7, 2025:

fixed

in test/functional/p2p_invalid_cb.py:45 in cd13a2086a outdated

40+        self.generate(self.nodes[0], 12)
41+
42+        # Make 200+ minutes pass with wiggle room added
43+        # to disallow direct fetch of block
44+        for node in self.nodes:
45+            node.setmocktime(now + 200 * 60 * 2)

davidgumberg commented at 0:12 am on July 4, 2025:

nit:

0            node.bumpmocktime(200 * 60 * 2)

instagibbs commented at 11:35 am on July 7, 2025:

will do on touchup

instagibbs commented at 12:16 pm on July 7, 2025:

done

davidgumberg commented at 0:24 am on July 4, 2025: contributor

Maybe a dumb question, but why not relax the other side here? I don’t know what the new criteria would be, but in an ideal world, nodes should not be sending block announcements for blocks they have identified as invalid just to avoid upsetting their peers, so maybe the constraints set by PeerManagerImpl::SendMessages() are too rigid?

https://github.com/bitcoin/bitcoin/blob/e3f416dbf7633b2fb19c933e5508bd231cc7e9cf/src/net_processing.cpp#L5865-L5873

instagibbs commented at 11:35 am on July 7, 2025: member

@davidgumberg might be talking past each other, but what’s happening here is that locally as soon as the block is reconstructed (including merkle checks), prior to script checks, the block is advertised forward via compact blocks to its peers. This is inline with the protocol description, and helps speed up block propagation.

test: Add coverage for issue #13370

If a PoW-and-merkle-valid compact block ends up being
invalid during validation, a peer can end up sending a
getdata for the full block and we will end up ignoring that
request.

This scenario is when a block hasn't been mined in 200m+
disallowing direct fetch and causing us to fall back to
parallel block fetching logic, rather than the short txid
collision scenario.

831e1b44de

p2p: BlockRequestAllowed relaxed to allow invalid getdata

In the case where a block was advertised due to it being
deemed TRANSACTION_VALID at any point, a peer may request
the full block and we currently ignore those requests,
causing disconnection.

Since we already answer requests for valid blocks, there is
no reason to not also answer these requests to avoid
partition risk in the network. The PoW is valid and
"recent", so there is no DoS risk either.

260558a295

instagibbs force-pushed on Jul 7, 2025

in src/net_processing.cpp:1871 in 260558a295

1865@@ -1866,7 +1866,9 @@ bool PeerManagerImpl::BlockRequestAllowed(const CBlockIndex* pindex)
1866 {
1867     AssertLockHeld(cs_main);
1868     if (m_chainman.ActiveChain().Contains(pindex)) return true;
1869-    return pindex->IsValid(BLOCK_VALID_SCRIPTS) && (m_chainman.m_best_header != nullptr) &&
1870+    // HaveNumChainTxs implies we at some point reached VALID_TRANSACTIONS, meaning
1871+    // we may have advertised it to peers via compact blocks
1872+    return pindex->HaveNumChainTxs() && (m_chainman.m_best_header != nullptr) &&

mzumsande commented at 3:20 pm on July 7, 2025:

Should adjust doc of BlockRequestAllowed above (“we fully-validated them at some point”)

mzumsande commented at 3:30 pm on July 7, 2025: contributor

I’ll have to think a bit more about the consequences of this - but if we do this change, can we remove the need_activate_chain / ActivateBestChain logic from ProcessGetBlockData()? It doesn’t belong in net_processing from an architectural POV, and now it shouldn’t matter anymore for the request if the block is fully validated / invalid.

Also, just noting that as a side effect we’d now serve blockfilters for invalid / unvalidated blocks (PrepareBlockFilterRequest).

davidgumberg commented at 4:23 pm on July 7, 2025: contributor

@davidgumberg might be talking past each other, but what’s happening here is that locally as soon as the block is reconstructed (including merkle checks), prior to script checks, the block is advertised forward via compact blocks to its peers. This is inline with the protocol description, and helps speed up block propagation.

I’m not suggesting any changes to the pre-validation relay described in BIP-152. I’m asking about propagating blocks post-validation that a node knows to be invalid, shouldn’t this be avoided if possible?

One thought I had about an approach that relaxes the “other side” of the restriction here would be to only disconnect peers after the time-out period if the BlockRequested is part of the active chain, something like:

 0diff --git a/src/net_processing.cpp b/src/net_processing.cpp
 1index 92a2a17956..bce906dd83 100644
 2--- a/src/net_processing.cpp
 3+++ b/src/net_processing.cpp
 4@@ -5866,9 +5866,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
 5             QueuedBlock &queuedBlock = state.vBlocksInFlight.front();
 6             int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
 7             if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
 8-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
 9-                pto->fDisconnect = true;
10-                return true;
11+                if (!m_chainman.ActiveChain().Contains(queuedBlock.pindex)) {
12+                    RemoveBlockRequest(queuedBlock.pindex->GetBlockHash(), pto->GetId());
13+                } else {
14+                    LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
15+                    pto->fDisconnect = true;
16+                    return true;
17+                }
18             }
19         }

But, maybe this approach doesn’t sufficiently punish peers that send headers for blocks that don’t exist. (Surprisingly, this change breaks no existing tests!)

mzumsande commented at 5:15 pm on July 7, 2025: contributor

disconnect peers after the time-out period if the BlockRequested is part of the active chain

Not sure I understand this - why would the block be part of the active chain by that time? We started a request in the first place because we don’t know the full block yet, and if that request has timed out, chances are we still don’t have the block, in which case it won’t be part of our active chain. Or do you count on a parallel compact-block download being successful in the meantime?

But, maybe this approach doesn’t sufficiently punish peers that send headers for blocks that don’t exist.

headers contain the PoW, so headers for blocks that don’t exist are very expensive to create.

davidgumberg commented at 6:48 pm on July 7, 2025: contributor

Not sure I understand this - why would the block be part of the active chain by that time? We started a request in the first place because we don’t know the full block yet, and if that request has timed out, chances are we still don’t have the block, in which case it won’t be part of our active chain. Or do you count on a parallel compact-block download being successful in the meantime?

My assumption was that it was pretty unlikely for it to ever take >= 10 minutes to receive a block after hearing about it unless all of a node’s peers are stalling or all of its connections are extremely slow (4 MiB / 10min = 6.8 KiB/s) but I didn’t realize that a node will not GETDATA a block that is already in flight:

https://github.com/bitcoin/bitcoin/blob/831e1b44de7596e703d68e92657863afbc3bdf90/src/net_processing.cpp#L1492-L1499

But won’t a node GETDATA a block that it has received a HEADERS message for in HeadersDirectFetchBlocks()?

This particular approach I suggested probably doesn’t work, but I think the more general question is prior: Should #13370 be solved by relaxing BlockRequestAllowed() to allow nodes to send invalid blocks when they were likely to have announced the invalid blocks or should it be solved by relaxing the block download timeout (https://github.com/bitcoin/bitcoin/pull/5608) criteria?

instagibbs commented at 6:47 pm on July 9, 2025: member

@davidgumberg for completeness, the other solution, as stated in the issue tracker, would be to send a notfound in response to requests for invalid blocks. I’m not sure where people landed on that as that discussion was a long time ago

dergoegge commented at 2:40 pm on July 11, 2025: member

Drafted a branch for the notfound approach here: https://github.com/dergoegge/bitcoin/commits/2025-07-blk-notfound/. I think I’d slightly prefer it because:

we wouldn’t propagate invalid blocks (this does not really matter that much but why do it if we can avoid it)
we wouldn’t get rid of the timeout punishment

Happy to review if someone wants to run with it (haven’t tested my branch and it’s missing a functional test).

instagibbs commented at 5:23 pm on July 11, 2025: member

@dergoegge a node can still avoid punishment by announcing the sending a notfound after 9m59s, but this may be better behaved average behavior. Might be worth bringing up on an irc meeting to get some more eyes on this problem and potential solution?

instagibbs commented at 2:06 pm on July 18, 2025: member

we wouldn’t propagate invalid blocks (this does not really matter that much but why do it if we can avoid it)

Thinking about this some more, I don’t think this particular point is persuasive. Invalid blocks are propagated quite easily through compact blocks’ high bandwidth connections already.

If all txns are known to the receiving peer, the block is forwarded before full validation
If txns are missing, they are requested, the sending peer unconditionally responds to these messages, and the block is thus forwarded by the receiver once merkle checks are complete.

Still considering all the options presented here

edit: oops, this approach is clearly broken because full BLOCK messages do cause disconnect, my test setup was just wrong:

02025-07-18T14:14:27.891958Z (mocktime: 2025-07-18T20:54:26Z) [msghand] [net_processing.cpp:4970] [MaybeDiscourageAndDisconnect] Warning: not punishing manually connected peer 0!

instagibbs marked this as a draft on Jul 18, 2025

instagibbs closed this on Jul 18, 2025

p2p: Relax BlockRequestAllowed to respond to advertised blocks #32869

Code Coverage & Benchmarks

Reviews