p2p: Relax BlockRequestAllowed to respond to advertised blocks #32869

pull instagibbs wants to merge 2 commits into bitcoin:master from instagibbs:2025-07-invalid-cb-stall changing 3 files +76 −1
  1. instagibbs commented at 1:42 pm on July 3, 2025: member

    First commit adds coverage to a similar case to #13370 . h/t dergoegge for detecting this issue independently. Documentation of this behavior with a test is then followed by a proposed change where anytime we considered something VALID_TRANSACTIONS and potentially advertised via compact blocks, we should respond to requests for the getdata if the peer fell back for whatever reason.

    If the second commit is deemed controversial it can be shelved and current behavior documented alone.

  2. DrahtBot commented at 1:42 pm on July 3, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32869.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK dergoegge

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  3. instagibbs force-pushed on Jul 3, 2025
  4. DrahtBot added the label CI failed on Jul 3, 2025
  5. DrahtBot commented at 1:46 pm on July 3, 2025: contributor

    🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/runs/45297064338 LLM reason (✨ experimental): The CI failure is caused by an unfixable lint error detected by rust and Python linters, leading to the overall build failure.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  6. dergoegge commented at 1:56 pm on July 3, 2025: member

    Concept ACK

    h/t dergoegge for detecting this issue independently.

    For context, I found this case (as demonstrated by the first commit in the PR) by adding a netsplit oracle (see https://gist.github.com/dergoegge/3036796551095a9ce7535fb5d74e6656#detecting-network-split-bugs) to fuzzamoto. While extremely unrealistic/unlikely to trigger in the wild, I’d still consider this timeout case a bug, especially because it could cause disconnects across the whole network (as opposed to isolated instances with short id collisions).

    Resolving this will help avoid special casing the oracle for this scenario.

  7. instagibbs renamed this:
    WIP p2p: Relax BlockRequestAllowed to respond to advertised blocks
    p2p: Relax BlockRequestAllowed to respond to advertised blocks
    on Jul 3, 2025
  8. instagibbs force-pushed on Jul 3, 2025
  9. fanquake requested review from mzumsande on Jul 3, 2025
  10. DrahtBot removed the label CI failed on Jul 3, 2025
  11. DrahtBot added the label P2P on Jul 3, 2025
  12. in test/functional/p2p_invalid_cb.py:62 in cd13a2086a outdated
    57+        # No disconnect yet, node1 is waiting on response to getdata
    58+        assert_equal(len(self.nodes[1].getpeerinfo()), 1)
    59+
    60+        # Make sure node1 has processed the compact block
    61+        self.wait_until(lambda: len(self.nodes[1].getchaintips()) == 2)
    62+        assert {"height": 13, "hash": block.hash, "branchlen": 1, "status": "headers-only"} in node.getchaintips()
    


    davidgumberg commented at 0:11 am on July 4, 2025:
    0        assert {"height": 13, "hash": block.hash, "branchlen": 1, "status": "headers-only"} in self.nodes[1].getchaintips()
    

    instagibbs commented at 12:16 pm on July 7, 2025:
    fixed
  13. in test/functional/p2p_invalid_cb.py:45 in cd13a2086a outdated
    40+        self.generate(self.nodes[0], 12)
    41+
    42+        # Make 200+ minutes pass with wiggle room added
    43+        # to disallow direct fetch of block
    44+        for node in self.nodes:
    45+            node.setmocktime(now + 200 * 60 * 2)
    


    davidgumberg commented at 0:12 am on July 4, 2025:

    nit:

    0            node.bumpmocktime(200 * 60 * 2)
    

    instagibbs commented at 11:35 am on July 7, 2025:
    will do on touchup

    instagibbs commented at 12:16 pm on July 7, 2025:
    done
  14. davidgumberg commented at 0:24 am on July 4, 2025: contributor

    Maybe a dumb question, but why not relax the other side here? I don’t know what the new criteria would be, but in an ideal world, nodes should not be sending block announcements for blocks they have identified as invalid just to avoid upsetting their peers, so maybe the constraints set by PeerManagerImpl::SendMessages() are too rigid?

    https://github.com/bitcoin/bitcoin/blob/e3f416dbf7633b2fb19c933e5508bd231cc7e9cf/src/net_processing.cpp#L5865-L5873

  15. instagibbs commented at 11:35 am on July 7, 2025: member
    @davidgumberg might be talking past each other, but what’s happening here is that locally as soon as the block is reconstructed (including merkle checks), prior to script checks, the block is advertised forward via compact blocks to its peers. This is inline with the protocol description, and helps speed up block propagation.
  16. test: Add coverage for issue #13370
    If a PoW-and-merkle-valid compact block ends up being
    invalid during validation, a peer can end up sending a
    getdata for the full block and we will end up ignoring that
    request.
    
    This scenario is when a block hasn't been mined in 200m+
    disallowing direct fetch and causing us to fall back to
    parallel block fetching logic, rather than the short txid
    collision scenario.
    831e1b44de
  17. p2p: BlockRequestAllowed relaxed to allow invalid getdata
    In the case where a block was advertised due to it being
    deemed TRANSACTION_VALID at any point, a peer may request
    the full block and we currently ignore those requests,
    causing disconnection.
    
    Since we already answer requests for valid blocks, there is
    no reason to not also answer these requests to avoid
    partition risk in the network. The PoW is valid and
    "recent", so there is no DoS risk either.
    260558a295
  18. instagibbs force-pushed on Jul 7, 2025
  19. in src/net_processing.cpp:1871 in 260558a295
    1865@@ -1866,7 +1866,9 @@ bool PeerManagerImpl::BlockRequestAllowed(const CBlockIndex* pindex)
    1866 {
    1867     AssertLockHeld(cs_main);
    1868     if (m_chainman.ActiveChain().Contains(pindex)) return true;
    1869-    return pindex->IsValid(BLOCK_VALID_SCRIPTS) && (m_chainman.m_best_header != nullptr) &&
    1870+    // HaveNumChainTxs implies we at some point reached VALID_TRANSACTIONS, meaning
    1871+    // we may have advertised it to peers via compact blocks
    1872+    return pindex->HaveNumChainTxs() && (m_chainman.m_best_header != nullptr) &&
    


    mzumsande commented at 3:20 pm on July 7, 2025:
    Should adjust doc of BlockRequestAllowed above (“we fully-validated them at some point”)
  20. mzumsande commented at 3:30 pm on July 7, 2025: contributor

    I’ll have to think a bit more about the consequences of this - but if we do this change, can we remove the need_activate_chain / ActivateBestChain logic from ProcessGetBlockData()? It doesn’t belong in net_processing from an architectural POV, and now it shouldn’t matter anymore for the request if the block is fully validated / invalid.

    Also, just noting that as a side effect we’d now serve blockfilters for invalid / unvalidated blocks (PrepareBlockFilterRequest).

  21. davidgumberg commented at 4:23 pm on July 7, 2025: contributor

    @davidgumberg might be talking past each other, but what’s happening here is that locally as soon as the block is reconstructed (including merkle checks), prior to script checks, the block is advertised forward via compact blocks to its peers. This is inline with the protocol description, and helps speed up block propagation.

    I’m not suggesting any changes to the pre-validation relay described in BIP-152. I’m asking about propagating blocks post-validation that a node knows to be invalid, shouldn’t this be avoided if possible?


    One thought I had about an approach that relaxes the “other side” of the restriction here would be to only disconnect peers after the time-out period if the BlockRequested is part of the active chain, something like:

     0diff --git a/src/net_processing.cpp b/src/net_processing.cpp
     1index 92a2a17956..bce906dd83 100644
     2--- a/src/net_processing.cpp
     3+++ b/src/net_processing.cpp
     4@@ -5866,9 +5866,13 @@ bool PeerManagerImpl::SendMessages(CNode* pto)
     5             QueuedBlock &queuedBlock = state.vBlocksInFlight.front();
     6             int nOtherPeersWithValidatedDownloads = m_peers_downloading_from - 1;
     7             if (current_time > state.m_downloading_since + std::chrono::seconds{consensusParams.nPowTargetSpacing} * (BLOCK_DOWNLOAD_TIMEOUT_BASE + BLOCK_DOWNLOAD_TIMEOUT_PER_PEER * nOtherPeersWithValidatedDownloads)) {
     8-                LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
     9-                pto->fDisconnect = true;
    10-                return true;
    11+                if (!m_chainman.ActiveChain().Contains(queuedBlock.pindex)) {
    12+                    RemoveBlockRequest(queuedBlock.pindex->GetBlockHash(), pto->GetId());
    13+                } else {
    14+                    LogInfo("Timeout downloading block %s, %s\n", queuedBlock.pindex->GetBlockHash().ToString(), pto->DisconnectMsg(fLogIPs));
    15+                    pto->fDisconnect = true;
    16+                    return true;
    17+                }
    18             }
    19         }
    

    But, maybe this approach doesn’t sufficiently punish peers that send headers for blocks that don’t exist. (Surprisingly, this change breaks no existing tests!)

  22. mzumsande commented at 5:15 pm on July 7, 2025: contributor

    disconnect peers after the time-out period if the BlockRequested is part of the active chain

    Not sure I understand this - why would the block be part of the active chain by that time? We started a request in the first place because we don’t know the full block yet, and if that request has timed out, chances are we still don’t have the block, in which case it won’t be part of our active chain. Or do you count on a parallel compact-block download being successful in the meantime?

    But, maybe this approach doesn’t sufficiently punish peers that send headers for blocks that don’t exist.

    headers contain the PoW, so headers for blocks that don’t exist are very expensive to create.

  23. davidgumberg commented at 6:48 pm on July 7, 2025: contributor

    Not sure I understand this - why would the block be part of the active chain by that time? We started a request in the first place because we don’t know the full block yet, and if that request has timed out, chances are we still don’t have the block, in which case it won’t be part of our active chain. Or do you count on a parallel compact-block download being successful in the meantime?

    My assumption was that it was pretty unlikely for it to ever take >= 10 minutes to receive a block after hearing about it unless all of a node’s peers are stalling or all of its connections are extremely slow (4 MiB / 10min = 6.8 KiB/s) but I didn’t realize that a node will not GETDATA a block that is already in flight:

    https://github.com/bitcoin/bitcoin/blob/831e1b44de7596e703d68e92657863afbc3bdf90/src/net_processing.cpp#L1492-L1499

    But won’t a node GETDATA a block that it has received a HEADERS message for in HeadersDirectFetchBlocks()?


    This particular approach I suggested probably doesn’t work, but I think the more general question is prior: Should #13370 be solved by relaxing BlockRequestAllowed() to allow nodes to send invalid blocks when they were likely to have announced the invalid blocks or should it be solved by relaxing the block download timeout (https://github.com/bitcoin/bitcoin/pull/5608) criteria?

  24. instagibbs commented at 6:47 pm on July 9, 2025: member
    @davidgumberg for completeness, the other solution, as stated in the issue tracker, would be to send a notfound in response to requests for invalid blocks. I’m not sure where people landed on that as that discussion was a long time ago
  25. dergoegge commented at 2:40 pm on July 11, 2025: member

    Drafted a branch for the notfound approach here: https://github.com/dergoegge/bitcoin/commits/2025-07-blk-notfound/. I think I’d slightly prefer it because:

    • we wouldn’t propagate invalid blocks (this does not really matter that much but why do it if we can avoid it)
    • we wouldn’t get rid of the timeout punishment

    Happy to review if someone wants to run with it (haven’t tested my branch and it’s missing a functional test).

  26. instagibbs commented at 5:23 pm on July 11, 2025: member
    @dergoegge a node can still avoid punishment by announcing the sending a notfound after 9m59s, but this may be better behaved average behavior. Might be worth bringing up on an irc meeting to get some more eyes on this problem and potential solution?
  27. instagibbs commented at 2:06 pm on July 18, 2025: member

    we wouldn’t propagate invalid blocks (this does not really matter that much but why do it if we can avoid it)

    Thinking about this some more, I don’t think this particular point is persuasive. Invalid blocks are propagated quite easily through compact blocks’ high bandwidth connections already.

    1. If all txns are known to the receiving peer, the block is forwarded before full validation
    2. If txns are missing, they are requested, the sending peer unconditionally responds to these messages, and the block is thus forwarded by the receiver once merkle checks are complete.

    Still considering all the options presented here

    edit: oops, this approach is clearly broken because full BLOCK messages do cause disconnect, my test setup was just wrong:

    02025-07-18T14:14:27.891958Z (mocktime: 2025-07-18T20:54:26Z) [msghand] [net_processing.cpp:4970] [MaybeDiscourageAndDisconnect] Warning: not punishing manually connected peer 0!
    
  28. instagibbs marked this as a draft on Jul 18, 2025
  29. instagibbs closed this on Jul 18, 2025


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-07-26 12:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me