net: wait for validation queue flush for missing compact filter for an already known block #35354

pull randomlogin wants to merge 1 commits into bitcoin:master from randomlogin:p2p-cbf-wait-for-indexed changing 5 files +431 −10
  1. randomlogin commented at 10:07 PM on May 21, 2026: none

    This PR is another approach to fix a race condition bug when a yet-unconstructed compact filter is requested for an already known block.

    Fixes: #29655, #27085 Related discussion: 1, 2 Another approach which received NACK: #35262

    The problem: Bitcoin Core receives a new block and processes it, then advertises it; however, by that moment the compact filter might not yet have been constructed. If a peer asks for a filter in that time window, Bitcoin Core simply does not respond. From the peer's perspective this is a misbehaviour (we advertised a block and don't provide filters for it), which results in disconnects or bans. While it does not critically affect core nodes, it breaks the network topology for BIP 157 clients.

    More precisely, BIP157 specifies that for getcfilters requestor:

    StopHash MUST be known to belong to a block accepted by the receiving peer. This is the case if the peer had previously sent a headers or inv message with that block or any descendents. A node that receives getcfilters with an unknown StopHash SHOULD NOT respond.
    

    Current behaviour clearly contradicts the above specification.

    Proposed solution:

    As was suggested, we can:

    1. attempt to respond to a request (getcfilters, getcfheaders, getcfcheckpt which uses LookupFilterHeader); if it fails,
    2. check that we know the requested block hash and that the height difference between the last constructed filter block and the requested block is small — from which we deduce that pending validation events would fix the miss,
    3. wait for validation-queue flush and then retry.

    Waiting for the validation queue flush parks the single msghand thread, which blocks all peers' ProcessMessages for the duration of the drain. So the trigger has to be tight: an external actor can otherwise force us into that wait.

    This PR introduces BaseIndex::WaitForRacingWrite(target, max_ahead), which only drains when:

    1. the index is synced (m_synced == true),
    2. we have a non-null target block index,
    3. the target is at most max_ahead blocks ahead of the index's last processed block (and not below it).

    BlockFilterIndex calls this from all four lookup paths (LookupFilter, LookupFilterHeader, LookupFilterRange, LookupFilterHashRange) with max_ahead = CF_MAX_BLOCKS_AHEAD_RACE_WAIT = 2, retrying the lookup once after the drain.

    Reorg handling:

    The same primitive cleanly handles the reorg cases the bug originally exposed:

    • New chain extends past the old tip (ahead > 0): the queued BlockConnected events for the new branch will, in order, run BaseIndex::BlockConnected's internal Rewind (which calls CustomRemove on the disconnected blocks — BlockFilterIndex::CustomRemove migrates each old filter from its height-keyed slot to a hash-keyed slot) and then CustomAppend for the new blocks. After the drain, both the old (hash-keyed) and new (height-keyed) filters are reachable.
    • Same-height sibling reorg (ahead == 0): the index's last processed block and the requested block sit at the same height on different branches. The first lookup misses (height-keyed slot still holds the old block's data, hash-keyed slot for the new block doesn't exist yet). The ahead == 0 branch of the helper drains the queue, which runs the disconnect/connect for the sibling reorg, after which the retry succeeds.
    • Stale-branch lookup after the reorg has settled (ahead < 0): the helper does not wait — there is no in-flight callback that would help. The lookup falls back to the hash-keyed slot independently via the existing dual-key storage.

    This PR does not affect block propagation, as the very root of this bug is the independence of block processing and filter construction in CustomAppend. Nor does it affect IBD: during IBD m_synced is false, the helper returns immediately, and we simply don't respond to compact filter messages (the writes are happening on the sync thread, not via the validation queue, so draining wouldn't help anyway). The IBD short-circuit also closes the DoS surface during the period when the node is most resource-constrained.

    Drive-by:

    LookupFilterHeader previously held m_cs_headers_cache across the DB lookup. Now that the lookup path can call SyncWithValidationInterfaceQueue (which must not be invoked under a lock that a queued callback might need), the cache lock is narrowed to wrap only the cache reads/writes, and the cache insert is switched to try_emplace since the check-then-insert is no longer atomic.

    Tests:

    A new cfilter_race_tests suite uses a PreFilterBlocker CValidationInterface registered before the filter index, so its own BlockConnected fires first on the scheduler thread and stalls it before the filter index gets to run CustomAppend. While the scheduler is parked, the test issues the lookup on a worker thread and asserts:

    1. the worker remains blocked (proves the wait path is engaged — a regression that bypasses it would let the worker return immediately with false);
    2. after releasing the scheduler, the worker returns the correct filter/header/range.

    Cases covered:

    • cfilter_available_during_append_windowLookupFilter for the new tip racing with BlockConnected.
    • cfilter_range_available_during_append_windowLookupFilterRange covering indexed history plus the racing tip.
    • cfilter_header_available_during_append_windowLookupFilterHeader on the racing tip.
    • cfilter_available_during_same_height_reorg — same-height sibling reorg path (ahead == 0).

    I've used Claude Code for this PR.

  2. DrahtBot added the label P2P on May 21, 2026
  3. DrahtBot commented at 10:08 PM on May 21, 2026: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--006a51241073e994b41acfe9ec718e94-->

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/35354.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #24230 (indexes: Stop using node internal types and locking cs_main, improve sync logic by ryanofsky)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

  4. randomlogin marked this as ready for review on May 21, 2026
  5. DrahtBot added the label CI failed on May 22, 2026
  6. DrahtBot commented at 10:47 AM on May 22, 2026: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task test ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/26255963544/job/77353717155</sub> <sub>LLM reason (✨ experimental): CI failed because ctest detected a test failure in cfilter_race_tests (1 test failed).</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  7. net: wait validation queue on cfilter request
    Compact block filter requests can arrive after a block has been
    connected and advertised but before the filter index's CustomAppend has
    written the filter on the validation-signals scheduler thread. In that
    window lookups fail, and requesting peers receive no reply.
    
    Check if the requested compact filter corresponds to a recent block, and
    if so wait for the validation queue to finish processing. This mitigates
    the race condition.
    b68a721e64
  8. randomlogin force-pushed on May 22, 2026
  9. randomlogin commented at 10:50 AM on May 22, 2026: none

    Squashed two commits in one (previously there was a distinct commit for tests addition).

  10. DrahtBot removed the label CI failed on May 22, 2026

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-05-22 20:51 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me