net: wait for validation queue flush for missing compact filter for an already known block #35354

pull randomlogin wants to merge 1 commits into bitcoin:master from randomlogin:p2p-cbf-wait-for-indexed changing 5 files +431 −10

randomlogin commented at 10:07 PM on May 21, 2026: none
This PR is another approach to fix a race condition bug when a yet-unconstructed compact filter is requested for an already known block.

Fixes: #29655, #27085 Related discussion: 1, 2 Another approach which received NACK: #35262

The problem: Bitcoin Core receives a new block and processes it, then advertises it; however, by that moment the compact filter might not yet have been constructed. If a peer asks for a filter in that time window, Bitcoin Core simply does not respond. From the peer's perspective this is a misbehaviour (we advertised a block and don't provide filters for it), which results in disconnects or bans. While it does not critically affect core nodes, it breaks the network topology for BIP 157 clients.

More precisely, BIP157 specifies that for getcfilters requestor:
```
StopHash MUST be known to belong to a block accepted by the receiving peer. This is the case if the peer had previously sent a headers or inv message with that block or any descendents. A node that receives getcfilters with an unknown StopHash SHOULD NOT respond.
```
Current behaviour clearly contradicts the above specification.

Proposed solution:

As was suggested, we can:
1. attempt to respond to a request (getcfilters, getcfheaders, getcfcheckpt which uses LookupFilterHeader); if it fails,
2. check that we know the requested block hash and that the height difference between the last constructed filter block and the requested block is small — from which we deduce that pending validation events would fix the miss,
3. wait for validation-queue flush and then retry.
Waiting for the validation queue flush parks the single msghand thread, which blocks all peers' ProcessMessages for the duration of the drain. So the trigger has to be tight: an external actor can otherwise force us into that wait.

This PR introduces BaseIndex::WaitForRacingWrite(target, max_ahead), which only drains when:
1. the index is synced (m_synced == true),
2. we have a non-null target block index,
3. the target is at most max_ahead blocks ahead of the index's last processed block (and not below it).
BlockFilterIndex calls this from all four lookup paths (LookupFilter, LookupFilterHeader, LookupFilterRange, LookupFilterHashRange) with max_ahead = CF_MAX_BLOCKS_AHEAD_RACE_WAIT = 2, retrying the lookup once after the drain.

Reorg handling:

The same primitive cleanly handles the reorg cases the bug originally exposed:
- New chain extends past the old tip (ahead > 0): the queued BlockConnected events for the new branch will, in order, run BaseIndex::BlockConnected's internal Rewind (which calls CustomRemove on the disconnected blocks — BlockFilterIndex::CustomRemove migrates each old filter from its height-keyed slot to a hash-keyed slot) and then CustomAppend for the new blocks. After the drain, both the old (hash-keyed) and new (height-keyed) filters are reachable.
- Same-height sibling reorg (ahead == 0): the index's last processed block and the requested block sit at the same height on different branches. The first lookup misses (height-keyed slot still holds the old block's data, hash-keyed slot for the new block doesn't exist yet). The ahead == 0 branch of the helper drains the queue, which runs the disconnect/connect for the sibling reorg, after which the retry succeeds.
- Stale-branch lookup after the reorg has settled (ahead < 0): the helper does not wait — there is no in-flight callback that would help. The lookup falls back to the hash-keyed slot independently via the existing dual-key storage.
This PR does not affect block propagation, as the very root of this bug is the independence of block processing and filter construction in CustomAppend. Nor does it affect IBD: during IBD m_synced is false, the helper returns immediately, and we simply don't respond to compact filter messages (the writes are happening on the sync thread, not via the validation queue, so draining wouldn't help anyway). The IBD short-circuit also closes the DoS surface during the period when the node is most resource-constrained.

Drive-by:

LookupFilterHeader previously held m_cs_headers_cache across the DB lookup. Now that the lookup path can call SyncWithValidationInterfaceQueue (which must not be invoked under a lock that a queued callback might need), the cache lock is narrowed to wrap only the cache reads/writes, and the cache insert is switched to try_emplace since the check-then-insert is no longer atomic.

Tests:

A new cfilter_race_tests suite uses a PreFilterBlocker CValidationInterface registered before the filter index, so its own BlockConnected fires first on the scheduler thread and stalls it before the filter index gets to run CustomAppend. While the scheduler is parked, the test issues the lookup on a worker thread and asserts:
1. the worker remains blocked (proves the wait path is engaged — a regression that bypasses it would let the worker return immediately with false);
2. after releasing the scheduler, the worker returns the correct filter/header/range.
Cases covered:
- cfilter_available_during_append_window — LookupFilter for the new tip racing with BlockConnected.
- cfilter_range_available_during_append_window — LookupFilterRange covering indexed history plus the racing tip.
- cfilter_header_available_during_append_window — LookupFilterHeader on the racing tip.
- cfilter_available_during_same_height_reorg — same-height sibling reorg path (ahead == 0).
I've used Claude Code for this PR.
DrahtBot added the label P2P on May 21, 2026
DrahtBot commented at 10:08 PM on May 21, 2026: contributor


The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.



Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/35354.



Reviews

See the guideline for information on the review process. A summary of reviews will appear here.



Conflicts

Reviewers, this pull request conflicts with the following ones:
- #24230 (indexes: Stop using node internal types and locking cs_main, improve sync logic by ryanofsky)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
randomlogin marked this as ready for review on May 21, 2026
DrahtBot added the label CI failed on May 22, 2026
DrahtBot commented at 10:47 AM on May 22, 2026: contributor


🚧 At least one of the CI tasks failed. <sub>Task test ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/26255963544/job/77353717155</sub> <sub>LLM reason (✨ experimental): CI failed because ctest detected a test failure in cfilter_race_tests (1 test failed).</sub>

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
- Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
- A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
- An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.

</details>

net: wait validation queue on cfilter request

Compact block filter requests can arrive after a block has been
connected and advertised but before the filter index's CustomAppend has
written the filter on the validation-signals scheduler thread. In that
window lookups fail, and requesting peers receive no reply.

Check if the requested compact filter corresponds to a recent block, and
if so wait for the validation queue to finish processing. This mitigates
the race condition.

b68a721e64

randomlogin force-pushed on May 22, 2026
randomlogin commented at 10:50 AM on May 22, 2026: none

Squashed two commits in one (previously there was a distinct commit for tests addition).
DrahtBot removed the label CI failed on May 22, 2026
sedited requested review from ajtowns on May 25, 2026
sedited commented at 9:47 PM on June 21, 2026: contributor

Reading through this, I don't think this should be handled on the index level. While I'm still not sold that this is not something that is just a bit flawed in the spec, I would prefer an approach where net_processing dealt with this directly in the case of an index query returning nothing by checking if the requested block lies in the two block tolerance range and if so syncing the validation interface queue.
sedited requested review from rustaceanrob on Jun 21, 2026

Contributors

randomlogin

DrahtBot

sedited

Review Requested
ajtowns rustaceanrob

Labels

P2P

Linked (view graph)

#27085 Failing to fetch `cfheader` corresponding to block header in `headers` message #29655 CFilters reject and disconnect peers with a valid block hash #35262 net: calculate compact filters on the fly for an already advertised block

net: wait for validation queue flush for missing compact filter for an already known block #35354

Code Coverage & Benchmarks

Reviews

Conflicts