This PR is another approach to fix a race condition bug when a yet-unconstructed compact filter is requested for an already known block.
Fixes: #29655, #27085 Related discussion: 1, 2 Another approach which received NACK: #35262
The problem: Bitcoin Core receives a new block and processes it, then advertises it; however, by that moment the compact filter might not yet have been constructed. If a peer asks for a filter in that time window, Bitcoin Core simply does not respond. From the peer's perspective this is a misbehaviour (we advertised a block and don't provide filters for it), which results in disconnects or bans. While it does not critically affect core nodes, it breaks the network topology for BIP 157 clients.
More precisely, BIP157 specifies that for getcfilters requestor:
StopHash MUST be known to belong to a block accepted by the receiving peer. This is the case if the peer had previously sent a headers or inv message with that block or any descendents. A node that receives getcfilters with an unknown StopHash SHOULD NOT respond.
Current behaviour clearly contradicts the above specification.
Proposed solution:
As was suggested, we can:
- attempt to respond to a request (
getcfilters,getcfheaders,getcfcheckptwhich usesLookupFilterHeader); if it fails, - check that we know the requested block hash and that the height difference between the last constructed filter block and the requested block is small — from which we deduce that pending validation events would fix the miss,
- wait for validation-queue flush and then retry.
Waiting for the validation queue flush parks the single msghand thread, which blocks all peers' ProcessMessages for the duration of the drain. So the trigger has to be tight: an external actor can otherwise force us into that wait.
This PR introduces BaseIndex::WaitForRacingWrite(target, max_ahead), which only drains when:
- the index is synced (
m_synced == true), - we have a non-null target block index,
- the target is at most
max_aheadblocks ahead of the index's last processed block (and not below it).
BlockFilterIndex calls this from all four lookup paths (LookupFilter, LookupFilterHeader, LookupFilterRange, LookupFilterHashRange) with max_ahead = CF_MAX_BLOCKS_AHEAD_RACE_WAIT = 2, retrying the lookup once after the drain.
Reorg handling:
The same primitive cleanly handles the reorg cases the bug originally exposed:
- New chain extends past the old tip (
ahead > 0): the queuedBlockConnectedevents for the new branch will, in order, runBaseIndex::BlockConnected's internal Rewind (which callsCustomRemoveon the disconnected blocks —BlockFilterIndex::CustomRemovemigrates each old filter from its height-keyed slot to a hash-keyed slot) and thenCustomAppendfor the new blocks. After the drain, both the old (hash-keyed) and new (height-keyed) filters are reachable. - Same-height sibling reorg (
ahead == 0): the index's last processed block and the requested block sit at the same height on different branches. The first lookup misses (height-keyed slot still holds the old block's data, hash-keyed slot for the new block doesn't exist yet). Theahead == 0branch of the helper drains the queue, which runs the disconnect/connect for the sibling reorg, after which the retry succeeds. - Stale-branch lookup after the reorg has settled (
ahead < 0): the helper does not wait — there is no in-flight callback that would help. The lookup falls back to the hash-keyed slot independently via the existing dual-key storage.
This PR does not affect block propagation, as the very root of this bug is the independence of block processing and filter construction in CustomAppend. Nor does it affect IBD: during IBD m_synced is false, the helper returns immediately, and we simply don't respond to compact filter messages (the writes are happening on the sync thread, not via the validation queue, so draining wouldn't help anyway). The IBD short-circuit also closes the DoS surface during the period when the node is most resource-constrained.
Drive-by:
LookupFilterHeader previously held m_cs_headers_cache across the DB lookup. Now that the lookup path can call SyncWithValidationInterfaceQueue (which must not be invoked under a lock that a queued callback might need), the cache lock is narrowed to wrap only the cache reads/writes, and the cache insert is switched to try_emplace since the check-then-insert is no longer atomic.
Tests:
A new cfilter_race_tests suite uses a PreFilterBlocker CValidationInterface registered before the filter index, so its own BlockConnected fires first on the scheduler thread and stalls it before the filter index gets to run CustomAppend. While the scheduler is parked, the test issues the lookup on a worker thread and asserts:
- the worker remains blocked (proves the wait path is engaged — a regression that bypasses it would let the worker return immediately with
false); - after releasing the scheduler, the worker returns the correct filter/header/range.
Cases covered:
cfilter_available_during_append_window—LookupFilterfor the new tip racing withBlockConnected.cfilter_range_available_during_append_window—LookupFilterRangecovering indexed history plus the racing tip.cfilter_header_available_during_append_window—LookupFilterHeaderon the racing tip.cfilter_available_during_same_height_reorg— same-height sibling reorg path (ahead == 0).
I've used Claude Code for this PR.