Cache block filter headers at heights of multiples of 1000 in memory.
Block filter headers at height 1000x are checkpointed, and will be the most frequently requested. Cache them in memory to avoid costly disk reads.
Cache block filter headers at heights of multiples of 1000 in memory.
Block filter headers at height 1000x are checkpointed, and will be the most frequently requested. Cache them in memory to avoid costly disk reads.
30@@ -31,6 +31,8 @@ constexpr char DB_FILTER_POS = 'P';
31 constexpr unsigned int MAX_FLTR_FILE_SIZE = 0x1000000; // 16 MiB
32 /** The pre-allocation chunk size for fltr?????.dat files */
33 constexpr unsigned int FLTR_FILE_CHUNK_SIZE = 0x100000; // 1 MiB
34+/** Maximum size of the cfheaders cache */
35+constexpr size_t CF_HEADERS_CACHE_MAX_SZ{2000};
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
I’m almost ready for ACKing this PR, just have some minor suggestion/discussion points:
0bool is_checkpoint = (block_index->nHeight % CFCHECKPT_INTERVAL == 0);
1if (is_checkpoint)
2 // Check the headers cache
3
4// Regular lookup
5
6if (is_checkpoint)
7 // Add to the headers cache
? Of course, the drawback of this approach is that the condition for checking needs to be removed if the cache is ever used for other filter headers than those from the checkpoints. Also not sure if its really worth it from a performance point of view. But in 999 out of 1000 times the lookup in the map could be skipped.
LogPrintf
s in the cache hit / cache add cases, run the (maybe slightly modified) functional test and analyze the log output to check if it behaves as expected.32@@ -30,6 +33,9 @@ class BlockFilterIndex final : public BaseIndex
33 bool ReadFilterFromDisk(const FlatFilePos& pos, BlockFilter& filter) const;
34 size_t WriteFilterToDisk(FlatFilePos& pos, const BlockFilter& filter);
35
36+ RecursiveMutex m_cs_headers_cache;
0 Mutex m_cs_headers_cache;
nit (feel free to ignore). I haven’t tested this, but from reading the code, a non-recursive mutex should be sufficient.
Mutex
.
409
410+ if (block_index->nHeight % CFCHECKPT_INTERVAL == 0) {
411+ // Add to the headers cache
412+ LOCK(m_cs_headers_cache);
413+ if (m_headers_cache.size() < CF_HEADERS_CACHE_MAX_SZ) {
414+ m_headers_cache.emplace(block_index->GetBlockHash(), entry.header);
CFCHECKPT_INTERVAL
would ever change. I know the chances for that happening are slim and if it happens we can deal with it then, just wanted to put it there.
The test suite explicitly checks that filters for stale blocks may still be fetched.
we could end up with headers of orphaned blocks in the cache
Correct. If there’s a re-org across checkpoint block, then the disconnect block’s filter header stays in the cache. That’s an explicit design decision, since:
if CFCHECKPT_INTERVAL would ever change
In that case, we’d need to change the software, which would involve a stop/start. This is an in-memory cache so gets dropped on shutdown.
408 }
409
410+ if (block_index->nHeight % CFCHECKPT_INTERVAL == 0) {
411+ // Add to the headers cache
412+ LOCK(m_cs_headers_cache);
413+ if (m_headers_cache.size() < CF_HEADERS_CACHE_MAX_SZ) {
If someone can mine enough blocks to fill up this cache, we have other problems. I think an adversary needs to mine blocks in the order of the 2000th triangle number to fill this up.
And when this cache is full due to natural reasons (in 20 years), disks will be fast enough that we won’t need a cache.
% CFCHECKPT_INTERVAL == 0
.
Remind me to change this to 4000 in 2026 :)
I agree that a benchmark would be nice.
0// TODO change to 4000 in 2026
;)
Would it make sense to also only check the cache if we have checkpoints, i.e. a method structure like
Done @MarcoFalke
a non-recursive mutex should be sufficient.
Changed RecursiveMutex
-> Mutex
I’m still working on adding a microbench, although it’s not clear to me that it’ll show any improvement since the disk reads may be cached by the OS for the microbench.
32@@ -30,6 +33,9 @@ class BlockFilterIndex final : public BaseIndex
33 bool ReadFilterFromDisk(const FlatFilePos& pos, BlockFilter& filter) const;
34 size_t WriteFilterToDisk(FlatFilePos& pos, const BlockFilter& filter);
35
36+ Mutex m_cs_headers_cache;
37+ std::map<uint256, uint256> m_headers_cache GUARDED_BY(m_cs_headers_cache);
std::unordered_map
can’t be used? I see that the tree-based std::map
and std::set
are used primarily throughout the codebase, but figured it may just be historical. Is there another reason these are preferred?
I expect mostly laziness. With an unordered_map
, you need to provide a hash function. In this case, it’s easy. The data is all hash digests from block chain data, so I can just take the first 8 bytes from the uint256 filter header.
I’ll make that change tomorrow.
Concept ACK.
Kudos on the simplicity of this approach!
Using jnewbery’s nice node-shell module (a modification of test-shell that allows to connect with the functional test framework to a running full-node) I crafted up a quick and dirty test-script (see branch https://github.com/theStack/bitcoin/tree/pr18960_test) that sends getcfcheckpt
requests with the stop-hash of the halving block 630000 and measures the processing time.
Freshly started bitcoind from master branch (w/o checkpoint headers cache implementation), after mempool imports are done:
0getcfcheckpt request processing times
1=====================================
2-> first: 53.75ms
3-> following: 50.70ms (average)
4=====================================
Freshly started bitcoind from PR branch (with checkpoint headers cache implementation), after mempool imports are done:
0getcfcheckpt request processing times
1=====================================
2-> first: 50.92ms
3-> following: 50.80ms (average)
4=====================================
There doesn’t seem to be a noticable difference. I don’t know if my methodology is flawed though. Maybe all the processing around is taking so much more time that the actual part of fetching the header (from the cache or from disk) is neglectible. If anyone has an idea how to improve the benchmark script, happy to hear suggestions!
// EDIT: As jnewbery found out in the course of today’s PR review club meeting, there is a good explanation on why the values are always close to 50ms:
019:36 < jnewbery> There are 100ms sleeps in the messagehandler thread between looping through all peers:
1 https://github.com/jnewbery/bitcoin/blob/7ad83ed252914d6d5b8ed81f103aecf052c68fb7/src/net.cpp#L2061
219:37 < jnewbery> on a non-busy node, 50ms is the average time you have to wait for the message handler thread to wake up
m_cs_headers_cache
for the entirety of the LookupFilterHeader()
function, rather than taking, releasing and taking it again. That seem like a reasonable request to me. We can’t ever block important work with this lock, so holding it a bit longer than strictly necessary is not an issue.
@theStack - thank you for doing some benching of this. Your approach inspired me to do something similar myself, using debug logs to find the exact time to construct a cfcheckpt response. I’ve added a debug_log_delta
context wrapper to the node_shell, which allows me to provide two debug log messages, then do actions on the node and extract the time delta between those two logs. I applied that patch to:
And then ran the following node-shell script:
0REPO_PATH="<path_to_repo>"
1DATADIR_PATH="<path_to_datadir>"
2import sys
3sys.path.insert(0, f"{REPO_PATH}/test/functional")
4from test_framework.node_shell import NodeShell
5test = NodeShell()
6test.setup(datadir=DATADIR_PATH)
7# <test_framework.node_shell.NodeShell.__TestShell object at 0x7f7704820490>
8node = test.nodes[0]
9bb = node.getbestblockhash()
10from test_framework.messages import FILTER_TYPE_BASIC, msg_getcfcheckpt
11request = msg_getcfcheckpt(filter_type=FILTER_TYPE_BASIC, stop_hash=int(bb, 16))
12for i in range(15):
13 with node.debug_log_delta("getcfcheckpt request received", "cfcheckpt response constructed"):
14 node.p2ps[0].send_and_ping(request)
15test.shutdown()
0getcfcheckpt request received to cfcheckpt response constructed: 7.824ms
1getcfcheckpt request received to cfcheckpt response constructed: 7.054ms
2getcfcheckpt request received to cfcheckpt response constructed: 7.021ms
3getcfcheckpt request received to cfcheckpt response constructed: 7.303ms
4getcfcheckpt request received to cfcheckpt response constructed: 7.355ms
5getcfcheckpt request received to cfcheckpt response constructed: 7.26ms
6getcfcheckpt request received to cfcheckpt response constructed: 7.169ms
7getcfcheckpt request received to cfcheckpt response constructed: 7.166ms
8getcfcheckpt request received to cfcheckpt response constructed: 7.194ms
9getcfcheckpt request received to cfcheckpt response constructed: 8.574ms
10getcfcheckpt request received to cfcheckpt response constructed: 7.119ms
11getcfcheckpt request received to cfcheckpt response constructed: 7.156ms
12getcfcheckpt request received to cfcheckpt response constructed: 4.153ms
13getcfcheckpt request received to cfcheckpt response constructed: 8.162ms
14getcfcheckpt request received to cfcheckpt response constructed: 6.485ms
There does seem to be a very slight performance improvement after the first request. Maybe around 5-10% from the mean, but variance is quite high, so it might be noise.
0getcfcheckpt request received to cfcheckpt response constructed: 12.2ms
1getcfcheckpt request received to cfcheckpt response constructed: 2.855ms
2getcfcheckpt request received to cfcheckpt response constructed: 2.992ms
3getcfcheckpt request received to cfcheckpt response constructed: 2.869ms
4getcfcheckpt request received to cfcheckpt response constructed: 3.141ms
5getcfcheckpt request received to cfcheckpt response constructed: 3.692ms
6getcfcheckpt request received to cfcheckpt response constructed: 3.452ms
7getcfcheckpt request received to cfcheckpt response constructed: 2.916ms
8getcfcheckpt request received to cfcheckpt response constructed: 3.413ms
9getcfcheckpt request received to cfcheckpt response constructed: 3.648ms
10getcfcheckpt request received to cfcheckpt response constructed: 3.816ms
11getcfcheckpt request received to cfcheckpt response constructed: 3.84ms
12getcfcheckpt request received to cfcheckpt response constructed: 3.812ms
13getcfcheckpt request received to cfcheckpt response constructed: 3.869ms
14getcfcheckpt request received to cfcheckpt response constructed: 3.445ms
The initial request is slower (presumably because the cache needs to be warmed), but subsequent requests are on average >50% faster than with no cache.
0getcfcheckpt request received to cfcheckpt response constructed: 11.785ms
1getcfcheckpt request received to cfcheckpt response constructed: 3.35ms
2getcfcheckpt request received to cfcheckpt response constructed: 3.19ms
3getcfcheckpt request received to cfcheckpt response constructed: 3.475ms
4getcfcheckpt request received to cfcheckpt response constructed: 3.175ms
5getcfcheckpt request received to cfcheckpt response constructed: 3.345ms
6getcfcheckpt request received to cfcheckpt response constructed: 3.274ms
7getcfcheckpt request received to cfcheckpt response constructed: 3.161ms
8getcfcheckpt request received to cfcheckpt response constructed: 3.259ms
9getcfcheckpt request received to cfcheckpt response constructed: 3.217ms
10getcfcheckpt request received to cfcheckpt response constructed: 3.238ms
11getcfcheckpt request received to cfcheckpt response constructed: 3.253ms
12getcfcheckpt request received to cfcheckpt response constructed: 3.33ms
13getcfcheckpt request received to cfcheckpt response constructed: 3.32ms
14getcfcheckpt request received to cfcheckpt response constructed: 3.36ms
The first request is slightly faster than when taking locks twice (although that might be noise). Subsequent requests are the same speed as the ’taking locks twice’ branch, which is what we’d expect because locks are only taken once in that branch if the header is found in the cache.
@MarcoFalke - I think these full node manual benches are more meaningful than microbenches, so I’m not planning to add any automated benches to /src/bench
.
@narula suggested in today’s PR Review Club meeting that it’d be less cognitive overhead to simply take the m_cs_headers_cache for the entirety of the LookupFilterHeader() function, rather than taking, releasing and taking it again.
I’ll make this change tomorrow.
In yesterday’s review club meeting on this PR, it was suggested to put LogPrintf() directly into the network processing code parts to measure the execution time of getcfcheckpt requests. (My previous approach of measuring the request response time from the test framework was not precise enough, as the time also included python overhead and the time for the message handler to wake up.) I did some testing with this refined approach and can roughly confirm the results from jnewbery (see commit https://github.com/theStack/bitcoin/commit/48cff1bb4d27ef0040b18a860a3c897752e3c296):
without cache:
02020-05-14T20:15:11Z getcfcheckpt processing took 10247us
12020-05-14T20:15:11Z getcfcheckpt processing took 6105us
22020-05-14T20:15:11Z getcfcheckpt processing took 6437us
32020-05-14T20:15:11Z getcfcheckpt processing took 5330us
42020-05-14T20:15:11Z getcfcheckpt processing took 6964us
52020-05-14T20:15:11Z getcfcheckpt processing took 5523us
62020-05-14T20:15:12Z getcfcheckpt processing took 5413us
72020-05-14T20:15:12Z getcfcheckpt processing took 5843us
82020-05-14T20:15:12Z getcfcheckpt processing took 7060us
92020-05-14T20:15:12Z getcfcheckpt processing took 6165us
10...
with cache:
02020-05-14T20:21:26Z getcfcheckpt processing took 9801us
12020-05-14T20:21:26Z getcfcheckpt processing took 2619us
22020-05-14T20:21:26Z getcfcheckpt processing took 3338us
32020-05-14T20:21:26Z getcfcheckpt processing took 2446us
42020-05-14T20:21:26Z getcfcheckpt processing took 2560us
52020-05-14T20:21:27Z getcfcheckpt processing took 2331us
62020-05-14T20:21:27Z getcfcheckpt processing took 2703us
72020-05-14T20:21:27Z getcfcheckpt processing took 2614us
82020-05-14T20:21:27Z getcfcheckpt processing took 2598us
92020-05-14T20:21:27Z getcfcheckpt processing took 2444us
10...
One can see that there’s a speedup of at least 2x with the cache.
Note that my approach to get these numbers was slightly different, as I directly measured the time in bitcoind via GetTimeMicros()
(commit https://github.com/theStack/bitcoin/commit/49f5f3d5444ed6c872dd41be910e2149308b35e7):
0diff --git a/src/net_processing.cpp b/src/net_processing.cpp
1index 1df1fab..6e620d9 100644
2--- a/src/net_processing.cpp
3+++ b/src/net_processing.cpp
4@@ -3380,7 +3380,10 @@ bool ProcessMessage(CNode* pfrom, const std::string& msg_type, CDataStream& vRec
5 }
6
7 if (msg_type == NetMsgType::GETCFCHECKPT) {
8+ int64_t start_time = GetTimeMicros();
9 ProcessGetCFCheckPt(pfrom, vRecv, chainparams, connman);
10+ int64_t processing_duration = GetTimeMicros() - start_time;
11+ LogPrintf("getcfcheckpt processing took %lldus\n", processing_duration);
12 return true;
13 }
but the idea is the same.
Now I’m very curious if the change from std::map
to std::unordered_map
shows a noticable difference in performance.
@MarcoFalke: Nice idea to drop the cache between the requests. Unfortunately, on my current development box I don’t have the permissions (root rights) to do this :/ if anyone having the possibility, the results would be very interesting.
30@@ -31,6 +31,8 @@ constexpr char DB_FILTER_POS = 'P';
31 constexpr unsigned int MAX_FLTR_FILE_SIZE = 0x1000000; // 16 MiB
32 /** The pre-allocation chunk size for fltr?????.dat files */
33 constexpr unsigned int FLTR_FILE_CHUNK_SIZE = 0x100000; // 1 MiB
34+/** Maximum size of the cfheaders cache */
406
407+ // Add to the headers cache
408+ if (m_headers_cache.size() < CF_HEADERS_CACHE_MAX_SZ) {
409+ m_headers_cache.emplace(block_index->GetBlockHash(), entry.header);
410+ }
411+
block_index->nHeight % CFCHECKPT_INTERVAL == 0
, otherwise the most recent 2000 requested headers will be cached rather than the desired ones.
I’ve force-pushed a branch that changes this to use an unordered_map and only lock once.
Comparing the unordered map implementation to map:
0getcfcheckpt request received to cfcheckpt response constructed: 12.808ms
1getcfcheckpt request received to cfcheckpt response constructed: 2.783ms
2getcfcheckpt request received to cfcheckpt response constructed: 2.712ms
3getcfcheckpt request received to cfcheckpt response constructed: 2.656ms
4getcfcheckpt request received to cfcheckpt response constructed: 2.672ms
5getcfcheckpt request received to cfcheckpt response constructed: 2.82ms
6getcfcheckpt request received to cfcheckpt response constructed: 2.994ms
7getcfcheckpt request received to cfcheckpt response constructed: 3.068ms
8getcfcheckpt request received to cfcheckpt response constructed: 3.335ms
9getcfcheckpt request received to cfcheckpt response constructed: 3.444ms
10getcfcheckpt request received to cfcheckpt response constructed: 3.159ms
11getcfcheckpt request received to cfcheckpt response constructed: 3.194ms
12getcfcheckpt request received to cfcheckpt response constructed: 2.601ms
13getcfcheckpt request received to cfcheckpt response constructed: 2.692ms
14getcfcheckpt request received to cfcheckpt response constructed: 2.982ms
15getcfcheckpt request received to cfcheckpt response constructed: 3.209ms
16getcfcheckpt request received to cfcheckpt response constructed: 3.116ms
17getcfcheckpt request received to cfcheckpt response constructed: 3.136ms
18getcfcheckpt request received to cfcheckpt response constructed: 2.666ms
19getcfcheckpt request received to cfcheckpt response constructed: 3.091ms
20getcfcheckpt request received to cfcheckpt response constructed: 3.444ms
21getcfcheckpt request received to cfcheckpt response constructed: 3.385ms
22getcfcheckpt request received to cfcheckpt response constructed: 3.147ms
23getcfcheckpt request received to cfcheckpt response constructed: 3.145ms
24getcfcheckpt request received to cfcheckpt response constructed: 3.053ms
0getcfcheckpt request received to cfcheckpt response constructed: 12.959ms
1getcfcheckpt request received to cfcheckpt response constructed: 3.001ms
2getcfcheckpt request received to cfcheckpt response constructed: 3.656ms
3getcfcheckpt request received to cfcheckpt response constructed: 2.07ms
4getcfcheckpt request received to cfcheckpt response constructed: 2.537ms
5getcfcheckpt request received to cfcheckpt response constructed: 2.838ms
6getcfcheckpt request received to cfcheckpt response constructed: 2.994ms
7getcfcheckpt request received to cfcheckpt response constructed: 3.143ms
8getcfcheckpt request received to cfcheckpt response constructed: 3.383ms
9getcfcheckpt request received to cfcheckpt response constructed: 3.666ms
10getcfcheckpt request received to cfcheckpt response constructed: 3.889ms
11getcfcheckpt request received to cfcheckpt response constructed: 3.553ms
12getcfcheckpt request received to cfcheckpt response constructed: 3.394ms
13getcfcheckpt request received to cfcheckpt response constructed: 3.3ms
14getcfcheckpt request received to cfcheckpt response constructed: 3.5ms
15getcfcheckpt request received to cfcheckpt response constructed: 3.372ms
16getcfcheckpt request received to cfcheckpt response constructed: 3.329ms
17getcfcheckpt request received to cfcheckpt response constructed: 3.308ms
18getcfcheckpt request received to cfcheckpt response constructed: 3.564ms
19getcfcheckpt request received to cfcheckpt response constructed: 3.746ms
20getcfcheckpt request received to cfcheckpt response constructed: 3.714ms
21getcfcheckpt request received to cfcheckpt response constructed: 3.423ms
22getcfcheckpt request received to cfcheckpt response constructed: 3.517ms
23getcfcheckpt request received to cfcheckpt response constructed: 3.636ms
24getcfcheckpt request received to cfcheckpt response constructed: 3.595ms
So from this single test, it appears that an unordered_map is indeed faster. That’s what we’d expect, since lookup is O(1) and number of lookups required for a single getcfcheckpt request is O(n), so total is O(n). For a map, a single lookup is O(logn), so total is O(nlogn).
The original implementation is faster still, since it’s storing the checkpts in a single vector, but the aim of this PR is to avoid disk reads for getcfcheckpt messages, which the simple unordered_map implementation does. Further performance improvements could be done as a follow-up if required.
ACK b6eba9e492e1dfb38baec894bfda1cfb4d709e5d
Ran all tests.
392@@ -387,13 +393,26 @@ bool BlockFilterIndex::LookupFilter(const CBlockIndex* block_index, BlockFilter&
393 return ReadFilterFromDisk(entry.pos, filter_out);
394 }
395
396-bool BlockFilterIndex::LookupFilterHeader(const CBlockIndex* block_index, uint256& header_out) const
397+bool BlockFilterIndex::LookupFilterHeader(const CBlockIndex* block_index, uint256& header_out)
398 {
399+ LOCK(m_cs_headers_cache);
400+ auto header = m_headers_cache.find(block_index->GetBlockHash());
LOCK(m_cs_headers_cache)
. Thanks for catching this.
Cache block filter headers at heights of multiples of 1000 in memory.
Block filter headers at height 1000x are checkpointed, and will be the
most frequently requested. Cache them in memory to avoid costly disk
reads.
ACK 0187d4c118ab4c0f5c2d4fb180c2a8dea8ac53cf :tada:
By the way, I’m neutral to the question on whether to use std::map
or std::unordered_map
for the cache – std::unordered_map
does indeed seem to lead to some performance improvement (my tests show a rough 10% improvement, see numbers below; jnewberys tests show an even higher improvement of close to 20%), but on the other hand takes both more space (as pointed out by MarcoFalke) and also more code due to the need to provide a hash function. Maybe someone has a more clear opinion on the choice, personally I’m okay with ACKing either version.
Test with std::map
:
02020-05-19T09:09:01Z getcfcheckpt processing took 13817us
12020-05-19T09:09:01Z getcfcheckpt processing took 2855us
22020-05-19T09:09:01Z getcfcheckpt processing took 3573us
32020-05-19T09:09:01Z getcfcheckpt processing took 3444us
42020-05-19T09:09:01Z getcfcheckpt processing took 3236us
52020-05-19T09:09:01Z getcfcheckpt processing took 3385us
62020-05-19T09:09:01Z getcfcheckpt processing took 3543us
72020-05-19T09:09:01Z getcfcheckpt processing took 3378us
82020-05-19T09:09:01Z getcfcheckpt processing took 3450us
92020-05-19T09:09:01Z getcfcheckpt processing took 3041us
102020-05-19T09:09:01Z getcfcheckpt processing took 3561us
11...
:arrow_right: Average subsequent lookup: 3,346ms
Test with std::unordered_map
:
02020-05-19T08:59:53Z getcfcheckpt processing took 12205us
12020-05-19T08:59:53Z getcfcheckpt processing took 3650us
22020-05-19T08:59:53Z getcfcheckpt processing took 4660us
32020-05-19T08:59:53Z getcfcheckpt processing took 3393us
42020-05-19T08:59:53Z getcfcheckpt processing took 3669us
52020-05-19T08:59:53Z getcfcheckpt processing took 2919us
62020-05-19T08:59:53Z getcfcheckpt processing took 2565us
72020-05-19T08:59:53Z getcfcheckpt processing took 2804us
82020-05-19T08:59:53Z getcfcheckpt processing took 2348us
92020-05-19T08:59:53Z getcfcheckpt processing took 2511us
102020-05-19T08:59:53Z getcfcheckpt processing took 2264us
11...
:arrow_right: Average subsequent lookup: 3,078ms
re-utACK 0187d4c118ab4c0f5c2d4fb180c2a8dea8ac53cf
Compared to my last review is_checkpoint
variable is introduced for better readability and if statement that was removed in error is added back: https://github.com/bitcoin/bitcoin/compare/b6eba9e492e1dfb38baec894bfda1cfb4d709e5d..0187d4c118ab4c0f5c2d4fb180c2a8dea8ac53cf
37@@ -30,6 +38,9 @@ class BlockFilterIndex final : public BaseIndex
38 bool ReadFilterFromDisk(const FlatFilePos& pos, BlockFilter& filter) const;
39 size_t WriteFilterToDisk(FlatFilePos& pos, const BlockFilter& filter);
40
41+ Mutex m_cs_headers_cache;
42+ std::unordered_map<uint256, uint256, FilterHeaderHasher> m_headers_cache GUARDED_BY(m_cs_headers_cache);
Code Review ACK 0187d4c.
I prefer this approach of caching inside index instead of net_processing compare to #16442, it reduces cognitive load. I also assent with ignoring reorg as argued, probability of the event is rare enough and overhead cheap enough to justify having some junk in cache.