When fetching inputs in ConnectBlock, each input is fetched from the cache in series. A cache miss means a round trip to the disk db to fetch the outpoint and insert it into the cache. Since the db is locked from being written during ConnectTip, we can fetch all block inputs missing from the cache in parallel on multiple threads before entering ConnectBlock. Using this strategy resulted in a 10% faster IBD.
Doing IBD with 16 vcores from a local peer with default settings, stopping at height 850k:
Mean [s] | Min [s] | Max [s] | Relative | |
---|---|---|---|---|
branch | 17065.138 ± 117.439 | 16982.096 | 17148.181 | 1.00 |
master | 18731.509 ± 94.142 | 18731.509 | 18864.646 | 1.10 |
For later blocks this change makes block connection even faster. Doing an assumeutxo from block 840k to 850k with 15 worker threads, this change is 26% faster. With just a single worker thread, this same benchmark is 6% faster. Benchmark and flame graph with 15 worker threads Benchmark and flame graph with 1 worker thread
I have fuzzed for over 500 million iterations with the provided fuzz harness with no issues.
This approach is heavily inspired by CCheckQueue
, but we could not easily reuse it since it only checks for validity and doesn’t allow us to store results. So, this PR creates a new InputFetcher
that loops through all inputs of a block on the main thread and adds their outpoints to a shared vector. After writing, the main thread and worker threads assign ranges of outpoints from the vector and fetch them from the db, and then push the resulting coins onto a thread local vector. Once the threads have finished reading all inputs, the main thread loops through all thread local vectors and inserts the results into the cache.
This PR uses the -par
value for the number of threads, which defaults to the number of vcores on the machine or 15 whichever is fewer. This is the same value used for CCheckQueue
, so any users that specifically have the multi threaded validation disabled by using -par=1
will also have this feature disabled. This also means the maximum number of input fetching threads is capped at 15.
Since InputFetcher::FetchInputs
is blocking, a follow-up can update this to share the thread pool between CCheckQueue
and InputFetcher
.