When fetching inputs in ConnectBlock, each input is fetched from the cache in series. A cache miss means a round trip to the disk db to fetch the outpoint and insert it into the cache. Since the db is locked from being written during ConnectTip, we can fetch all block inputs missing from the cache in parallel on multiple threads before entering ConnectBlock. Using this strategy resulted in a ~17% faster IBD (or master was ~21% slower).
Doing IBD with 16 vcores from a local peer with default settings, stopping at height 850k:
Mean [s] | Min [s] | Max [s] | Relative | |
---|---|---|---|---|
branch | 22187.488 ± 50.159 | 22152.021 | 22222.956 | 1.00 |
master | 26865.884 ± 33.498 | 26842.197 | 26889.570 | 1.21 |
This approach is heavily inspired by CCheckQueue
, but we could not easily reuse it since it only checks for validity and doesn’t allow us to store results in a queue. So, this PR creates a new InputFetcher
that loops through all inputs of a block on the main thread and adds their outpoints to a queue to be fetched in parallel. Worker threads pull outpoints from the queue and fetch them from the db, and then push the resulting coins back onto another queue. Once the main thread has finished looping through the block inputs, it pulls results from the coins queue and inserts them into the cache.
This PR uses number of cores to create the worker threads, but since the work on the threads is IO bound, it might benefit from using a multiple of the number of cores. However, that would result in more memory usage and lock contention, so unsure what the optimal number is.