validation: fetch block inputs on parallel threads #31132

andrewtoth commented at 2:40 PM on October 22, 2024: contributor

Parts of this PR are isolated in independent smaller PRs to ease review:

This PR parallelizes fetching all input prevouts of a block during block connection, achieving over 3x faster IBD performance in some scenarios[^1][^2][^3][^4][^5].

Problem

Currently, when fetching inputs in ConnectBlock, each input is fetched from the cache sequentially. A cache miss requires a round trip to the disk database to fetch the outpoint and insert it into the cache. Since the database is read-only during ConnectBlock, we can fetch all inputs of a block in parallel on multiple threads while connecting.

Solution

We add a ThreadPool to CoinsViewOverlay to fetch block inputs in parallel. The block is passed to the CoinsViewOverlay view before entering ConnectBlock, which kicks off the worker threads to begin fetching the inputs. The cache returns fetched coins as they become available via the overridden FetchCoinFromBase method. If not available yet, the main thread also fetches coins as it waits.

Implementation Details

The CoinsViewOverlay implements a lock-free MPSC (Multiple Producer, Single Consumer) queue design:

Work Distribution: Collects all input prevouts into a queue and uses a barrier to start all worker threads simultaneously
Synchronization: Worker threads use an atomic counter to claim which inputs to fetch, and each input has an atomic flag to signal completion to the main thread
Main Thread Processing: The main thread waits for inputs in order and moves their Coin into the cacheCoins map as they become available.
Work Stealing: If the main thread catches up to the workers, it assists with fetching to maximize parallelism
Completion: All thread futures are waited on in StopFetching, which is called from any mutating method (Flush/Sync/SetBackend/Reset). The ResetGuard going out of scope ensure this happens before the block is destroyed.

Safety and Correctness

The CoinsViewOverlay works on a block that has not been fully validated, but it does not interfere or modify any of the validation during ConnectBlock
It simply fetches inputs in parallel, which must be fetched before a transaction is validated anyways
Invalid blocks: If an invalid block is mined, the temporary cache is reset without being flushed. This is an improvement over the current behavior, where existing inputs are inserted into the main CoinsTip() cache when pulled through for an invalid block

Performance

Benchmarks show over 3x faster IBD performance in a cloud environment with network connected storage[^1], 3x faster IBD for an M4 Mac and up to 46% faster with directly connected storage[^2][^3][^4][^5]. The parallelization of expensive disk lookups provides significant speedup.

Flamegraphs show how the execution is changed.

Credits

Inspired by this comment.

Resolves #34121.

[^1]: #31132 (comment) [^2]: #31132#pullrequestreview-3515011880 [^3]: #31132 (comment) [^4]: #31132#pullrequestreview-3347436866 [^5]: #31132 (comment)

DrahtBot commented at 2:40 PM on October 22, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31132.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	ismaelsadeeq, Raimo33, l0rinc, hodlinator, murchandamus, willcl-ark, ryanofsky

If your review is incorrectly listed, please copy-paste <code></code> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#35078 (validation: merge PeekCoin into GetCoin by l0rinc)
#34320 (coins: remove redundant and confusing CCoinsViewDB::HaveCoin by l0rinc)
#34132 (coins: drop error catcher, centralize fatal read handling by l0rinc)
#28690 (build: Introduce internal kernel library by sedited)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Validation on Oct 22, 2024

andrewtoth force-pushed on Oct 22, 2024

DrahtBot commented at 2:45 PM on October 22, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/31894441286

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot added the label CI failed on Oct 22, 2024

andrewtoth renamed this:
~~validation: fetch block inputs parallel threads ~17% faster IBD~~
validation: fetch block inputs on parallel threads ~17% faster IBD
on Oct 22, 2024

andrewtoth force-pushed on Oct 22, 2024

DrahtBot removed the label CI failed on Oct 22, 2024

in src/inputfetcher.h:151 in e9e23b59f8 outdated

 146 | +        : m_batch_size(batch_size)
 147 | +    {
 148 | +        m_worker_threads.reserve(worker_thread_count);
 149 | +        for (size_t n = 0; n < worker_thread_count; ++n) {
 150 | +            m_worker_threads.emplace_back([this, n]() {
 151 | +                util::ThreadRename(strprintf("inputfetch.%i", n));

l0rinc commented at 10:26 AM on October 23, 2024:

Q: Is this a leftover a hack for non-owning LevelDB threads, or is this really the best way to name threads in a cross-platform way?

andrewtoth commented at 1:49 PM on October 23, 2024:

Unsure, copied from CScriptCheck. If the state of the art of thread naming has advanced since that was written, please let me know!

sipa commented at 1:54 PM on October 23, 2024:

The C++ standard library does as far as I know have no way of renaming threads at all. src/util/threadnames.{h,cpp} is our wrapper around the various platform-dependent ways of doing so on supported systems.

l0rinc commented at 2:03 PM on October 23, 2024:

Thank you, please resolve the comment.

in src/inputfetcher.h:189 in e9e23b59f8 outdated

 184 | +                    continue;
 185 | +                }
 186 | +
 187 | +                buffer.emplace_back(outpoint);
 188 | +                if (buffer.size() == m_batch_size) {
 189 | +                    Add(std::move(buffer));

l0rinc commented at 11:29 AM on October 23, 2024:

We're mostly creating the buckets randomly here, so each thread will need access to basically all of the keys. Since we have an idea of how LevelDB works here (i.e. Sorted String Table), we could likely improve cache locality (would likely be most beneficial on HDDs) and minimize lock contention by splitting the reads by sorted transactions instead.

andrewtoth commented at 1:46 PM on October 23, 2024:

I don't think there is any lock contention here if we are doing multithreaded reading?

I also think what you're suggesting would add a lot more complexity to this PR, when this is "good enough".

l0rinc commented at 2:02 PM on October 23, 2024:

This might be as simple as sorting by tx before we create the buckets.

andrewtoth commented at 2:12 PM on October 23, 2024:

If a benchmark shows that it is better, then great!

in src/inputfetcher.h:73 in e9e23b59f8 outdated

  68 | +        std::vector<std::pair<COutPoint, Coin>> pairs{};
  69 | +        do {
  70 | +            std::vector<COutPoint> outpoints{};
  71 | +            outpoints.reserve(m_batch_size);
  72 | +            {
  73 | +                WAIT_LOCK(m_mutex, lock);

l0rinc commented at 11:30 AM on October 23, 2024:

I'm wondering if we really need to (b)lock here or whether we could we create a read-only snapshot instead and avoid stalling?

andrewtoth commented at 1:34 PM on October 23, 2024:

This is blocking so we can access the queue of shared outpoints that we need to fetch from. It is not blocking for LevelDB, we access the db once we are out of the critical section.

l0rinc commented at 2:04 PM on October 23, 2024:

As mentioned before, why do we need shared outpoints here?

andrewtoth commented at 2:15 PM on October 23, 2024:

The main thread adds all outpoints to a global vector, which all workers will fetch their work from.

andrewtoth commented at 12:56 AM on December 4, 2024:

We no longer need to block on the shared outpoints vector. We write to it once in the main thread before notifying the other threads and then only read from it afterwards.

in src/inputfetcher.h:29 in e9e23b59f8 outdated

  24 | + * onto the queue, where they are fetched by N worker threads. The resulting
  25 | + * coins are pushed onto another queue after they are read from disk. When
  26 | + * the main is done adding outpoints, it starts writing the results of the read
  27 | + * queue to the cache.
  28 | + */
  29 | +class InputFetcher

l0rinc commented at 11:39 AM on October 23, 2024:

I know it's not trivial request, but can we add a test for this class which fetches everything in parallel and sequentially and assert that the result is equivalent? And preferably also a benchmark, like we have it for https://github.com/bitcoin/bitcoin/blob/master/src/bench/checkqueue.cpp. I would gladly help here, if needed.

andrewtoth commented at 1:35 PM on October 23, 2024:

Yes, I can add these but I am waiting for some more conceptual support.

andrewtoth commented at 3:06 PM on November 7, 2024:

Added tests and benchmark. The test has random parameters, one of which would be end up having a single worker thread.

andrewtoth commented at 8:59 PM on November 16, 2024:

Also added fuzz harness

in src/inputfetcher.h:145 in e9e23b59f8 outdated

 140 | +    }
 141 | +
 142 | +
 143 | +public:
 144 | +    //! Create a new input fetcher
 145 | +    explicit InputFetcher(size_t batch_size, size_t worker_thread_count) noexcept

l0rinc commented at 11:46 AM on October 23, 2024:

For consistency (see: explicit CCheckQueue(unsigned int batch_size, int worker_threads_num)) and simplicity (m_input_fetcher{/*batch_size=*/128, static_cast<size_t>(options.worker_threads_num)}, and to follow modern C++ directions where sizes seem to be preferred as signed values, see: #30927 (review)), please consider making these int(s) instead.

andrewtoth commented at 3:06 PM on November 7, 2024:

Done.

in src/validation.cpp:6251 in e9e23b59f8 outdated

6247 | @@ -6243,6 +6248,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
6248 |  
6249 |  ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
6250 |      : m_script_check_queue{/*batch_size=*/128, options.worker_threads_num},
6251 | +      m_input_fetcher{/*batch_size=*/128, static_cast<size_t>(options.worker_threads_num)},

l0rinc commented at 12:07 PM on October 23, 2024:

Unlike the script checks, these fetches aren't CPU bound, there is no reason to provide the number of CPUs as the number of parallels threads. I don't know if we care about HDD performance here or not, but we can likely find a multiplier that makes this better for both SSD and HDD.

Quoting from https://pkolaczk.github.io/disk-parallelism:

It was surprising to me that even 64 threads, which are far more than the number of CPU cores (4 physical, 8 virtual), still improved the performance. I guess that with requests of such a small size to such a fast storage, you need to submit really many of them to keep the SSD busy.

If we can provide a benchmark for this usecase we can likely find an optimal multiplier here - I won't nack but this part is very important for me.

andrewtoth commented at 1:43 PM on October 23, 2024:

Adding more threads will require more memory, which is one reason to not use many more.

I did a benchmark using 64 threads on the same 16 vcore machine, and it was slightly slower :/

l0rinc commented at 2:00 PM on October 23, 2024:

4x may be too much to begin with, but 1.5-2x sounds plausible, I'll help with benchmarking this once my current batches finish.

andrewtoth commented at 3:06 PM on November 7, 2024:

Added a benchmark to experiment with these.

in src/inputfetcher.h:188 in e9e23b59f8 outdated

 183 | +                if (cache.HaveCoinInCache(outpoint)) {
 184 | +                    continue;
 185 | +                }
 186 | +
 187 | +                buffer.emplace_back(outpoint);
 188 | +                if (buffer.size() == m_batch_size) {

l0rinc commented at 12:10 PM on October 23, 2024:

Would it be possible to create the batch sizes dynamically? Since the number of missing values differs for every block (and every dbcache size), it may not make more sense to calculate the optimal split instead of using the random value of 128. Coroutines might alleviate this problem.

andrewtoth commented at 1:42 PM on October 23, 2024:

I'm not sure it would warrant the complexity I think this batch size is "good enough" for now. In a follow up we could maybe add ways to set this with configs to experiment if there really is more optimal settings.

andrewtoth commented at 3:05 PM on November 7, 2024:

I changed the batch size to be number of workers.

in src/inputfetcher.h:197 in e9e23b59f8 outdated

 192 | +                }
 193 | +            }
 194 | +            txids.insert(tx->GetHash());
 195 | +        }
 196 | +
 197 | +        Add(std::move(buffer));

l0rinc commented at 12:11 PM on October 23, 2024:

Do we always have leftovers or will this process the last batch twice (or process an empty one) if the batch happens to be divisible by batch_size?

andrewtoth commented at 1:37 PM on October 23, 2024:

It won't process twice, but it could pass in an empty vector, which is ignored if you look at Add implementation.

in src/inputfetcher.h:65 in e9e23b59f8 outdated

  60 | +
  61 | +    std::vector<std::thread> m_worker_threads;
  62 | +    bool m_request_stop GUARDED_BY(m_mutex){false};
  63 | +
  64 | +    /** Internal function that does the fetching from disk. */
  65 | +    void Loop() noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)

l0rinc commented at 12:35 PM on October 23, 2024:

We're basically mimicking RocksDB's MultiGet here - but prewarming the cache instead in separate get requests, since we can't really access LevelDB's internals.

Since splitting into buckets isn't trivial and since MultiGet seems to rely on C++20 coroutines (which wasn't available in 2012 when CCheckQueue was written), I'm wondering how much simpler this fetching would be if we had lightweight suspendible threads instead: https://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html#multiget

andrewtoth commented at 1:40 PM on October 23, 2024:

I think it would be similar in complexity, we would still need all the locking mechanisms to prevent multithreaded access.

What would really be great is if we had a similar construction to Rust's std::sync::mpsc.

l0rinc commented at 1:58 PM on October 23, 2024:

Can you tell me why we need to prevent multithreaded access exactly? We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread, right?

How would mpsc solve this better? Do you think we need work stealing to make it perfectly parallel? Wouldn't coroutines already achieve the same?

sipa commented at 2:07 PM on October 23, 2024:

I haven't yet experimented with them, but as far as I understand it, coroutines are just programming paradigm, not magic; they don't do anything of their own, besides making things that were already possible easier to write. In particular, you still need a thread pool or some mechanism for scheduling how to run them,

andrewtoth commented at 2:09 PM on October 23, 2024:

We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread

If the vectors are thread local, then how can the main thread access them at the end to write them? We also want to be writing throughout while the workers are fetching, not just at the end.

How would mpsc solve this better?

Instead of each worker thread having a local queue of results, which they then append to the global results queue, they could just push each result to the channel individually. The main thread could just pull results off the channel as they arrive, instead of waiting to be awoken by a worker thread that appended all its results to the global queue.

work stealing

That is a concept for async rust, or std::async::mpsc. We can do all this without introducing an async runtime. But, this is getting off topic.

l0rinc commented at 3:15 PM on October 23, 2024:

coroutines are just programming paradigm, not magic

That's also what I was counting on! :D

In RocksDB they have high and low priority work (I assume that's just added to the front or the back of a background work deque) – this could align well with @furszy's suggestion for mixing different kinds of background work units.

I haven't used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime. Combined with an appropriate scheduling mechanism (such as a thread pool), we could maximize both CPU and IO usage, if I'm not mistaken. Instead of each thread handling just one task, it could suspend a coroutine while waiting on IO (e.g., a database fetch) and resume it later, effectively maximizing CPU and IO work without needing to know the exact details of the work.

If the vectors are thread local

The vector would still be global, but each thread would only access a single bucket (i.e. global vector of vectors, with each thread from the pool writing only to vector[thread_id], which contains a vector of fetched coins). When all the work is finished, we'd iterate over the global vector and merge the results into the cache on a single thread. As mentioned, sorting the outpoints before fetching could help improve data locality and reduce lock contention, and the coroutines above would help with work stealing, ensuring that all threads finish roughly at the same time.

Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.

sipa commented at 3:28 PM on October 23, 2024:

I haven't used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime.

That needs async I/O, and is unrelated to coroutines, as far as I understand it. Coroutines just help with keeping track of what to do when the reads come back inside rocksdb.

As long as LevelDB (or whatever database engine we use) internally does not use async I/O, there will be one (waiting) thread per parallel outstanding read request from the database.

HowHsu commented at 4:06 PM on June 28, 2025:

Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.

As far as I know, the advantage of coroutines over threads is faster context switching, since it doesn’t go through the operating system kernel. This advantage only becomes apparent under extremely high concurrency, such as hundreds of thousands of concurrent tasks. Using coroutines does not eliminate the need for synchronization mechanisms where they are inherently required.

l0rinc changes_requested

l0rinc commented at 12:44 PM on October 23, 2024: contributor

Concept ACK

I'm still missing tests and benchmarks here and I think we need to find better default values for SSD and HDD parallelism, and I'd be interested in how coroutines would perform here instead of trying to find the best batching size manually.

furszy commented at 2:25 PM on October 23, 2024: member

Cool idea.

Since the inputs fetcher call is blocking, instead of creating a new set of worker threads, what do you think about re-using the existing script validation ones (or any other unused worker threads) by implementing a general-purpose thread pool shared among the validation checks? The script validation checks and the inputs fetching mechanism are never done concurrently because you need the inputs in order to verify the scripts. So, they could share workers.

This should be benchmarked because it might add some overhead but, #26966 introduces such structure inside 401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

andrewtoth commented at 2:48 PM on October 23, 2024: contributor

implementing a general-purpose thread pool shared among the validation checks?

Nice, yes that would be great! That would simplify this PR a lot if it could just schedule tasks on worker threads and receive the responses, instead of implementing all the sync code itself.

#26966 introduces such structure inside https://github.com/bitcoin/bitcoin/commit/401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

Concept ACK!

l0rinc commented at 6:12 PM on October 24, 2024: contributor

Finished benching on a HDD until 860k on Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, CPU = 8:

Summary
'COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0' ran
 1.02 times faster than 'COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=
0'

f278ca4ec3 coins: allow emplacing non-dirty coins internally (39993.343777768874 seconds = 11.1 hours) e9e23b59f8 validation: fetch block inputs in parallel (40929.84310861388 seconds = 11.3 hours)

~So likely on HDD we shouldn't use so many threads, apparently it slows down IBD.~ Maybe we could add a new config option (iothreads or iothreadmultiplier or something). The defaults should likely depend on whether it's an SSD or HDD.

Edit:

<details> <summary>Previous results</summary>

"command": "COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [39993.343777768874],

"command": "COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [40929.84310861388],

</details>

I have retried the same with half the parallelism (rebased, but no other change in the end, otherwise the results would be hard to interpret):

"command": "COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
"times": [40579.00445769842],

So it's a tiny bit faster than before (surprisingly stable for an actual IBD with real peers), but still slower-than/same-as before, so not sure why it's not faster.

Edit:

Running it on a HDD with a low dbcache value reproduces the original result:

<details> <summary>benchmark</summary>

hyperfine --runs 1 --show-output --export-json /mnt/my_storage/ibd_full-threaded-inputs-3.json --parameter-list COMMIT 92fc718592be55812b2c73a3bf57599fc81425fa,8207d372b2fac24af0f8999b30e71e88d40b3a13 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'

</details>

8207d372b2 validation: fetch block inputs in parallel
92fc718592 coins: allow emplacing non-dirty coins internally
Summary
  'COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0' ran
    1.16 times faster than 'COMMIT=92fc718592be55812b2c73a3bf57599fc81425fa ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'

andrewtoth commented at 6:31 PM on October 24, 2024: contributor

So likely on HDD we shouldn't use so many threads, apparently it slows down IBD.

I'm not sure we can conclude that from your benchmark. It used a very high dbcache setting, which makes the effect of this change less important. It also is syncing from untrusted network peers, so there is some variance which could also account for the 2% difference.

andrewtoth force-pushed on Oct 24, 2024

DrahtBot commented at 7:25 PM on October 24, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32027275494

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot added the label CI failed on Oct 24, 2024

DrahtBot removed the label CI failed on Oct 25, 2024

andrewtoth force-pushed on Oct 26, 2024

DrahtBot added the label CI failed on Oct 26, 2024

DrahtBot commented at 6:52 PM on October 26, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32107893176

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Oct 26, 2024

andrewtoth force-pushed on Oct 27, 2024

andrewtoth marked this as a draft on Oct 27, 2024

andrewtoth force-pushed on Oct 27, 2024

andrewtoth force-pushed on Nov 7, 2024

DrahtBot removed the label CI failed on Nov 7, 2024

andrewtoth force-pushed on Nov 7, 2024

andrewtoth marked this as ready for review on Nov 7, 2024

andrewtoth commented at 3:05 PM on November 7, 2024: contributor

@furszy I tried to switch to using a shared threadpool, but it is much slower that way. We need a way to have shared state between threads for this, instead of just scheduling tasks. I suppose the generic threadpool is great for scheduling independent tasks like indexing an individual block, but for quickly pulling outpoints off a shared vector it is not optimized well.

From #29386:

I just noticed the comment in the code:

For each thread a thread stack needs to be allocated. By default on Linux, threads take up 8MiB for the thread stack on a 64-bit system, and 4MiB in a 32-bit system.

Only 8MiB of Virtual Memory is allocated, which doesn't really mean anything. Due to CoW mechanism, only the parts of stack that are being used will be allocated as Physical Memory which is the one that actually matters.

So, I don't think it matters much to have an extra threadpool owned by the input fetcher.

I think this is ready for more review. I also added tests and a benchmark.

andrewtoth force-pushed on Nov 7, 2024

andrewtoth force-pushed on Nov 9, 2024

andrewtoth commented at 3:28 PM on November 13, 2024: contributor

For later blocks where cache misses are much more common, this change has an even bigger impact. This benchmark report shows a 40% speedup measuring from blocks 840k to 850k. Also, compare flamegraphs of master and this branch, where the latter has 15 worker threads fetching coins from disk. https://bitcoin-dev-tools.github.io/benchcoin/results/pr-19/11798124132/index.html

andrewtoth commented at 3:35 AM on November 16, 2024: contributor

Even with just 2 worker threads, there is significant (~30%) speed improvement for syncing recent blocks. https://bitcoin-dev-tools.github.io/benchcoin/results/pr-19/11865650166/index.html

andrewtoth force-pushed on Nov 16, 2024

DrahtBot commented at 7:45 PM on November 16, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33086747731

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot added the label CI failed on Nov 16, 2024

andrewtoth force-pushed on Nov 16, 2024

DrahtBot removed the label CI failed on Nov 16, 2024

andrewtoth force-pushed on Nov 17, 2024

in src/test/fuzz/inputfetcher.cpp:32 in 2bd5f0f03b outdated

  27 | +    CBlock block;
  28 | +    Txid prevhash{Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))};
  29 | +
  30 | +    const auto txs{fuzzed_data_provider.ConsumeIntegralInRange<uint32_t>(1,
  31 | +        std::numeric_limits<uint32_t>::max())};
  32 | +    for (uint32_t i{0}; i < txs; ++i) {

dergoegge commented at 2:16 PM on November 18, 2024:

This will create very long running inputs (e.g. txs = std::numeric_limits<uint32_t>::max()).

    LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), N) {

or

    LIMITED_WHILE(fuzzed_data_provider.remaining_bytes(), N) {

andrewtoth commented at 5:04 PM on November 18, 2024:

Thanks, done!

in src/test/fuzz/inputfetcher.cpp:36 in 2bd5f0f03b outdated

  31 | +        std::numeric_limits<uint32_t>::max())};
  32 | +    for (uint32_t i{0}; i < txs; ++i) {
  33 | +        CMutableTransaction tx;
  34 | +
  35 | +        const auto inputs{fuzzed_data_provider.ConsumeIntegral<uint32_t>()};
  36 | +        for (uint32_t j{0}; j < inputs; ++j) {

dergoegge commented at 2:17 PM on November 18, 2024:

Same as above, this will create long running inputs and maybe even run out of memory?

andrewtoth force-pushed on Nov 18, 2024

DrahtBot commented at 3:34 PM on November 18, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33143571653

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot added the label CI failed on Nov 18, 2024

andrewtoth force-pushed on Nov 18, 2024

DrahtBot removed the label CI failed on Nov 18, 2024

andrewtoth force-pushed on Nov 19, 2024

in src/coins.h:420 in 3a4af55071 outdated

 415 | @@ -416,13 +416,14 @@ class CCoinsViewCache : public CCoinsViewBacked
 416 |      void AddCoin(const COutPoint& outpoint, Coin&& coin, bool possible_overwrite);
 417 |  
 418 |      /**
 419 | -     * Emplace a coin into cacheCoins without performing any checks, marking
 420 | -     * the emplaced coin as dirty.
 421 | +     * Emplace a coin into cacheCoins without performing any checks, optionally
 422 | +     * marking the emplaced coin as dirty.

sedited commented at 2:38 PM on November 19, 2024:

Should this rather say "optionally marking the emplaced coin as not dirty", since the default is always dirty?

andrewtoth commented at 6:53 PM on November 20, 2024:

I'm not sure that's the best though, since we do not mark a coin as not dirty. That is the default state.

What about "marking the coin as dirty unless set_dirty is set to false"?

sedited commented at 7:15 PM on November 20, 2024:

That sounds good to me :+1:

andrewtoth commented at 9:56 PM on November 20, 2024:

Done.

andrewtoth force-pushed on Nov 19, 2024

andrewtoth force-pushed on Nov 20, 2024

DrahtBot added the label CI failed on Nov 20, 2024

DrahtBot commented at 6:46 PM on November 20, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33279820062

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Nov 20, 2024

DrahtBot removed the label CI failed on Nov 20, 2024

andrewtoth force-pushed on Nov 21, 2024

DrahtBot commented at 5:12 PM on November 21, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33335042693

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot added the label CI failed on Nov 21, 2024

andrewtoth force-pushed on Nov 21, 2024

DrahtBot removed the label CI failed on Nov 21, 2024

andrewtoth force-pushed on Nov 22, 2024

andrewtoth force-pushed on Nov 24, 2024

in src/test/inputfetcher_tests.cpp:55 in 3c201bcffc outdated

  50 | +    const auto cores{GetNumCores()};
  51 | +    const auto num_txs{m_rng.randrange(cores * 10)};
  52 | +    const auto block{CreateBlock(num_txs)};
  53 | +    const auto batch_size{m_rng.randrange<int32_t>(block.vtx.size() * 2)};
  54 | +    const auto worker_threads{m_rng.randrange(cores * 2)};
  55 | +    InputFetcher fetcher{batch_size, worker_threads};

ismaelsadeeq commented at 8:31 PM on November 26, 2024:

In 3c201bcffc1d7e382e8afa9a88750a4c261c1cf8 "tests: add inputfetcher tests" You can set this up in InputFetcherTest so that you don't have to repeat it in the rest of the tests.

andrewtoth commented at 12:58 AM on December 4, 2024:

Done.

in src/bench/inputfetcher.cpp:15 in 2349ac7d60 outdated

  10 | +#include <primitives/block.h>
  11 | +#include <serialize.h>
  12 | +#include <streams.h>
  13 | +#include <util/time.h>
  14 | +
  15 | +static constexpr auto QUEUE_BATCH_SIZE{128};

ismaelsadeeq commented at 8:33 PM on November 26, 2024:

In 2349ac7d6071746a80223358bce0d5e556b277d7 "bench: add inputfetcher bench" How did you select this batch size?

andrewtoth commented at 12:58 AM on December 4, 2024:

This is the hardcoded batch size used in CheckQueue. Not sure why that was selected, but I deferred to previous choices.

l0rinc commented at 3:45 PM on September 29, 2025:

I would prefer retesting those assumptions (I don't even think we need a batch here)

andrewtoth commented at 7:06 PM on October 3, 2025:

Removed the batch size 🎉

in src/bench/inputfetcher.cpp:19 in 2349ac7d60 outdated

  14 | +
  15 | +static constexpr auto QUEUE_BATCH_SIZE{128};
  16 | +static constexpr auto DELAY{2ms};
  17 | +
  18 | +//! Simulates a DB by adding a delay when calling GetCoin
  19 | +class DelayedCoinsView : public CCoinsView

ismaelsadeeq commented at 8:47 PM on November 26, 2024:

In 2349ac7d6071746a80223358bce0d5e556b277d7 "bench: add inputfetcher bench" nit: will be nice if we have block413567 input's data that we can read so that we dont have to simulate this.

andrewtoth commented at 12:59 AM on December 4, 2024:

We're reading the previous outpoints of that block's inputs, which are in many other previous blocks. So, not sure this is feasible.

andrewtoth commented at 4:09 PM on November 30, 2025:

I now use a leveldb and add mock input data before the benchmark, so it's more of a real world benchmark now :+1: . Thanks!

ismaelsadeeq commented at 8:55 PM on November 26, 2024: member

Concept ACK This is nice. Although I have not yet benchmarked this branch, I also like @furszy's idea of having a general-purpose thread pool.

I just have one test improvement comment, question and a nit after first pass of the PR

DrahtBot added the label Needs rebase on Dec 4, 2024

andrewtoth force-pushed on Dec 4, 2024

andrewtoth commented at 1:02 AM on December 4, 2024: contributor

Rebased. Since #30039 reading inputs is much faster, so the effect of this is somewhat less significant (17% -> 10%). It's still a significant speedup though so still worth it. Especially for worst case where the cache is completely empty, like on startup or right after it gets flushed due to size.

It is also refactored significantly. The main thread now writes everything before notifying threads, and then joins in working. This lets us do significantly less work in the critical section and parallelize more checks.

andrewtoth renamed this:
~~validation: fetch block inputs on parallel threads ~17% faster IBD~~
validation: fetch block inputs on parallel threads 10% faster IBD
on Dec 4, 2024

andrewtoth force-pushed on Dec 4, 2024

DrahtBot commented at 1:29 AM on December 4, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33884531020

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot added the label CI failed on Dec 4, 2024

andrewtoth force-pushed on Dec 4, 2024

DrahtBot removed the label Needs rebase on Dec 4, 2024

DrahtBot removed the label CI failed on Dec 4, 2024

DrahtBot added the label Needs rebase on Dec 5, 2024

andrewtoth force-pushed on Dec 5, 2024

DrahtBot removed the label Needs rebase on Dec 5, 2024

in src/validation.cpp:3198 in b2da764446 outdated

3194 | @@ -3195,6 +3195,8 @@ bool Chainstate::ConnectTip(BlockValidationState& state, CBlockIndex* pindexNew,
3195 |      LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
3196 |               Ticks<MillisecondsDouble>(time_2 - time_1));
3197 |      {
3198 | +        m_chainman.GetInputFetcher().FetchInputs(CoinsTip(), CoinsDB(), blockConnecting);

l0rinc commented at 9:26 AM on May 11, 2025:

Can we let the objects do the job instead of querying their internals and doing it ourselves ("tell, don't ask"):

        m_chainman.FetchInputs(CoinsTip(), CoinsDB(), blockConnecting);

andrewtoth commented at 2:20 PM on October 2, 2025:

I tried to mimic the script validation like GetCheckQueue. But, I guess this is different enough. Will change next time I push.

andrewtoth commented at 12:00 AM on October 4, 2025:

Done.

DrahtBot added the label CI failed on Jun 2, 2025

maflcko commented at 7:02 AM on June 10, 2025: member

Looks like the CI started failing, due to too many threads being launched in the functional tests with that parallelism? As the threads may open files, this could be hitting the max open files limit? Or maybe it is a different limit hit?

HowHsu commented at 4:42 PM on June 26, 2025: contributor

Hi folks, this looks great, since if all the prevout coins of all transactions of a block are loaded in advance, then the optimization in #32791 makes sense.

sedited commented at 9:09 AM on September 17, 2025: contributor

What's the status here?

andrewtoth force-pushed on Sep 17, 2025

andrewtoth commented at 8:38 PM on September 17, 2025: contributor

Looks like the CI started failing, due to too many threads being launched in the functional tests with that parallelism? As the threads may open files, this could be hitting the max open files limit? Or maybe it is a different limit hit?

Thanks, I added -par=1 to all nodes spawned in features_proxy.py in 6980852416040bdddf111df3cea3ec50639f010a. That test spawns lots of nodes and block validation is not relevant to it.

What's the status here?

Rebased to fix silent conflicts and added the fix for features_proxy.py.

andrewtoth commented at 10:57 PM on September 21, 2025: contributor

I benchmarked the latest branch with default dbcache up to 912683. Results are a speedup of 14% - 5:07 vs 5:49.

Command	Mean [s]	Min [s]	Max [s]	Relative
`echo 688c03597afb0b76077f1ffc4608eef19481056e && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683`	18430.672 ± 19.856	18416.631	18444.712	1.00
`echo 1444ed855f438f1270104fca259ce61b99ed5cdb && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683`	20937.219 ± 62.635	20892.929	20981.508	1.14 ± 0.00

andrewtoth renamed this:
~~validation: fetch block inputs on parallel threads 10% faster IBD~~
validation: fetch block inputs on parallel threads >10% faster IBD
on Sep 21, 2025

maflcko removed the label CI failed on Sep 22, 2025

andrewtoth commented at 5:18 PM on September 23, 2025: contributor

Did the same benchmark with 5000 dbcache and there is a 6% speedup :rocket: - 4:27 vs 4:44. Even with far fewer cache misses this change is still a benefit, and will continue to improve block connection speed as the blockchain and utxo set get bigger.

Command	Mean [s]	Min [s]	Max [s]	Relative
`echo 688c03597afb0b76077f1ffc4608eef19481056e && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 -dbcache=5000`	16021.047 ± 5.892	16016.881	16025.213	1.00
`echo 1444ed855f438f1270104fca259ce61b99ed5cdb && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 -dbcache=5000`	17057.947 ± 42.032	17028.226	17087.668	1.06 ± 0.00

in test/functional/feature_proxy.py:141 in 688c03597a outdated

 135 | @@ -136,6 +136,9 @@ def setup_nodes(self):
 136 |          if self.have_unix_sockets:
 137 |              args[5] = ['-listen', f'-proxy=unix:{socket_path}']
 138 |              args[6] = ['-listen', f'-onion=unix:{socket_path}']
 139 | +        # Keep validation threads low to avoid CI thread/pid limits.
 140 | +        # Ensure even empty arg lists get '-par=1'.
 141 | +        args = [a + ['-par=1'] if a else ['-par=1'] for a in args]

maflcko commented at 8:16 PM on September 24, 2025:

seems a bit odd to have the number of nodes in a test influence whether or not the test has to be edited to remove or add -par=1 everywhere. Would it not be easier to just globally set -par=2 for all funtional tests?

diff --git a/test/functional/test_framework/util.py b/test/functional/test_framework/util.py
index e5a5938f07..42bb213dd3 100644
--- a/test/functional/test_framework/util.py
+++ b/test/functional/test_framework/util.py
@@ -459,6 +459,7 @@ def write_config(config_path, *, n, chain, extra_config="", disable_autoconnect=
         f.write("printtoconsole=0\n")
         f.write("natpmp=0\n")
         f.write("shrinkdebugfile=0\n")
+        f.write("par=2\n")
         # To improve SQLite wallet performance so that the tests don't timeout, use -unsafesqlitesync
         f.write("unsafesqlitesync=1\n")
         if disable_autoconnect:

andrewtoth commented at 9:01 PM on September 24, 2025:

Yes, I wondered if that would be more invasive to other tests though.

maflcko commented at 7:42 AM on September 25, 2025:

It disables the auto-detection for all functional tests by default, which I can't really find a downside to. Also, it removes idle "spam" threads while debugging (gdb and other tools will display less script check threads), which also seems beneficial to have?

andrewtoth commented at 4:09 PM on September 27, 2025:

Makes sense. Done here #33485.

in src/inputfetcher.h:145 in 688c03597a outdated

 140 | +                        m_in_flight_outpoints_count -= m_last_outpoint_index;
 141 | +                        m_last_outpoint_index = 0;
 142 | +                        break;
 143 | +                    }
 144 | +                }
 145 | +            } catch (const std::runtime_error&) {

l0rinc commented at 2:31 PM on September 29, 2025:

nit: is there anything in the error that we may want to log?

andrewtoth commented at 6:04 PM on October 11, 2025:

Added a log.

in src/inputfetcher.h:107 in 688c03597a

 102 | +                while (m_last_outpoint_index == 0) {
 103 | +                    if ((is_main_thread && m_in_flight_outpoints_count == 0) || m_request_stop) {
 104 | +                        return;
 105 | +                    }
 106 | +                    ++m_idle_worker_count;
 107 | +                    cond.wait(lock);

l0rinc commented at 2:33 PM on September 29, 2025:

I haven't reviewed it in detail but was wondering why we need locking here, it should be possible to do most of this lock free (especially if we sort the keys first so that threads are more likely to access different regions). I have started reviewing and testing it in detail, but to have some progress I'm sharing my observations as I go along

andrewtoth commented at 2:21 PM on October 2, 2025:

I've updated to use semaphores instead of mutex. That should be more efficient.

especially if we sort the keys first so that threads are more likely to access different regions

I don't understand what this has to do with being lock free.

l0rinc commented at 2:40 PM on October 2, 2025:

I don't understand what this has to do with being lock free.

We may have fewer file system locks if the threads are accessing different regions

I've updated to use semaphores instead of mutex

I will review that in more detail soon, probably next week.

andrewtoth commented at 2:47 PM on October 2, 2025:

We may have fewer file system locks if the threads are accessing different regions

Ok, but that is not the same as this InputFetcher construction being lock free.

l0rinc commented at 2:54 PM on October 2, 2025:

No, that's orthogonal, it's another area where we could possibly reduce contention.

in src/inputfetcher.h:82 in 688c03597a

  77 | +    const CCoinsViewCache* m_cache{nullptr};
  78 | +
  79 | +    std::vector<std::thread> m_worker_threads;
  80 | +    bool m_request_stop GUARDED_BY(m_mutex){false};
  81 | +
  82 | +    //! Internal function that does the fetching from disk.

l0rinc commented at 2:35 PM on September 29, 2025:

instead of the comment, can we express this in the method name?

andrewtoth commented at 2:22 PM on October 2, 2025:

I updated the thread name to ThreadLoop, which just does the loop. There is another function now, FetchInputsOnThread, that fetches for each block until finished.

in src/inputfetcher.h:79 in 688c03597a outdated

  74 | +    //! DB coins view to fetch from.
  75 | +    const CCoinsView* m_db{nullptr};
  76 | +    //! The cache to check if we already have this input.
  77 | +    const CCoinsViewCache* m_cache{nullptr};
  78 | +
  79 | +    std::vector<std::thread> m_worker_threads;

l0rinc commented at 2:38 PM on September 29, 2025:

I have tried std::jthread in l0rinc/bitcoin@6afe2e8 (#40) but it seems the CI's libc++ doesn’t provide it

Q: it's just the second commit and we're already doing the fetching on multiple threads. Can we add a single-threaded input fetcher first and add multithreading only as the very last step?

andrewtoth commented at 3:55 PM on October 4, 2025:

Can we add a single-threaded input fetcher first and add multithreading only as the very last step?

Done.

andrewtoth commented at 2:09 PM on October 14, 2025:

I have tried std::jthread

I looked at this, but it doesn't really add anything to this implementation. We could have a std::stop_token for each thread, but we would have to request_stop() each jthread before releasing the semaphore in the destructor anyway. So it doesn't let us remove the destructor, and saves a line for not having to declare m_request_stop. I don't think it's worth it to use jthreads here.

l0rinc commented at 2:49 PM on October 14, 2025:

I couldn't get it to work on CI anyway

in src/inputfetcher.h:73 in 688c03597a

  68 | +     */
  69 | +    int32_t m_in_flight_outpoints_count GUARDED_BY(m_mutex){0};
  70 | +    //! The number of worker threads that are waiting on m_worker_cv
  71 | +    int32_t m_idle_worker_count GUARDED_BY(m_mutex){0};
  72 | +    //! The maximum number of outpoints to be assigned in one batch
  73 | +    const int32_t m_batch_size;

l0rinc commented at 2:42 PM on September 29, 2025:

what if instead of locking we just iterate every nth element (where n is the thread index), implicitly dividing the input into n buckets without locking. Each thread would work on a distinct set of values - we can pre-filter for existing values on a single thread before forking off. This won't have work stealing, but we can likely assume uniform distribution and the solution would be trivial and lock free.

andrewtoth commented at 2:24 PM on October 2, 2025:

Prefiltering on the main thread is too slow, it's faster if we do the filtering in parallel. So, we still need to have a smaller batch size because then work will not be divided evenly. One thread could get all cache misses while the others all have cached inputs.

l0rinc commented at 2:40 PM on October 2, 2025:

Not sure why that's problematic, we don't have to have perfect parallelism, it seems to me we can assume uniform distribution - it's fine if there are outliers if that makes the code simpler (which I think it should, it could even eliminate most locks, since the jobs are basically completely independent)

in src/inputfetcher.h:77 in 688c03597a outdated

  72 | +    //! The maximum number of outpoints to be assigned in one batch
  73 | +    const int32_t m_batch_size;
  74 | +    //! DB coins view to fetch from.
  75 | +    const CCoinsView* m_db{nullptr};
  76 | +    //! The cache to check if we already have this input.
  77 | +    const CCoinsViewCache* m_cache{nullptr};

l0rinc commented at 2:44 PM on September 29, 2025:

could we pre-filter on a single tread and send the results to the fetcher instead? That way we can also decide not to do multi-threaded access for small sets (we can experiment with the values, but we can probably start with set size < nproc should be done on a single thread).

andrewtoth commented at 2:25 PM on October 2, 2025:

Prefiltering on the main thread is too slow. It is several milliseconds to check every input in large blocks whether they exist in the cache.

in src/inputfetcher.h:1 in 688c03597a outdated

   0 | @@ -0,0 +1,246 @@
   1 | +// Copyright (c) 2024-present The Bitcoin Core developers

l0rinc commented at 2:45 PM on September 29, 2025:

nit: the curse of long review queues

// Copyright (c) 2025-present The Bitcoin Core developers

andrewtoth commented at 2:25 PM on October 2, 2025:

Done.

in src/coins.h:432 in 688c03597a outdated

 430 | +     * NOT FOR GENERAL USE. Used when loading coins from a UTXO snapshot, and
 431 | +     * in the InputFetcher.
 432 |       * @sa ChainstateManager::PopulateAndValidateSnapshot()
 433 |       */
 434 | -    void EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin);
 435 | +    void EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty = true);

l0rinc commented at 2:48 PM on September 29, 2025:

to peel away the preparatory commits, it would simplify review to extract these into tiny, focused PRs - to have some progress, since this PR is in review for some time, but it's a very good change that I'd like to have some progress on.

nit: I understand the default param is meant to make the diff smaller, but it doesn't help with understanding the effect of the change, to see where this is used and what we're changing

andrewtoth commented at 4:09 PM on October 15, 2025:

I think I can just drop this first commit entirely. We don't actually care to not set the coins we fetch as dirty. In the happy path, all these coins will be spent immediately after ConnectBlock, so they will be set to dirty anyways. In the unhappy path where the valid proof-of-work block is found to be invalid, the dirty coins we added will just cause the coins to be overwritten by the same data in the db at the next flush.

in src/coins.cpp:114 in 688c03597a outdated

 109 | @@ -110,10 +110,15 @@ void CCoinsViewCache::AddCoin(const COutPoint &outpoint, Coin&& coin, bool possi
 110 |             (bool)it->second.coin.IsCoinBase());
 111 |  }
 112 |  
 113 | -void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin) {
 114 | -    cachedCoinsUsage += coin.DynamicMemoryUsage();
 115 | +void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty) {
 116 | +    const auto mem_usage{coin.DynamicMemoryUsage()};

l0rinc commented at 2:49 PM on September 29, 2025:

af8a366bd6a08d9362e69a89b0b89b5c94eb63ca I had something similar in https://github.com/bitcoin/bitcoin/pull/32313/files#diff-f0ed73d62dae6ca28ebd3045e5fc0d5d02eaaacadb4c2a292985a3fbd7e1c77cR254

Can you please explain in the commit message why this change is necessary?

andrewtoth commented at 3:59 PM on October 3, 2025:

Added some explanation in the commit message. Please let me know if it makes it more clear.

in src/validation.cpp:6271 in 688c03597a outdated

6267 | @@ -6266,6 +6268,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
6268 |  
6269 |  ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
6270 |      : m_script_check_queue{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
6271 | +      m_input_fetcher{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},

l0rinc commented at 2:50 PM on September 29, 2025:

I have tested this with different par values and surprisingly it barely had any effect. Is it because of the locking?

andrewtoth commented at 2:27 PM on October 2, 2025:

I believe this is resolved.

in src/coins.cpp:117 in af8a366bd6 outdated

 115 | +void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty) {
 116 | +    const auto mem_usage{coin.DynamicMemoryUsage()};
 117 |      auto [it, inserted] = cacheCoins.try_emplace(std::move(outpoint), std::move(coin));
 118 | -    if (inserted) CCoinsCacheEntry::SetDirty(*it, m_sentinel);
 119 | +    if (inserted) {
 120 | +        cachedCoinsUsage += mem_usage;

l0rinc commented at 2:57 PM on September 29, 2025:

this seems like a change in behavior, but it assumed that the coin was never in the cache - though if (inserted) didn't help with understanding this, so it likely isn't.

Is that still something that we can assume - hence the "danger", right? And if it's always inserted, does the insertion guard still make sense? It's a bit even more confusing now :/

andrewtoth commented at 2:33 PM on October 2, 2025:

This is dangerous because it doesn't check for freshness or if already inserted. It is meant to bulk load new utxos from the assume utxo set. Since assume utxo assumes the utxo set is currently empty, the coins would always be inserted. This is repurposed here to bulk load utxos from the db directly into the cache. However, an invalid block could be mined which spends an already spent utxo that is in the cache but has not been synced to the db yet. In that case, the insertion will fail here. There is a unit test specifically for this scenario.

in src/inputfetcher.h:24 in 912f26b81e outdated

  19 | +#include <unordered_set>
  20 | +#include <vector>
  21 | +
  22 | +/**
  23 | + * Input fetcher for fetching inputs from the CoinsDB and inserting
  24 | + * into the CoinsTip.

l0rinc commented at 2:59 PM on September 29, 2025:

 * Helper for fetching inputs from the CoinsDB and inserting into the CoinsTip.

andrewtoth commented at 3:59 PM on October 3, 2025:

Done.

in src/inputfetcher.h:27 in 912f26b81e outdated

  22 | +/**
  23 | + * Input fetcher for fetching inputs from the CoinsDB and inserting
  24 | + * into the CoinsTip.
  25 | + *
  26 | + * The main thread loops through the block and writes all input prevouts to a
  27 | + * global vector. It then wakes all workers and starts working as well. Each

l0rinc commented at 3:00 PM on September 29, 2025:

do we need to write to a global vector or can we safely iterate the prevouts directly from each thread?

andrewtoth commented at 2:36 PM on October 2, 2025:

We iterate the prevouts directly from the block now. However, we store the tx index and vin index in a global vector now. This way we can flatten the inputs instead of having to scan the txs to see how many inputs they have.

in src/inputfetcher.h:83 in 912f26b81e outdated

  78 | +
  79 | +    std::vector<std::thread> m_worker_threads;
  80 | +    bool m_request_stop GUARDED_BY(m_mutex){false};
  81 | +
  82 | +    //! Internal function that does the fetching from disk.
  83 | +    void Loop(int32_t index, bool is_main_thread = false) noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)

l0rinc commented at 3:04 PM on September 29, 2025:

Q: do we really need the main thread to be part of this? I expect this to be disk bound and not CPU restricted, we should be able to go beyond nproc, so it should be safe to leave the main thread out of this as far as I can tell...

andrewtoth commented at 2:37 PM on October 2, 2025:

Not sure, would have to benchmark this. I have updated the functions though to make the main thread's entrance clearer.

l0rinc commented at 2:43 PM on October 2, 2025:

My benchmarks so far indicate the opposite: after 3-4 threads there is no benefit to the parallelization (either on SSD or HDD). I will remeasure your new changes after you give me the 👍

in src/inputfetcher.h:130 in 912f26b81e outdated

 125 | +                    // block, it won't be in the cache yet but it also won't be
 126 | +                    // in the db either.
 127 | +                    if (m_txids.contains(outpoint.hash)) {
 128 | +                        continue;
 129 | +                    }
 130 | +                    if (m_cache->HaveCoinInCache(outpoint)) {

l0rinc commented at 3:28 PM on September 29, 2025:

as mentioned before I think it should be safe to pre-filter on a single thread instead

andrewtoth commented at 2:38 PM on October 2, 2025:

It is definitely safe to do, since all access would be on main thread. It is also safe to do from parallel threads if we don't write until all threads are done reading, which is what this PR does. Prefiltering is slow though (several milliseconds) so is better to do in parallel.

l0rinc commented at 2:45 PM on October 2, 2025:

But prefiltering would allow sorting, which should untangle the threads. The threads will access the same files (which are more likely to be different from the files the other threads are requesting), so they may profit from cache locality if the OS supports it - that's why I suggested giving it a try.

in src/inputfetcher.h:127 in 912f26b81e outdated

 122 | +                for (auto i{end_index - local_batch_size}; i < end_index; ++i) {
 123 | +                    const auto& outpoint{m_outpoints[i]};
 124 | +                    // If an input spends an outpoint from earlier in the
 125 | +                    // block, it won't be in the cache yet but it also won't be
 126 | +                    // in the db either.
 127 | +                    if (m_txids.contains(outpoint.hash)) {

l0rinc commented at 3:31 PM on September 29, 2025:

what if this ends up on different threads, i.e. a spend from an earlier outpoint processed on a different thread? Wouldn't we take care of those automatically? We can likely skip all values that are not found, since we will revalidate everything after this cache warming call - we just have to document that it's theoretically possible that some values won't be in the cache after this call (though the internal spends should be added, just in a different thread, right?).

andrewtoth commented at 2:40 PM on October 2, 2025:

I'm not sure I understand this. The m_txids set is computed on the main thread, and is only read from multiple threads. If we didn't do this we would try to fetch non-existent outputs from the db, which would be much slower.

in src/inputfetcher.h:136 in 912f26b81e outdated

 131 | +                        continue;
 132 | +                    }
 133 | +                    if (auto coin{m_db->GetCoin(outpoint)}; coin) {
 134 | +                        local_pairs.emplace_back(outpoint, std::move(*coin));
 135 | +                    } else {
 136 | +                        // Missing an input. This block will fail validation.

l0rinc commented at 3:32 PM on September 29, 2025:

do we really care about this, it's not our job here to validate, just fetch whatever we can, the validation will happen after this pre-warming.

andrewtoth commented at 2:41 PM on October 2, 2025:

We don't really care, but it would be good to not continue doing work here if we know it's pointless. This just exits early. No validation is happening.

l0rinc commented at 2:50 PM on October 2, 2025:

I think I would prefer a less opinionated version, as long as it's still correct. No need to optimize for the consensus failure speed in my opinion, I would prefer simpler code for a change as risky as this one.

in src/inputfetcher.h:162 in 912f26b81e outdated

 157 | +
 158 | +    //! Create a new input fetcher
 159 | +    explicit InputFetcher(int32_t batch_size, int32_t worker_thread_count) noexcept
 160 | +        : m_batch_size(batch_size)
 161 | +    {
 162 | +        if (worker_thread_count < 1) {

l0rinc commented at 3:34 PM on September 29, 2025:

what's the reason for allowing negative worker_thread_count? In other cases I think it was used to signal how many CPUs to reserve, but that doesn't seem to be the case here, and since we're claming to min of 0, consider:

        if (worker_thread_count == 0) {

andrewtoth commented at 3:58 PM on October 3, 2025:

Done.

in src/inputfetcher.h:192 in 912f26b81e outdated

 187 | +    void FetchInputs(CCoinsViewCache& cache,
 188 | +                     const CCoinsView& db,
 189 | +                     const CBlock& block) noexcept
 190 | +        EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
 191 | +    {
 192 | +        if (m_worker_threads.empty() || block.vtx.size() <= 1) {

l0rinc commented at 3:37 PM on September 29, 2025:

Can we maybe do something like this instead?

        if (block.vtx.size() < m_worker_threads.size()) {

andrewtoth commented at 2:42 PM on October 2, 2025:

This is to not enter if there is only a coinbase tx, since it has no inputs to fetch. If there were 2 txs, and the second has 1000 inputs, we would still want to enter here.

in src/inputfetcher.h:198 in 912f26b81e outdated

 193 | +            return;
 194 | +        }
 195 | +
 196 | +        // Set the db and cache to use for this block.
 197 | +        m_db = &db;
 198 | +        m_cache = &cache;

l0rinc commented at 3:38 PM on September 29, 2025:

can we avoid mutating the state in a multithreaded class for safety? It's easier to follow along knowing that the class is immutable and the state is passed along...

andrewtoth commented at 2:43 PM on October 2, 2025:

I don't think we can do that. We need to set these here for other threads to read. These are only read from other threads, never written to. We also only read from other threads after the main thread has released the counting_semaphore, so we know the pointers are synced across the threads.

l0rinc commented at 2:48 PM on October 2, 2025:

I really dislike that, will try to come up with a lock-free version later (maybe next week)

in src/test/inputfetcher_tests.cpp:51 in c705c6f1f1 outdated

  46 | +
  47 | +        return block;
  48 | +    }
  49 | +
  50 | +public:
  51 | +    explicit InputFetcherTest(const ChainType chainType = ChainType::MAIN,

l0rinc commented at 3:40 PM on September 29, 2025:

can we add these tests before the multithreading change - having a single-threaded InputFetcher first, adding tests and benchmarks after and doing the actual multithreading as a very last step. That would construct the whole scenario in smaller steps, proving that every change is safe

andrewtoth commented at 3:57 PM on October 4, 2025:

Done.

in src/test/inputfetcher_tests.cpp:55 in c705c6f1f1 outdated

  50 | +public:
  51 | +    explicit InputFetcherTest(const ChainType chainType = ChainType::MAIN,
  52 | +                             TestOpts opts = {})
  53 | +        : BasicTestingSetup{chainType, opts}
  54 | +    {
  55 | +        SeedRandomForTest(SeedRand::ZEROS);

l0rinc commented at 3:41 PM on September 29, 2025:

I understand why benchmarks need predictability, but wouldn't we want variance for tests?

andrewtoth commented at 6:04 PM on October 11, 2025:

Changed to FIXED_SEED.

in src/test/inputfetcher_tests.cpp:171 in c705c6f1f1 outdated

 166 | +
 167 | +class ThrowCoinsView : public CCoinsView
 168 | +{
 169 | +    std::optional<Coin> GetCoin(const COutPoint& outpoint) const override
 170 | +    {
 171 | +        throw std::runtime_error("database error");

l0rinc commented at 3:44 PM on September 29, 2025:

consider std::terminate

andrewtoth commented at 2:44 PM on October 2, 2025:

Err, we want to throw a runtime error here to test the try/catch in the inputfetcher.

in src/test/fuzz/inputfetcher.cpp:150 in 1faf0595a5 outdated

 145 | +            // Check any newly added coins in the cache are the same as the db
 146 | +            const auto& coin{cache.AccessCoin(outpoint)};
 147 | +            assert(!coin.IsSpent());
 148 | +            assert(coin.fCoinBase == (*maybe_coin).fCoinBase);
 149 | +            assert(coin.nHeight == (*maybe_coin).nHeight);
 150 | +            assert(coin.out == (*maybe_coin).out);

l0rinc commented at 3:47 PM on September 29, 2025:

            assert(coin.fCoinBase == maybe_coin->fCoinBase);
            assert(coin.nHeight == maybe_coin->nHeight);
            assert(coin.out == maybe_coin->out);

andrewtoth commented at 6:04 PM on October 11, 2025:

Done!

l0rinc commented at 4:16 PM on September 29, 2025: contributor

I have re-reviewed the changes again lightly and did quite a few benchmarks on different platforms. There were a lot of surprises, see my measurements:

<details> <summary>rpi5-16 IBD from local node or & reindex-chainstate seem is ~27% faster</summary>

COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
STOP=915961; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

688c03597a validation: fetch block inputs in parallel
af8a366bd6 coins: allow emplacing non-dirty coins internally

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
  Time (abs ≡):        29732.695 s               [User: 60441.083 s, System: 5856.247 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
  Time (abs ≡):        37896.082 s               [User: 60968.810 s, System: 7062.414 s]

Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
        1.27          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

Retested it separately with:

# cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -connect=rpi5-16-3.local

cat ../BitcoinData/debug.log | egrep 'height=0|height=916000'
2025-09-25T17:03:06Z UpdateTip: new best=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f height=0 version=0x00000001 log2_work=32.000022 tx=1 date='2009-01-03T18:15:05Z' progress=0.000000 cache=0.3MiB(0txo)
2025-09-26T01:02:56Z UpdateTip: new best=000000000000000000003ca9748080f4c3d1230ba9fa4bed66be6ded05f9b6e6 height=916000 version=0x2000e000 log2_work=95.840381 tx=1246369867 date='2025-09-23T07:22:08Z' progress=0.998966 cache=367.7MiB(2821413txo)
7h 59m 50s

</details>

Doing the same on an Intel i9 with SSD shows similar results

<details> <summary>i9 with SSD, IBD from real peers/reindex-chainstate seem is 24%/25% faster for default memory, done in 6h/3.5h</summary>

COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
STOP=915961; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

688c03597a validation: fetch block inputs in parallel
af8a366bd6 coins: allow emplacing non-dirty coins internally

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
  Time (abs ≡):        12698.166 s               [User: 33794.242 s, System: 3015.471 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
  Time (abs ≡):        15928.708 s               [User: 28382.232 s, System: 2308.299 s]

Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
        1.25          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

and

COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
STOP=916000; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 3 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

688c03597a validation: fetch block inputs in parallel
af8a366bd6 coins: allow emplacing non-dirty coins internally

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
  Time (mean ± σ):     21484.108 s ± 1187.956 s    [User: 42976.944 s, System: 4356.289 s]
  Range (min … max):   20112.390 s … 22175.559 s    3 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
  Time (mean ± σ):     26589.393 s ± 1171.370 s    [User: 36011.245 s, System: 3193.496 s]
  Range (min … max):   25607.731 s … 27886.055 s    3 runs

Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
        1.24 ±  0.09  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

</details>

Increasing the memory decreases the difference:

<details> <summary>i9 reindex-chainstate seem is ~9% faster for default memory, done in 3.3h</summary>

COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; STOP=915961; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && hyperfine --sort command --runs 1 --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" --parameter-list COMMIT ${COMMITS// /,} --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

688c03597a validation: fetch block inputs in parallel af8a366bd6 coins: allow emplacing non-dirty coins internally

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 11801.704 s [User: 20216.598 s, System: 1181.879 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 12916.432 s [User: 17150.579 s, System: 747.711 s]

Relative speed comparison 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) 1.09 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

</details>

Note that the difference between small and big dbcache is also shrunk from 23% to 7.5%!

Checked the same on an i7 with hdd, it seems the speedup is best on non-rotating disks, maybe we could consider reducing the parallelism in those cases:

<details>
<summary>i7 with HDD, IBD/reindex-chainstate seem is ~16% faster for default memory</summary>

COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca";
STOP=916000; DBCACHE=450;
CC=gcc; CXX=g++;
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") &&
hyperfine
--sort command
--runs 1
--export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"
--parameter-list COMMIT ${COMMITS// /,}
--prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"
--cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log &&
grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"
"COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

688c03597a validation: fetch block inputs in parallel af8a366bd6 coins: allow emplacing non-dirty coins internally

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 35766.853 s [User: 39688.514 s, System: 2853.808 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 41355.517 s [User: 35667.321 s, System: 2872.506 s]

Relative speed comparison 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blockso$ly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) 1.16 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

and

COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca";
STOP=916000; DBCACHE=450;
CC=gcc; CXX=g++;
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") &&
hyperfine
--sort command
--runs 1
--export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"
--parameter-list COMMIT ${COMMITS// /,}
--prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"
--cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log &&
grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"
"COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

688c03597a validation: fetch block inputs in parallel af8a366bd6 coins: allow emplacing non-dirty coins internally

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 35766.853 s [User: 39688.514 s, System: 2853.808 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 41355.517 s [User: 35667.321 s, System: 2872.506 s]

Relative speed comparison 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blockso$ly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) 1.16 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)


</details>

Checking the same on my M4 max laptop was the most surprising:
<details>
<summary>M4 max with SSD, IBD/reindex-chainstate seem is 290% faster for default memory</summary>

STOP=916000; DBCACHE=450;
DATA_DIR="/Users/lorinc/Library/Application\ Support/Bitcoin";
hyperfine
--sort command
--runs 1
--parameter-list COMMIT 688c03597afb0b76077f1ffc4608eef19481056e
--prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"
--cleanup "grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"
"./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

Benchmark 1: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 20658.825 s [User: 19679.918 s, System: 4763.490 s] Benchmark 1b: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 20186.312 s [User: 19481.126 s, System: 4716.728 s]

Benchmark 2: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 7131.178 s [User: 17244.133 s, System: 12850.427 s] Benchmark 2b: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 7180.762 s [User: 17430.360 s, System: 12949.731 s

Relative speed comparison 2.90 ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) 1.00 ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)


It was hard to believe this was true, so I re-ran it a few times, and it was consistent.

</details>

I have tried -par=32 on my laptop as well - exactly the same speed:
<details>
<summary>-par=32</summary>

STOP=916000; DBCACHE=450;
DATA_DIR="/Users/lorinc/Library/Application\ Support/Bitcoin";
hyperfine
--sort command
--runs 1
--parameter-list COMMIT 688c03597afb0b76077f1ffc4608eef19481056e
--prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"
--cleanup "grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"
"./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=32"

Benchmark 1: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=32 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 7109.626 s [User: 17210.848 s, System: 12938.964 s]

note, the commit had:

  m_input_fetcher{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, 10 * MAX_SCRIPTCHECK_THREADS)},


</details>

l0rinc changes_requested

l0rinc commented at 4:17 PM on September 29, 2025: contributor

in src/inputfetcher.h:86 in 688c03597a

  81 | +
  82 | +    //! Internal function that does the fetching from disk.
  83 | +    void Loop(int32_t index, bool is_main_thread = false) noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
  84 | +    {
  85 | +        auto local_batch_size{0};
  86 | +        auto end_index{0};

l0rinc commented at 5:41 PM on September 29, 2025:

I think we should add exact types here to make sure calculations like end_index - local_batch_size can't underflow

andrewtoth commented at 2:44 PM on October 2, 2025:

I rewrote this part, these are gone now.

in src/bench/inputfetcher.cpp:47 in 688c03597a outdated

  42 | +    DelayedCoinsView db(DELAY);
  43 | +    CCoinsViewCache cache(&db);
  44 | +
  45 | +    // The main thread should be counted to prevent thread oversubscription, and
  46 | +    // to decrease the variance of benchmark results.
  47 | +    const auto worker_threads_num{GetNumCores() - 1};

l0rinc commented at 5:44 PM on September 29, 2025:

I'm a bit conflicted here: this way we're all measuring something slightly different - which is especially problematic since the work here isn't even CPU bound. What if we did a min of npcu and 4?

Raimo33 commented at 2:23 PM on October 1, 2025: contributor

Concept ACK

andrewtoth force-pushed on Oct 2, 2025

andrewtoth commented at 3:47 PM on October 2, 2025: contributor

Updated the input fetcher significantly:

uses counting_semaphores to synchronize threads instead of mutex + condvar.
stores tx + vin indexes in global vector instead of copying the COutPoints. The COutPoints are read from a global CBlock pointer.
The fetch queue counter is an atomic int instead of a mutex guarded int.

andrewtoth force-pushed on Oct 3, 2025

andrewtoth commented at 5:33 PM on October 3, 2025: contributor

Removed m_batch_size. Each thread now increments the atomic counter by 1.

andrewtoth force-pushed on Oct 3, 2025

l0rinc commented at 12:53 AM on October 4, 2025: contributor

The latest version seems very promising, I like that the algorithms is getting simpler. I noticed that for small dbcache it has a very noticeable effect, but for very high dbcache this seems to add an extra cost - since we already have everything in the cache, so it just does useless work. I wonder if we could enable this fetching only after the very first time we Flush and erase, since it cannot help in any way before that.

andrewtoth force-pushed on Oct 4, 2025

l0rinc commented at 6:52 PM on October 7, 2025: contributor

Compared it against master on a Raspberry Pi 5 synchronizing from real peers for realism, ran it twice for good measure until 917000 blocks with dbcache 450:

This isn't the latest version of the PR, but should likely be representative anyway.

COMMITS="a8f9a806751b5755bdec5b096186f70c0bfddcfa f0dc19f16826f68ef482acfb7b24e8bb7168fc51"; \
STOP=917000; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

a8f9a80675 validation: fetch block inputs in parallel
f0dc19f168 coins: allow emplacing non-dirty coins internally

IBD | 917000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
  Time (abs ≡):        47485.682 s               [User: 79615.847 s, System: 9374.261 s]
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
  Time (abs ≡):        56374.354 s               [User: 78807.079 s, System: 10196.290 s]
 
Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
        1.19          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)

</details>

COMMITS="a8f9a806751b5755bdec5b096186f70c0bfddcfa f0dc19f16826f68ef482acfb7b24e8bb7168fc51"; \
STOP=917000; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

a8f9a80675 validation: fetch block inputs in parallel
f0dc19f168 coins: allow emplacing non-dirty coins internally

IBD | 917000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
  Time (abs ≡):        45907.874 s               [User: 81006.258 s, System: 10039.919 s]
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
  Time (abs ≡):        55612.464 s               [User: 81830.349 s, System: 11913.754 s]
 
Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
        1.21          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)

</details>

The variance between the runs shows 1% for master and 3% difference for the PR indicating that we're likely nearing the network bandwidth limitations.

andrewtoth force-pushed on Oct 11, 2025

andrewtoth force-pushed on Oct 14, 2025

andrewtoth commented at 8:28 PM on October 14, 2025: contributor

I noticed that for small dbcache it has a very noticeable effect, but for very high dbcache this seems to add an extra cost - since we already have everything in the cache, so it just does useless work. I wonder if we could enable this fetching only after the very first time we Flush and erase, since it cannot help in any way before that. @l0rinc There is already quite a lot to review here, and your benchmarks (and mine) show very promising results. So, I would prefer to keep this idea as a follow-up. We can do isolated benchmarks with your suggested change afterwards and propose an improvement accordingly.

l0rinc commented at 9:04 PM on October 14, 2025: contributor

I'm fine with doing that in a follow-up if you think it's too complicated (though it's likely quite simple, we can just track the very first cache miss and always prefetch after that - that heuristic would even survive node restarts. Maybe we need to skip the Bip30 values though, but it's just a heuristic anyway). But I don't yet see this PR as close to being final yet - do you? I still want to review it thoroughly, I don't think we should ossify yet :)

DrahtBot added the label Needs rebase on Oct 15, 2025

andrewtoth force-pushed on Oct 16, 2025

DrahtBot removed the label Needs rebase on Oct 16, 2025

andrewtoth commented at 2:07 AM on October 16, 2025: contributor

Rebased due to conflicts. Removed the first commit. We don't need to modify EmplaceCoinInternalDANGER, we can just insert the coins as dirty into the cache. They will be set dirty when they are spent anyways. If a block fails validation inside ConnectBlock, then the dirty coins will just rewrite the same value to the db on the next flush or sync.

in src/inputfetcher.h:35 in 063946d6bd outdated

  30 | +private:
  31 | +    /**
  32 | +     * The flattened indexes to each input in the block. The first item in the
  33 | +     * pair is the index of the tx, and the second is the index of the vin.
  34 | +     */
  35 | +    std::vector<std::pair<size_t, size_t>> m_inputs{};

l0rinc commented at 11:56 PM on October 16, 2025:

As far as I understood the tx index and the vin index cannot exceed the limits of a uint32_t, see https://github.com/bitcoin/bitcoin/blob/e744fd1249bf9577274614eaf3997bf4bbb612ff/src/primitives/transaction.h#L32

Please consider std::vector<std::pair<uint32_t, uint32_t>> instead, which would likely halve the memory footprint. Based on https://godbolt.org/z/Wb918bWaM it seems to me this layout allows modern compilers to coalesce the two member accesses into a single, optimal 64-bit load from memory.

Packing it into a single uint64_t seems to be the best, but that's completely unreadable, so I'd go with the above std::pair<uint32_t, uint32_t>.

andrewtoth commented at 1:59 PM on October 28, 2025:

Done, using uint32_t in the Input struct now, not sure if it will have any effect in the struct though.

in src/validation.cpp:3142 in 64de911053 outdated

3136 | @@ -3137,6 +3137,8 @@ bool Chainstate::ConnectTip(
3137 |      LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
3138 |               Ticks<MillisecondsDouble>(time_2 - time_1));
3139 |      {
3140 | +        m_chainman.FetchInputs(CoinsTip(), CoinsDB(), *block_to_connect);
3141 | +
3142 |          CCoinsViewCache view(&CoinsTip());

l0rinc commented at 2:02 PM on October 18, 2025:

I have played with this to see if we can construct the new cache layer before the fetcher, so that it populates the new layer instead of reading & writing to the old one - seemed like a cleaner separation.

But unfortunately we would need access to both in-memory cache layers inside FetchInputs in that case - to read for presence from the old cache and to write the newly fetched values to the new one (which will be flushed together with the new outputs when existing this scope).

But given that we're adding these missing entries only to spend them a moment later in the other cache layer (and to avoid having all of the missing ones require two-hop-lookups) we should still try reading from the stable cache and writing to the temporary one.

Collecting the missing inputs to a separate cache would also help with benchmarking and testing since the underlying cache would only be modified once block connection finishes - while the complete diff would be in the top cache layer. We also shouldn't add the entries to the cache if the block fails validation.

andrewtoth commented at 1:30 PM on October 28, 2025:

Done.

in src/test/coinsviewcacheasync_tests.cpp:41 in 64de911053 outdated

  36 | +
  37 | +        Txid prevhash{Txid::FromUint256(uint256(1))};
  38 | +
  39 | +        for (auto i{1}; i < num_txs; ++i) {
  40 | +            CMutableTransaction tx;
  41 | +            const auto txid{m_rng.randbool() ? Txid::FromUint256(uint256(i)) : prevhash};

l0rinc commented at 4:14 PM on October 18, 2025:

Does this mean the spent tx is never processed on the same thread currently?

Maybe we can mix it up a bit by something like

if (m_rng.randbool()) {
    prevhash = tx.GetHash(); // TODO This can theoretically simulate double spends
}

andrewtoth commented at 2:04 PM on October 28, 2025:

Does this mean the spent tx is never processed on the same thread currently?

I don't think that's what it means. The same thread could fetch two inputs in a row.

Maybe we can mix it up a bit by something like

So we can use the same prevhash in difference txs? What is the benefit of this?

in src/test/inputfetcher_tests.cpp:153 in 64de911053 outdated

 148 | +}
 149 | +
 150 | +BOOST_FIXTURE_TEST_CASE(fetch_no_inputs, InputFetcherTest)
 151 | +{
 152 | +    const auto& block{getBlock()};
 153 | +    for (auto i{0}; i < 3; ++i) {

l0rinc commented at 4:34 PM on October 18, 2025:

What's the point of the loop in these, is there a state we're changing in each iteration?

andrewtoth commented at 1:31 PM on October 28, 2025:

The InputFetcher is stateful, so this is making sure previous state does not leak into the next fetch phase.

in src/test/fuzz/inputfetcher.cpp:47 in 64de911053 outdated

  42 | +{
  43 | +public:
  44 | +    std::optional<Coin> GetCoin(const COutPoint&) const override
  45 | +    {
  46 | +        abort();
  47 | +    }

l0rinc commented at 4:52 PM on October 18, 2025:

nit:

    std::optional<Coin> GetCoin(const COutPoint&) const override { std::abort(); }

andrewtoth commented at 2:00 PM on October 28, 2025:

Done.

in src/test/fuzz/inputfetcher.cpp:26 in 64de911053 outdated

  21 | +class DbCoinsView : public CCoinsView
  22 | +{
  23 | +private:
  24 | +    DbMap& m_map;
  25 | +
  26 | +public:

l0rinc commented at 4:53 PM on October 18, 2025:

nit: in test code I'd strive for simpler code instead of needlessly "safe"

struct DbCoinsView : CCoinsView
{
    DbMap& m_map;

andrewtoth commented at 2:00 PM on October 28, 2025:

Done.

in src/bench/inputfetcher.cpp:29 in 64de911053 outdated

  24 | +    DelayedCoinsView(std::chrono::milliseconds delay) : m_delay(delay) {}
  25 | +
  26 | +    std::optional<Coin> GetCoin(const COutPoint&) const override
  27 | +    {
  28 | +        UninterruptibleSleep(m_delay);
  29 | +        return Coin{};

l0rinc commented at 4:54 PM on October 18, 2025:

GetCoin shouldn't return spent entries

andrewtoth commented at 2:00 PM on October 28, 2025:

Done.

in src/bench/inputfetcher.cpp:24 in 64de911053 outdated

  19 | +{
  20 | +private:
  21 | +    std::chrono::milliseconds m_delay;
  22 | +
  23 | +public:
  24 | +    DelayedCoinsView(std::chrono::milliseconds delay) : m_delay(delay) {}

l0rinc commented at 4:54 PM on October 18, 2025:

this seems overly general to me, I think we can inline the delay for now

andrewtoth commented at 2:00 PM on October 28, 2025:

Done.

in src/inputfetcher.h:153 in 64de911053 outdated

 148 | +            return;
 149 | +        }
 150 | +
 151 | +        m_db = &db;
 152 | +        m_cache = &cache;
 153 | +        m_block = &block;

l0rinc commented at 4:55 PM on October 18, 2025:

As mentioned before, I really dislike these lines, a fetcher shouldn't change the internal state (especially since they're const). Since we need a continuously running threads, can we package these to avoid state mutations? This would likely be solved by sending these off to a ThreadPool instance.

andrewtoth commented at 1:36 PM on October 28, 2025:

Indeed, this would be gracefully solved with #33689. We could pass all state into the worker threads via lambda capture.

in src/test/fuzz/inputfetcher.cpp:120 in 64de911053 outdated

 115 | +
 116 | +            prevhash = tx.GetHash();
 117 | +            block.vtx.push_back(MakeTransactionRef(tx));
 118 | +        }
 119 | +
 120 | +        fetcher.FetchInputs(cache, db, block);

l0rinc commented at 5:00 PM on October 18, 2025:

We shouldn't test this with invalid (empty) blocks:

        if (block.vtx.empty()) continue;
        fetcher.FetchInputs(cache, db, block);

andrewtoth commented at 1:37 PM on October 28, 2025:

Why not? The InputFetcher should not make assumptions about the structure of the block being passed in.

l0rinc commented at 4:30 PM on October 28, 2025:

Even consensus invalid ones? Or did I misunderstand the context here?

andrewtoth commented at 4:38 PM on October 28, 2025:

Yeah, InputFetcher doesn't know if a block is consensus valid or not yet. It hasn't passed ConnectBlock yet before entering here.

in src/bench/inputfetcher.cpp:39 in 64de911053 outdated

  34 | +
  35 | +static void InputFetcherBenchmark(benchmark::Bench& bench)
  36 | +{
  37 | +    DataStream stream{benchmark::data::block413567};
  38 | +    CBlock block;
  39 | +    stream >> TX_WITH_WITNESS(block);

l0rinc commented at 5:08 PM on October 18, 2025:

nit:

    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

andrewtoth commented at 2:00 PM on October 28, 2025:

Done.

in src/bench/inputfetcher.cpp:45 in 64de911053 outdated

  40 | +
  41 | +    DelayedCoinsView db(DELAY);
  42 | +    CCoinsViewCache cache(&db);
  43 | +
  44 | +    // The main thread should be counted to prevent thread oversubscription, and
  45 | +    // to decrease the variance of benchmark results.

l0rinc commented at 5:09 PM on October 18, 2025:

"should be counted" is the reason for the "- 1"?

in src/bench/inputfetcher.cpp:47 in 64de911053 outdated

  42 | +    CCoinsViewCache cache(&db);
  43 | +
  44 | +    // The main thread should be counted to prevent thread oversubscription, and
  45 | +    // to decrease the variance of benchmark results.
  46 | +    const auto worker_threads_num{GetNumCores() - 1};
  47 | +    InputFetcher fetcher{static_cast<size_t>(worker_threads_num)};

l0rinc commented at 5:12 PM on October 18, 2025:

nit: if we keep the processor count here (which kinda' makes the benchmark measure different things on different platforms, but I don't really have a better idea, unless we assume that even the simplest machines where this matters (e.g. rpi4) already have at least 4 threads - and since this isn't even CPU bound, we shouldn't pretend that it does):

    const auto worker_threads_num{size_t(GetNumCores() - 1)};
    const InputFetcher fetcher{worker_threads_num};

or

    const InputFetcher fetcher{/*max_thread_count=*/4};

andrewtoth commented at 1:38 PM on October 28, 2025:

Why don't we want different benchmarks on different machines? All benchmarks are subtly different depending on the host machine.

l0rinc commented at 4:32 PM on October 28, 2025:

Yeah, but we want to understand where the differences are coming from, otherwise we'd have the "faster-on-my-machine" syndrome. If you disagree, just resolve the issue.

in src/bench/inputfetcher.cpp:50 in 64de911053 outdated

  45 | +    // to decrease the variance of benchmark results.
  46 | +    const auto worker_threads_num{GetNumCores() - 1};
  47 | +    InputFetcher fetcher{static_cast<size_t>(worker_threads_num)};
  48 | +
  49 | +    bench.run([&] {
  50 | +        const auto ok{cache.Flush()};

l0rinc commented at 5:17 PM on October 18, 2025:

Doesn't this change the behavior of the benchmark after the first iteration?

andrewtoth commented at 1:39 PM on October 28, 2025:

Fixed in latest iteration.

in src/validation.cpp:6270 in 64de911053 outdated

6266 | @@ -6265,6 +6267,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
6267 |  
6268 |  ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
6269 |      : m_script_check_queue{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
6270 | +      m_input_fetcher{std::clamp<size_t>(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},

l0rinc commented at 5:25 PM on October 18, 2025:

I don't understand what 0 threads means. It likely means turn off prefetching.

I would consider it to be a lot more intuitive if this started with 1 (and not be bound by the number of unrelated MAX_SCRIPTCHECK_THREADS since it's not script related and not even CPU bound).

Note: chainstatemanager_snapshot_init creates it with 0 workers by default, not sure it's intended

andrewtoth commented at 1:41 PM on October 28, 2025:

0 threads means it is turned off, yes. If -par=1 is configured, we want to pass 0 to input fetcher to disable prefetching. Also if options.worker_threads_num was negative for some reason. Do you have a suggestion of how this can be made more clear?

in src/inputfetcher.h:113 in 64de911053 outdated

 108 | +                }
 109 | +                if (m_cache->HaveCoinInCache(outpoint)) {
 110 | +                    continue;
 111 | +                }
 112 | +                if (auto coin{m_db->GetCoin(outpoint)}; coin) {
 113 | +                    m_coins[thread_index].emplace_back(outpoint, std::move(*coin));

l0rinc commented at 5:38 PM on October 18, 2025:

It seems to me that since the original m_coins is never written by different threads, we shouldn't have a false sharing problem here - right?

andrewtoth commented at 1:43 PM on October 28, 2025:

Possibly. This is no longer relevant in the current implementation.

in src/inputfetcher.h:78 in 64de911053 outdated

  73 | +    const CBlock* m_block{nullptr};
  74 | +
  75 | +    std::vector<std::thread> m_worker_threads;
  76 | +    std::counting_semaphore<> m_start_semaphore{0};
  77 | +    std::counting_semaphore<> m_complete_semaphore{0};
  78 | +    std::atomic<bool> m_request_stop{false};

l0rinc commented at 5:41 PM on October 18, 2025:

This style of dynamic work-stealing seems too complicated for such a uniform problem - static slicing would likely be a lot simpler and I would expect it to perform equally well.

I also realize that some of these are needed to avoid recreating the threads for every call. Currently the parallelism and the thread recreation are done in a single commit - could we first implement multithreading with thread recreation, and do the thread reuse as a separate concern (though a ThreadPool with dedicated test)?

andrewtoth commented at 2:05 PM on October 28, 2025:

It should be simpler now. I'm not sure if this comment is still valid though with the current approach. I agree if we already had a ThreadPool this would be much cleaner.

in src/test/inputfetcher_tests.cpp:134 in 64de911053 outdated

 129 | +
 130 | +        // Add all inputs as spent already in cache
 131 | +        for (const auto& tx : block.vtx) {
 132 | +            for (const auto& in : tx->vin) {
 133 | +                auto outpoint{in.prevout};
 134 | +                Coin coin{}; // Not setting nValue implies spent

l0rinc commented at 6:09 PM on October 18, 2025:

                Coin coin{};
                assert(coin.IsSpent());

andrewtoth commented at 2:01 PM on October 28, 2025:

Done.

andrewtoth force-pushed on Oct 19, 2025

andrewtoth commented at 3:26 PM on October 19, 2025: contributor

Updated to use std::barrier for the completion synchronization instead of acquiring a semaphore for each thread, as suggested by @l0rinc .

andrewtoth force-pushed on Oct 19, 2025

in src/inputfetcher.h:96 in a07936a62c

  91 | +            m_start_semaphore.acquire();
  92 | +            if (m_request_stop.load(std::memory_order_relaxed)) {
  93 | +                return;
  94 | +            }
  95 | +            Work(thread_index);
  96 | +            [[maybe_unused]] const auto arrival_token{m_complete_barrier.arrive()};

l0rinc commented at 1:23 AM on October 21, 2025:

Would this suffice?

            (void)m_complete_barrier.arrive();

in src/inputfetcher.h:190 in a07936a62c

 185 | +        m_input_counter.store(0, std::memory_order_relaxed);
 186 | +        m_start_semaphore.release(m_worker_threads.size());
 187 | +
 188 | +        // Have the main thread work too before we wait for other threads
 189 | +        Work(m_worker_threads.size());
 190 | +        m_complete_barrier.arrive_and_wait();

l0rinc commented at 1:29 AM on October 21, 2025:

What's the reason for not doing the completion here instead of the OnCompletion workaround?

        m_complete_barrier.arrive_and_wait();
        for (auto& coins : m_coins) {
            for (auto& [outpoint, coin] : coins) {
                m_cache->EmplaceCoinInternalDANGER(std::move(outpoint), std::move(coin));
            }
            coins.clear();
        }

in src/inputfetcher.h:149 in a07936a62c

 144 | +        }
 145 | +    }
 146 | +
 147 | +public:
 148 | +    explicit InputFetcher(size_t worker_thread_count) noexcept
 149 | +        : m_complete_barrier{static_cast<int32_t>(worker_thread_count + 1), OnCompletionWrapper{this}}

l0rinc commented at 1:30 AM on October 21, 2025:

std::ptrdiff_t seems more appropriate here: https://en.cppreference.com/w/cpp/thread/barrier/barrier.html

    explicit InputFetcher(size_t worker_thread_count) noexcept : m_complete_barrier{std::ptrdiff_t(worker_thread_count + 1)}

in src/inputfetcher.h:165 in a07936a62c outdated

 160 | +    }
 161 | +
 162 | +    //! Fetch all block inputs from db, and insert into cache.
 163 | +    void FetchInputs(CCoinsViewCache& cache, const CCoinsView& db, const CBlock& block) noexcept
 164 | +    {
 165 | +        if (block.vtx.size() <= 1 || m_worker_threads.size() == 0) {

l0rinc commented at 2:21 PM on October 21, 2025:

wouldn't m_worker_threads.size() == 0 be an error?

andrewtoth commented at 1:44 PM on October 28, 2025:

No, we can have no worker threads, for instance if started with -par=1. In that case just disable prefetching.

l0rinc commented at 5:58 PM on November 2, 2025:

Not a biggy, but this also seems uncovered by the unit tests.

in src/bench/inputfetcher.cpp:18 in a07936a62c outdated

  13 | +#include <util/time.h>
  14 | +
  15 | +static constexpr auto DELAY{2ms};
  16 | +
  17 | +//! Simulates a DB by adding a delay when calling GetCoin
  18 | +class DelayedCoinsView : public CCoinsView

l0rinc commented at 2:24 PM on October 21, 2025:

I personally would favor going for simpler code as opposed to going for theoretically better encapsulation, similarly to current inputfetcher_tests.cpp:

struct DelayedCoinsView : CCoinsView

andrewtoth commented at 2:01 PM on October 28, 2025:

Done.

in src/bench/inputfetcher.cpp:32 in a07936a62c outdated

  27 | +    {
  28 | +        UninterruptibleSleep(m_delay);
  29 | +        return Coin{};
  30 | +    }
  31 | +
  32 | +    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { return true; }

l0rinc commented at 2:25 PM on October 21, 2025:

We can add better assertions if we count the iterator size here:

bool BatchWrite(CoinsViewCacheCursor& cursor, const uint256&) override
{
    for (auto it{cursor.Begin()}; it != cursor.End(); it = cursor.NextAndMaybeErase(*it)) {
        m_write_count++;
    }
    return true;
}

So the bench could be (without flush which makes the bench runs equivalent):

bench.run([&] {
    CCoinsViewCache block_cache{&cache};
    fetcher.FetchInputs(cache, block_cache, db, block);
    assert(db.m_write_count == 0 && cache.GetCacheSize() == 0 && block_cache.GetCacheSize() == 4599);
});

andrewtoth commented at 1:49 PM on October 28, 2025:

When do we do a batch write though in this example?

l0rinc commented at 4:28 PM on October 28, 2025:

Originally we flushed, so if we still want to, we may want to extend this to assert that behavior. Or avoid flushing and simplify the bench. Or add the flushing behavior above to a test, etc.

andrewtoth commented at 4:37 PM on October 28, 2025:

We avoid flushing now.

l0rinc commented at 6:08 PM on November 2, 2025:

I think we could still add

        ankerl::nanobench::doNotOptimizeAway(&temp_cache);
        Assert(temp_cache.GetCacheSize() == 4599);

to document that the benchmark can loop now (i.e. every iteration should be the same)

in src/validation.cpp:3140 in a07936a62c

3136 | @@ -3137,6 +3137,8 @@ bool Chainstate::ConnectTip(
3137 |      LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
3138 |               Ticks<MillisecondsDouble>(time_2 - time_1));
3139 |      {
3140 | +        m_chainman.FetchInputs(CoinsTip(), CoinsDB(), *block_to_connect);

l0rinc commented at 2:30 PM on October 21, 2025:

Given that we already create the actual temporary top cache here, it would be great if we could separate the reading and writing (read from old/big in-memory cache and write to the temporary small one):

CCoinsViewCache* cache{&CoinsTip()};
CCoinsViewCache new_cache{cache};
m_chainman.FetchInputs(*cache, new_cache, CoinsDB(), *block_to_connect);

This would also mean that the actually read changes are read after that from the top-layer cache instead of always requiring two hops (assuming many missing ones). It also makes sense to not add inputs from blocks that fail validation - which we'd be getting for free this way.

andrewtoth commented at 1:50 PM on October 28, 2025:

Done.

l0rinc commented at 3:32 PM on October 21, 2025: contributor

I love the new structure, it's a lot easier to track the progress compared to previous versions. It's also measurably faster than previous versions, we seem to be nearing ~20% - sweeeet! It also pairs well with other Siphash related changes that are in the pipeline.

I have reimplemented most of it locally to make sure I can do a meaningful review, see https://github.com/l0rinc/bitcoin/pull/47 for my attempt. It's not finished, I'm still experimenting with different alternatives to make sure we can make this as simple and useful as possible (e.g. using barriers, filtering and sorting before fetch, adding dedicated ThreadPool, splitting reading and writing caches etc.), but wanted to publish my observations so far.

My remaining biggest concern is that the threadpool should definitely be untangled from the InputFetcher logic (I hate concurrent mutability), it's an independent concern mixed with non-trivial fetcher logic. We also shouldn't parallelize based on CPU in the first place, I don't see why we'd do that. And the ThreadPool part still needs independent tests.

I have tried fetching everything on a single thread and the same with sorted UTXOs and it does seem to be ~6% faster on an SSD (I'd expected even bigger difference on HDD, still measuring that) - but I don't have a final solution that's faster than everything else (since I could only solve it by single-threaded filtering which is slow).

<details> <summary>Details</summary>

b6ccd542fd single-threaded NO sort
c50a6bd981 single-threaded + sorted fetch

reindex-chainstate | 700000 blocks | dbcache 100 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
 
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b6ccd542fdc371d3ecd90164169e3d5d7c60e82d)
  Time (abs ≡):        11322.014 s               [User: 20431.572 s, System: 1375.995 s]
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = c50a6bd9815ebb2e0d86d5e6b679da5ac021b796)
  Time (abs ≡):        10646.049 s               [User: 19231.516 s, System: 1268.395 s]

</details>

andrewtoth commented at 4:55 PM on October 21, 2025: contributor

Thank you for your review @l0rinc!

I love the new structure, it's a lot easier to track the progress compared to previous versions. It's also measurably faster than previous versions, we seem to be nearing ~20% - sweeeet!

:rocket:

we can make this as simple and useful as possible

What else do you have in mind for this being useful? It has a very focused purpose in my view. I would love to make it as simple as possible.

using barriers

Done :)

filtering and sorting before fetch

Can you expand on why we would want this? Prefiltering on the main thread was always slower than filtering in parallel when I was testing this idea.

adding dedicated ThreadPool

Yes, see #26966. Hopefully we can pull that out and it can be used here. See https://github.com/andrewtoth/bitcoin/tree/test-thread-pool.

splitting reading and writing caches

I'm not sure what you mean by this? There is only one cache, and we read from it concurrently, then write to it on a single thread. Can you expand on how we can split the cache into two? Also, why would we want separate caches?

My remaining biggest concern is that the threadpool should definitely be untangled from the InputFetcher logic

Is this not accomplished by ThreadLoop, Work and OnCompletion functions? The Work function has no thread pool logic at all, it is completely independent and can be run by one thread or many. Do you have any concrete suggestions, or specific codepaths that are tangled that are concerning?

I hate concurrent mutability

I'm not sure I understand this. There is no concurrent mutability in my implementation. That would be undefined behavior. Can you point out the data members that are being concurrently mutated and how? Perhaps we have different definitions of concurrent mutability.

We also shouldn't parallelize based on CPU in the first place, I don't see why we'd do that.

This is following the logic of the check queue. Ideally we could reuse both threads in the same threadpool in the future. Do you have any concrete recommendations on how we many threads we should run? Using the number of threads per CPU is yielding a 20% speed boost, so it seems like a sane enough choice.

l0rinc commented at 6:02 PM on October 24, 2025: contributor

Let me summarize our offline discussions:

Cache hierarchy

During block connection we're adding an extra temporary in-memory dbcache layer on top so that whatever happens during block connection doesn't end up polluting the big dbcache or leveldb. I think we should take advantage of this and use the temporary top layer to collect the missing inputs there:

if block validation fails we can just throw it out
it's easier to test and benchmark, we can rerun the same operation without changing the underlying state
the missing values will all be fetched from the top layer now, avoiding two-hop lookups
since the threads aren't reading from the temp cache and aren't writing to the big cache, we can copy the needed inputs from the big in-memory cache contents on the main thread as well to the temporary top layer (without locking) while the other threads are fetching from the DB (since those are IO bound, this being CPU bound). That would create a dedicated dbcache per blocks which we could eventually flush independently.

This might need some workarounds, but it does enable new cache invalidation opportunities.

Sorted fetch

Since LevelDB writes (via the CDBBatch and MemTable) are already sorted and result in a significant speedup compared to inserting the values one-by-one (though likely this isn't just the effect of sorting), I thought it would make sense to experiment with sorted fetches as well. I have added an extra fetch (similarly to this PR, see https://github.com/l0rinc/bitcoin/commit/b72f67d4a88495a0222cbb9ae825daa6ea38e4df) before calling ConnectBlock so that the cache is pre-warmed, but on a single thread. This is the baseline, in the next commit after gathering the missing values I'm sorting them and doing the fetch from the db one-by-one, but in a sorted order.

The results indicate that sorted fetching is a lot faster, especially on lower-end devices (i7 with HDD and Rpi5 with SSD).

<details> <summary>6% faster reindex-chainstate | 919191 blocks | dbcache 450 | rpi5-16-2</summary>

COMMITS="7d27af98c7cf858b5ab5a02e64f89a857cc53172 ccc748b05858a9ebeb375cc1f3e7426698394470 ead8da2b33117807e27a66f814cf11cdf676d194"; \
STOP=919191; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

7d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz
ccc748b058 coins: prefetch inputs on single thread
ead8da2b33 coins: sorted input prefetch

reindex-chainstate | 919191 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)
  Time (abs ≡):        42271.083 s               [User: 66842.872 s, System: 8003.776 s]
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)
  Time (abs ≡):        40776.988 s               [User: 67394.771 s, System: 7130.144 s]
 
Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)
  Time (abs ≡):        38435.516 s               [User: 64537.357 s, System: 6716.634 s]
 
Relative speed comparison
        1.10          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)
        1.06          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)

</details>

(interestingly it seems this is even faster than master by a lot, we're not yet sure what's causing that since the fetches are still single-threaded)

Similar speedup on HDD:

<details> <summary>5% faster reindex-chainstate | 919191 blocks | dbcache 450 | i7-hdd</summary>

STOP=919191; DBCACHE=450; \                                                                                                                   
CC=gcc; CXX=g++; \                                                                                                                                                                  
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \                                                                                           
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \                                                                      
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cor
es | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || e
cho HDD)"; echo "") &&\                                                                                                                                                             
hyperfine \                                  
  --sort command \                           
  --runs 1 \                                 
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \                                                                      
  --parameter-list COMMIT ${COMMITS// /,} \                                               
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \                                                  
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \                                                                  
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \                                                                        
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \                                                                                                   
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \                                                                                
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"                         
7d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz                             
ccc748b058 coins: prefetch inputs on single thread                                        
ead8da2b33 coins: sorted input prefetch                                                   

reindex-chainstate | 919191 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD                                      

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                                        
  Time (abs ≡):        42897.195 s               [User: 38613.661 s, System: 3154.561 s]                                                                                            
                                             
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                                               
  Time (abs ≡):        42015.404 s               [User: 40096.242 s, System: 3180.038 s]                                                                                            
                                             
Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)                                               
  Time (abs ≡):        40176.361 s               [User: 38614.897 s, System: 3047.461 s]                                                                                            
                                             
Relative speed comparison                    
        1.07          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                               
        1.05          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                               
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)

</details>

On a very powerful i9 with very performant SSD sorting isn't faster than master, but it's still a lot faster than random fetching:

<details> <summary>4% faster reindex-chainstate | 919191 blocks | dbcache 450 | i9-ssd</summary>

STOP=919191; DBCACHE=450; \                                                                                                      
CC=gcc; CXX=g++; \                                                                                                                                                     
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \                                                                              
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \                                                         
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\                                  
hyperfine \                              
  --sort command \                       
  --runs 1 \                             
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \                                                         
  --parameter-list COMMIT ${COMMITS// /,} \                                        
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \                                     
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \                                                     
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \                                                           
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \                                                                                      
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \                                                                   
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"            

7d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz                      
ccc748b058 coins: prefetch inputs on single thread                                 
ead8da2b33 coins: sorted input prefetch                                            

reindex-chainstate | 919191 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD                        

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                    
  Time (abs ≡):        20383.488 s               [User: 38259.799 s, System: 2702.188 s]                                                                               
                                         
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                    
  Time (abs ≡):        21213.645 s               [User: 39728.945 s, System: 2709.999 s]                                                                               
                                         
Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)                    
  Time (abs ≡):        20476.751 s               [User: 38448.254 s, System: 2597.600 s]                                                                               
                                         
Relative speed comparison                
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)           
        1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)           
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)

</details>

Theoretically this can also be affected by LevelDB's internal options.block_cache and options.block_size and iteroptions.fill_cache (currently experimenting with different values to see if it makes any difference). This is why I have suggested trying a read-only snapshot per thread, if they're reading sorted values the previous path could theoretically be caches which can help avoid a few lookups.

Separate ThreadPool

I think we should be able to use a single barrier for starting and stopping, assuming that we either want to use all of them or none of them, something like: https://github.com/l0rinc/bitcoin/commit/67e79e041b524eb1b0039ff39d4fc5b235d89b8f#diff-7ad6c5646dfd3749d2634861dfaf98e8c6fbc041a8e8eded973ff39929acffd7 or #33689 This would also help with completely separating the multithreaded part from the Inputfetcher which would also allow testing that critical behavior separately. This would also help with minimizing the mutable state which could theoretically allow concurrent calls to swap out the work from under the started threads (so likely the m_task in my example should be atomic, haven't given it enough thought).

Since sorted-fetching currently needs a preprocessing step, we could use these available threads for filtering operations as well (since single-threaded filtering is slow, as proven above) - we could filter on all threads for missing, sort the missing on main thread and spread out to all threads for db fetching - leaving the main thread to do other things (since db fetching isn't really CPU bound, no need to involve the main thread), such as the mentioned outer-layer warming from the main in-memory dbcache. And alternative to the "fetch-missing & sort & get" on a single thread is to try "fetch-all & sort & filter & get" instead which is more parallelizable. Since everything in the outer-layer will be spent anyway, in the future maybe we could even flush them on a background thread to db (given that we've just copied the items to the temp cache, after successful flush we could just remove them from the main cache) - which would likely simplify the main dbcache's dirty and spent behaviors, but that's outside of the scope of this PR but could provide extra motivation for these changes).

andrewtoth force-pushed on Oct 27, 2025

andrewtoth commented at 1:58 AM on October 27, 2025: contributor

Thank you @l0rinc for your detailed review and suggestions! I have taken some of them. The input fetcher has now been redesigned.

Coins are written to the ephemeral cache that is created just to be used in ConnectBlock, instead of the main cache. This requires a new method in CCoinsViewCache - GetPossiblySpentCoinFromCache. Since we write to an empty cache instead of CoinsTip(), we could insert a Coin that exists in the db but is spent in CoinsTip(). Previously that insertion would fail since we were inserting again into CoinsTip() which would not overwrite the spent coin.
Because of this we can safely write to the ephemeral cache on the main thread, since the worker threads will be reading from a different cache. So the main thread does not do any fetching, but it writes fetched Coins to the cache in parallel as workers fetch them. It is a lock-free MPSC queue. The workers and main thread synchronize on an atomic Status member of each input. The workers set it to READY while the main thread spins on it until it is no longer WAITING. This is a substantial speed improvement over the previous version, and it also lets us insert Coins from the main cache into the ephemeral cache faster. So this change speeds up block connection even when there are no cache misses, and makes the workers utilize more CPU rather than just IO.
A single barrier is used to synchronize the threads. Both the main and worker threads call arrive_and_wait before they begin their work and after completion.

I removed the last commit, and instead add the InputFetcher as multi threaded initially. It didn't make sense to split it since the worker threads and main threads do different things now.

I haven't been able to effectively utilize a sorting strategy. Sorting the inputs before fetching doesn't seem to have a benefit in the multi threaded approach. The overhead of copying the COutPoints and sorting them was always slower. I think the parallel fetching dominates any speedup achieved by sorted fetching anyways.

andrewtoth force-pushed on Oct 27, 2025

l0rinc commented at 9:48 PM on October 27, 2025: contributor

Looks like I forgot to push the comments last time - that explains your surprise. I pushed all of them, many are out of date, please just resolve them. Sorry for the confusion.

Since I'm reposting these old comments (since commented from the main GitHub view), lemme' post a recently finished Rpi4 benchmark showing a 31% speedup for an older push \:D/

COMMITS="063946d6bd78035276d12e070a208d84492ac5cd 64de91105312d36dadb5f71ec01fc6af9b14da69"; \
STOP=919191; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

063946d6bd coins: add inputfetcher
64de911053 coins: fetch coins on parallel threads

reindex-chainstate | 919191 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 063946d6bd78035276d12e070a208d84492ac5cd)
  Time (abs ≡):        329713.086 s               [User: 172834.637 s, System: 87207.916 s]
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 64de91105312d36dadb5f71ec01fc6af9b14da69)
  Time (abs ≡):        251976.600 s               [User: 177923.005 s, System: 116804.580 s]
 
Relative speed comparison
        1.31          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 063946d6bd78035276d12e070a208d84492ac5cd)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 64de91105312d36dadb5f71ec01fc6af9b14da69)

</details>

andrewtoth force-pushed on Oct 28, 2025

in src/bench/inputfetcher.cpp:28 in 2aa5103481 outdated

  23 | +        Coin coin{};
  24 | +        coin.out.nValue = 1;
  25 | +        return coin;
  26 | +    }
  27 | +
  28 | +    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { return true; }

l0rinc commented at 4:40 PM on October 28, 2025:

nit: we could throw now to indicate we don't have unrepeatable side-effects

andrewtoth commented at 11:08 PM on November 2, 2025:

We could just get rid of this line now.

andrewtoth commented at 11:31 PM on November 2, 2025:

Got rid of it.

in src/bench/inputfetcher.cpp:18 in 2aa5103481 outdated

  13 | +#include <util/time.h>
  14 | +
  15 | +static constexpr auto DELAY{2ms};
  16 | +
  17 | +//! Simulates a DB by adding a delay when calling GetCoin
  18 | +struct DelayedCoinsView : CCoinsView

l0rinc commented at 4:40 PM on October 28, 2025:

Nit: structs are now formatted consistently on the same line

andrewtoth commented at 11:32 PM on November 2, 2025:

Done.

l0rinc commented at 8:44 AM on November 3, 2025:

There are other ones that I didn't mention, reformatted the change, please take the ones that make sense:

<details> <summary>Details</summary>

diff --git a/src/coins.cpp b/src/coins.cpp
index 2ef2e36ccc..baac1a32b5 100644
--- a/src/coins.cpp
+++ b/src/coins.cpp
@@ -173,7 +173,8 @@ bool CCoinsViewCache::HaveCoinInCache(const COutPoint &outpoint) const {
     return (it != cacheCoins.end() && !it->second.coin.IsSpent());
 }
 
-std::optional<Coin> CCoinsViewCache::GetPossiblySpentCoinFromCache(const COutPoint &outpoint) const noexcept {
+std::optional<Coin> CCoinsViewCache::GetPossiblySpentCoinFromCache(const COutPoint& outpoint) const noexcept
+{
     if (auto it{cacheCoins.find(outpoint)}; it != cacheCoins.end()) return it->second.coin;
     return std::nullopt;
 }
diff --git a/src/coins.h b/src/coins.h
index e7d28ace97..02c3ea9e15 100644
--- a/src/coins.h
+++ b/src/coins.h
@@ -407,7 +407,7 @@ public:
      * Used in InputFetcher to make sure we do not add a coin from the backing
      * view when it is spent in the cache but not yet flushed to the parent.
      */
-    std::optional<Coin> GetPossiblySpentCoinFromCache(const COutPoint &outpoint) const noexcept;
+    std::optional<Coin> GetPossiblySpentCoinFromCache(const COutPoint& outpoint) const noexcept;
 
     /**
      * Return a reference to Coin in the cache, or coinEmpty if not found. This is
diff --git a/src/inputfetcher.h b/src/inputfetcher.h
index a8a3f4d1ad..74c655caf1 100644
--- a/src/inputfetcher.h
+++ b/src/inputfetcher.h
@@ -46,8 +46,8 @@ private:
     struct Input {
         enum class Status : uint8_t {
             WAITING, // The coin has not been fetched yet
-            READY, // The coin has been fetched and is ready to be inserted into the cache
-            FAILED, // The coin failed to be fetched
+            READY,   // The coin has been fetched and is ready to be inserted into the cache
+            FAILED,  // The coin failed to be fetched
             SKIPPED, // The coin is created and spent in the same block so cannot be fetched
         };
 
diff --git a/src/test/fuzz/inputfetcher.cpp b/src/test/fuzz/inputfetcher.cpp
index cd2a0f5c68..70f5153912 100644
--- a/src/test/fuzz/inputfetcher.cpp
+++ b/src/test/fuzz/inputfetcher.cpp
@@ -18,8 +18,7 @@
 
 using DbMap = std::map<const COutPoint, std::pair<std::optional<const Coin>, bool>>;
 
-struct DbCoinsView : CCoinsView
-{
+struct DbCoinsView : CCoinsView {
     DbMap& m_map;
     DbCoinsView(DbMap& map) noexcept : m_map(map) {}
 
@@ -35,8 +34,7 @@ struct DbCoinsView : CCoinsView
     }
 };
 
-struct NoAccessCoinsView : CCoinsView
-{
+struct NoAccessCoinsView : CCoinsView {
     std::optional<Coin> GetCoin(const COutPoint&) const override { abort(); }
 };
 
@@ -49,7 +47,8 @@ FUZZ_TARGET(inputfetcher)
         fuzzed_data_provider.ConsumeIntegralInRange<int32_t>(2, 4)};
     InputFetcher fetcher{worker_threads};
 
-    LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10000) {
+    LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10000)
+    {
         CBlock block;
         Txid prevhash{Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))};
 
@@ -61,13 +60,13 @@ FUZZ_TARGET(inputfetcher)
         NoAccessCoinsView back;
         CCoinsViewCache main_cache(&back);
 
-        LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10000) {
+        LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10000)
+        {
             CMutableTransaction tx;
 
-            LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10) {
-                const auto txid{fuzzed_data_provider.ConsumeBool()
-                    ? Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))
-                    : prevhash};
+            LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10)
+            {
+                const auto txid{fuzzed_data_provider.ConsumeBool() ? Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider)) : prevhash};
                 const auto index{fuzzed_data_provider.ConsumeIntegral<uint32_t>()};
                 const COutPoint outpoint(txid, index);
 
@@ -87,8 +86,8 @@ FUZZ_TARGET(inputfetcher)
                     maybe_coin = std::nullopt;
                 }
                 db_map.try_emplace(outpoint, std::make_pair(
-                    maybe_coin,
-                    fuzzed_data_provider.ConsumeBool()));
+                                                 maybe_coin,
+                                                 fuzzed_data_provider.ConsumeBool()));
 
                 // Add the coin to the cache
                 if (fuzzed_data_provider.ConsumeBool()) {
diff --git a/src/test/inputfetcher_tests.cpp b/src/test/inputfetcher_tests.cpp
index 33fb8c6cb0..83f3d19432 100644
--- a/src/test/inputfetcher_tests.cpp
+++ b/src/test/inputfetcher_tests.cpp
@@ -48,7 +48,7 @@ private:
 
 public:
     explicit InputFetcherTest(const ChainType chainType = ChainType::MAIN,
-                             TestOpts opts = {})
+                              TestOpts opts = {})
         : BasicTestingSetup{chainType, opts}
     {
         SeedRandomForTest(SeedRand::FIXED_SEED);
@@ -200,8 +200,7 @@ BOOST_FIXTURE_TEST_CASE(fetch_no_inputs, InputFetcherTest)
     }
 }
 
-struct ThrowCoinsView : CCoinsView
-{
+struct ThrowCoinsView : CCoinsView {
     std::optional<Coin> GetCoin(const COutPoint&) const override
     {
         throw std::runtime_error("database error");
diff --git a/src/validation.h b/src/validation.h
index 26141305cb..2ff7c5ef6d 100644
--- a/src/validation.h
+++ b/src/validation.h
@@ -1341,7 +1341,8 @@ public:
     void RecalculateBestHeader() EXCLUSIVE_LOCKS_REQUIRED(::cs_main);
 
     CCheckQueue<CScriptCheck>& GetCheckQueue() { return m_script_check_queue; }
-    void FetchInputs(CCoinsViewCache& temp_cache, const CCoinsViewCache& main_cache, const CCoinsView& db, const CBlock& block) noexcept {
+    void FetchInputs(CCoinsViewCache& temp_cache, const CCoinsViewCache& main_cache, const CCoinsView& db, const CBlock& block) noexcept
+    {
         m_input_fetcher.FetchInputs(temp_cache, main_cache, db, block);
     }

</details>

andrewtoth commented at 10:15 PM on November 4, 2025:

Took the formats, thanks!

l0rinc commented at 8:40 PM on October 29, 2025: contributor

I'd say it's time to update the PR description:

COMMITS="cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85 2aa510348143521a14146e41b5cf87cb3e60b29e"; \
STOP=921129; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind || true; sleep 5; cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log; \
              grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

cb0fdfdf37 coins: add inputfetcher
2aa5103481 validation: fetch block inputs via InputFetcher before connecting

reindex-chainstate | 921129 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
  Time (abs ≡):        40539.887 s               [User: 69358.879 s, System: 7393.185 s]
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
  Time (abs ≡):        32768.672 s               [User: 69495.022 s, System: 11880.553 s]
 
Relative speed comparison
        1.24          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)

</details>

With this change we're getting a 9 hours full reindex-chainstate on an rpi5 with default 450MB memory. Wow!

andrewtoth renamed this:
~~validation: fetch block inputs on parallel threads >10% faster IBD~~
validation: fetch block inputs on parallel threads >20% faster IBD
on Oct 29, 2025

andrewtoth force-pushed on Oct 31, 2025

l0rinc commented at 8:20 AM on November 1, 2025: contributor

Rebased the latest version of the PR now that #31645 was merged and measured a reindex on my M4 laptop: it finished all 3 runs with dbcache 450, 4500 and 45000 overnight.

<details> <summary>Details</summary>

time ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 \
&& time ./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 \
&& time ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 

./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate   8805.12s user 1456.47s system 134% cpu 2:07:22.36 total
./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -reindex-chainstate    12201.80s user 3698.94s system 197% cpu 2:13:54.50 total
./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate     20676.11s user 10582.81s system 358% cpu 2:25:26.52 total

repeated the same later to check for stability:

./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate   8873.87s user 1468.17s system 134% cpu 2:07:47.62 total
./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -reindex-chainstate    12178.09s user 3621.39s system 197% cpu 2:13:11.48 total
./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate     20725.97s user 10666.85s system 359% cpu 2:25:33.12 total

</details>

Note: I did get a bitcoind(70369,0x16f5ff000) malloc: Failed to allocate segment from range group - out of space warning above, maybe 45 GB memory was a bit too much, but I can continue the block connection after the reindexes, so the measurements are likely representative. Edit: reran the 45 GB case, it did complete successfully in a similar time without errors.

<details> <summary>Details</summary>

reindex-chainstate seems correct, it can continue after:

./build/bin/bitcoind
2025-11-01T08:16:25Z Bitcoin Core version v30.99.0-45fe0c0e5bed (release build)
...
2025-11-01T08:16:29Z nBestHeight = 921129
...
2025-11-01T08:16:29Z UpdateTip: new best=00000000000000000000a216ce3209897114b757dbac3b651c72484829637c0e height=921130 version=0x343ba000 log2_work=95.904905 tx=1262937761 date='2025-10-28T06:06:11Z' progress=0.998475 cache=0.5MiB(3812txo)
2025-11-01T08:16:29Z UpdateTip: new best=00000000000000000001965d9ff2ef639dd6829da2ad19c3f5d46691475f0df1 height=921131 version=0x25472000 log2_work=95.904918 tx=1262938885 date='2025-10-28T06:13:50Z' progress=0.998477 cache=1.6MiB(9726txo)

</details>

andrewtoth force-pushed on Nov 1, 2025

andrewtoth commented at 9:47 PM on November 1, 2025: contributor

Rebased to test behavior with #31645. Some other touch-ups include:

Use const COutPoint& in Input struct instead of vin and vtx indexes
Cleanup shared vectors and pointers at the end of loop
Refactor inner work loop to do fewer existence checks at the expense of some duplicated code
Use std::atomic_usize_t instead of std::atomic<usize_t>
Use GetPossiblySpentCoinFromCache in tests and fuzz harness for better clarity and correctness

in src/coins.h:489 in 62868c8846

 481 | @@ -474,6 +482,14 @@ class CCoinsViewCache : public CCoinsViewBacked
 482 |      //! See: https://stackoverflow.com/questions/42114044/how-to-release-unordered-map-memory
 483 |      void ReallocateCache();
 484 |  
 485 | +    /**
 486 | +     * Reserve enough space in the cache so the underlying unordered_map will
 487 | +     * not have to rehash unless capacity is exceeded.
 488 | +     */
 489 | +    void Reserve(size_t capacity) {

l0rinc commented at 2:39 PM on November 2, 2025:

Nit: can you please reformat the change with latest clang-format after rebase? Nit2: this seems to belong closer to GetCacheSize, one reserves, the other returns the actual size

andrewtoth commented at 11:32 PM on November 2, 2025:

I couldn't get clang-format to work, but I made this a one-liner and moved under GetCacheSize.

in src/inputfetcher.h:165 in 62868c8846

 160 | +        m_cache = &cache;
 161 | +        m_input_head.store(0, std::memory_order_relaxed);
 162 | +        m_barrier.arrive_and_wait();
 163 | +
 164 | +        // Insert fetched coins into the temp_cache as they are set to READY.
 165 | +        temp_cache.Reserve(m_inputs.size() + outputs_count);

l0rinc commented at 2:50 PM on November 2, 2025:

I like that we're doing this, we've usually avoided preallocating our maps - we should do it more often!

I know we're providing empty caches via the parameters, but for completeness either assert that they're empty or:

        temp_cache.Reserve(temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);

I have added

        Assert(temp_cache.GetCacheSize() <= temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);

after the for (auto& input : m_inputs) { loop to validate that the allocation achieves - the tests pass! 👍

andrewtoth commented at 11:32 PM on November 2, 2025:

Done.

in src/inputfetcher.h:149 in 62868c8846 outdated

 144 | +        }
 145 | +
 146 | +        // Loop through the inputs of the block and set them in the queue.
 147 | +        // Construct the set of txids to filter, and count the outputs to reserve for temp_cache.
 148 | +        auto outputs_count{block.vtx[0]->vout.size()};
 149 | +        for (size_t i{1}; i < block.vtx.size(); ++i) {

l0rinc commented at 2:57 PM on November 2, 2025:

we're also just adding stuff to the m_inputs without any reservation, the first few times we will have some needless copying - we could likely alleviate that by doing a rough reservation

        m_txids.reserve(block.vtx.size());
        m_inputs.reserve(2 * block.vtx.size()); // rough guess

andrewtoth commented at 11:07 PM on November 2, 2025:

This won't have a measurable effect if we are connecting lots of blocks though.

andrewtoth commented at 11:33 PM on November 2, 2025:

Did it anyways.

in src/inputfetcher.h:71 in 62868c8846 outdated

  66 | +    /**
  67 | +     * The set of txids of all txs in the block being fetched.
  68 | +     * Used to filter out inputs that are created and spent in the same block,
  69 | +     * since they will not be in the db or the cache.
  70 | +     */
  71 | +    std::unordered_set<Txid, SaltedTxidHasher> m_txids{};

l0rinc commented at 3:06 PM on November 2, 2025:

I understand if you don't want to bother with this, but how many of these do we expect per block? I wonder if we want to incur the hashing cost here or if using a sorted set or a sorted vector with std::binary_search (or even (unsorted?) SIMD-enabled linear scan) would be faster or simpler here.

I also thought of whether we could just add these to temp_cache directly, but that would likely pollute the up-coming validation (unless we can differentiate these from existing inputs).

Taking block413567, we can assuming for the benchmark that 5% of the txs are internal spends (we need to measure this properly before we decide) and assuming 1556 txs (4886 inputs, 3581 outputs), 287 internal spends, we can compare the alternatives, here's a bench for motivation.

<details> <summary>Benchmark comparing the miss and hit count of Txid_SetOrdered, Txid_UnorderedSalted, Txid_VectorBinarySearch, Txid_VectorLinearScan</summary>

// Copyright (c) 2022-present The Bitcoin Core developers
// Distributed under the MIT software license, see the accompanying
// file COPYING or https://www.opensource.org/licenses/mit-license.php.

#include <bench/bench.h>
#include <bench/nanobench.h>
#include <primitives/transaction_identifier.h>
#include <random.h>
#include <util/check.h>
#include <util/hasher.h>

#include <algorithm>
#include <ranges>
#include <set>
#include <unordered_set>
#include <vector>

namespace {

constexpr size_t iterations{100}; // since the inputs of the benchmarks are mutated by sorting, we can't rerun the benchmarks
constexpr size_t hits_count{275}; // assuming ~5% of blocks contain internal spends
constexpr size_t tx_count{5500};

struct Dataset {
    std::set<Txid> sorted_set;
    std::unordered_set<Txid, SaltedTxidHasher> unsorted_set;
    std::vector<Txid> vec_sorted;
    std::vector<Txid> vec_unsorted;

    std::vector<Txid> queries;
};

std::vector<Dataset> BuildDatasets()
{
    FastRandomContext rng(/*fDeterministic=*/true);

    std::vector<Dataset> datasets;
    datasets.reserve(iterations);

    for (size_t d{0}; d < iterations; ++d) {
        Dataset ds;
        ds.queries.reserve(tx_count);
        ds.unsorted_set.reserve(tx_count);
        ds.vec_sorted.reserve(tx_count);
        ds.vec_unsorted.reserve(tx_count);

        for (size_t i{0}; i < tx_count; ++i) {
            Txid t{Txid::FromUint256(rng.rand256())};
            ds.sorted_set.emplace(t);
            ds.unsorted_set.emplace(t);
            ds.vec_sorted.emplace_back(t);
            ds.vec_unsorted.emplace_back(t);

            ds.queries.emplace_back(i < hits_count ? t : Txid::FromUint256(rng.rand256()));
        }

        std::ranges::shuffle(ds.queries, rng);
        std::ranges::shuffle(ds.vec_unsorted, rng);
        std::sort(ds.vec_sorted.begin(), ds.vec_sorted.end());

        datasets.emplace_back(std::move(ds));
    }
    return datasets;
}

} // namespace

static void Txid_UnorderedSalted(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += s.unsorted_set.contains(q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_SetOrdered(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += s.sorted_set.contains(q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_VectorBinarySearch(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += std::binary_search(s.vec_sorted.begin(), s.vec_sorted.end(), q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_VectorLinearScan(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets()};
    const auto contains_linear{[](const std::vector<Txid>& v, const Txid& x) noexcept {
        return std::ranges::find(v, x) != v.end();
    }};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += contains_linear(s.vec_unsorted, q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

BENCHMARK(Txid_UnorderedSalted, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_SetOrdered, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_VectorBinarySearch, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_VectorLinearScan, benchmark::PriorityLevel::LOW);

</details>

cmake -B build -DBUILD_BENCH=ON -DENABLE_IPC=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='Txid_.*'

| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 217,247.09 | 4,603.05 | 0.0% | 0.02 | Txid_SetOrdered | 202,607.50 | 4,935.65 | 0.0% | 0.02 | Txid_UnorderedSalted | 194,347.91 | 5,145.41 | 0.0% | 0.02 | Txid_VectorBinarySearch | 12,449,880.83 | 80.32 | 0.0% | 1.24 | Txid_VectorLinearScan

<details> <summary>Same measurement on an Rpi5 with GCC instead</summary>

| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1,179,326.18 | 847.94 | 0.0% | 2,361,313.36 | 2,820,742.15 | 0.837 | 475,693.50 | 1.2% | 0.12 | Txid_SetOrdered | 912,335.48 | 1,096.09 | 0.0% | 1,346,377.49 | 2,183,167.97 | 0.617 | 84,874.38 | 5.7% | 0.09 | Txid_UnorderedSalted | 732,493.03 | 1,365.20 | 0.0% | 2,411,291.90 | 1,752,928.05 | 1.376 | 531,672.95 | 8.1% | 0.07 | Txid_VectorBinarySearch | 25,449,442.73 | 39.29 | 0.0% | 235,989,723.07 | 60,929,976.87 | 3.873 | 58,998,462.96 | 0.0% | 2.54 | Txid_VectorLinearScan

</details>

So the benchmark isn't as revealing as I was hoping, please verify my assumptions and we can still test these with some macro-benchmark to see if there's any performance or memory advantage to any of these.

andrewtoth commented at 9:38 PM on November 2, 2025:

I did try and use a sorted vector and do binary search in the workers, but it was not a measurable performance difference. In theory it should be faster since txid comparison is much faster than siphash. I can do this if you want, but I think it makes the code clearer using the unordered_set.

andrewtoth commented at 1:42 AM on November 3, 2025:

I wrote benchmarks to check how fast it was to construct, since this is the work that is done not in parallel:

static void SortedVectorBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    std::vector<Txid> v{};
    v.reserve(block.vtx.size());

    bench.run([&] {
        for (const auto& tx : block.vtx) {
            v.emplace_back(tx->GetHash());
        }
        std::sort(v.begin(), v.end());
    });
}

static void UnorderedSetBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    std::unordered_set<Txid, SaltedTxidHasher> u{};
    u.reserve(block.vtx.size());

    bench.run([&] {
        for (const auto& tx : block.vtx) {
            u.emplace(tx->GetHash());
        }
    });
}

static void SetBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    std::set<Txid> s{};

    bench.run([&] {
        for (const auto& tx : block.vtx) {
            s.insert(tx->GetHash());
        }
    });
}

Results:

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           45,794.65 |           21,836.61 |    1.6% |      777,851.91 |      110,515.70 |  7.038 |      88,000.15 |    0.5% |      0.01 | `UnorderedSetBenchmark`
|           96,598.09 |           10,352.17 |    1.8% |      574,593.10 |      233,289.00 |  2.463 |     129,942.30 |    1.5% |      0.01 | `SetBenchmark`
|        1,037,787.00 |              963.59 |    5.1% |   12,280,496.00 |    2,472,876.00 |  4.966 |   2,863,217.00 |    1.3% |      0.02 | :wavy_dash: `SortedVectorBenchmark` (Unstable with ~1.9 iters. Increase `minEpochIterations` to e.g. 19)

Obviously we don't want to use the sorted vector. unordered_set is roughly twice as fast as set. I don't think it's worth pursuing a different filter container.

l0rinc commented at 9:09 AM on November 3, 2025:

Hmm, I'm not sure this is correct.

Obviously we don't want to use the sorted vector

That's not what I'm getting. You were inserting to the same collection over and over, I'm not sure what we were measuring there. Adding the collection creation and an assertion to not optimize it away (to make sure we're measuring the same thing in every iteration) reveals something completely different.

<details> <summary>updated benchmarking code</summary>

#include <bench/bench.h>
#include <bench/data/block413567.raw.h>
#include <coins.h>
#include <inputfetcher.h>
#include <primitives/block.h>
#include <serialize.h>
#include <streams.h>

static void InputFetcher_SortedVectorBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    bench.run([&] {
        std::vector<Txid> v{};
        v.reserve(block.vtx.size());
        for (const auto& tx : block.vtx) {
            v.emplace_back(tx->GetHash());
        }
        std::sort(v.begin(), v.end());
        ankerl::nanobench::doNotOptimizeAway(v);
    });
}

static void InputFetcher_UnorderedSetBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    bench.run([&] {
        std::unordered_set<Txid, SaltedTxidHasher> u{};
        u.reserve(block.vtx.size());
        for (const auto& tx : block.vtx) {
            u.emplace(tx->GetHash());
        }
        ankerl::nanobench::doNotOptimizeAway(u);
    });
}

static void InputFetcher_SetBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    bench.run([&] {
        std::set<Txid> s{};
        for (const auto& tx : block.vtx) {
            s.insert(tx->GetHash());
        }
        ankerl::nanobench::doNotOptimizeAway(s);
    });
}

BENCHMARK(InputFetcher_SortedVectorBenchmark, benchmark::PriorityLevel::HIGH);
BENCHMARK(InputFetcher_UnorderedSetBenchmark, benchmark::PriorityLevel::HIGH);
BENCHMARK(InputFetcher_SetBenchmark, benchmark::PriorityLevel::HIGH);

</details>

| ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 246,121.59 | 4,063.03 | 0.0% | 978,180.25 | 883,481.68 | 1.107 | 224,954.25 | 2.8% | 11.00 | InputFetcher_SetBenchmark | 130,850.86 | 7,642.29 | 0.2% | 608,319.13 | 469,743.71 | 1.295 | 135,588.13 | 4.4% | 11.04 | InputFetcher_SortedVectorBenchmark | 171,092.09 | 5,844.81 | 0.0% | 1,207,752.53 | 614,142.35 | 1.967 | 174,278.58 | 1.1% | 11.00 | InputFetcher_UnorderedSetBenchmark

andrewtoth commented at 1:30 PM on November 3, 2025:

Hmm right I was not clearing the containers, so the sorting was dominating. I retried with clearing and the unordered_map is still slightly faster. In your benchmark you are creating and reserving the map inside the benchmark instead of before. In this implementation the reserved memory is kept in between blocks, so reserving and creating outside makes more sense.

l0rinc commented at 10:48 AM on November 5, 2025:

In this implementation the reserved memory is kept in between blocks

K, so let's reserve outside and clear inside. Now that missing values aren't failures, we can experiment with shortids - since a missing value isn't a failure anymore (even though I wouldn't expect any collisions in 64 bits either, assuming uniform distribution. But even if the distribution isn't uniform, we can likely store it safely). 64 bit ids for internal spends would mean that in case of some collision we will attempt to fetch something from disk that was actually in the current block - so the attacker can at best slow down block validation by a few milliseconds.

<details> <summary>sorted/unsorted/vector benchmarks & shortids</summary>

// Copyright (c) 2022-present The Bitcoin Core developers
// Distributed under the MIT software license, see the accompanying
// file COPYING or https://www.opensource.org/licenses/mit-license.php.

#include <algorithm>
#include <bench/bench.h>
#include <bench/data/block413567.raw.h>
#include <bench/nanobench.h>
#include <coins.h>
#include <functional>
#include <inputfetcher.h>
#include <primitives/block.h>
#include <primitives/transaction_identifier.h>
#include <random.h>
#include <ranges>
#include <serialize.h>
#include <set>
#include <streams.h>
#include <unordered_set>
#include <util/check.h>
#include <util/hasher.h>
#include <vector>

namespace {
constexpr size_t iterations{100}; // since the inputs of the benchmarks are mutated by sorting, we can't rerun the benchmarks
constexpr size_t hits_count{275}; // assuming ~5% of blocks contain internal spends
constexpr size_t tx_count{5500};

uint64_t GetShortID(const Txid& txid)
{
    return txid.ToUint256().GetUint64(0);
}

template <typename T>
struct Dataset {
    std::set<T> sorted_set;
    std::unordered_set<T, std::conditional_t<std::is_same_v<T, Txid>, SaltedTxidHasher, std::identity>> unsorted_set;
    std::vector<T> vec_sorted;
    std::vector<T> vec_unsorted;
    std::vector<T> queries;

    static T Convert(const Txid& txid)
    {
        if constexpr (std::is_same_v<T, Txid>) {
            return txid;
        } else {
            static_assert(std::is_same_v<T, uint64_t>);
            return GetShortID(txid);
        }
    }
};

template <typename T>
std::vector<Dataset<T>> BuildDatasets()
{
    FastRandomContext rng(/*fDeterministic=*/true);

    std::vector<Dataset<T>> datasets;
    datasets.reserve(iterations);

    for (size_t d{0}; d < iterations; ++d) {
        Dataset<T> ds;
        ds.queries.reserve(tx_count);
        ds.unsorted_set.reserve(tx_count);
        ds.vec_sorted.reserve(tx_count);
        ds.vec_unsorted.reserve(tx_count);

        for (size_t i{0}; i < tx_count; ++i) {
            T t1{Dataset<T>::Convert(Txid::FromUint256(rng.rand256()))};
            ds.sorted_set.emplace(t1);
            ds.unsorted_set.emplace(t1);
            ds.vec_sorted.emplace_back(t1);
            ds.vec_unsorted.emplace_back(t1);

            T t2{Dataset<T>::Convert(Txid::FromUint256(rng.rand256()))};
            ds.queries.emplace_back(i < hits_count ? t1 : t2);
        }

        std::ranges::shuffle(ds.queries, rng);
        std::ranges::shuffle(ds.vec_unsorted, rng);
        std::sort(ds.vec_sorted.begin(), ds.vec_sorted.end());

        datasets.emplace_back(std::move(ds));
    }
    return datasets;
}
} // namespace

static void Txid_UnorderedSalted(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets<Txid>()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += s.unsorted_set.contains(q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_SetOrdered(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets<Txid>()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += s.sorted_set.contains(q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_VectorBinarySearch(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets<Txid>()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += std::binary_search(s.vec_sorted.begin(), s.vec_sorted.end(), q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_VectorLinearScan(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets<Txid>()};
    const auto contains_linear{[](const std::vector<Txid>& v, const Txid& x) noexcept {
        return std::ranges::find(v, x) != v.end();
    }};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += contains_linear(s.vec_unsorted, q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_UnorderedSalted_shortid(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets<uint64_t>()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += s.unsorted_set.contains(q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_SetOrdered_shortid(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets<uint64_t>()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += s.sorted_set.contains(q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_VectorBinarySearch_shortid(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets<uint64_t>()};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += std::ranges::binary_search(s.vec_sorted, q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void Txid_VectorLinearScan_shortid(benchmark::Bench& bench)
{
    static auto ds{BuildDatasets<uint64_t>()};
    const auto contains_linear{[](const std::vector<uint64_t>& v, uint64_t x) noexcept {
        return std::ranges::find(v, x) != v.end();
    }};
    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
        size_t sum{0};
        for (const auto& s : ds) {
            for (const auto& q : s.queries) {
                sum += contains_linear(s.vec_unsorted, q);
            }
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
        Assert(sum == iterations * hits_count);
    });
}

static void InputFetcher_SortedVectorBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    std::vector<Txid> v{};
    v.reserve(block.vtx.size());
    bench.run([&] {
        for (const auto& tx : block.vtx) {
            v.emplace_back(tx->GetHash());
        }
        std::sort(v.begin(), v.end());
        ankerl::nanobench::doNotOptimizeAway(v);
        v.clear();
    });
}

static void InputFetcher_UnorderedSetBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    std::unordered_set<Txid, SaltedTxidHasher> u{};
    u.reserve(block.vtx.size());
    bench.run([&] {
        for (const auto& tx : block.vtx) {
            u.emplace(tx->GetHash());
        }
        ankerl::nanobench::doNotOptimizeAway(u);
        u.clear();
    });
}

static void InputFetcher_SetBenchmark(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    std::set<Txid> s{};
    bench.run([&] {
        for (const auto& tx : block.vtx) {
            s.insert(tx->GetHash());
        }
        ankerl::nanobench::doNotOptimizeAway(s);
        s.clear();
    });
}

static void InputFetcher_SortedVectorBenchmark_shortid(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    std::vector<uint64_t> v{};
    v.reserve(block.vtx.size());
    bench.run([&] {
        for (const auto& tx : block.vtx) {
            v.emplace_back(GetShortID(tx->GetHash()));
        }
        std::ranges::sort(v);
        ankerl::nanobench::doNotOptimizeAway(v);
        v.clear();
    });
}

static void InputFetcher_UnorderedSetBenchmark_shortid(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    std::unordered_set<uint64_t> u{};
    u.reserve(block.vtx.size());
    bench.run([&] {
        for (const auto& tx : block.vtx) {
            u.emplace(GetShortID(tx->GetHash()));
        }
        ankerl::nanobench::doNotOptimizeAway(u);
        u.clear();
    });
}

static void InputFetcher_SetBenchmark_shortid(benchmark::Bench& bench)
{
    CBlock block;
    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);

    std::set<uint64_t> s{};
    bench.run([&] {
        for (const auto& tx : block.vtx) {
            s.insert(GetShortID(tx->GetHash()));
        }
        ankerl::nanobench::doNotOptimizeAway(s);
        s.clear();
    });
}

BENCHMARK(InputFetcher_SortedVectorBenchmark, benchmark::PriorityLevel::LOW);
BENCHMARK(InputFetcher_UnorderedSetBenchmark, benchmark::PriorityLevel::LOW);
BENCHMARK(InputFetcher_SetBenchmark, benchmark::PriorityLevel::LOW);

BENCHMARK(InputFetcher_SortedVectorBenchmark_shortid, benchmark::PriorityLevel::LOW);
BENCHMARK(InputFetcher_UnorderedSetBenchmark_shortid, benchmark::PriorityLevel::LOW);
BENCHMARK(InputFetcher_SetBenchmark_shortid, benchmark::PriorityLevel::LOW);

BENCHMARK(Txid_UnorderedSalted, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_SetOrdered, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_VectorBinarySearch, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_VectorLinearScan, benchmark::PriorityLevel::LOW);

BENCHMARK(Txid_UnorderedSalted_shortid, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_SetOrdered_shortid, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_VectorBinarySearch_shortid, benchmark::PriorityLevel::LOW);
BENCHMARK(Txid_VectorLinearScan_shortid, benchmark::PriorityLevel::LOW);

</details>

The new measurements indicate that the single-threaded preparation is 4x faster with a sorted vector, while multithreaded search is more than 2x faster with small ids compared to the current approach. Worth investigating further, I'd say :)

in src/coins.cpp:176 in aeec2e421d outdated

 172 | @@ -173,6 +173,11 @@ bool CCoinsViewCache::HaveCoinInCache(const COutPoint &outpoint) const {
 173 |      return (it != cacheCoins.end() && !it->second.coin.IsSpent());
 174 |  }
 175 |  
 176 | +std::optional<Coin> CCoinsViewCache::GetPossiblySpentCoinFromCache(const COutPoint &outpoint) const {

l0rinc commented at 4:40 PM on November 2, 2025:

This seems like a code smell that we should pay attention to. I have already brought this up a year ago in #30673 (review), seems it keeps biting us. Instead of adding a new method that basically does the same, can we keep GetCoin to simply return the coin and move the spentness checks to the call sites? I understand that it would likely require another PR, but it would make this one cleaner - I don't like that we need a workaround for a scenario that isn't out of the ordinary.

andrewtoth commented at 9:52 PM on November 2, 2025:

This is a consequence of skipping the main cache and writing directly to the temp cache.

I don't think we can modify GetCoin to not return spent coins, it's part of the definition Retrieve the Coin (unspent transaction output) for a given outpoint.. This also goes backwards from your change #30849 where you added TODOs to not return spent coins where our test GetCoins do. I don't think a PR to return spent coins from GetCoin would get enough support to be merged.

If GetCoin could return spent coins, we would also not be able to use it until we call HaveCoinInCache. This is because even though GetCoin is const, it modifies cacheCoins internally (which has the mutable modifier for this purpose). But, HaveCoinInCache also returns false if the coin is spent, so we would need to modify that. So, we would have 2 methods change and we would also then have to do 2 lookups. This method is very simple and very much defined for this special purpose, similar to EmplaceCoinInCacheDANGER.

I'm open to other suggestions, but modifying GetCoin (and then HaveCoinInCache) doesn't seem like the way.

in src/inputfetcher.h:132 in aeec2e421d outdated

 127 | +public:
 128 | +    explicit InputFetcher(size_t worker_thread_count) noexcept
 129 | +        : m_barrier{static_cast<int32_t>(worker_thread_count + 1)}
 130 | +    {
 131 | +        for (size_t n{0}; n < worker_thread_count; ++n) {
 132 | +            m_worker_threads.emplace_back([this, n]() {

l0rinc commented at 4:54 PM on November 2, 2025:

nit:

            m_worker_threads.emplace_back([this, n] {

andrewtoth commented at 11:33 PM on November 2, 2025:

Done.

in src/inputfetcher.h:128 in aeec2e421d outdated

 123 | +            m_barrier.arrive_and_wait();
 124 | +        }
 125 | +    }
 126 | +
 127 | +public:
 128 | +    explicit InputFetcher(size_t worker_thread_count) noexcept

l0rinc commented at 4:56 PM on November 2, 2025:

    explicit InputFetcher(int32_t worker_thread_count) noexcept : m_barrier{(worker_thread_count + 1)}

this would remove a few static casts

andrewtoth commented at 11:33 PM on November 2, 2025:

Done.

in src/inputfetcher.h:62 in aeec2e421d outdated

  57 | +        const COutPoint& outpoint;
  58 | +        //! The coin that workers will fetch and main thread will insert into cache.
  59 | +        Coin coin{};
  60 | +
  61 | +        Input(Input&& other) noexcept : outpoint{other.outpoint} {} // Only moved in setup for reallocation.
  62 | +        Input(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}

l0rinc commented at 4:58 PM on November 2, 2025:

nit:

        explicit Input(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}

andrewtoth commented at 11:33 PM on November 2, 2025:

Done.

in src/inputfetcher.h:43 in aeec2e421d outdated

  38 | + */
  39 | +class InputFetcher
  40 | +{
  41 | +private:
  42 | +    //! The latest input being fetched. Workers atomically increment this when fetching.
  43 | +    alignas(64) std::atomic_size_t m_input_head{0};

l0rinc commented at 5:01 PM on November 2, 2025:

is the alignas a false sharing guard? Does it have a measurable effect?

andrewtoth commented at 9:54 PM on November 2, 2025:

It was for false sharing. I don't think it has a measurable effect, but it might on some systems? I think it's harmless to keep, since there's only one InputFetcher.

andrewtoth commented at 11:33 PM on November 2, 2025:

Removed it.

in src/inputfetcher.h:189 in aeec2e421d outdated

 184 | +        m_cache = nullptr;
 185 | +    }
 186 | +
 187 | +    ~InputFetcher()
 188 | +    {
 189 | +        m_request_stop = true;

l0rinc commented at 5:05 PM on November 2, 2025:

I don't like that we need a separate field for this only, but I guess without std::jthread we have to do this manually.

Since this field isn't checked on every iteration, only once per thread per job as far as I can tell, we could repurpose m_input_head and use it as

alignas(64) std::atomic_int32_t m_input_head{0};
...
if (m_input_head.load(std::memory_order_acquire) < 0) [[unlikely]] {
...
m_input_head.store(-1, std::memory_order_relaxed);

andrewtoth commented at 9:55 PM on November 2, 2025:

I don't think that's actually preferable. The current way is more readable IMO.

in src/inputfetcher.h:190 in aeec2e421d outdated

 185 | +    }
 186 | +
 187 | +    ~InputFetcher()
 188 | +    {
 189 | +        m_request_stop = true;
 190 | +        m_barrier.arrive_and_wait();

l0rinc commented at 5:10 PM on November 2, 2025:

not certain, but I think this should likely be:

        m_barrier.arrive_and_drop();

andrewtoth commented at 11:33 PM on November 2, 2025:

Done.

in src/inputfetcher.h:120 in aeec2e421d outdated

 115 | +                } catch (const std::runtime_error& e) {
 116 | +                    LogPrintLevel(BCLog::VALIDATION, BCLog::Level::Warning, "InputFetcher failed to fetch input: %s.\n", e.what());
 117 | +                }
 118 | +                // Input missing or spent. This block will fail validation.
 119 | +                // Skip remaining inputs.
 120 | +                m_input_head.store(m_inputs.size(), std::memory_order_relaxed);

l0rinc commented at 5:13 PM on November 2, 2025:

I understand that this poison pill broadcast is needed to stop execution when errors occur - but I'm not sure we should care here about fetching error, I think we should just try to continue if that makes the code simpler

andrewtoth commented at 10:30 PM on November 2, 2025:

In ConnectBlock if we encounter a missing input we abort validation immediately. It is wasted work to continue. Why should we do it here?

l0rinc commented at 7:44 AM on November 3, 2025:

I don't think we should abort, it's not the fetcher's job to validate. If we didn't, we could even use short ids for the intra-block spends (64 bit likely won't even result in a single duplicate, and when they do collide we would just do a db check)

andrewtoth commented at 1:31 PM on November 3, 2025:

Why would short ids matter here though? Isn't that for keeping the bandwidth smaller for compact blocks?

l0rinc commented at 1:36 PM on November 3, 2025:

no, I mean, we don't actually need 256 bits of precision here, just a probabilistic check, so taking the first 32/64 bits of the hash should suffice, since the worst case is just going to disk, so it's not a tragedy if there are false positives, as long as in the average case checks are lot faster

andrewtoth commented at 2:25 PM on November 3, 2025:

I don't see why aborting early would prevent us from using less precision. It's doubtful there would be a collision, and if there were that block will just be a little slower to connect. I don't think it would have a measurable effect though, siphashing 64 bits vs 256 here?

andrewtoth commented at 10:15 PM on November 4, 2025:

I removed the abort early logic, so we keep going if we don't find an input. It makes the logic much simpler, but we will do some more work if we get a block mined that is double spending.

in src/inputfetcher.h:91 in aeec2e421d outdated

  86 | +            if (m_request_stop) [[unlikely]] {
  87 | +                return;
  88 | +            }
  89 | +            while (true) {
  90 | +                const size_t i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
  91 | +                if (i >= m_inputs.size()) [[unlikely]] {

l0rinc commented at 5:47 PM on November 2, 2025:

I don't mind the unlikely parts but we're barely using them in the code, they have some weird side-effects when merged - which may not be the case here, so I'm fine wither way, if nothing else, it documents the usage

in src/inputfetcher.h:33 in aeec2e421d outdated

  28 | + * into the ephemeral cache used in ConnectBlock.
  29 | + *
  30 | + * It spawns a fixed set of worker threads that fetch Coins for each input
  31 | + * in a block. The Coin is moved into the Input struct and then the status is
  32 | + * atomically updated to READY. The main thread spin loops on the status field
  33 | + * until it is READY and then inserts it into the temporary cache.

l0rinc commented at 5:49 PM on November 2, 2025:

I'm surprised synchronized insertion is faster than collecting them in a lock-free way - guess it's all the copying, but we should be able to avoid that, I don't like that we're locking again

andrewtoth commented at 10:00 PM on November 2, 2025:

There are no locks... This is a lock free implementation. It is synchronized with atomics per input, and the threads are work stealing via the m_input_head. Or do you mean something else?

l0rinc commented at 7:46 AM on November 3, 2025:

atomics work via CAS, they need to retry in case of high contention. the previous solution didn't have contention, the threads all knew beforehand what to work on.

andrewtoth commented at 2:20 PM on November 3, 2025:

Some contention on shared resources is unavoidable. The threads need to synchronize on atomics. There are no locks in this implementation that all other threads need to wait on.

The main thread may have to wait here if there is a slow fetch, but it would be reading uncontested memory right up until one worker thread flips this bit. None of the other threads are trying to write to this same bit, so there is no contention between the main thread and any other worker threads.

andrewtoth commented at 10:16 PM on November 4, 2025:

I've updated this to be an atomic_bool instead of this enum. So, since there is only a false value that is set to true, we have the main thread call input.ready.wait(false, std::memory_order_acquire);. This way we don't spin.

in src/inputfetcher.h:50 in aeec2e421d outdated

  45 | +    //! The inputs of the block which is being fetched.
  46 | +    struct Input {
  47 | +        enum class Status : uint8_t {
  48 | +            WAITING, // The coin has not been fetched yet
  49 | +            READY, // The coin has been fetched and is ready to be inserted into the cache
  50 | +            FAILED, // The coin failed to be fetched

l0rinc commented at 5:50 PM on November 2, 2025:

as mentioned before, is FAILED an important state for the fetcher? Why not just debug/warn log and continue?

andrewtoth commented at 10:03 PM on November 2, 2025:

We can't have the main thread insert a failed fetch. The coin will not be updated. I suppose the main thread could check if the coin is unspent. We want to exit ASAP as well if we can't find an input, so we need to signal the main thread that they should exit the loop.

l0rinc commented at 7:47 AM on November 3, 2025:

that's my point, I don't think we should validate at all, it would simplify the code if we didn't

andrewtoth commented at 10:19 PM on November 4, 2025:

Done, this Status enum has been replaced. It is now just an atomic_bool ready.

in src/inputfetcher.h:104 in aeec2e421d outdated

  99 | +                    continue;
 100 | +                }
 101 | +                try {
 102 | +                    if (auto coin{m_cache->GetPossiblySpentCoinFromCache(input.outpoint)}) {
 103 | +                        input.coin = std::move(*coin);
 104 | +                        if (!input.coin.IsSpent()) [[likely]] { // Coin from cache could be spent

l0rinc commented at 5:57 PM on November 2, 2025:

hmm, code states it's likely, but the new tests are passing with:

if (!input.coin.IsSpent()) {
    throw "";
}

andrewtoth commented at 10:42 PM on November 2, 2025:

Right, didn't cover this new happy path in tests. Unlikely case is covered though. Will add. Thanks.

in src/test/inputfetcher_tests.cpp:25 in e3045d2237 outdated

  20 | +#include <string>
  21 | +#include <unordered_set>
  22 | +
  23 | +BOOST_AUTO_TEST_SUITE(inputfetcher_tests)
  24 | +
  25 | +struct InputFetcherTest : BasicTestingSetup {

l0rinc commented at 5:59 PM on November 2, 2025:

I think the test belongs with the implementation, I need it to be able to review the first commit (otherwise it's just dead code, with the tests it has at least a test user)

andrewtoth commented at 11:34 PM on November 2, 2025:

Done.

in src/test/inputfetcher_tests.cpp:21 in e3045d2237 outdated

  16 | +
  17 | +#include <cstdint>
  18 | +#include <memory>
  19 | +#include <stdexcept>
  20 | +#include <string>
  21 | +#include <unordered_set>

l0rinc commented at 6:02 PM on November 2, 2025:

nit: some of these seem unused:

#include <memory>
#include <stdexcept>
#include <unordered_set>

andrewtoth commented at 11:04 PM on November 2, 2025:

<cstdint> we need because we use int32_t.

in src/test/inputfetcher_tests.cpp:85 in 62868c8846

  80 | +                db.EmplaceCoinInternalDANGER(std::move(outpoint), std::move(coin));
  81 | +            }
  82 | +        }
  83 | +
  84 | +        CCoinsViewCache main_cache(&db);
  85 | +        CCoinsViewCache cache(&cache);

l0rinc commented at 8:03 PM on November 2, 2025:

hmm, how does this even compile?

andrewtoth commented at 11:34 PM on November 2, 2025:

Fixed. But, it wouldn't affect the correctness of the test.

l0rinc commented at 7:49 AM on November 3, 2025:

how so? Can we assert the behavior of the main cache as well so that the previous version doesn't pass?

andrewtoth commented at 2:43 PM on November 5, 2025:

Can we assert the behavior of the main cache as well so that the previous version doesn't pass?

Neither the temp cache or the main cache touch their backing cache in the input fetcher. So, we can't assert a failure if the backing cache is something else. We can assert that the backing view is not touched during FetchInputs, which is done in the fuzz harness.

in src/test/fuzz/inputfetcher.cpp:43 in 62868c8846 outdated

  38 | +struct NoAccessCoinsView : CCoinsView
  39 | +{
  40 | +    std::optional<Coin> GetCoin(const COutPoint&) const override { abort(); }
  41 | +};
  42 | +
  43 | +FUZZ_TARGET(inputfetcher)

l0rinc commented at 8:24 PM on November 2, 2025:

I don't have a good fuzzer locally - does this cover all the new code?

andrewtoth commented at 11:05 PM on November 2, 2025:

Yes, I need to fuzz it though myself. The CI is fuzzing it a little bit.

in src/inputfetcher.h:105 in 62868c8846

 100 | +                }
 101 | +                try {
 102 | +                    if (auto coin{m_cache->GetPossiblySpentCoinFromCache(input.outpoint)}) {
 103 | +                        input.coin = std::move(*coin);
 104 | +                        if (!input.coin.IsSpent()) [[likely]] { // Coin from cache could be spent
 105 | +                            // We need release here, so setting coin 2 lines above happens before the main thread loads.

l0rinc commented at 8:28 PM on November 2, 2025:

The comments indicate that you also don't think this code is very intuitive - I'm also a bit lost here, can we simplify this somehow? I don't even understand what "release" means here or why both branches result in Status::READY (can we unify them) or what happens if the first internal if isn't fulfilled, or the outer one or the else, etc. The branching + continue + try/catch doesn't help

andrewtoth commented at 11:35 PM on November 2, 2025:

Rewrote this part to make it more clear. I was trying to be clever by avoiding some extra checks, but they are probably meaningless and clarity is better here.

l0rinc changes_requested

l0rinc commented at 8:31 PM on November 2, 2025: contributor

I like how we're progressing here! I think we need a few more things and have to try out a few alternatives (I haven't given up on sorting yet, especially now with bigger dbcache) and want to see how this combines with the other optimizations (threadpool, SipHash13, map hash caching, etc), but I'm definitely getting closer and closer to an ACK :D

I'm testing full IBD locally on my servers, but those are always slower than reindex-chainstes since the nodes can't send the blocks fast enough - I don't yet have a seeding node yet, but working on it.

"It simply inserts inputs into the temporary cache, which must be fetched before a transaction is validated anyways." beautiful, I think some of this should be added to the commit messages as well. commit message nit: "coins: add InputFetcher" (commit message content and formatting is a bit sloppy)

I'd merge fuzz: add inputfetcher fuzz harness and tests: add inputfetcher tests and coins: add inputfetcher since the clients are needed for the review of InputFetcher, otherwise it's just dead code added...

<details> <summary>local patch I had during review - they're not necessarily suggestions, just changes I did locally</summary>

diff --git a/src/bench/CMakeLists.txt b/src/bench/CMakeLists.txt
index 9d03f075a7..dcb6281699 100644
--- a/src/bench/CMakeLists.txt
+++ b/src/bench/CMakeLists.txt
@@ -52,6 +52,7 @@ add_executable(bench_bitcoin
   streams_findbyte.cpp
   strencodings.cpp
   txgraph.cpp
+  txid_membership.cpp
   txorphanage.cpp
   util_time.cpp
   verify_script.cpp
diff --git a/src/bench/inputfetcher.cpp b/src/bench/inputfetcher.cpp
index c10fcc5b5e..c1660c3ccf 100644
--- a/src/bench/inputfetcher.cpp
+++ b/src/bench/inputfetcher.cpp
@@ -12,20 +12,18 @@
 #include <streams.h>
 #include <util/time.h>
 
-static constexpr auto DELAY{2ms};
-
 //! Simulates a DB by adding a delay when calling GetCoin
 struct DelayedCoinsView : CCoinsView
 {
     std::optional<Coin> GetCoin(const COutPoint&) const override
     {
-        UninterruptibleSleep(DELAY);
+        UninterruptibleSleep(2ms);
         Coin coin{};
         coin.out.nValue = 1;
         return coin;
     }
 
-    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { return true; }
+    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { throw std::logic_error{"unused"}; }
 };
 
 static void InputFetcherBenchmark(benchmark::Bench& bench)
@@ -39,11 +37,13 @@ static void InputFetcherBenchmark(benchmark::Bench& bench)
     // The main thread should be counted to prevent thread oversubscription, and
     // to decrease the variance of benchmark results.
     const auto worker_threads_num{GetNumCores() - 1};
-    InputFetcher fetcher{static_cast<size_t>(worker_threads_num)};
+    InputFetcher fetcher{worker_threads_num};
 
     bench.run([&] {
         CCoinsViewCache temp_cache(&main_cache);
         fetcher.FetchInputs(temp_cache, main_cache, db, block);
+        ankerl::nanobench::doNotOptimizeAway(&temp_cache);
+        Assert(temp_cache.GetCacheSize() == 4599);
     });
 }
 
diff --git a/src/bench/txid_membership.cpp b/src/bench/txid_membership.cpp
new file mode 100644
index 0000000000..e646bb2a4a
--- /dev/null
+++ b/src/bench/txid_membership.cpp
@@ -0,0 +1,134 @@
+// Copyright (c) 2022-present The Bitcoin Core developers
+// Distributed under the MIT software license, see the accompanying
+// file COPYING or https://www.opensource.org/licenses/mit-license.php.
+
+#include <bench/bench.h>
+#include <bench/nanobench.h>
+#include <primitives/transaction_identifier.h>
+#include <random.h>
+#include <util/check.h>
+#include <util/hasher.h>
+
+#include <algorithm>
+#include <ranges>
+#include <set>
+#include <unordered_set>
+#include <vector>
+
+namespace {
+
+constexpr size_t iterations{100}; // since the inputs of the benchmarks are mutated by sorting, we can't rerun the benchmarks
+constexpr size_t hits_count{275}; // assuming ~5% of blocks contain internal spends
+constexpr size_t tx_count{5500};
+
+struct Dataset {
+    std::set<Txid> sorted_set;
+    std::unordered_set<Txid, SaltedTxidHasher> unsorted_set;
+    std::vector<Txid> vec_sorted;
+    std::vector<Txid> vec_unsorted;
+
+    std::vector<Txid> queries;
+};
+
+std::vector<Dataset> BuildDatasets()
+{
+    FastRandomContext rng(/*fDeterministic=*/true);
+
+    std::vector<Dataset> datasets;
+    datasets.reserve(iterations);
+
+    for (size_t d{0}; d < iterations; ++d) {
+        Dataset ds;
+        ds.queries.reserve(tx_count);
+        ds.unsorted_set.reserve(tx_count);
+        ds.vec_sorted.reserve(tx_count);
+        ds.vec_unsorted.reserve(tx_count);
+
+        for (size_t i{0}; i < tx_count; ++i) {
+            Txid t{Txid::FromUint256(rng.rand256())};
+            ds.sorted_set.emplace(t);
+            ds.unsorted_set.emplace(t);
+            ds.vec_sorted.emplace_back(t);
+            ds.vec_unsorted.emplace_back(t);
+
+            ds.queries.emplace_back(i < hits_count ? t : Txid::FromUint256(rng.rand256()));
+        }
+
+        std::ranges::shuffle(ds.queries, rng);
+        std::ranges::shuffle(ds.vec_unsorted, rng);
+        std::sort(ds.vec_sorted.begin(), ds.vec_sorted.end());
+
+        datasets.emplace_back(std::move(ds));
+    }
+    return datasets;
+}
+
+} // namespace
+
+static void Txid_UnorderedSalted(benchmark::Bench& bench)
+{
+    static auto ds{BuildDatasets()};
+    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
+        size_t sum{0};
+        for (const auto& s : ds) {
+            for (const auto& q : s.queries) {
+                sum += s.unsorted_set.contains(q);
+            }
+        }
+        ankerl::nanobench::doNotOptimizeAway(sum);
+        Assert(sum == iterations * hits_count);
+    });
+}
+
+static void Txid_SetOrdered(benchmark::Bench& bench)
+{
+    static auto ds{BuildDatasets()};
+    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
+        size_t sum{0};
+        for (const auto& s : ds) {
+            for (const auto& q : s.queries) {
+                sum += s.sorted_set.contains(q);
+            }
+        }
+        ankerl::nanobench::doNotOptimizeAway(sum);
+        Assert(sum == iterations * hits_count);
+    });
+}
+
+static void Txid_VectorBinarySearch(benchmark::Bench& bench)
+{
+    static auto ds{BuildDatasets()};
+    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
+        size_t sum{0};
+        for (const auto& s : ds) {
+            for (const auto& q : s.queries) {
+                sum += std::binary_search(s.vec_sorted.begin(), s.vec_sorted.end(), q);
+            }
+        }
+        ankerl::nanobench::doNotOptimizeAway(sum);
+        Assert(sum == iterations * hits_count);
+    });
+}
+
+static void Txid_VectorLinearScan(benchmark::Bench& bench)
+{
+    static auto ds{BuildDatasets()};
+    const auto contains_linear{[](const std::vector<Txid>& v, const Txid& x) noexcept {
+        return std::ranges::find(v, x) != v.end();
+    }};
+    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
+        size_t sum{0};
+        for (const auto& s : ds) {
+            for (const auto& q : s.queries) {
+                sum += contains_linear(s.vec_unsorted, q);
+            }
+        }
+        ankerl::nanobench::doNotOptimizeAway(sum);
+        Assert(sum == iterations * hits_count);
+    });
+}
+
+BENCHMARK(Txid_UnorderedSalted, benchmark::PriorityLevel::LOW);
+BENCHMARK(Txid_SetOrdered, benchmark::PriorityLevel::LOW);
+BENCHMARK(Txid_VectorBinarySearch, benchmark::PriorityLevel::LOW);
+BENCHMARK(Txid_VectorLinearScan, benchmark::PriorityLevel::LOW);
diff --git a/src/coins.h b/src/coins.h
index 1b3dcfc309..95c3bfe2f5 100644
--- a/src/coins.h
+++ b/src/coins.h
@@ -486,7 +486,8 @@ public:
      * Reserve enough space in the cache so the underlying unordered_map will
      * not have to rehash unless capacity is exceeded.
      */
-    void Reserve(size_t capacity) {
+    void Reserve(size_t capacity)
+    {
         cacheCoins.reserve(capacity);
     }
 
diff --git a/src/inputfetcher.h b/src/inputfetcher.h
index cd9eaed6ea..d15c4874cc 100644
--- a/src/inputfetcher.h
+++ b/src/inputfetcher.h
@@ -17,6 +17,7 @@
 #include <atomic>
 #include <barrier>
 #include <cstdint>
+#include <set>
 #include <stdexcept>
 #include <thread>
 #include <unordered_set>
@@ -38,9 +39,8 @@
  */
 class InputFetcher
 {
-private:
     //! The latest input being fetched. Workers atomically increment this when fetching.
-    alignas(64) std::atomic_size_t m_input_head{0};
+    alignas(64) std::atomic_int32_t m_input_head{0};
 
     //! The inputs of the block which is being fetched.
     struct Input {
@@ -59,7 +59,7 @@ private:
         Coin coin{};
 
         Input(Input&& other) noexcept : outpoint{other.outpoint} {} // Only moved in setup for reallocation.
-        Input(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
+        explicit Input(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
     };
     std::vector<Input> m_inputs{};
 
@@ -68,7 +68,7 @@ private:
      * Used to filter out inputs that are created and spent in the same block,
      * since they will not be in the db or the cache.
      */
-    std::unordered_set<Txid, SaltedTxidHasher> m_txids{};
+    std::set<Txid> m_txids{};
 
     //! DB coins view to fetch from.
     const CCoinsView* m_db{nullptr};
@@ -77,18 +77,15 @@ private:
 
     std::vector<std::thread> m_worker_threads{};
     std::barrier<> m_barrier;
-    bool m_request_stop{false};
 
     void WorkLoop() noexcept
     {
         while (true) {
             m_barrier.arrive_and_wait();
-            if (m_request_stop) [[unlikely]] {
-                return;
-            }
+            if (m_input_head.load(std::memory_order_relaxed) < 0) [[unlikely]] return;
             while (true) {
-                const size_t i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
-                if (i >= m_inputs.size()) [[unlikely]] {
+                const auto i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
+                if (i >= int32_t(m_inputs.size())) [[unlikely]] {
                     break;
                 }
                 auto& input{m_inputs[i]};
@@ -125,11 +122,10 @@ private:
     }
 
 public:
-    explicit InputFetcher(size_t worker_thread_count) noexcept
-        : m_barrier{static_cast<int32_t>(worker_thread_count + 1)}
+    explicit InputFetcher(int32_t worker_thread_count) noexcept : m_barrier{(worker_thread_count + 1)}
     {
-        for (size_t n{0}; n < worker_thread_count; ++n) {
-            m_worker_threads.emplace_back([this, n]() {
+        for (int32_t n{0}; n < worker_thread_count; ++n) {
+            m_worker_threads.emplace_back([this, n] {
                 util::ThreadRename(strprintf("inputfetch.%i", n));
                 WorkLoop();
             });
@@ -145,6 +141,8 @@ public:
 
         // Loop through the inputs of the block and set them in the queue.
         // Construct the set of txids to filter, and count the outputs to reserve for temp_cache.
+        //m_txids.reserve(block.vtx.size());
+        m_inputs.reserve(2 * block.vtx.size()); // rough guess
         auto outputs_count{block.vtx[0]->vout.size()};
         for (size_t i{1}; i < block.vtx.size(); ++i) {
             const auto& tx{block.vtx[i]};
@@ -162,7 +160,7 @@ public:
         m_barrier.arrive_and_wait();
 
         // Insert fetched coins into the temp_cache as they are set to READY.
-        temp_cache.Reserve(m_inputs.size() + outputs_count);
+        temp_cache.Reserve(temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);
         for (auto& input : m_inputs) {
             auto status{input.status.load(std::memory_order_acquire)};
             while (status == Input::Status::WAITING) {
@@ -175,6 +173,7 @@ public:
                 break;
             }
         }
+        Assert(temp_cache.GetCacheSize() <= temp_cache.GetCacheSize() + m_inputs.size() + outputs_count); // TODO remove
 
         m_barrier.arrive_and_wait();
         // Cleanup after all worker threads have exited the inner loop.
@@ -186,11 +185,9 @@ public:
 
     ~InputFetcher()
     {
-        m_request_stop = true;
-        m_barrier.arrive_and_wait();
-        for (auto& t : m_worker_threads) {
-            t.join();
-        }
+        m_input_head.store(-1, std::memory_order_relaxed);
+        m_barrier.arrive_and_drop();
+        for (auto& t : m_worker_threads) t.join();
     }
 };
 
diff --git a/src/test/inputfetcher_tests.cpp b/src/test/inputfetcher_tests.cpp
index b92a15d291..3d085bc843 100644
--- a/src/test/inputfetcher_tests.cpp
+++ b/src/test/inputfetcher_tests.cpp
@@ -14,10 +14,8 @@
 
 #include <boost/test/unit_test.hpp>
 
-#include <cstdint>
 #include <memory>
 #include <stdexcept>
-#include <string>
 #include <unordered_set>
 
 BOOST_AUTO_TEST_SUITE(inputfetcher_tests)
@@ -82,7 +80,7 @@ BOOST_FIXTURE_TEST_CASE(fetch_inputs, InputFetcherTest)
         }
 
         CCoinsViewCache main_cache(&db);
-        CCoinsViewCache cache(&cache);
+        CCoinsViewCache cache(&main_cache);
         getFetcher().FetchInputs(cache, main_cache, db, block);
 
         std::unordered_set<Txid, SaltedTxidHasher> txids{};
diff --git a/src/validation.cpp b/src/validation.cpp
index 7564b97a07..07ab71852f 100644
--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -6299,7 +6299,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
 
 ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
     : m_script_check_queue{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
-      m_input_fetcher{std::clamp<size_t>(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
+      m_input_fetcher{std::clamp<int32_t>(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
       m_interrupt{interrupt},
       m_options{Flatten(std::move(options))},
       m_blockman{interrupt, std::move(blockman_options)},

</details>

andrewtoth force-pushed on Nov 2, 2025

l0rinc commented at 9:34 AM on November 3, 2025: contributor

I have the IBD numbers for the i7-hdd and i9-ssd server. They're not as glorious as our reindex-chainstate measurements, most likely since I don't yet have a way to test IBD from extremely fast peers. But as a sanity-check I think it's fine, we're still bandwidth bound - which is a good problem to have.

COMMITS="bf07cf0adf19889727cb6bea24ebfbbfcc231a0c 45fe0c0e5beddce1c9e836ab5d97aa064069c192"; \                        
STOP=921129; DBCACHE=4500; \                                                                                                                                
CC=gcc; CXX=g++; \                                                                                                                                          
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \                                                                   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \                                              
(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

bf07cf0adf coins: add inputfetcher
45fe0c0e5b validation: fetch block inputs via InputFetcher before connecting

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)      
  Time (mean ± σ):     36470.995 s ± 113.187 s    [User: 38024.880 s, System: 2035.545 s]
  Range (min … max):   36390.960 s … 36551.030 s    2 runs
  
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)                                                                         
  Time (mean ± σ):     33962.782 s ± 375.163 s    [User: 41686.176 s, System: 2832.686 s]
  Range (min … max):   33697.502 s … 34228.062 s    2 runs
  
Relative speed comparison
        1.07 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)

</details>

and

COMMITS="aeec2e421d2ba102d905633d474f0fb88f91a9bf 62868c8846f043477d128788eadced3e71522417"; \
STOP=921129; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

aeec2e421d coins: add inputfetcher
62868c8846 validation: fetch block inputs via InputFetcher before connecting

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = aeec2e421d2ba102d905633d474f0fb88f91a9bf)
  Time (mean ± σ):     30907.958 s ± 1761.510 s    [User: 51503.334 s, System: 3833.881 s]
  Range (min … max):   29662.383 s … 32153.534 s    2 runs
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 62868c8846f043477d128788eadced3e71522417)
  Time (mean ± σ):     27504.900 s ± 2529.652 s    [User: 59336.996 s, System: 5796.247 s]
  Range (min … max):   25716.165 s … 29293.634 s    2 runs
 
Relative speed comparison
        1.12 ±  0.12  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = aeec2e421d2ba102d905633d474f0fb88f91a9bf)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 62868c8846f043477d128788eadced3e71522417)

</details>

and

COMMITS="2aa510348143521a14146e41b5cf87cb3e60b29e cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85"; \
STOP=921129; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

2aa5103481 validation: fetch block inputs via InputFetcher before connecting
cb0fdfdf37 coins: add inputfetcher

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
  Time (mean ± σ):     61239.351 s ± 4942.104 s    [User: 90457.852 s, System: 16836.057 s]
  Range (min … max):   57744.756 s … 64733.946 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
  Time (mean ± σ):     66848.997 s ± 1122.176 s    [User: 88025.800 s, System: 11057.384 s]
  Range (min … max):   66055.499 s … 67642.496 s    2 runs
  
Relative speed comparison
        1.09 ±  0.09  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)

</details>

The reindex-chainstate cases (which we can look at as a more stable way of testing offline-IBD) show very good results even for the max-memory usecase (45 GB dbcache) - and confirm @andrewtoth's claim that we may be able to deprecate the -dbcache argument after this since it has barely any effect after this change!

COMMITS="bf07cf0adf19889727cb6bea24ebfbbfcc231a0c 45fe0c0e5beddce1c9e836ab5d97aa064069c192"; \
STOP=921129; DBCACHE=45000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

bf07cf0adf coins: add inputfetcher
45fe0c0e5b validation: fetch block inputs via InputFetcher before connecting
 
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)
  Time (abs ≡):        16044.026 s               [User: 23421.874 s, System: 695.027 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
  Time (abs ≡):        15643.115 s               [User: 26237.588 s, System: 984.588 s]
 
Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
        1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)

</details>

the same measurement with default memory is basically exactly the same speed as with max memory (450 MB -> 15818 seconds vs 45 GB -> 15643 seconds)

COMMITS="bf07cf0adf19889727cb6bea24ebfbbfcc231a0c 45fe0c0e5beddce1c9e836ab5d97aa064069c192"; STOP=921129; DBCACHE=450; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

bf07cf0adf coins: add inputfetcher
45fe0c0e5b validation: fetch block inputs via InputFetcher before connecting
 
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)
  Time (abs ≡):        20500.654 s               [User: 40766.355 s, System: 2845.314 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
  Time (abs ≡):        15818.604 s               [User: 45952.420 s, System: 4127.137 s]
 
Relative speed comparison
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
        1.30          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)

</details>

COMMITS="cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85 2aa510348143521a14146e41b5cf87cb3e60b29e"; STOP=921129; DBCACHE=450;
 CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origi
n $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(unam
e -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log; \
              grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

cb0fdfdf37 coins: add inputfetcher                                            
2aa5103481 validation: fetch block inputs via InputFetcher before connecting                                                                                

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
  Time (abs ≡):        43407.876 s               [User: 40230.765 s, System: 3077.358 s]
                                                                              
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)                       
  Time (abs ≡):        37189.669 s               [User: 45706.002 s, System: 4452.708 s]
  
Relative speed comparison
        1.17          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)

</details>

in src/test/fuzz/inputfetcher.cpp:50 in 21e5e10bf3 outdated

  45 | +    SeedRandomStateForTest(SeedRand::ZEROS);
  46 | +    FuzzedDataProvider fuzzed_data_provider(buffer.data(), buffer.size());
  47 | +
  48 | +    const auto worker_threads{
  49 | +        fuzzed_data_provider.ConsumeIntegralInRange<int32_t>(2, 4)};
  50 | +    InputFetcher fetcher{worker_threads};

sedited commented at 11:19 AM on November 3, 2025:

I'm observing a memory leak in this fuzz test similar to the one we had for the thread pool. Over there, we disabled logging and instantiate only a single pool instance: https://github.com/bitcoin/bitcoin/pull/33689/files#diff-68602d972fe2b027e3987eff0042c27f3f00fc161b7fe871bdb147571f348298R49-R58 . Maybe similar things can be done here?

l0rinc commented at 11:22 AM on November 3, 2025:

Thanks for reporting, that explains why I was seeing

bench_bitcoin(10874,0x207c84800) malloc: Failed to allocate segment from range group - out of space

and

bitcoind(70369,0x16f5ff000) malloc: Failed to allocate segment from range group - out of space

recently (cc: @maflcko)

maflcko commented at 11:44 AM on November 3, 2025:

Interesting, how would a fuzz target lead to a crash in bench or bitcoind? Did you run it in parallel?

Though, the suggestion to disable logging is correct, because while fuzzing, we probably don't want to spend cycles on log formatting. I guess those logs only end up in the buffer which causes the memory to grow? (There should be a DEFAULT_MAX_LOG_BUFFER, so if buffering is the issue, the memory should be limited)

sedited commented at 12:23 PM on November 3, 2025:

The following patch seems to stabilize memory consumption:

diff --git a/src/test/fuzz/inputfetcher.cpp b/src/test/fuzz/inputfetcher.cpp
index cd2a0f5c68..609b8e1191 100644
--- a/src/test/fuzz/inputfetcher.cpp
+++ b/src/test/fuzz/inputfetcher.cpp
@@ -43 +43,10 @@ struct NoAccessCoinsView : CCoinsView
-FUZZ_TARGET(inputfetcher)
+std::optional<InputFetcher> g_fetcher{};
+
+static void setup_threadpool_test()
+{
+    LogInstance().DisableLogging();
+    g_fetcher.emplace(3);
+}
+
+FUZZ_TARGET(inputfetcher, .init = setup_threadpool_test)
@@ -48,4 +56,0 @@ FUZZ_TARGET(inputfetcher)
-    const auto worker_threads{
-        fuzzed_data_provider.ConsumeIntegralInRange<int32_t>(2, 4)};
-    InputFetcher fetcher{worker_threads};
-
@@ -115 +120 @@ FUZZ_TARGET(inputfetcher)
-        fetcher.FetchInputs(cache, main_cache, db, block);
+        g_fetcher->FetchInputs(cache, main_cache, db, block);

l0rinc commented at 12:27 PM on November 3, 2025:

Are we sure we're not just masking a real problem with the disabled logger?

sedited commented at 12:33 PM on November 3, 2025:

The logger is way less problematic in terms of its effect on the memory growing, and I find it difficult to really pin its effect. Having a global input fetcher that does not get instantiated with every fuzzer iteration has an immediate and clear effect. It is not clear to me if we are are actually leaking anything through the threads, or if creating and destroying thousands of threads per second puts too much pressure on the os (same for the threadpool).

andrewtoth commented at 10:18 PM on November 4, 2025:

@TheCharlatan thanks for fuzzing, and the diff for the fuzzer! I have taken it, and added you as a co-author :heart_hands:. @l0rinc it is concerning that you are getting malloc errors. Are there any other details you can share about this?

l0rinc commented at 7:49 AM on November 6, 2025:

Did the same on master:

git log -1
commit 5c5704e730796c6f31e2d7891bf6334674a04219 (HEAD, upstream/master, upstream/HEAD, origin/master, origin/HEAD)

and unfortunately I'm getting the same:

time ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0\
&& time ./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0\
&& time ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
bitcoind(71239,0x170f2f000) malloc: Failed to allocate segment from range group - out of space
...

so it's not related to this PR

Also fuzzed for almost a day, no problems came up:

...
[#12935791](/bitcoin-bitcoin/12935791/)       REDUCE cov: 1485 ft: 9496 corp: 917/727Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1640/4082 MS: 3 EraseBytes-PersAutoDict-InsertRepeatedBytes- DE: "\001\\"-
[#12941278](/bitcoin-bitcoin/12941278/)       REDUCE cov: 1485 ft: 9496 corp: 917/727Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1734/4082 MS: 2 PersAutoDict-EraseBytes- DE: "\377\031"-
[#12943860](/bitcoin-bitcoin/12943860/)       REDUCE cov: 1485 ft: 9496 corp: 917/727Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1303/4082 MS: 2 InsertByte-EraseBytes-
[#12956441](/bitcoin-bitcoin/12956441/)       REDUCE cov: 1485 ft: 9496 corp: 917/727Kb lim: 4096 exec/s: 475 rss: 189Mb L: 2407/4082 MS: 1 EraseBytes-
[#12973964](/bitcoin-bitcoin/12973964/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 2400/4082 MS: 3 ChangeByte-ChangeBinInt-EraseBytes-
[#12980710](/bitcoin-bitcoin/12980710/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1835/4082 MS: 1 EraseBytes-
[#12986931](/bitcoin-bitcoin/12986931/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 471/4082 MS: 1 EraseBytes-
[#12987148](/bitcoin-bitcoin/12987148/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1295/4082 MS: 2 PersAutoDict-EraseBytes- DE: "\377\377\377\377"-
[#13001660](/bitcoin-bitcoin/13001660/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 2399/4082 MS: 2 PersAutoDict-EraseBytes- DE: "\332\377\377\377"-
[#13008530](/bitcoin-bitcoin/13008530/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 3079/4082 MS: 5 ChangeByte-PersAutoDict-EraseBytes-ChangeASCIIInt-InsertRepeatedBytes- DE: "\363\006\000\000\000\000\000\000"-
[#13015051](/bitcoin-bitcoin/13015051/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 3063/4082 MS: 1 EraseBytes-
[#13026498](/bitcoin-bitcoin/13026498/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 2374/4082 MS: 2 ChangeBinInt-EraseBytes-

andrewtoth force-pushed on Nov 4, 2025

DrahtBot added the label CI failed on Nov 4, 2025

andrewtoth force-pushed on Nov 4, 2025

DrahtBot removed the label CI failed on Nov 4, 2025

andrewtoth commented at 1:46 AM on November 5, 2025: contributor

Benchmarked the latest up to block 921129 and it's 16% faster :rocket:. Not as fast as some of @l0rinc's numbers but it's on a laptop with an internal NVMe SSD. This change will see the most benefit for disk IO with higher latency, like network connected storage.

Command	Mean [s]	Min [s]	Max [s]	Relative
`echo d606c36a13ca2a055d1a4eb4c623fb6aa45405b2 && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=921129`	18498.670 ± 16.716	18486.850	18510.490	1.00
`echo 25c45bb0d0bd6618ec9296a1a43605657124e5de && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=921129`	21537.077 ± 123.626	21449.660	21624.494	1.16 ± 0.01

Also refactored to not stop early if an input is missing. This let's us simplify the logic. We can get rid of the different status flags and just synchronize each input on an atomic bool.

andrewtoth force-pushed on Nov 5, 2025

andrewtoth force-pushed on Nov 6, 2025

in src/inputfetcher.h:81 in 63dde36c1d outdated

  76 | +     * @return false if there are no more inputs in the queue to fetch
  77 | +     */
  78 | +    bool FetchCoin() noexcept
  79 | +    {
  80 | +        const size_t i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
  81 | +        if (i >= m_inputs.size()) [[unlikely]] return false;

l0rinc commented at 3:22 PM on November 6, 2025:

when can this be true?

andrewtoth commented at 5:28 PM on November 6, 2025:

This is true when all inputs have been fetched from the block. We want the compiler to optimize for the case where we have work.

in src/inputfetcher.h:83 in 63dde36c1d

  78 | +    bool FetchCoin() noexcept
  79 | +    {
  80 | +        const size_t i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
  81 | +        if (i >= m_inputs.size()) [[unlikely]] return false;
  82 | +        auto& input{m_inputs[i]};
  83 | +        if (std::binary_search(m_txids.begin(), m_txids.end(), input.outpoint.hash.ToUint256().GetUint64(0))) {

l0rinc commented at 3:22 PM on November 6, 2025:

        if (std::ranges::binary_search(m_txids, input.outpoint.hash.ToUint256().GetUint64(0))) {

in src/inputfetcher.h:93 in 63dde36c1d

  88 | +        auto coin{m_cache->GetPossiblySpentCoinFromCache(input.outpoint)};
  89 | +        if (!coin) {
  90 | +            try {
  91 | +                coin = m_db->GetCoin(input.outpoint);
  92 | +            } catch (const std::runtime_error& e) {
  93 | +                LogPrintLevel(BCLog::VALIDATION, BCLog::Level::Warning, "InputFetcher failed to fetch input: %s.\n", e.what());

l0rinc commented at 3:24 PM on November 6, 2025:

nit: trailing newline shouldn't be needed anymore

                LogPrintLevel(BCLog::VALIDATION, BCLog::Level::Warning, "InputFetcher failed to fetch input: %s.", e.what());

in src/inputfetcher.h:120 in 63dde36c1d

 115 | +            const auto& tx{block.vtx[i]};
 116 | +            outputs_count += tx->vout.size();
 117 | +            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
 118 | +            for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
 119 | +        }
 120 | +        std::sort(m_txids.begin(), m_txids.end());

l0rinc commented at 3:24 PM on November 6, 2025:

        std::ranges::sort(m_txids);

DrahtBot added the label CI failed on Nov 6, 2025

DrahtBot commented at 3:25 PM on November 6, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/actions/runs/19139133943/job/54698986986 LLM reason (✨ experimental): Trailing whitespace detected in src/inputfetcher.h caused the lint check to fail.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

in src/inputfetcher.h:134 in 63dde36c1d

 129 | +        temp_cache.Reserve(temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);
 130 | +        for (auto& input : m_inputs) {
 131 | +            while (!input.ready.test(std::memory_order_acquire)) {
 132 | +                // Work too while we wait
 133 | +                if (!FetchCoin()) {
 134 | +                    input.ready.wait(false, std::memory_order_acquire);

l0rinc commented at 3:26 PM on November 6, 2025:

based on https://en.cppreference.com/w/cpp/atomic/atomic_flag/wait.html

                    input.ready.wait(/*old*/false, std::memory_order_acquire);

in src/inputfetcher.h:132 in 63dde36c1d

 127 | +
 128 | +        // Insert fetched coins into the temp_cache as they are set to ready.
 129 | +        temp_cache.Reserve(temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);
 130 | +        for (auto& input : m_inputs) {
 131 | +            while (!input.ready.test(std::memory_order_acquire)) {
 132 | +                // Work too while we wait

l0rinc commented at 3:27 PM on November 6, 2025:

not yet sure I fully understand why this is needed - it's more complex, but does it result in at least a theoretic speedup?

andrewtoth commented at 5:31 PM on November 6, 2025:

The work done on the other threads is much slower than what the main thread is doing. Fetching inputs possibly from disk vs inserting coins into the temporary cache. So, in cases where there are fewer worker threads the main thread will likely be waiting on the work to be done. In these cases we can just start fetching as well. I think this would have a large impact in cases where there are few worker threads (like rpis), or if using a low but > 1 -par value. For instance, using -par=2 this would in theory double the efficiency of fetching inputs from disk. It is likely not a measurable effect for setups that have 16 or more vcpus.

l0rinc commented at 6:09 PM on November 6, 2025:

I'm not sure I understand why, what's the difference between two threads working while main is waiting vs one thread working and main also working? Where does the difference come from (especially given that the work is not fully cpu-bound)?

andrewtoth commented at 6:29 PM on November 6, 2025:

two threads working while main is waiting vs one thread working and main also working?

It would be two threads working while main is waiting vs two threads working and main also working? So the latter has 3 threads working vs the former's 2 threads working. If using -par=3 this is the case. 50% more work is done in parallel.

l0rinc commented at 6:52 PM on November 6, 2025:

So why not just do the work that par defined and leave main asleep, wouldn't that be simpler while basically achieving exactly the same?

andrewtoth commented at 7:02 PM on November 6, 2025:

No because we would have to insert all entries into temp cache at the end, instead of parallelizing that work as well. That was the previous implementation where we waited for every thread to be done and then inserted everything in series before exiting. Now, the main thread does both. It inserts while others are fetching, but if it inserts fast enough where it is waiting for new entries it will also fetch entries.

andrewtoth commented at 7:10 PM on November 6, 2025:

On a system with 15 worker threads + main, it is likely that main will not be waiting much. The other 15 threads are busy setting newly fetched coins to ready so that the main thread can continuously read true for the ready flags. On a system with only 3 worker threads + main, it is likely that the 3 worker threads will not be able to fetch and set ready coins fast enough where the main thread does not have to wait. For instance all 3 threads are fetching from disk, and the main thread reads the next input and ready is still false. So now it can either wait until one of the 3 worker threads fetches an input from disk, or it can start helping out the 3 workers and fetch from disk itself. This will increase parallel throughput by 33%, since 4 workers are fetching instead of just 3. Then when main returns with its fetched coin, main can insert the rest of the coins that the workers have fetched and set ready while main was busy fetching. Then it will catch up again to the latest input that is not yet ready, and fetch another coin.

l0rinc commented at 8:18 PM on November 6, 2025:

We discussed this out of band, here's the summary:

the main thread is special because it has access to the dbcache, it can insert there without locking the cache.
the threads each have their inputs now, each of which have a switch to signal when they've fetched the input and move on to the next
the threads each compete for which input to fetch, marking them one-by-one as ready
the main thread goes in order, spins until the next one is available, if it needs to wait, it does some fetching itself, rechecks later if the given value is available and if so, it inserts to the dbcache, after which it checks the next value in order.

This way the IO and CPU bound work is parallelized, so we don't need to do the heavy rehashing at the end, it's done while the other threads are doing IO work - that's why it's faster.

I have to think about this, it sounds like we can simplify this further, but this is already an improvement over previous solutions.

And given that the worst thing that can happen for an internal spend is that if there's another input with the same prefix in the same block, we wouldn't fetch it here and it would be fetched during block connection like it was done before. This means that we likely don't even need 64 bits for that, likely 32 are enough. Back of the napkin calculations indicate that would mean that 1/1000 blocks would contain transactions that won't be fetched by InputFetcher and would need to be fetched during block connection instead on a single thread. As long as 32 bit checks are faster than 64 (which should definitely be the case for the sorted-vector case), this should likely result in an overall speedup. We definitely need to add a test case for that.

in src/inputfetcher.h:139 in 63dde36c1d

 134 | +                    input.ready.wait(false, std::memory_order_acquire);
 135 | +                    break;
 136 | +                }
 137 | +            }
 138 | +            if (input.coin.IsSpent()) continue;
 139 | +            temp_cache.EmplaceCoinInternalDANGER(COutPoint{input.outpoint}, std::move(input.coin));

l0rinc commented at 3:27 PM on November 6, 2025:

continue effectively skipping the last line is a bit confusing an dangerous - it should only skip the next line:

            if (!input.coin.IsSpent()) {
                temp_cache.EmplaceCoinInternalDANGER(COutPoint{input.outpoint}, std::move(input.coin));
            }

in src/inputfetcher.h:61 in 63dde36c1d outdated

  56 | +    /**
  57 | +     * The set of first 8 bytes of txids of all txs in the block being fetched.
  58 | +     * Used to filter out inputs that are created and spent in the same block,
  59 | +     * since they will not be in the db or the cache.
  60 | +     */
  61 | +    std::vector<uint64_t> m_txids{};

l0rinc commented at 3:29 PM on November 6, 2025:

should we document here what happens in case of a cache miss or collision?

in src/inputfetcher.h:76 in 63dde36c1d

  71 | +
  72 | +    /**
  73 | +     * Fetches the next input in the queue. Safe to call from any thread once inside the barrier.
  74 | +     * 
  75 | +     * @return true if there are more inputs in the queue to fetch
  76 | +     * @return false if there are no more inputs in the queue to fetch

l0rinc commented at 3:33 PM on November 6, 2025:

this will fix the linter failure as well (whitespace after *):

     * 
     * [@return](/bitcoin-bitcoin/contributor/return/) whether there are more inputs in the queue to fetch

l0rinc changes_requested

in src/inputfetcher.h:78 in 63dde36c1d

  73 | +     * Fetches the next input in the queue. Safe to call from any thread once inside the barrier.
  74 | +     * 
  75 | +     * @return true if there are more inputs in the queue to fetch
  76 | +     * @return false if there are no more inputs in the queue to fetch
  77 | +     */
  78 | +    bool FetchCoin() noexcept

l0rinc commented at 5:00 PM on November 6, 2025:

can we call it something else to not coincide with CCoinsViewCache::FetchCoin

andrewtoth force-pushed on Nov 6, 2025

DrahtBot removed the label CI failed on Nov 6, 2025

andrewtoth force-pushed on Nov 8, 2025

andrewtoth force-pushed on Nov 12, 2025

andrewtoth force-pushed on Nov 14, 2025

l0rinc commented at 10:29 AM on November 14, 2025: contributor

I was still wondering how number of parallel threads affects this, given that it's not a cpu-bound task.

The measurements were done on i9, m4 and rpi4. First two have 16 threads, rpi has 4. The results still indicate to me that it doesn't make sense to set parallelism based on number of cpus directly. Beyond 4-8 threads the systems didn't really perform any better.

I will continue measuring it on other systems as well, but wanted to share preliminary results since these measurements take a lot of time.

commit=c2b0239001629a43d50cb8eb00e884423db89b38 && git log -1 --pretty='%h %s' $commit && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 2 4 8 16 32 64; do   time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par; done
c2b0239001 validation: fetch block inputs via InputFetcher before connecting

real    94m1.512s
user    169m2.995s
sys     11m59.017s

real    86m23.533s
user    171m32.112s
sys     12m42.929s

real    82m58.013s
user    179m20.981s
sys     14m25.288s

real    82m38.540s
user    197m24.427s
sys     20m31.753s

real    82m38.442s
user    197m16.954s
sys     20m32.770s

real    82m46.060s
user    197m44.609s
sys     20m37.426s

</details>

<details> <summary>Macbook Pro M4 Max</summary>

commit=c2b0239001629a43d50cb8eb00e884423db89b38 && git log -1 --pretty='%h %s' $commit && \
git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && \
for par in 2 4 8 16 32 64; do
  time ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par
done

c2b0239001 validation: fetch block inputs via InputFetcher before connecting
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     10711.00s user 1495.61s system 181% cpu 1:51:52.59 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     10601.27s user 1311.29s system 207% cpu 1:35:49.25 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     11362.34s user 2558.23s system 242% cpu 1:35:29.27 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12203.63s user 5783.56s system 299% cpu 1:40:07.62 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12124.30s user 5764.53s system 298% cpu 1:40:01.07 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12276.36s user 5816.05s system 300% cpu 1:40:13.27 total

</details>

root@rpi4-8-1:/mnt/my_storage/bitcoin# for par in 4 8; do \
    COMMITS="c2b0239001629a43d50cb8eb00e884423db89b38"; \
    STOP=700000; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par"; \
done

c2b0239001 validation: fetch block inputs via InputFetcher before connecting

reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=4 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
  Time (abs ≡):        36057.895 s               [User: 61089.585 s, System: 12571.590 s]
 

c2b0239001 validation: fetch block inputs via InputFetcher before connecting

reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=8 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
  Time (abs ≡):        35682.583 s               [User: 61828.582 s, System: 14162.402 s]

</details>

Edit: updated times:

Edit2: update with even more fine-grained measurements <img width="1493" height="868" alt="image" src="https://github.com/user-attachments/assets/7636f9f0-3e3a-484f-a8c0-61a00a6515fa" />

<details> <summary>Raw measurements</summary>


i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

commit=c2b0239001629a43d50cb8eb00e884423db89b38 && git log -1 --pretty='%h %s' $commit && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 2 4 8 16 32 64; do   time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par; done
c2b0239001 validation: fetch block inputs via InputFetcher before connecting

real    94m1.512s
user    169m2.995s
sys     11m59.017s

real    86m23.533s
user    171m32.112s
sys     12m42.929s

real    82m58.013s
user    179m20.981s
sys     14m25.288s

real    82m38.540s
user    197m24.427s
sys     20m31.753s

real    82m38.442s
user    197m16.954s
sys     20m32.770s

real    82m46.060s
user    197m44.609s
sys     20m37.426s



root@rpi4-8-1:/mnt/my_storage/bitcoin# for par in 2 4 8 16; do \
    COMMITS="c2b0239001629a43d50cb8eb00e884423db89b38"; \
    STOP=700000; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par"; \
done

c2b0239001 validation: fetch block inputs via InputFetcher before connecting

reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=2 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
  Time (abs ≡):        39770.114 s               [User: 62016.921 s, System: 12176.118 s]

c2b0239001 validation: fetch block inputs via InputFetcher before connecting

reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=4 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
  Time (abs ≡):        36057.895 s               [User: 61089.585 s, System: 12571.590 s]


c2b0239001 validation: fetch block inputs via InputFetcher before connecting

reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=8 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
  Time (abs ≡):        35682.583 s               [User: 61828.582 s, System: 14162.402 s]

c2b0239001 validation: fetch block inputs via InputFetcher before connecting

reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=16 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
  Time (abs ≡):        36414.980 s               [User: 63043.265 s, System: 16780.513 s]





rpi5-16-1:
commit=d6fac85ee4465cce8e81e36cdfd46636d34725fa && git log -1 --pretty='%h %s' $commit && git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 1 2 3 4 5 6 7 8 9 10; do   time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -par=$par -stopatheight=800000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0; done
d6fac85ee4 validation: fetch block inputs via InputFetcher before connecting

real    393m14.069s
user    611m13.972s
sys     64m44.860s

real    335m41.301s
user    615m9.967s
sys     61m58.381s

real    314m37.969s
user    617m42.106s
sys     60m47.302s

real    313m26.580s
user    619m26.740s
sys     63m59.801s

real    314m35.267s
user    622m30.314s
sys     66m53.456s

real    314m14.335s
user    621m22.819s
sys     68m30.831s

real    314m49.454s
user    621m3.552s
sys     70m1.733s

real    316m10.270s
user    624m1.233s
sys     70m48.074s

real    315m57.060s
user    619m48.948s
sys     72m10.586s

real    316m27.926s
user    622m56.166s
sys     73m16.170s



i9:
commit=d6fac85ee4465cce8e81e36cdfd46636d34725fa && git log -1 --pretty='%h %s' $commit && git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18; do   time ./build/bin/bitcoind -par=$par -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0; done

d6fac85ee4 validation: fetch block inputs via InputFetcher before connecting

real    395m38.371s
user    624m54.760s
sys     50m2.017s

real    321m54.706s
user    638m0.914s
sys     45m16.018s

real    295m13.108s
user    635m21.111s
sys     44m41.847s

real    281m53.289s
user    636m26.743s
sys     45m4.569s

real    274m35.436s
user    638m44.171s
sys     46m25.320s

real    270m4.622s
user    641m52.896s
sys     45m44.507s

real    266m44.501s
user    647m12.988s
sys     47m8.929s

real    265m9.994s
user    661m8.192s
sys     48m54.941s

real    263m49.009s
user    675m41.091s
sys     49m52.024s

real    262m13.541s
user    687m23.237s
sys     51m29.839s

real    262m29.564s
user    701m19.813s
sys     52m20.824s

real    261m29.692s
user    717m48.727s
sys     53m54.110s

real    260m36.882s
user    727m35.223s
sys     56m30.900s

real    259m58.230s
user    740m17.961s
sys     57m37.615s

real    259m46.496s
user    750m59.419s
sys     59m55.428s

real    262m3.069s
user    756m49.146s
sys     63m51.639s

real    262m29.892s
user    755m7.876s
sys     63m29.261s

real    262m54.519s
user    759m4.919s
sys     62m51.652s



i7:
commit=d6fac85ee4465cce8e81e36cdfd46636d34725fa && git log -1 --pretty='%h %s' $commit && git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 1 2 3 4 5 6 7 8 9 10 11 12 13; do   time ./build/bin/bitcoind -par=$par -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0; done

d6fac85ee4 validation: fetch block inputs via InputFetcher before connecting

real    677m53.522s
user    540m13.974s
sys     48m0.685s

real    616m58.994s
user    547m24.056s
sys     46m0.808s

real    582m52.884s
user    554m55.162s
sys     45m21.949s

real    577m5.937s
user    563m47.629s
sys     45m28.084s

real    568m25.225s
user    576m19.153s
sys     46m34.582s

real    566m31.568s
user    586m8.162s
sys     46m41.220s

real    564m43.096s
user    594m42.382s
sys     49m17.091s

real    558m40.218s
user    600m10.625s
sys     52m24.030s

real    556m23.944s
user    606m36.724s
sys     55m27.147s

real    565m47.020s
user    607m41.449s
sys     56m21.510s

real    563m37.784s
user    609m40.123s
sys     57m35.573s

real    567m43.207s
user    608m53.189s
sys     57m43.208s

real    563m14.968s
user    611m1.819s
sys     58m43.991s


Macbook Pro M4 Max
commit=c2b0239001629a43d50cb8eb00e884423db89b38 && git log -1 --pretty='%h %s' $commit && \
git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && \
for par in 2 4 8 16 32 64; do
  time ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par
done

c2b0239001 validation: fetch block inputs via InputFetcher before connecting
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     10711.00s user 1495.61s system 181% cpu 1:51:52.59 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     10601.27s user 1311.29s system 207% cpu 1:35:49.25 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     11362.34s user 2558.23s system 242% cpu 1:35:29.27 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12203.63s user 5783.56s system 299% cpu 1:40:07.62 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12124.30s user 5764.53s system 298% cpu 1:40:01.07 total
./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12276.36s user 5816.05s system 300% cpu 1:40:13.27 total

</details>

andrewtoth force-pushed on Nov 23, 2025

DrahtBot added the label CI failed on Nov 23, 2025

DrahtBot commented at 1:55 AM on November 23, 2025: contributor

🚧 At least one of the CI tasks failed. Task fuzzer,address,undefined,integer, no depends: https://github.com/bitcoin/bitcoin/actions/runs/19603900050/job/56139454218 LLM reason (✨ experimental): Fuzz run crashes with UndefinedBehaviorSanitizer: null-pointer-use (null CTransaction dereference in CoinsViewCacheAsync).

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth marked this as a draft on Nov 23, 2025

andrewtoth force-pushed on Nov 23, 2025

DrahtBot removed the label CI failed on Nov 23, 2025

andrewtoth marked this as ready for review on Nov 23, 2025

andrewtoth commented at 5:13 PM on November 23, 2025: contributor

I've updated the PR to make the InputFetcher a subclass of CCoinsViewCache. Instead of waiting on the MPSC queue to be finished before connecting the block, the queue can be processed inside CCoinsViewCache::FetchCoin during ConnectBlock. This makes the fetching non-blocking, which is a significant performance improvement. It's been renamed to CoinsViewCacheAsync. @l0rinc thank you for your very thorough benchmarks. I've updated this to use 4 worker threads, which yields the same speed as with 15 on my benchmark machine. I've also made this non-configurable, as I don't see a reason why a user would want to change it. It should help performance on single core machines as well since the parallel work is IO bound.

This new non-blocking version using 4 threads is significantly faster. On the same machine that yielded 16% IBD speedup until block 921129, this version is now 21% faster :rocket:. I'm curious to see benchmarks on other machines.

Command	Mean [s]	Min [s]	Max [s]	Relative
`echo 42e68d48b4da838361b045a977242d3262f8b351 && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=921129`	17842.357 ± 131.143	17749.624	17935.089	1.00
`echo 25c45bb0d0bd6618ec9296a1a43605657124e5de && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=921129`	21537.077 ± 123.626	21449.660	21624.494	1.21

andrewtoth force-pushed on Nov 24, 2025

DrahtBot added the label CI failed on Nov 24, 2025

DrahtBot commented at 1:28 AM on November 24, 2025: contributor

🚧 At least one of the CI tasks failed. Task fuzzer,address,undefined,integer, no depends: https://github.com/bitcoin/bitcoin/actions/runs/19620057148/job/56178655244 LLM reason (✨ experimental): libFuzzer crash (deadly signal) from an assertion in DbCoinsView::GetCoin during fuzzing.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Nov 24, 2025

DrahtBot removed the label CI failed on Nov 25, 2025

andrewtoth force-pushed on Nov 25, 2025

andrewtoth force-pushed on Nov 26, 2025

andrewtoth force-pushed on Nov 27, 2025

in src/coins.h:405 in 0f3778fbfb

 400 | @@ -401,6 +401,14 @@ class CCoinsViewCache : public CCoinsViewBacked
 401 |       */
 402 |      bool HaveCoinInCache(const COutPoint &outpoint) const;
 403 |  
 404 | +    /**
 405 | +     * Retrieves the coin from the cache even if it is spent, without calling

l0rinc commented at 12:21 PM on November 27, 2025:

nit: other comments use the bare verb form

     * Retrieve the coin from the cache even if it is spent, without calling

in src/coinsviewcacheasync.h:85 in 0f3778fbfb outdated

  80 | +     * Similar to CCoinsViewCache::GetCoin, but it does not mutate internally.
  81 | +     * Therefore safe to call from any thread once inside the barrier.
  82 | +     */
  83 | +    std::optional<Coin> GetCoinWithoutMutating(const COutPoint& outpoint) const
  84 | +    {
  85 | +        auto coin{static_cast<CCoinsViewCache*>(base)->GetPossiblySpentCoinFromCache(outpoint)};

l0rinc commented at 12:28 PM on November 27, 2025:

This highlights that our dbcache layering is a bit messy – having to static_cast the base view to CCoinsViewCache here is quite ugly. It would be good to clean this up in a follow-up.

in src/test/fuzz/coinsviewcacheasync.cpp:164 in 0f3778fbfb outdated

 159 | +                assert(coin->nHeight == db_coin.nHeight);
 160 | +                assert(coin->out == db_coin.out);
 161 | +            }
 162 | +        }
 163 | +        assert(cache.GetCacheSize() == outpoints_in_cache.size());
 164 | +        fuzzed_data_provider.ConsumeBool() ? (void)cache.Flush() : cache.Reset();

l0rinc commented at 1:05 PM on November 27, 2025:

We’re moving towards Flush not returning a value – can we avoid adding new code that relies on its return? Not sure it’s possible before that other PR lands, but worth keeping in mind.

andrewtoth commented at 4:53 PM on November 27, 2025:

The (void) here does not rely on the return value. Or do you mean something else?

l0rinc commented at 5:09 PM on November 27, 2025:

we wouldn't need it if Flush were a void itself, like Reset, right?

andrewtoth commented at 5:15 PM on November 27, 2025:

Yes, but if the other PR is merged, then this will still compile with no errors or warnings right?

l0rinc commented at 5:25 PM on November 27, 2025:

you will have a conflict where you're making Flush abstract and they're making it void

in src/bench/coinsviewcacheasync.cpp:21 in 0f3778fbfb

  16 | +
  17 | +//! Simulates a DB by adding a delay when calling GetCoin
  18 | +struct DelayedCoinsView : CCoinsView {
  19 | +    std::optional<Coin> GetCoin(const COutPoint&) const override
  20 | +    {
  21 | +        UninterruptibleSleep(DELAY);

l0rinc commented at 1:06 PM on November 27, 2025:

I wonder if we could change this benchmark to use an actual LevelDB-backed view instead of a synthetic delay – and maybe reuse the setup from #32554 for a more realistic measurement?

in src/coinsviewcacheasync.h:208 in fc4346fe05 outdated

 203 | +    explicit CoinsViewCacheAsync(CCoinsViewCache& cache, const CCoinsView& db, bool deterministic = false) noexcept
 204 | +        : CCoinsViewCache{&cache, deterministic}, m_db{db}, m_barrier{WORKER_THREADS + 1}
 205 | +    {
 206 | +        for (uint32_t n{0}; n < WORKER_THREADS; ++n) {
 207 | +            m_worker_threads.emplace_back([this, n] {
 208 | +                util::ThreadRename(strprintf("inputfetcher.%i", n));

l0rinc commented at 1:08 PM on November 27, 2025:

nit: now that this is CoinsViewCacheAsync, maybe rename the thread prefix from "inputfetcher.%i" to something like "coinsviewcacheasync.%i" or similar, to reflect the current design.

andrewtoth commented at 5:03 AM on November 30, 2025:

Hmm the thread is just an input fetcher though. That's all it does. I like this name for it.

in src/bench/coinsviewcacheasync.cpp:43 in 0f3778fbfb outdated

  38 | +        async_cache.StartFetching(block);
  39 | +        async_cache.Reset();
  40 | +    });
  41 | +}
  42 | +
  43 | +BENCHMARK(CoinsViewCacheAsyncBenchmark, benchmark::PriorityLevel::HIGH);

l0rinc commented at 1:15 PM on November 27, 2025:

nit: the formatter should enforce a trailing newline at the end of C++ source files; could you add one here?

in src/bench/coinsviewcacheasync.cpp:38 in 0f3778fbfb outdated

  33 | +    DelayedCoinsView db{};
  34 | +    CCoinsViewCache main_cache(&db);
  35 | +    CoinsViewCacheAsync async_cache{main_cache, db};
  36 | +
  37 | +    bench.run([&] {
  38 | +        async_cache.StartFetching(block);

l0rinc commented at 1:16 PM on November 27, 2025:

What if we added a second run inside the same benchmark that calls StartFetching again or a series of AccessCoin calls - to exercise the “everything is already in the cache” path (similar to a large -dbcache scenario)?

andrewtoth commented at 4:54 PM on November 27, 2025:

Ah, yes the Reset now just short circuits it. So this is not a very good benchmark anymore. Will fix to access all coins.

in src/test/fuzz/coinsviewcacheasync.cpp:39 in 0f3778fbfb

  34 | +        cacheCoins.clear();
  35 | +        ReallocateCache();
  36 | +        cachedCoinsUsage = 0;
  37 | +    }
  38 | +
  39 | +    CoinsViewDb() : CCoinsViewCache(nullptr, /*deterministic=*/true) {}

l0rinc commented at 1:18 PM on November 27, 2025:

Same idea as in the bench: would it be feasible to fuzz this against an actual LevelDB-backed view instead of a synthetic cache, or is that too expensive / complicated for now?

in src/test/fuzz/coinsviewcacheasync.cpp:67 in 0f3778fbfb outdated

  62 | +
  63 | +        std::map<const COutPoint, const Coin> db_map{};
  64 | +        std::map<const COutPoint, const Coin> cache_map{};
  65 | +        std::vector<COutPoint> input_outpoints{};
  66 | +
  67 | +        CCoinsViewCache main_cache(&*g_db, /*deterministic=*/true);

l0rinc commented at 1:18 PM on November 27, 2025:

Why do we need this cache to be constructed with deterministic=true here? Is the deterministic ordering actually required for the fuzz harness, or could we drop that?

in src/coins.h:410 in 0f3778fbfb outdated

 405 | +     * Retrieves the coin from the cache even if it is spent, without calling
 406 | +     * the backing CCoinsView if no coin exists.
 407 | +     * Used in CoinsViewCacheAsync to make sure we do not add a coin from the backing
 408 | +     * view when it is spent in the cache but not yet flushed to the parent.
 409 | +     */
 410 | +    std::optional<Coin> GetPossiblySpentCoinFromCache(const COutPoint& outpoint) const noexcept;

l0rinc commented at 1:20 PM on November 27, 2025:

As mentioned before, I’m not a fan of adding a separate “get-possibly-spent” helper – it feels like spentness is a separate concern and we’re encoding it into the accessor. I still think long term it’d be cleaner if GetCoin always returned the raw coin and call sites handled spentness explicitly, but I understand that’s probably a separate cleanup.

in src/coinsviewcacheasync.h:47 in 0f3778fbfb outdated

  42 | +{
  43 | +private:
  44 | +    //! The latest input not yet being fetched. Workers atomically increment this when fetching.
  45 | +    mutable std::atomic_uint32_t m_input_head{0};
  46 | +    //! The latest input not yet accessed by a consumer. Only the main thread increments this.
  47 | +    mutable uint32_t m_input_tail{0};

l0rinc commented at 1:21 PM on November 27, 2025:

Hmm, can the main thread keep track of this locally instead of storing m_input_tail as a member? It looks like we’re effectively simulating a queue here (claiming work from the “head” and consuming from the “tail”). Could we use an actual std::deque or similar, or would that invalidate indices / references that the workers rely on?

andrewtoth commented at 4:57 PM on November 27, 2025:

Yes, this is a queue! An MPSC queue to be exact. I don't think it's possible to not store this as an instance member. I don't think we can use a std::deque though because we can't actually mutate the container. The benefit of a deque would be to actually push and pop, but here we set up the container before kicking off the worker threads.

in src/coinsviewcacheasync.h:70 in 0f3778fbfb outdated

  65 | +    mutable std::vector<InputToFetch> m_inputs{};
  66 | +
  67 | +    /**
  68 | +     * The first 8 bytes of txids of all txs in the block being fetched. This is used to filter out inputs that
  69 | +     * are created and spent in the same block, since they will not be in the db or the cache.
  70 | +     * Using only the first 8 bytes is a performance improvement, versus storing the entire 32 bytes. In case of a

l0rinc commented at 1:23 PM on November 27, 2025:

Do we have tests that explicitly exercise the 8-byte prefix collision case (i.e. two txids in the same block sharing the same first 64 bits)? If we used 32-bit prefixes instead, collisions would be much more frequent, but the structure would be smaller; did we benchmark 32-bit vs 64-bit prefixes, or is 64-bit a conservative choice?

andrewtoth commented at 4:59 PM on November 27, 2025:

did we benchmark 32-bit vs 64-bit prefixes

I did do microbenchmarks with 32-bit, and it was not more performant. Also, there is no uint256::GetUint32, so we would either just cast the 64 bits to 32 or have to write a new method on uint256. I don't think it's worth exploring this more.

in src/coinsviewcacheasync.h:122 in 0f3778fbfb

 117 | +    {
 118 | +        const auto [ret, inserted] = cacheCoins.try_emplace(outpoint);
 119 | +        if (inserted) {
 120 | +            for (auto i{m_input_tail}; i < m_inputs.size(); ++i) {
 121 | +                auto& input{m_inputs[i]};
 122 | +                if (input.outpoint != outpoint) continue;

l0rinc commented at 1:28 PM on November 27, 2025:

I find this iteration quite hard to follow. Could we extract the search into a helper and then act on the found index, something like:

    std::optional<uint32_t> FindInputIndex(const COutPoint& outpoint) const
    {
        for (size_t i{m_input_tail}; i < m_inputs.size(); ++i) {
            if (m_inputs[i].outpoint == outpoint) {
                return i;
            }
        }
        return std::nullopt;
    }


    CCoinsMap::iterator FetchCoin(const COutPoint& outpoint) const override
    {
        const auto [ret, inserted] = cacheCoins.try_emplace(outpoint);
        if (!inserted) return ret;

        if (const auto idx_opt{FindInputIndex(outpoint)}) {
            auto& input{m_inputs[*idx_opt]};

            // Wait until this input is ready. Acquire matches the worker's release.
            while (!input.ready.test(std::memory_order_acquire)) {
                // Try to process other inputs instead of just waiting
                if (!ProcessInputInBackground()) {
                    // No more work; just wait on this one
                    input.ready.wait(/*old=*/false, std::memory_order_acquire);
                    break;
                }
            }

            if (input.coin) [[likely]]
                ret->second.coin = std::move(*input.coin);
            m_input_tail = *idx_opt + 1; // We will never need to scan earlier entries again
        }

        if (ret->second.coin.IsSpent()) [[unlikely]] {
            // We only get here for BIP30 checks, txid collisions, or missing/spent inputs.
            if (auto coin{FetchCoinFromParent(outpoint)}) {
                ret->second.coin = std::move(*coin);
            } else {
                cacheCoins.erase(ret);
                return cacheCoins.end();
            }
        }

        cachedCoinsUsage += ret->second.coin.DynamicMemoryUsage();
        return ret;
    }

in src/coinsviewcacheasync.h:177 in 0f3778fbfb outdated

 172 | +            const auto& tx{block.vtx[i]};
 173 | +            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
 174 | +            for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
 175 | +        }
 176 | +        if (m_inputs.size() == 0) return;
 177 | +        std::ranges::sort(m_txids);

l0rinc commented at 2:46 PM on November 27, 2025:

It might be worth mentioning (either in a comment here or in the PR description) that benchmarks indicated this sorted std::vector<uint64_t> + binary_search approach is significantly faster than a std::unordered_set<Txid> or std::set<Txid> for the expected hit/miss mix.

in src/coinsviewcacheasync.h:212 in 0f3778fbfb outdated

 207 | +            m_worker_threads.emplace_back([this, n] {
 208 | +                util::ThreadRename(strprintf("inputfetcher.%i", n));
 209 | +                while (true) {
 210 | +                    m_barrier.arrive_and_wait();
 211 | +                    if (m_request_stop) [[unlikely]] return;
 212 | +                    while (ProcessInputInBackground()) {}

l0rinc commented at 2:49 PM on November 27, 2025:

nit: this is very compact, but maybe a bit hard to see, consider:

for (;;) {
    if (!ProcessInputInBackground()) break;
}

andrewtoth commented at 5:05 AM on November 30, 2025:

I prefer the current version.

in src/validation.cpp:3100 in 0f3778fbfb outdated

3096 | @@ -3095,6 +3097,7 @@ bool Chainstate::ConnectTip(
3097 |              if (state.IsInvalid())
3098 |                  InvalidBlockFound(pindexNew, state);
3099 |              LogError("%s: ConnectBlock %s failed, %s\n", __func__, pindexNew->GetBlockHash().ToString(), state.ToString());
3100 | +            view.Reset();

l0rinc commented at 2:49 PM on November 27, 2025:

Is Reset() strictly necessary here given that a failed ConnectBlock will discard the ephemeral view anyway? Can we carry any async state across to the next block?

andrewtoth commented at 5:01 PM on November 27, 2025:

We don't actually throw away the ephemeral view anymore. We just reset the state each time. This is a big performance improvement since we don't have to reallocate anything. If we didn't do this we would need an external thread pool as well, since the threads are owned by the CoinsViewCacheAsync. I think it's a fair tradeoff for one extra line here. Perhaps calling it ephemeral_view is a misnomer though. I'm not sure what a better name would be right now.

l0rinc commented at 5:13 PM on November 27, 2025:

We don't actually throw away the ephemeral view anymore

Hah, I missed that in the latest push, thanks. 👍

calling it ephemeral_view is a misnomer though

Yeah :)

andrewtoth commented at 5:04 AM on November 30, 2025:

Renamed to m_connect_block_view.

in src/coinsviewcacheasync.h:87 in 0f3778fbfb outdated

  82 | +     */
  83 | +    std::optional<Coin> GetCoinWithoutMutating(const COutPoint& outpoint) const
  84 | +    {
  85 | +        auto coin{static_cast<CCoinsViewCache*>(base)->GetPossiblySpentCoinFromCache(outpoint)};
  86 | +        if (!coin) coin = m_db.GetCoin(outpoint);
  87 | +        if (coin && !coin->IsSpent()) [[likely]] return coin;

l0rinc commented at 2:51 PM on November 27, 2025:

I don’t mind the [[likely]] hints (they can help document expectations), but I know others are opposed to using them in our codebase.

andrewtoth commented at 5:02 PM on November 27, 2025:

There are some already in the codebase. They are added in C++20, why not use them? Maybe in future compiler versions the hints will become more useful for optimization? I only use them when we know a path is always happy or unhappy for a valid block (except for BIP30 checks, which we only do at the very beginning). An invalid block with valid proof of work is very rare, so we can optimize for the case where we don't get one. If we get an invalid block, we care less about speed to validate.

l0rinc commented at 5:23 PM on November 27, 2025:

why not use them

#30535 (review) #31682 (review) #33657 (review)

andrewtoth commented at 5:40 PM on November 27, 2025:

The first link shows a benefit to using them, and that is now in our codebase. The second and third links reference this blog post which has crazy convoluted usages. I don't find that blog post convincing. The usages here are straightforward and in very hot paths. They are only used in paths where we know a valid block will always or never go into (aside from BIP30).

in src/coinsviewcacheasync.h:176 in 0f3778fbfb

 171 | +        for (size_t i{1}; i < block.vtx.size(); ++i) {
 172 | +            const auto& tx{block.vtx[i]};
 173 | +            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
 174 | +            for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
 175 | +        }
 176 | +        if (m_inputs.size() == 0) return;

l0rinc commented at 2:54 PM on November 27, 2025:

What does the if (m_inputs.size() == 0) guard against exactly?

andrewtoth commented at 5:06 PM on November 27, 2025:

I use this instead of an m_is_fetching or equivalent boolean state. We don't want to exit the barrier if we haven't entered it. We don't enter above if the block has less than 2 txs, and we don't enter if it is an invalid block with 2 or more txs that have zero inputs. After we stop the threads and exit the barrier, we clear the inputs as well so we don't exit again.

l0rinc commented at 3:26 PM on November 27, 2025: contributor

This new design achieves the best results I've seen so far across all platforms measured, excellent job!!

It evolved from an InputFetcher helper (where the parallelism happened before ConnectBlock was called) to CoinsViewCacheAsync, a multithreaded CCoinsViewCache subclass whose worker threads run in parallel with ConnectBlock itself.

Internal spends are still filtered using short txid prefixes stored in a sorted std::vector and checked via std::binary_search. In the rare case of a prefix collision, the async fetcher will simply skip that input, and it will be fetched later by the normal synchronous path.

For now the async view uses a fixed worker thread count of 4. The workload is primarily I/O-bound on DB latency rather than CPU-bound, so 4 workers already hide most of the latency and it simplifies the implementation. If needed we can make this configurable or tie it to -par later.

This way the I/O-bound work runs in parallel with the CPU-bound validation work, and the preliminary results are very encouraging: on a Raspberry Pi 5 the best -reindex-chainstate so far is about 7.3 hours with -dbcache=4500 and about 7.7 hours with the default 450 MB, roughly 36% and 46% faster than the current single-threaded baseline.

The new implementation has been fuzzed for several days - it would be good to get some more eyes on it.

(some of these measurements are suspiciously good, especially the M4 versions (maybe it switches from the energy efficient cores to the performance ones), I will try to replicate the results in followups)

<details> <summary>Details</summary>

> M4 @ dbcache=450:
for commit in dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab 32de0ff1a97fc2880ba2f507dd00082727badf3f; do
  git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && git log -1 --pretty='%h %s' && \
  rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && \
  time ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 || exit 1
done
dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate     20994.00s user 5348.08s system 120% cpu 6:03:05.28 total

32de0ff1a9 validation: fetch block inputs via CCoinsViewCacheAsync during connection
./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate     16748.42s user 1830.34s system 276% cpu 1:51:50.94 total


> M4 @ dbcache=4500:
d5ed4ba9d8 Merge bitcoin/bitcoin#33906: depends: Add patch for Windows11Style plugin
./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -blocksonly    8895.93s user 751.51s system 133% cpu 2:00:08.82 total
b1a791db1c validation: fetch block inputs via CCoinsViewCacheAsync during connection
./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -blocksonly    10327.49s user 940.30s system 186% cpu 1:40:57.44 total

> M4 @ dbcache=45000:
for commit in dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab b1a791db1c75a47569b690baf7b074b78e08ca5a; do
  git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && git log -1 --pretty='%h %s' && \
  rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && \
  time ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 || exit 1
done
dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate   6338.71s user 332.32s system 131% cpu 1:24:22.56 total

b1a791db1c validation: fetch block inputs via CCoinsViewCacheAsync during connection
./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate   7301.13s user 471.62s system 162% cpu 1:19:35.33 total


> i9 @ dbcache=450:
dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
e86d485271 validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 921129 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
  Time (abs ≡):        20659.626 s               [User: 40733.602 s, System: 2871.904 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
  Time (abs ≡):        14729.736 s               [User: 39566.674 s, System: 2313.959 s]

Relative speed comparison
        1.40          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)


> i9  @ dbcache=4500:
COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab 32de0ff1a97fc2880ba2f507dd00082727badf3f"; STOP=921129; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
32de0ff1a9 validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 921129 blocks | dbcache 4500 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
  Time (abs ≡):        16615.768 s               [User: 25458.915 s, System: 859.662 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
  Time (abs ≡):        13689.366 s               [User: 26290.581 s, System: 991.037 s]

Relative speed comparison
        1.21          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)


> i9  @ dbcache=45000:
COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab 32de0ff1a97fc2880ba2f507dd00082727badf3f"; STOP=921129; DBCACHE=45000; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
32de0ff1a9 validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 921129 blocks | dbcache 45000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
  Time (abs ≡):        16118.775 s               [User: 23433.898 s, System: 725.843 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
  Time (abs ≡):        14429.306 s               [User: 23850.818 s, System: 792.987 s]

Relative speed comparison
        1.12          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)


> i7 @ dbcache=450:
COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab 32de0ff1a97fc2880ba2f507dd00082727badf3f"; STOP=921129; DBCACHE=450; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origi
n $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(unam
e -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS/
/ /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5;
grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.
log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -d
bcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
32de0ff1a9 validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 921129 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -conne
ct=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
  Time (abs ≡):        42473.571 s               [User: 40584.287 s, System: 3012.074 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -conne
ct=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
  Time (abs ≡):        34193.205 s               [User: 42326.030 s, System: 2778.267 s]

Relative speed comparison
        1.24          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blockson
ly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blockson
ly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)


> i7 @ dbcache=4500:
COMMITS="d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac b1a791db1c75a47569b690baf7b074b78e08ca5a"; STOP=921129; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

d5ed4ba9d8 Merge bitcoin/bitcoin#33906: depends: Add patch for Windows11Style plugin
b1a791db1c validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 921129 blocks | dbcache 4500 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
  Time (abs ≡):        27190.152 s               [User: 33685.961 s, System: 1842.096 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
  Time (abs ≡):        23873.513 s               [User: 27793.779 s, System: 1036.030 s]

Relative speed comparison
        1.14          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)


> i7 @ dbcache=45000:
COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab e86d48527122c803a58d7bfecffd43a0e373c756"; \
STOP=921129; DBCACHE=45000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
e86d485271 validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 921129 blocks | dbcache 45000 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
  Time (mean ± σ):     22133.846 s ± 42.629 s    [User: 24498.825 s, System: 634.139 s]
  Range (min … max):   22103.703 s … 22163.990 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
  Time (mean ± σ):     20547.518 s ±  8.809 s    [User: 25074.730 s, System: 695.076 s]
  Range (min … max):   20541.289 s … 20553.747 s    2 runs

Relative speed comparison
        1.08 ±  0.00  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)


> rpi5 @ dbcache=450:
COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab e86d48527122c803a58d7bfecffd43a0e373c756"; \
STOP=921129; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
e86d485271 validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 921129 blocks | dbcache 450 | rpi5-16-1 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
  Time (mean ± σ):     41084.236 s ± 701.352 s    [User: 68642.573 s, System: 7256.334 s]
  Range (min … max):   40588.305 s … 41580.166 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
  Time (mean ± σ):     28200.555 s ± 297.983 s    [User: 66305.959 s, System: 5678.536 s]
  Range (min … max):   27989.850 s … 28411.261 s    2 runs

Relative speed comparison
        1.46 ±  0.03  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)


> rpi5 @ dbcache=4500:
COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab e86d48527122c803a58d7bfecffd43a0e373c756"; STOP=921129; DBCACHE=4500; CC14:18:21 [10/1679]
E_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exi
t 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs)
| $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 &&
 echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.jso
n"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \

    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA
_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
root@rpi5-16-1:/mnt/my_storage/bitcoin# COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab e86d48527122c803a58d7bfecffd43a0e373c756"; STOP=921129; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
e86d485271 validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 921129 blocks | dbcache 4500 | rpi5-16-1 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
  Time (abs ≡):        35867.389 s               [User: 41695.508 s, System: 3281.868 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
  Time (abs ≡):        26440.662 s               [User: 43495.688 s, System: 3743.419 s]

Relative speed comparison
        1.36          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)

</details>

andrewtoth renamed this:
~~validation: fetch block inputs on parallel threads >20% faster IBD~~
validation: fetch block inputs on parallel threads >40% faster IBD
on Nov 27, 2025

andrewtoth force-pushed on Nov 30, 2025

DrahtBot added the label CI failed on Nov 30, 2025

DrahtBot commented at 4:01 AM on November 30, 2025: contributor

🚧 At least one of the CI tasks failed. Task ASan + LSan + UBSan + integer: https://github.com/bitcoin/bitcoin/actions/runs/19793395783/job/56710184816 LLM reason (✨ experimental): Compiler errors: thread-safety checks fail in coinsviewcacheasync.cpp (requires holding cs_main exclusively), causing build to fail.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Nov 30, 2025

DrahtBot removed the label CI failed on Nov 30, 2025

andrewtoth commented at 3:15 PM on November 30, 2025: contributor

Thank you very much for your review and bencharking @l0rinc! The speedup this offers is great. I have taken most of your suggestions.

andrewtoth force-pushed on Nov 30, 2025

DrahtBot added the label CI failed on Nov 30, 2025

DrahtBot commented at 4:16 PM on November 30, 2025: contributor

🚧 At least one of the CI tasks failed. Task Windows-cross to x86_64, ucrt: https://github.com/bitcoin/bitcoin/actions/runs/19801430839/job/56729280722 LLM reason (✨ experimental): Linker failure: undefined reference to util::TraceThread causing the bitcoinkernel.dll build to fail.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

DrahtBot removed the label CI failed on Nov 30, 2025

andrewtoth force-pushed on Nov 30, 2025

in src/coinsviewcacheasync.h:234 in 69310ec003

 229 | +                }
 230 | +            });
 231 | +        }
 232 | +    }
 233 | +
 234 | +    ~CoinsViewCacheAsync()

l0rinc commented at 6:00 PM on November 30, 2025:

nit:

    ~CoinsViewCacheAsync() override

in src/coinsviewcacheasync.h:227 in 69310ec003

 222 | +            m_worker_threads.emplace_back([this, n] {
 223 | +                util::ThreadRename(strprintf("inputfetcher.%i", n));
 224 | +                while (true) {
 225 | +                    m_barrier.arrive_and_wait();
 226 | +                    while (ProcessInputInBackground()) {}
 227 | +                    if (m_inputs.size() == 0) return;

l0rinc commented at 6:01 PM on November 30, 2025:

I understand that size() > 0 may be more descriptive than !empty(), but there's a dedicated method for this case (there are a few more cases):

                    if (m_inputs.empty()) return;

in src/bench/coinsviewcacheasync.cpp:37 in 69310ec003

  32 | +            coins_tip.EmplaceCoinInternalDANGER(COutPoint{in.prevout}, std::move(coin));
  33 | +        }
  34 | +    }
  35 | +    chainstate.ForceFlushStateToDisk();
  36 | +    const auto& coins_db{WITH_LOCK(testing_setup->m_node.chainman->GetMutex(), return chainstate.CoinsDB();)};
  37 | +    CoinsViewCacheAsync async_cache{coins_tip, coins_db, /*deterministic=*/true};

l0rinc commented at 6:08 PM on November 30, 2025:

deterministic sounds like something we should do in a benchmark, but it just predefines the salts for testability: https://github.com/bitcoin/bitcoin/blob/9c24cda72edb2085edfa75296d6b42fab34433d9/src/util/hasher.cpp#L22-L25

As far as I can tell we don't need here, so we can likely remove that constructor arg completely:

    explicit CoinsViewCacheAsync(CCoinsViewCache& cache, const CCoinsView& db) noexcept
        : CCoinsViewCache{&cache}, m_db{db}, m_barrier{WORKER_THREADS + 1}

in src/coinsviewcacheasync.h:121 in 69310ec003

 116 | +        input.ready.notify_one();
 117 | +        return true;
 118 | +    }
 119 | +
 120 | +    //! Get the index in m_inputs for the given outpoint. Advances m_input_tail if found.
 121 | +    std::optional<uint32_t> GetInputIndex(const COutPoint &outpoint) const noexcept

l0rinc commented at 6:18 PM on November 30, 2025:

Should we maybe document that this assumes that ConnectBlock will fetch the inputs in the same order?

nit: could we use the current formatting for new code?

in src/test/coinsviewcacheasync_tests.cpp:66 in 69310ec003

  61 | +        return block;
  62 | +    }
  63 | +
  64 | +public:
  65 | +    explicit CoinsViewCacheAsyncTest(const ChainType chainType = ChainType::MAIN,
  66 | +                              TestOpts opts = {})

l0rinc commented at 6:19 PM on November 30, 2025:

As https://corecheck.dev/bitcoin/bitcoin/pulls/31132 hints, this could also be passed as const reference.

nit: formatting

l0rinc commented at 9:13 AM on December 1, 2025:

formatting is still off though

in src/coinsviewcacheasync.h:166 in 69310ec003

 161 | +        cachedCoinsUsage += ret->second.coin.DynamicMemoryUsage();
 162 | +        return ret;
 163 | +    }
 164 | +
 165 | +    std::vector<std::thread> m_worker_threads{};
 166 | +    std::barrier<> m_barrier;

l0rinc commented at 6:26 PM on November 30, 2025:

we should likely init it here instead:

    std::barrier<> m_barrier{WORKER_THREADS + 1};

in src/validation.h:489 in 69310ec003 outdated

 485 | @@ -485,6 +486,10 @@ class CoinsViews {
 486 |      //! can fit per the dbcache setting.
 487 |      std::unique_ptr<CCoinsViewCache> m_cacheview GUARDED_BY(cs_main);
 488 |  
 489 | +    //! Used as an empty view that is only passed into ConnectBlock to help speed up block validation,

l0rinc commented at 6:32 PM on November 30, 2025:

speed up block validation

Should we mention any parallelism here?

andrewtoth commented at 10:04 PM on November 30, 2025:

That's more an implementation detail that can be found by reading the header file, no?

in src/test/coinsviewcacheasync_tests.cpp:33 in 69310ec003

  28 | +struct CoinsViewCacheAsyncTest : BasicTestingSetup {
  29 | +private:
  30 | +    std::unique_ptr<CoinsViewCacheAsync> m_async_cache{nullptr};
  31 | +    std::unique_ptr<CBlock> m_block{nullptr};
  32 | +
  33 | +    CBlock CreateBlock(int32_t num_txs)

l0rinc commented at 6:33 PM on November 30, 2025:

can we make this static or at least const?

in src/coins.cpp:176 in 765b57d1b1 outdated

 172 | @@ -173,6 +173,12 @@ bool CCoinsViewCache::HaveCoinInCache(const COutPoint &outpoint) const {
 173 |      return (it != cacheCoins.end() && !it->second.coin.IsSpent());
 174 |  }
 175 |  
 176 | +std::optional<Coin> CCoinsViewCache::GetPossiblySpentCoinFromCache(const COutPoint& outpoint) const noexcept

l0rinc commented at 6:40 PM on November 30, 2025:

nit: the first commit is huge, maybe we could split out the coins.[cpp|h] changes to a commit before it. Not sure what else we could split out, though... Would it help if we split out the single-threaded internal spends case, so that they avoid the cache entirely? Wouldn't that already speed up IBD - in which case it's definitely a separate feature. I also don't mind if we do that in a separate PR to have some progress.

andrewtoth commented at 10:04 PM on November 30, 2025:

I would prefer to split out the implementation, the tests, the fuzz harness... But you prefer those all be in the same commit?

l0rinc commented at 6:40 AM on December 1, 2025:

I prefer simple but fully functioning chunks that converge towards a feature (as someone illustrated: "skateboard -> bicycle -> scooter -> motorcycle -> car" instead of "left wheels -> right wheels -> doors -> wipers -> radio antenna -> windows -> etc"). So if we can carve out chunks (such as the internal spend + main fallback, or a single-threaded fetcher first), we could guide the reviewer instead of having a big-bang change that's really hard to fully comprehend as such.

andrewtoth commented at 5:49 PM on December 20, 2025:

I've split the PR up into multiple commits that build on each other. Please let me know what you think.

in src/coinsviewcacheasync.h:208 in 765b57d1b1 outdated

 203 | +        cacheCoins.clear();
 204 | +        cachedCoinsUsage = 0;
 205 | +        hashBlock = uint256::ZERO;
 206 | +    }
 207 | +
 208 | +    bool Flush() override

l0rinc commented at 6:42 PM on November 30, 2025:

we may want to explain in a comment why the parent CCoinsViewCache::Flush isn't called here (i.e. to stop propagation to disk)

l0rinc commented at 8:28 PM on November 30, 2025:

this whole function never seems to be called from unit tests

andrewtoth commented at 10:00 PM on November 30, 2025:

It gets called by functional tests though.

andrewtoth commented at 10:05 PM on November 30, 2025:

(i.e. to stop propagation to disk)

The parent could be called, but this is faster since we skip calling ReallocateCache.

l0rinc commented at 6:58 AM on December 1, 2025:

wouldn't that percolate to the database layer?

andrewtoth commented at 2:24 PM on December 1, 2025:

Oh, we override so we make sure all threads are stopped before we do the batch write. This is in a comment two lines below.

in src/coinsviewcacheasync.h:63 in 765b57d1b1 outdated

  58 | +        /**
  59 | +         * We only move when m_inputs reallocates during setup.
  60 | +         * We never move after work begins, so we don't have to copy other members.
  61 | +         */
  62 | +        InputToFetch(InputToFetch&& other) noexcept : outpoint{other.outpoint} {}
  63 | +        explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}

l0rinc commented at 6:47 PM on November 30, 2025:

Sonar is complaining that we're violating the Rule of Five here, it's a bit verbose, but maybe we could:

        InputToFetch(InputToFetch&& other) noexcept : outpoint{other.outpoint} {}
        explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
        InputToFetch(const InputToFetch&) = delete;
        InputToFetch& operator=(const InputToFetch&) = delete;
        InputToFetch& operator=(InputToFetch&&) = delete;

But it begs the question: why are we even moving these and why is the "move" only copying partial state. Is it meant to avoid reallocations in StartFetching? But m_inputs is always empty at the beginning of assignment and we could easily reserve the actual size to avoid moves as far as I understood, something like:

    //! Start fetching all block inputs in parallel.
    void StartFetching(const CBlock& block) noexcept
    {
        int input_count{std::accumulate(block.vtx.begin() + 1, block.vtx.end(), 0, [](size_t s, const auto& t) { return s + t->vin.size(); })};
        m_inputs.reserve(input_count);

        // Loop through the inputs of the block and set them in the queue. Also construct the set of txids to filter.
        for (const auto& tx : block.vtx | std::views::drop(1)) {
            for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
        }

which would allow us to delete most of the other constructors instead:

    explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
    InputToFetch(const InputToFetch&) = delete;
    InputToFetch& operator=(const InputToFetch&) = delete;
    InputToFetch(InputToFetch&&) = delete;
    InputToFetch& operator=(InputToFetch&&) = delete;

andrewtoth commented at 10:08 PM on November 30, 2025:

I'm not sure if we need all these deletes here? What are we gaining from this?

We need to have move assignment for it to compile. It can't automatically move the atomic_flag. So, it doesn't really matter if it gets called or not, we still need to define it. And, since it only happens before we start doing work, might as well make it simple and not bother moving the other fields. I don't think we need to bother reserving since we keep the capacity over many blocks. The looping is just extra overhead, and it doesn't allow us to remove the move assignment.

l0rinc commented at 7:00 AM on December 1, 2025:

I disagree, I think the current move construction is incorrect and if I understand it correctly, we should reserve instead and delete (or ignore) the other constructors.

l0rinc commented at 2:24 PM on December 1, 2025:

Looks like this needs more work, it's not as easy as I though - we could use an std::deque instead of a vector here:

Subject: [PATCH] std::deque<InputToFetch> m_inputs
---
diff --git a/src/coinsviewcacheasync.h b/src/coinsviewcacheasync.h
--- a/src/coinsviewcacheasync.h	(revision a3f56354d6e3f64eaca84a16e4951e6073090f60)
+++ b/src/coinsviewcacheasync.h	(revision 462408b897197de3a7067dcbdee318ad9dc1e546)
@@ -17,6 +17,7 @@
 #include <atomic>
 #include <barrier>
 #include <cstdint>
+#include <deque>
 #include <optional>
 #include <ranges>
 #include <thread>
@@ -55,14 +56,9 @@
         //! The coin that workers will fetch and main thread will insert into cache.
         std::optional<Coin> coin{std::nullopt};
 
-        /**
-         * We only move when m_inputs reallocates during setup.
-         * We never move after work begins, so we don't have to copy other members.
-         */
-        InputToFetch(InputToFetch&& other) noexcept : outpoint{other.outpoint} {}
         explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
     };
-    mutable std::vector<InputToFetch> m_inputs{};
+    mutable std::deque<InputToFetch> m_inputs{};
 
     /**
      * The first 8 bytes of txids of all txs in the block being fetched. This is used to filter out inputs that

But unfortunately the speed difference is measurable (5% slower):

vector:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        1,644,638.02 |              608.04 |    0.1% |      1.09 | `CoinsViewCacheAsyncBenchmark`

deque:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        1,732,962.13 |              577.05 |    0.1% |      1.10 | `CoinsViewCacheAsyncBenchmark`

Maybe we could split out the atomic and have a vector + a deque, something like:

diff --git a/src/coinsviewcacheasync.h b/src/coinsviewcacheasync.h
--- a/src/coinsviewcacheasync.h	(revision 462408b897197de3a7067dcbdee318ad9dc1e546)
+++ b/src/coinsviewcacheasync.h	(date 1764598896036)
@@ -48,17 +48,17 @@
     mutable uint32_t m_input_tail{0};
 
     //! The inputs of the block which is being fetched.
-    struct InputToFetch {
-        //! Workers set this after setting the coin. The main thread tests this before reading the coin.
-        std::atomic_flag ready{};
+    struct InputData {
         //! The outpoint of the input to fetch;
         const COutPoint& outpoint;
         //! The coin that workers will fetch and main thread will insert into cache.
         std::optional<Coin> coin{std::nullopt};
 
-        explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
+        explicit InputData(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
     };
-    mutable std::deque<InputToFetch> m_inputs{};
+    //! Workers set this after setting the coin. The main thread tests this before reading the coin.
+    mutable std::deque<std::atomic_flag> m_ready_flags{};
+    mutable std::vector<InputData> m_inputs{};
 
     /**
      * The first 8 bytes of txids of all txs in the block being fetched. This is used to filter out inputs that
@@ -97,19 +97,18 @@
         const auto i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
         if (i >= m_inputs.size()) [[unlikely]] return false;
 
-        auto& input{m_inputs[i]};
         // Inputs spending a coin from a tx earlier in the block won't be in the cache or db
-        if (std::ranges::binary_search(m_txids, input.outpoint.hash.ToUint256().GetUint64(0))) {
+        if (std::ranges::binary_search(m_txids, m_inputs[i].outpoint.hash.ToUint256().GetUint64(0))) {
             // We can use relaxed ordering here since we don't write the coin.
-            input.ready.test_and_set(std::memory_order_relaxed);
-            input.ready.notify_one();
+            m_ready_flags[i].test_and_set(std::memory_order_relaxed);
+            m_ready_flags[i].notify_one();
             return true;
         }
 
-        if (auto coin{GetCoinWithoutMutating(input.outpoint)}) [[likely]] input.coin.emplace(std::move(*coin));
+        if (auto coin{GetCoinWithoutMutating(m_inputs[i].outpoint)}) [[likely]] m_inputs[i].coin.emplace(std::move(*coin));
         // We need release here, so writing coin in the line above happens before the main thread acquires.
-        input.ready.test_and_set(std::memory_order_release);
-        input.ready.notify_one();
+        m_ready_flags[i].test_and_set(std::memory_order_release);
+        m_ready_flags[i].notify_one();
         return true;
     }
 
@@ -135,17 +134,16 @@
         if (!inserted) return ret;
 
         if (const auto i{GetInputIndex(outpoint)}) [[likely]] {
-            auto& input{m_inputs[*i]};
             // Check if the coin is ready to be read. We need to acquire to match the worker thread's release.
-            while (!input.ready.test(std::memory_order_acquire)) {
+            while (!m_ready_flags[*i].test(std::memory_order_acquire)) {
                 // Work instead of waiting if the coin is not ready
                 if (!ProcessInputInBackground()) {
                     // No more work, just wait
-                    input.ready.wait(/*old=*/false, std::memory_order_acquire);
+                    m_ready_flags[*i].wait(/*old=*/false, std::memory_order_acquire);
                     break;
                 }
             }
-            if (input.coin) [[likely]] ret->second.coin = std::move(*input.coin);
+            if (m_inputs[*i].coin) [[likely]] ret->second.coin = std::move(*m_inputs[*i].coin);
         }
 
         if (ret->second.coin.IsSpent()) [[unlikely]] {
@@ -180,6 +178,10 @@
     //! Start fetching all block inputs in parallel.
     void StartFetching(const CBlock& block) noexcept
     {
+        int input_count{std::accumulate(block.vtx.begin() + 1, block.vtx.end(), 0, [](size_t s, const auto& t) { return s + t->vin.size(); })};
+        m_ready_flags.resize(input_count);
+        m_inputs.reserve(input_count);
+
         // Loop through the inputs of the block and set them in the queue. Also construct the set of txids to filter.
         for (const auto& tx : block.vtx | std::views::drop(1)) {
             for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);

(maybe deduplicating the index accesses speeds it back up, not sure) but this would be even slower: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,880,070.41 | 531.89 | 0.3% | 1.10 | CoinsViewCacheAsyncBenchmark

andrewtoth commented at 2:37 PM on December 1, 2025:

I disagree, I think the current move construction is incorrect and if I understand it correctly, we should reserve instead and delete (or ignore) the other constructors.

I think you are disagreeing with my statement:

since it only happens before we start doing work, might as well make it simple and not bother moving the other fields

because the other things I said were facts. Do you think we should also construct new atomic_flags when doing the move construction? As I mentioned I don't think it's worth it. We can logically see the only time we will move is during a reallocation of the vector when capacity is reached. It doesn't matter if we reserve or not, the compiler cannot deduce that so will still require the move constructor. A deque does not need to reallocate and move all elements when another is added over a capacity, which is why it is ok to use atomics without a custom move constructor. But, it is obviously much slower than a vector.

in src/test/coinsviewcacheasync_tests.cpp:85 in 765b57d1b1 outdated

  80 | +    for (const auto& tx : block.vtx) {
  81 | +        for (const auto& in : tx->vin) {
  82 | +            auto outpoint{in.prevout};
  83 | +            Coin coin{};
  84 | +            if (!spent) coin.out.nValue = 1;
  85 | +            BOOST_CHECK(spent ? coin.IsSpent() : !coin.IsSpent());

l0rinc commented at 6:53 PM on November 30, 2025:

            BOOST_CHECK_EQUAL(coin.IsSpent(), spent);

in src/test/coinsviewcacheasync_tests.cpp:50 in 69310ec003 outdated

  45 | +            if (i % 3 == 0) {
  46 | +                txid = Txid::FromUint256(uint256(i));
  47 | +            } else if (i % 3 == 1) {
  48 | +                txid = prevhash;
  49 | +            } else {
  50 | +                // Test shortid collisions

l0rinc commented at 7:54 PM on November 30, 2025:

I'm not sure I understand how we're actually testing this. NoAccessCoinsView is designed to abort on access, so in the shortid collision scenario how does it simulate going to disk? Shouldn't we assert here that the number of collisions coincides with the number of simulated disk reads?

andrewtoth commented at 10:11 PM on November 30, 2025:

It doesn't simulate going to disk. It simulates not setting the coin in ProcessInputInBackground even though the base has it. Then the if (ret->second.coin.IsSpent()) [[unlikely]] { branch is executed in FetchCoin and the coin is fetched from base via GetCoinWithoutMutating.

l0rinc commented at 9:02 AM on December 1, 2025:

isn't GetCoinWithoutMutating meant to simulate going one layer deeper in the cache - which is basically going to disk on the main thread, right?

andrewtoth commented at 2:48 PM on December 1, 2025:

NoAccessCoinsView is designed to abort on access

It is designed to abort on accessing the db via the main cache. We want to access the db only via our m_db ref and not go through the main cache's base pointer. This is unrelated to the test of short txid collisions. For those, we want to go to successfully go to disk on the main thread, while getting a nullopt from our m_input coin.

in src/test/coinsviewcacheasync_tests.cpp:56 in 69310ec003

  51 | +                const uint64_t shorttxid{prevhash.ToUint256().GetUint64(0)};
  52 | +                uint256 u(i);
  53 | +                WriteLE64(u.data(), shorttxid);
  54 | +                txid = Txid::FromUint256(u);
  55 | +            }
  56 | +            tx.vin.emplace_back(COutPoint(txid, 0));

l0rinc commented at 7:59 PM on November 30, 2025:

nit: emplace doesn't necessarily need the class name

            tx.vin.emplace_back(txid, 0);

in src/test/coinsviewcacheasync_tests.cpp:166 in 69310ec003

 161 | +        view.StartFetching(block);
 162 | +        for (const auto& tx : block.vtx) {
 163 | +            for (const auto& in : tx->vin) view.AccessCoin(in.prevout);
 164 | +        }
 165 | +        // Coins are not added to the view, even though they exist unspent in the parent db
 166 | +        BOOST_CHECK(view.GetCacheSize() == 0);

l0rinc commented at 8:07 PM on November 30, 2025:

any reason not to use:

        BOOST_CHECK_EQUAL(view.GetCacheSize(), 0);

? Or just a copy-paste convenience from fuzz :)?

l0rinc commented at 9:11 AM on December 1, 2025:

There's one remaining that wasn't migrated:

BOOST_CHECK(cache.GetCacheSize() == counter);

in src/test/coinsviewcacheasync_tests.cpp:52 in 69310ec003

  47 | +            } else if (i % 3 == 1) {
  48 | +                txid = prevhash;
  49 | +            } else {
  50 | +                // Test shortid collisions
  51 | +                const uint64_t shorttxid{prevhash.ToUint256().GetUint64(0)};
  52 | +                uint256 u(i);

l0rinc commented at 8:08 PM on November 30, 2025:

aren't we already overwriting the first 64 bits in the next step anyway?

in src/test/coinsviewcacheasync_tests.cpp:46 in 69310ec003 outdated

  41 | +
  42 | +        for (const auto i : std::views::iota(1, num_txs)) {
  43 | +            CMutableTransaction tx;
  44 | +            Txid txid;
  45 | +            if (i % 3 == 0) {
  46 | +                txid = Txid::FromUint256(uint256(i));

l0rinc commented at 8:11 PM on November 30, 2025:

isn't this too small for our purposes? Or since there's no randomness involved, we just store at most 100 values on 256 bits? That might work I guess, I would have just gone with Txid::FromUint256(m_rng.rand256());

andrewtoth commented at 9:53 PM on November 30, 2025:

isn't this too small for our purposes?

I don't see why we need any randomness or larger values for these?

l0rinc commented at 9:10 AM on December 1, 2025:

some values are tiny now while internal spends (real or fake) have full hashes. It might be easier to work with small values, but internal spends will still result in big ugly full hashes anyway. I'm fine with it as it is.

in src/test/coinsviewcacheasync_tests.cpp:53 in 69310ec003

  48 | +                txid = prevhash;
  49 | +            } else {
  50 | +                // Test shortid collisions
  51 | +                const uint64_t shorttxid{prevhash.ToUint256().GetUint64(0)};
  52 | +                uint256 u(i);
  53 | +                WriteLE64(u.data(), shorttxid);

l0rinc commented at 8:13 PM on November 30, 2025:

we're not switching platforms during testing, not sure we need LE/BE conversions here:

uint256 u{m_rng.rand256()};
std::memcpy(u.begin(), prevhash.ToUint256().begin(), 8);
txid = Txid::FromUint256(u);

in src/test/coinsviewcacheasync_tests.cpp:122 in 69310ec003

 117 | +
 118 | +BOOST_FIXTURE_TEST_CASE(fetch_inputs_from_db, CoinsViewCacheAsyncTest)
 119 | +{
 120 | +    const auto& block{getBlock()};
 121 | +    NoAccessCoinsView dummy;
 122 | +    CCoinsViewCache db(&dummy);

l0rinc commented at 8:16 PM on November 30, 2025:

nit: brace init may be slightly better here to differentiate it from function calls (many such cases):

    const CCoinsViewCache db{&dummy};

in src/test/fuzz/coinsviewcacheasync.cpp:137 in 69310ec003 outdated

 132 | +            } else {
 133 | +                const auto txid{Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))};
 134 | +                const auto index{fuzzed_data_provider.ConsumeIntegral<uint32_t>()};
 135 | +                outpoint = COutPoint(txid, index);
 136 | +            }
 137 | +            cache.AccessCoin(outpoint);

l0rinc commented at 8:17 PM on November 30, 2025:

we could validate the result values of AccessCoin throughout the tests

in src/coinsviewcacheasync.h:183 in 69310ec003 outdated

 178 | +
 179 | +public:
 180 | +    //! Start fetching all block inputs in parallel.
 181 | +    void StartFetching(const CBlock& block) noexcept
 182 | +    {
 183 | +        // Loop through the inputs of the block and set them in the queue. Also construct the set of txids to filter.

l0rinc commented at 8:22 PM on November 30, 2025:

should we assume that the cache is empty here - or can you imagine a scenario when we wouldn't want that.? Otherwise tests like:

for (auto i{0}; i < 3; ++i) {
    view.StartFetching(block);
    CheckCache(block, view);
    view.Reset();
}

would basically pass (but hang) if we forget to Reset

andrewtoth commented at 10:21 PM on November 30, 2025:

would basically pass (but hang) if we forget to Reset

A hanging test is treated as failure in the CI. I don't think it's necessary to do anything else here.

in src/coinsviewcacheasync.h:135 in 69310ec003 outdated

 130 | +    }
 131 | +
 132 | +    CCoinsMap::iterator FetchCoin(const COutPoint &outpoint) const override
 133 | +    {
 134 | +        const auto [ret, inserted] = cacheCoins.try_emplace(outpoint);
 135 | +        if (!inserted) return ret;

l0rinc commented at 8:26 PM on November 30, 2025:

this early exit doesn't seem to be covered by unit tests in coinsviewcacheasync_tests (same for StopFetching)

in src/coinsviewcacheasync.h:142 in 69310ec003 outdated

 137 | +        if (const auto i{GetInputIndex(outpoint)}) [[likely]] {
 138 | +            auto& input{m_inputs[*i]};
 139 | +            // Check if the coin is ready to be read. We need to acquire to match the worker thread's release.
 140 | +            while (!input.ready.test(std::memory_order_acquire)) {
 141 | +                // Work instead of waiting if the coin is not ready
 142 | +                if (!ProcessInputInBackground()) {

l0rinc commented at 8:27 PM on November 30, 2025:

this also never seem to be triggered by unit tests in coinsviewcacheasync_tests - could we selectively block other threads to make sure we get here?

andrewtoth commented at 9:59 PM on November 30, 2025:

Hmm not sure how we'd test this with a unit test. It surely gets exercised by fuzzing though. Also the function is called by other threads.

l0rinc commented at 9:08 AM on December 1, 2025:

Hmm not sure how we'd test this with a unit test

We could block the thread pool by adding a dummy underlying cache which blocks for the gets and make sure the main thread can still do the fetch on a single thread, when you can unblock the cache and check that we still managed to make progress on a single thread.

andrewtoth commented at 2:23 PM on December 1, 2025:

That would be racy. We could have the worker thread count be a variable instead of hard coded, then for a test we could make it zero.

l0rinc commented at 2:30 PM on December 1, 2025:

you sure it would be racy?

andrewtoth commented at 2:42 PM on December 1, 2025:

If we have a backing cache that blocks, how can we know if it's the main thread or worker threads that need to be blocked? And if we block the main thread by mistake, it will make no progress even though the worker thread can fetch all inputs

andrewtoth commented at 12:26 AM on December 7, 2025:

This is tested now by having zero worker threads.

in src/test/coinsviewcacheasync_tests.cpp:109 in 69310ec003

 104 | +                if (should_have) {
 105 | +                    cache.AccessCoin(outpoint);
 106 | +                    ++counter;
 107 | +                }
 108 | +                const auto have{cache.GetPossiblySpentCoinFromCache(outpoint)};
 109 | +                BOOST_CHECK(should_have ? !!have : !have);

l0rinc commented at 8:29 PM on November 30, 2025:

We could also use a less brancy equality check here

                BOOST_CHECK_EQUAL(should_have, !!have);

or

                BOOST_CHECK_NE(should_have, !have);

(but the latter might be too much for some :p)

in src/bench/coinsviewcacheasync.cpp:51 in 69310ec003 outdated

  46 | +        }
  47 | +        async_cache.Reset();
  48 | +    });
  49 | +}
  50 | +
  51 | +BENCHMARK(CoinsViewCacheAsyncBenchmark, benchmark::PriorityLevel::HIGH);

l0rinc commented at 8:35 PM on November 30, 2025:

could you please add new lines to the end of the files? I also don't like them, but it seems to be necessary in certain cases...

andrewtoth commented at 9:45 PM on November 30, 2025:

If there is a missing new line github will show a red arrow at the end of the file.

l0rinc commented at 8:53 AM on December 1, 2025:

You're right, my mistake

in src/bench/coinsviewcacheasync.cpp:41 in 69310ec003 outdated

  36 | +    const auto& coins_db{WITH_LOCK(testing_setup->m_node.chainman->GetMutex(), return chainstate.CoinsDB();)};
  37 | +    CoinsViewCacheAsync async_cache{coins_tip, coins_db, /*deterministic=*/true};
  38 | +
  39 | +    bench.run([&] {
  40 | +        async_cache.StartFetching(block);
  41 | +        for (const auto& tx : block.vtx | std::views::drop(1)) {

l0rinc commented at 8:37 PM on November 30, 2025:

I like these functional constructs, they may take some getting used to, but they have a lot fewer moving parts which help with separating iteration (= glue code) from important parts!

Note: could you publish the benchmark results in the commit messages before/after and can you reproduce 4 threads saturating the parallelism factor with it? (I don't have any available benchmarking servers at the moment...)

andrewtoth commented at 10:01 PM on November 30, 2025:

before/after

Not sure there are any before results we can use for this.

can you reproduce 4 threads saturating the parallelism factor with it?

How would I know if I do that?

l0rinc commented at 6:44 AM on December 1, 2025:

any before results

I just noticed the benchmark only tests the new state - what if the benchmark originally measured the current behavior and in the commit when we're changing to multithreaded connection, we update the benchmark to reflect that (if needed, maybe the same benchmark can already switch automatically if the original CoinsViewCacheAsync implementation reimplements everything on a single thread at first).

In that case we could have something like (fake numbers):

bench: add CoinsViewCacheAsync benchmark

| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,764,037.75 | 566.88 | 0.2% | 11.02 | CoinsViewCacheAsyncBenchmark

validation: fetch block inputs via CCoinsViewCacheAsync during connection

| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,304,607.17 | 766.51 | 0.7% | 10.60 | CoinsViewCacheAsyncBenchmark

can you reproduce 4 threads saturating the parallelism factor with it?

Changing:

static constexpr uint32_t WORKER_THREADS{1};

which gives me | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,294,995.86 | 772.20 | 1.4% | 10.61 | CoinsViewCacheAsyncBenchmark

and 2: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,237,416.55 | 808.14 | 1.9% | 10.83 | CoinsViewCacheAsyncBenchmark

and 3: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,321,210.42 | 756.88 | 1.3% | 10.84 | CoinsViewCacheAsyncBenchmark

and for 4: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,788,112.75 | 559.25 | 0.3% | 10.91 | CoinsViewCacheAsyncBenchmark

and 5: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 2,126,524.56 | 470.25 | 2.0% | 9.88 | CoinsViewCacheAsyncBenchmark

and 6: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 3,053,001.11 | 327.55 | 3.3% | 10.57 | CoinsViewCacheAsyncBenchmark

So it kinda' reproduces that it doesn't make sense to do more than 4

andrewtoth commented at 2:41 PM on December 1, 2025:

There is a ConnectBlock benchmark already. But, it only does internal spends so it won't be that great for this. I can maybe extend it to also work on a block with inputs from leveldb. WDYT?

andrewtoth commented at 4:12 PM on December 1, 2025:

So it kinda' reproduces that it doesn't make sense to do more than 4

Looks like it doesn't make sense to do more than 2?

l0rinc commented at 4:24 PM on December 1, 2025:

maybe, but leveldb is basically empty, we shouldn't take it too seriously

andrewtoth commented at 4:42 PM on December 28, 2025:

Added some benchmark results in the commits where it makes sense.

in src/validation.h:490 in 69310ec003 outdated

 485 | @@ -485,6 +486,10 @@ class CoinsViews {
 486 |      //! can fit per the dbcache setting.
 487 |      std::unique_ptr<CCoinsViewCache> m_cacheview GUARDED_BY(cs_main);
 488 |  
 489 | +    //! Used as an empty view that is only passed into ConnectBlock to help speed up block validation,
 490 | +    //! as well as not pollute the underlying cache with newly created coins in case the block is invalid.

l0rinc commented at 8:48 PM on November 30, 2025:

Do we have a specific test case that verifies cache isolation in a failure scenario?

andrewtoth commented at 9:44 PM on November 30, 2025:

I think some invalid block tests exercise this. If a block is invalid then the outputs of any txs are not added to the utxo set.

l0rinc commented at 9:04 AM on December 1, 2025:

I think some invalid block tests exercise this. If a block is invalid then the outputs of any txs are not added to the utxo set.

Are those tests realistic and using the new cache or simulating it somehow? My question is: is the new functionality covered for this case, when a block has e.g. a double-spend to make sure it's not propagated to the main cache?

andrewtoth commented at 2:57 PM on December 1, 2025:

when a block has e.g. a double-spend to make sure it's not propagated to the main cache

That would be a necessary condition of a double-spending block test. If the double spend is propagated to the main cache it is part of the utxo set and the test is a failure. The main cache is treated as the source of truth.

l0rinc approved

l0rinc commented at 9:15 PM on November 30, 2025: contributor

The new version cleverly uses m_inputs being empty as the shared shutdown signal (handling both 'no work' and 'shutdown' cases). This finally allowed us to eliminate the m_request_stop flag I disliked :).

It also benchmarks real-world I/O latency via a real LevelDB acces, while the fuzz tests use in-memory LevelDB now - sweet! The new design basically falls back to synchronous fetching gracefully in cases of collisions or delays (we may want to test that specifically).

Regarding 'faster IBD performance', I think it would be more accurate to call it 'validation performance', as this benefits block connection generally, not just during initial download (which we can't reliably measure anyway).

Should we also add some log message announcing that input fetching is running on parallel threads? We should definitely add release notes for this feature once it stabilizes - and I like this latest push a lot!

andrewtoth force-pushed on Nov 30, 2025

in src/test/coinsviewcacheasync_tests.cpp:106 in 236e7a4374 outdated

 101 | +                const auto& outpoint{in.prevout};
 102 | +                const auto should_have{!txids.contains(outpoint.hash)};
 103 | +                if (should_have) {
 104 | +                    cache.AccessCoin(outpoint);
 105 | +                    ++counter;
 106 | +                }

l0rinc commented at 7:29 AM on December 1, 2025:

ConnectBlock calls AccessCoin for every input, shouldn't the test do the same? Aren't we cheating by only calling it when we already know it's not an internal spend?

    for (const auto& tx : block.vtx) {
        if (tx->IsCoinBase()) {
            BOOST_CHECK(!cache.GetPossiblySpentCoinFromCache(tx->vin[0].prevout));
        } else {
            for (const auto& outpoint : tx->vin | std::views::transform(&CTxIn::prevout)) {
                const auto external{!txids.contains(outpoint.hash)};
                const auto& c{cache.AccessCoin(outpoint)};
                BOOST_CHECK_EQUAL(c.IsSpent(), !external);

                counter += external;
                const bool in_cache{!!cache.GetPossiblySpentCoinFromCache(outpoint)};
                BOOST_CHECK_EQUAL(external, in_cache);
            }
            txids.emplace(tx->GetHash());
        }
    }
    BOOST_CHECK_EQUAL(cache.GetCacheSize(), counter);

andrewtoth commented at 3:29 PM on December 1, 2025:

But ConnectBlock also inserts the newly created utxos into the cache, so that the next call to AccessCoin will just get it from the cache's cacheCoins map.

in src/test/coinsviewcacheasync_tests.cpp:88 in 236e7a4374 outdated

  83 | +            if (!spent) coin.out.nValue = 1;
  84 | +            BOOST_CHECK_EQUAL(coin.IsSpent(), spent);
  85 | +            cache.EmplaceCoinInternalDANGER(std::move(outpoint), std::move(coin));
  86 | +        }
  87 | +    }
  88 | +}

l0rinc commented at 7:50 AM on December 1, 2025:

why are coinbases and internal spends added to the cache, that's not what happens in reality, right? It should represent the state before the block is connected, and it should populate backed and db caches as well, so maybe something like:

void PopulateCache(const CBlock& block, CCoinsView& view, bool spent = false)
{
    CCoinsViewCache cache{&view};
    cache.SetBestBlock(uint256::ONE);

    std::unordered_set<Txid, SaltedTxidHasher> txids{};
    txids.reserve(block.vtx.size() - 1);
    for (const auto& tx : block.vtx | std::views::drop(1)) {
        for (const auto& in : tx->vin) {
            if (!txids.contains(in.prevout.hash)) {
                Coin coin{};
                if (!spent) coin.out.nValue = 1;
                cache.EmplaceCoinInternalDANGER(COutPoint{in.prevout}, std::move(coin));
            }
        }
        txids.emplace(tx->GetHash());
    }

    cache.Flush();
}

in src/test/coinsviewcacheasync_tests.cpp:25 in 236e7a4374 outdated

  20 | +#include <unordered_set>
  21 | +
  22 | +BOOST_AUTO_TEST_SUITE(coinsviewcacheasync_tests)
  23 | +
  24 | +struct NoAccessCoinsView : CCoinsView {
  25 | +    std::optional<Coin> GetCoin(const COutPoint&) const override { abort(); }

l0rinc commented at 8:15 AM on December 1, 2025:

instead of std::abort we should just return an std::nullopt, it's closer to what the database would do - especially since GetPossiblySpentCoinFromCache is noexcept. Or even better, what if we also used an in-memory leveldb here instead and populated a CCoinsView& view in PopulateCache instead (see below)?

andrewtoth commented at 2:44 PM on December 1, 2025:

We want to specifically test that we do not access the main cache's backing view, by e.g. calling GetCoin on it. If we return a nullopt here then it would correctly go to the db layer (while then mutating base non-atomically and causing UB), while we want to make sure we blow up here because it is a failed test.

in src/test/coinsviewcacheasync_tests.cpp:33 in 236e7a4374 outdated

  28 | +struct CoinsViewCacheAsyncTest : BasicTestingSetup {
  29 | +private:
  30 | +    std::unique_ptr<CoinsViewCacheAsync> m_async_cache{nullptr};
  31 | +    std::unique_ptr<CBlock> m_block{nullptr};
  32 | +
  33 | +    CBlock CreateBlock(int32_t num_txs) const noexcept

l0rinc commented at 8:41 AM on December 1, 2025:

we're only using 100 here, we might as well:

    static constexpr auto num_txs{100};
    CBlock CreateBlock() const noexcept

in src/test/coinsviewcacheasync_tests.cpp:30 in 236e7a4374 outdated

  25 | +    std::optional<Coin> GetCoin(const COutPoint&) const override { abort(); }
  26 | +};
  27 | +
  28 | +struct CoinsViewCacheAsyncTest : BasicTestingSetup {
  29 | +private:
  30 | +    std::unique_ptr<CoinsViewCacheAsync> m_async_cache{nullptr};

l0rinc commented at 8:42 AM on December 1, 2025:

m_async_cache seems unused

in src/test/coinsviewcacheasync_tests.cpp:48 in 236e7a4374 outdated

  43 | +            CMutableTransaction tx;
  44 | +            Txid txid;
  45 | +            if (i % 3 == 0) {
  46 | +                txid = Txid::FromUint256(uint256(i));
  47 | +            } else if (i % 3 == 1) {
  48 | +                txid = prevhash;

l0rinc commented at 8:42 AM on December 1, 2025:

we could add comments for these cases as well:

            if (i % 3 == 0) {
                // External input
                txid = Txid::FromUint256(uint256(i));
            } else if (i % 3 == 1) {
                // Internal spend (prev tx)
                txid = prevhash;
            } else {
                // Test shortid collisions (looks internal, but is external)

in src/test/coinsviewcacheasync_tests.cpp:175 in 236e7a4374 outdated

 170 | +BOOST_FIXTURE_TEST_CASE(fetch_no_inputs, CoinsViewCacheAsyncTest)
 171 | +{
 172 | +    const auto& block{getBlock()};
 173 | +    CCoinsView db;
 174 | +    CCoinsViewCache main_cache(&db);
 175 | +    CoinsViewCacheAsync view{main_cache, db};

l0rinc commented at 8:47 AM on December 1, 2025:

we could also add an in-memory leveldb here to simplify the code and make it more realistic:

    CCoinsViewDB db{{.path = "", .cache_bytes = 1_MiB, .memory_only = true}, {}};
    CCoinsViewCache main_cache{&db};
    CoinsViewCacheAsync view{main_cache, db};

in src/test/coinsviewcacheasync_tests.cpp:179 in 236e7a4374 outdated

 174 | +    CCoinsViewCache main_cache(&db);
 175 | +    CoinsViewCacheAsync view{main_cache, db};
 176 | +    for (auto i{0}; i < 3; ++i) {
 177 | +        view.StartFetching(block);
 178 | +        for (const auto& tx : block.vtx) {
 179 | +            for (const auto& in : tx->vin) view.AccessCoin(in.prevout);

l0rinc commented at 8:48 AM on December 1, 2025:

we should do something with the result:

            for (const auto& in : tx->vin) {
                const auto& c{view.AccessCoin(in.prevout)};
                BOOST_CHECK(c.IsSpent());
            }

l0rinc commented at 9:43 AM on December 1, 2025: contributor

Redid he measurements on a Mac with AppleClang for different sizes to check why there's such a massive speedup for lowe memory:

for DBCACHE in 450 4500 45000; do \
    COMMITS="d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac b1a791db1c75a47569b690baf7b074b78e08ca5a"; \
    STOP=921129; \
    DATA_DIR="$HOME/Library/Application Support/Bitcoin"; LOG_DIR="$HOME/bitcoin-reindex-logs"; \
    mkdir -p "$LOG_DIR"; \
    COMMA_COMMITS=${COMMITS// /,}; \
    (echo ""; echo "$COMMITS" | tr ' ' '\n' | while read -r c; do git log -1 --pretty='%h %s' -- "$c" || exit 1; done;) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(sysctl -n machdep.cpu.brand_string) | $(nproc) cores | $(printf '%.1fGiB' "$(( $(sysctl -n hw.memsize)/1024/1024/1024 ))") RAM | SSD | $(sw_vers -productName) $(sw_vers -productVersion) $(sw_vers -buildVersion) | $(xcrun clang --version | head -1)"; echo "") && \
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$LOG_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-appleclang.json" \
      --parameter-list COMMIT "$COMMA_COMMITS" \
      --prepare "killall bitcoind 2>/dev/null || true; rm -f \"$DATA_DIR\"/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=\"$DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind 2>/dev/null || true; sleep 5; grep -q 'height=0' \"$DATA_DIR\"/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' \"$DATA_DIR\"/debug.log && grep -q \"height=$STOP\" \"$DATA_DIR\"/debug.log || { echo 'debug.log assertions failed'; exit 1; }; \
                  cp \"$DATA_DIR\"/debug.log \"$LOG_DIR\"/debug-{COMMIT}-\$(date +%s).log 2>/dev/null || true" \
      "./build/bin/bitcoind -datadir=\"$DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
done

dbcache 450

reindex-chainstate | 921129 blocks | dbcache 450 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)

Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
  Time (abs ≡):        26759.295 s               [User: 29786.899 s, System: 7379.722 s]

Benchmark 2: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
  Time (abs ≡):        8826.595 s               [User: 23102.926 s, System: 2391.832 s]

Relative speed comparison
        3.03          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
        1.00          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)

dbcache 4500

reindex-chainstate | 921129 blocks | dbcache 4500 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)

Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
  Time (abs ≡):        12563.690 s               [User: 15217.346 s, System: 1087.166 s]

Benchmark 2: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
  Time (abs ≡):        7786.335 s               [User: 14306.318 s, System: 1220.685 s]

Relative speed comparison
        1.61          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
        1.00          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)

dbcache 45000

reindex-chainstate | 921129 blocks | dbcache 45000 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)

Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
  Time (abs ≡):        5256.592 s               [User: 6551.334 s, System: 337.214 s]

Benchmark 2: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
  Time (abs ≡):        4727.896 s               [User: 7191.973 s, System: 467.989 s]

Relative speed comparison
        1.11          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
        1.00          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)

The huge difference may come from the multithreaded solution using the performance cores more heavily:

Master: <img width="869" height="429" alt="Image" src="https://github.com/user-attachments/assets/caa6ddaf-51c8-45cd-835a-232d2dc943af" />

PR: <img width="868" height="426" alt="Image" src="https://github.com/user-attachments/assets/5a410b64-b460-423e-b7f1-133bd266d02f" /> @andrewtoth can you reproduce the low-memory speedup results?

andrewtoth force-pushed on Dec 1, 2025

DrahtBot added the label CI failed on Dec 1, 2025

DrahtBot commented at 7:26 PM on December 1, 2025: contributor

🚧 At least one of the CI tasks failed. Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/19833312863/job/56824503489 LLM reason (✨ experimental): Narrowing conversion in CoinsViewCacheAsync constructor triggers -Werror, causing the build to fail.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Dec 1, 2025

DrahtBot removed the label CI failed on Dec 1, 2025

andrewtoth commented at 3:12 PM on December 5, 2025: contributor

I've generated some flamegraphs from perf data recorded during IBD for ~10k blocks between 850-900k for both master and this branch.

The 4 worker threads can be seen clearly on the left of the graph. The binary search through short txids is barely noticeable, which confirms our approach to not use an unordered_set + salted hasher.

Looking at b-msghand (main) thread, we see ConnectBlock dominating it on master while block serialization and tx hash computation becomes the dominant factor on this branch. The main thread calling ProcessInputInBackground on this branch indicates that work stealing is working correctly, and that there might be some speedup if worker thread count was increased.

I see these don't work well in the browser when hosted on github. If you right-click download them, then open in a browser they are easier to inspect.

perf_master perf_branch

l0rinc commented at 4:58 PM on December 5, 2025: contributor

The flames look impressive, my dfferential flames for all 900k blocks should also finish in a few days.

Parallelism vs speedup on different platforms

COMMITS="8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
STOP=700000; DBCACHE=450; \
DATA_DIR="$HOME/Library/Application Support/Bitcoin"; LOG_DIR="$HOME/bitcoin-reindex-logs"; \
mkdir -p "$LOG_DIR"; \
COMMA_COMMITS=${COMMITS// /,}; \
(echo ""; for c in $(echo $COMMITS); do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(sysctl -n machdep.cpu.brand_string) | $(nproc) cores | $(printf '%.1fGiB' "$(( $(sysctl -n hw.memsize)/1024/1024/1024 ))") RAM | SSD | $(sw_vers -productName) $(sw_vers -productVersion) $(sw_vers -buildVersion) | $(xcrun clang --version | head -1)"; echo "") && \
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$LOG_DIR/rdx-$(echo "$COMMITS" | sed -E 's/([a-f0-9]{8})[a-f0-9]+ ?/\1-/g;s/-$//')-$STOP-$DBCACHE-appleclang.json" \
  --parameter-list COMMIT "$COMMA_COMMITS" \
  --prepare "killall -9 bitcoind 2>/dev/null || true; rm -f \"$DATA_DIR\"/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=\"$DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind 2>/dev/null || true; sleep 5; grep -q 'height=0' \"$DATA_DIR\"/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' \"$DATA_DIR\"/debug.log && grep -q \"height=$STOP\" \"$DATA_DIR\"/debug.log || { echo 'debug.log assertions failed'; exit 1; }; \
              cp \"$DATA_DIR\"/debug.log \"$LOG_DIR\"/debug-{COMMIT}-\$(date +%s).log 2>/dev/null || true" \
  "./build/bin/bitcoind -datadir=\"$DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

8744e5a03e WORKER_THREADS{1}
042dbfdc3f WORKER_THREADS{2}
db9ec4d4e7 WORKER_THREADS{3}
e4bb647a16 WORKER_THREADS{4}
36613ec982 WORKER_THREADS{5}
82cd3e294f WORKER_THREADS{6}
114fef0f34 WORKER_THREADS{7}
ae1589d7f9 WORKER_THREADS{8}

reindex-chainstate | 700000 blocks | dbcache 450 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)

Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
  Time (abs ≡):        4188.146 s               [User: 7387.420 s, System: 932.198 s]

Benchmark 2: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
  Time (abs ≡):        3683.049 s               [User: 7346.612 s, System: 833.443 s]

Benchmark 3: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
  Time (abs ≡):        3483.915 s               [User: 7427.734 s, System: 815.722 s]

Benchmark 4: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
  Time (abs ≡):        3349.891 s               [User: 7531.310 s, System: 839.720 s]

Benchmark 5: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
  Time (abs ≡):        3402.258 s               [User: 7836.059 s, System: 1139.180 s]

Benchmark 6: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
  Time (abs ≡):        3399.448 s               [User: 8072.648 s, System: 1508.136 s]

Benchmark 7: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
  Time (abs ≡):        3404.973 s               [User: 8226.177 s, System: 1889.810 s]

Benchmark 8: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
  Time (abs ≡):        3398.617 s               [User: 8358.164 s, System: 2256.116 s]

Relative speed comparison
        1.25          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
        1.10          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
        1.04          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
        1.00          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
        1.02          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
        1.01          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
        1.02          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
        1.01          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)

</details>

COMMITS="8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
STOP=923319; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

8744e5a03e WORKER_THREADS{1}
042dbfdc3f WORKER_THREADS{2}
db9ec4d4e7 WORKER_THREADS{3}
e4bb647a16 WORKER_THREADS{4}
36613ec982 WORKER_THREADS{5}
82cd3e294f WORKER_THREADS{6}
114fef0f34 WORKER_THREADS{7}
ae1589d7f9 WORKER_THREADS{8}

reindex-chainstate | 923319 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
  Time (abs ≡):        17079.999 s               [User: 41791.411 s, System: 2436.225 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
  Time (abs ≡):        15860.805 s               [User: 41236.804 s, System: 2345.765 s]

Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
  Time (abs ≡):        15450.583 s               [User: 41354.937 s, System: 2364.782 s]

Benchmark 4: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
  Time (abs ≡):        15319.447 s               [User: 41812.192 s, System: 2357.605 s]

Benchmark 5: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
  Time (abs ≡):        15266.043 s               [User: 42357.261 s, System: 2514.361 s]

Benchmark 6: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
  Time (abs ≡):        15206.688 s               [User: 42723.482 s, System: 2511.710 s]

Benchmark 7: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
  Time (abs ≡):        15241.462 s               [User: 43303.245 s, System: 2591.484 s]

Benchmark 8: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
  Time (abs ≡):        15234.312 s               [User: 43964.600 s, System: 2819.547 s]

Relative speed comparison
        1.12          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
        1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
        1.02          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
        1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)

</details>

COMMITS="8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
STOP=700000; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

8744e5a03e WORKER_THREADS{1}
042dbfdc3f WORKER_THREADS{2}
db9ec4d4e7 WORKER_THREADS{3}
e4bb647a16 WORKER_THREADS{4}
36613ec982 WORKER_THREADS{5}
82cd3e294f WORKER_THREADS{6}
114fef0f34 WORKER_THREADS{7}
ae1589d7f9 WORKER_THREADS{8}

reindex-chainstate | 700000 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
  Time (abs ≡):        18390.604 s               [User: 16855.672 s, System: 1349.628 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
  Time (abs ≡):        17781.743 s               [User: 17041.495 s, System: 1368.022 s]

Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
  Time (abs ≡):        17337.544 s               [User: 17490.866 s, System: 1413.850 s]

Benchmark 4: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
  Time (abs ≡):        17775.681 s               [User: 17832.946 s, System: 1436.415 s]

Benchmark 5: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
  Time (abs ≡):        17699.749 s               [User: 18151.899 s, System: 1476.080 s]

Benchmark 6: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
  Time (abs ≡):        17115.228 s               [User: 18467.571 s, System: 1506.336 s]

Benchmark 7: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
  Time (abs ≡):        17546.582 s               [User: 18532.849 s, System: 1551.217 s]

Benchmark 8: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
  Time (abs ≡):        17452.997 s               [User: 18550.132 s, System: 1546.017 s]

Relative speed comparison
        1.07          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
        1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
        1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
        1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
        1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
        1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
        1.02          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)

</details>

Edit:

COMMITS="9a29b2d331eed5b4cbd6922f63e397b68ff12447 8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
STOP=700000; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

9a29b2d331 Merge bitcoin/bitcoin#33857: doc: Add `x86_64-w64-mingw32ucrt` triplet to `depends/README.md`
8744e5a03e WORKER_THREADS{1}
042dbfdc3f WORKER_THREADS{2}
db9ec4d4e7 WORKER_THREADS{3}
e4bb647a16 WORKER_THREADS{4}
36613ec982 WORKER_THREADS{5}
82cd3e294f WORKER_THREADS{6}
114fef0f34 WORKER_THREADS{7}
ae1589d7f9 WORKER_THREADS{8}

reindex-chainstate | 700000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 9a29b2d331eed5b4cbd6922f63e397b68ff12447)
  Time (abs ≡):        17037.553 s               [User: 26114.648 s, System: 2505.015 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
  Time (abs ≡):        13967.390 s               [User: 27084.842 s, System: 2533.624 s]

Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
  Time (abs ≡):        13030.059 s               [User: 27638.137 s, System: 2473.673 s]

Benchmark 4: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
  Time (abs ≡):        13077.949 s               [User: 27739.880 s, System: 2496.343 s]

Benchmark 5: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
  Time (abs ≡):        13051.649 s               [User: 27609.668 s, System: 2538.616 s]

Benchmark 6: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
  Time (abs ≡):        13287.758 s               [User: 27771.809 s, System: 2615.043 s]

Benchmark 7: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
  Time (abs ≡):        13308.250 s               [User: 27744.112 s, System: 2646.436 s]

Benchmark 8: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
  Time (abs ≡):        13436.808 s               [User: 27789.127 s, System: 2709.751 s]

Benchmark 9: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
  Time (abs ≡):        13430.790 s               [User: 27676.672 s, System: 2727.739 s]

Relative speed comparison
        1.31          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 9a29b2d331eed5b4cbd6922f63e397b68ff12447)
        1.07          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
        1.02          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
        1.02          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
        1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
        1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)

</details>

Edit2:

COMMITS="9a29b2d331eed5b4cbd6922f63e397b68ff12447 8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
STOP=600000; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 1 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
  --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
              cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";

9a29b2d331 Merge bitcoin/bitcoin#33857: doc: Add `x86_64-w64-mingw32ucrt` triplet to `depends/README.md`
8744e5a03e WORKER_THREADS{1}
042dbfdc3f WORKER_THREADS{2}
db9ec4d4e7 WORKER_THREADS{3}
e4bb647a16 WORKER_THREADS{4}
36613ec982 WORKER_THREADS{5}
82cd3e294f WORKER_THREADS{6}
114fef0f34 WORKER_THREADS{7}
ae1589d7f9 WORKER_THREADS{8}

reindex-chainstate | 600000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 9a29b2d331eed5b4cbd6922f63e397b68ff12447)
  Time (abs ≡):        28988.956 s               [User: 36830.542 s, System: 6768.571 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
  Time (abs ≡):        23269.261 s               [User: 38584.449 s, System: 7103.664 s]

Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
  Time (abs ≡):        21678.210 s               [User: 39240.948 s, System: 7259.279 s]

Benchmark 4: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
  Time (abs ≡):        21506.209 s               [User: 39363.857 s, System: 7569.643 s]

Benchmark 5: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
  Time (abs ≡):        21428.512 s               [User: 39616.150 s, System: 7698.857 s]

Benchmark 6: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
  Time (abs ≡):        21392.758 s               [User: 39653.354 s, System: 8054.084 s]

Benchmark 7: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
  Time (abs ≡):        21395.545 s               [User: 39365.692 s, System: 8235.890 s]

Benchmark 8: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
  Time (abs ≡):        21449.737 s               [User: 39321.387 s, System: 8314.124 s]

Benchmark 9: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
  Time (abs ≡):        21505.684 s               [User: 39558.723 s, System: 8648.682 s]

Relative speed comparison
        1.36          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 9a29b2d331eed5b4cbd6922f63e397b68ff12447)
        1.09          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
        1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
        1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
        1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)

</details>

My conclusion (given that threads aren't for free) is that 4 threads should suffice for now!

Memory usage

I've also measured the peak memory usage during reindex-chainstate and unsurprisingly, the additional threads (4 threads * 8MB stack = ~32MB on glibc), reused internal caches (m_inputs.clear() does not free the allocated memory capacity so the peak block input size will be helf for the lifetime of the node) result in a measurable memory overhead (>100MB peak extra). Once we're close to the finish line, we can discuss whether we should do anything about it (e.g., we could lower the default dbcache size, or try to reduce memory usage in other ways to compensate, or just document the increase).

The peak memory of v30 was 744.4 MB (measured via assumeutxo):

    MB
744.4^                            :
     |# :::: :  ::: ::  :::::  ::@:::::   ::::::::@@
     |# : :: :  ::: ::  : ::   ::@:: ::   ::: ::: @
     |#:: :: :  ::: ::  : ::   ::@:: ::   ::: ::: @
     |#:: :: :  ::: ::  : ::   ::@:: :: ::::: ::: @
     |#:: :: :  ::: ::  : ::   ::@:: :: : ::: ::: @
     |#:: :: :  ::: ::  : ::   ::@:: :::: ::: ::: @
     |#:: :: :::::: ::  : ::   ::@:: :::: ::: ::: @
     |#:: ::::: ::: ::::: :: ::::@:: :::: ::: ::: @
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ ::::::::@@::@::::::::::@::
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
     |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
   0 +----------------------------------------------------------------------->h
     0                                                                   1.603

the peak memory of master is 819.6 MB (measured via assumeutxo, see #31645 (comment))

    MB
819.6^      ##
     | :    #       :      :  :           :
     | :  ::# :::@@:::::::::  :  @:::::: :::::::
     | :  : # :: @ ::: :: ::  :  @:: ::  :::: :
     | :: : # :: @ ::: :: ::  :  @:: ::  :::: :
     | :: : # :: @ ::: :: ::  :  @:: ::  :::: :
     | :: : # :: @ ::: :: ::  :  @:: ::  :::: :
     | :::: # :: @ ::: :: ::  :  @:: ::  :::: :
     | :::: # :: @ ::: :: ::  :  @:: ::  :::: :
     | :::: # :: @ ::: :: :::::  @:: ::  :::: :
     | :::: # :: @ ::: :: ::: :::@:: ::  :::: :
     | :::: # :: @ ::: :: ::: :: @:: ::  :::: :
     | :::: # :: @ ::: :: ::: :: @:: :: ::::: :
     | :::: # :: @ ::: :: ::: :: @:: :: ::::: :
     | :::: # :: @ ::: :: ::: :: @:: :: ::::: : :::::::::::@::::@::::@:::::::@
     | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
     | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
     | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
     | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
     | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
   0 +----------------------------------------------------------------------->h
     0                                                                   1.484

and the peak memory after the PR is 944.1MB (via reindex-chainstate, we can't measure it via assumeutxo):

    MB
944.1^                                                         #              
     |                       :@                    @           #              
     |                       :@                    @           #              
     |                       :@                    @      :    #              
     |     :::               :@     :         @    @   :  :  @ #        :     
     |     : :            :  :@     :    @    @    @ :::  :: @ #:       :     
     |     : :         :: : ::@     :   :@   :@    @ :::  :: @ #:: :    :     
     |  :  : :      :  : :: ::@:: : : :::@   :@   @@::::  :: @ #:: :    :     
     |  :  : :::    : :: :: ::@:  :@: : :@ : :@  :@@::::  :::@ #:: ::   : :   
     |  :  : ::::   : :: :::::@:  :@: : :@ : :@  :@@::::  :::@:#:: ::::@: ::  
     |  :  : ::::   :::: :::::@:  :@: : :@ : :@ ::@@::::  :::@:#:: ::::@: ::  
     | ::  : ::::::::::: :::::@: ::@: : :@:: :@ ::@@::::@ :::@:#:: ::::@: ::  
     | ::::: :::::: :::: :::::@: ::@: : :@::::@:::@@::::@::::@:#:::::::@::::::
     | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
     | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
     | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
     | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
     | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
     | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
     | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
   0 +----------------------------------------------------------------------->h
     0                                                                   95.25

massif-0f3778fbfb03fc9083326e9cf62b3d3293a7f623-921129-450.txt

Also, now that the whole process is done in half the time, it's also possible that the higher peak memory is mostly a result of different memory-intensive processes coinciding more often.

andrewtoth force-pushed on Dec 7, 2025

DrahtBot added the label CI failed on Dec 7, 2025

DrahtBot commented at 1:43 AM on December 7, 2025: contributor

🚧 At least one of the CI tasks failed. Task TSan: https://github.com/bitcoin/bitcoin/actions/runs/19996361340/job/57344491981 LLM reason (✨ experimental): ThreadSanitizer data race detected in CCoinsViewCache::FetchCoin during coinsviewcacheasync_tests.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Dec 7, 2025

DrahtBot removed the label CI failed on Dec 7, 2025

DrahtBot added the label Needs rebase on Dec 11, 2025

andrewtoth force-pushed on Dec 11, 2025

DrahtBot removed the label Needs rebase on Dec 11, 2025

TheBlueMatt commented at 4:21 PM on December 11, 2025: contributor

For now the async view uses a fixed worker thread count of 4. The workload is primarily I/O-bound on DB latency rather than CPU-bound, so 4 workers already hide most of the latency and it simplifies the implementation. If needed we can make this configurable or tie it to -par later.

Probably makes sense to benchmark this kind of change in a cloud environment as well. There you'll likely see fixed, higher latency but more consistent as you push more IOPS, which I anticipate might result in substantially different results/optimal thread counts compared to physically attached flash.

andrewtoth commented at 10:45 PM on December 11, 2025: contributor

I've rebased due to #33602, and added several touchups. Thank you @l0rinc for the suggestions!

Thank you also @l0rinc for your very thorough measurements. I think 4 threads is a decent choice for now, but as @TheBlueMatt suggests I will try and run benchmarks in a cloud environment with network connected storage.

I would like to reproduce the memory findings, but #33351 makes it a little difficult to determine exact numbers. I think running an IBD on each branch and confirming the max RSS would be a better indicator than running an assumeutxo load.

the additional threads (4 threads * 8MB stack = ~32MB on glibc), reused internal caches (m_inputs.clear() does not free the allocated memory capacity so the peak block input size will be helf for the lifetime of the node) result in a measurable memory overhead (>100MB peak extra).

I am skeptical about this claim. An InputToFetch is 72 bytes, plus 8 bytes per txid stored. Let's be generous and round up to 100 bytes per input. The theoretical maximum number of inputs in a block is 1MB / 41 bytes = ~24.3k, let's say 30k. Let's be generous and double that to 60k because vectors will double capacity when reaching the limit. That's 100 bytes * 60k = 6 MB. That plus the 32MB for the thread stacks is only 38MB. Where does the extra memory come from? Or am I not accounting for something big in my math here? Edit: Actually, double that since we also have the cacheCoins not being reallocated. So that would still only be a maximum of 12 MB (in reality much less) on top of the 32 MB.

l0rinc commented at 1:50 PM on December 12, 2025: contributor

As mentioned on IRC yesterday:

We also observed that on the 16 GB system, runs with -dbcache values of 4 GB and higher were a lot slower than with -dbcache of 3 GB, and that an rpi5 with 16 GB of memory ran out of memory with -dbcache of 10 GB`.

My assumption was that it's caused by the UTXO size getting closer to the total memory, so I ran it on the i9 and i7 servers - both of which have 64 GB memory:

for DBCACHE in 100 200 300 400 500 1000 2000 3000 4000 5000 6000 7000; do \
    COMMITS="f6acbef1084e34f126bf530df99e4ef6a11c38e8 eee2204d6f7117c5b39abaf47d7d329ff0951638"; \
    STOP=900000; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
done

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 100 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        23016.826 s               [User: 40886.840 s, System: 2758.769 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        15579.671 s               [User: 39232.678 s, System: 2491.046 s]

Relative speed comparison
        1.48          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 200 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        21436.283 s               [User: 37890.294 s, System: 2736.881 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        14707.903 s               [User: 35945.513 s, System: 2246.392 s]

Relative speed comparison
        1.46          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 300 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        20189.990 s               [User: 35193.471 s, System: 2875.028 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        14072.230 s               [User: 33903.334 s, System: 2159.474 s]

Relative speed comparison
        1.43          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 400 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        19874.210 s               [User: 33637.992 s, System: 2759.377 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        14185.335 s               [User: 32711.873 s, System: 2197.601 s]

Relative speed comparison
        1.40          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 500 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        19176.938 s               [User: 31471.188 s, System: 2511.966 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        13856.071 s               [User: 31289.683 s, System: 2210.229 s]

Relative speed comparison
        1.38          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 1000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=1000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        17434.420 s               [User: 25518.649 s, System: 1736.314 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=1000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        13178.801 s               [User: 26366.210 s, System: 1707.001 s]

Relative speed comparison
        1.32          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=1000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=1000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 2000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=2000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        16285.543 s               [User: 20987.530 s, System: 1073.606 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=2000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        12830.954 s               [User: 22300.342 s, System: 1206.297 s]

Relative speed comparison
        1.27          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=2000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=2000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 3000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        15768.416 s               [User: 19226.843 s, System: 863.531 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        12788.084 s               [User: 20314.162 s, System: 965.156 s]

Relative speed comparison
        1.23          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 4000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=4000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        15685.029 s               [User: 18706.301 s, System: 811.667 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=4000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        12892.918 s               [User: 19746.475 s, System: 903.910 s]

Relative speed comparison
        1.22          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=4000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=4000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 5000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        15478.130 s               [User: 18161.854 s, System: 764.612 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        12924.056 s               [User: 19084.441 s, System: 852.527 s]

Relative speed comparison
        1.20          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 6000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=6000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        15563.687 s               [User: 17939.937 s, System: 754.868 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=6000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        13059.707 s               [User: 18685.622 s, System: 808.353 s]

Relative speed comparison
        1.19          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=6000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=6000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 7000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=7000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        15630.441 s               [User: 17853.930 s, System: 776.871 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=7000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        13210.208 s               [User: 18681.925 s, System: 822.040 s]

Relative speed comparison
        1.18          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=7000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=7000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

</details>

for DBCACHE in 100 200 300 400 500; do \
    COMMITS="f6acbef1084e34f126bf530df99e4ef6a11c38e8 eee2204d6f7117c5b39abaf47d7d329ff0951638"; \
    STOP=900000; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
done

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 100 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        44039.358 s               [User: 40502.406 s, System: 3048.444 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        34822.469 s               [User: 43549.781 s, System: 2913.842 s]

Relative speed comparison
        1.26          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 200 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        43276.275 s               [User: 37875.550 s, System: 3095.389 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        33394.767 s               [User: 39767.262 s, System: 2773.980 s]

Relative speed comparison
        1.30          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 300 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        53339.879 s               [User: 37057.843 s, System: 3635.662 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        39372.213 s               [User: 37610.647 s, System: 2897.763 s]

Relative speed comparison
        1.35          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 400 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        41460.250 s               [User: 33577.389 s, System: 3144.309 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        33277.494 s               [User: 35893.949 s, System: 2803.631 s]

Relative speed comparison
        1.25          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection

reindex-chainstate | 900000 blocks | dbcache 500 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
  Time (abs ≡):        39183.661 s               [User: 31615.517 s, System: 2853.531 s]

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
  Time (abs ≡):        32531.387 s               [User: 33908.272 s, System: 2720.594 s]

Relative speed comparison
        1.20          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)

</details>

My conclusion from the above are:

It's not the size of the UTXO set vs total memory;
This change performs better the lower the memory (we knew this already);
after 3 GB of dbcache there isn't any measurable speedup for some reason - with the PR it even gets slightly slower;
master seems to behave similarly to the PR in this regard;
HDD isn't very stable, but the SSD is ridiculously predictable;
PR performs better with 100 mb dbcache than master with 7 GB;
PR performs exactly the same with 1 GB and 7 GB.

andrewtoth commented at 6:41 PM on December 12, 2025: contributor

PR performs better with 100 mb dbcache than master with 7 GB;

wat

andrewtoth force-pushed on Dec 20, 2025

in src/validation.cpp:3127 in 6db7f75f53 outdated

3123 | @@ -3122,6 +3124,7 @@ bool Chainstate::ConnectTip(
3124 |              if (state.IsInvalid())
3125 |                  InvalidBlockFound(pindexNew, state);
3126 |              LogError("%s: ConnectBlock %s failed, %s\n", __func__, pindexNew->GetBlockHash().ToString(), state.ToString());
3127 | +            view.Reset();

l0rinc commented at 11:49 AM on December 21, 2025:

// local CCoinsViewCache goes out of scope

This isn't true anymore as far as I can tell

in src/test/fuzz/coinsviewcacheasync.cpp:42 in 6db7f75f53 outdated

  37 | +        .path = "",
  38 | +        .cache_bytes = 1_MiB,
  39 | +        .memory_only = true,
  40 | +    };
  41 | +    g_db.emplace(std::move(db_params), CoinsViewOptions{});
  42 | +    CCoinsViewCache cache{nullptr};

l0rinc commented at 12:42 PM on December 21, 2025:

hmmm, isn't this UB, aren't we getting lifetime problems here?

nit: why are the tests separated from the implementation? They're part of the "feature", how can we review one fully without the other?

andrewtoth commented at 3:00 PM on December 21, 2025:

hmmm, isn't this UB, aren't we getting lifetime problems here?

Nothing touches base while it is nullptr. We need to call a method to get UB, and we don't do that until we replace it with a valid pointer. Fuzzing would have revealed any UB by now.

andrewtoth commented at 3:01 PM on December 21, 2025:

why are the tests separated from the implementation?

This seems to be a standard way to do this. Less cognitive load per commit. See e.g. #29415.

l0rinc commented at 3:04 PM on December 21, 2025:

Less cognitive load per commit

Strongly disagree. We simply don't have enough information to review it properly and just skip to the next commit. So in a way, yes, less cognitive load - but not the good kind...

l0rinc commented at 3:37 PM on December 21, 2025:

Nothing touches base while it is nullptr

It's not nullptr, it's a dangling pointer. CoinsViewCacheAsync constructor takes CCoinsViewCache& from stack which gets destroyed after setup_threadpool_test exits, something like: https://godbolt.org/z/3e9qoPTP1

The fix could simply be:

 std::optional<CoinsViewCacheAsync> g_async_cache{};
 std::optional<CCoinsViewDB> g_db{};
+std::optional<CCoinsViewCache> g_cache{};
 
 static void setup_threadpool_test()
 {
@@ -39,8 +40,8 @@
         .memory_only = true,
     };
     g_db.emplace(std::move(db_params), CoinsViewOptions{});
-    CCoinsViewCache cache{nullptr};
-    g_async_cache.emplace(cache, *g_db);
+    g_cache.emplace(nullptr);
+    g_async_cache.emplace(*g_cache, *g_db);
 }

andrewtoth commented at 6:01 PM on December 21, 2025:

Right, now I see. Yes it is a dangling pointer, but again UB is not triggered unless the pointer is dereferenced.

andrewtoth commented at 3:21 PM on December 22, 2025:

We can get rid of the dangling pointer by just doing this:

diff --git a/src/test/fuzz/coinsviewcacheasync.cpp b/src/test/fuzz/coinsviewcacheasync.cpp
index 77c378288e..8796c51926 100644
--- a/src/test/fuzz/coinsviewcacheasync.cpp
+++ b/src/test/fuzz/coinsviewcacheasync.cpp
@@ -41,6 +41,7 @@ static void setup_threadpool_test()
     g_db.emplace(std::move(db_params), CoinsViewOptions{});
     CCoinsViewCache cache{nullptr};
     g_async_cache.emplace(cache, *g_db);
+    g_async_cache->SetBackend(nullptr);
 }
 
 FUZZ_TARGET(coinsviewcacheasync, .init = setup_threadpool_test)

andrewtoth commented at 3:00 AM on December 23, 2025:

This is updated to have the same constructor interface as CCoinsViewCache, so fixed.

in src/coinsviewcacheasync.h:240 in 6db7f75f53 outdated

 235 | +        }
 236 | +    }
 237 | +
 238 | +    ~CoinsViewCacheAsync() override
 239 | +    {
 240 | +        m_barrier.arrive_and_drop();

l0rinc commented at 12:53 PM on December 21, 2025:

we seem to be calling StopFetching everywhere except here - since we rely on m_inputs.empty() for releasing the barrier, it might be safer to do that here, too - what do you think? Or maybe a Flush() - not sure...

l0rinc commented at 8:43 AM on December 25, 2025:

Hmmm, if I just call start and let the destructor do its job, I get a hanging test:

BOOST_AUTO_TEST_CASE(destructor_without_reset)
{
    CCoinsViewDB db{{.path = "", .memory_only = true}, {}};
    CCoinsViewCache main_cache{&db};
    CoinsViewCacheAsync view{&main_cache};
    view.StartFetching(CreateBlock());
    // Destructor called WITHOUT Reset() or Flush()
}

adding an explicit stop to the destructor fixes it for me:

~CoinsViewCacheAsync() override
{
    StopFetching();
    m_barrier.arrive_and_drop();
    for (auto& t : m_worker_threads) t.join();
}

If you think this is a misuse of the API, we could add an assert in the destructor instead.

in src/coinsviewcacheasync.h:25 in 6db7f75f53

  20 | +#include <ranges>
  21 | +#include <thread>
  22 | +#include <utility>
  23 | +#include <vector>
  24 | +
  25 | +static constexpr int32_t WORKER_THREADS{8};

l0rinc commented at 1:08 PM on December 21, 2025:

Based on our previous measurements I think this should remain 4 - unless you have better data.

in src/coinsviewcacheasync.h:86 in 6db7f75f53 outdated

  81 | +     * Similar to CCoinsViewCache::GetCoin, but it does not mutate internally.
  82 | +     * Therefore safe to call from any thread once inside the barrier.
  83 | +     */
  84 | +    std::optional<Coin> GetCoinWithoutMutating(const COutPoint& outpoint) const
  85 | +    {
  86 | +        if (auto coin{static_cast<CCoinsViewCache*>(base)->GetPossiblySpentCoinFromCache(outpoint)}) {

l0rinc commented at 1:23 PM on December 21, 2025:

this is still super-fishy to me, we have to fix some abstractions here first ...

andrewtoth commented at 3:04 PM on December 21, 2025:

We know we will always have a cache as the base here. This doesn't always hold for the base class. e.g. CCoinsViewCache can have a CCoinsViewDB as base.

andrewtoth commented at 2:59 AM on December 23, 2025:

Updated to use a FetchCoinWithoutMutating protected method.

l0rinc commented at 4:47 PM on December 25, 2025:

My understanding is that as long as the current cache and its descendants are CCoinsViewCache instances, we will try to get the outpoint from them - but as mentioned, I think that violates their interface and introduced unannounced unpredictability (i.e. surprise the reviewer by assuming more locally than the interface promised).

Not yet sure if that would be better, but if ew introduced an additional CCoinsViewCache method to peek into the structure, we wouldn't need iteration and casting, i.e. something like:

std::optional<Coin> CCoinsViewBacked::PeekCoin(const COutPoint& outpoint) const { return base->PeekCoin(outpoint); }

and

std::optional<Coin> CCoinsViewCache::PeekCoin(const COutPoint& outpoint) const
{
    if (auto it{cacheCoins.find(outpoint)}; it != cacheCoins.end()) {
        return it->second.coin.IsSpent() ? std::nullopt : std::optional{it->second.coin};
    }
    return base->PeekCoin(outpoint);
}

in src/coins.h:495 in ab1614473b outdated

 491 | @@ -484,7 +492,7 @@ class CCoinsViewCache : public CCoinsViewBacked
 492 |       * @note this is marked const, but may actually append to `cacheCoins`, increasing
 493 |       * memory usage.
 494 |       */
 495 | -    CCoinsMap::iterator FetchCoin(const COutPoint &outpoint) const;
 496 | +    virtual CCoinsMap::iterator FetchCoin(const COutPoint &outpoint) const;

l0rinc commented at 1:25 PM on December 21, 2025:

Doesn't this incur an additional hot-path virtual dispatch cost? I will carve this out to a separate commit and run a reindex-chainstate with minimal dbcache to force Flush & FetchCoin calls with and without virtual access to see if this is a valid concern.

andrewtoth commented at 3:05 PM on December 21, 2025:

I'm not sure why that matters. Our benchmarks show a very large speedup with this change?

l0rinc commented at 6:48 PM on December 25, 2025:

Measured it a few times, got very different results, I guess we can assume for now that this isn't a problem.

andrewtoth commented at 7:43 PM on January 11, 2026:

Got rid of this via #34165.

in src/coinsviewcacheasync.h:14 in dc4c3f6cac outdated

   9 | +
  10 | +class CoinsViewCacheAsync : public CCoinsViewCache
  11 | +{
  12 | +public:
  13 | +    //! Reset state.
  14 | +    void Reset() noexcept

l0rinc commented at 1:30 PM on December 21, 2025:

Reset doesn't sound like an async property - would it make sense to make that a CCoinsViewCache method instead?

andrewtoth commented at 3:24 PM on December 22, 2025:

We would have to have a virtual method in the base CCoinsViewCache class, and then override it here because we have to clear our subclass members as well. But, it would never be called as a base class anywhere. It might make the commits easier to follow though, since I could introduce it on the base class and use that to reuse just a CCoinsViewCache. Then I could introduce CoinsViewCacheAsync later. I will experiment.

l0rinc commented at 6:49 PM on December 25, 2025:

The way you've added it is excellent, carving out a genuine sub-feature (which we could push as a separate PR)

in src/coinsviewcacheasync.h:83 in b9edd77b49 outdated

  78 | @@ -67,6 +79,11 @@ class CoinsViewCacheAsync : public CCoinsViewCache
  79 |          if (i >= m_inputs.size()) [[unlikely]] return false;
  80 |  
  81 |          auto& input{m_inputs[i]};
  82 | +        // Inputs spending a coin from a tx earlier in the block won't be in the cache or db
  83 | +        if (std::ranges::binary_search(m_txids, input.outpoint.hash.ToUint256().GetUint64(0))) {

l0rinc commented at 1:44 PM on December 21, 2025:

b9edd77b4960f68afc761447e4e3372371be2143: this feature is nicely split out of the whole - but I'm missing a test in the commit that could help me debug it locally before I move on to the next commit.

in src/test/coinsviewoverlay_tests.cpp:206 in 85c20a57d4 outdated

 201 | @@ -202,4 +202,20 @@ BOOST_AUTO_TEST_CASE(access_non_input_coin)
 202 |      }
 203 |  }
 204 |  
 205 | +// Test that the main thread can make progress with no workers
 206 | +BOOST_AUTO_TEST_CASE(fetch_main_thread)

l0rinc commented at 1:46 PM on December 21, 2025:

can we add this to the commit that introduces the feature that this is validating?

in src/coinsviewcacheasync.h:27 in 6db7f75f53 outdated

  23 | @@ -24,6 +24,19 @@
  24 |  
  25 |  static constexpr int32_t WORKER_THREADS{8};
  26 |  
  27 | +/**

l0rinc commented at 1:46 PM on December 21, 2025:

6db7f75f53bd89e4e9b019c6ecd0c31f43d0f219: comments aren't features - why not add them when CoinsViewCacheAsync is introduced and extend the description with every added feature

in src/coinsviewcacheasync.h:127 in 6db7f75f53 outdated

 122 | +    {
 123 | +        // This assumes ConnectBlock accesses all inputs in the same order as they are added to m_inputs
 124 | +        // in StartFetching. Some outpoints are not accessed because they are created by the block, so we scan until we
 125 | +        // come across the requested input. We advance the tail since the input will be cached and not accessed through
 126 | +        // this method again.
 127 | +        for (const auto i : std::views::iota(m_input_tail, m_inputs.size())) [[likely]] {

l0rinc commented at 1:49 PM on December 21, 2025:

does the [[likely]] have an effect on the if below? As mentioned before, I think it has some weird effect when nested...

andrewtoth commented at 2:54 PM on December 21, 2025:

No, it does not affect anything other than than the loop branch. It's not nested here.

l0rinc commented at 1:56 PM on December 21, 2025: contributor

I went through the changes quickly, planning on recreating everything locally to understand the constraints more fundamentally.

I want to investigate a few more issues before I can do that (e.g. virtual dispatch cost on critical path, how removing noexcept and simpler siphash changes would affect the constraints, whether we can clean up the coins area a bit more before we proceed).

To reduce risk (since this is at the heart of the project), I want to continue carving our cleanup PR that would derisk and simplify this one. Appreciate your patience and quick reaction time here.

andrewtoth commented at 2:32 PM on December 21, 2025: contributor

Did some benchmarks for cloud connected storage in AWS.

I used 2 c6in.xlarge instances (4 vCPU, 8 GB RAM) with 800 GB gp3 volumes attached. The volumes were configured with 12500 IOPS and 391 MB/s throughput, which is the baseline for that instance type. The instance type was chosen because it had the highest baseline EBS throughput for that size class.

I ran a reindex-chainstate up to block 921129 with this branch and master, and this branch was ~2.6x faster. 12h43m vs 33h20m.

branch: 39139.06user 4836.53system 12:42:43elapsed 96%CPU (0avgtext+0avgdata 6748912maxresident)k 25383964712inputs+5043205272outputs (140197459major+76837294minor)pagefaults 0swaps

master: 34586.03user 4147.82system 33:19:56elapsed 32%CPU (0avgtext+0avgdata 6931572maxresident)k 25758328464inputs+5040510928outputs (137873360major+81380971minor)pagefaults 0swaps

On network connected storage, master is completely dominated by the serial latency of fetching inputs one by one. It can't push past around 140 MB/s read throughput, so didn't get close to maxing out disk usage. This branch easily hits the volume limits with 4 threads, by parallelizing this latency. After it completed, I ran it again but bumped the gp3 volume throughput limit to 1000 MB/s. It managed over 500 MB/s and completed in 10h01m - a ~3.3x speedup over master. Both the second run and master completed at almost the exact same time coincidentally, which you can see in the graph below.

After I tested running with 8 and 12 threads, with both volumes bumped to 1000 MB/s. The 12 thread variant managed to get limited by the throughput limit, but not the 8. However, after 3 hours of bursting above the baseline, they both got throttled back. These finished in 10h04m and 8h51m - 3.3x and ~3.8x speedups respectively. So they are not too much faster with this instance type. But with a bigger instance with higher EBS baseline and gp3 settings to get more read throughput, they could be much faster.

So it might make sense if you have a lot of sustained read throughput available to bump to a lot more threads. @TheBlueMatt

The below graph shows this branch but with 12 worker threads (called master in the graph) and 8 worker threads (called branch in the graph). After 3 hours of bursting above baseline they get throttled back to ~390 MB/s. <img width="2720" height="593" alt="Screenshot from 2025-12-21 09-27-01" src="https://github.com/user-attachments/assets/ce9f24d2-d416-44bd-99d7-41921a869928" />

Also, the max RSS reported by /usr/bin/time from the output above shows that master actually has a higher max RSS of 6931572k vs 6748912k. That would imply that the RSS is 6.9 GB vs 6.7 GB though, which doesn't really make sense to me. Both were using default dbcache, so I'm not entirely sure how these values are computed. However, it seems like there is less memory pressure in this branch than master. @l0rinc

andrewtoth force-pushed on Dec 21, 2025

andrewtoth renamed this:
~~validation: fetch block inputs on parallel threads >40% faster IBD~~
validation: fetch block inputs on parallel threads 3x faster IBD
on Dec 22, 2025

andrewtoth force-pushed on Dec 23, 2025

DrahtBot added the label CI failed on Dec 23, 2025

DrahtBot commented at 3:06 AM on December 23, 2025: contributor

🚧 At least one of the CI tasks failed. Task test max 6 ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/20450003058/job/58760941958 LLM reason (✨ experimental): Compilation error: parameter 'base' shadows inherited member in CoinsViewCacheAsync, causing CI failure.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Dec 23, 2025

DrahtBot removed the label CI failed on Dec 23, 2025

in src/validation.cpp:3117 in 14f1b79138 outdated

3113 | @@ -3113,7 +3114,7 @@ bool Chainstate::ConnectTip(
3114 |      LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
3115 |               Ticks<MillisecondsDouble>(time_2 - time_1));
3116 |      {
3117 | -        CCoinsViewCache view(&CoinsTip());
3118 | +        auto& view{*m_coins_views->m_connect_block_view};

l0rinc commented at 11:39 AM on December 24, 2025:

14f1b7913861d95e3ffeda658d45b0b88e7019e7:

I love this separation, we could extract it to a new PR since it kinda' makes sense on its own - carving out a small but self-sufficient portion of this PR. It could also help us in future refactors (e.g. preallocating the coins cache maps to avoid resizes).

Could we add a Start here as well, which could assert that this was in a clean state at the beginning (as a sanity check that all return paths are calling it)?

void CCoinsViewCache::Start()
{
    Assert(cacheCoins.empty());
    Assert(cachedCoinsUsage == 0);
    Assert(m_sentinel.second.Next() == &m_sentinel);
    Assert(m_sentinel.second.Prev() == &m_sentinel);

    SetBestBlock(base->GetBestBlock());
}

Besides the symmetry with Reset, it could help with using this same view for the other /*will_reuse_cache=*/false. The assertion would assure that the two cannot accidentally run at the same time (proving that we can reuse the same instance)

currently applied to Chainstate::ConnectTip, but could be added (here or in a separate cleanup PR) to Chainstate::DisconnectTip and TestBlockValidity cleanly
CVerifyDB::VerifyDB and Chainstate::ReplayBlocks & Chainstate::RollforwardBlock need to hold more than 1 block, not sure it's worth reusing the cache for those - though parallelization would be welcome in both cases...

We could also add Reset() and Start() to the constructor, but that would require fixing the mentioned UB in #34124 (comment)

Note: we could access this through a dedicated method instead like we do with other similar ones:

CoinsViewCacheAsync& ConnectBlockView() EXCLUSIVE_LOCKS_REQUIRED(::cs_main)
{
    AssertLockHeld(::cs_main);
    Assert(m_coins_views);
    return *Assert(m_coins_views->m_connect_block_view);
}

andrewtoth commented at 7:34 PM on December 28, 2025:

Hmm doing SetBestBlock(base->GetBestBlock()); seems like a behavior change which I don't want to do here. Reset is basically returning the state of CCoinsViewCache to a fresh copy, so it is identical behavior-wise to what we had before.

If we don't call SetBestBlock, then I don't see a benefit to including a Start method.

l0rinc commented at 7:37 PM on December 28, 2025:

seems like a behavior change which I don't want to do here

Don't we have a race condition otherwise because of the lazy init + setter?

andrewtoth commented at 7:41 PM on December 28, 2025:

a race condition

I don't think so? Only the base's cacheCoins is accessed via multiple threads, not hashBlock. So it is safe to mutate hashBlock on the main thread.

in src/coins.cpp:284 in 14f1b79138 outdated

 276 | @@ -275,6 +277,13 @@ bool CCoinsViewCache::Sync()
 277 |      return fOk;
 278 |  }
 279 |  
 280 | +void CCoinsViewCache::Reset() noexcept
 281 | +{
 282 | +    cacheCoins.clear();
 283 | +    cachedCoinsUsage = 0;
 284 | +    hashBlock.SetNull();

l0rinc commented at 12:15 PM on December 24, 2025:

14f1b7913861d95e3ffeda658d45b0b88e7019e7: we're not setting view.SetBestBlock in Chainstate::ConnectTip in this commit - but I think we should, to avoid leaving the lazy getter which is in a race with the setter. Since this is done in a multithreaded code, I think we should set it deterministically and remove the lazy init. Especially since Flush/Sync access hashBlock directly...

andrewtoth commented at 7:42 PM on December 29, 2025:

the lazy getter which is in a race with the setter

Can you describe this race? The current behavior creates a CCoinsViewCache with null hashBlock and passes it to ConnectBlock. This behavior now resets the CoinsViewCacheAsync hashBlock to null and passes it to ConnectBlock. We don't modify any behavior inside ConnectBlock. hashBlock is not accessed in multi threaded code.

in src/coins.cpp:256 in 14f1b79138 outdated

 252 | @@ -253,11 +253,13 @@ bool CCoinsViewCache::Flush(bool will_reuse_cache) {
 253 |      auto cursor{CoinsViewCacheCursor(m_sentinel, cacheCoins, /*will_erase=*/true)};
 254 |      bool fOk = base->BatchWrite(cursor, hashBlock);
 255 |      if (fOk) {
 256 | -        cacheCoins.clear();
 257 |          if (will_reuse_cache) {

l0rinc commented at 10:28 AM on December 26, 2025:

we can likely get rid of the will_reuse_cache now that we have a reusable cache that we can reset - will attempt in a follow-up

andrewtoth commented at 8:57 PM on January 3, 2026:

Done in #34164.

andrewtoth force-pushed on Dec 26, 2025

DrahtBot added the label CI failed on Dec 26, 2025

DrahtBot commented at 4:41 PM on December 26, 2025: contributor

🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/actions/runs/20525757781/job/58968699149 LLM reason (✨ experimental): Lint failure due to trailing whitespace in src/test/fuzz/coinscache_sim.cpp:59.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Dec 26, 2025

DrahtBot removed the label CI failed on Dec 26, 2025

andrewtoth commented at 8:19 PM on December 26, 2025: contributor

Thank you again @l0rinc for your review. I've taken most of your suggestions. Reset() is now on the base CCoinsViewCache class and GetPossiblySpentCoinFromCache has been replaced with a protected FetchCoinWithoutMutating.

I've removed the new fuzz harness and instead integrated the CoinsViewCacheAsync to our coins_view and coinscache_sim fuzz targets. So, we can fuzz them as we would a CCoinsViewCache and make sure the new subclass behaves the same as before. I have been fuzzing the three new targets for a while now.

virtual dispatch cost on critical path

I don't see why this needs investigating. Both our benchmarks show large speedups along the critical path, so even if there is a minor increased cost here it is a net benefit.

how removing noexcept and simpler siphash changes would affect the constraints

These seem like good ideas to investigate, but I don't see how they are applicable to the change proposed here. Can you elaborate?

whether we can clean up the coins area a bit more before we proceed

That is a worthy goal. I have refactored to make the changes to coins more straightforward.

in src/test/fuzz/coins_view.cpp:91 in 738c40a566 outdated

  84 | @@ -85,6 +85,11 @@ void TestCoinsView(FuzzedDataProvider& fuzzed_data_provider, CCoinsView& backend
  85 |                  if (is_db && best_block.IsNull()) best_block = uint256::ONE;
  86 |                  coins_view_cache.SetBestBlock(best_block);
  87 |              },
  88 | +            [&] {
  89 | +                coins_view_cache.Reset();
  90 | +                // Set best block hash to non-null to satisfy the assertion in CCoinsViewDB::BatchWrite().
  91 | +                if (is_db) coins_view_cache.SetBestBlock(uint256::ONE);

l0rinc commented at 7:20 AM on December 27, 2025:

we only need this for caches that actually write to db, so we might as well:

                if (!will_reuse_cache && is_db) coins_view_cache.SetBestBlock(uint256::ONE);

in src/test/coins_tests.cpp:1124 in 738c40a566 outdated

1120 | @@ -1121,4 +1121,28 @@ BOOST_AUTO_TEST_CASE(ccoins_emplace_duplicate_keeps_usage_balanced)
1121 |      BOOST_CHECK(cache.AccessCoin(outpoint) == coin1);
1122 |  }
1123 |  
1124 | +BOOST_AUTO_TEST_CASE(ccoins_reset)

l0rinc commented at 7:28 AM on December 27, 2025:

we could extend this with idempotency checks - especially if we add the mentioned Start method:

BOOST_AUTO_TEST_CASE(ccoins_start)
{
    test_only_CheckFailuresAreExceptionsNotAborts mock_checks{};

    CCoinsView root;
    CCoinsViewCacheTest cache{&root};

    // Start fails if state wasn't reset
    cache.Start();
    cache.EmplaceCoinInternalDANGER({Txid::FromUint256(m_rng.rand256()), m_rng.rand32()}, {});
    BOOST_CHECK_THROW(cache.Start(), NonFatalCheckError);

    // Resetting allows start again
    cache.Reset();
    cache.Start();

    // Reset and Start are idempotent
    cache.Reset();
    cache.Reset();
    cache.Start();
    cache.Start();
}

andrewtoth commented at 8:39 PM on December 28, 2025:

Added in #34164.

in src/coins.cpp:278 in 738c40a566 outdated

 274 | @@ -275,6 +275,13 @@ bool CCoinsViewCache::Sync()
 275 |      return fOk;
 276 |  }
 277 |  
 278 | +void CCoinsViewCache::Reset() noexcept

l0rinc commented at 7:42 AM on December 27, 2025:

the constructor should likely call this reset at the beginning, so this should likely adjust m_sentinel.second as well, something like:

CCoinsViewCache::CCoinsViewCache(CCoinsView* baseIn, bool deterministic) :
    CCoinsViewBacked(baseIn), m_deterministic(deterministic),
    cacheCoins(0, SaltedOutpointHasher(/*deterministic=*/deterministic), CCoinsMap::key_equal{}, &m_cache_coins_memory_resource)
{
    CCoinsViewCache::Reset();
    Start();
}

void CCoinsViewCache::Start()
{
    Assert(cacheCoins.empty());
    Assert(cachedCoinsUsage == 0);
    Assert(m_sentinel.second.Next() == &m_sentinel);
    Assert(m_sentinel.second.Prev() == &m_sentinel);

    SetBestBlock(base->GetBestBlock());
}

void CCoinsViewCache::Reset() noexcept
{
    cacheCoins.clear();
    cachedCoinsUsage = 0;
    hashBlock.SetNull();
    m_sentinel.second.SelfRef(m_sentinel);
}

andrewtoth commented at 8:58 PM on January 3, 2026:

I don't think we need to add a Start() method to the cache. I'd rather not touch the hashBlock behavior in this PR. It can be cleaned up in a parallel PR. No multi threaded code touches it here.

in src/validation.cpp:3136 in 738c40a566 outdated

3132 | @@ -3131,8 +3133,9 @@ bool Chainstate::ConnectTip(
3133 |                   Ticks<MillisecondsDouble>(time_3 - time_2),
3134 |                   Ticks<SecondsDouble>(m_chainman.time_connect_total),
3135 |                   Ticks<MillisecondsDouble>(m_chainman.time_connect_total) / m_chainman.num_blocks_total);
3136 | -        bool flushed = view.Flush(/*will_reuse_cache=*/false); // local CCoinsViewCache goes out of scope
3137 | +        bool flushed = view.Flush(/*will_reuse_cache=*/false); // No need to reallocate since it only has capacity for 1 block

l0rinc commented at 7:45 AM on December 27, 2025:

👍 for the new comment

in src/coinsviewcacheasync.h:177 in b9ecb3c9ba outdated

 172 | +        m_txids.clear();
 173 | +    }
 174 | +
 175 | +public:
 176 | +    //! Fetch all block inputs.
 177 | +    void StartFetching(const CBlock& block) noexcept

l0rinc commented at 9:11 AM on December 27, 2025:

Is this idempotent, are we sure all state is reset after the previous block? Could we call a reset here or an assert to make sure we're not accidentally inheriting anything from a previous (failed?) fetch?

andrewtoth commented at 7:22 PM on December 28, 2025:

It's not idempotent. We need to call Reset/Flush/Sync/SetBackend on the cache before calling this again. There is an Assume(m_inputs.empty());. I'm not sure if we want to call Reset here, since it is an error if we call this before Reset and we should crash.

andrewtoth commented at 7:50 PM on January 11, 2026:

See #31132 (review).

in src/test/coinsviewcacheasync_tests.cpp:134 in b9ecb3c9ba outdated

 129 | +    PopulateView(block, main_cache);
 130 | +    CoinsViewCacheAsync view{&main_cache};
 131 | +    for (auto i{0}; i < 3; ++i) {
 132 | +        view.StartFetching(block);
 133 | +        CheckCache(block, view);
 134 | +        view.Reset();

l0rinc commented at 9:13 AM on December 27, 2025:

what if we forget to call Reset() between two StartFetching calls?

andrewtoth commented at 7:19 PM on December 28, 2025:

Bad things. We need to call Reset before the block is destroyed.

andrewtoth commented at 7:50 PM on January 11, 2026:

I have updated to call StopFetching in StartFetching. This way you can fetch two blocks without resetting the cache (useful for VerifyDB). I've also added a RAII control object that gets returned from StartFetching, and will call StopFetching when we go out of scope. This is much safer than having to remember to call one of the methods to stop. It also separates the concerns of Reset to the base CCoinsViewCache only.

This is bound to the lifetime of the block as well, so we have static analysis that will ensure we don't keep fetching after the block is destroyed (causing UB).

l0rinc commented at 8:27 PM on January 11, 2026:

That's definitely better. Added some comment in #31132#pullrequestreview-3648380425 and #31132 (review) that might make this even more lightweight.

in src/coinsviewcacheasync.h:183 in b9ecb3c9ba outdated

 178 | +    {
 179 | +        Assume(m_inputs.empty());
 180 | +        // Loop through the inputs of the block and set them in the queue. Also construct the set of txids to filter.
 181 | +        for (const auto& tx : block.vtx | std::views::drop(1)) [[likely]] {
 182 | +            for (const auto& input : tx->vin) [[likely]] m_inputs.emplace_back(input.prevout);
 183 | +            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));

l0rinc commented at 9:21 AM on December 27, 2025:

This is internal, so we don't necessarily need endian conversion here (would be skipped on most popular platforms anyway), but we could simplify a few of these regardless:

            m_txids.emplace_back(ReadLE64(tx->GetHash().begin()));

andrewtoth commented at 7:39 PM on December 29, 2025:

Is this simpler? GetUint64 calls ReadLE64 internally. This would require another import for ReadLE64 as well. We're just skipping the conversion to uint256.

l0rinc commented at 8:51 PM on December 29, 2025:

yes, it's simpler, we're skipping a call and a conversion and it's shorter - fewer moving parts. If you don't like it, resolve this comment.

andrewtoth commented at 3:30 PM on January 3, 2026:

You're right. Taken, thanks!

andrewtoth force-pushed on Dec 28, 2025

DrahtBot added the label CI failed on Dec 28, 2025

l0rinc commented at 6:01 PM on December 28, 2025: contributor

While I was reviewing it you have pushed two new versions, so let me add my half-baked comments in the meantime. I'm also experimenting with adding the features in smaller steps in https://github.com/l0rinc/bitcoin/pull/79/commits - a separate resetable and reusable cache (applied to other temp places as well), introducing a single-threaded fetcher at first, changing it to newly created threads in the next, optimizing it via a barrier-guarded thread pool in a follow-up. It's definitely not done yet, but want to make sure we have progress, would appreciate if you could take a look and see what we can use here from the ideas. I think we should be able to extract the cache reuse commit to a dedicated PR.

l0rinc commented at 7:49 AM on December 31, 2025: contributor

Now that I have access to a Windows benchmarking server, managed to run a few rounds of reindex-chainstate with default 450 dbcache until tip.

Edit: previously posted some measurements for the PR, turns out Windows disagreed and didn't actually check out the new commit (had some chmod leftovers) so I was measuring just variance.

Edit2: reran the PR separately, seems we maintain the speedup we were hoping for:

<details> <summary>Details</summary>

 for DBCACHE in 450; do \
>   COMMITS="7f295e1d9b44c225c823242c1f04239f46fb27a6 0827d5d363d68f38feff89124347e9914de83cfa"; \
>   STOP=927719; \
>   BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
>   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
>   (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | SSD"; echo "") &&\
erfine \>   hyperfine \
ort comm>     --sort command \
  --runs>     --runs 2 \
>     --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
>     --parameter-list COMMIT ${COMMITS// /,} \
>     --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git clean -fxd; git reset --hard {COMMIT} && \
e -B bu>       cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
>       ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log" \
>     --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
        >                 cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
>     "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
> done

0827d5d363 validation: fetch inputs on parallel threads

reindex-chainstate | 927719 blocks | dbcache 450 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | SSD

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=927719 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7f295e1d9b44c225c823242c1f04239f46fb27a6)
  Time (mean ± σ):     119997.373 s ± 2661.660 s    [User: 85751.035 s, System: 34713.420 s]
  Range (min … max):   118115.295 s … 121879.451 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=927719 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 0827d5d363d68f38feff89124347e9914de83cfa)
  Time (mean ± σ):     48412.140 s ± 1510.148 s    [User: 86485.276 s, System: 21483.316 s]
  Range (min … max):   47344.305 s … 49479.976 s    2 runs

</details>

edit: tried native executable with clang, I couldn't reproduce any speedup vs master that way. Couldn't yet make gcc work.

maflcko removed the label CI failed on Jan 1, 2026

maflcko commented at 11:37 AM on January 2, 2026: member

Now that I have access to a Windows benchmarking server, managed to run a few rounds

It looks like this is run inside WSL (Linux) and compiled for Linux. I wonder if this is meaningful of real end-user performance, which normally run native Windows (.exe) binaries?

DrahtBot added the label Needs rebase on Jan 3, 2026

andrewtoth force-pushed on Jan 3, 2026

DrahtBot removed the label Needs rebase on Jan 3, 2026

andrewtoth force-pushed on Jan 3, 2026

DrahtBot added the label CI failed on Jan 3, 2026

andrewtoth force-pushed on Jan 3, 2026

DrahtBot removed the label CI failed on Jan 3, 2026

andrewtoth commented at 9:22 PM on January 3, 2026: contributor

Measured the performance at tip. This branch is ~64% faster connecting newly seen blocks than master.

	Node 1	Node 2	Average
master	273.80ms/blk	319.19ms/blk	296.50ms/blk
branch	179.89ms/blk	181.45ms/blk	180.67ms/blk

I ran 5 t3.small AWS instances with 20 GB gp2 EBS volumes attached. I ran all nodes pruned to 550, with the exact same blocks, chainstate, and mempool.dat uploaded to them. I ran 2 nodes at master, and 2 nodes at this branch. These 4 nodes only connected to the 5th node, which itself connected to a trusted node outside the VPC. Using debug=bench, we can see the cumulative block connection speed in the debug logs. These are linked in the table above.

Edit: There was a network outage with the gateway node for 12 hours, and on connection all nodes caught up. This skews results. Restarted the nodes and will get more data.

andrewtoth force-pushed on Jan 5, 2026

DrahtBot added the label CI failed on Jan 5, 2026

DrahtBot removed the label CI failed on Jan 5, 2026

andrewtoth force-pushed on Jan 9, 2026

andrewtoth force-pushed on Jan 11, 2026

in src/bench/coinsviewcacheasync.cpp:41 in 396f784f8f

  36 | +    }
  37 | +    chainstate.ForceFlushStateToDisk();
  38 | +    CoinsViewCacheAsync async_cache{&coins_tip};
  39 | +
  40 | +    bench.run([&] {
  41 | +        const auto fetch_control{async_cache.StartFetching(block)};

l0rinc commented at 7:41 PM on January 11, 2026:

I have mixed feelings about these new RAII unused variables. We already have structures like these where e.g. the locks are only applied in a given scope - if we think this automatic cleanup is better, can we do something like that instead?

<details> <summary>WITH_BLOCK_INPUTS_FETCHING prototype</summary>

diff --git a/src/bench/coinsviewcacheasync.cpp b/src/bench/coinsviewcacheasync.cpp
index aa6c9c4cd7..8b7dcc5505 100644
--- a/src/bench/coinsviewcacheasync.cpp
+++ b/src/bench/coinsviewcacheasync.cpp
@@ -38,7 +38,7 @@ static void CoinsViewCacheAsyncBenchmark(benchmark::Bench& bench)
     CoinsViewCacheAsync async_cache{&coins_tip};
 
     bench.run([&] {
-        const auto fetch_control{async_cache.StartFetching(block)};
+        WITH_BLOCK_INPUTS_FETCHING(async_cache, block);
         for (const auto& tx : block.vtx | std::views::drop(1)) {
             for (const auto& in : tx->vin) {
                 const auto have{async_cache.HaveCoin(in.prevout)};
diff --git a/src/coinsviewcacheasync.h b/src/coinsviewcacheasync.h
index 779bb05633..db471bff40 100644
--- a/src/coinsviewcacheasync.h
+++ b/src/coinsviewcacheasync.h
@@ -12,6 +12,7 @@
 #include <primitives/transaction.h>
 #include <tinyformat.h>
 #include <util/check.h>
+#include <util/macros.h>
 #include <util/threadnames.h>
 
 #include <algorithm>
@@ -298,4 +299,8 @@ public:
     }
 };
 
+//! Helper macro to start background fetching of all inputs in a block for the current scope.
+#define WITH_BLOCK_INPUTS_FETCHING(view, block) \
+    [[maybe_unused]] const auto UNIQUE_NAME(fetch_control_) = (view).StartFetching(block)
+
 #endif // BITCOIN_COINSVIEWCACHEASYNC_H
diff --git a/src/test/coinsviewcacheasync_tests.cpp b/src/test/coinsviewcacheasync_tests.cpp
index 8e0020de9c..06b0dd3fc6 100644
--- a/src/test/coinsviewcacheasync_tests.cpp
+++ b/src/test/coinsviewcacheasync_tests.cpp
@@ -109,7 +109,7 @@ BOOST_AUTO_TEST_CASE(fetch_inputs_from_db)
     CCoinsViewCache main_cache{&db};
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
+        WITH_BLOCK_INPUTS_FETCHING(view, block);
         CheckCache(block, view);
         // Check that no coins have been moved up to main cache from db
         for (const auto& tx : block.vtx) {
@@ -129,7 +129,7 @@ BOOST_AUTO_TEST_CASE(fetch_inputs_from_cache)
     PopulateView(block, main_cache);
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
+        WITH_BLOCK_INPUTS_FETCHING(view, block);
         CheckCache(block, view);
         view.Reset();
     }
@@ -147,7 +147,7 @@ BOOST_AUTO_TEST_CASE(fetch_no_double_spend)
     PopulateView(block, main_cache, /*spent=*/true);
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
+        WITH_BLOCK_INPUTS_FETCHING(view, block);
         for (const auto& tx : block.vtx) {
             for (const auto& in : tx->vin) {
                 const auto& c{view.AccessCoin(in.prevout)};
@@ -167,7 +167,7 @@ BOOST_AUTO_TEST_CASE(fetch_no_inputs)
     CCoinsViewCache main_cache{&db};
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
+        WITH_BLOCK_INPUTS_FETCHING(view, block);
         for (const auto& tx : block.vtx) {
             for (const auto& in : tx->vin) {
                 const auto& c{view.AccessCoin(in.prevout)};
@@ -191,7 +191,7 @@ BOOST_AUTO_TEST_CASE(access_non_input_coin)
     main_cache.EmplaceCoinInternalDANGER(COutPoint{Txid::FromUint256(uint256::ZERO), 0}, std::move(coin));
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
+        WITH_BLOCK_INPUTS_FETCHING(view, block);
         const auto& accessed_coin{view.AccessCoin(outpoint)};
         BOOST_CHECK(!accessed_coin.IsSpent());
         view.Reset();
@@ -207,7 +207,7 @@ BOOST_AUTO_TEST_CASE(fetch_main_thread)
     PopulateView(block, main_cache);
     CoinsViewCacheAsync view{&main_cache, /*deterministic=*/false, /*num_workers=*/0};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
+        WITH_BLOCK_INPUTS_FETCHING(view, block);
         CheckCache(block, view);
         view.Reset();
     }
diff --git a/src/test/fuzz/coins_view.cpp b/src/test/fuzz/coins_view.cpp
index 3f03ec90f0..aaffe98b88 100644
--- a/src/test/fuzz/coins_view.cpp
+++ b/src/test/fuzz/coins_view.cpp
@@ -384,7 +384,7 @@ FUZZ_TARGET(coins_view_async, .init = initialize_coins_view)
     CCoinsView backend_coins_view;
     g_async_cache->SetBackend(backend_coins_view);
     CBlock block{BuildRandomBlock(fuzzed_data_provider)};
-    const auto fetch_control{g_async_cache->StartFetching(block)};
+    WITH_BLOCK_INPUTS_FETCHING(*g_async_cache, block);
     TestCoinsView(fuzzed_data_provider, *g_async_cache, backend_coins_view, /*is_db=*/false);
     g_async_cache->Reset();
 }
@@ -402,7 +402,7 @@ FUZZ_TARGET(coins_view_stacked, .init = initialize_coins_view)
     g_async_cache->SetBackend(backend_coins_view);
     TestCoinsView(fuzzed_data_provider, backend_coins_view, db_coins_view, /*is_db=*/true);
     CBlock block{BuildRandomBlock(fuzzed_data_provider)};
-    const auto fetch_control{g_async_cache->StartFetching(block)};
+    WITH_BLOCK_INPUTS_FETCHING(*g_async_cache, block);
     TestCoinsView(fuzzed_data_provider, *g_async_cache, backend_coins_view, /*is_db=*/false);
     TestCoinsView(fuzzed_data_provider, backend_coins_view, db_coins_view, /*is_db=*/true);
     g_async_cache->Reset();
diff --git a/src/test/fuzz/coinscache_sim.cpp b/src/test/fuzz/coinscache_sim.cpp
index c635525a24..aa304aab80 100644
--- a/src/test/fuzz/coinscache_sim.cpp
+++ b/src/test/fuzz/coinscache_sim.cpp
@@ -408,7 +408,7 @@ FUZZ_TARGET(coinscache_sim, .init = setup_coinscache_sim)
                         for (auto& async_cache : g_async_caches) {
                             if (async_cache.use_count() > 1) continue;
                             async_cache->SetBackend(*top_cache());
-                            const auto fetch_control{async_cache->StartFetching(data.block)};
+                            WITH_BLOCK_INPUTS_FETCHING(*async_cache, data.block);
                             caches.emplace_back(async_cache);
                             break;
                         }
diff --git a/src/validation.cpp b/src/validation.cpp
index f1168729d2..7b304b6f5b 100644
--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -3100,7 +3100,7 @@ bool Chainstate::ConnectTip(
              Ticks<MillisecondsDouble>(time_2 - time_1));
     {
         auto& view{*m_coins_views->m_connect_block_view};
-        const auto fetch_control{view.StartFetching(*block_to_connect)};
+        WITH_BLOCK_INPUTS_FETCHING(view, *block_to_connect);
         bool rv = ConnectBlock(*block_to_connect, state, pindexNew, view);
         if (m_chainman.m_options.signals) {
             m_chainman.m_options.signals->BlockChecked(block_to_connect, state);

</details>

Alternatively (I like this one a lot more), what if we covered the existing view itself (making FetchControl a proxy for the view accessed through the -> and * operators neatly) to make it obvious why we need it in the scope but not outside it. This would indicate that there's a state we don't want to touch, but that there's a start/stop layer that is still needed. This would enable calling Reset automatically (which would already call StopFetching).

<details> <summary>FetchControl proxy prototype</summary>

diff --git a/src/bench/coinsviewcacheasync.cpp b/src/bench/coinsviewcacheasync.cpp
index aa6c9c4cd7..9b66a0e953 100644
--- a/src/bench/coinsviewcacheasync.cpp
+++ b/src/bench/coinsviewcacheasync.cpp
@@ -38,14 +38,13 @@ static void CoinsViewCacheAsyncBenchmark(benchmark::Bench& bench)
     CoinsViewCacheAsync async_cache{&coins_tip};
 
     bench.run([&] {
-        const auto fetch_control{async_cache.StartFetching(block)};
+        auto view{async_cache.StartFetching(block)};
         for (const auto& tx : block.vtx | std::views::drop(1)) {
             for (const auto& in : tx->vin) {
-                const auto have{async_cache.HaveCoin(in.prevout)};
+                const auto have{view->HaveCoin(in.prevout)};
                 assert(have);
             }
         }
-        async_cache.Reset();
     });
 }
 
diff --git a/src/coinsviewcacheasync.h b/src/coinsviewcacheasync.h
index 779bb05633..dc3f51255b 100644
--- a/src/coinsviewcacheasync.h
+++ b/src/coinsviewcacheasync.h
@@ -96,9 +96,11 @@ class CoinsViewCacheAsync : public CCoinsViewCache
 {
 public:
     /**
-     * RAII-style controller that guarantees fetching is stopped when it goes out of scope.
+     * RAII-style controller that guarantees fetching is stopped and the view is reset when it goes out of scope.
      * Returned by StartFetching() and bound to the lifetime of the block.
      * Non-copyable and non-movable to prevent scope escape.
+     *
+     * Provides access to the view through operator-> and operator*.
      */
     class FetchControl
     {
@@ -113,10 +115,13 @@ public:
         FetchControl(FetchControl&&) = delete;
         FetchControl& operator=(FetchControl&&) = delete;
 
-        ~FetchControl()
-        {
-            m_cache.StopFetching();
-        }
+        CoinsViewCacheAsync& operator*() noexcept { return m_cache; }
+        const CoinsViewCacheAsync& operator*() const noexcept { return m_cache; }
+
+        CoinsViewCacheAsync* operator->() noexcept { return &m_cache; }
+        const CoinsViewCacheAsync* operator->() const noexcept { return &m_cache; }
+
+        ~FetchControl() { m_cache.Reset(); }
     };
 
 private:
@@ -228,7 +233,7 @@ private:
     }
 
 public:
-    //! Start fetching all block inputs and return RAII guard that stops fetching on destruction.
+    //! Start fetching all block inputs and return RAII guard that resets the view on destruction.
     [[nodiscard]] FetchControl StartFetching(const CBlock& block LIFETIMEBOUND) noexcept
     {
         StopFetching();
diff --git a/src/test/coinsviewcacheasync_tests.cpp b/src/test/coinsviewcacheasync_tests.cpp
index 8e0020de9c..39232112b7 100644
--- a/src/test/coinsviewcacheasync_tests.cpp
+++ b/src/test/coinsviewcacheasync_tests.cpp
@@ -109,15 +109,14 @@ BOOST_AUTO_TEST_CASE(fetch_inputs_from_db)
     CCoinsViewCache main_cache{&db};
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
-        CheckCache(block, view);
+        auto async_view{view.StartFetching(block)};
+        CheckCache(block, *async_view);
         // Check that no coins have been moved up to main cache from db
         for (const auto& tx : block.vtx) {
             for (const auto& in : tx->vin) {
                 BOOST_CHECK(!main_cache.HaveCoinInCache(in.prevout));
             }
         }
-        view.Reset();
     }
 }
 
@@ -129,9 +128,8 @@ BOOST_AUTO_TEST_CASE(fetch_inputs_from_cache)
     PopulateView(block, main_cache);
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
-        CheckCache(block, view);
-        view.Reset();
+        auto async_view{view.StartFetching(block)};
+        CheckCache(block, *async_view);
     }
 }
 
@@ -147,16 +145,15 @@ BOOST_AUTO_TEST_CASE(fetch_no_double_spend)
     PopulateView(block, main_cache, /*spent=*/true);
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
+        auto async_view{view.StartFetching(block)};
         for (const auto& tx : block.vtx) {
             for (const auto& in : tx->vin) {
-                const auto& c{view.AccessCoin(in.prevout)};
+                const auto& c{async_view->AccessCoin(in.prevout)};
                 BOOST_CHECK(c.IsSpent());
             }
         }
         // Coins are not added to the view, even though they exist unspent in the parent db
-        BOOST_CHECK_EQUAL(view.GetCacheSize(), 0);
-        view.Reset();
+        BOOST_CHECK_EQUAL(async_view->GetCacheSize(), 0);
     }
 }
 
@@ -167,15 +164,14 @@ BOOST_AUTO_TEST_CASE(fetch_no_inputs)
     CCoinsViewCache main_cache{&db};
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
+        auto async_view{view.StartFetching(block)};
         for (const auto& tx : block.vtx) {
             for (const auto& in : tx->vin) {
-                const auto& c{view.AccessCoin(in.prevout)};
+                const auto& c{async_view->AccessCoin(in.prevout)};
                 BOOST_CHECK(c.IsSpent());
             }
         }
-        BOOST_CHECK_EQUAL(view.GetCacheSize(), 0);
-        view.Reset();
+        BOOST_CHECK_EQUAL(async_view->GetCacheSize(), 0);
     }
 }
 
@@ -191,10 +187,9 @@ BOOST_AUTO_TEST_CASE(access_non_input_coin)
     main_cache.EmplaceCoinInternalDANGER(COutPoint{Txid::FromUint256(uint256::ZERO), 0}, std::move(coin));
     CoinsViewCacheAsync view{&main_cache};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
-        const auto& accessed_coin{view.AccessCoin(outpoint)};
+        auto async_view{view.StartFetching(block)};
+        const auto& accessed_coin{async_view->AccessCoin(outpoint)};
         BOOST_CHECK(!accessed_coin.IsSpent());
-        view.Reset();
     }
 }
 
@@ -207,9 +202,8 @@ BOOST_AUTO_TEST_CASE(fetch_main_thread)
     PopulateView(block, main_cache);
     CoinsViewCacheAsync view{&main_cache, /*deterministic=*/false, /*num_workers=*/0};
     for (auto i{0}; i < 3; ++i) {
-        const auto fetch_control{view.StartFetching(block)};
-        CheckCache(block, view);
-        view.Reset();
+        auto async_view{view.StartFetching(block)};
+        CheckCache(block, *async_view);
     }
 }
 
diff --git a/src/test/fuzz/coins_view.cpp b/src/test/fuzz/coins_view.cpp
index 3f03ec90f0..f3df7b19a9 100644
--- a/src/test/fuzz/coins_view.cpp
+++ b/src/test/fuzz/coins_view.cpp
@@ -384,9 +384,8 @@ FUZZ_TARGET(coins_view_async, .init = initialize_coins_view)
     CCoinsView backend_coins_view;
     g_async_cache->SetBackend(backend_coins_view);
     CBlock block{BuildRandomBlock(fuzzed_data_provider)};
-    const auto fetch_control{g_async_cache->StartFetching(block)};
-    TestCoinsView(fuzzed_data_provider, *g_async_cache, backend_coins_view, /*is_db=*/false);
-    g_async_cache->Reset();
+    auto async_view{g_async_cache->StartFetching(block)};
+    TestCoinsView(fuzzed_data_provider, *async_view, backend_coins_view, /*is_db=*/false);
 }
 
 FUZZ_TARGET(coins_view_stacked, .init = initialize_coins_view)
@@ -402,8 +401,7 @@ FUZZ_TARGET(coins_view_stacked, .init = initialize_coins_view)
     g_async_cache->SetBackend(backend_coins_view);
     TestCoinsView(fuzzed_data_provider, backend_coins_view, db_coins_view, /*is_db=*/true);
     CBlock block{BuildRandomBlock(fuzzed_data_provider)};
-    const auto fetch_control{g_async_cache->StartFetching(block)};
-    TestCoinsView(fuzzed_data_provider, *g_async_cache, backend_coins_view, /*is_db=*/false);
+    auto async_view{g_async_cache->StartFetching(block)};
+    TestCoinsView(fuzzed_data_provider, *async_view, backend_coins_view, /*is_db=*/false);
     TestCoinsView(fuzzed_data_provider, backend_coins_view, db_coins_view, /*is_db=*/true);
-    g_async_cache->Reset();
 }
diff --git a/src/test/fuzz/coinscache_sim.cpp b/src/test/fuzz/coinscache_sim.cpp
index c635525a24..4090697b8e 100644
--- a/src/test/fuzz/coinscache_sim.cpp
+++ b/src/test/fuzz/coinscache_sim.cpp
@@ -407,8 +407,8 @@ FUZZ_TARGET(coinscache_sim, .init = setup_coinscache_sim)
                         // Find an unused async cache from the pool
                         for (auto& async_cache : g_async_caches) {
                             if (async_cache.use_count() > 1) continue;
+                            async_cache->Reset();
                             async_cache->SetBackend(*top_cache());
-                            const auto fetch_control{async_cache->StartFetching(data.block)};
                             caches.emplace_back(async_cache);
                             break;
                         }
diff --git a/src/validation.cpp b/src/validation.cpp
index f1168729d2..57108d05ed 100644
--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -3099,9 +3099,8 @@ bool Chainstate::ConnectTip(
     LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
              Ticks<MillisecondsDouble>(time_2 - time_1));
     {
-        auto& view{*m_coins_views->m_connect_block_view};
-        const auto fetch_control{view.StartFetching(*block_to_connect)};
-        bool rv = ConnectBlock(*block_to_connect, state, pindexNew, view);
+        auto view{m_coins_views->m_connect_block_view->StartFetching(*block_to_connect)};
+        bool rv = ConnectBlock(*block_to_connect, state, pindexNew, *view);
         if (m_chainman.m_options.signals) {
             m_chainman.m_options.signals->BlockChecked(block_to_connect, state);
         }
@@ -3109,7 +3108,6 @@ bool Chainstate::ConnectTip(
             if (state.IsInvalid())
                 InvalidBlockFound(pindexNew, state);
             LogError("%s: ConnectBlock %s failed, %s\n", __func__, pindexNew->GetBlockHash().ToString(), state.ToString());
-            view.Reset();
             return false;
         }
         time_3 = SteadyClock::now();
@@ -3119,8 +3117,7 @@ bool Chainstate::ConnectTip(
                  Ticks<MillisecondsDouble>(time_3 - time_2),
                  Ticks<SecondsDouble>(m_chainman.time_connect_total),
                  Ticks<MillisecondsDouble>(m_chainman.time_connect_total) / m_chainman.num_blocks_total);
-        view.Flush(/*will_reuse_cache=*/false); // No need to reallocate since it only has capacity for 1 block
-        view.Reset();
+        view->Flush(/*will_reuse_cache=*/false); // No need to reallocate since it only has capacity for 1 block
     }
     const auto time_4{SteadyClock::now()};
     m_chainman.time_flush += time_4 - time_3;

</details>

andrewtoth commented at 8:45 PM on January 11, 2026:

What is the benefit of the macro? I don't see a problem with using it as suggested, but not sure what it is giving us.

Re proxy - we don't want to call Reset everytime we go out of scope. Reset is only if we need to reset the cache to its initial state. Consider VerifyDB, where we could use it to fetch 6 blocks in a row but we don't want to clear its state each time.

diff --git a/src/validation.cpp b/src/validation.cpp
index f1168729d2..bcd933fa78 100644
--- a/src/validation.cpp
+++ b/src/validation.cpp
@@ -4697,7 +4697,7 @@ VerifyDBResult CVerifyDB::VerifyDB(
     }
     nCheckLevel = std::max(0, std::min(4, nCheckLevel));
     LogInfo("Verifying last %i blocks at level %i", nCheckDepth, nCheckLevel);
-    CCoinsViewCache coins(&coinsview);
+    CoinsViewCacheAsync coins(&coinsview);
     CBlockIndex* pindex;
     CBlockIndex* pindexFailure = nullptr;
     int nGoodTransactions = 0;
@@ -4799,6 +4799,7 @@ VerifyDBResult CVerifyDB::VerifyDB(
                 LogError("Verification error: ReadBlock failed at %d, hash=%s", pindex->nHeight, pindex->GetBlockHash().ToString());
                 return VerifyDBResult::CORRUPTED_BLOCK_DB;
             }
+            const auto fetch_control{coins.StartFetching(block)};
             if (!chainstate.ConnectBlock(block, state, pindex, coins)) {
                 LogError("Verification error: found unconnectable block at %d, hash=%s (%s)", pindex->nHeight, pindex->GetBlockHash().ToString(), state.ToString());
                 return VerifyDBResult::CORRUPTED_BLOCK_DB;

What this gives us is a guarantee that we will stop fetching before exceeding the lifetime of the block.

l0rinc commented at 9:07 PM on January 11, 2026:

Wouldn't we declare and start fetching at a higher level so that the 6 blocks are all in the same scope?

andrewtoth commented at 9:14 PM on January 11, 2026:

We start fetching as soon as we get the block, and the block is destroyed when we exit the scope. I'm not sure what you mean.

andrewtoth commented at 3:35 PM on January 12, 2026:

Looking closer at VerifyDB, I don't think this would be useful there. All blocks are disconnected in the same cache without flushing, so all utxos will already be in the cache and no lookups will occur. Maybe it makes sense to just Reset, and then we can get rid of those Reset calls everywhere... Will look into this, thanks!

andrewtoth commented at 7:22 PM on January 14, 2026:

Updated to use a controller that returns a handle that dereferences to the cache. When the handle is destroyed it resets the cache.

in src/coinsviewcacheasync.h:234 in 396f784f8f

 229 | +
 230 | +public:
 231 | +    //! Start fetching all block inputs and return RAII guard that stops fetching on destruction.
 232 | +    [[nodiscard]] FetchControl StartFetching(const CBlock& block LIFETIMEBOUND) noexcept
 233 | +    {
 234 | +        StopFetching();

l0rinc commented at 7:57 PM on January 11, 2026:

Now that this is automatic, we could rather assert that it's stopped, since that would be a programming error, right?

andrewtoth commented at 8:33 PM on January 11, 2026:

It's not automatic. Consider:

const auto fetch_control{view.StartFetching(block)};
const auto fetch_control2{view.StartFetching(block)};

l0rinc commented at 8:35 PM on January 11, 2026:

yes, shouldn't that fail with an assertion error instead of the second spitting in the soup of the first?

andrewtoth commented at 8:48 PM on January 11, 2026:

I don't think so, it's a valid use of the API. The first fetching will be stopped and the second started. It's just inefficient, but perfectly safe.

l0rinc commented at 9:08 PM on January 11, 2026:

Can you come up with a valid usecase because it seems like an error to me...

andrewtoth commented at 7:23 PM on January 14, 2026:

Changed back to an assertion.

in src/coinsviewcacheasync.h:86 in 396f784f8f

  81 | + *
  82 | + *       After workers start:
  83 | + *
  84 | + *                                        Worker 2            Worker 0  Worker 3  Worker 1  m_input_head
  85 | + *                                           │                   │         │         │         │
  86 | + *                                           ▼                   ▼         ▼         ▼         ▼

l0rinc commented at 7:58 PM on January 11, 2026:

Lol, cool ascii art - though it's off by one space :p

 *                                       Worker 2            Worker 0  Worker 3  Worker 1  m_input_head
 *                                          │                   │         │         │         │
 *                                          ▼                   ▼         ▼         ▼         ▼

in src/test/fuzz/coinscache_sim.cpp:411 in 396f784f8f

 408 | +                    if (provider.ConsumeBool()) {
 409 | +                        // Find an unused async cache from the pool
 410 | +                        for (auto& async_cache : g_async_caches) {
 411 | +                            if (async_cache.use_count() > 1) continue;
 412 | +                            async_cache->SetBackend(*top_cache());
 413 | +                            const auto fetch_control{async_cache->StartFetching(data.block)};

l0rinc commented at 8:19 PM on January 11, 2026:

What is the purpose here of starting a fetch, adding it to a vector and resetting it immediately?

andrewtoth commented at 8:36 PM on January 11, 2026:

Not much, but it exercises the StartFetching/StopFetching paths. There's not much more we can do here since the fetching is now bound to the scope. Fuzzing the methods while we are still fetching happens in coins_view.cpp fuzz harness.

andrewtoth commented at 7:23 PM on January 14, 2026:

The latest version does not return a fetch control object here, so we can continue fetching in the background while exercising different methods.

l0rinc changes_requested

l0rinc commented at 8:26 PM on January 11, 2026: contributor

I like the new cleanup changes and the ASCII art, I only had time and patience to quickly go over it, hope the comments are useful.

andrewtoth force-pushed on Jan 14, 2026

l0rinc commented at 11:05 AM on January 19, 2026: contributor

It looks like this is run inside WSL (Linux) and compiled for Linux

Took me longer than anticipated, but we finally have our first native GCC Windows measurement - after a few failed previous attempts using clang or older gcc versions.

Results: 29% faster with dbcache=450, 14.5% faster with dbcache=4500 (932239 blocks, native .exe, MinGW GCC 15.2.0)

for DBCACHE in 450 4500; do \
  COMMITS="ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb b3cb5bb90a41af4199dde17946e5aa9b3cd72db6"; \
  STOP=932239; \
  HOST=x86_64-w64-mingw32; \
  XPACK="/home/win/xpack-mingw-w64-gcc-15.2.0-2"; \
  BASE_DIR="/mnt/c/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
  WIN_DATA_DIR='C:\\my_storage\\BitcoinData'; \
  export PATH="$XPACK/bin:$PATH"; \
  mkdir -p "$LOG_DIR"; \
  (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
  (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | win64-gcc15"; echo "") && \
  hyperfine \
    --sort command \
    --runs 1 \
    --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-win64.json" \
    --parameter-list COMMIT ${COMMITS// /,} \
    --prepare "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; rm -f ./build/bin/bitcoind.exe; rm -f $DATA_DIR/debug.log; git clean -fxd -e depends/built -e depends/sources -e depends/$HOST; git reset --hard {COMMIT} && \
      make -C depends HOST=$HOST NO_QT=1 NO_ZMQ=1 CC=\"$XPACK/bin/x86_64-w64-mingw32-gcc\" CXX=\"$XPACK/bin/x86_64-w64-mingw32-g++\" -j\$(nproc) && \
      cmake -B build -G Ninja --toolchain depends/$HOST/toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF && \
      ninja -C build bitcoind -j\$(nproc) && \
      ./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log" \
    --conclude "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(printf %.12s {COMMIT})"; \
      cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-\$(date +%s).log" \
    "./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"; \
done

ab233255d4 Merge bitcoin/bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
b3cb5bb90a validation: fetch inputs on parallel threads

2026-01-17 | reindex-chainstate | 932239 blocks | dbcache 450 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15

Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
  Time (abs ≡):        37260.585 s               [User: 0.002 s, System: 0.000 s]

Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
  Time (abs ≡):        28819.823 s               [User: 0.002 s, System: 0.000 s]

Relative speed comparison
        1.29          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
        1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)

ab233255d4 Merge bitcoin/bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
b3cb5bb90a validation: fetch inputs on parallel threads

2026-01-18 | reindex-chainstate | 932239 blocks | dbcache 4500 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15

Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
  Time (abs ≡):        29746.920 s               [User: 0.002 s, System: 0.000 s]

Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
  Time (abs ≡):        25974.137 s               [User: 0.002 s, System: 0.000 s]

Relative speed comparison
        1.15          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
        1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)

</details>

<details> <summary>Earlier attempts</summary>

Measure-Command { C:\my_storage\bitcoin-win64\bin\bitcoind.exe -datadir=C:\my_storage\BitcoinData -stopatheight=927729 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 }

Days              : 0
Hours             : 10
Minutes           : 33
Seconds           : 40
Milliseconds      : 162
Ticks             : 380201620408
TotalDays         : 0.440048171768519
TotalHours        : 10.5611561224444
TotalMinutes      : 633.669367346667
TotalSeconds      : 38020.1620408
TotalMilliseconds : 38020162.0408

and

 Measure-Command { C:\my_storage\bitcoin-win64\bin\bitcoind.exe -datadir=C:\my_storage\BitcoinData -stopatheight=927729 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 }

Days              : 0
Hours             : 11
Minutes           : 11
Seconds           : 40
Milliseconds      : 78
Ticks             : 403000786529
TotalDays         : 0.466436095519676
TotalHours        : 11.1944662924722
TotalMinutes      : 671.667977548333
TotalSeconds      : 40300.0786529
TotalMilliseconds : 40300078.6529

and

win@WIN-A2EHOAU4JET:/mnt/my_storage/bitcoin$ git log -1
commit 7f295e1d9b44c225c823242c1f04239f46fb27a6 (HEAD, l0rinc/master, master)
Merge: 5e7931af35 fa4cb13b52
Author: merge-script <fanquake@gmail.com>
Date:   Fri Dec 19 16:56:02 2025 +0000

Measure-Command { C:\my_storage\bitcoin-win64\bin\bitcoind.exe -datadir=C:\my_storage\BitcoinData -stopatheight=927729 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 }

Days              : 0
Hours             : 9
Minutes           : 48
Seconds           : 15
Milliseconds      : 866
Ticks             : 352958669523
TotalDays         : 0.408516978614583
TotalHours        : 9.80440748675
TotalMinutes      : 588.264449205
TotalSeconds      : 35295.8669523
TotalMilliseconds : 35295866.9523

and this is v30 with official release:

Measure-Command { C:\my_storage\bitcoin_bins\bitcoin-30.0\bin\bitcoind.exe -datadir=C:\my_storage\BitcoinData -stopatheight=927719 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 }

Days              : 0
Hours             : 9
Minutes           : 54
Seconds           : 34
Milliseconds      : 482
Ticks             : 356744820316
TotalDays         : 0.412899097587963
TotalHours        : 9.90957834211111
TotalMinutes      : 594.574700526667
TotalSeconds      : 35674.4820316
TotalMilliseconds : 35674482.0316

</details>

re-checked pruned IBD - this still seems to be bandwidth bound, so the difference is more modest:

COMMITS="22bde74d1d8f861323eabb8dc60401bbf1226544 13d32ed39cf869eb64faf8f489c53f38806a6c29"; \
STOP=932239; DBCACHE=450; PRUNE=550; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/ShallowBitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "$(date -I) | pruned IBD | ${STOP} blocks | dbcache ${DBCACHE} | pruning ${PRUNE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(fr ee -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
--sort command \
--runs 2 \
--export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
--parameter-list COMMIT ${COMMITS// /,} \
--prepare "killall -9 bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git clean -fxd; git reset --hard {COMMIT} && \
cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind -j2 && \
./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -prune=$PRUNE -printtoconsole=0; sleep 20" \
--conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block #1' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && g rep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(printf %.12s {COMMIT})"; \
cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
"COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -prune=$PRUNE -printtoconsole=0"
22bde74d1d Merge bitcoin-core/gui#924: Show an error message if the restored wallet name is empty
13d32ed39c validation: fetch inputs on parallel threads
2026-01-18 | pruned IBD | 932239 blocks | dbcache 450 | pruning 550 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/ShallowBitcoinData -stopatheight=932239 -dbcache=450 -blocksonly -prune=550 -printtoconsole=0 (COMMIT = 22bde74d1d8f861323eabb8dc60401bbf1226544)
Time (mean ± σ): 33025.250 s ± 368.595 s [User: 73666.506 s, System: 4901.220 s]
Range (min … max): 32764.613 s … 33285.886 s 2 runs
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/ShallowBitcoinData -stopatheight=932239 -dbcache=450 -blocksonly -prune=550 -printtoconsole=0 (COMMIT = 13d32ed39cf869eb64faf8f489c53f38806a6c29)
Time (mean ± σ): 27953.899 s ± 205.665 s [User: 72179.265 s, System: 4704.044 s]
Range (min … max): 27808.472 s … 28099.327 s 2 runs

</details>

<details> <summary>same measurements with `reindex-chainstate` for dbcache of 3 and 12 GB</summary>

 for DBCACHE in 3000 12000; do   COMMITS="ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb b3cb5bb90a41af4199dde17946e5aa9b3cd72db6";   STOP=932239;
   HOST=x86_64-w64-mingw32;   XPACK="/home/win/xpack-mingw-w64-gcc-15.2.0-2";   BASE_DIR="/mnt/c/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";   WIN_DATA_DIR='
C:\\my_storage\\BitcoinData';   export PATH="$XPACK/bin:$PATH";   mkdir -p "$LOG_DIR";   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit
1; done) &&   (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 |
xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | win64-gcc15"; echo "") &&   hyperfine     --sort command     --runs 1     --export-json "$BASE_DIR/rdx-$(sed -E 's/(\
w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-win64.json"     --parameter-list COMMIT ${COMMITS// /,}     --prepare "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; rm -f ./build/b
in/bitcoind.exe; rm -f $DATA_DIR/debug.log; git clean -fxd -e depends/built -e depends/sources -e depends/$HOST; git reset --hard {COMMIT} && \
      make -C depends HOST=$HOST NO_QT=1 NO_ZMQ=1 CC=\"$XPACK/bin/x86_64-w64-mingw32-gcc\" CXX=\"$XPACK/bin/x86_64-w64-mingw32-g++\" -j\$(nproc) && \
      cmake -B build -G Ninja --toolchain depends/$HOST/toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF && \
      ninja -C build bitcoind -j\$(nproc) && \
      ./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log"     --conclude "taskkill.exe /IM bitco
ind.exe /F 2>/dev/null; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(prin
tf %.12s {COMMIT})"; \
      cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-\$(date +%s).log"     "./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate
-blocksonly -connect=0 -printtoconsole=0"; done

ab233255d4 Merge bitcoin/bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
b3cb5bb90a validation: fetch inputs on parallel threads

2026-01-19 | reindex-chainstate | 932239 blocks | dbcache 3000 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15

Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
  Time (abs ≡):        30456.010 s               [User: 0.002 s, System: 0.000 s]

Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
  Time (abs ≡):        26194.317 s               [User: 0.006 s, System: 0.000 s]

Relative speed comparison
        1.16          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
        1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)

ab233255d4 Merge bitcoin/bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
b3cb5bb90a validation: fetch inputs on parallel threads

2026-01-19 | reindex-chainstate | 932239 blocks | dbcache 12000 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15

Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
  Time (abs ≡):        29192.227 s               [User: 0.002 s, System: 0.000 s]

Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
  Time (abs ≡):        26041.974 s               [User: 0.003 s, System: 0.000 s]

Relative speed comparison
        1.12          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
        1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)

</details>

<details> <summary>WORKER_THREADS{8} is slower</summary>

 for DBCACHE in 3000 12000; do   COMMITS="363e525d8da3c6c495191cb92d8eaf5dbeaeddf5";   STOP=932239;   HOST=x86_64-w64-mingw32;   XPACK="/home/win/xpack-mingw-w64-gcc-15.2.0-2";   BASE_DIR="/mnt/c/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";   WIN_DATA_DIR='C:\\my_storage\\BitcoinData';   export PATH="$XPACK/bin:$PATH";   mkdir -p "$LOG_DIR";   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) &&   (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | win64-gcc15"; echo "") &&   hyperfine     --sort command     --runs 1     --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-win64.json"     --parameter-list COMMIT ${COMMITS// /,}     --prepare "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; rm -f ./build/bin/bitcoind.exe; rm -f $DATA_DIR/debug.log; git clean -fxd -e depends/built -e depends/sources -e depends/$HOST; git reset --hard {COMMIT} && \
      make -C depends HOST=$HOST NO_QT=1 NO_ZMQ=1 CC=\"$XPACK/bin/x86_64-w64-mingw32-gcc\" CXX=\"$XPACK/bin/x86_64-w64-mingw32-g++\" -j\$(nproc) && \
      cmake -B build -G Ninja --toolchain depends/$HOST/toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF && \
      ninja -C build bitcoind -j\$(nproc) && \
      ./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log"     --conclude "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(printf %.12s {COMMIT})"; \
      cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-\$(date +%s).log"     "./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"; done

363e525d8d WORKER_THREADS{8}

2026-01-21 | reindex-chainstate | 932239 blocks | dbcache 3000 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15

Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 363e525d8da3c6c495191cb92d8eaf5dbeaeddf5)
  Time (abs ≡):        26386.395 s               [User: 0.000 s, System: 0.001 s]

363e525d8d WORKER_THREADS{8}

2026-01-21 | reindex-chainstate | 932239 blocks | dbcache 12000 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15

Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 363e525d8da3c6c495191cb92d8eaf5dbeaeddf5)
  Time (abs ≡):        26401.538 s               [User: 0.002 s, System: 0.000 s]

</details>

fanquake commented at 4:51 PM on January 22, 2026: member

Note that this needs a rebase:

/root/ci_scratch/src/bench/coinsviewcacheasync.cpp:51:71: error: macro ‘BENCHMARK’ passed 2 arguments, but takes just 1
   51 | BENCHMARK(CoinsViewCacheAsyncBenchmark, benchmark::PriorityLevel::HIGH);
      |                                                                       ^
In file included from /root/ci_scratch/src/bench/coinsviewcacheasync.cpp:5:
/root/ci_scratch/src/bench/bench.h:68:9: note: macro ‘BENCHMARK’ defined here
   68 | #define BENCHMARK(n) \
      |         ^~~~~~~~~

DrahtBot added the label CI failed on Jan 22, 2026

DrahtBot commented at 5:26 PM on January 22, 2026: contributor

🚧 At least one of the CI tasks failed. Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/21006803006/job/61174454902 LLM reason (✨ experimental): Compilation failed due to BENCHMARK macro usage: it is invoked with two arguments, but the macro defined takes only one, causing a build error.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Jan 22, 2026

DrahtBot removed the label CI failed on Jan 23, 2026

willcl-ark commented at 2:10 PM on January 27, 2026: member

Benchcoin Full IBD Results (to block 930,000)

Benchmark run: https://github.com/bitcoin-dev-tools/benchcoin/pull/178

dbcache	master (2778eb4)	PR (fc72fca)	Δ
450 MB	323 min	260 min	-19.5%
32000 MB	266 min	246 min	-7.5%

Configuration:

-prune=200GB
AMD Ryzen 7 7700 8-Core, 64GB RAM, NVMe SSD
1Gbit network to dedicated seed node

PR commits: 95ee2d60c217aa2ccf37ed1e5951ea91fdf403d9^..fc72fca292d995de07d98f12dfc4164478826b1f

Seems like we do indeed get a nice speedup, especially with the default dbcache :)

achow101 referenced this in commit 6750744eb3 on Jan 29, 2026

DrahtBot added the label Needs rebase on Jan 30, 2026

andrewtoth force-pushed on Jan 30, 2026

DrahtBot added the label CI failed on Jan 30, 2026

DrahtBot commented at 1:54 AM on January 30, 2026: contributor

🚧 At least one of the CI tasks failed. Task MSan: https://github.com/bitcoin/bitcoin/actions/runs/21501458749/job/61948501131 LLM reason (✨ experimental): Compilation failed in src/bench/coinsviewcacheasync.cpp due to undeclared identifier CoinsViewCacheAsyncController, causing the build to abort.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Jan 30, 2026

DrahtBot removed the label Needs rebase on Jan 30, 2026

DrahtBot removed the label CI failed on Jan 30, 2026

in src/coinsviewcacheasync.h:27 in 77c0df7b59

  22 | +#include <ranges>
  23 | +#include <thread>
  24 | +#include <utility>
  25 | +#include <vector>
  26 | +
  27 | +static constexpr int32_t WORKER_THREADS{4};

HowHsu commented at 8:34 AM on February 10, 2026:

Hi @andrewtoth

Have you tried a WORKER_THREADS bigger than this? I have this question because the workloads are not cpu intensive but IO intensive.

sedited commented at 8:38 AM on February 10, 2026:

This was extensively benchmarked, see l0rinc's comments #31132 (comment) and #31132 (comment) and the discussions further up in this PR.

andrewtoth commented at 3:14 PM on February 10, 2026:

Also see #31132 (comment), where higher thread count indeed correlates to a big speed increase. A system with high IO latency coupled with high IO bandwidth will see the most benefit from this PR in general, and the most benefit from increasing the threadcount.

We decided to keep it simple for now and use a static 4 threads. More threads will translate to higher memory usage of course. I think we can investigate making this configurable in a follow-up if there is interest.

sedited commented at 2:16 PM on February 10, 2026: contributor

When reindexing on my current system this PR consistently does not perform faster, and I'm getting the impression that it might actually be slower. This is on a system with 32 virtual cores, heaps of ram, and a fast nvme drive. I'd also be curious in general what the performance looks liked with maxed out dbcache.

l0rinc commented at 3:12 PM on February 10, 2026: contributor

I'd also be curious in general what the performance looks like with maxed out dbcache.

With a big enough dbcache we won't really have any disk activity since all the inputs are still in memory, see: #31132 (comment) The parallelization still results in some modest speedups in most cases because of the parallel temporary cache filling and SipHash calculation and map interactions, but that's not the main goal of the change. There are also some fixed costs that we wouldn't need to do if we knew that everything is in memory so some minor regression is expected for max-memory dbcache. Also note that adding more and more memory isn't necessarily faster after a while, e.g. 30 GB dbcache isn't usually faster than 5 GB - sometimes it's even slower, most likely because a larger hashmap spreads entries across more memory, causing more cache misses on the random UTXO lookups.

Can you please share your measurements so that we can try to reproduce them? This is mainly meant for default or low dbcache since the in-memory cache size matters a lot less after the cache warming.

fanquake commented at 3:18 PM on February 10, 2026: member

I also haven't seen any speedup running a "real world" sync with this branch. i.e Guix build the branch, and then run from scratch on a reasonble (16 core, 32GB) machine. IBD time seems the same as master.

andrewtoth commented at 3:23 PM on February 10, 2026: contributor

When reindexing on my current system this PR consistently does not perform faster, and I'm getting the impression that it might actually be slower.

IBD time seems the same as master.

+1 on sharing the commands you are running and the times you are getting.

I did IBD as well and saw consistently better performance on a machine with locally connected NVMe drive, 16 vcores 32GB RAM #31132 (comment).

Cores and RAM should not really be a factor with this change. The main bottleneck is higher IO latency. So, for a directly connected NVMe drive you should not see as big of an increase compared to network connected storage. I think most users of this software run it in a cloud environment.

l0rinc commented at 3:28 PM on February 10, 2026: contributor

IBD time seems the same as master.

I also noticed that doing actual IBD compared to just a -reindex-chainstate often shows a less dramatic speedup since validation wasn't the main bottleneck in the first place (likely bandwidth was). With the average (100Mbps) global internet speed just downloading the blockchain would take 16 hours.

sedited commented at 4:44 PM on February 10, 2026: contributor

I re-ran three runs interleaved of ./bitcoind -signet -stopatheight=290000 -reindex-chainstate. This PR averages 8:40, master 7:50.

Edit: Re-running on mainnet too, but will obviously take a while:

On a AMD Ryzen 9 9950X3D 16-Core Processor, nvme drive, and heaps of RAM

Baseline: ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate 3:11:48.66 total ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate -dbcache=10000 2:30:11.60 total ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate -dbcache=30000 2:10:11.60 total

This PR: ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate 2:09:11.83 total ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate -dbcache=10000 2:05:03.33 total ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate -dbcache=30000 2:03:49.21 total

Also checked again what happens when more workers (16) are added and the gains are indeed marginal: ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate 2:07:13.48 total

So I guess the slowdown I perceived before is just higher dbcache mattering less over time.

l0rinc commented at 4:54 PM on February 10, 2026: contributor

Thanks, let me retry the latest push (I never tested signet though)

andrewtoth commented at 4:57 PM on February 10, 2026: contributor

@sedited thanks! I don't think this PR will perform better than master on signet. Blocks on signet seem to have <100 txs with mostly single inputs. The overhead of collecting the inputs and then releasing threads to start fetching them is likely not recouped on fetching that few inputs.

Also, the size of the chainstate leveldb is very small compared to mainnet. Fetching inputs in series really starts to degrade around block 800,000 when the utxo set is much larger on mainnet. Roughly 90% of the sync time on network connected storage was for 800k to tip.

l0rinc commented at 2:24 PM on February 11, 2026: contributor

Retried validation latest version on Windows, the 30% speedup with default dbcache still reproduces

> for DBCACHE in 450; do \
>   COMMITS="5401e673d56198f2c0bad366581e70d5d9cd765c 77c0df7b59ff5a3a77d37e77145f1a157e05db19"; \
>   STOP=933339; \
>   HOST=x86_64-w64-mingw32; \
" && >   XPACK="/home/win/xpack-mingw-w64-gcc-15.2.0-2"; \
(date -I) |>   BASE_DIR="/mnt/c/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
>   WIN_DATA_DIR='C:\\my_storage\\BitcoinData'; \
| win64>   export PATH="$XPACK/bin:$PATH"; \
>   mkdir -p "$LOG_DIR"; \
>   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
>   (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | win64-gcc15"; echo "") && \
ns 1 \
>   hyperfine \
>     --sort command \
>     --runs 1 \
>     --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-win64.json" \
>     --parameter-list COMMIT ${COMMITS// /,} \
>     --prepare "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; rm -f ./build/bin/bitcoind.exe; rm -f $DATA_DIR/debug.log; git clean -fxd -e depends/built -e depends/sources -e depends/$HOST; git reset --hard {COMMIT} && \
>       make -C depends HOST=$HOST NO_QT=1 NO_ZMQ=1 CC=\"$XPACK/bin/x86_64-w64-mingw32-gcc\" CXX=\"$XPACK/bin/x86_64-w64-mingw32-g++\" -j\$(nproc) && \
>       cmake -B build -G Ninja --toolchain depends/$HOST/toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF && \
>       ninja -C build bitcoind -j\$(nproc) && \
>       ./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log" \
>     --conclude "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(printf %.12s {COMMIT})"; \
>       cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-\$(date +%s).log" \
>     "./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"; \
> done~

</details>

5401e673d5 Merge bitcoin/bitcoin#33604: p2p: Allow block downloads from peers without snapshot block after assumeutxo validation 77c0df7b59 validation: fetch inputs on parallel threads

Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=933339 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 5401e673d56198f2c0bad366581e70d5d9cd765c)
  Time (abs ≡):        37691.648 s               [User: 0.000 s, System: 0.001 s]

Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=933339 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 77c0df7b59ff5a3a77d37e77145f1a157e05db19)
  Time (abs ≡):        28752.722 s               [User: 0.003 s, System: 0.000 s]

Relative speed comparison
        1.31          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=933339 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 5401e673d56198f2c0bad366581e70d5d9cd765c)
        1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=933339 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 77c0df7b59ff5a3a77d37e77145f1a157e05db19)

sedited referenced this in commit 4a05825a3f on Feb 11, 2026

rustaceanrob commented at 2:10 PM on February 17, 2026: contributor

I tested this with a simple time based script. Note real is the absolute time of the reindex.

<details> <summary>A/B test on Linux systems</summary>

#!/usr/bin/env bash
set -euo pipefail

SRC_DIR="${SRC_DIR:-$HOME/bitcoin}"
COMMIT_A="${COMMIT_A:-5401e673d56198f2c0bad366581e70d5d9cd765c}"
COMMIT_B="${COMMIT_B:-77c0df7b59ff5a3a77d37e77145f1a157e05db19}"
STOP="${STOP:-930000}"
DBCACHE="${DBCACHE:-450}"
DATA_DIR="${DATA_DIR:-$HOME/.bitcoin}"
JOBS="${JOBS:-$(nproc)}"

git reset --hard $COMMIT_A
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF
ninja -C build bitcoind -j $JOBS
time ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -daemon=0

git reset --hard $COMMIT_B
cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF
ninja -C build bitcoind -j $JOBS
time ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -daemon=0

</details>

Results on my first machine:

<details> <summary>Machine specifications</summary>

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  ...
CPU(s):                   16
  ...
  Model name:             AMD Ryzen 7 7700 8-Core Processor

</details>

Before

real	223m59.233s
user	435m25.461s
sys	52m39.860s

After

real	144m43.344s
user	429m27.149s
sys	38m21.716s

Results on my second machine:

<details> <summary>Machine specifications</summary>

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  ...
CPU(s):                   16
  ...
Vendor ID:                GenuineIntel
  Model name:             13th Gen Intel(R) Core(TM) i5-1340P

</details>

Before

real	316m47.537s
user	770m3.013s
sys	39m59.231s

After

real	236m34.064s
user	804m33.719s
sys	36m46.623s

in src/coinsviewcacheasync.h:61 in bc9d1e7ee4

  56 | +     * collision of an input being spent having the same first 8 bytes as a txid of a tx elsewhere in the block,
  57 | +     * the input will not be fetched in the background. The input will still be fetched later on the main thread.
  58 | +     * Using a sorted vector and binary search lookups is a performance improvement. It is faster than
  59 | +     * using std::unordered_set with salted hash or std::set.
  60 | +     */
  61 | +    std::vector<uint64_t> m_txids{};

sipa commented at 9:18 PM on February 18, 2026:

Could these be salted hashes instead of the first 8 bytes of txids? I'm slightly concerned this could enable deliberate performance degradation using a $2^{32}$ collision search on txids.

l0rinc commented at 9:26 PM on February 18, 2026:

What if we randomly add a shift x instead of always using the first few bytes - we'd take [x, x+8] instead?

Edit: rehashing these kinda' sounds like it would defeat the purpose. What if we did 8 random bytes instead (e.g. shuffle 0..31 and load the first 8 indices)

andrewtoth commented at 1:38 PM on February 19, 2026:

Here are the measurements that led us to use a sorted vector + binary search #31132 (review). Note the graph on the left is more important since that's done on the main thread. The right side is mostly done on worker threads so is not as important. Obviously the sorted vector of short txids is the best option. @l0rinc it sounds like you're trying to reinvent siphash.

What if we removed the sorted vector for now and used a salted unordered_set? That would probably make it easier to review, since we don't have to think about collisions at all. We could introduce a performance improvement for this in a follow-up.

sipa commented at 1:49 PM on February 19, 2026:

I think we can use something much weaker than a hash here, if we assume (1) the inputs are cryptographic hashes already (txids are) and (2) the attacker does not get to observe our secret salt in any way (not even through timing leaks - which may be the case here because by the time they observe it, they've already succeeded).

class QuickHashHasher
{
    uint64_t m_key[4];

public:
    QuickHashHasher() noexcept
    {
        FastRandomContext rng;
        for (int i = 0; i < 4; ++i) m_key[i] = rng.rand64();
    }

    uint64_t operator()(const uint256& hash_input) noexcept
    {
        return (hash_input.GetUint64(0) ^ m_key[0]) +
               (hash_input.GetUint64(1) ^ m_key[1]) +
               (hash_input.GetUint64(2) ^ m_key[2]) +
               (hash_input.GetUint64(3) ^ m_key[3]);
    }
};

So my suggestion would be to use the current approach, but instead of ReadLE64(txid.begin()), use QuickHashHasher m_hasher; ... m_hasher(txid) .... Would that be acceptably fast? If so, I think I can write up a better formal argument why this is sufficient.

andrewtoth commented at 4:07 AM on February 20, 2026:

Adding a benchmark to @l0rinc's code at #31132 (review), I added the above quick hash and used it on each txid before adding to the vector. It did not slow down the vector creation + sorting and even showed a slight speedup. Lookups were essentially the same speed as well.

andrewtoth commented at 1:57 AM on February 23, 2026:

Thanks @sipa, I added this and made you a co-author. I did XOR accumulation instead of addition since it was triggering the overflow UB in CI.

sipa commented at 2:05 AM on February 23, 2026:

@andrewtoth Sadly, that doesn't work, because now the salt has no impact on which pairs form collisions, so the attacker can find those in 2^32 work again. To see why, let t[4] be the txid and s[4] be the salts, then you're computing (t[0] ^ s[0]) ^ (t[1] ^ s[1]) ^ (t[2] ^ s[2]) ^ (t[3] ^ s[3]), which can be rearranged as (t[0] ^ t[1] ^ t[2] ^ t[3]) ^ (s[0] ^ s[1] ^ s[2] ^ s[3]), so collisions just depend on the xoring of the txid qwords.

Adding a ubsan suppression should suffice; there is nothing actually UB about uint64_t overflow, it's just our sanitizer that "helpfully" warns about some perfectly legal but suspicious things.

andrewtoth commented at 2:32 AM on February 23, 2026:

Aha thanks yes I get it! I reverted and added a ubsan suppression instead.

andrewtoth commented at 11:33 PM on April 11, 2026:

@sipa @l0rinc For the same reason an attacker can't create quick hash collisions, they also can't create bucket collisions in an unordered_set. This lets us store the uint64_t quick hash directly in an unordered_set<uint64_t> rather than storing the full Txid with a salted hash. Benchmarks with this method show roughly the same construction time but much faster lookups.

ryanofsky referenced this in commit ee2065fdea on Feb 20, 2026

DrahtBot added the label Needs rebase on Feb 20, 2026

andrewtoth force-pushed on Feb 23, 2026

DrahtBot added the label CI failed on Feb 23, 2026

DrahtBot commented at 1:01 AM on February 23, 2026: contributor

🚧 At least one of the CI tasks failed. Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/22288974266/job/64472680743 LLM reason (✨ experimental): Linker error: undefined reference to util::TraceThread prevents bitcoin-chainstate from linking.

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

andrewtoth force-pushed on Feb 23, 2026

andrewtoth commented at 1:55 AM on February 23, 2026: contributor

Rebased due to #34165 and the ThreadPool in #33689.

The CoinsViewOverlay is now used for parallel input fetching. It also takes a shared_ptr<ThreadPool> instead of managing threads manually.

Now instead of managing threads via a std::barrier, we just spawn tasks each time we start a block and wait for the futures to complete.

This also lets us pass in a global thread pool for tests and fuzzing, so we can recreate CoinsViewOverlays quickly without having to spawn and teardown the threads each iteration.

A variation of the QuickHashHasher suggested in #31132 (review) is now used for the txid filter.

The benchmark was dropped for this PR. There is already a lot to review as it is, and #34320 contains basically the same benchmark. It just needs to add a StartFetching once this is merged.

DrahtBot removed the label Needs rebase on Feb 23, 2026

andrewtoth force-pushed on Feb 23, 2026

andrewtoth force-pushed on Feb 24, 2026

DrahtBot removed the label CI failed on Feb 24, 2026

andrewtoth force-pushed on Feb 24, 2026

DrahtBot added the label Needs rebase on Feb 26, 2026

sedited commented at 10:21 AM on March 7, 2026: contributor

Benched this some more:

The hetzner box I used is x86_64 and has 8 vCPU cores. The node is also configured to prune (thought that might be interesting because of the slightly increased IO) and uses default dbcache.

The RockPro64 (which I think is the platform used by nodl and the ronin dojo) is from the original 2018 series, has a nvme SSD, and active cooling. The node running on it is configured with 1GB of dbcache. The Bitseed is the Qotom Q190N-S01 with 4G of RAM and no active cooling. Its original hdd died years ago, it now has a SATA SSD. The node is running with default dbcache.

The two low power nodes run at home, and the connection I have is not that stable, so I did not want to do repeated ibd runs. I did one ibd run from a local connection on the RockPro64 which clocked in shy of 35 hours.

All nodes had -stopatheight=930000.

benchmark	base	branch
IBD on hetzner	13h 32m	9h 57m
reindex-chainstate RockPro 64	67h 47m	32h 24m
reindex-chainstate Bitseed	108h 48m	60h 59m

andrewtoth force-pushed on Mar 8, 2026

DrahtBot removed the label Needs rebase on Mar 8, 2026

andrewtoth commented at 12:26 AM on March 9, 2026: contributor

Rebased due to #34562. Also split the changes into more atomic commits that should be easier to review.

sedited referenced this in commit 524aa1e533 on Mar 11, 2026

DrahtBot added the label Needs rebase on Mar 11, 2026

andrewtoth force-pushed on Mar 11, 2026

DrahtBot removed the label Needs rebase on Mar 11, 2026

andrewtoth commented at 2:35 PM on March 12, 2026: contributor

Rebased due to #34576. All split out PRs have been merged. The diff seems a lot more manageable now.

Thank you everyone for your benchmarks.

I think this is ready for more review.

andrewtoth renamed this:
~~validation: fetch block inputs on parallel threads 3x faster IBD~~
validation: fetch block inputs on parallel threads
on Mar 12, 2026

murchandamus commented at 5:40 PM on March 12, 2026: member

Concept ACK

I’ve been loosely following this PR in the context of doing outreach. I have read many complaints about IBD being slow, especially for microcomputers and node in the box setups, which often come with external drives. The preliminary results described in comments on this PR sound promising.

I don’t think I have valuable code review to add here, but conceptually this seems worthwhile, especially because we can anticipate that a lot of users will be switching out hard drives soon, if they are running a node with a full copy of the blockchain and had a 1 TB drive.

in src/coins.h:744 in 551050628c outdated

 744 |  
 745 | +    std::shared_ptr<ThreadPool> m_thread_pool;
 746 | +    std::vector<std::future<void>> m_futures{};
 747 | +
 748 | +protected:
 749 | +    void Reset() noexcept override {

hodlinator commented at 7:52 PM on March 12, 2026:

nit:

    void Reset() noexcept override
    {

in src/coins.h:1 in 612f420811 outdated

hodlinator commented at 7:57 PM on March 12, 2026:

612f420 validation: collect block inputs in CoinsViewOverlay before ConnectBlock:

I wonder if we could instead have the first 2 commits squashed together and also add reading from m_inputs so that we have a working but non-parallelized implementation?

Edit: My own attempt at this: https://github.com/hodlinator/bitcoin/tree/pr/31132_suggestions

andrewtoth commented at 5:41 PM on March 13, 2026:

Will do this :+1:.

andrewtoth commented at 4:58 PM on March 22, 2026:

Done, added you as a co-author.

in src/coins.h:693 in 551050628c outdated

 693 | +
 694 | +        if (auto coin{base->PeekCoin(input.outpoint)}) [[likely]] input.coin.emplace(std::move(*coin));
 695 | +        // We need release here, so writing coin in the line above happens before the main thread acquires.
 696 | +        input.ready.test_and_set(std::memory_order_release);
 697 | +        input.ready.notify_one();
 698 | +        return true;

hodlinator commented at 8:27 PM on March 12, 2026:

Would it be more correct to return false for the last input to prevent ProcessInput() from being called one extra time?

        return i < m_inputs.size() - 1;

(We will already have returned at the top if m_inputs.size() == 0).

andrewtoth commented at 3:11 PM on March 13, 2026:

This would only have an effect on a single thread. The other 3 threads would all call ProcessInput() an extra time.

in src/coins.h:736 in 551050628c outdated

 736 | +            if (input.coin) [[likely]] return std::move(*input.coin);
 737 | +            // If we get here, then this block has missing or spent inputs or there is a txid quick hash collision.
 738 | +            break;
 739 | +        }
 740 | +
 741 | +        // We will only get here for BIP30 checks, txid quick hash collisions or a block with missing or spent inputs.

hodlinator commented at 8:47 PM on March 12, 2026:

We dropped the coinbase tx (& indirectly input) in StartFetching(), which shouldn't be spendable yet for another 100 blocks anyway, but tests try to access it so technically it could be included in the comment (invalid blocks may try to access it too).

        // We will only get here for BIP30 checks, txid quick hash collisions,
        // a block with missing or spent inputs, or attempts to look up coinbase inputs.

andrewtoth commented at 3:14 PM on March 13, 2026:

This comment only reflects production code. There is also a unit test to specifically access a coin that is not in the block, and of course fuzz tests will be able to hit here as well. Would it be helpful to clarify these are the only reasons in non-test code?

invalid blocks may try to access it too

I think it's more specific to say a block with missing or spent inputs. Other types of invalid blocks will not get here.

Edit: Actually, a block spending its own coinbase outputs would get here too. Not sure that's possible though...

Edit again: Ok, I'm pretty sure it's not possible to construct a segwit block that spends from its own coinbase. The merkle root of the witness data in the coinbase creates a circular dependency on the tx spending the coinbase outputs. However, it could be possible for a pre-segwit legacy block. Not sure if we need to spell that out in the comment here?

hodlinator commented at 10:43 PM on March 12, 2026: contributor

Concept ACK 551050628c0a4e17a72180353888ddeeab7e4030

Been looking forward to this optimization landing. It is unfortunate that it's only gotten Concept ACKs so far. From briefly looking at the approach, it seems fairly straight forward. Don't have a solid grasp of the edge cases and the surrounding pieces of the puzzle though.

willcl-ark commented at 11:33 AM on March 13, 2026: member

I'm concept ACK here, and plan to review more soon.

I was trying to enumerate new new assumptions that this change would bring as changing the db (even only the access mechanism) has the potential to lead to consensus bugs (i.e. 2013), if done incorrectly.

The main change here is of course using multiple threads to read from levelDB. This is a well-supported and used use-case of levelDB (as I understand this is how Chrome and many other users of levelDB use it). However we do now rely on the correctness of this part of the levelDB implementation, which we did not before. I wonder if we could fuzz multi-threaded reads from levelDB (or if it's already being done) to try and give more assurances around this new assumption?

If for example there were a LevelDB bug triggered only under concurrent read load (a corrupted read, stale cache entry, a race in the table cache eviction etc.) an upgraded node could get different coin data than an un-upgraded node. That may result in a chain split. IMO this is the main question we have to be able to assure ourselves of in here (this could also be an argument for adding a config option to disable parallel fetch, if we cannot assure ourselves enough?)

I have taken a look at some of the levelDB code (which is reasonably new to me outside of tweaking various params) to try and get a better understanding of how it works under the hood under concurrent reads (I was curious whether our threads were just saturating a single pipeline more, or actually executing fully in parallel):

My initial read is that an internal mutex is held while grabbing a memtable reference (with current pointers to the current db state) which is very fast, and then released before any real expensive work is done. The internal block cache is using 16 independent shards, each with its own mutex held only during O(1) hash table operations. So with 4 worker threads if we read from different SST files we run fully in parallel. We only contend if we hash to the same shard (so 1/16 chance), and even then we only block on the mutex very briefly. So we are doing genuine parallel reads.

Most of the other "potential problems" I was trying to consider, I feel, were pretty much quashed by the fact that we do not change the fact we still hold cs_main for the duration of the parallel fetch, as before, and so really the main variable is the leveldb concurrent access.

I have observed significant speedup benchmarking this, and it feels valuable enough to consider taking IMO, once we get good-enough assurances from the levelDB side of things.

andrewtoth commented at 5:40 PM on March 13, 2026: contributor

Thanks for your reviews @murchandamus, @hodlinator, and @willcl-ark.

we do now rely on the correctness of this part of the levelDB implementation, which we did not before

This is not entirely accurate. We rely on this correctness for our indexes. Concurrent getrawtransaction or getblockfilter calls using txindex or blockfilterindex do concurrent levelDB reads. We just haven't used this for chainstate reads.

I wonder if we could fuzz multi-threaded reads from levelDB

I pushed a commit to add a coins_view_stacked fuzz harness. This creates a similar stack of views as we use in production. A CoinsViewOverlay -> CCoinsViewCache -> CCoinsViewDB stack using an in-memory levelDB. The fuzzer first works on CCoinsViewCache -> CCoinsViewDB by themselves to populate the levelDB, then works on the overlay on top of the main cache to perform concurrent reads through to levelDB, and then after works on the cache and db to flush any data from the main cache down to the db.

I built this harness before and fuzzed with it, and am fuzzing with it now. I'm not sure why I removed it when rebasing at some point.

DrahtBot added the label CI failed on Mar 13, 2026

andrewtoth commented at 10:28 PM on March 16, 2026: contributor

I collected additional steady-state data.

I ran four nodes on AWS t2.small instances (1 vCPU, 2 GB RAM) with -prune=550 -debug=bench and 20 GB gp2 EBS volumes. They ran from 1 Jan–3 Mar 2026 (blocks 930,301–939,173). All four started from the same chainstate, block files, and mempool.dat. They all connected to a single gateway node in the same VPS, which itself connected only to two outside trusted nodes. Two nodes ran master and two ran this branch. The log files are attached to this comment.

On average, the branch nodes were 23.7% faster at connecting blocks (25.1 ms per block). Although that is a modest improvement overall, worst-case block connection times were much better on the branch. The table below lists the 20 slowest blocks (by average connect time across the four nodes), with an average speedup of 2.87×, or about 11.7 seconds per block for that set.

Rank	Height	Txins	branch1	branch2	master1	master2	Average Speedup
1	935502	11,740	23.4 s	10.9 s	76.0 s	16.4 s	2.69x
2	936879	11,539	10.5 s	11.4 s	43.4 s	43.3 s	3.95x
3	935500	10,462	16.2 s	8.6 s	44.4 s	34.2 s	3.18x
4	939086	17,118	11.7 s	17.0 s	49.4 s	19.0 s	2.38x
5	939021	7,373	8.1 s	7.2 s	27.4 s	30.7 s	3.80x
6	930335	8,381	8.5 s	8.4 s	22.1 s	25.0 s	2.78x
7	934760	11,920	3.0 s	7.5 s	14.1 s	15.0 s	2.77x
8	930334	8,843	4.1 s	3.5 s	14.2 s	15.5 s	3.90x
9	930338	6,616	3.0 s	4.2 s	14.3 s	14.1 s	3.92x
10	936669	11,195	5.9 s	5.7 s	11.8 s	11.9 s	2.06x
11	930311	6,915	2.6 s	2.1 s	11.5 s	11.8 s	5.04x
12	930364	9,102	6.9 s	5.6 s	9.4 s	5.7 s	1.21x
13	939024	9,719	4.8 s	6.3 s	8.7 s	7.7 s	1.47x
14	930336	7,847	2.9 s	2.6 s	10.2 s	10.9 s	3.85x
15	930333	7,224	2.6 s	5.2 s	7.0 s	10.7 s	2.29x
16	933330	3,868	2.5 s	3.2 s	9.6 s	8.6 s	3.17x
17	930312	8,812	1.4 s	2.6 s	8.6 s	11.3 s	5.05x
18	930308	7,194	3.0 s	3.0 s	6.6 s	9.6 s	2.68x
19	939046	9,468	4.1 s	4.1 s	7.0 s	6.8 s	1.68x
20	930339	6,554	2.8 s	3.1 s	4.3 s	9.7 s	2.39x
Avg	—	—	6.4 s	6.1 s	20.0 s	15.9 s	2.87x

Why the improvement shows up in the worst cases

If every transaction in a block was added to the mempool after the last cache flush, this change has little effect, because their inputs are already in the cache. After the cache is flushed due to memory limits, however, those inputs are evicted. This can happen regardless of when the transactions entered the mempool. So if large consolidation transactions are in the mempool and the cache then flushes, when those transactions are mined, all their inputs must be fetched from disk. With typical single-digit millisecond latency for network-attached storage, a block with many inputs can easily spend tens of seconds just fetching UTXOs. This effect is illustrated in the chart in the description of #28233.

The same pattern affects blocks where some transactions were never in the mempool. When missing transactions are fetched to complete a compact block, their inputs will not be in the cache before entering ConnectBlock. That further slows block connection when it is already suboptimal. For example, when many transactions are non-standard.

Also, -blocksonly will see significant speedup at steady-state as well because of this.

branch1.log.gz branch2.log.gz master1.log.gz master2.log.gz

murchandamus commented at 11:10 PM on March 16, 2026: member

The table below lists the 20 slowest blocks (by average connect time across the four nodes), with an average speedup of 2.87×, or about 11.7 seconds per block for that set.

Reading that, I became curious. What would the 20 slowest blocks by the average “branch” time and “master” time look like in comparison? Is it largely the same set, or are there perhaps some cases in which the performance shifts one way or the other?

andrewtoth commented at 11:28 PM on March 16, 2026: contributor

@murchandamus Here are the longest average block times for the branch and master nodes independently. It is largely the same set, but the order is slightly different.

Rank	Branch Height	Branch Average Time	Master Height	Master Average Time
1	935502	17.1 s	935502	46.2 s
2	939086	14.4 s	936879	43.3 s
3	935500	12.4 s	935500	39.3 s
4	936879	11.0 s	939086	34.2 s
5	930335	8.5 s	939021	29.1 s
6	939021	7.6 s	930335	23.5 s
7	930364	6.2 s	930334	14.9 s
8	936669	5.8 s	934760	14.6 s
9	930347	5.7 s	930338	14.2 s
10	939024	5.6 s	936669	11.9 s
11	934760	5.2 s	930311	11.6 s
12	929609	4.5 s	930336	10.6 s
13	930777	4.4 s	930312	10.0 s
14	939046	4.1 s	933330	9.1 s
15	930357	4.0 s	930333	8.8 s
16	931291	4.0 s	939024	8.2 s
17	930333	3.9 s	930308	8.1 s
18	930334	3.8 s	930310	7.7 s
19	930338	3.6 s	930364	7.5 s
20	930305	3.4 s	930313	7.2 s

murchandamus commented at 11:35 PM on March 16, 2026: member

Oh, I was thinking the same table as above, but selected by the times of the branch or master. I thought it might be interesting to see what the speed-up factor was on blocks that are slow for the branch vs the speed-up factor for blocks that are slow for master, and might be interesting if the overlap isn’t complete. I figured you might have a script that produces the table already. If it’s too much work (because you did this manually), don’t worry—my comment was just from random curiosity inspired by your data dump.

andrewtoth commented at 11:51 PM on March 16, 2026: contributor

I couldn't exactly parse your request, so I put it into the LLM and it came up with this :) Interestingly, there are a few branch-slow blocks that are slower than master. All master-slow blocks are slower than the branch though.

931291 seems to be a major outlier. It is 2x slower than master, and it has the typical pattern of a lot of large very low fee consolidation transactions. This one should be a lot faster than master. According to the block audit on mempool.space, this tx with >1200 inputs was confirmed in that block after being seen 7 seconds earlier. So, my theory is that the master nodes accepted the transaction into their mempool right before they saw the block, while the branch nodes did not yet see it.

Top 20 by average branch time (branch-slow blocks)

Rank	Height	Txins	branch1.log	branch2	master1	master2	Speedup (m/b)
1	935502	11,740	23.4 s	10.9 s	76.0 s	16.4 s	2.69x
2	939086	17,118	11.7 s	17.0 s	49.4 s	19.0 s	2.38x
3	935500	10,462	16.2 s	8.6 s	44.4 s	34.2 s	3.18x
4	936879	11,539	10.5 s	11.4 s	43.4 s	43.3 s	3.95x
5	930335	8,381	8.5 s	8.4 s	22.1 s	25.0 s	2.78x
6	939021	7,373	8.1 s	7.2 s	27.4 s	30.7 s	3.80x
7	930364	9,102	6.9 s	5.6 s	9.4 s	5.7 s	1.21x
8	936669	11,195	5.9 s	5.7 s	11.8 s	11.9 s	2.06x
9	930347	8,532	5.8 s	5.6 s	4.0 s	4.1 s	0.71x
10	939024	9,719	4.8 s	6.3 s	8.7 s	7.7 s	1.47x
11	934760	11,920	3.0 s	7.5 s	14.1 s	15.0 s	2.77x
12	929609	7,048	4.5 s	4.5 s	4.9 s	5.0 s	1.10x
13	930777	7,240	4.4 s	4.4 s	4.5 s	4.7 s	1.04x
14	939046	9,468	4.1 s	4.1 s	7.0 s	6.8 s	1.68x
15	930357	7,258	3.9 s	4.2 s	3.5 s	3.5 s	0.87x
16	931291	11,867	4.0 s	4.0 s	1.9 s	1.9 s	0.48x
17	930333	7,224	2.6 s	5.2 s	7.0 s	10.7 s	2.29x
18	930334	8,843	4.1 s	3.5 s	14.2 s	15.5 s	3.90x
19	930338	6,616	3.0 s	4.2 s	14.3 s	14.1 s	3.92x
20	930305	7,033	3.4 s	3.4 s	4.4 s	6.6 s	1.62x

Top 20 by average master time (master-slow blocks)

Rank	Height	Txins	branch1.log	branch2	master1	master2	Speedup (m/b)
1	935502	11,740	23.4 s	10.9 s	76.0 s	16.4 s	2.69x
2	936879	11,539	10.5 s	11.4 s	43.4 s	43.3 s	3.95x
3	935500	10,462	16.2 s	8.6 s	44.4 s	34.2 s	3.18x
4	939086	17,118	11.7 s	17.0 s	49.4 s	19.0 s	2.38x
5	939021	7,373	8.1 s	7.2 s	27.4 s	30.7 s	3.80x
6	930335	8,381	8.5 s	8.4 s	22.1 s	25.0 s	2.78x
7	930334	8,843	4.1 s	3.5 s	14.2 s	15.5 s	3.90x
8	934760	11,920	3.0 s	7.5 s	14.1 s	15.0 s	2.77x
9	930338	6,616	3.0 s	4.2 s	14.3 s	14.1 s	3.92x
10	936669	11,195	5.9 s	5.7 s	11.8 s	11.9 s	2.06x
11	930311	6,915	2.6 s	2.1 s	11.5 s	11.8 s	5.04x
12	930336	7,847	2.9 s	2.6 s	10.2 s	10.9 s	3.85x
13	930312	8,812	1.4 s	2.6 s	8.6 s	11.3 s	5.05x
14	933330	3,868	2.5 s	3.2 s	9.6 s	8.6 s	3.17x
15	930333	7,224	2.6 s	5.2 s	7.0 s	10.7 s	2.29x
16	939024	9,719	4.8 s	6.3 s	8.7 s	7.7 s	1.47x
17	930308	7,194	3.0 s	3.0 s	6.6 s	9.6 s	2.68x
18	930310	7,681	1.2 s	1.0 s	7.6 s	7.8 s	7.09x
19	930364	9,102	6.9 s	5.6 s	9.4 s	5.7 s	1.21x
20	930313	7,913	1.3 s	1.0 s	4.7 s	9.8 s	6.17x

DrahtBot removed the label CI failed on Mar 18, 2026

andrewtoth force-pushed on Mar 22, 2026

andrewtoth commented at 5:04 PM on March 22, 2026: contributor

Addressed comments by @hodlinator to rework the commit progression. The first commit is now a fully complete standalone change that fetches all block inputs into a vector before ConnectBlock on a single thread, and scans this vector in FetchCoinFromBase instead of looking up the coin from base.

The next commits add performance improvements

cache last looked up input in m_input_tail so we don't scan entire vector each lookup (9783ff481fd922cfa59c980046b0491c0241fd83)
filter inputs that are created earlier in the block so we don't look them up from disk (150f052ef4d057c45f8a85904be92a6c6fb1418c, 5ec61b1e9c879fa30e15d907c61f2953a140f567)

Then the next few commits make it safe for parallel lookups

introduce a ready flag in case input is not yet fetched (8eae22f493138f6ff0a4e07020a3036df98e0413)
stop fetching whenever any method is called that will mutate base (3e5cdee07720f841b8ae2538f556ee1bb5cb5bc0)

Then the threadpool is added (f15dd38be78a89139b308e7a7682979adf3b0e0b) and finally used (1ef7474d19cb720d526e29cdabaa85f8f79c9d5f)

The rest of the commits add documentation, unit tests, and fuzz harness updates.

ryanofsky commented at 9:32 PM on April 7, 2026: contributor

Concept ACK. Change seems worthwhile and surprisingly not that complicated. One thing I was wondering was about how 4 worker threads were chosen. I see some testing was done #31132 (comment) but if optimal number depends on the type of storage device, maybe it should be configurable.

DrahtBot added the label Needs rebase on Apr 13, 2026

validation: collect block inputs in CoinsViewOverlay before ConnectBlock

Introduce CoinsViewOverlay::StartFetching, which maps all input prevouts of a
block to a new m_inputs vector of InputToFetch elements. Returns a ResetGuard
which is lifetime bound to the block, while the InputToFetch elements are
lifetime bound to the block as well.

Introduce StopFetching to clear the m_inputs vector.
CCoinsViewCache::Reset is made virtual and is overridden in CoinsViewOverlay.
StopFetching is called on Reset, so the InputToFetch objects will not
exceed the lifetime of the block.

Introduce ProcessInput to fetch the utxo of an individual input in m_inputs.
Each caller fetches the input at m_input_head and increments it, so each call
will fetch the next input in the queue.

Fetch coins from the m_inputs vector in FetchCoinFromBase by scanning all inputs
until we discover the input with the correct outpoint.

This is designed deliberately so multiple threads can call ProcessInput independently.

Co-authored-by: l0rinc <pap.lorinc@gmail.com>
Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>

4203d58656

andrewtoth force-pushed on Apr 14, 2026

DrahtBot added the label CI failed on Apr 15, 2026

DrahtBot commented at 12:09 AM on April 15, 2026: contributor

🚧 At least one of the CI tasks failed. Task test ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/24427713628/job/71365320458 LLM reason (✨ experimental): CI failed because CTest reported a segmentation fault (SIGSEGV) in validation_block_tests (test 327).

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

</details>

coins: track last accessed input using m_input_tail

Inputs are accessed by ConnectBlock in the same order as they
are created in StartFetching (excepting BIP30 checks).
We can use this information, as well as the fact that CoinsViewOverlay
caches coins accessed via FetchCoinFromBase, to skip scanning
over previously accessed coins.

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

4731016c81

coins: introduce QuickHashHasher

Collapses a 32-byte Txid into a uint64_t, using 4 random uint64_ts.
Used in place of a hash function as a performance improvement.

Co-authored-by: Pieter Wuille <pieter@wuille.net>

02756fc5e1

coins: filter inputs spending outputs of same block in ProcessInput

This is a performance improvement, because we can skip checking on disk
that the input does not exist.

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

e40042eda3

coins: add ready flag to InputToFetch

Prepares for ProcessInput to be called from multiple threads.

This flag acts as a memory fence around InputToFetch::coin. There is no lock
guarding reads and writes of the coin field.
Instead we use the flag's release/acquire semantics to ensure that when the
main thread reads the coin it will have happened after a worker thread has
finished writing it.

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

39fe4975a3

coins: stop fetching before mutating base

Prepares for ProcessInput to be called from multiple threads.

ProcessInput reads from base. For ProcessInput to be safe to call in parallel
on separate threads, it must not be mutated.
Flush, Sync, and SetBackend can modify base, so we override these and
StopFetching before calling the base class.

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

f6a868595a

validation: add -inputfetchthreads configuration option

Add a configuration option for the number of worker threads used for
parallel UTXO input fetching during block connection.

Default is 4 threads, max is 15, 0 disables parallel fetching.

e56373fc2d

coins: introduce thread pool in CoinsViewOverlay

Prepares for ProcessInput to be called from multiple threads.

Introduce a ThreadPool shared pointer to CoinsViewOverlay. A pool managed
externally can be passed in the constructor.

A global thread pool is used in fuzz harnesses since iterations can happen
faster than the OS can create and tear down thread pools.
This can cause a memory leak when fuzzing.

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

0188760a85

coins: fetch inputs in parallel

Leverages the thread pool to fetch inputs on multiple threads, while the overlay
serves inputs on the main thread.

This is a performance improvement over blocking the main thread to fetch inputs.

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

5a34853872

doc: update CoinsViewOverlay docstring to describe parallel fetching

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

34b931df5f

test: add unit tests for CoinsViewOverlay::StartFetching

Co-authored-by: l0rinc <pap.lorinc@gmail.com>

ff6a56335f

fuzz: update harnesses to cover CoinsViewOverlay::StartFetching

Co-authored-by: l0rinc <pap.lorinc@gmail.com>
Co-authored-by: sedited <seb.kung@gmail.com>

1dd2f0fa06

fuzz: add coins_view_stacked fuzz harness to test concurrent leveldb reads cfbff4cd70

andrewtoth force-pushed on Apr 15, 2026

DrahtBot removed the label Needs rebase on Apr 15, 2026

DrahtBot removed the label CI failed on Apr 15, 2026

andrewtoth commented at 2:26 PM on April 15, 2026: contributor

Rebased due to #34124.

Added a new commit to add a configuration option to set the number of input fetcher threads -inputfetchthreads e56373fc2d5ee4c617f7bca0e63b0e82e9bbed0d. Default is 4, maximum is 15 like script validation threads, and 0 disables input fetching on threads other than main. Addresses suggestions in #31132 (comment) (thanks @ryanofsky) and #31132 (comment) (thanks @willcl-ark).

Uses an unordered_set for storing and looking up the quick hashes of txids, instead of a sorted vector and binary search lookups. This is faster than the previous approach, and is safe from bucket filling attacks due to the same collision resistant property of using quick hash to avoid collisions between txids and prevout hashes. See discussion #31132 (review).

Fixes an issue with the fuzz harnesses using -fork with certain fuzzers (thanks @furszy).

git range-diff e98d36715eace5ee54a10f2931adcbbc5f6b0a15..62e4ec4bf38e4f22eed3b1015036105b2efa000a 976985eccd546a95e38973b854ccc6589e8afc74..cfbff4cd70092d5b53bf4f1dee3df84b4961a51c

One thing I was wondering was about how 4 worker threads were chosen. @ryanofsky there are more measurements here with different benchmarks for different values of threads #31132 (comment). Most systems will benefit from more threads, but a few do not and even show slight degradation if more than 4 threads are chosen. @l0rinc and I decided that 4 was a sane conservative default that showed significant speedup across all systems benchmarked. However, #31132 (comment) shows much better performance with more threads on network connected storage. I decided to add a configuration option.