validation: fetch block inputs on parallel threads #31132

pull andrewtoth wants to merge 13 commits into bitcoin:master from andrewtoth:threaded-inputs changing 10 files +476 −28
  1. andrewtoth commented at 2:40 PM on October 22, 2024: contributor

    Parts of this PR are isolated in independent smaller PRs to ease review:


    This PR parallelizes fetching all input prevouts of a block during block connection, achieving over 3x faster IBD performance in some scenarios[^1][^2][^3][^4][^5].

    Problem

    Currently, when fetching inputs in ConnectBlock, each input is fetched from the cache sequentially. A cache miss requires a round trip to the disk database to fetch the outpoint and insert it into the cache. Since the database is read-only during ConnectBlock, we can fetch all inputs of a block in parallel on multiple threads while connecting.

    Solution

    We add a ThreadPool to CoinsViewOverlay to fetch block inputs in parallel. The block is passed to the CoinsViewOverlay view before entering ConnectBlock, which kicks off the worker threads to begin fetching the inputs. The cache returns fetched coins as they become available via the overridden FetchCoinFromBase method. If not available yet, the main thread also fetches coins as it waits.

    Implementation Details

    The CoinsViewOverlay implements a lock-free MPSC (Multiple Producer, Single Consumer) queue design:

    • Work Distribution: Collects all input prevouts into a queue and uses a barrier to start all worker threads simultaneously
    • Synchronization: Worker threads use an atomic counter to claim which inputs to fetch, and each input has an atomic flag to signal completion to the main thread
    • Main Thread Processing: The main thread waits for inputs in order and moves their Coin into the cacheCoins map as they become available.
    • Work Stealing: If the main thread catches up to the workers, it assists with fetching to maximize parallelism
    • Completion: All thread futures are waited on in StopFetching, which is called from any mutating method (Flush/Sync/SetBackend/Reset). The ResetGuard going out of scope ensure this happens before the block is destroyed.

    Safety and Correctness

    • The CoinsViewOverlay works on a block that has not been fully validated, but it does not interfere or modify any of the validation during ConnectBlock
    • It simply fetches inputs in parallel, which must be fetched before a transaction is validated anyways
    • Invalid blocks: If an invalid block is mined, the temporary cache is reset without being flushed. This is an improvement over the current behavior, where existing inputs are inserted into the main CoinsTip() cache when pulled through for an invalid block

    Performance

    Benchmarks show over 3x faster IBD performance in a cloud environment with network connected storage[^1], 3x faster IBD for an M4 Mac and up to 46% faster with directly connected storage[^2][^3][^4][^5]. The parallelization of expensive disk lookups provides significant speedup.

    Flamegraphs show how the execution is changed.

    Credits

    Inspired by this comment.

    Resolves #34121.

    [^1]: #31132 (comment) [^2]: #31132#pullrequestreview-3515011880 [^3]: #31132 (comment) [^4]: #31132#pullrequestreview-3347436866 [^5]: #31132 (comment)

  2. DrahtBot commented at 2:40 PM on October 22, 2024: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--006a51241073e994b41acfe9ec718e94-->

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31132.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--174a7506f384e20aa4161008e828411d-->

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #35078 (validation: merge PeekCoin into GetCoin by l0rinc)
    • #34320 (coins: remove redundant and confusing CCoinsViewDB::HaveCoin by l0rinc)
    • #34132 (coins: drop error catcher, centralize fatal read handling by l0rinc)
    • #28690 (build: Introduce internal kernel library by sedited)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

  3. DrahtBot added the label Validation on Oct 22, 2024
  4. andrewtoth force-pushed on Oct 22, 2024
  5. DrahtBot commented at 2:45 PM on October 22, 2024: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/31894441286</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  6. DrahtBot added the label CI failed on Oct 22, 2024
  7. andrewtoth renamed this:
    validation: fetch block inputs parallel threads ~17% faster IBD
    validation: fetch block inputs on parallel threads ~17% faster IBD
    on Oct 22, 2024
  8. andrewtoth force-pushed on Oct 22, 2024
  9. andrewtoth force-pushed on Oct 22, 2024
  10. andrewtoth force-pushed on Oct 22, 2024
  11. DrahtBot removed the label CI failed on Oct 22, 2024
  12. in src/inputfetcher.h:151 in e9e23b59f8 outdated
     146 | +        : m_batch_size(batch_size)
     147 | +    {
     148 | +        m_worker_threads.reserve(worker_thread_count);
     149 | +        for (size_t n = 0; n < worker_thread_count; ++n) {
     150 | +            m_worker_threads.emplace_back([this, n]() {
     151 | +                util::ThreadRename(strprintf("inputfetch.%i", n));
    


    l0rinc commented at 10:26 AM on October 23, 2024:

    Q: Is this a leftover a hack for non-owning LevelDB threads, or is this really the best way to name threads in a cross-platform way?


    andrewtoth commented at 1:49 PM on October 23, 2024:

    Unsure, copied from CScriptCheck. If the state of the art of thread naming has advanced since that was written, please let me know!


    sipa commented at 1:54 PM on October 23, 2024:

    The C++ standard library does as far as I know have no way of renaming threads at all. src/util/threadnames.{h,cpp} is our wrapper around the various platform-dependent ways of doing so on supported systems.


    l0rinc commented at 2:03 PM on October 23, 2024:

    Thank you, please resolve the comment.

  13. in src/inputfetcher.h:189 in e9e23b59f8 outdated
     184 | +                    continue;
     185 | +                }
     186 | +
     187 | +                buffer.emplace_back(outpoint);
     188 | +                if (buffer.size() == m_batch_size) {
     189 | +                    Add(std::move(buffer));
    


    l0rinc commented at 11:29 AM on October 23, 2024:

    We're mostly creating the buckets randomly here, so each thread will need access to basically all of the keys. Since we have an idea of how LevelDB works here (i.e. Sorted String Table), we could likely improve cache locality (would likely be most beneficial on HDDs) and minimize lock contention by splitting the reads by sorted transactions instead.


    andrewtoth commented at 1:46 PM on October 23, 2024:

    I don't think there is any lock contention here if we are doing multithreaded reading?

    I also think what you're suggesting would add a lot more complexity to this PR, when this is "good enough".


    l0rinc commented at 2:02 PM on October 23, 2024:

    This might be as simple as sorting by tx before we create the buckets.


    andrewtoth commented at 2:12 PM on October 23, 2024:

    If a benchmark shows that it is better, then great!

  14. in src/inputfetcher.h:73 in e9e23b59f8 outdated
      68 | +        std::vector<std::pair<COutPoint, Coin>> pairs{};
      69 | +        do {
      70 | +            std::vector<COutPoint> outpoints{};
      71 | +            outpoints.reserve(m_batch_size);
      72 | +            {
      73 | +                WAIT_LOCK(m_mutex, lock);
    


    l0rinc commented at 11:30 AM on October 23, 2024:

    I'm wondering if we really need to (b)lock here or whether we could we create a read-only snapshot instead and avoid stalling?


    andrewtoth commented at 1:34 PM on October 23, 2024:

    This is blocking so we can access the queue of shared outpoints that we need to fetch from. It is not blocking for LevelDB, we access the db once we are out of the critical section.


    l0rinc commented at 2:04 PM on October 23, 2024:

    As mentioned before, why do we need shared outpoints here?


    andrewtoth commented at 2:15 PM on October 23, 2024:

    The main thread adds all outpoints to a global vector, which all workers will fetch their work from.


    andrewtoth commented at 12:56 AM on December 4, 2024:

    We no longer need to block on the shared outpoints vector. We write to it once in the main thread before notifying the other threads and then only read from it afterwards.

  15. in src/inputfetcher.h:29 in e9e23b59f8 outdated
      24 | + * onto the queue, where they are fetched by N worker threads. The resulting
      25 | + * coins are pushed onto another queue after they are read from disk. When
      26 | + * the main is done adding outpoints, it starts writing the results of the read
      27 | + * queue to the cache.
      28 | + */
      29 | +class InputFetcher
    


    l0rinc commented at 11:39 AM on October 23, 2024:

    I know it's not trivial request, but can we add a test for this class which fetches everything in parallel and sequentially and assert that the result is equivalent? And preferably also a benchmark, like we have it for https://github.com/bitcoin/bitcoin/blob/master/src/bench/checkqueue.cpp. I would gladly help here, if needed.


    andrewtoth commented at 1:35 PM on October 23, 2024:

    Yes, I can add these but I am waiting for some more conceptual support.


    andrewtoth commented at 3:06 PM on November 7, 2024:

    Added tests and benchmark. The test has random parameters, one of which would be end up having a single worker thread.


    andrewtoth commented at 8:59 PM on November 16, 2024:

    Also added fuzz harness

  16. in src/inputfetcher.h:145 in e9e23b59f8 outdated
     140 | +    }
     141 | +
     142 | +
     143 | +public:
     144 | +    //! Create a new input fetcher
     145 | +    explicit InputFetcher(size_t batch_size, size_t worker_thread_count) noexcept
    


    l0rinc commented at 11:46 AM on October 23, 2024:

    For consistency (see: explicit CCheckQueue(unsigned int batch_size, int worker_threads_num)) and simplicity (m_input_fetcher{/*batch_size=*/128, static_cast<size_t>(options.worker_threads_num)}, and to follow modern C++ directions where sizes seem to be preferred as signed values, see: #30927 (review)), please consider making these int(s) instead.


    andrewtoth commented at 3:06 PM on November 7, 2024:

    Done.

  17. in src/validation.cpp:6251 in e9e23b59f8 outdated
    6247 | @@ -6243,6 +6248,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
    6248 |  
    6249 |  ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
    6250 |      : m_script_check_queue{/*batch_size=*/128, options.worker_threads_num},
    6251 | +      m_input_fetcher{/*batch_size=*/128, static_cast<size_t>(options.worker_threads_num)},
    


    l0rinc commented at 12:07 PM on October 23, 2024:

    Unlike the script checks, these fetches aren't CPU bound, there is no reason to provide the number of CPUs as the number of parallels threads. I don't know if we care about HDD performance here or not, but we can likely find a multiplier that makes this better for both SSD and HDD.

    Quoting from https://pkolaczk.github.io/disk-parallelism:

    It was surprising to me that even 64 threads, which are far more than the number of CPU cores (4 physical, 8 virtual), still improved the performance. I guess that with requests of such a small size to such a fast storage, you need to submit really many of them to keep the SSD busy.

    If we can provide a benchmark for this usecase we can likely find an optimal multiplier here - I won't nack but this part is very important for me.


    andrewtoth commented at 1:43 PM on October 23, 2024:

    Adding more threads will require more memory, which is one reason to not use many more.

    I did a benchmark using 64 threads on the same 16 vcore machine, and it was slightly slower :/


    l0rinc commented at 2:00 PM on October 23, 2024:

    4x may be too much to begin with, but 1.5-2x sounds plausible, I'll help with benchmarking this once my current batches finish.


    andrewtoth commented at 3:06 PM on November 7, 2024:

    Added a benchmark to experiment with these.

  18. in src/inputfetcher.h:188 in e9e23b59f8 outdated
     183 | +                if (cache.HaveCoinInCache(outpoint)) {
     184 | +                    continue;
     185 | +                }
     186 | +
     187 | +                buffer.emplace_back(outpoint);
     188 | +                if (buffer.size() == m_batch_size) {
    


    l0rinc commented at 12:10 PM on October 23, 2024:

    Would it be possible to create the batch sizes dynamically? Since the number of missing values differs for every block (and every dbcache size), it may not make more sense to calculate the optimal split instead of using the random value of 128. Coroutines might alleviate this problem.


    andrewtoth commented at 1:42 PM on October 23, 2024:

    I'm not sure it would warrant the complexity I think this batch size is "good enough" for now. In a follow up we could maybe add ways to set this with configs to experiment if there really is more optimal settings.


    andrewtoth commented at 3:05 PM on November 7, 2024:

    I changed the batch size to be number of workers.

  19. in src/inputfetcher.h:197 in e9e23b59f8 outdated
     192 | +                }
     193 | +            }
     194 | +            txids.insert(tx->GetHash());
     195 | +        }
     196 | +
     197 | +        Add(std::move(buffer));
    


    l0rinc commented at 12:11 PM on October 23, 2024:

    Do we always have leftovers or will this process the last batch twice (or process an empty one) if the batch happens to be divisible by batch_size?


    andrewtoth commented at 1:37 PM on October 23, 2024:

    It won't process twice, but it could pass in an empty vector, which is ignored if you look at Add implementation.

  20. in src/inputfetcher.h:65 in e9e23b59f8 outdated
      60 | +
      61 | +    std::vector<std::thread> m_worker_threads;
      62 | +    bool m_request_stop GUARDED_BY(m_mutex){false};
      63 | +
      64 | +    /** Internal function that does the fetching from disk. */
      65 | +    void Loop() noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
    


    l0rinc commented at 12:35 PM on October 23, 2024:

    We're basically mimicking RocksDB's MultiGet here - but prewarming the cache instead in separate get requests, since we can't really access LevelDB's internals.

    Since splitting into buckets isn't trivial and since MultiGet seems to rely on C++20 coroutines (which wasn't available in 2012 when CCheckQueue was written), I'm wondering how much simpler this fetching would be if we had lightweight suspendible threads instead: https://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html#multiget


    andrewtoth commented at 1:40 PM on October 23, 2024:

    I think it would be similar in complexity, we would still need all the locking mechanisms to prevent multithreaded access.

    What would really be great is if we had a similar construction to Rust's std::sync::mpsc.


    l0rinc commented at 1:58 PM on October 23, 2024:

    Can you tell me why we need to prevent multithreaded access exactly? We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread, right?

    How would mpsc solve this better? Do you think we need work stealing to make it perfectly parallel? Wouldn't coroutines already achieve the same?


    sipa commented at 2:07 PM on October 23, 2024:

    I haven't yet experimented with them, but as far as I understand it, coroutines are just programming paradigm, not magic; they don't do anything of their own, besides making things that were already possible easier to write. In particular, you still need a thread pool or some mechanism for scheduling how to run them,


    andrewtoth commented at 2:09 PM on October 23, 2024:

    We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread

    If the vectors are thread local, then how can the main thread access them at the end to write them? We also want to be writing throughout while the workers are fetching, not just at the end.

    How would mpsc solve this better?

    Instead of each worker thread having a local queue of results, which they then append to the global results queue, they could just push each result to the channel individually. The main thread could just pull results off the channel as they arrive, instead of waiting to be awoken by a worker thread that appended all its results to the global queue.

    work stealing

    That is a concept for async rust, or std::async::mpsc. We can do all this without introducing an async runtime. But, this is getting off topic.


    l0rinc commented at 3:15 PM on October 23, 2024:

    coroutines are just programming paradigm, not magic

    That's also what I was counting on! :D

    In RocksDB they have high and low priority work (I assume that's just added to the front or the back of a background work deque) – this could align well with @furszy's suggestion for mixing different kinds of background work units.

    I haven't used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime. Combined with an appropriate scheduling mechanism (such as a thread pool), we could maximize both CPU and IO usage, if I'm not mistaken. Instead of each thread handling just one task, it could suspend a coroutine while waiting on IO (e.g., a database fetch) and resume it later, effectively maximizing CPU and IO work without needing to know the exact details of the work.

    If the vectors are thread local

    The vector would still be global, but each thread would only access a single bucket (i.e. global vector of vectors, with each thread from the pool writing only to vector[thread_id], which contains a vector of fetched coins). When all the work is finished, we'd iterate over the global vector and merge the results into the cache on a single thread. As mentioned, sorting the outpoints before fetching could help improve data locality and reduce lock contention, and the coroutines above would help with work stealing, ensuring that all threads finish roughly at the same time.

    Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.


    sipa commented at 3:28 PM on October 23, 2024:

    I haven't used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime.

    That needs async I/O, and is unrelated to coroutines, as far as I understand it. Coroutines just help with keeping track of what to do when the reads come back inside rocksdb.

    As long as LevelDB (or whatever database engine we use) internally does not use async I/O, there will be one (waiting) thread per parallel outstanding read request from the database.


    HowHsu commented at 4:06 PM on June 28, 2025:

    Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.

    As far as I know, the advantage of coroutines over threads is faster context switching, since it doesn’t go through the operating system kernel. This advantage only becomes apparent under extremely high concurrency, such as hundreds of thousands of concurrent tasks. Using coroutines does not eliminate the need for synchronization mechanisms where they are inherently required.

  21. l0rinc changes_requested
  22. l0rinc commented at 12:44 PM on October 23, 2024: contributor

    Concept ACK

    I'm still missing tests and benchmarks here and I think we need to find better default values for SSD and HDD parallelism, and I'd be interested in how coroutines would perform here instead of trying to find the best batching size manually.

  23. furszy commented at 2:25 PM on October 23, 2024: member

    Cool idea.

    Since the inputs fetcher call is blocking, instead of creating a new set of worker threads, what do you think about re-using the existing script validation ones (or any other unused worker threads) by implementing a general-purpose thread pool shared among the validation checks? The script validation checks and the inputs fetching mechanism are never done concurrently because you need the inputs in order to verify the scripts. So, they could share workers.

    This should be benchmarked because it might add some overhead but, #26966 introduces such structure inside 401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

  24. andrewtoth commented at 2:48 PM on October 23, 2024: contributor

    implementing a general-purpose thread pool shared among the validation checks?

    Nice, yes that would be great! That would simplify this PR a lot if it could just schedule tasks on worker threads and receive the responses, instead of implementing all the sync code itself.

    #26966 introduces such structure inside https://github.com/bitcoin/bitcoin/commit/401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

    Concept ACK!

  25. l0rinc commented at 6:12 PM on October 24, 2024: contributor

    Finished benching on a HDD until 860k on Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, CPU = 8:

    Summary
    'COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0' ran
     1.02 times faster than 'COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=
    0'
    

    f278ca4ec3 coins: allow emplacing non-dirty coins internally (39993.343777768874 seconds = 11.1 hours) e9e23b59f8 validation: fetch block inputs in parallel (40929.84310861388 seconds = 11.3 hours)


    ~So likely on HDD we shouldn't use so many threads, apparently it slows down IBD.~ Maybe we could add a new config option (iothreads or iothreadmultiplier or something). The defaults should likely depend on whether it's an SSD or HDD.


    Edit:

    <details> <summary>Previous results</summary>

    "command": "COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    "times": [39993.343777768874],
    
    "command": "COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    "times": [40929.84310861388],
    

    </details>

    I have retried the same with half the parallelism (rebased, but no other change in the end, otherwise the results would be hard to interpret):

    "command": "COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    "times": [40579.00445769842],
    

    So it's a tiny bit faster than before (surprisingly stable for an actual IBD with real peers), but still slower-than/same-as before, so not sure why it's not faster.


    Edit:

    Running it on a HDD with a low dbcache value reproduces the original result:

    <details> <summary>benchmark</summary>

    hyperfine --runs 1 --show-output --export-json /mnt/my_storage/ibd_full-threaded-inputs-3.json --parameter-list COMMIT 92fc718592be55812b2c73a3bf57599fc81425fa,8207d372b2fac24af0f8999b30e71e88d40b3a13 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'
    

    </details>

    8207d372b2 validation: fetch block inputs in parallel
    92fc718592 coins: allow emplacing non-dirty coins internally
    Summary
      'COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0' ran
        1.16 times faster than 'COMMIT=92fc718592be55812b2c73a3bf57599fc81425fa ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'
    
  26. andrewtoth commented at 6:31 PM on October 24, 2024: contributor

    So likely on HDD we shouldn't use so many threads, apparently it slows down IBD.

    I'm not sure we can conclude that from your benchmark. It used a very high dbcache setting, which makes the effect of this change less important. It also is syncing from untrusted network peers, so there is some variance which could also account for the 2% difference.

  27. andrewtoth force-pushed on Oct 24, 2024
  28. andrewtoth force-pushed on Oct 24, 2024
  29. DrahtBot commented at 7:25 PM on October 24, 2024: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/32027275494</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  30. DrahtBot added the label CI failed on Oct 24, 2024
  31. DrahtBot removed the label CI failed on Oct 25, 2024
  32. andrewtoth force-pushed on Oct 26, 2024
  33. andrewtoth force-pushed on Oct 26, 2024
  34. DrahtBot added the label CI failed on Oct 26, 2024
  35. DrahtBot commented at 6:52 PM on October 26, 2024: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/32107893176</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  36. andrewtoth force-pushed on Oct 26, 2024
  37. andrewtoth force-pushed on Oct 26, 2024
  38. andrewtoth force-pushed on Oct 26, 2024
  39. andrewtoth force-pushed on Oct 26, 2024
  40. andrewtoth force-pushed on Oct 27, 2024
  41. andrewtoth force-pushed on Oct 27, 2024
  42. andrewtoth marked this as a draft on Oct 27, 2024
  43. andrewtoth force-pushed on Oct 27, 2024
  44. andrewtoth force-pushed on Oct 27, 2024
  45. andrewtoth force-pushed on Oct 27, 2024
  46. andrewtoth force-pushed on Oct 27, 2024
  47. andrewtoth force-pushed on Oct 27, 2024
  48. andrewtoth force-pushed on Oct 27, 2024
  49. andrewtoth force-pushed on Oct 27, 2024
  50. andrewtoth force-pushed on Nov 7, 2024
  51. andrewtoth force-pushed on Nov 7, 2024
  52. DrahtBot removed the label CI failed on Nov 7, 2024
  53. andrewtoth force-pushed on Nov 7, 2024
  54. andrewtoth marked this as ready for review on Nov 7, 2024
  55. andrewtoth commented at 3:05 PM on November 7, 2024: contributor

    @furszy I tried to switch to using a shared threadpool, but it is much slower that way. We need a way to have shared state between threads for this, instead of just scheduling tasks. I suppose the generic threadpool is great for scheduling independent tasks like indexing an individual block, but for quickly pulling outpoints off a shared vector it is not optimized well.

    From #29386:

    I just noticed the comment in the code:

    For each thread a thread stack needs to be allocated. By default on Linux, threads take up 8MiB for the thread stack on a 64-bit system, and 4MiB in a 32-bit system.

    Only 8MiB of Virtual Memory is allocated, which doesn't really mean anything. Due to CoW mechanism, only the parts of stack that are being used will be allocated as Physical Memory which is the one that actually matters.

    So, I don't think it matters much to have an extra threadpool owned by the input fetcher.

    I think this is ready for more review. I also added tests and a benchmark.

  56. andrewtoth force-pushed on Nov 7, 2024
  57. andrewtoth force-pushed on Nov 9, 2024
  58. andrewtoth commented at 3:28 PM on November 13, 2024: contributor

    For later blocks where cache misses are much more common, this change has an even bigger impact. This benchmark report shows a 40% speedup measuring from blocks 840k to 850k. Also, compare flamegraphs of master and this branch, where the latter has 15 worker threads fetching coins from disk. https://bitcoin-dev-tools.github.io/benchcoin/results/pr-19/11798124132/index.html

  59. andrewtoth commented at 3:35 AM on November 16, 2024: contributor

    Even with just 2 worker threads, there is significant (~30%) speed improvement for syncing recent blocks. https://bitcoin-dev-tools.github.io/benchcoin/results/pr-19/11865650166/index.html

  60. andrewtoth force-pushed on Nov 16, 2024
  61. andrewtoth force-pushed on Nov 16, 2024
  62. andrewtoth force-pushed on Nov 16, 2024
  63. andrewtoth force-pushed on Nov 16, 2024
  64. DrahtBot commented at 7:45 PM on November 16, 2024: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/33086747731</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  65. DrahtBot added the label CI failed on Nov 16, 2024
  66. andrewtoth force-pushed on Nov 16, 2024
  67. andrewtoth force-pushed on Nov 16, 2024
  68. DrahtBot removed the label CI failed on Nov 16, 2024
  69. andrewtoth force-pushed on Nov 17, 2024
  70. in src/test/fuzz/inputfetcher.cpp:32 in 2bd5f0f03b outdated
      27 | +    CBlock block;
      28 | +    Txid prevhash{Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))};
      29 | +
      30 | +    const auto txs{fuzzed_data_provider.ConsumeIntegralInRange<uint32_t>(1,
      31 | +        std::numeric_limits<uint32_t>::max())};
      32 | +    for (uint32_t i{0}; i < txs; ++i) {
    


    dergoegge commented at 2:16 PM on November 18, 2024:

    This will create very long running inputs (e.g. txs = std::numeric_limits<uint32_t>::max()).

        LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), N) {
    

    or

        LIMITED_WHILE(fuzzed_data_provider.remaining_bytes(), N) {
    

    andrewtoth commented at 5:04 PM on November 18, 2024:

    Thanks, done!

  71. in src/test/fuzz/inputfetcher.cpp:36 in 2bd5f0f03b outdated
      31 | +        std::numeric_limits<uint32_t>::max())};
      32 | +    for (uint32_t i{0}; i < txs; ++i) {
      33 | +        CMutableTransaction tx;
      34 | +
      35 | +        const auto inputs{fuzzed_data_provider.ConsumeIntegral<uint32_t>()};
      36 | +        for (uint32_t j{0}; j < inputs; ++j) {
    


    dergoegge commented at 2:17 PM on November 18, 2024:

    Same as above, this will create long running inputs and maybe even run out of memory?

  72. andrewtoth force-pushed on Nov 18, 2024
  73. andrewtoth force-pushed on Nov 18, 2024
  74. DrahtBot commented at 3:34 PM on November 18, 2024: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/33143571653</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  75. DrahtBot added the label CI failed on Nov 18, 2024
  76. andrewtoth force-pushed on Nov 18, 2024
  77. andrewtoth force-pushed on Nov 18, 2024
  78. andrewtoth force-pushed on Nov 18, 2024
  79. andrewtoth force-pushed on Nov 18, 2024
  80. andrewtoth force-pushed on Nov 18, 2024
  81. andrewtoth force-pushed on Nov 18, 2024
  82. andrewtoth force-pushed on Nov 18, 2024
  83. DrahtBot removed the label CI failed on Nov 18, 2024
  84. andrewtoth force-pushed on Nov 19, 2024
  85. in src/coins.h:420 in 3a4af55071 outdated
     415 | @@ -416,13 +416,14 @@ class CCoinsViewCache : public CCoinsViewBacked
     416 |      void AddCoin(const COutPoint& outpoint, Coin&& coin, bool possible_overwrite);
     417 |  
     418 |      /**
     419 | -     * Emplace a coin into cacheCoins without performing any checks, marking
     420 | -     * the emplaced coin as dirty.
     421 | +     * Emplace a coin into cacheCoins without performing any checks, optionally
     422 | +     * marking the emplaced coin as dirty.
    


    sedited commented at 2:38 PM on November 19, 2024:

    Should this rather say "optionally marking the emplaced coin as not dirty", since the default is always dirty?


    andrewtoth commented at 6:53 PM on November 20, 2024:

    I'm not sure that's the best though, since we do not mark a coin as not dirty. That is the default state.

    What about "marking the coin as dirty unless set_dirty is set to false"?


    sedited commented at 7:15 PM on November 20, 2024:

    That sounds good to me :+1:


    andrewtoth commented at 9:56 PM on November 20, 2024:

    Done.

  86. andrewtoth force-pushed on Nov 19, 2024
  87. andrewtoth force-pushed on Nov 20, 2024
  88. andrewtoth force-pushed on Nov 20, 2024
  89. DrahtBot added the label CI failed on Nov 20, 2024
  90. DrahtBot commented at 6:46 PM on November 20, 2024: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/33279820062</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  91. andrewtoth force-pushed on Nov 20, 2024
  92. DrahtBot removed the label CI failed on Nov 20, 2024
  93. andrewtoth force-pushed on Nov 21, 2024
  94. DrahtBot commented at 5:12 PM on November 21, 2024: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/33335042693</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  95. DrahtBot added the label CI failed on Nov 21, 2024
  96. andrewtoth force-pushed on Nov 21, 2024
  97. DrahtBot removed the label CI failed on Nov 21, 2024
  98. andrewtoth force-pushed on Nov 22, 2024
  99. andrewtoth force-pushed on Nov 24, 2024
  100. in src/test/inputfetcher_tests.cpp:55 in 3c201bcffc outdated
      50 | +    const auto cores{GetNumCores()};
      51 | +    const auto num_txs{m_rng.randrange(cores * 10)};
      52 | +    const auto block{CreateBlock(num_txs)};
      53 | +    const auto batch_size{m_rng.randrange<int32_t>(block.vtx.size() * 2)};
      54 | +    const auto worker_threads{m_rng.randrange(cores * 2)};
      55 | +    InputFetcher fetcher{batch_size, worker_threads};
    


    ismaelsadeeq commented at 8:31 PM on November 26, 2024:

    In 3c201bcffc1d7e382e8afa9a88750a4c261c1cf8 "tests: add inputfetcher tests" You can set this up in InputFetcherTest so that you don't have to repeat it in the rest of the tests.


    andrewtoth commented at 12:58 AM on December 4, 2024:

    Done.

  101. in src/bench/inputfetcher.cpp:15 in 2349ac7d60 outdated
      10 | +#include <primitives/block.h>
      11 | +#include <serialize.h>
      12 | +#include <streams.h>
      13 | +#include <util/time.h>
      14 | +
      15 | +static constexpr auto QUEUE_BATCH_SIZE{128};
    


    ismaelsadeeq commented at 8:33 PM on November 26, 2024:

    In 2349ac7d6071746a80223358bce0d5e556b277d7 "bench: add inputfetcher bench" How did you select this batch size?


    andrewtoth commented at 12:58 AM on December 4, 2024:

    This is the hardcoded batch size used in CheckQueue. Not sure why that was selected, but I deferred to previous choices.


    l0rinc commented at 3:45 PM on September 29, 2025:

    I would prefer retesting those assumptions (I don't even think we need a batch here)


    andrewtoth commented at 7:06 PM on October 3, 2025:

    Removed the batch size 🎉

  102. in src/bench/inputfetcher.cpp:19 in 2349ac7d60 outdated
      14 | +
      15 | +static constexpr auto QUEUE_BATCH_SIZE{128};
      16 | +static constexpr auto DELAY{2ms};
      17 | +
      18 | +//! Simulates a DB by adding a delay when calling GetCoin
      19 | +class DelayedCoinsView : public CCoinsView
    


    ismaelsadeeq commented at 8:47 PM on November 26, 2024:

    In 2349ac7d6071746a80223358bce0d5e556b277d7 "bench: add inputfetcher bench" nit: will be nice if we have block413567 input's data that we can read so that we dont have to simulate this.


    andrewtoth commented at 12:59 AM on December 4, 2024:

    We're reading the previous outpoints of that block's inputs, which are in many other previous blocks. So, not sure this is feasible.


    andrewtoth commented at 4:09 PM on November 30, 2025:

    I now use a leveldb and add mock input data before the benchmark, so it's more of a real world benchmark now :+1: . Thanks!

  103. ismaelsadeeq commented at 8:55 PM on November 26, 2024: member

    Concept ACK This is nice. Although I have not yet benchmarked this branch, I also like @furszy's idea of having a general-purpose thread pool.

    I just have one test improvement comment, question and a nit after first pass of the PR

  104. DrahtBot added the label Needs rebase on Dec 4, 2024
  105. andrewtoth force-pushed on Dec 4, 2024
  106. andrewtoth commented at 1:02 AM on December 4, 2024: contributor

    Rebased. Since #30039 reading inputs is much faster, so the effect of this is somewhat less significant (17% -> 10%). It's still a significant speedup though so still worth it. Especially for worst case where the cache is completely empty, like on startup or right after it gets flushed due to size.

    It is also refactored significantly. The main thread now writes everything before notifying threads, and then joins in working. This lets us do significantly less work in the critical section and parallelize more checks.

  107. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads ~17% faster IBD
    validation: fetch block inputs on parallel threads 10% faster IBD
    on Dec 4, 2024
  108. andrewtoth force-pushed on Dec 4, 2024
  109. DrahtBot commented at 1:29 AM on December 4, 2024: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/33884531020</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  110. DrahtBot added the label CI failed on Dec 4, 2024
  111. andrewtoth force-pushed on Dec 4, 2024
  112. DrahtBot removed the label Needs rebase on Dec 4, 2024
  113. DrahtBot removed the label CI failed on Dec 4, 2024
  114. DrahtBot added the label Needs rebase on Dec 5, 2024
  115. andrewtoth force-pushed on Dec 5, 2024
  116. DrahtBot removed the label Needs rebase on Dec 5, 2024
  117. in src/validation.cpp:3198 in b2da764446 outdated
    3194 | @@ -3195,6 +3195,8 @@ bool Chainstate::ConnectTip(BlockValidationState& state, CBlockIndex* pindexNew,
    3195 |      LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
    3196 |               Ticks<MillisecondsDouble>(time_2 - time_1));
    3197 |      {
    3198 | +        m_chainman.GetInputFetcher().FetchInputs(CoinsTip(), CoinsDB(), blockConnecting);
    


    l0rinc commented at 9:26 AM on May 11, 2025:

    Can we let the objects do the job instead of querying their internals and doing it ourselves ("tell, don't ask"):

            m_chainman.FetchInputs(CoinsTip(), CoinsDB(), blockConnecting);
    

    andrewtoth commented at 2:20 PM on October 2, 2025:

    I tried to mimic the script validation like GetCheckQueue. But, I guess this is different enough. Will change next time I push.


    andrewtoth commented at 12:00 AM on October 4, 2025:

    Done.

  118. DrahtBot added the label CI failed on Jun 2, 2025
  119. maflcko commented at 7:02 AM on June 10, 2025: member

    Looks like the CI started failing, due to too many threads being launched in the functional tests with that parallelism? As the threads may open files, this could be hitting the max open files limit? Or maybe it is a different limit hit?

  120. HowHsu commented at 4:42 PM on June 26, 2025: contributor

    Hi folks, this looks great, since if all the prevout coins of all transactions of a block are loaded in advance, then the optimization in #32791 makes sense.

  121. sedited commented at 9:09 AM on September 17, 2025: contributor

    What's the status here?

  122. andrewtoth force-pushed on Sep 17, 2025
  123. andrewtoth force-pushed on Sep 17, 2025
  124. andrewtoth force-pushed on Sep 17, 2025
  125. andrewtoth commented at 8:38 PM on September 17, 2025: contributor

    Looks like the CI started failing, due to too many threads being launched in the functional tests with that parallelism? As the threads may open files, this could be hitting the max open files limit? Or maybe it is a different limit hit?

    Thanks, I added -par=1 to all nodes spawned in features_proxy.py in 6980852416040bdddf111df3cea3ec50639f010a. That test spawns lots of nodes and block validation is not relevant to it.

    What's the status here?

    Rebased to fix silent conflicts and added the fix for features_proxy.py.

  126. andrewtoth commented at 10:57 PM on September 21, 2025: contributor

    I benchmarked the latest branch with default dbcache up to 912683. Results are a speedup of 14% - 5:07 vs 5:49.

    Command Mean [s] Min [s] Max [s] Relative
    echo 688c03597afb0b76077f1ffc4608eef19481056e && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 18430.672 ± 19.856 18416.631 18444.712 1.00
    echo 1444ed855f438f1270104fca259ce61b99ed5cdb && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 20937.219 ± 62.635 20892.929 20981.508 1.14 ± 0.00
  127. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads 10% faster IBD
    validation: fetch block inputs on parallel threads >10% faster IBD
    on Sep 21, 2025
  128. maflcko removed the label CI failed on Sep 22, 2025
  129. andrewtoth commented at 5:18 PM on September 23, 2025: contributor

    Did the same benchmark with 5000 dbcache and there is a 6% speedup :rocket: - 4:27 vs 4:44. Even with far fewer cache misses this change is still a benefit, and will continue to improve block connection speed as the blockchain and utxo set get bigger.

    Command Mean [s] Min [s] Max [s] Relative
    echo 688c03597afb0b76077f1ffc4608eef19481056e && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 -dbcache=5000 16021.047 ± 5.892 16016.881 16025.213 1.00
    echo 1444ed855f438f1270104fca259ce61b99ed5cdb && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 -dbcache=5000 17057.947 ± 42.032 17028.226 17087.668 1.06 ± 0.00
  130. in test/functional/feature_proxy.py:141 in 688c03597a outdated
     135 | @@ -136,6 +136,9 @@ def setup_nodes(self):
     136 |          if self.have_unix_sockets:
     137 |              args[5] = ['-listen', f'-proxy=unix:{socket_path}']
     138 |              args[6] = ['-listen', f'-onion=unix:{socket_path}']
     139 | +        # Keep validation threads low to avoid CI thread/pid limits.
     140 | +        # Ensure even empty arg lists get '-par=1'.
     141 | +        args = [a + ['-par=1'] if a else ['-par=1'] for a in args]
    


    maflcko commented at 8:16 PM on September 24, 2025:

    seems a bit odd to have the number of nodes in a test influence whether or not the test has to be edited to remove or add -par=1 everywhere. Would it not be easier to just globally set -par=2 for all funtional tests?

    diff --git a/test/functional/test_framework/util.py b/test/functional/test_framework/util.py
    index e5a5938f07..42bb213dd3 100644
    --- a/test/functional/test_framework/util.py
    +++ b/test/functional/test_framework/util.py
    @@ -459,6 +459,7 @@ def write_config(config_path, *, n, chain, extra_config="", disable_autoconnect=
             f.write("printtoconsole=0\n")
             f.write("natpmp=0\n")
             f.write("shrinkdebugfile=0\n")
    +        f.write("par=2\n")
             # To improve SQLite wallet performance so that the tests don't timeout, use -unsafesqlitesync
             f.write("unsafesqlitesync=1\n")
             if disable_autoconnect:
    

    andrewtoth commented at 9:01 PM on September 24, 2025:

    Yes, I wondered if that would be more invasive to other tests though.


    maflcko commented at 7:42 AM on September 25, 2025:

    It disables the auto-detection for all functional tests by default, which I can't really find a downside to. Also, it removes idle "spam" threads while debugging (gdb and other tools will display less script check threads), which also seems beneficial to have?


    andrewtoth commented at 4:09 PM on September 27, 2025:

    Makes sense. Done here #33485.

  131. in src/inputfetcher.h:145 in 688c03597a outdated
     140 | +                        m_in_flight_outpoints_count -= m_last_outpoint_index;
     141 | +                        m_last_outpoint_index = 0;
     142 | +                        break;
     143 | +                    }
     144 | +                }
     145 | +            } catch (const std::runtime_error&) {
    


    l0rinc commented at 2:31 PM on September 29, 2025:

    nit: is there anything in the error that we may want to log?


    andrewtoth commented at 6:04 PM on October 11, 2025:

    Added a log.

  132. in src/inputfetcher.h:107 in 688c03597a
     102 | +                while (m_last_outpoint_index == 0) {
     103 | +                    if ((is_main_thread && m_in_flight_outpoints_count == 0) || m_request_stop) {
     104 | +                        return;
     105 | +                    }
     106 | +                    ++m_idle_worker_count;
     107 | +                    cond.wait(lock);
    


    l0rinc commented at 2:33 PM on September 29, 2025:

    I haven't reviewed it in detail but was wondering why we need locking here, it should be possible to do most of this lock free (especially if we sort the keys first so that threads are more likely to access different regions). I have started reviewing and testing it in detail, but to have some progress I'm sharing my observations as I go along


    andrewtoth commented at 2:21 PM on October 2, 2025:

    I've updated to use semaphores instead of mutex. That should be more efficient.

    especially if we sort the keys first so that threads are more likely to access different regions

    I don't understand what this has to do with being lock free.


    l0rinc commented at 2:40 PM on October 2, 2025:

    I don't understand what this has to do with being lock free.

    We may have fewer file system locks if the threads are accessing different regions

    I've updated to use semaphores instead of mutex

    I will review that in more detail soon, probably next week.


    andrewtoth commented at 2:47 PM on October 2, 2025:

    We may have fewer file system locks if the threads are accessing different regions

    Ok, but that is not the same as this InputFetcher construction being lock free.


    l0rinc commented at 2:54 PM on October 2, 2025:

    No, that's orthogonal, it's another area where we could possibly reduce contention.

  133. in src/inputfetcher.h:82 in 688c03597a
      77 | +    const CCoinsViewCache* m_cache{nullptr};
      78 | +
      79 | +    std::vector<std::thread> m_worker_threads;
      80 | +    bool m_request_stop GUARDED_BY(m_mutex){false};
      81 | +
      82 | +    //! Internal function that does the fetching from disk.
    


    l0rinc commented at 2:35 PM on September 29, 2025:

    instead of the comment, can we express this in the method name?


    andrewtoth commented at 2:22 PM on October 2, 2025:

    I updated the thread name to ThreadLoop, which just does the loop. There is another function now, FetchInputsOnThread, that fetches for each block until finished.

  134. in src/inputfetcher.h:79 in 688c03597a outdated
      74 | +    //! DB coins view to fetch from.
      75 | +    const CCoinsView* m_db{nullptr};
      76 | +    //! The cache to check if we already have this input.
      77 | +    const CCoinsViewCache* m_cache{nullptr};
      78 | +
      79 | +    std::vector<std::thread> m_worker_threads;
    


    l0rinc commented at 2:38 PM on September 29, 2025:

    I have tried std::jthread in l0rinc/bitcoin@6afe2e8 (#40) but it seems the CI's libc++ doesn’t provide it

    Q: it's just the second commit and we're already doing the fetching on multiple threads. Can we add a single-threaded input fetcher first and add multithreading only as the very last step?


    andrewtoth commented at 3:55 PM on October 4, 2025:

    Can we add a single-threaded input fetcher first and add multithreading only as the very last step?

    Done.


    andrewtoth commented at 2:09 PM on October 14, 2025:

    I have tried std::jthread

    I looked at this, but it doesn't really add anything to this implementation. We could have a std::stop_token for each thread, but we would have to request_stop() each jthread before releasing the semaphore in the destructor anyway. So it doesn't let us remove the destructor, and saves a line for not having to declare m_request_stop. I don't think it's worth it to use jthreads here.


    l0rinc commented at 2:49 PM on October 14, 2025:

    I couldn't get it to work on CI anyway

  135. in src/inputfetcher.h:73 in 688c03597a
      68 | +     */
      69 | +    int32_t m_in_flight_outpoints_count GUARDED_BY(m_mutex){0};
      70 | +    //! The number of worker threads that are waiting on m_worker_cv
      71 | +    int32_t m_idle_worker_count GUARDED_BY(m_mutex){0};
      72 | +    //! The maximum number of outpoints to be assigned in one batch
      73 | +    const int32_t m_batch_size;
    


    l0rinc commented at 2:42 PM on September 29, 2025:

    what if instead of locking we just iterate every nth element (where n is the thread index), implicitly dividing the input into n buckets without locking. Each thread would work on a distinct set of values - we can pre-filter for existing values on a single thread before forking off. This won't have work stealing, but we can likely assume uniform distribution and the solution would be trivial and lock free.


    andrewtoth commented at 2:24 PM on October 2, 2025:

    Prefiltering on the main thread is too slow, it's faster if we do the filtering in parallel. So, we still need to have a smaller batch size because then work will not be divided evenly. One thread could get all cache misses while the others all have cached inputs.


    l0rinc commented at 2:40 PM on October 2, 2025:

    Not sure why that's problematic, we don't have to have perfect parallelism, it seems to me we can assume uniform distribution - it's fine if there are outliers if that makes the code simpler (which I think it should, it could even eliminate most locks, since the jobs are basically completely independent)

  136. in src/inputfetcher.h:77 in 688c03597a outdated
      72 | +    //! The maximum number of outpoints to be assigned in one batch
      73 | +    const int32_t m_batch_size;
      74 | +    //! DB coins view to fetch from.
      75 | +    const CCoinsView* m_db{nullptr};
      76 | +    //! The cache to check if we already have this input.
      77 | +    const CCoinsViewCache* m_cache{nullptr};
    


    l0rinc commented at 2:44 PM on September 29, 2025:

    could we pre-filter on a single tread and send the results to the fetcher instead? That way we can also decide not to do multi-threaded access for small sets (we can experiment with the values, but we can probably start with set size < nproc should be done on a single thread).


    andrewtoth commented at 2:25 PM on October 2, 2025:

    Prefiltering on the main thread is too slow. It is several milliseconds to check every input in large blocks whether they exist in the cache.

  137. in src/inputfetcher.h:1 in 688c03597a outdated
       0 | @@ -0,0 +1,246 @@
       1 | +// Copyright (c) 2024-present The Bitcoin Core developers
    


    l0rinc commented at 2:45 PM on September 29, 2025:

    nit: the curse of long review queues

    // Copyright (c) 2025-present The Bitcoin Core developers
    

    andrewtoth commented at 2:25 PM on October 2, 2025:

    Done.

  138. in src/coins.h:432 in 688c03597a outdated
     430 | +     * NOT FOR GENERAL USE. Used when loading coins from a UTXO snapshot, and
     431 | +     * in the InputFetcher.
     432 |       * @sa ChainstateManager::PopulateAndValidateSnapshot()
     433 |       */
     434 | -    void EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin);
     435 | +    void EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty = true);
    


    l0rinc commented at 2:48 PM on September 29, 2025:

    to peel away the preparatory commits, it would simplify review to extract these into tiny, focused PRs - to have some progress, since this PR is in review for some time, but it's a very good change that I'd like to have some progress on.

    nit: I understand the default param is meant to make the diff smaller, but it doesn't help with understanding the effect of the change, to see where this is used and what we're changing


    andrewtoth commented at 4:09 PM on October 15, 2025:

    I think I can just drop this first commit entirely. We don't actually care to not set the coins we fetch as dirty. In the happy path, all these coins will be spent immediately after ConnectBlock, so they will be set to dirty anyways. In the unhappy path where the valid proof-of-work block is found to be invalid, the dirty coins we added will just cause the coins to be overwritten by the same data in the db at the next flush.

  139. in src/coins.cpp:114 in 688c03597a outdated
     109 | @@ -110,10 +110,15 @@ void CCoinsViewCache::AddCoin(const COutPoint &outpoint, Coin&& coin, bool possi
     110 |             (bool)it->second.coin.IsCoinBase());
     111 |  }
     112 |  
     113 | -void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin) {
     114 | -    cachedCoinsUsage += coin.DynamicMemoryUsage();
     115 | +void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty) {
     116 | +    const auto mem_usage{coin.DynamicMemoryUsage()};
    


    l0rinc commented at 2:49 PM on September 29, 2025:

    af8a366bd6a08d9362e69a89b0b89b5c94eb63ca I had something similar in https://github.com/bitcoin/bitcoin/pull/32313/files#diff-f0ed73d62dae6ca28ebd3045e5fc0d5d02eaaacadb4c2a292985a3fbd7e1c77cR254

    Can you please explain in the commit message why this change is necessary?


    andrewtoth commented at 3:59 PM on October 3, 2025:

    Added some explanation in the commit message. Please let me know if it makes it more clear.

  140. in src/validation.cpp:6271 in 688c03597a outdated
    6267 | @@ -6266,6 +6268,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
    6268 |  
    6269 |  ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
    6270 |      : m_script_check_queue{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    6271 | +      m_input_fetcher{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    


    l0rinc commented at 2:50 PM on September 29, 2025:

    I have tested this with different par values and surprisingly it barely had any effect. Is it because of the locking?


    andrewtoth commented at 2:27 PM on October 2, 2025:

    I believe this is resolved.

  141. in src/coins.cpp:117 in af8a366bd6 outdated
     115 | +void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty) {
     116 | +    const auto mem_usage{coin.DynamicMemoryUsage()};
     117 |      auto [it, inserted] = cacheCoins.try_emplace(std::move(outpoint), std::move(coin));
     118 | -    if (inserted) CCoinsCacheEntry::SetDirty(*it, m_sentinel);
     119 | +    if (inserted) {
     120 | +        cachedCoinsUsage += mem_usage;
    


    l0rinc commented at 2:57 PM on September 29, 2025:

    this seems like a change in behavior, but it assumed that the coin was never in the cache - though if (inserted) didn't help with understanding this, so it likely isn't.

    Is that still something that we can assume - hence the "danger", right? And if it's always inserted, does the insertion guard still make sense? It's a bit even more confusing now :/


    andrewtoth commented at 2:33 PM on October 2, 2025:

    This is dangerous because it doesn't check for freshness or if already inserted. It is meant to bulk load new utxos from the assume utxo set. Since assume utxo assumes the utxo set is currently empty, the coins would always be inserted. This is repurposed here to bulk load utxos from the db directly into the cache. However, an invalid block could be mined which spends an already spent utxo that is in the cache but has not been synced to the db yet. In that case, the insertion will fail here. There is a unit test specifically for this scenario.

  142. in src/inputfetcher.h:24 in 912f26b81e outdated
      19 | +#include <unordered_set>
      20 | +#include <vector>
      21 | +
      22 | +/**
      23 | + * Input fetcher for fetching inputs from the CoinsDB and inserting
      24 | + * into the CoinsTip.
    


    l0rinc commented at 2:59 PM on September 29, 2025:
     * Helper for fetching inputs from the CoinsDB and inserting into the CoinsTip.
    

    andrewtoth commented at 3:59 PM on October 3, 2025:

    Done.

  143. in src/inputfetcher.h:27 in 912f26b81e outdated
      22 | +/**
      23 | + * Input fetcher for fetching inputs from the CoinsDB and inserting
      24 | + * into the CoinsTip.
      25 | + *
      26 | + * The main thread loops through the block and writes all input prevouts to a
      27 | + * global vector. It then wakes all workers and starts working as well. Each
    


    l0rinc commented at 3:00 PM on September 29, 2025:

    do we need to write to a global vector or can we safely iterate the prevouts directly from each thread?


    andrewtoth commented at 2:36 PM on October 2, 2025:

    We iterate the prevouts directly from the block now. However, we store the tx index and vin index in a global vector now. This way we can flatten the inputs instead of having to scan the txs to see how many inputs they have.

  144. in src/inputfetcher.h:83 in 912f26b81e outdated
      78 | +
      79 | +    std::vector<std::thread> m_worker_threads;
      80 | +    bool m_request_stop GUARDED_BY(m_mutex){false};
      81 | +
      82 | +    //! Internal function that does the fetching from disk.
      83 | +    void Loop(int32_t index, bool is_main_thread = false) noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
    


    l0rinc commented at 3:04 PM on September 29, 2025:

    Q: do we really need the main thread to be part of this? I expect this to be disk bound and not CPU restricted, we should be able to go beyond nproc, so it should be safe to leave the main thread out of this as far as I can tell...


    andrewtoth commented at 2:37 PM on October 2, 2025:

    Not sure, would have to benchmark this. I have updated the functions though to make the main thread's entrance clearer.


    l0rinc commented at 2:43 PM on October 2, 2025:

    My benchmarks so far indicate the opposite: after 3-4 threads there is no benefit to the parallelization (either on SSD or HDD). I will remeasure your new changes after you give me the 👍

  145. in src/inputfetcher.h:130 in 912f26b81e outdated
     125 | +                    // block, it won't be in the cache yet but it also won't be
     126 | +                    // in the db either.
     127 | +                    if (m_txids.contains(outpoint.hash)) {
     128 | +                        continue;
     129 | +                    }
     130 | +                    if (m_cache->HaveCoinInCache(outpoint)) {
    


    l0rinc commented at 3:28 PM on September 29, 2025:

    as mentioned before I think it should be safe to pre-filter on a single thread instead


    andrewtoth commented at 2:38 PM on October 2, 2025:

    It is definitely safe to do, since all access would be on main thread. It is also safe to do from parallel threads if we don't write until all threads are done reading, which is what this PR does. Prefiltering is slow though (several milliseconds) so is better to do in parallel.


    l0rinc commented at 2:45 PM on October 2, 2025:

    But prefiltering would allow sorting, which should untangle the threads. The threads will access the same files (which are more likely to be different from the files the other threads are requesting), so they may profit from cache locality if the OS supports it - that's why I suggested giving it a try.

  146. in src/inputfetcher.h:127 in 912f26b81e outdated
     122 | +                for (auto i{end_index - local_batch_size}; i < end_index; ++i) {
     123 | +                    const auto& outpoint{m_outpoints[i]};
     124 | +                    // If an input spends an outpoint from earlier in the
     125 | +                    // block, it won't be in the cache yet but it also won't be
     126 | +                    // in the db either.
     127 | +                    if (m_txids.contains(outpoint.hash)) {
    


    l0rinc commented at 3:31 PM on September 29, 2025:

    what if this ends up on different threads, i.e. a spend from an earlier outpoint processed on a different thread? Wouldn't we take care of those automatically? We can likely skip all values that are not found, since we will revalidate everything after this cache warming call - we just have to document that it's theoretically possible that some values won't be in the cache after this call (though the internal spends should be added, just in a different thread, right?).


    andrewtoth commented at 2:40 PM on October 2, 2025:

    I'm not sure I understand this. The m_txids set is computed on the main thread, and is only read from multiple threads. If we didn't do this we would try to fetch non-existent outputs from the db, which would be much slower.

  147. in src/inputfetcher.h:136 in 912f26b81e outdated
     131 | +                        continue;
     132 | +                    }
     133 | +                    if (auto coin{m_db->GetCoin(outpoint)}; coin) {
     134 | +                        local_pairs.emplace_back(outpoint, std::move(*coin));
     135 | +                    } else {
     136 | +                        // Missing an input. This block will fail validation.
    


    l0rinc commented at 3:32 PM on September 29, 2025:

    do we really care about this, it's not our job here to validate, just fetch whatever we can, the validation will happen after this pre-warming.


    andrewtoth commented at 2:41 PM on October 2, 2025:

    We don't really care, but it would be good to not continue doing work here if we know it's pointless. This just exits early. No validation is happening.


    l0rinc commented at 2:50 PM on October 2, 2025:

    I think I would prefer a less opinionated version, as long as it's still correct. No need to optimize for the consensus failure speed in my opinion, I would prefer simpler code for a change as risky as this one.

  148. in src/inputfetcher.h:162 in 912f26b81e outdated
     157 | +
     158 | +    //! Create a new input fetcher
     159 | +    explicit InputFetcher(int32_t batch_size, int32_t worker_thread_count) noexcept
     160 | +        : m_batch_size(batch_size)
     161 | +    {
     162 | +        if (worker_thread_count < 1) {
    


    l0rinc commented at 3:34 PM on September 29, 2025:

    what's the reason for allowing negative worker_thread_count? In other cases I think it was used to signal how many CPUs to reserve, but that doesn't seem to be the case here, and since we're claming to min of 0, consider:

            if (worker_thread_count == 0) {
    

    andrewtoth commented at 3:58 PM on October 3, 2025:

    Done.

  149. in src/inputfetcher.h:192 in 912f26b81e outdated
     187 | +    void FetchInputs(CCoinsViewCache& cache,
     188 | +                     const CCoinsView& db,
     189 | +                     const CBlock& block) noexcept
     190 | +        EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
     191 | +    {
     192 | +        if (m_worker_threads.empty() || block.vtx.size() <= 1) {
    


    l0rinc commented at 3:37 PM on September 29, 2025:

    Can we maybe do something like this instead?

            if (block.vtx.size() < m_worker_threads.size()) {
    

    andrewtoth commented at 2:42 PM on October 2, 2025:

    This is to not enter if there is only a coinbase tx, since it has no inputs to fetch. If there were 2 txs, and the second has 1000 inputs, we would still want to enter here.

  150. in src/inputfetcher.h:198 in 912f26b81e outdated
     193 | +            return;
     194 | +        }
     195 | +
     196 | +        // Set the db and cache to use for this block.
     197 | +        m_db = &db;
     198 | +        m_cache = &cache;
    


    l0rinc commented at 3:38 PM on September 29, 2025:

    can we avoid mutating the state in a multithreaded class for safety? It's easier to follow along knowing that the class is immutable and the state is passed along...


    andrewtoth commented at 2:43 PM on October 2, 2025:

    I don't think we can do that. We need to set these here for other threads to read. These are only read from other threads, never written to. We also only read from other threads after the main thread has released the counting_semaphore, so we know the pointers are synced across the threads.


    l0rinc commented at 2:48 PM on October 2, 2025:

    I really dislike that, will try to come up with a lock-free version later (maybe next week)

  151. in src/test/inputfetcher_tests.cpp:51 in c705c6f1f1 outdated
      46 | +
      47 | +        return block;
      48 | +    }
      49 | +
      50 | +public:
      51 | +    explicit InputFetcherTest(const ChainType chainType = ChainType::MAIN,
    


    l0rinc commented at 3:40 PM on September 29, 2025:

    can we add these tests before the multithreading change - having a single-threaded InputFetcher first, adding tests and benchmarks after and doing the actual multithreading as a very last step. That would construct the whole scenario in smaller steps, proving that every change is safe


    andrewtoth commented at 3:57 PM on October 4, 2025:

    Done.

  152. in src/test/inputfetcher_tests.cpp:55 in c705c6f1f1 outdated
      50 | +public:
      51 | +    explicit InputFetcherTest(const ChainType chainType = ChainType::MAIN,
      52 | +                             TestOpts opts = {})
      53 | +        : BasicTestingSetup{chainType, opts}
      54 | +    {
      55 | +        SeedRandomForTest(SeedRand::ZEROS);
    


    l0rinc commented at 3:41 PM on September 29, 2025:

    I understand why benchmarks need predictability, but wouldn't we want variance for tests?


    andrewtoth commented at 6:04 PM on October 11, 2025:

    Changed to FIXED_SEED.

  153. in src/test/inputfetcher_tests.cpp:171 in c705c6f1f1 outdated
     166 | +
     167 | +class ThrowCoinsView : public CCoinsView
     168 | +{
     169 | +    std::optional<Coin> GetCoin(const COutPoint& outpoint) const override
     170 | +    {
     171 | +        throw std::runtime_error("database error");
    


    l0rinc commented at 3:44 PM on September 29, 2025:

    consider std::terminate


    andrewtoth commented at 2:44 PM on October 2, 2025:

    Err, we want to throw a runtime error here to test the try/catch in the inputfetcher.

  154. in src/test/fuzz/inputfetcher.cpp:150 in 1faf0595a5 outdated
     145 | +            // Check any newly added coins in the cache are the same as the db
     146 | +            const auto& coin{cache.AccessCoin(outpoint)};
     147 | +            assert(!coin.IsSpent());
     148 | +            assert(coin.fCoinBase == (*maybe_coin).fCoinBase);
     149 | +            assert(coin.nHeight == (*maybe_coin).nHeight);
     150 | +            assert(coin.out == (*maybe_coin).out);
    


    l0rinc commented at 3:47 PM on September 29, 2025:
                assert(coin.fCoinBase == maybe_coin->fCoinBase);
                assert(coin.nHeight == maybe_coin->nHeight);
                assert(coin.out == maybe_coin->out);
    

    andrewtoth commented at 6:04 PM on October 11, 2025:

    Done!

  155. l0rinc commented at 4:16 PM on September 29, 2025: contributor

    I have re-reviewed the changes again lightly and did quite a few benchmarks on different platforms. There were a lot of surprises, see my measurements:

    <details> <summary>rpi5-16 IBD from local node or & reindex-chainstate seem is ~27% faster</summary>

    COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
    STOP=915961; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    688c03597a validation: fetch block inputs in parallel
    af8a366bd6 coins: allow emplacing non-dirty coins internally
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
      Time (abs ≡):        29732.695 s               [User: 60441.083 s, System: 5856.247 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
      Time (abs ≡):        37896.082 s               [User: 60968.810 s, System: 7062.414 s]
    
    Relative speed comparison
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
            1.27          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    

    Retested it separately with:

    # cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -connect=rpi5-16-3.local
    
    cat ../BitcoinData/debug.log | egrep 'height=0|height=916000'
    2025-09-25T17:03:06Z UpdateTip: new best=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f height=0 version=0x00000001 log2_work=32.000022 tx=1 date='2009-01-03T18:15:05Z' progress=0.000000 cache=0.3MiB(0txo)
    2025-09-26T01:02:56Z UpdateTip: new best=000000000000000000003ca9748080f4c3d1230ba9fa4bed66be6ded05f9b6e6 height=916000 version=0x2000e000 log2_work=95.840381 tx=1246369867 date='2025-09-23T07:22:08Z' progress=0.998966 cache=367.7MiB(2821413txo)
    7h 59m 50s
    

    </details>

    Doing the same on an Intel i9 with SSD shows similar results

    <details> <summary>i9 with SSD, IBD from real peers/reindex-chainstate seem is 24%/25% faster for default memory, done in 6h/3.5h</summary>

    COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
    STOP=915961; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    688c03597a validation: fetch block inputs in parallel
    af8a366bd6 coins: allow emplacing non-dirty coins internally
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
      Time (abs ≡):        12698.166 s               [User: 33794.242 s, System: 3015.471 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
      Time (abs ≡):        15928.708 s               [User: 28382.232 s, System: 2308.299 s]
    
    Relative speed comparison
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
            1.25          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    

    and

    COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
    STOP=916000; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
    hyperfine \
      --sort command \
      --runs 3 \
      --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
      --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
    
    688c03597a validation: fetch block inputs in parallel
    af8a366bd6 coins: allow emplacing non-dirty coins internally
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
      Time (mean ± σ):     21484.108 s ± 1187.956 s    [User: 42976.944 s, System: 4356.289 s]
      Range (min … max):   20112.390 s … 22175.559 s    3 runs
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
      Time (mean ± σ):     26589.393 s ± 1171.370 s    [User: 36011.245 s, System: 3193.496 s]
      Range (min … max):   25607.731 s … 27886.055 s    3 runs
    
    Relative speed comparison
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
            1.24 ±  0.09  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    

    </details>

    Increasing the memory decreases the difference:

    <details> <summary>i9 reindex-chainstate seem is ~9% faster for default memory, done in 3.3h</summary>

    COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; STOP=915961; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && hyperfine --sort command --runs 1 --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" --parameter-list COMMIT ${COMMITS// /,} --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

    688c03597a validation: fetch block inputs in parallel af8a366bd6 coins: allow emplacing non-dirty coins internally

    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 11801.704 s [User: 20216.598 s, System: 1181.879 s]

    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 12916.432 s [User: 17150.579 s, System: 747.711 s]

    Relative speed comparison 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) 1.09 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

    </details>
    
    Note that the difference between small and big dbcache is also shrunk from 23% to 7.5%!
    
    Checked the same on an i7 with hdd, it seems the speedup is best on non-rotating disks, maybe we could consider reducing the parallelism in those cases:
    
    <details>
    <summary>i7 with HDD, IBD/reindex-chainstate seem is ~16% faster for default memory</summary>
    
    

    COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca";
    STOP=916000; DBCACHE=450;
    CC=gcc; CXX=g++;
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") &&
    hyperfine
    --sort command
    --runs 1
    --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"
    --parameter-list COMMIT ${COMMITS// /,}
    --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"
    --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log &&
    grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"
    "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

    688c03597a validation: fetch block inputs in parallel af8a366bd6 coins: allow emplacing non-dirty coins internally

    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 35766.853 s [User: 39688.514 s, System: 2853.808 s]

    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 41355.517 s [User: 35667.321 s, System: 2872.506 s]

    Relative speed comparison 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blockso$ly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) 1.16 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

    and
    

    COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca";
    STOP=916000; DBCACHE=450;
    CC=gcc; CXX=g++;
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") &&
    hyperfine
    --sort command
    --runs 1
    --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"
    --parameter-list COMMIT ${COMMITS// /,}
    --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"
    --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log &&
    grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"
    "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

    688c03597a validation: fetch block inputs in parallel af8a366bd6 coins: allow emplacing non-dirty coins internally

    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 35766.853 s [User: 39688.514 s, System: 2853.808 s]

    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 41355.517 s [User: 35667.321 s, System: 2872.506 s]

    Relative speed comparison 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blockso$ly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) 1.16 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

    
    </details>
    
    Checking the same on my M4 max laptop was the most surprising:
    <details>
    <summary>M4 max with SSD, IBD/reindex-chainstate seem is 290% faster for default memory</summary>
    
    

    STOP=916000; DBCACHE=450;
    DATA_DIR="/Users/lorinc/Library/Application\ Support/Bitcoin";
    hyperfine
    --sort command
    --runs 1
    --parameter-list COMMIT 688c03597afb0b76077f1ffc4608eef19481056e
    --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"
    --cleanup "grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"
    "./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

    Benchmark 1: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 20658.825 s [User: 19679.918 s, System: 4763.490 s] Benchmark 1b: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 20186.312 s [User: 19481.126 s, System: 4716.728 s]

    Benchmark 2: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 7131.178 s [User: 17244.133 s, System: 12850.427 s] Benchmark 2b: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 7180.762 s [User: 17430.360 s, System: 12949.731 s

    Relative speed comparison 2.90 ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) 1.00 ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)

    
    It was hard to believe this was true, so I re-ran it a few times, and it was consistent.
    
    </details>
    
    I have tried -par=32 on my laptop as well - exactly the same speed:
    <details>
    <summary>-par=32</summary>
    
    

    STOP=916000; DBCACHE=450;
    DATA_DIR="/Users/lorinc/Library/Application\ Support/Bitcoin";
    hyperfine
    --sort command
    --runs 1
    --parameter-list COMMIT 688c03597afb0b76077f1ffc4608eef19481056e
    --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard &&
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"
    --cleanup "grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"
    "./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=32"

    Benchmark 1: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=32 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 7109.626 s [User: 17210.848 s, System: 12938.964 s]

    note, the commit had:
    
      m_input_fetcher{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, 10 * MAX_SCRIPTCHECK_THREADS)},
    
    </details>
    
  156. l0rinc changes_requested
  157. l0rinc commented at 4:17 PM on September 29, 2025: contributor

    <duplicate>

  158. in src/inputfetcher.h:86 in 688c03597a
      81 | +
      82 | +    //! Internal function that does the fetching from disk.
      83 | +    void Loop(int32_t index, bool is_main_thread = false) noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
      84 | +    {
      85 | +        auto local_batch_size{0};
      86 | +        auto end_index{0};
    


    l0rinc commented at 5:41 PM on September 29, 2025:

    I think we should add exact types here to make sure calculations like end_index - local_batch_size can't underflow


    andrewtoth commented at 2:44 PM on October 2, 2025:

    I rewrote this part, these are gone now.

  159. in src/bench/inputfetcher.cpp:47 in 688c03597a outdated
      42 | +    DelayedCoinsView db(DELAY);
      43 | +    CCoinsViewCache cache(&db);
      44 | +
      45 | +    // The main thread should be counted to prevent thread oversubscription, and
      46 | +    // to decrease the variance of benchmark results.
      47 | +    const auto worker_threads_num{GetNumCores() - 1};
    


    l0rinc commented at 5:44 PM on September 29, 2025:

    I'm a bit conflicted here: this way we're all measuring something slightly different - which is especially problematic since the work here isn't even CPU bound. What if we did a min of npcu and 4?

  160. Raimo33 commented at 2:23 PM on October 1, 2025: contributor

    Concept ACK

  161. andrewtoth force-pushed on Oct 2, 2025
  162. andrewtoth force-pushed on Oct 2, 2025
  163. andrewtoth force-pushed on Oct 2, 2025
  164. andrewtoth force-pushed on Oct 2, 2025
  165. andrewtoth force-pushed on Oct 2, 2025
  166. andrewtoth commented at 3:47 PM on October 2, 2025: contributor

    Updated the input fetcher significantly:

    • uses counting_semaphores to synchronize threads instead of mutex + condvar.
    • stores tx + vin indexes in global vector instead of copying the COutPoints. The COutPoints are read from a global CBlock pointer.
    • The fetch queue counter is an atomic int instead of a mutex guarded int.
  167. andrewtoth force-pushed on Oct 3, 2025
  168. andrewtoth force-pushed on Oct 3, 2025
  169. andrewtoth force-pushed on Oct 3, 2025
  170. andrewtoth commented at 5:33 PM on October 3, 2025: contributor

    Removed m_batch_size. Each thread now increments the atomic counter by 1.

  171. andrewtoth force-pushed on Oct 3, 2025
  172. l0rinc commented at 12:53 AM on October 4, 2025: contributor

    The latest version seems very promising, I like that the algorithms is getting simpler. I noticed that for small dbcache it has a very noticeable effect, but for very high dbcache this seems to add an extra cost - since we already have everything in the cache, so it just does useless work. I wonder if we could enable this fetching only after the very first time we Flush and erase, since it cannot help in any way before that.

  173. andrewtoth force-pushed on Oct 4, 2025
  174. l0rinc commented at 6:52 PM on October 7, 2025: contributor

    Compared it against master on a Raspberry Pi 5 synchronizing from real peers for realism, ran it twice for good measure until 917000 blocks with dbcache 450:

    This isn't the latest version of the PR, but should likely be representative anyway.

    <details> <summary>First run: 19% faster, finished IBD in 13h:11m | IBD | 917000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD</summary>

    COMMITS="a8f9a806751b5755bdec5b096186f70c0bfddcfa f0dc19f16826f68ef482acfb7b24e8bb7168fc51"; \
    STOP=917000; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
      --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
    
    a8f9a80675 validation: fetch block inputs in parallel
    f0dc19f168 coins: allow emplacing non-dirty coins internally
    
    IBD | 917000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
      Time (abs ≡):        47485.682 s               [User: 79615.847 s, System: 9374.261 s]
     
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
      Time (abs ≡):        56374.354 s               [User: 78807.079 s, System: 10196.290 s]
     
    Relative speed comparison
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
            1.19          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
    

    </details>

    <details> <summary>Second run: 21% faster, finished IBD in 12h:45m | IBD | 917000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD</summary>

    COMMITS="a8f9a806751b5755bdec5b096186f70c0bfddcfa f0dc19f16826f68ef482acfb7b24e8bb7168fc51"; \
    STOP=917000; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
      --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
    
    a8f9a80675 validation: fetch block inputs in parallel
    f0dc19f168 coins: allow emplacing non-dirty coins internally
    
    IBD | 917000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
      Time (abs ≡):        45907.874 s               [User: 81006.258 s, System: 10039.919 s]
     
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
      Time (abs ≡):        55612.464 s               [User: 81830.349 s, System: 11913.754 s]
     
    Relative speed comparison
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
            1.21          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
    

    </details>

    The variance between the runs shows 1% for master and 3% difference for the PR indicating that we're likely nearing the network bandwidth limitations.

  175. andrewtoth force-pushed on Oct 11, 2025
  176. andrewtoth force-pushed on Oct 11, 2025
  177. andrewtoth force-pushed on Oct 14, 2025
  178. andrewtoth commented at 8:28 PM on October 14, 2025: contributor

    I noticed that for small dbcache it has a very noticeable effect, but for very high dbcache this seems to add an extra cost - since we already have everything in the cache, so it just does useless work. I wonder if we could enable this fetching only after the very first time we Flush and erase, since it cannot help in any way before that. @l0rinc There is already quite a lot to review here, and your benchmarks (and mine) show very promising results. So, I would prefer to keep this idea as a follow-up. We can do isolated benchmarks with your suggested change afterwards and propose an improvement accordingly.

  179. l0rinc commented at 9:04 PM on October 14, 2025: contributor

    I'm fine with doing that in a follow-up if you think it's too complicated (though it's likely quite simple, we can just track the very first cache miss and always prefetch after that - that heuristic would even survive node restarts. Maybe we need to skip the Bip30 values though, but it's just a heuristic anyway). But I don't yet see this PR as close to being final yet - do you? I still want to review it thoroughly, I don't think we should ossify yet :)

  180. DrahtBot added the label Needs rebase on Oct 15, 2025
  181. andrewtoth force-pushed on Oct 16, 2025
  182. andrewtoth force-pushed on Oct 16, 2025
  183. DrahtBot removed the label Needs rebase on Oct 16, 2025
  184. andrewtoth commented at 2:07 AM on October 16, 2025: contributor

    Rebased due to conflicts. Removed the first commit. We don't need to modify EmplaceCoinInternalDANGER, we can just insert the coins as dirty into the cache. They will be set dirty when they are spent anyways. If a block fails validation inside ConnectBlock, then the dirty coins will just rewrite the same value to the db on the next flush or sync.

  185. in src/inputfetcher.h:35 in 063946d6bd outdated
      30 | +private:
      31 | +    /**
      32 | +     * The flattened indexes to each input in the block. The first item in the
      33 | +     * pair is the index of the tx, and the second is the index of the vin.
      34 | +     */
      35 | +    std::vector<std::pair<size_t, size_t>> m_inputs{};
    


    l0rinc commented at 11:56 PM on October 16, 2025:

    As far as I understood the tx index and the vin index cannot exceed the limits of a uint32_t, see https://github.com/bitcoin/bitcoin/blob/e744fd1249bf9577274614eaf3997bf4bbb612ff/src/primitives/transaction.h#L32

    Please consider std::vector<std::pair<uint32_t, uint32_t>> instead, which would likely halve the memory footprint. Based on https://godbolt.org/z/Wb918bWaM it seems to me this layout allows modern compilers to coalesce the two member accesses into a single, optimal 64-bit load from memory.

    Packing it into a single uint64_t seems to be the best, but that's completely unreadable, so I'd go with the above std::pair<uint32_t, uint32_t>.


    andrewtoth commented at 1:59 PM on October 28, 2025:

    Done, using uint32_t in the Input struct now, not sure if it will have any effect in the struct though.

  186. in src/validation.cpp:3142 in 64de911053 outdated
    3136 | @@ -3137,6 +3137,8 @@ bool Chainstate::ConnectTip(
    3137 |      LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
    3138 |               Ticks<MillisecondsDouble>(time_2 - time_1));
    3139 |      {
    3140 | +        m_chainman.FetchInputs(CoinsTip(), CoinsDB(), *block_to_connect);
    3141 | +
    3142 |          CCoinsViewCache view(&CoinsTip());
    


    l0rinc commented at 2:02 PM on October 18, 2025:

    I have played with this to see if we can construct the new cache layer before the fetcher, so that it populates the new layer instead of reading & writing to the old one - seemed like a cleaner separation.

    But unfortunately we would need access to both in-memory cache layers inside FetchInputs in that case - to read for presence from the old cache and to write the newly fetched values to the new one (which will be flushed together with the new outputs when existing this scope).

    But given that we're adding these missing entries only to spend them a moment later in the other cache layer (and to avoid having all of the missing ones require two-hop-lookups) we should still try reading from the stable cache and writing to the temporary one.

    Collecting the missing inputs to a separate cache would also help with benchmarking and testing since the underlying cache would only be modified once block connection finishes - while the complete diff would be in the top cache layer. We also shouldn't add the entries to the cache if the block fails validation.


    andrewtoth commented at 1:30 PM on October 28, 2025:

    Done.

  187. in src/test/coinsviewcacheasync_tests.cpp:41 in 64de911053 outdated
      36 | +
      37 | +        Txid prevhash{Txid::FromUint256(uint256(1))};
      38 | +
      39 | +        for (auto i{1}; i < num_txs; ++i) {
      40 | +            CMutableTransaction tx;
      41 | +            const auto txid{m_rng.randbool() ? Txid::FromUint256(uint256(i)) : prevhash};
    


    l0rinc commented at 4:14 PM on October 18, 2025:

    Does this mean the spent tx is never processed on the same thread currently?

    Maybe we can mix it up a bit by something like

    if (m_rng.randbool()) {
        prevhash = tx.GetHash(); // TODO This can theoretically simulate double spends
    }
    

    andrewtoth commented at 2:04 PM on October 28, 2025:

    Does this mean the spent tx is never processed on the same thread currently?

    I don't think that's what it means. The same thread could fetch two inputs in a row.

    Maybe we can mix it up a bit by something like

    So we can use the same prevhash in difference txs? What is the benefit of this?

  188. in src/test/inputfetcher_tests.cpp:153 in 64de911053 outdated
     148 | +}
     149 | +
     150 | +BOOST_FIXTURE_TEST_CASE(fetch_no_inputs, InputFetcherTest)
     151 | +{
     152 | +    const auto& block{getBlock()};
     153 | +    for (auto i{0}; i < 3; ++i) {
    


    l0rinc commented at 4:34 PM on October 18, 2025:

    What's the point of the loop in these, is there a state we're changing in each iteration?


    andrewtoth commented at 1:31 PM on October 28, 2025:

    The InputFetcher is stateful, so this is making sure previous state does not leak into the next fetch phase.

  189. in src/test/fuzz/inputfetcher.cpp:47 in 64de911053 outdated
      42 | +{
      43 | +public:
      44 | +    std::optional<Coin> GetCoin(const COutPoint&) const override
      45 | +    {
      46 | +        abort();
      47 | +    }
    


    l0rinc commented at 4:52 PM on October 18, 2025:

    nit:

        std::optional<Coin> GetCoin(const COutPoint&) const override { std::abort(); }
    

    andrewtoth commented at 2:00 PM on October 28, 2025:

    Done.

  190. in src/test/fuzz/inputfetcher.cpp:26 in 64de911053 outdated
      21 | +class DbCoinsView : public CCoinsView
      22 | +{
      23 | +private:
      24 | +    DbMap& m_map;
      25 | +
      26 | +public:
    


    l0rinc commented at 4:53 PM on October 18, 2025:

    nit: in test code I'd strive for simpler code instead of needlessly "safe"

    struct DbCoinsView : CCoinsView
    {
        DbMap& m_map;
    

    andrewtoth commented at 2:00 PM on October 28, 2025:

    Done.

  191. in src/bench/inputfetcher.cpp:29 in 64de911053 outdated
      24 | +    DelayedCoinsView(std::chrono::milliseconds delay) : m_delay(delay) {}
      25 | +
      26 | +    std::optional<Coin> GetCoin(const COutPoint&) const override
      27 | +    {
      28 | +        UninterruptibleSleep(m_delay);
      29 | +        return Coin{};
    


    l0rinc commented at 4:54 PM on October 18, 2025:

    GetCoin shouldn't return spent entries


    andrewtoth commented at 2:00 PM on October 28, 2025:

    Done.

  192. in src/bench/inputfetcher.cpp:24 in 64de911053 outdated
      19 | +{
      20 | +private:
      21 | +    std::chrono::milliseconds m_delay;
      22 | +
      23 | +public:
      24 | +    DelayedCoinsView(std::chrono::milliseconds delay) : m_delay(delay) {}
    


    l0rinc commented at 4:54 PM on October 18, 2025:

    this seems overly general to me, I think we can inline the delay for now


    andrewtoth commented at 2:00 PM on October 28, 2025:

    Done.

  193. in src/inputfetcher.h:153 in 64de911053 outdated
     148 | +            return;
     149 | +        }
     150 | +
     151 | +        m_db = &db;
     152 | +        m_cache = &cache;
     153 | +        m_block = &block;
    


    l0rinc commented at 4:55 PM on October 18, 2025:

    As mentioned before, I really dislike these lines, a fetcher shouldn't change the internal state (especially since they're const). Since we need a continuously running threads, can we package these to avoid state mutations? This would likely be solved by sending these off to a ThreadPool instance.


    andrewtoth commented at 1:36 PM on October 28, 2025:

    Indeed, this would be gracefully solved with #33689. We could pass all state into the worker threads via lambda capture.

  194. in src/test/fuzz/inputfetcher.cpp:120 in 64de911053 outdated
     115 | +
     116 | +            prevhash = tx.GetHash();
     117 | +            block.vtx.push_back(MakeTransactionRef(tx));
     118 | +        }
     119 | +
     120 | +        fetcher.FetchInputs(cache, db, block);
    


    l0rinc commented at 5:00 PM on October 18, 2025:

    We shouldn't test this with invalid (empty) blocks:

            if (block.vtx.empty()) continue;
            fetcher.FetchInputs(cache, db, block);
    

    andrewtoth commented at 1:37 PM on October 28, 2025:

    Why not? The InputFetcher should not make assumptions about the structure of the block being passed in.


    l0rinc commented at 4:30 PM on October 28, 2025:

    Even consensus invalid ones? Or did I misunderstand the context here?


    andrewtoth commented at 4:38 PM on October 28, 2025:

    Yeah, InputFetcher doesn't know if a block is consensus valid or not yet. It hasn't passed ConnectBlock yet before entering here.

  195. in src/bench/inputfetcher.cpp:39 in 64de911053 outdated
      34 | +
      35 | +static void InputFetcherBenchmark(benchmark::Bench& bench)
      36 | +{
      37 | +    DataStream stream{benchmark::data::block413567};
      38 | +    CBlock block;
      39 | +    stream >> TX_WITH_WITNESS(block);
    


    l0rinc commented at 5:08 PM on October 18, 2025:

    nit:

        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    

    andrewtoth commented at 2:00 PM on October 28, 2025:

    Done.

  196. in src/bench/inputfetcher.cpp:45 in 64de911053 outdated
      40 | +
      41 | +    DelayedCoinsView db(DELAY);
      42 | +    CCoinsViewCache cache(&db);
      43 | +
      44 | +    // The main thread should be counted to prevent thread oversubscription, and
      45 | +    // to decrease the variance of benchmark results.
    


    l0rinc commented at 5:09 PM on October 18, 2025:

    "should be counted" is the reason for the "- 1"?

  197. in src/bench/inputfetcher.cpp:47 in 64de911053 outdated
      42 | +    CCoinsViewCache cache(&db);
      43 | +
      44 | +    // The main thread should be counted to prevent thread oversubscription, and
      45 | +    // to decrease the variance of benchmark results.
      46 | +    const auto worker_threads_num{GetNumCores() - 1};
      47 | +    InputFetcher fetcher{static_cast<size_t>(worker_threads_num)};
    


    l0rinc commented at 5:12 PM on October 18, 2025:

    nit: if we keep the processor count here (which kinda' makes the benchmark measure different things on different platforms, but I don't really have a better idea, unless we assume that even the simplest machines where this matters (e.g. rpi4) already have at least 4 threads - and since this isn't even CPU bound, we shouldn't pretend that it does):

        const auto worker_threads_num{size_t(GetNumCores() - 1)};
        const InputFetcher fetcher{worker_threads_num};
    

    or

        const InputFetcher fetcher{/*max_thread_count=*/4};
    

    andrewtoth commented at 1:38 PM on October 28, 2025:

    Why don't we want different benchmarks on different machines? All benchmarks are subtly different depending on the host machine.


    l0rinc commented at 4:32 PM on October 28, 2025:

    Yeah, but we want to understand where the differences are coming from, otherwise we'd have the "faster-on-my-machine" syndrome. If you disagree, just resolve the issue.

  198. in src/bench/inputfetcher.cpp:50 in 64de911053 outdated
      45 | +    // to decrease the variance of benchmark results.
      46 | +    const auto worker_threads_num{GetNumCores() - 1};
      47 | +    InputFetcher fetcher{static_cast<size_t>(worker_threads_num)};
      48 | +
      49 | +    bench.run([&] {
      50 | +        const auto ok{cache.Flush()};
    


    l0rinc commented at 5:17 PM on October 18, 2025:

    Doesn't this change the behavior of the benchmark after the first iteration?


    andrewtoth commented at 1:39 PM on October 28, 2025:

    Fixed in latest iteration.

  199. in src/validation.cpp:6270 in 64de911053 outdated
    6266 | @@ -6265,6 +6267,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
    6267 |  
    6268 |  ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
    6269 |      : m_script_check_queue{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    6270 | +      m_input_fetcher{std::clamp<size_t>(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    


    l0rinc commented at 5:25 PM on October 18, 2025:

    I don't understand what 0 threads means. It likely means turn off prefetching.

    I would consider it to be a lot more intuitive if this started with 1 (and not be bound by the number of unrelated MAX_SCRIPTCHECK_THREADS since it's not script related and not even CPU bound).

    Note: chainstatemanager_snapshot_init creates it with 0 workers by default, not sure it's intended


    andrewtoth commented at 1:41 PM on October 28, 2025:

    0 threads means it is turned off, yes. If -par=1 is configured, we want to pass 0 to input fetcher to disable prefetching. Also if options.worker_threads_num was negative for some reason. Do you have a suggestion of how this can be made more clear?

  200. in src/inputfetcher.h:113 in 64de911053 outdated
     108 | +                }
     109 | +                if (m_cache->HaveCoinInCache(outpoint)) {
     110 | +                    continue;
     111 | +                }
     112 | +                if (auto coin{m_db->GetCoin(outpoint)}; coin) {
     113 | +                    m_coins[thread_index].emplace_back(outpoint, std::move(*coin));
    


    l0rinc commented at 5:38 PM on October 18, 2025:

    It seems to me that since the original m_coins is never written by different threads, we shouldn't have a false sharing problem here - right?


    andrewtoth commented at 1:43 PM on October 28, 2025:

    Possibly. This is no longer relevant in the current implementation.

  201. in src/inputfetcher.h:78 in 64de911053 outdated
      73 | +    const CBlock* m_block{nullptr};
      74 | +
      75 | +    std::vector<std::thread> m_worker_threads;
      76 | +    std::counting_semaphore<> m_start_semaphore{0};
      77 | +    std::counting_semaphore<> m_complete_semaphore{0};
      78 | +    std::atomic<bool> m_request_stop{false};
    


    l0rinc commented at 5:41 PM on October 18, 2025:

    This style of dynamic work-stealing seems too complicated for such a uniform problem - static slicing would likely be a lot simpler and I would expect it to perform equally well.

    I also realize that some of these are needed to avoid recreating the threads for every call. Currently the parallelism and the thread recreation are done in a single commit - could we first implement multithreading with thread recreation, and do the thread reuse as a separate concern (though a ThreadPool with dedicated test)?


    andrewtoth commented at 2:05 PM on October 28, 2025:

    It should be simpler now. I'm not sure if this comment is still valid though with the current approach. I agree if we already had a ThreadPool this would be much cleaner.

  202. in src/test/inputfetcher_tests.cpp:134 in 64de911053 outdated
     129 | +
     130 | +        // Add all inputs as spent already in cache
     131 | +        for (const auto& tx : block.vtx) {
     132 | +            for (const auto& in : tx->vin) {
     133 | +                auto outpoint{in.prevout};
     134 | +                Coin coin{}; // Not setting nValue implies spent
    


    l0rinc commented at 6:09 PM on October 18, 2025:
                    Coin coin{};
                    assert(coin.IsSpent());
    

    andrewtoth commented at 2:01 PM on October 28, 2025:

    Done.

  203. andrewtoth force-pushed on Oct 19, 2025
  204. andrewtoth commented at 3:26 PM on October 19, 2025: contributor

    Updated to use std::barrier for the completion synchronization instead of acquiring a semaphore for each thread, as suggested by @l0rinc .

  205. andrewtoth force-pushed on Oct 19, 2025
  206. andrewtoth force-pushed on Oct 19, 2025
  207. andrewtoth force-pushed on Oct 19, 2025
  208. andrewtoth force-pushed on Oct 19, 2025
  209. andrewtoth force-pushed on Oct 19, 2025
  210. in src/inputfetcher.h:96 in a07936a62c
      91 | +            m_start_semaphore.acquire();
      92 | +            if (m_request_stop.load(std::memory_order_relaxed)) {
      93 | +                return;
      94 | +            }
      95 | +            Work(thread_index);
      96 | +            [[maybe_unused]] const auto arrival_token{m_complete_barrier.arrive()};
    


    l0rinc commented at 1:23 AM on October 21, 2025:

    Would this suffice?

                (void)m_complete_barrier.arrive();
    
  211. in src/inputfetcher.h:190 in a07936a62c
     185 | +        m_input_counter.store(0, std::memory_order_relaxed);
     186 | +        m_start_semaphore.release(m_worker_threads.size());
     187 | +
     188 | +        // Have the main thread work too before we wait for other threads
     189 | +        Work(m_worker_threads.size());
     190 | +        m_complete_barrier.arrive_and_wait();
    


    l0rinc commented at 1:29 AM on October 21, 2025:

    What's the reason for not doing the completion here instead of the OnCompletion workaround?

            m_complete_barrier.arrive_and_wait();
            for (auto& coins : m_coins) {
                for (auto& [outpoint, coin] : coins) {
                    m_cache->EmplaceCoinInternalDANGER(std::move(outpoint), std::move(coin));
                }
                coins.clear();
            }
    
  212. in src/inputfetcher.h:149 in a07936a62c
     144 | +        }
     145 | +    }
     146 | +
     147 | +public:
     148 | +    explicit InputFetcher(size_t worker_thread_count) noexcept
     149 | +        : m_complete_barrier{static_cast<int32_t>(worker_thread_count + 1), OnCompletionWrapper{this}}
    


    l0rinc commented at 1:30 AM on October 21, 2025:

    std::ptrdiff_t seems more appropriate here: https://en.cppreference.com/w/cpp/thread/barrier/barrier.html

        explicit InputFetcher(size_t worker_thread_count) noexcept : m_complete_barrier{std::ptrdiff_t(worker_thread_count + 1)}
    
  213. in src/inputfetcher.h:165 in a07936a62c outdated
     160 | +    }
     161 | +
     162 | +    //! Fetch all block inputs from db, and insert into cache.
     163 | +    void FetchInputs(CCoinsViewCache& cache, const CCoinsView& db, const CBlock& block) noexcept
     164 | +    {
     165 | +        if (block.vtx.size() <= 1 || m_worker_threads.size() == 0) {
    


    l0rinc commented at 2:21 PM on October 21, 2025:

    wouldn't m_worker_threads.size() == 0 be an error?


    andrewtoth commented at 1:44 PM on October 28, 2025:

    No, we can have no worker threads, for instance if started with -par=1. In that case just disable prefetching.


    l0rinc commented at 5:58 PM on November 2, 2025:

    Not a biggy, but this also seems uncovered by the unit tests.

  214. in src/bench/inputfetcher.cpp:18 in a07936a62c outdated
      13 | +#include <util/time.h>
      14 | +
      15 | +static constexpr auto DELAY{2ms};
      16 | +
      17 | +//! Simulates a DB by adding a delay when calling GetCoin
      18 | +class DelayedCoinsView : public CCoinsView
    


    l0rinc commented at 2:24 PM on October 21, 2025:

    I personally would favor going for simpler code as opposed to going for theoretically better encapsulation, similarly to current inputfetcher_tests.cpp:

    struct DelayedCoinsView : CCoinsView
    

    andrewtoth commented at 2:01 PM on October 28, 2025:

    Done.

  215. in src/bench/inputfetcher.cpp:32 in a07936a62c outdated
      27 | +    {
      28 | +        UninterruptibleSleep(m_delay);
      29 | +        return Coin{};
      30 | +    }
      31 | +
      32 | +    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { return true; }
    


    l0rinc commented at 2:25 PM on October 21, 2025:

    We can add better assertions if we count the iterator size here:

    bool BatchWrite(CoinsViewCacheCursor& cursor, const uint256&) override
    {
        for (auto it{cursor.Begin()}; it != cursor.End(); it = cursor.NextAndMaybeErase(*it)) {
            m_write_count++;
        }
        return true;
    }
    

    So the bench could be (without flush which makes the bench runs equivalent):

    bench.run([&] {
        CCoinsViewCache block_cache{&cache};
        fetcher.FetchInputs(cache, block_cache, db, block);
        assert(db.m_write_count == 0 && cache.GetCacheSize() == 0 && block_cache.GetCacheSize() == 4599);
    });
    

    andrewtoth commented at 1:49 PM on October 28, 2025:

    When do we do a batch write though in this example?


    l0rinc commented at 4:28 PM on October 28, 2025:

    Originally we flushed, so if we still want to, we may want to extend this to assert that behavior. Or avoid flushing and simplify the bench. Or add the flushing behavior above to a test, etc.


    andrewtoth commented at 4:37 PM on October 28, 2025:

    We avoid flushing now.


    l0rinc commented at 6:08 PM on November 2, 2025:

    I think we could still add

            ankerl::nanobench::doNotOptimizeAway(&temp_cache);
            Assert(temp_cache.GetCacheSize() == 4599);
    

    to document that the benchmark can loop now (i.e. every iteration should be the same)

  216. in src/validation.cpp:3140 in a07936a62c
    3136 | @@ -3137,6 +3137,8 @@ bool Chainstate::ConnectTip(
    3137 |      LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
    3138 |               Ticks<MillisecondsDouble>(time_2 - time_1));
    3139 |      {
    3140 | +        m_chainman.FetchInputs(CoinsTip(), CoinsDB(), *block_to_connect);
    


    l0rinc commented at 2:30 PM on October 21, 2025:

    Given that we already create the actual temporary top cache here, it would be great if we could separate the reading and writing (read from old/big in-memory cache and write to the temporary small one):

    CCoinsViewCache* cache{&CoinsTip()};
    CCoinsViewCache new_cache{cache};
    m_chainman.FetchInputs(*cache, new_cache, CoinsDB(), *block_to_connect);
    

    This would also mean that the actually read changes are read after that from the top-layer cache instead of always requiring two hops (assuming many missing ones). It also makes sense to not add inputs from blocks that fail validation - which we'd be getting for free this way.


    andrewtoth commented at 1:50 PM on October 28, 2025:

    Done.

  217. l0rinc commented at 3:32 PM on October 21, 2025: contributor

    I love the new structure, it's a lot easier to track the progress compared to previous versions. It's also measurably faster than previous versions, we seem to be nearing ~20% - sweeeet! It also pairs well with other Siphash related changes that are in the pipeline.

    I have reimplemented most of it locally to make sure I can do a meaningful review, see https://github.com/l0rinc/bitcoin/pull/47 for my attempt. It's not finished, I'm still experimenting with different alternatives to make sure we can make this as simple and useful as possible (e.g. using barriers, filtering and sorting before fetch, adding dedicated ThreadPool, splitting reading and writing caches etc.), but wanted to publish my observations so far.

    My remaining biggest concern is that the threadpool should definitely be untangled from the InputFetcher logic (I hate concurrent mutability), it's an independent concern mixed with non-trivial fetcher logic. We also shouldn't parallelize based on CPU in the first place, I don't see why we'd do that. And the ThreadPool part still needs independent tests.

    I have tried fetching everything on a single thread and the same with sorted UTXOs and it does seem to be ~6% faster on an SSD (I'd expected even bigger difference on HDD, still measuring that) - but I don't have a final solution that's faster than everything else (since I could only solve it by single-threaded filtering which is slow).

    <details> <summary>Details</summary>

    b6ccd542fd single-threaded NO sort
    c50a6bd981 single-threaded + sorted fetch
    
    reindex-chainstate | 700000 blocks | dbcache 100 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
     
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b6ccd542fdc371d3ecd90164169e3d5d7c60e82d)
      Time (abs ≡):        11322.014 s               [User: 20431.572 s, System: 1375.995 s]
     
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = c50a6bd9815ebb2e0d86d5e6b679da5ac021b796)
      Time (abs ≡):        10646.049 s               [User: 19231.516 s, System: 1268.395 s]
    

    </details>

  218. andrewtoth commented at 4:55 PM on October 21, 2025: contributor

    Thank you for your review @l0rinc!

    I love the new structure, it's a lot easier to track the progress compared to previous versions. It's also measurably faster than previous versions, we seem to be nearing ~20% - sweeeet!

    :rocket:

    we can make this as simple and useful as possible

    What else do you have in mind for this being useful? It has a very focused purpose in my view. I would love to make it as simple as possible.

    using barriers

    Done :)

    filtering and sorting before fetch

    Can you expand on why we would want this? Prefiltering on the main thread was always slower than filtering in parallel when I was testing this idea.

    adding dedicated ThreadPool

    Yes, see #26966. Hopefully we can pull that out and it can be used here. See https://github.com/andrewtoth/bitcoin/tree/test-thread-pool.

    splitting reading and writing caches

    I'm not sure what you mean by this? There is only one cache, and we read from it concurrently, then write to it on a single thread. Can you expand on how we can split the cache into two? Also, why would we want separate caches?

    My remaining biggest concern is that the threadpool should definitely be untangled from the InputFetcher logic

    Is this not accomplished by ThreadLoop, Work and OnCompletion functions? The Work function has no thread pool logic at all, it is completely independent and can be run by one thread or many. Do you have any concrete suggestions, or specific codepaths that are tangled that are concerning?

    I hate concurrent mutability

    I'm not sure I understand this. There is no concurrent mutability in my implementation. That would be undefined behavior. Can you point out the data members that are being concurrently mutated and how? Perhaps we have different definitions of concurrent mutability.

    We also shouldn't parallelize based on CPU in the first place, I don't see why we'd do that.

    This is following the logic of the check queue. Ideally we could reuse both threads in the same threadpool in the future. Do you have any concrete recommendations on how we many threads we should run? Using the number of threads per CPU is yielding a 20% speed boost, so it seems like a sane enough choice.

  219. l0rinc commented at 6:02 PM on October 24, 2025: contributor

    Let me summarize our offline discussions:

    Cache hierarchy

    During block connection we're adding an extra temporary in-memory dbcache layer on top so that whatever happens during block connection doesn't end up polluting the big dbcache or leveldb. I think we should take advantage of this and use the temporary top layer to collect the missing inputs there:

    • if block validation fails we can just throw it out
    • it's easier to test and benchmark, we can rerun the same operation without changing the underlying state
    • the missing values will all be fetched from the top layer now, avoiding two-hop lookups
    • since the threads aren't reading from the temp cache and aren't writing to the big cache, we can copy the needed inputs from the big in-memory cache contents on the main thread as well to the temporary top layer (without locking) while the other threads are fetching from the DB (since those are IO bound, this being CPU bound). That would create a dedicated dbcache per blocks which we could eventually flush independently.

    This might need some workarounds, but it does enable new cache invalidation opportunities.

    Sorted fetch

    Since LevelDB writes (via the CDBBatch and MemTable) are already sorted and result in a significant speedup compared to inserting the values one-by-one (though likely this isn't just the effect of sorting), I thought it would make sense to experiment with sorted fetches as well. I have added an extra fetch (similarly to this PR, see https://github.com/l0rinc/bitcoin/commit/b72f67d4a88495a0222cbb9ae825daa6ea38e4df) before calling ConnectBlock so that the cache is pre-warmed, but on a single thread. This is the baseline, in the next commit after gathering the missing values I'm sorting them and doing the fetch from the db one-by-one, but in a sorted order.

    The results indicate that sorted fetching is a lot faster, especially on lower-end devices (i7 with HDD and Rpi5 with SSD).

    <details> <summary>6% faster reindex-chainstate | 919191 blocks | dbcache 450 | rpi5-16-2</summary>

    COMMITS="7d27af98c7cf858b5ab5a02e64f89a857cc53172 ccc748b05858a9ebeb375cc1f3e7426698394470 ead8da2b33117807e27a66f814cf11cdf676d194"; \
    STOP=919191; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    7d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz
    ccc748b058 coins: prefetch inputs on single thread
    ead8da2b33 coins: sorted input prefetch
    
    reindex-chainstate | 919191 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)
      Time (abs ≡):        42271.083 s               [User: 66842.872 s, System: 8003.776 s]
     
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)
      Time (abs ≡):        40776.988 s               [User: 67394.771 s, System: 7130.144 s]
     
    Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)
      Time (abs ≡):        38435.516 s               [User: 64537.357 s, System: 6716.634 s]
     
    Relative speed comparison
            1.10          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)
            1.06          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)
    

    </details>

    (interestingly it seems this is even faster than master by a lot, we're not yet sure what's causing that since the fetches are still single-threaded)

    Similar speedup on HDD:

    <details> <summary>5% faster reindex-chainstate | 919191 blocks | dbcache 450 | i7-hdd</summary>

    STOP=919191; DBCACHE=450; \                                                                                                                   
    CC=gcc; CXX=g++; \                                                                                                                                                                  
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \                                                                                           
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \                                                                      
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cor
    es | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || e
    cho HDD)"; echo "") &&\                                                                                                                                                             
    hyperfine \                                  
      --sort command \                           
      --runs 1 \                                 
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \                                                                      
      --parameter-list COMMIT ${COMMITS// /,} \                                               
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \                                                  
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \                                                                  
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \                                                                        
      --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \                                                                                                   
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \                                                                                
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"                         
    7d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz                             
    ccc748b058 coins: prefetch inputs on single thread                                        
    ead8da2b33 coins: sorted input prefetch                                                   
    
    reindex-chainstate | 919191 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD                                      
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                                        
      Time (abs ≡):        42897.195 s               [User: 38613.661 s, System: 3154.561 s]                                                                                            
                                                 
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                                               
      Time (abs ≡):        42015.404 s               [User: 40096.242 s, System: 3180.038 s]                                                                                            
                                                 
    Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)                                               
      Time (abs ≡):        40176.361 s               [User: 38614.897 s, System: 3047.461 s]                                                                                            
                                                 
    Relative speed comparison                    
            1.07          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                               
            1.05          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                               
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)   
    

    </details>

    On a very powerful i9 with very performant SSD sorting isn't faster than master, but it's still a lot faster than random fetching:

    <details> <summary>4% faster reindex-chainstate | 919191 blocks | dbcache 450 | i9-ssd</summary>

    STOP=919191; DBCACHE=450; \                                                                                                      
    CC=gcc; CXX=g++; \                                                                                                                                                     
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \                                                                              
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \                                                         
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\                                  
    hyperfine \                              
      --sort command \                       
      --runs 1 \                             
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \                                                         
      --parameter-list COMMIT ${COMMITS// /,} \                                        
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \                                     
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \                                                     
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \                                                           
      --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \                                                                                      
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \                                                                   
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"            
    
    7d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz                      
    ccc748b058 coins: prefetch inputs on single thread                                 
    ead8da2b33 coins: sorted input prefetch                                            
    
    reindex-chainstate | 919191 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD                        
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                    
      Time (abs ≡):        20383.488 s               [User: 38259.799 s, System: 2702.188 s]                                                                               
                                             
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                    
      Time (abs ≡):        21213.645 s               [User: 39728.945 s, System: 2709.999 s]                                                                               
                                             
    Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)                    
      Time (abs ≡):        20476.751 s               [User: 38448.254 s, System: 2597.600 s]                                                                               
                                             
    Relative speed comparison                
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)           
            1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)           
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)  
    

    </details>

    Theoretically this can also be affected by LevelDB's internal options.block_cache and options.block_size and iteroptions.fill_cache (currently experimenting with different values to see if it makes any difference). This is why I have suggested trying a read-only snapshot per thread, if they're reading sorted values the previous path could theoretically be caches which can help avoid a few lookups.

    Separate ThreadPool

    I think we should be able to use a single barrier for starting and stopping, assuming that we either want to use all of them or none of them, something like: https://github.com/l0rinc/bitcoin/commit/67e79e041b524eb1b0039ff39d4fc5b235d89b8f#diff-7ad6c5646dfd3749d2634861dfaf98e8c6fbc041a8e8eded973ff39929acffd7 or #33689 This would also help with completely separating the multithreaded part from the Inputfetcher which would also allow testing that critical behavior separately. This would also help with minimizing the mutable state which could theoretically allow concurrent calls to swap out the work from under the started threads (so likely the m_task in my example should be atomic, haven't given it enough thought).

    Since sorted-fetching currently needs a preprocessing step, we could use these available threads for filtering operations as well (since single-threaded filtering is slow, as proven above) - we could filter on all threads for missing, sort the missing on main thread and spread out to all threads for db fetching - leaving the main thread to do other things (since db fetching isn't really CPU bound, no need to involve the main thread), such as the mentioned outer-layer warming from the main in-memory dbcache. And alternative to the "fetch-missing & sort & get" on a single thread is to try "fetch-all & sort & filter & get" instead which is more parallelizable. Since everything in the outer-layer will be spent anyway, in the future maybe we could even flush them on a background thread to db (given that we've just copied the items to the temp cache, after successful flush we could just remove them from the main cache) - which would likely simplify the main dbcache's dirty and spent behaviors, but that's outside of the scope of this PR but could provide extra motivation for these changes).

  220. andrewtoth force-pushed on Oct 27, 2025
  221. andrewtoth force-pushed on Oct 27, 2025
  222. andrewtoth commented at 1:58 AM on October 27, 2025: contributor

    Thank you @l0rinc for your detailed review and suggestions! I have taken some of them. The input fetcher has now been redesigned.

    • Coins are written to the ephemeral cache that is created just to be used in ConnectBlock, instead of the main cache. This requires a new method in CCoinsViewCache - GetPossiblySpentCoinFromCache. Since we write to an empty cache instead of CoinsTip(), we could insert a Coin that exists in the db but is spent in CoinsTip(). Previously that insertion would fail since we were inserting again into CoinsTip() which would not overwrite the spent coin.
    • Because of this we can safely write to the ephemeral cache on the main thread, since the worker threads will be reading from a different cache. So the main thread does not do any fetching, but it writes fetched Coins to the cache in parallel as workers fetch them. It is a lock-free MPSC queue. The workers and main thread synchronize on an atomic Status member of each input. The workers set it to READY while the main thread spins on it until it is no longer WAITING. This is a substantial speed improvement over the previous version, and it also lets us insert Coins from the main cache into the ephemeral cache faster. So this change speeds up block connection even when there are no cache misses, and makes the workers utilize more CPU rather than just IO.
    • A single barrier is used to synchronize the threads. Both the main and worker threads call arrive_and_wait before they begin their work and after completion.

    I removed the last commit, and instead add the InputFetcher as multi threaded initially. It didn't make sense to split it since the worker threads and main threads do different things now.

    I haven't been able to effectively utilize a sorting strategy. Sorting the inputs before fetching doesn't seem to have a benefit in the multi threaded approach. The overhead of copying the COutPoints and sorting them was always slower. I think the parallel fetching dominates any speedup achieved by sorted fetching anyways.

  223. andrewtoth force-pushed on Oct 27, 2025
  224. l0rinc commented at 9:48 PM on October 27, 2025: contributor

    Looks like I forgot to push the comments last time - that explains your surprise. I pushed all of them, many are out of date, please just resolve them. Sorry for the confusion.

    Since I'm reposting these old comments (since commented from the main GitHub view), lemme' post a recently finished Rpi4 benchmark showing a 31% speedup for an older push \:D/

    <details> <summary>31% faster reindex-chainstate | 919191 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD</summary>

    COMMITS="063946d6bd78035276d12e070a208d84492ac5cd 64de91105312d36dadb5f71ec01fc6af9b14da69"; \
    STOP=919191; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    063946d6bd coins: add inputfetcher
    64de911053 coins: fetch coins on parallel threads
    
    reindex-chainstate | 919191 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 063946d6bd78035276d12e070a208d84492ac5cd)
      Time (abs ≡):        329713.086 s               [User: 172834.637 s, System: 87207.916 s]
     
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 64de91105312d36dadb5f71ec01fc6af9b14da69)
      Time (abs ≡):        251976.600 s               [User: 177923.005 s, System: 116804.580 s]
     
    Relative speed comparison
            1.31          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 063946d6bd78035276d12e070a208d84492ac5cd)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 64de91105312d36dadb5f71ec01fc6af9b14da69)
    

    </details>

  225. andrewtoth force-pushed on Oct 28, 2025
  226. in src/bench/inputfetcher.cpp:28 in 2aa5103481 outdated
      23 | +        Coin coin{};
      24 | +        coin.out.nValue = 1;
      25 | +        return coin;
      26 | +    }
      27 | +
      28 | +    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { return true; }
    


    l0rinc commented at 4:40 PM on October 28, 2025:

    nit: we could throw now to indicate we don't have unrepeatable side-effects


    andrewtoth commented at 11:08 PM on November 2, 2025:

    We could just get rid of this line now.


    andrewtoth commented at 11:31 PM on November 2, 2025:

    Got rid of it.

  227. in src/bench/inputfetcher.cpp:18 in 2aa5103481 outdated
      13 | +#include <util/time.h>
      14 | +
      15 | +static constexpr auto DELAY{2ms};
      16 | +
      17 | +//! Simulates a DB by adding a delay when calling GetCoin
      18 | +struct DelayedCoinsView : CCoinsView
    


    l0rinc commented at 4:40 PM on October 28, 2025:

    Nit: structs are now formatted consistently on the same line


    andrewtoth commented at 11:32 PM on November 2, 2025:

    Done.


    l0rinc commented at 8:44 AM on November 3, 2025:

    There are other ones that I didn't mention, reformatted the change, please take the ones that make sense:

    <details> <summary>Details</summary>

    diff --git a/src/coins.cpp b/src/coins.cpp
    index 2ef2e36ccc..baac1a32b5 100644
    --- a/src/coins.cpp
    +++ b/src/coins.cpp
    @@ -173,7 +173,8 @@ bool CCoinsViewCache::HaveCoinInCache(const COutPoint &outpoint) const {
         return (it != cacheCoins.end() && !it->second.coin.IsSpent());
     }
     
    -std::optional<Coin> CCoinsViewCache::GetPossiblySpentCoinFromCache(const COutPoint &outpoint) const noexcept {
    +std::optional<Coin> CCoinsViewCache::GetPossiblySpentCoinFromCache(const COutPoint& outpoint) const noexcept
    +{
         if (auto it{cacheCoins.find(outpoint)}; it != cacheCoins.end()) return it->second.coin;
         return std::nullopt;
     }
    diff --git a/src/coins.h b/src/coins.h
    index e7d28ace97..02c3ea9e15 100644
    --- a/src/coins.h
    +++ b/src/coins.h
    @@ -407,7 +407,7 @@ public:
          * Used in InputFetcher to make sure we do not add a coin from the backing
          * view when it is spent in the cache but not yet flushed to the parent.
          */
    -    std::optional<Coin> GetPossiblySpentCoinFromCache(const COutPoint &outpoint) const noexcept;
    +    std::optional<Coin> GetPossiblySpentCoinFromCache(const COutPoint& outpoint) const noexcept;
     
         /**
          * Return a reference to Coin in the cache, or coinEmpty if not found. This is
    diff --git a/src/inputfetcher.h b/src/inputfetcher.h
    index a8a3f4d1ad..74c655caf1 100644
    --- a/src/inputfetcher.h
    +++ b/src/inputfetcher.h
    @@ -46,8 +46,8 @@ private:
         struct Input {
             enum class Status : uint8_t {
                 WAITING, // The coin has not been fetched yet
    -            READY, // The coin has been fetched and is ready to be inserted into the cache
    -            FAILED, // The coin failed to be fetched
    +            READY,   // The coin has been fetched and is ready to be inserted into the cache
    +            FAILED,  // The coin failed to be fetched
                 SKIPPED, // The coin is created and spent in the same block so cannot be fetched
             };
     
    diff --git a/src/test/fuzz/inputfetcher.cpp b/src/test/fuzz/inputfetcher.cpp
    index cd2a0f5c68..70f5153912 100644
    --- a/src/test/fuzz/inputfetcher.cpp
    +++ b/src/test/fuzz/inputfetcher.cpp
    @@ -18,8 +18,7 @@
     
     using DbMap = std::map<const COutPoint, std::pair<std::optional<const Coin>, bool>>;
     
    -struct DbCoinsView : CCoinsView
    -{
    +struct DbCoinsView : CCoinsView {
         DbMap& m_map;
         DbCoinsView(DbMap& map) noexcept : m_map(map) {}
     
    @@ -35,8 +34,7 @@ struct DbCoinsView : CCoinsView
         }
     };
     
    -struct NoAccessCoinsView : CCoinsView
    -{
    +struct NoAccessCoinsView : CCoinsView {
         std::optional<Coin> GetCoin(const COutPoint&) const override { abort(); }
     };
     
    @@ -49,7 +47,8 @@ FUZZ_TARGET(inputfetcher)
             fuzzed_data_provider.ConsumeIntegralInRange<int32_t>(2, 4)};
         InputFetcher fetcher{worker_threads};
     
    -    LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10000) {
    +    LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10000)
    +    {
             CBlock block;
             Txid prevhash{Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))};
     
    @@ -61,13 +60,13 @@ FUZZ_TARGET(inputfetcher)
             NoAccessCoinsView back;
             CCoinsViewCache main_cache(&back);
     
    -        LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10000) {
    +        LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10000)
    +        {
                 CMutableTransaction tx;
     
    -            LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10) {
    -                const auto txid{fuzzed_data_provider.ConsumeBool()
    -                    ? Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))
    -                    : prevhash};
    +            LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), 10)
    +            {
    +                const auto txid{fuzzed_data_provider.ConsumeBool() ? Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider)) : prevhash};
                     const auto index{fuzzed_data_provider.ConsumeIntegral<uint32_t>()};
                     const COutPoint outpoint(txid, index);
     
    @@ -87,8 +86,8 @@ FUZZ_TARGET(inputfetcher)
                         maybe_coin = std::nullopt;
                     }
                     db_map.try_emplace(outpoint, std::make_pair(
    -                    maybe_coin,
    -                    fuzzed_data_provider.ConsumeBool()));
    +                                                 maybe_coin,
    +                                                 fuzzed_data_provider.ConsumeBool()));
     
                     // Add the coin to the cache
                     if (fuzzed_data_provider.ConsumeBool()) {
    diff --git a/src/test/inputfetcher_tests.cpp b/src/test/inputfetcher_tests.cpp
    index 33fb8c6cb0..83f3d19432 100644
    --- a/src/test/inputfetcher_tests.cpp
    +++ b/src/test/inputfetcher_tests.cpp
    @@ -48,7 +48,7 @@ private:
     
     public:
         explicit InputFetcherTest(const ChainType chainType = ChainType::MAIN,
    -                             TestOpts opts = {})
    +                              TestOpts opts = {})
             : BasicTestingSetup{chainType, opts}
         {
             SeedRandomForTest(SeedRand::FIXED_SEED);
    @@ -200,8 +200,7 @@ BOOST_FIXTURE_TEST_CASE(fetch_no_inputs, InputFetcherTest)
         }
     }
     
    -struct ThrowCoinsView : CCoinsView
    -{
    +struct ThrowCoinsView : CCoinsView {
         std::optional<Coin> GetCoin(const COutPoint&) const override
         {
             throw std::runtime_error("database error");
    diff --git a/src/validation.h b/src/validation.h
    index 26141305cb..2ff7c5ef6d 100644
    --- a/src/validation.h
    +++ b/src/validation.h
    @@ -1341,7 +1341,8 @@ public:
         void RecalculateBestHeader() EXCLUSIVE_LOCKS_REQUIRED(::cs_main);
     
         CCheckQueue<CScriptCheck>& GetCheckQueue() { return m_script_check_queue; }
    -    void FetchInputs(CCoinsViewCache& temp_cache, const CCoinsViewCache& main_cache, const CCoinsView& db, const CBlock& block) noexcept {
    +    void FetchInputs(CCoinsViewCache& temp_cache, const CCoinsViewCache& main_cache, const CCoinsView& db, const CBlock& block) noexcept
    +    {
             m_input_fetcher.FetchInputs(temp_cache, main_cache, db, block);
         }
    

    </details>


    andrewtoth commented at 10:15 PM on November 4, 2025:

    Took the formats, thanks!

  228. l0rinc commented at 8:40 PM on October 29, 2025: contributor

    I'd say it's time to update the PR description:

    <details> <summary>24% faster reindex-chainstate | 921129 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD</summary>

    COMMITS="cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85 2aa510348143521a14146e41b5cf87cb3e60b29e"; \
    STOP=921129; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log; \
                  grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    cb0fdfdf37 coins: add inputfetcher
    2aa5103481 validation: fetch block inputs via InputFetcher before connecting
    
    reindex-chainstate | 921129 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
      Time (abs ≡):        40539.887 s               [User: 69358.879 s, System: 7393.185 s]
     
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
      Time (abs ≡):        32768.672 s               [User: 69495.022 s, System: 11880.553 s]
     
    Relative speed comparison
            1.24          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
    

    </details>

    With this change we're getting a 9 hours full reindex-chainstate on an rpi5 with default 450MB memory. Wow!

  229. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads >10% faster IBD
    validation: fetch block inputs on parallel threads >20% faster IBD
    on Oct 29, 2025
  230. andrewtoth force-pushed on Oct 31, 2025
  231. andrewtoth force-pushed on Oct 31, 2025
  232. l0rinc commented at 8:20 AM on November 1, 2025: contributor

    Rebased the latest version of the PR now that #31645 was merged and measured a reindex on my M4 laptop: it finished all 3 runs with dbcache 450, 4500 and 45000 overnight.

    <details> <summary>Details</summary>

    time ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 \
    && time ./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 \
    && time ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 
    
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate   8805.12s user 1456.47s system 134% cpu 2:07:22.36 total
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -reindex-chainstate    12201.80s user 3698.94s system 197% cpu 2:13:54.50 total
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate     20676.11s user 10582.81s system 358% cpu 2:25:26.52 total
    

    repeated the same later to check for stability:

    ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate   8873.87s user 1468.17s system 134% cpu 2:07:47.62 total
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -reindex-chainstate    12178.09s user 3621.39s system 197% cpu 2:13:11.48 total
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate     20725.97s user 10666.85s system 359% cpu 2:25:33.12 total
    

    </details>

    <img width="1185" height="580" alt="image" src="https://github.com/user-attachments/assets/a530f7d2-e7e5-4fa4-9ab0-59d0359aa763" />

    Note: I did get a bitcoind(70369,0x16f5ff000) malloc: Failed to allocate segment from range group - out of space warning above, maybe 45 GB memory was a bit too much, but I can continue the block connection after the reindexes, so the measurements are likely representative. Edit: reran the 45 GB case, it did complete successfully in a similar time without errors.

    <details> <summary>Details</summary>

    reindex-chainstate seems correct, it can continue after:

    ./build/bin/bitcoind
    2025-11-01T08:16:25Z Bitcoin Core version v30.99.0-45fe0c0e5bed (release build)
    ...
    2025-11-01T08:16:29Z nBestHeight = 921129
    ...
    2025-11-01T08:16:29Z UpdateTip: new best=00000000000000000000a216ce3209897114b757dbac3b651c72484829637c0e height=921130 version=0x343ba000 log2_work=95.904905 tx=1262937761 date='2025-10-28T06:06:11Z' progress=0.998475 cache=0.5MiB(3812txo)
    2025-11-01T08:16:29Z UpdateTip: new best=00000000000000000001965d9ff2ef639dd6829da2ad19c3f5d46691475f0df1 height=921131 version=0x25472000 log2_work=95.904918 tx=1262938885 date='2025-10-28T06:13:50Z' progress=0.998477 cache=1.6MiB(9726txo)
    

    </details>

  233. andrewtoth force-pushed on Nov 1, 2025
  234. andrewtoth commented at 9:47 PM on November 1, 2025: contributor

    Rebased to test behavior with #31645. Some other touch-ups include:

    • Use const COutPoint& in Input struct instead of vin and vtx indexes
    • Cleanup shared vectors and pointers at the end of loop
    • Refactor inner work loop to do fewer existence checks at the expense of some duplicated code
    • Use std::atomic_usize_t instead of std::atomic<usize_t>
    • Use GetPossiblySpentCoinFromCache in tests and fuzz harness for better clarity and correctness
  235. in src/coins.h:489 in 62868c8846
     481 | @@ -474,6 +482,14 @@ class CCoinsViewCache : public CCoinsViewBacked
     482 |      //! See: https://stackoverflow.com/questions/42114044/how-to-release-unordered-map-memory
     483 |      void ReallocateCache();
     484 |  
     485 | +    /**
     486 | +     * Reserve enough space in the cache so the underlying unordered_map will
     487 | +     * not have to rehash unless capacity is exceeded.
     488 | +     */
     489 | +    void Reserve(size_t capacity) {
    


    l0rinc commented at 2:39 PM on November 2, 2025:

    Nit: can you please reformat the change with latest clang-format after rebase? Nit2: this seems to belong closer to GetCacheSize, one reserves, the other returns the actual size


    andrewtoth commented at 11:32 PM on November 2, 2025:

    I couldn't get clang-format to work, but I made this a one-liner and moved under GetCacheSize.

  236. in src/inputfetcher.h:165 in 62868c8846
     160 | +        m_cache = &cache;
     161 | +        m_input_head.store(0, std::memory_order_relaxed);
     162 | +        m_barrier.arrive_and_wait();
     163 | +
     164 | +        // Insert fetched coins into the temp_cache as they are set to READY.
     165 | +        temp_cache.Reserve(m_inputs.size() + outputs_count);
    


    l0rinc commented at 2:50 PM on November 2, 2025:

    I like that we're doing this, we've usually avoided preallocating our maps - we should do it more often!


    I know we're providing empty caches via the parameters, but for completeness either assert that they're empty or:

            temp_cache.Reserve(temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);
    

    I have added

            Assert(temp_cache.GetCacheSize() <= temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);
    

    after the for (auto& input : m_inputs) { loop to validate that the allocation achieves - the tests pass! 👍


    andrewtoth commented at 11:32 PM on November 2, 2025:

    Done.

  237. in src/inputfetcher.h:149 in 62868c8846 outdated
     144 | +        }
     145 | +
     146 | +        // Loop through the inputs of the block and set them in the queue.
     147 | +        // Construct the set of txids to filter, and count the outputs to reserve for temp_cache.
     148 | +        auto outputs_count{block.vtx[0]->vout.size()};
     149 | +        for (size_t i{1}; i < block.vtx.size(); ++i) {
    


    l0rinc commented at 2:57 PM on November 2, 2025:

    we're also just adding stuff to the m_inputs without any reservation, the first few times we will have some needless copying - we could likely alleviate that by doing a rough reservation

            m_txids.reserve(block.vtx.size());
            m_inputs.reserve(2 * block.vtx.size()); // rough guess
    

    andrewtoth commented at 11:07 PM on November 2, 2025:

    This won't have a measurable effect if we are connecting lots of blocks though.


    andrewtoth commented at 11:33 PM on November 2, 2025:

    Did it anyways.

  238. in src/inputfetcher.h:71 in 62868c8846 outdated
      66 | +    /**
      67 | +     * The set of txids of all txs in the block being fetched.
      68 | +     * Used to filter out inputs that are created and spent in the same block,
      69 | +     * since they will not be in the db or the cache.
      70 | +     */
      71 | +    std::unordered_set<Txid, SaltedTxidHasher> m_txids{};
    


    l0rinc commented at 3:06 PM on November 2, 2025:

    I understand if you don't want to bother with this, but how many of these do we expect per block? I wonder if we want to incur the hashing cost here or if using a sorted set or a sorted vector with std::binary_search (or even (unsorted?) SIMD-enabled linear scan) would be faster or simpler here.

    I also thought of whether we could just add these to temp_cache directly, but that would likely pollute the up-coming validation (unless we can differentiate these from existing inputs).


    Taking block413567, we can assuming for the benchmark that 5% of the txs are internal spends (we need to measure this properly before we decide) and assuming 1556 txs (4886 inputs, 3581 outputs), 287 internal spends, we can compare the alternatives, here's a bench for motivation.

    <details> <summary>Benchmark comparing the miss and hit count of Txid_SetOrdered, Txid_UnorderedSalted, Txid_VectorBinarySearch, Txid_VectorLinearScan</summary>

    // Copyright (c) 2022-present The Bitcoin Core developers
    // Distributed under the MIT software license, see the accompanying
    // file COPYING or https://www.opensource.org/licenses/mit-license.php.
    
    #include <bench/bench.h>
    #include <bench/nanobench.h>
    #include <primitives/transaction_identifier.h>
    #include <random.h>
    #include <util/check.h>
    #include <util/hasher.h>
    
    #include <algorithm>
    #include <ranges>
    #include <set>
    #include <unordered_set>
    #include <vector>
    
    namespace {
    
    constexpr size_t iterations{100}; // since the inputs of the benchmarks are mutated by sorting, we can't rerun the benchmarks
    constexpr size_t hits_count{275}; // assuming ~5% of blocks contain internal spends
    constexpr size_t tx_count{5500};
    
    struct Dataset {
        std::set<Txid> sorted_set;
        std::unordered_set<Txid, SaltedTxidHasher> unsorted_set;
        std::vector<Txid> vec_sorted;
        std::vector<Txid> vec_unsorted;
    
        std::vector<Txid> queries;
    };
    
    std::vector<Dataset> BuildDatasets()
    {
        FastRandomContext rng(/*fDeterministic=*/true);
    
        std::vector<Dataset> datasets;
        datasets.reserve(iterations);
    
        for (size_t d{0}; d < iterations; ++d) {
            Dataset ds;
            ds.queries.reserve(tx_count);
            ds.unsorted_set.reserve(tx_count);
            ds.vec_sorted.reserve(tx_count);
            ds.vec_unsorted.reserve(tx_count);
    
            for (size_t i{0}; i < tx_count; ++i) {
                Txid t{Txid::FromUint256(rng.rand256())};
                ds.sorted_set.emplace(t);
                ds.unsorted_set.emplace(t);
                ds.vec_sorted.emplace_back(t);
                ds.vec_unsorted.emplace_back(t);
    
                ds.queries.emplace_back(i < hits_count ? t : Txid::FromUint256(rng.rand256()));
            }
    
            std::ranges::shuffle(ds.queries, rng);
            std::ranges::shuffle(ds.vec_unsorted, rng);
            std::sort(ds.vec_sorted.begin(), ds.vec_sorted.end());
    
            datasets.emplace_back(std::move(ds));
        }
        return datasets;
    }
    
    } // namespace
    
    static void Txid_UnorderedSalted(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += s.unsorted_set.contains(q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_SetOrdered(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += s.sorted_set.contains(q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_VectorBinarySearch(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += std::binary_search(s.vec_sorted.begin(), s.vec_sorted.end(), q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_VectorLinearScan(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets()};
        const auto contains_linear{[](const std::vector<Txid>& v, const Txid& x) noexcept {
            return std::ranges::find(v, x) != v.end();
        }};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += contains_linear(s.vec_unsorted, q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    BENCHMARK(Txid_UnorderedSalted, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_SetOrdered, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_VectorBinarySearch, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_VectorLinearScan, benchmark::PriorityLevel::LOW);
    

    </details>

    cmake -B build -DBUILD_BENCH=ON -DENABLE_IPC=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='Txid_.*'
    

    | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 217,247.09 | 4,603.05 | 0.0% | 0.02 | Txid_SetOrdered | 202,607.50 | 4,935.65 | 0.0% | 0.02 | Txid_UnorderedSalted | 194,347.91 | 5,145.41 | 0.0% | 0.02 | Txid_VectorBinarySearch | 12,449,880.83 | 80.32 | 0.0% | 1.24 | Txid_VectorLinearScan

    <details> <summary>Same measurement on an Rpi5 with GCC instead</summary>

    | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1,179,326.18 | 847.94 | 0.0% | 2,361,313.36 | 2,820,742.15 | 0.837 | 475,693.50 | 1.2% | 0.12 | Txid_SetOrdered | 912,335.48 | 1,096.09 | 0.0% | 1,346,377.49 | 2,183,167.97 | 0.617 | 84,874.38 | 5.7% | 0.09 | Txid_UnorderedSalted | 732,493.03 | 1,365.20 | 0.0% | 2,411,291.90 | 1,752,928.05 | 1.376 | 531,672.95 | 8.1% | 0.07 | Txid_VectorBinarySearch | 25,449,442.73 | 39.29 | 0.0% | 235,989,723.07 | 60,929,976.87 | 3.873 | 58,998,462.96 | 0.0% | 2.54 | Txid_VectorLinearScan

    </details>


    So the benchmark isn't as revealing as I was hoping, please verify my assumptions and we can still test these with some macro-benchmark to see if there's any performance or memory advantage to any of these.


    andrewtoth commented at 9:38 PM on November 2, 2025:

    I did try and use a sorted vector and do binary search in the workers, but it was not a measurable performance difference. In theory it should be faster since txid comparison is much faster than siphash. I can do this if you want, but I think it makes the code clearer using the unordered_set.


    andrewtoth commented at 1:42 AM on November 3, 2025:

    I wrote benchmarks to check how fast it was to construct, since this is the work that is done not in parallel:

    static void SortedVectorBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
        std::vector<Txid> v{};
        v.reserve(block.vtx.size());
    
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                v.emplace_back(tx->GetHash());
            }
            std::sort(v.begin(), v.end());
        });
    }
    
    static void UnorderedSetBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
        std::unordered_set<Txid, SaltedTxidHasher> u{};
        u.reserve(block.vtx.size());
    
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                u.emplace(tx->GetHash());
            }
        });
    }
    
    static void SetBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
        std::set<Txid> s{};
    
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                s.insert(tx->GetHash());
            }
        });
    }
    

    Results:

    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |           45,794.65 |           21,836.61 |    1.6% |      777,851.91 |      110,515.70 |  7.038 |      88,000.15 |    0.5% |      0.01 | `UnorderedSetBenchmark`
    |           96,598.09 |           10,352.17 |    1.8% |      574,593.10 |      233,289.00 |  2.463 |     129,942.30 |    1.5% |      0.01 | `SetBenchmark`
    |        1,037,787.00 |              963.59 |    5.1% |   12,280,496.00 |    2,472,876.00 |  4.966 |   2,863,217.00 |    1.3% |      0.02 | :wavy_dash: `SortedVectorBenchmark` (Unstable with ~1.9 iters. Increase `minEpochIterations` to e.g. 19)
    

    Obviously we don't want to use the sorted vector. unordered_set is roughly twice as fast as set. I don't think it's worth pursuing a different filter container.


    l0rinc commented at 9:09 AM on November 3, 2025:

    Hmm, I'm not sure this is correct.

    Obviously we don't want to use the sorted vector

    That's not what I'm getting. You were inserting to the same collection over and over, I'm not sure what we were measuring there. Adding the collection creation and an assertion to not optimize it away (to make sure we're measuring the same thing in every iteration) reveals something completely different.

    <details> <summary>updated benchmarking code</summary>

    #include <bench/bench.h>
    #include <bench/data/block413567.raw.h>
    #include <coins.h>
    #include <inputfetcher.h>
    #include <primitives/block.h>
    #include <serialize.h>
    #include <streams.h>
    
    static void InputFetcher_SortedVectorBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        bench.run([&] {
            std::vector<Txid> v{};
            v.reserve(block.vtx.size());
            for (const auto& tx : block.vtx) {
                v.emplace_back(tx->GetHash());
            }
            std::sort(v.begin(), v.end());
            ankerl::nanobench::doNotOptimizeAway(v);
        });
    }
    
    static void InputFetcher_UnorderedSetBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        bench.run([&] {
            std::unordered_set<Txid, SaltedTxidHasher> u{};
            u.reserve(block.vtx.size());
            for (const auto& tx : block.vtx) {
                u.emplace(tx->GetHash());
            }
            ankerl::nanobench::doNotOptimizeAway(u);
        });
    }
    
    static void InputFetcher_SetBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        bench.run([&] {
            std::set<Txid> s{};
            for (const auto& tx : block.vtx) {
                s.insert(tx->GetHash());
            }
            ankerl::nanobench::doNotOptimizeAway(s);
        });
    }
    
    BENCHMARK(InputFetcher_SortedVectorBenchmark, benchmark::PriorityLevel::HIGH);
    BENCHMARK(InputFetcher_UnorderedSetBenchmark, benchmark::PriorityLevel::HIGH);
    BENCHMARK(InputFetcher_SetBenchmark, benchmark::PriorityLevel::HIGH);
    

    </details>

    | ns/op | op/s | err% | ins/op | cyc/op | IPC | bra/op | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 246,121.59 | 4,063.03 | 0.0% | 978,180.25 | 883,481.68 | 1.107 | 224,954.25 | 2.8% | 11.00 | InputFetcher_SetBenchmark | 130,850.86 | 7,642.29 | 0.2% | 608,319.13 | 469,743.71 | 1.295 | 135,588.13 | 4.4% | 11.04 | InputFetcher_SortedVectorBenchmark | 171,092.09 | 5,844.81 | 0.0% | 1,207,752.53 | 614,142.35 | 1.967 | 174,278.58 | 1.1% | 11.00 | InputFetcher_UnorderedSetBenchmark

    <img width="1500" height="860" alt="image" src="https://github.com/user-attachments/assets/548ca5a4-bfe0-478c-9268-d76c6cd91e75" />


    andrewtoth commented at 1:30 PM on November 3, 2025:

    Hmm right I was not clearing the containers, so the sorting was dominating. I retried with clearing and the unordered_map is still slightly faster. In your benchmark you are creating and reserving the map inside the benchmark instead of before. In this implementation the reserved memory is kept in between blocks, so reserving and creating outside makes more sense.


    l0rinc commented at 10:48 AM on November 5, 2025:

    In this implementation the reserved memory is kept in between blocks

    K, so let's reserve outside and clear inside. Now that missing values aren't failures, we can experiment with shortids - since a missing value isn't a failure anymore (even though I wouldn't expect any collisions in 64 bits either, assuming uniform distribution. But even if the distribution isn't uniform, we can likely store it safely). 64 bit ids for internal spends would mean that in case of some collision we will attempt to fetch something from disk that was actually in the current block - so the attacker can at best slow down block validation by a few milliseconds.

    <details> <summary>sorted/unsorted/vector benchmarks & shortids</summary>

    // Copyright (c) 2022-present The Bitcoin Core developers
    // Distributed under the MIT software license, see the accompanying
    // file COPYING or https://www.opensource.org/licenses/mit-license.php.
    
    #include <algorithm>
    #include <bench/bench.h>
    #include <bench/data/block413567.raw.h>
    #include <bench/nanobench.h>
    #include <coins.h>
    #include <functional>
    #include <inputfetcher.h>
    #include <primitives/block.h>
    #include <primitives/transaction_identifier.h>
    #include <random.h>
    #include <ranges>
    #include <serialize.h>
    #include <set>
    #include <streams.h>
    #include <unordered_set>
    #include <util/check.h>
    #include <util/hasher.h>
    #include <vector>
    
    namespace {
    constexpr size_t iterations{100}; // since the inputs of the benchmarks are mutated by sorting, we can't rerun the benchmarks
    constexpr size_t hits_count{275}; // assuming ~5% of blocks contain internal spends
    constexpr size_t tx_count{5500};
    
    uint64_t GetShortID(const Txid& txid)
    {
        return txid.ToUint256().GetUint64(0);
    }
    
    template <typename T>
    struct Dataset {
        std::set<T> sorted_set;
        std::unordered_set<T, std::conditional_t<std::is_same_v<T, Txid>, SaltedTxidHasher, std::identity>> unsorted_set;
        std::vector<T> vec_sorted;
        std::vector<T> vec_unsorted;
        std::vector<T> queries;
    
        static T Convert(const Txid& txid)
        {
            if constexpr (std::is_same_v<T, Txid>) {
                return txid;
            } else {
                static_assert(std::is_same_v<T, uint64_t>);
                return GetShortID(txid);
            }
        }
    };
    
    template <typename T>
    std::vector<Dataset<T>> BuildDatasets()
    {
        FastRandomContext rng(/*fDeterministic=*/true);
    
        std::vector<Dataset<T>> datasets;
        datasets.reserve(iterations);
    
        for (size_t d{0}; d < iterations; ++d) {
            Dataset<T> ds;
            ds.queries.reserve(tx_count);
            ds.unsorted_set.reserve(tx_count);
            ds.vec_sorted.reserve(tx_count);
            ds.vec_unsorted.reserve(tx_count);
    
            for (size_t i{0}; i < tx_count; ++i) {
                T t1{Dataset<T>::Convert(Txid::FromUint256(rng.rand256()))};
                ds.sorted_set.emplace(t1);
                ds.unsorted_set.emplace(t1);
                ds.vec_sorted.emplace_back(t1);
                ds.vec_unsorted.emplace_back(t1);
    
                T t2{Dataset<T>::Convert(Txid::FromUint256(rng.rand256()))};
                ds.queries.emplace_back(i < hits_count ? t1 : t2);
            }
    
            std::ranges::shuffle(ds.queries, rng);
            std::ranges::shuffle(ds.vec_unsorted, rng);
            std::sort(ds.vec_sorted.begin(), ds.vec_sorted.end());
    
            datasets.emplace_back(std::move(ds));
        }
        return datasets;
    }
    } // namespace
    
    static void Txid_UnorderedSalted(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets<Txid>()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += s.unsorted_set.contains(q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_SetOrdered(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets<Txid>()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += s.sorted_set.contains(q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_VectorBinarySearch(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets<Txid>()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += std::binary_search(s.vec_sorted.begin(), s.vec_sorted.end(), q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_VectorLinearScan(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets<Txid>()};
        const auto contains_linear{[](const std::vector<Txid>& v, const Txid& x) noexcept {
            return std::ranges::find(v, x) != v.end();
        }};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += contains_linear(s.vec_unsorted, q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_UnorderedSalted_shortid(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets<uint64_t>()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += s.unsorted_set.contains(q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_SetOrdered_shortid(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets<uint64_t>()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += s.sorted_set.contains(q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_VectorBinarySearch_shortid(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets<uint64_t>()};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += std::ranges::binary_search(s.vec_sorted, q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void Txid_VectorLinearScan_shortid(benchmark::Bench& bench)
    {
        static auto ds{BuildDatasets<uint64_t>()};
        const auto contains_linear{[](const std::vector<uint64_t>& v, uint64_t x) noexcept {
            return std::ranges::find(v, x) != v.end();
        }};
        bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
            size_t sum{0};
            for (const auto& s : ds) {
                for (const auto& q : s.queries) {
                    sum += contains_linear(s.vec_unsorted, q);
                }
            }
            ankerl::nanobench::doNotOptimizeAway(sum);
            Assert(sum == iterations * hits_count);
        });
    }
    
    static void InputFetcher_SortedVectorBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        std::vector<Txid> v{};
        v.reserve(block.vtx.size());
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                v.emplace_back(tx->GetHash());
            }
            std::sort(v.begin(), v.end());
            ankerl::nanobench::doNotOptimizeAway(v);
            v.clear();
        });
    }
    
    static void InputFetcher_UnorderedSetBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        std::unordered_set<Txid, SaltedTxidHasher> u{};
        u.reserve(block.vtx.size());
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                u.emplace(tx->GetHash());
            }
            ankerl::nanobench::doNotOptimizeAway(u);
            u.clear();
        });
    }
    
    static void InputFetcher_SetBenchmark(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        std::set<Txid> s{};
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                s.insert(tx->GetHash());
            }
            ankerl::nanobench::doNotOptimizeAway(s);
            s.clear();
        });
    }
    
    static void InputFetcher_SortedVectorBenchmark_shortid(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        std::vector<uint64_t> v{};
        v.reserve(block.vtx.size());
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                v.emplace_back(GetShortID(tx->GetHash()));
            }
            std::ranges::sort(v);
            ankerl::nanobench::doNotOptimizeAway(v);
            v.clear();
        });
    }
    
    static void InputFetcher_UnorderedSetBenchmark_shortid(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        std::unordered_set<uint64_t> u{};
        u.reserve(block.vtx.size());
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                u.emplace(GetShortID(tx->GetHash()));
            }
            ankerl::nanobench::doNotOptimizeAway(u);
            u.clear();
        });
    }
    
    static void InputFetcher_SetBenchmark_shortid(benchmark::Bench& bench)
    {
        CBlock block;
        DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    
        std::set<uint64_t> s{};
        bench.run([&] {
            for (const auto& tx : block.vtx) {
                s.insert(GetShortID(tx->GetHash()));
            }
            ankerl::nanobench::doNotOptimizeAway(s);
            s.clear();
        });
    }
    
    BENCHMARK(InputFetcher_SortedVectorBenchmark, benchmark::PriorityLevel::LOW);
    BENCHMARK(InputFetcher_UnorderedSetBenchmark, benchmark::PriorityLevel::LOW);
    BENCHMARK(InputFetcher_SetBenchmark, benchmark::PriorityLevel::LOW);
    
    BENCHMARK(InputFetcher_SortedVectorBenchmark_shortid, benchmark::PriorityLevel::LOW);
    BENCHMARK(InputFetcher_UnorderedSetBenchmark_shortid, benchmark::PriorityLevel::LOW);
    BENCHMARK(InputFetcher_SetBenchmark_shortid, benchmark::PriorityLevel::LOW);
    
    BENCHMARK(Txid_UnorderedSalted, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_SetOrdered, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_VectorBinarySearch, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_VectorLinearScan, benchmark::PriorityLevel::LOW);
    
    BENCHMARK(Txid_UnorderedSalted_shortid, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_SetOrdered_shortid, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_VectorBinarySearch_shortid, benchmark::PriorityLevel::LOW);
    BENCHMARK(Txid_VectorLinearScan_shortid, benchmark::PriorityLevel::LOW);
    

    </details>

    <img width="1487" height="864" alt="image" src="https://github.com/user-attachments/assets/8f971f62-87c7-496f-a060-0dec2bb9cc51" />

    The new measurements indicate that the single-threaded preparation is 4x faster with a sorted vector, while multithreaded search is more than 2x faster with small ids compared to the current approach. Worth investigating further, I'd say :)

  239. in src/coins.cpp:176 in aeec2e421d outdated
     172 | @@ -173,6 +173,11 @@ bool CCoinsViewCache::HaveCoinInCache(const COutPoint &outpoint) const {
     173 |      return (it != cacheCoins.end() && !it->second.coin.IsSpent());
     174 |  }
     175 |  
     176 | +std::optional<Coin> CCoinsViewCache::GetPossiblySpentCoinFromCache(const COutPoint &outpoint) const {
    


    l0rinc commented at 4:40 PM on November 2, 2025:

    This seems like a code smell that we should pay attention to. I have already brought this up a year ago in #30673 (review), seems it keeps biting us. Instead of adding a new method that basically does the same, can we keep GetCoin to simply return the coin and move the spentness checks to the call sites? I understand that it would likely require another PR, but it would make this one cleaner - I don't like that we need a workaround for a scenario that isn't out of the ordinary.


    andrewtoth commented at 9:52 PM on November 2, 2025:

    This is a consequence of skipping the main cache and writing directly to the temp cache.

    I don't think we can modify GetCoin to not return spent coins, it's part of the definition Retrieve the Coin (unspent transaction output) for a given outpoint.. This also goes backwards from your change #30849 where you added TODOs to not return spent coins where our test GetCoins do. I don't think a PR to return spent coins from GetCoin would get enough support to be merged.

    If GetCoin could return spent coins, we would also not be able to use it until we call HaveCoinInCache. This is because even though GetCoin is const, it modifies cacheCoins internally (which has the mutable modifier for this purpose). But, HaveCoinInCache also returns false if the coin is spent, so we would need to modify that. So, we would have 2 methods change and we would also then have to do 2 lookups. This method is very simple and very much defined for this special purpose, similar to EmplaceCoinInCacheDANGER.

    I'm open to other suggestions, but modifying GetCoin (and then HaveCoinInCache) doesn't seem like the way.

  240. in src/inputfetcher.h:132 in aeec2e421d outdated
     127 | +public:
     128 | +    explicit InputFetcher(size_t worker_thread_count) noexcept
     129 | +        : m_barrier{static_cast<int32_t>(worker_thread_count + 1)}
     130 | +    {
     131 | +        for (size_t n{0}; n < worker_thread_count; ++n) {
     132 | +            m_worker_threads.emplace_back([this, n]() {
    


    l0rinc commented at 4:54 PM on November 2, 2025:

    nit:

                m_worker_threads.emplace_back([this, n] {
    

    andrewtoth commented at 11:33 PM on November 2, 2025:

    Done.

  241. in src/inputfetcher.h:128 in aeec2e421d outdated
     123 | +            m_barrier.arrive_and_wait();
     124 | +        }
     125 | +    }
     126 | +
     127 | +public:
     128 | +    explicit InputFetcher(size_t worker_thread_count) noexcept
    


    l0rinc commented at 4:56 PM on November 2, 2025:
        explicit InputFetcher(int32_t worker_thread_count) noexcept : m_barrier{(worker_thread_count + 1)}
    

    this would remove a few static casts


    andrewtoth commented at 11:33 PM on November 2, 2025:

    Done.

  242. in src/inputfetcher.h:62 in aeec2e421d outdated
      57 | +        const COutPoint& outpoint;
      58 | +        //! The coin that workers will fetch and main thread will insert into cache.
      59 | +        Coin coin{};
      60 | +
      61 | +        Input(Input&& other) noexcept : outpoint{other.outpoint} {} // Only moved in setup for reallocation.
      62 | +        Input(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
    


    l0rinc commented at 4:58 PM on November 2, 2025:

    nit:

            explicit Input(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
    

    andrewtoth commented at 11:33 PM on November 2, 2025:

    Done.

  243. in src/inputfetcher.h:43 in aeec2e421d outdated
      38 | + */
      39 | +class InputFetcher
      40 | +{
      41 | +private:
      42 | +    //! The latest input being fetched. Workers atomically increment this when fetching.
      43 | +    alignas(64) std::atomic_size_t m_input_head{0};
    


    l0rinc commented at 5:01 PM on November 2, 2025:

    is the alignas a false sharing guard? Does it have a measurable effect?


    andrewtoth commented at 9:54 PM on November 2, 2025:

    It was for false sharing. I don't think it has a measurable effect, but it might on some systems? I think it's harmless to keep, since there's only one InputFetcher.


    andrewtoth commented at 11:33 PM on November 2, 2025:

    Removed it.

  244. in src/inputfetcher.h:189 in aeec2e421d outdated
     184 | +        m_cache = nullptr;
     185 | +    }
     186 | +
     187 | +    ~InputFetcher()
     188 | +    {
     189 | +        m_request_stop = true;
    


    l0rinc commented at 5:05 PM on November 2, 2025:

    I don't like that we need a separate field for this only, but I guess without std::jthread we have to do this manually.

    Since this field isn't checked on every iteration, only once per thread per job as far as I can tell, we could repurpose m_input_head and use it as

    alignas(64) std::atomic_int32_t m_input_head{0};
    ...
    if (m_input_head.load(std::memory_order_acquire) < 0) [[unlikely]] {
    ...
    m_input_head.store(-1, std::memory_order_relaxed);
    

    andrewtoth commented at 9:55 PM on November 2, 2025:

    I don't think that's actually preferable. The current way is more readable IMO.

  245. in src/inputfetcher.h:190 in aeec2e421d outdated
     185 | +    }
     186 | +
     187 | +    ~InputFetcher()
     188 | +    {
     189 | +        m_request_stop = true;
     190 | +        m_barrier.arrive_and_wait();
    


    l0rinc commented at 5:10 PM on November 2, 2025:

    not certain, but I think this should likely be:

            m_barrier.arrive_and_drop();
    

    andrewtoth commented at 11:33 PM on November 2, 2025:

    Done.

  246. in src/inputfetcher.h:120 in aeec2e421d outdated
     115 | +                } catch (const std::runtime_error& e) {
     116 | +                    LogPrintLevel(BCLog::VALIDATION, BCLog::Level::Warning, "InputFetcher failed to fetch input: %s.\n", e.what());
     117 | +                }
     118 | +                // Input missing or spent. This block will fail validation.
     119 | +                // Skip remaining inputs.
     120 | +                m_input_head.store(m_inputs.size(), std::memory_order_relaxed);
    


    l0rinc commented at 5:13 PM on November 2, 2025:

    I understand that this poison pill broadcast is needed to stop execution when errors occur - but I'm not sure we should care here about fetching error, I think we should just try to continue if that makes the code simpler


    andrewtoth commented at 10:30 PM on November 2, 2025:

    In ConnectBlock if we encounter a missing input we abort validation immediately. It is wasted work to continue. Why should we do it here?


    l0rinc commented at 7:44 AM on November 3, 2025:

    I don't think we should abort, it's not the fetcher's job to validate. If we didn't, we could even use short ids for the intra-block spends (64 bit likely won't even result in a single duplicate, and when they do collide we would just do a db check)


    andrewtoth commented at 1:31 PM on November 3, 2025:

    Why would short ids matter here though? Isn't that for keeping the bandwidth smaller for compact blocks?


    l0rinc commented at 1:36 PM on November 3, 2025:

    no, I mean, we don't actually need 256 bits of precision here, just a probabilistic check, so taking the first 32/64 bits of the hash should suffice, since the worst case is just going to disk, so it's not a tragedy if there are false positives, as long as in the average case checks are lot faster


    andrewtoth commented at 2:25 PM on November 3, 2025:

    I don't see why aborting early would prevent us from using less precision. It's doubtful there would be a collision, and if there were that block will just be a little slower to connect. I don't think it would have a measurable effect though, siphashing 64 bits vs 256 here?


    andrewtoth commented at 10:15 PM on November 4, 2025:

    I removed the abort early logic, so we keep going if we don't find an input. It makes the logic much simpler, but we will do some more work if we get a block mined that is double spending.

  247. in src/inputfetcher.h:91 in aeec2e421d outdated
      86 | +            if (m_request_stop) [[unlikely]] {
      87 | +                return;
      88 | +            }
      89 | +            while (true) {
      90 | +                const size_t i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
      91 | +                if (i >= m_inputs.size()) [[unlikely]] {
    


    l0rinc commented at 5:47 PM on November 2, 2025:

    I don't mind the unlikely parts but we're barely using them in the code, they have some weird side-effects when merged - which may not be the case here, so I'm fine wither way, if nothing else, it documents the usage

  248. in src/inputfetcher.h:33 in aeec2e421d outdated
      28 | + * into the ephemeral cache used in ConnectBlock.
      29 | + *
      30 | + * It spawns a fixed set of worker threads that fetch Coins for each input
      31 | + * in a block. The Coin is moved into the Input struct and then the status is
      32 | + * atomically updated to READY. The main thread spin loops on the status field
      33 | + * until it is READY and then inserts it into the temporary cache.
    


    l0rinc commented at 5:49 PM on November 2, 2025:

    I'm surprised synchronized insertion is faster than collecting them in a lock-free way - guess it's all the copying, but we should be able to avoid that, I don't like that we're locking again


    andrewtoth commented at 10:00 PM on November 2, 2025:

    There are no locks... This is a lock free implementation. It is synchronized with atomics per input, and the threads are work stealing via the m_input_head. Or do you mean something else?


    l0rinc commented at 7:46 AM on November 3, 2025:

    atomics work via CAS, they need to retry in case of high contention. the previous solution didn't have contention, the threads all knew beforehand what to work on.


    andrewtoth commented at 2:20 PM on November 3, 2025:

    Some contention on shared resources is unavoidable. The threads need to synchronize on atomics. There are no locks in this implementation that all other threads need to wait on.

    The main thread may have to wait here if there is a slow fetch, but it would be reading uncontested memory right up until one worker thread flips this bit. None of the other threads are trying to write to this same bit, so there is no contention between the main thread and any other worker threads.


    andrewtoth commented at 10:16 PM on November 4, 2025:

    I've updated this to be an atomic_bool instead of this enum. So, since there is only a false value that is set to true, we have the main thread call input.ready.wait(false, std::memory_order_acquire);. This way we don't spin.

  249. in src/inputfetcher.h:50 in aeec2e421d outdated
      45 | +    //! The inputs of the block which is being fetched.
      46 | +    struct Input {
      47 | +        enum class Status : uint8_t {
      48 | +            WAITING, // The coin has not been fetched yet
      49 | +            READY, // The coin has been fetched and is ready to be inserted into the cache
      50 | +            FAILED, // The coin failed to be fetched
    


    l0rinc commented at 5:50 PM on November 2, 2025:

    as mentioned before, is FAILED an important state for the fetcher? Why not just debug/warn log and continue?


    andrewtoth commented at 10:03 PM on November 2, 2025:

    We can't have the main thread insert a failed fetch. The coin will not be updated. I suppose the main thread could check if the coin is unspent. We want to exit ASAP as well if we can't find an input, so we need to signal the main thread that they should exit the loop.


    l0rinc commented at 7:47 AM on November 3, 2025:

    that's my point, I don't think we should validate at all, it would simplify the code if we didn't


    andrewtoth commented at 10:19 PM on November 4, 2025:

    Done, this Status enum has been replaced. It is now just an atomic_bool ready.

  250. in src/inputfetcher.h:104 in aeec2e421d outdated
      99 | +                    continue;
     100 | +                }
     101 | +                try {
     102 | +                    if (auto coin{m_cache->GetPossiblySpentCoinFromCache(input.outpoint)}) {
     103 | +                        input.coin = std::move(*coin);
     104 | +                        if (!input.coin.IsSpent()) [[likely]] { // Coin from cache could be spent
    


    l0rinc commented at 5:57 PM on November 2, 2025:

    hmm, code states it's likely, but the new tests are passing with:

    if (!input.coin.IsSpent()) {
        throw "";
    }
    

    andrewtoth commented at 10:42 PM on November 2, 2025:

    Right, didn't cover this new happy path in tests. Unlikely case is covered though. Will add. Thanks.

  251. in src/test/inputfetcher_tests.cpp:25 in e3045d2237 outdated
      20 | +#include <string>
      21 | +#include <unordered_set>
      22 | +
      23 | +BOOST_AUTO_TEST_SUITE(inputfetcher_tests)
      24 | +
      25 | +struct InputFetcherTest : BasicTestingSetup {
    


    l0rinc commented at 5:59 PM on November 2, 2025:

    I think the test belongs with the implementation, I need it to be able to review the first commit (otherwise it's just dead code, with the tests it has at least a test user)


    andrewtoth commented at 11:34 PM on November 2, 2025:

    Done.

  252. in src/test/inputfetcher_tests.cpp:21 in e3045d2237 outdated
      16 | +
      17 | +#include <cstdint>
      18 | +#include <memory>
      19 | +#include <stdexcept>
      20 | +#include <string>
      21 | +#include <unordered_set>
    


    l0rinc commented at 6:02 PM on November 2, 2025:

    nit: some of these seem unused:

    #include <memory>
    #include <stdexcept>
    #include <unordered_set>
    

    andrewtoth commented at 11:04 PM on November 2, 2025:

    <cstdint> we need because we use int32_t.

  253. in src/test/inputfetcher_tests.cpp:85 in 62868c8846
      80 | +                db.EmplaceCoinInternalDANGER(std::move(outpoint), std::move(coin));
      81 | +            }
      82 | +        }
      83 | +
      84 | +        CCoinsViewCache main_cache(&db);
      85 | +        CCoinsViewCache cache(&cache);
    


    l0rinc commented at 8:03 PM on November 2, 2025:

    hmm, how does this even compile?


    andrewtoth commented at 11:34 PM on November 2, 2025:

    Fixed. But, it wouldn't affect the correctness of the test.


    l0rinc commented at 7:49 AM on November 3, 2025:

    how so? Can we assert the behavior of the main cache as well so that the previous version doesn't pass?


    andrewtoth commented at 2:43 PM on November 5, 2025:

    Can we assert the behavior of the main cache as well so that the previous version doesn't pass?

    Neither the temp cache or the main cache touch their backing cache in the input fetcher. So, we can't assert a failure if the backing cache is something else. We can assert that the backing view is not touched during FetchInputs, which is done in the fuzz harness.

  254. in src/test/fuzz/inputfetcher.cpp:43 in 62868c8846 outdated
      38 | +struct NoAccessCoinsView : CCoinsView
      39 | +{
      40 | +    std::optional<Coin> GetCoin(const COutPoint&) const override { abort(); }
      41 | +};
      42 | +
      43 | +FUZZ_TARGET(inputfetcher)
    


    l0rinc commented at 8:24 PM on November 2, 2025:

    I don't have a good fuzzer locally - does this cover all the new code?


    andrewtoth commented at 11:05 PM on November 2, 2025:

    Yes, I need to fuzz it though myself. The CI is fuzzing it a little bit.

  255. in src/inputfetcher.h:105 in 62868c8846
     100 | +                }
     101 | +                try {
     102 | +                    if (auto coin{m_cache->GetPossiblySpentCoinFromCache(input.outpoint)}) {
     103 | +                        input.coin = std::move(*coin);
     104 | +                        if (!input.coin.IsSpent()) [[likely]] { // Coin from cache could be spent
     105 | +                            // We need release here, so setting coin 2 lines above happens before the main thread loads.
    


    l0rinc commented at 8:28 PM on November 2, 2025:

    The comments indicate that you also don't think this code is very intuitive - I'm also a bit lost here, can we simplify this somehow? I don't even understand what "release" means here or why both branches result in Status::READY (can we unify them) or what happens if the first internal if isn't fulfilled, or the outer one or the else, etc. The branching + continue + try/catch doesn't help


    andrewtoth commented at 11:35 PM on November 2, 2025:

    Rewrote this part to make it more clear. I was trying to be clever by avoiding some extra checks, but they are probably meaningless and clarity is better here.

  256. l0rinc changes_requested
  257. l0rinc commented at 8:31 PM on November 2, 2025: contributor

    I like how we're progressing here! I think we need a few more things and have to try out a few alternatives (I haven't given up on sorting yet, especially now with bigger dbcache) and want to see how this combines with the other optimizations (threadpool, SipHash13, map hash caching, etc), but I'm definitely getting closer and closer to an ACK :D

    I'm testing full IBD locally on my servers, but those are always slower than reindex-chainstes since the nodes can't send the blocks fast enough - I don't yet have a seeding node yet, but working on it.

    "It simply inserts inputs into the temporary cache, which must be fetched before a transaction is validated anyways." beautiful, I think some of this should be added to the commit messages as well. commit message nit: "coins: add InputFetcher" (commit message content and formatting is a bit sloppy)

    I'd merge fuzz: add inputfetcher fuzz harness and tests: add inputfetcher tests and coins: add inputfetcher since the clients are needed for the review of InputFetcher, otherwise it's just dead code added...

    <details> <summary>local patch I had during review - they're not necessarily suggestions, just changes I did locally</summary>

    diff --git a/src/bench/CMakeLists.txt b/src/bench/CMakeLists.txt
    index 9d03f075a7..dcb6281699 100644
    --- a/src/bench/CMakeLists.txt
    +++ b/src/bench/CMakeLists.txt
    @@ -52,6 +52,7 @@ add_executable(bench_bitcoin
       streams_findbyte.cpp
       strencodings.cpp
       txgraph.cpp
    +  txid_membership.cpp
       txorphanage.cpp
       util_time.cpp
       verify_script.cpp
    diff --git a/src/bench/inputfetcher.cpp b/src/bench/inputfetcher.cpp
    index c10fcc5b5e..c1660c3ccf 100644
    --- a/src/bench/inputfetcher.cpp
    +++ b/src/bench/inputfetcher.cpp
    @@ -12,20 +12,18 @@
     #include <streams.h>
     #include <util/time.h>
     
    -static constexpr auto DELAY{2ms};
    -
     //! Simulates a DB by adding a delay when calling GetCoin
     struct DelayedCoinsView : CCoinsView
     {
         std::optional<Coin> GetCoin(const COutPoint&) const override
         {
    -        UninterruptibleSleep(DELAY);
    +        UninterruptibleSleep(2ms);
             Coin coin{};
             coin.out.nValue = 1;
             return coin;
         }
     
    -    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { return true; }
    +    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { throw std::logic_error{"unused"}; }
     };
     
     static void InputFetcherBenchmark(benchmark::Bench& bench)
    @@ -39,11 +37,13 @@ static void InputFetcherBenchmark(benchmark::Bench& bench)
         // The main thread should be counted to prevent thread oversubscription, and
         // to decrease the variance of benchmark results.
         const auto worker_threads_num{GetNumCores() - 1};
    -    InputFetcher fetcher{static_cast<size_t>(worker_threads_num)};
    +    InputFetcher fetcher{worker_threads_num};
     
         bench.run([&] {
             CCoinsViewCache temp_cache(&main_cache);
             fetcher.FetchInputs(temp_cache, main_cache, db, block);
    +        ankerl::nanobench::doNotOptimizeAway(&temp_cache);
    +        Assert(temp_cache.GetCacheSize() == 4599);
         });
     }
     
    diff --git a/src/bench/txid_membership.cpp b/src/bench/txid_membership.cpp
    new file mode 100644
    index 0000000000..e646bb2a4a
    --- /dev/null
    +++ b/src/bench/txid_membership.cpp
    @@ -0,0 +1,134 @@
    +// Copyright (c) 2022-present The Bitcoin Core developers
    +// Distributed under the MIT software license, see the accompanying
    +// file COPYING or https://www.opensource.org/licenses/mit-license.php.
    +
    +#include <bench/bench.h>
    +#include <bench/nanobench.h>
    +#include <primitives/transaction_identifier.h>
    +#include <random.h>
    +#include <util/check.h>
    +#include <util/hasher.h>
    +
    +#include <algorithm>
    +#include <ranges>
    +#include <set>
    +#include <unordered_set>
    +#include <vector>
    +
    +namespace {
    +
    +constexpr size_t iterations{100}; // since the inputs of the benchmarks are mutated by sorting, we can't rerun the benchmarks
    +constexpr size_t hits_count{275}; // assuming ~5% of blocks contain internal spends
    +constexpr size_t tx_count{5500};
    +
    +struct Dataset {
    +    std::set<Txid> sorted_set;
    +    std::unordered_set<Txid, SaltedTxidHasher> unsorted_set;
    +    std::vector<Txid> vec_sorted;
    +    std::vector<Txid> vec_unsorted;
    +
    +    std::vector<Txid> queries;
    +};
    +
    +std::vector<Dataset> BuildDatasets()
    +{
    +    FastRandomContext rng(/*fDeterministic=*/true);
    +
    +    std::vector<Dataset> datasets;
    +    datasets.reserve(iterations);
    +
    +    for (size_t d{0}; d < iterations; ++d) {
    +        Dataset ds;
    +        ds.queries.reserve(tx_count);
    +        ds.unsorted_set.reserve(tx_count);
    +        ds.vec_sorted.reserve(tx_count);
    +        ds.vec_unsorted.reserve(tx_count);
    +
    +        for (size_t i{0}; i < tx_count; ++i) {
    +            Txid t{Txid::FromUint256(rng.rand256())};
    +            ds.sorted_set.emplace(t);
    +            ds.unsorted_set.emplace(t);
    +            ds.vec_sorted.emplace_back(t);
    +            ds.vec_unsorted.emplace_back(t);
    +
    +            ds.queries.emplace_back(i < hits_count ? t : Txid::FromUint256(rng.rand256()));
    +        }
    +
    +        std::ranges::shuffle(ds.queries, rng);
    +        std::ranges::shuffle(ds.vec_unsorted, rng);
    +        std::sort(ds.vec_sorted.begin(), ds.vec_sorted.end());
    +
    +        datasets.emplace_back(std::move(ds));
    +    }
    +    return datasets;
    +}
    +
    +} // namespace
    +
    +static void Txid_UnorderedSalted(benchmark::Bench& bench)
    +{
    +    static auto ds{BuildDatasets()};
    +    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
    +        size_t sum{0};
    +        for (const auto& s : ds) {
    +            for (const auto& q : s.queries) {
    +                sum += s.unsorted_set.contains(q);
    +            }
    +        }
    +        ankerl::nanobench::doNotOptimizeAway(sum);
    +        Assert(sum == iterations * hits_count);
    +    });
    +}
    +
    +static void Txid_SetOrdered(benchmark::Bench& bench)
    +{
    +    static auto ds{BuildDatasets()};
    +    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
    +        size_t sum{0};
    +        for (const auto& s : ds) {
    +            for (const auto& q : s.queries) {
    +                sum += s.sorted_set.contains(q);
    +            }
    +        }
    +        ankerl::nanobench::doNotOptimizeAway(sum);
    +        Assert(sum == iterations * hits_count);
    +    });
    +}
    +
    +static void Txid_VectorBinarySearch(benchmark::Bench& bench)
    +{
    +    static auto ds{BuildDatasets()};
    +    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
    +        size_t sum{0};
    +        for (const auto& s : ds) {
    +            for (const auto& q : s.queries) {
    +                sum += std::binary_search(s.vec_sorted.begin(), s.vec_sorted.end(), q);
    +            }
    +        }
    +        ankerl::nanobench::doNotOptimizeAway(sum);
    +        Assert(sum == iterations * hits_count);
    +    });
    +}
    +
    +static void Txid_VectorLinearScan(benchmark::Bench& bench)
    +{
    +    static auto ds{BuildDatasets()};
    +    const auto contains_linear{[](const std::vector<Txid>& v, const Txid& x) noexcept {
    +        return std::ranges::find(v, x) != v.end();
    +    }};
    +    bench.epochs(1).epochIterations(1).batch(iterations).run([&] {
    +        size_t sum{0};
    +        for (const auto& s : ds) {
    +            for (const auto& q : s.queries) {
    +                sum += contains_linear(s.vec_unsorted, q);
    +            }
    +        }
    +        ankerl::nanobench::doNotOptimizeAway(sum);
    +        Assert(sum == iterations * hits_count);
    +    });
    +}
    +
    +BENCHMARK(Txid_UnorderedSalted, benchmark::PriorityLevel::LOW);
    +BENCHMARK(Txid_SetOrdered, benchmark::PriorityLevel::LOW);
    +BENCHMARK(Txid_VectorBinarySearch, benchmark::PriorityLevel::LOW);
    +BENCHMARK(Txid_VectorLinearScan, benchmark::PriorityLevel::LOW);
    diff --git a/src/coins.h b/src/coins.h
    index 1b3dcfc309..95c3bfe2f5 100644
    --- a/src/coins.h
    +++ b/src/coins.h
    @@ -486,7 +486,8 @@ public:
          * Reserve enough space in the cache so the underlying unordered_map will
          * not have to rehash unless capacity is exceeded.
          */
    -    void Reserve(size_t capacity) {
    +    void Reserve(size_t capacity)
    +    {
             cacheCoins.reserve(capacity);
         }
     
    diff --git a/src/inputfetcher.h b/src/inputfetcher.h
    index cd9eaed6ea..d15c4874cc 100644
    --- a/src/inputfetcher.h
    +++ b/src/inputfetcher.h
    @@ -17,6 +17,7 @@
     #include <atomic>
     #include <barrier>
     #include <cstdint>
    +#include <set>
     #include <stdexcept>
     #include <thread>
     #include <unordered_set>
    @@ -38,9 +39,8 @@
      */
     class InputFetcher
     {
    -private:
         //! The latest input being fetched. Workers atomically increment this when fetching.
    -    alignas(64) std::atomic_size_t m_input_head{0};
    +    alignas(64) std::atomic_int32_t m_input_head{0};
     
         //! The inputs of the block which is being fetched.
         struct Input {
    @@ -59,7 +59,7 @@ private:
             Coin coin{};
     
             Input(Input&& other) noexcept : outpoint{other.outpoint} {} // Only moved in setup for reallocation.
    -        Input(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
    +        explicit Input(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
         };
         std::vector<Input> m_inputs{};
     
    @@ -68,7 +68,7 @@ private:
          * Used to filter out inputs that are created and spent in the same block,
          * since they will not be in the db or the cache.
          */
    -    std::unordered_set<Txid, SaltedTxidHasher> m_txids{};
    +    std::set<Txid> m_txids{};
     
         //! DB coins view to fetch from.
         const CCoinsView* m_db{nullptr};
    @@ -77,18 +77,15 @@ private:
     
         std::vector<std::thread> m_worker_threads{};
         std::barrier<> m_barrier;
    -    bool m_request_stop{false};
     
         void WorkLoop() noexcept
         {
             while (true) {
                 m_barrier.arrive_and_wait();
    -            if (m_request_stop) [[unlikely]] {
    -                return;
    -            }
    +            if (m_input_head.load(std::memory_order_relaxed) < 0) [[unlikely]] return;
                 while (true) {
    -                const size_t i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
    -                if (i >= m_inputs.size()) [[unlikely]] {
    +                const auto i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
    +                if (i >= int32_t(m_inputs.size())) [[unlikely]] {
                         break;
                     }
                     auto& input{m_inputs[i]};
    @@ -125,11 +122,10 @@ private:
         }
     
     public:
    -    explicit InputFetcher(size_t worker_thread_count) noexcept
    -        : m_barrier{static_cast<int32_t>(worker_thread_count + 1)}
    +    explicit InputFetcher(int32_t worker_thread_count) noexcept : m_barrier{(worker_thread_count + 1)}
         {
    -        for (size_t n{0}; n < worker_thread_count; ++n) {
    -            m_worker_threads.emplace_back([this, n]() {
    +        for (int32_t n{0}; n < worker_thread_count; ++n) {
    +            m_worker_threads.emplace_back([this, n] {
                     util::ThreadRename(strprintf("inputfetch.%i", n));
                     WorkLoop();
                 });
    @@ -145,6 +141,8 @@ public:
     
             // Loop through the inputs of the block and set them in the queue.
             // Construct the set of txids to filter, and count the outputs to reserve for temp_cache.
    +        //m_txids.reserve(block.vtx.size());
    +        m_inputs.reserve(2 * block.vtx.size()); // rough guess
             auto outputs_count{block.vtx[0]->vout.size()};
             for (size_t i{1}; i < block.vtx.size(); ++i) {
                 const auto& tx{block.vtx[i]};
    @@ -162,7 +160,7 @@ public:
             m_barrier.arrive_and_wait();
     
             // Insert fetched coins into the temp_cache as they are set to READY.
    -        temp_cache.Reserve(m_inputs.size() + outputs_count);
    +        temp_cache.Reserve(temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);
             for (auto& input : m_inputs) {
                 auto status{input.status.load(std::memory_order_acquire)};
                 while (status == Input::Status::WAITING) {
    @@ -175,6 +173,7 @@ public:
                     break;
                 }
             }
    +        Assert(temp_cache.GetCacheSize() <= temp_cache.GetCacheSize() + m_inputs.size() + outputs_count); // TODO remove
     
             m_barrier.arrive_and_wait();
             // Cleanup after all worker threads have exited the inner loop.
    @@ -186,11 +185,9 @@ public:
     
         ~InputFetcher()
         {
    -        m_request_stop = true;
    -        m_barrier.arrive_and_wait();
    -        for (auto& t : m_worker_threads) {
    -            t.join();
    -        }
    +        m_input_head.store(-1, std::memory_order_relaxed);
    +        m_barrier.arrive_and_drop();
    +        for (auto& t : m_worker_threads) t.join();
         }
     };
     
    diff --git a/src/test/inputfetcher_tests.cpp b/src/test/inputfetcher_tests.cpp
    index b92a15d291..3d085bc843 100644
    --- a/src/test/inputfetcher_tests.cpp
    +++ b/src/test/inputfetcher_tests.cpp
    @@ -14,10 +14,8 @@
     
     #include <boost/test/unit_test.hpp>
     
    -#include <cstdint>
     #include <memory>
     #include <stdexcept>
    -#include <string>
     #include <unordered_set>
     
     BOOST_AUTO_TEST_SUITE(inputfetcher_tests)
    @@ -82,7 +80,7 @@ BOOST_FIXTURE_TEST_CASE(fetch_inputs, InputFetcherTest)
             }
     
             CCoinsViewCache main_cache(&db);
    -        CCoinsViewCache cache(&cache);
    +        CCoinsViewCache cache(&main_cache);
             getFetcher().FetchInputs(cache, main_cache, db, block);
     
             std::unordered_set<Txid, SaltedTxidHasher> txids{};
    diff --git a/src/validation.cpp b/src/validation.cpp
    index 7564b97a07..07ab71852f 100644
    --- a/src/validation.cpp
    +++ b/src/validation.cpp
    @@ -6299,7 +6299,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
     
     ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
         : m_script_check_queue{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    -      m_input_fetcher{std::clamp<size_t>(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    +      m_input_fetcher{std::clamp<int32_t>(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
           m_interrupt{interrupt},
           m_options{Flatten(std::move(options))},
           m_blockman{interrupt, std::move(blockman_options)},
    

    </details>

  258. andrewtoth force-pushed on Nov 2, 2025
  259. l0rinc commented at 9:34 AM on November 3, 2025: contributor

    I have the IBD numbers for the i7-hdd and i9-ssd server. They're not as glorious as our reindex-chainstate measurements, most likely since I don't yet have a way to test IBD from extremely fast peers. But as a sanity-check I think it's fine, we're still bandwidth bound - which is a good problem to have.

    <details> <summary>7% faster IBD | 921129 blocks | dbcache 4500 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD</summary>

    COMMITS="bf07cf0adf19889727cb6bea24ebfbbfcc231a0c 45fe0c0e5beddce1c9e836ab5d97aa064069c192"; \                        
    STOP=921129; DBCACHE=4500; \                                                                                                                                
    CC=gcc; CXX=g++; \                                                                                                                                          
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \                                                                   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \                                              
    (echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 2 \
      --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
      --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
    
    bf07cf0adf coins: add inputfetcher
    45fe0c0e5b validation: fetch block inputs via InputFetcher before connecting
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)      
      Time (mean ± σ):     36470.995 s ± 113.187 s    [User: 38024.880 s, System: 2035.545 s]
      Range (min … max):   36390.960 s … 36551.030 s    2 runs
      
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)                                                                         
      Time (mean ± σ):     33962.782 s ± 375.163 s    [User: 41686.176 s, System: 2832.686 s]
      Range (min … max):   33697.502 s … 34228.062 s    2 runs
      
    Relative speed comparison
            1.07 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -blocksonly -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
    

    </details>

    and

    <details> <summary>12% faster IBD | 921129 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD </summary>

    COMMITS="aeec2e421d2ba102d905633d474f0fb88f91a9bf 62868c8846f043477d128788eadced3e71522417"; \
    STOP=921129; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 2 \
      --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
      --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
    
    aeec2e421d coins: add inputfetcher
    62868c8846 validation: fetch block inputs via InputFetcher before connecting
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = aeec2e421d2ba102d905633d474f0fb88f91a9bf)
      Time (mean ± σ):     30907.958 s ± 1761.510 s    [User: 51503.334 s, System: 3833.881 s]
      Range (min … max):   29662.383 s … 32153.534 s    2 runs
     
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 62868c8846f043477d128788eadced3e71522417)
      Time (mean ± σ):     27504.900 s ± 2529.652 s    [User: 59336.996 s, System: 5796.247 s]
      Range (min … max):   25716.165 s … 29293.634 s    2 runs
     
    Relative speed comparison
            1.12 ±  0.12  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = aeec2e421d2ba102d905633d474f0fb88f91a9bf)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 62868c8846f043477d128788eadced3e71522417)
    

    </details>

    and

    <details> <summary>9% faster IBD | 921129 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD</summary>

    COMMITS="2aa510348143521a14146e41b5cf87cb3e60b29e cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85"; \
    STOP=921129; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 2 \
      --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
      --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
                 grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
    
    2aa5103481 validation: fetch block inputs via InputFetcher before connecting
    cb0fdfdf37 coins: add inputfetcher
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
      Time (mean ± σ):     61239.351 s ± 4942.104 s    [User: 90457.852 s, System: 16836.057 s]
      Range (min … max):   57744.756 s … 64733.946 s    2 runs
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
      Time (mean ± σ):     66848.997 s ± 1122.176 s    [User: 88025.800 s, System: 11057.384 s]
      Range (min … max):   66055.499 s … 67642.496 s    2 runs
      
    Relative speed comparison
            1.09 ±  0.09  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
    

    </details>


    The reindex-chainstate cases (which we can look at as a more stable way of testing offline-IBD) show very good results even for the max-memory usecase (45 GB dbcache) - and confirm @andrewtoth's claim that we may be able to deprecate the -dbcache argument after this since it has barely any effect after this change!

    <details> <summary>3% faster reindex-chainstate | 921129 blocks | dbcache 45000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD</summary>

    COMMITS="bf07cf0adf19889727cb6bea24ebfbbfcc231a0c 45fe0c0e5beddce1c9e836ab5d97aa064069c192"; \
    STOP=921129; DBCACHE=45000; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    bf07cf0adf coins: add inputfetcher
    45fe0c0e5b validation: fetch block inputs via InputFetcher before connecting
     
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)
      Time (abs ≡):        16044.026 s               [User: 23421.874 s, System: 695.027 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
      Time (abs ≡):        15643.115 s               [User: 26237.588 s, System: 984.588 s]
     
    Relative speed comparison
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
            1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)
    

    </details>

    the same measurement with default memory is basically exactly the same speed as with max memory (450 MB -> 15818 seconds vs 45 GB -> 15643 seconds)

    <details> <summary>30% faster reindex-chainstate | 921129 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD</summary>

    COMMITS="bf07cf0adf19889727cb6bea24ebfbbfcc231a0c 45fe0c0e5beddce1c9e836ab5d97aa064069c192"; STOP=921129; DBCACHE=450; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    bf07cf0adf coins: add inputfetcher
    45fe0c0e5b validation: fetch block inputs via InputFetcher before connecting
     
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)
      Time (abs ≡):        20500.654 s               [User: 40766.355 s, System: 2845.314 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
      Time (abs ≡):        15818.604 s               [User: 45952.420 s, System: 4127.137 s]
     
    Relative speed comparison
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 45fe0c0e5beddce1c9e836ab5d97aa064069c192)
            1.30          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = bf07cf0adf19889727cb6bea24ebfbbfcc231a0c)
    

    </details>

    <details> <summary>17% faster reindex-chainstate | 921129 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD</summary>

    COMMITS="cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85 2aa510348143521a14146e41b5cf87cb3e60b29e"; STOP=921129; DBCACHE=450;
     CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origi
    n $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(unam
    e -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log; \
                  grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    cb0fdfdf37 coins: add inputfetcher                                            
    2aa5103481 validation: fetch block inputs via InputFetcher before connecting                                                                                
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
      Time (abs ≡):        43407.876 s               [User: 40230.765 s, System: 3077.358 s]
                                                                                  
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)                       
      Time (abs ≡):        37189.669 s               [User: 45706.002 s, System: 4452.708 s]
      
    Relative speed comparison
            1.17          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
    

    </details>

    <img width="1503" height="869" alt="image" src="https://github.com/user-attachments/assets/2a3fa7c0-6d5e-4afb-a346-99aa733ac9e8" />

  260. in src/test/fuzz/inputfetcher.cpp:50 in 21e5e10bf3 outdated
      45 | +    SeedRandomStateForTest(SeedRand::ZEROS);
      46 | +    FuzzedDataProvider fuzzed_data_provider(buffer.data(), buffer.size());
      47 | +
      48 | +    const auto worker_threads{
      49 | +        fuzzed_data_provider.ConsumeIntegralInRange<int32_t>(2, 4)};
      50 | +    InputFetcher fetcher{worker_threads};
    


    sedited commented at 11:19 AM on November 3, 2025:

    I'm observing a memory leak in this fuzz test similar to the one we had for the thread pool. Over there, we disabled logging and instantiate only a single pool instance: https://github.com/bitcoin/bitcoin/pull/33689/files#diff-68602d972fe2b027e3987eff0042c27f3f00fc161b7fe871bdb147571f348298R49-R58 . Maybe similar things can be done here?


    l0rinc commented at 11:22 AM on November 3, 2025:

    Thanks for reporting, that explains why I was seeing

    bench_bitcoin(10874,0x207c84800) malloc: Failed to allocate segment from range group - out of space
    

    and

    bitcoind(70369,0x16f5ff000) malloc: Failed to allocate segment from range group - out of space
    

    recently (cc: @maflcko)


    maflcko commented at 11:44 AM on November 3, 2025:

    Interesting, how would a fuzz target lead to a crash in bench or bitcoind? Did you run it in parallel?

    Though, the suggestion to disable logging is correct, because while fuzzing, we probably don't want to spend cycles on log formatting. I guess those logs only end up in the buffer which causes the memory to grow? (There should be a DEFAULT_MAX_LOG_BUFFER, so if buffering is the issue, the memory should be limited)


    sedited commented at 12:23 PM on November 3, 2025:

    The following patch seems to stabilize memory consumption:

    diff --git a/src/test/fuzz/inputfetcher.cpp b/src/test/fuzz/inputfetcher.cpp
    index cd2a0f5c68..609b8e1191 100644
    --- a/src/test/fuzz/inputfetcher.cpp
    +++ b/src/test/fuzz/inputfetcher.cpp
    @@ -43 +43,10 @@ struct NoAccessCoinsView : CCoinsView
    -FUZZ_TARGET(inputfetcher)
    +std::optional<InputFetcher> g_fetcher{};
    +
    +static void setup_threadpool_test()
    +{
    +    LogInstance().DisableLogging();
    +    g_fetcher.emplace(3);
    +}
    +
    +FUZZ_TARGET(inputfetcher, .init = setup_threadpool_test)
    @@ -48,4 +56,0 @@ FUZZ_TARGET(inputfetcher)
    -    const auto worker_threads{
    -        fuzzed_data_provider.ConsumeIntegralInRange<int32_t>(2, 4)};
    -    InputFetcher fetcher{worker_threads};
    -
    @@ -115 +120 @@ FUZZ_TARGET(inputfetcher)
    -        fetcher.FetchInputs(cache, main_cache, db, block);
    +        g_fetcher->FetchInputs(cache, main_cache, db, block);
    

    l0rinc commented at 12:27 PM on November 3, 2025:

    Are we sure we're not just masking a real problem with the disabled logger?


    sedited commented at 12:33 PM on November 3, 2025:

    The logger is way less problematic in terms of its effect on the memory growing, and I find it difficult to really pin its effect. Having a global input fetcher that does not get instantiated with every fuzzer iteration has an immediate and clear effect. It is not clear to me if we are are actually leaking anything through the threads, or if creating and destroying thousands of threads per second puts too much pressure on the os (same for the threadpool).


    andrewtoth commented at 10:18 PM on November 4, 2025:

    @TheCharlatan thanks for fuzzing, and the diff for the fuzzer! I have taken it, and added you as a co-author :heart_hands:. @l0rinc it is concerning that you are getting malloc errors. Are there any other details you can share about this?


    l0rinc commented at 7:49 AM on November 6, 2025:

    Did the same on master:

    git log -1
    commit 5c5704e730796c6f31e2d7891bf6334674a04219 (HEAD, upstream/master, upstream/HEAD, origin/master, origin/HEAD)
    

    and unfortunately I'm getting the same:

    time ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0\
    && time ./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0\
    && time ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
    bitcoind(71239,0x170f2f000) malloc: Failed to allocate segment from range group - out of space
    ...
    

    so it's not related to this PR

    Also fuzzed for almost a day, no problems came up:

    ...
    [#12935791](/bitcoin-bitcoin/12935791/)       REDUCE cov: 1485 ft: 9496 corp: 917/727Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1640/4082 MS: 3 EraseBytes-PersAutoDict-InsertRepeatedBytes- DE: "\001\\"-
    [#12941278](/bitcoin-bitcoin/12941278/)       REDUCE cov: 1485 ft: 9496 corp: 917/727Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1734/4082 MS: 2 PersAutoDict-EraseBytes- DE: "\377\031"-
    [#12943860](/bitcoin-bitcoin/12943860/)       REDUCE cov: 1485 ft: 9496 corp: 917/727Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1303/4082 MS: 2 InsertByte-EraseBytes-
    [#12956441](/bitcoin-bitcoin/12956441/)       REDUCE cov: 1485 ft: 9496 corp: 917/727Kb lim: 4096 exec/s: 475 rss: 189Mb L: 2407/4082 MS: 1 EraseBytes-
    [#12973964](/bitcoin-bitcoin/12973964/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 2400/4082 MS: 3 ChangeByte-ChangeBinInt-EraseBytes-
    [#12980710](/bitcoin-bitcoin/12980710/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1835/4082 MS: 1 EraseBytes-
    [#12986931](/bitcoin-bitcoin/12986931/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 471/4082 MS: 1 EraseBytes-
    [#12987148](/bitcoin-bitcoin/12987148/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 1295/4082 MS: 2 PersAutoDict-EraseBytes- DE: "\377\377\377\377"-
    [#13001660](/bitcoin-bitcoin/13001660/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 2399/4082 MS: 2 PersAutoDict-EraseBytes- DE: "\332\377\377\377"-
    [#13008530](/bitcoin-bitcoin/13008530/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 3079/4082 MS: 5 ChangeByte-PersAutoDict-EraseBytes-ChangeASCIIInt-InsertRepeatedBytes- DE: "\363\006\000\000\000\000\000\000"-
    [#13015051](/bitcoin-bitcoin/13015051/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 3063/4082 MS: 1 EraseBytes-
    [#13026498](/bitcoin-bitcoin/13026498/)       REDUCE cov: 1485 ft: 9496 corp: 917/726Kb lim: 4096 exec/s: 475 rss: 189Mb L: 2374/4082 MS: 2 ChangeBinInt-EraseBytes-
    
  261. andrewtoth force-pushed on Nov 4, 2025
  262. DrahtBot added the label CI failed on Nov 4, 2025
  263. andrewtoth force-pushed on Nov 4, 2025
  264. DrahtBot removed the label CI failed on Nov 4, 2025
  265. andrewtoth commented at 1:46 AM on November 5, 2025: contributor

    Benchmarked the latest up to block 921129 and it's 16% faster :rocket:. Not as fast as some of @l0rinc's numbers but it's on a laptop with an internal NVMe SSD. This change will see the most benefit for disk IO with higher latency, like network connected storage.

    Command Mean [s] Min [s] Max [s] Relative
    echo d606c36a13ca2a055d1a4eb4c623fb6aa45405b2 && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=921129 18498.670 ± 16.716 18486.850 18510.490 1.00
    echo 25c45bb0d0bd6618ec9296a1a43605657124e5de && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=921129 21537.077 ± 123.626 21449.660 21624.494 1.16 ± 0.01

    Also refactored to not stop early if an input is missing. This let's us simplify the logic. We can get rid of the different status flags and just synchronize each input on an atomic bool.

  266. andrewtoth force-pushed on Nov 5, 2025
  267. andrewtoth force-pushed on Nov 6, 2025
  268. in src/inputfetcher.h:81 in 63dde36c1d outdated
      76 | +     * @return false if there are no more inputs in the queue to fetch
      77 | +     */
      78 | +    bool FetchCoin() noexcept
      79 | +    {
      80 | +        const size_t i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
      81 | +        if (i >= m_inputs.size()) [[unlikely]] return false;
    


    l0rinc commented at 3:22 PM on November 6, 2025:

    when can this be true?


    andrewtoth commented at 5:28 PM on November 6, 2025:

    This is true when all inputs have been fetched from the block. We want the compiler to optimize for the case where we have work.

  269. in src/inputfetcher.h:83 in 63dde36c1d
      78 | +    bool FetchCoin() noexcept
      79 | +    {
      80 | +        const size_t i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
      81 | +        if (i >= m_inputs.size()) [[unlikely]] return false;
      82 | +        auto& input{m_inputs[i]};
      83 | +        if (std::binary_search(m_txids.begin(), m_txids.end(), input.outpoint.hash.ToUint256().GetUint64(0))) {
    


    l0rinc commented at 3:22 PM on November 6, 2025:
            if (std::ranges::binary_search(m_txids, input.outpoint.hash.ToUint256().GetUint64(0))) {
    
  270. in src/inputfetcher.h:93 in 63dde36c1d
      88 | +        auto coin{m_cache->GetPossiblySpentCoinFromCache(input.outpoint)};
      89 | +        if (!coin) {
      90 | +            try {
      91 | +                coin = m_db->GetCoin(input.outpoint);
      92 | +            } catch (const std::runtime_error& e) {
      93 | +                LogPrintLevel(BCLog::VALIDATION, BCLog::Level::Warning, "InputFetcher failed to fetch input: %s.\n", e.what());
    


    l0rinc commented at 3:24 PM on November 6, 2025:

    nit: trailing newline shouldn't be needed anymore

                    LogPrintLevel(BCLog::VALIDATION, BCLog::Level::Warning, "InputFetcher failed to fetch input: %s.", e.what());
    
  271. in src/inputfetcher.h:120 in 63dde36c1d
     115 | +            const auto& tx{block.vtx[i]};
     116 | +            outputs_count += tx->vout.size();
     117 | +            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
     118 | +            for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
     119 | +        }
     120 | +        std::sort(m_txids.begin(), m_txids.end());
    


    l0rinc commented at 3:24 PM on November 6, 2025:
            std::ranges::sort(m_txids);
    
  272. DrahtBot added the label CI failed on Nov 6, 2025
  273. DrahtBot commented at 3:25 PM on November 6, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task lint: https://github.com/bitcoin/bitcoin/actions/runs/19139133943/job/54698986986</sub> <sub>LLM reason (✨ experimental): Trailing whitespace detected in src/inputfetcher.h caused the lint check to fail.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  274. in src/inputfetcher.h:134 in 63dde36c1d
     129 | +        temp_cache.Reserve(temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);
     130 | +        for (auto& input : m_inputs) {
     131 | +            while (!input.ready.test(std::memory_order_acquire)) {
     132 | +                // Work too while we wait
     133 | +                if (!FetchCoin()) {
     134 | +                    input.ready.wait(false, std::memory_order_acquire);
    


    l0rinc commented at 3:26 PM on November 6, 2025:

    based on https://en.cppreference.com/w/cpp/atomic/atomic_flag/wait.html

                        input.ready.wait(/*old*/false, std::memory_order_acquire);
    
  275. in src/inputfetcher.h:132 in 63dde36c1d
     127 | +
     128 | +        // Insert fetched coins into the temp_cache as they are set to ready.
     129 | +        temp_cache.Reserve(temp_cache.GetCacheSize() + m_inputs.size() + outputs_count);
     130 | +        for (auto& input : m_inputs) {
     131 | +            while (!input.ready.test(std::memory_order_acquire)) {
     132 | +                // Work too while we wait
    


    l0rinc commented at 3:27 PM on November 6, 2025:

    not yet sure I fully understand why this is needed - it's more complex, but does it result in at least a theoretic speedup?


    andrewtoth commented at 5:31 PM on November 6, 2025:

    The work done on the other threads is much slower than what the main thread is doing. Fetching inputs possibly from disk vs inserting coins into the temporary cache. So, in cases where there are fewer worker threads the main thread will likely be waiting on the work to be done. In these cases we can just start fetching as well. I think this would have a large impact in cases where there are few worker threads (like rpis), or if using a low but > 1 -par value. For instance, using -par=2 this would in theory double the efficiency of fetching inputs from disk. It is likely not a measurable effect for setups that have 16 or more vcpus.


    l0rinc commented at 6:09 PM on November 6, 2025:

    I'm not sure I understand why, what's the difference between two threads working while main is waiting vs one thread working and main also working? Where does the difference come from (especially given that the work is not fully cpu-bound)?


    andrewtoth commented at 6:29 PM on November 6, 2025:

    two threads working while main is waiting vs one thread working and main also working?

    It would be two threads working while main is waiting vs two threads working and main also working? So the latter has 3 threads working vs the former's 2 threads working. If using -par=3 this is the case. 50% more work is done in parallel.


    l0rinc commented at 6:52 PM on November 6, 2025:

    So why not just do the work that par defined and leave main asleep, wouldn't that be simpler while basically achieving exactly the same?


    andrewtoth commented at 7:02 PM on November 6, 2025:

    No because we would have to insert all entries into temp cache at the end, instead of parallelizing that work as well. That was the previous implementation where we waited for every thread to be done and then inserted everything in series before exiting. Now, the main thread does both. It inserts while others are fetching, but if it inserts fast enough where it is waiting for new entries it will also fetch entries.


    andrewtoth commented at 7:10 PM on November 6, 2025:

    On a system with 15 worker threads + main, it is likely that main will not be waiting much. The other 15 threads are busy setting newly fetched coins to ready so that the main thread can continuously read true for the ready flags. On a system with only 3 worker threads + main, it is likely that the 3 worker threads will not be able to fetch and set ready coins fast enough where the main thread does not have to wait. For instance all 3 threads are fetching from disk, and the main thread reads the next input and ready is still false. So now it can either wait until one of the 3 worker threads fetches an input from disk, or it can start helping out the 3 workers and fetch from disk itself. This will increase parallel throughput by 33%, since 4 workers are fetching instead of just 3. Then when main returns with its fetched coin, main can insert the rest of the coins that the workers have fetched and set ready while main was busy fetching. Then it will catch up again to the latest input that is not yet ready, and fetch another coin.


    l0rinc commented at 8:18 PM on November 6, 2025:

    We discussed this out of band, here's the summary:

    • the main thread is special because it has access to the dbcache, it can insert there without locking the cache.
    • the threads each have their inputs now, each of which have a switch to signal when they've fetched the input and move on to the next
    • the threads each compete for which input to fetch, marking them one-by-one as ready
    • the main thread goes in order, spins until the next one is available, if it needs to wait, it does some fetching itself, rechecks later if the given value is available and if so, it inserts to the dbcache, after which it checks the next value in order.

    This way the IO and CPU bound work is parallelized, so we don't need to do the heavy rehashing at the end, it's done while the other threads are doing IO work - that's why it's faster.

    I have to think about this, it sounds like we can simplify this further, but this is already an improvement over previous solutions.


    And given that the worst thing that can happen for an internal spend is that if there's another input with the same prefix in the same block, we wouldn't fetch it here and it would be fetched during block connection like it was done before. This means that we likely don't even need 64 bits for that, likely 32 are enough. Back of the napkin calculations indicate that would mean that 1/1000 blocks would contain transactions that won't be fetched by InputFetcher and would need to be fetched during block connection instead on a single thread. As long as 32 bit checks are faster than 64 (which should definitely be the case for the sorted-vector case), this should likely result in an overall speedup. We definitely need to add a test case for that.

  276. in src/inputfetcher.h:139 in 63dde36c1d
     134 | +                    input.ready.wait(false, std::memory_order_acquire);
     135 | +                    break;
     136 | +                }
     137 | +            }
     138 | +            if (input.coin.IsSpent()) continue;
     139 | +            temp_cache.EmplaceCoinInternalDANGER(COutPoint{input.outpoint}, std::move(input.coin));
    


    l0rinc commented at 3:27 PM on November 6, 2025:

    continue effectively skipping the last line is a bit confusing an dangerous - it should only skip the next line:

                if (!input.coin.IsSpent()) {
                    temp_cache.EmplaceCoinInternalDANGER(COutPoint{input.outpoint}, std::move(input.coin));
                }
    
  277. in src/inputfetcher.h:61 in 63dde36c1d outdated
      56 | +    /**
      57 | +     * The set of first 8 bytes of txids of all txs in the block being fetched.
      58 | +     * Used to filter out inputs that are created and spent in the same block,
      59 | +     * since they will not be in the db or the cache.
      60 | +     */
      61 | +    std::vector<uint64_t> m_txids{};
    


    l0rinc commented at 3:29 PM on November 6, 2025:

    should we document here what happens in case of a cache miss or collision?

  278. in src/inputfetcher.h:76 in 63dde36c1d
      71 | +
      72 | +    /**
      73 | +     * Fetches the next input in the queue. Safe to call from any thread once inside the barrier.
      74 | +     * 
      75 | +     * @return true if there are more inputs in the queue to fetch
      76 | +     * @return false if there are no more inputs in the queue to fetch
    


    l0rinc commented at 3:33 PM on November 6, 2025:

    this will fix the linter failure as well (whitespace after *):

         * 
         * [@return](/bitcoin-bitcoin/contributor/return/) whether there are more inputs in the queue to fetch
    
  279. l0rinc changes_requested
  280. in src/inputfetcher.h:78 in 63dde36c1d
      73 | +     * Fetches the next input in the queue. Safe to call from any thread once inside the barrier.
      74 | +     * 
      75 | +     * @return true if there are more inputs in the queue to fetch
      76 | +     * @return false if there are no more inputs in the queue to fetch
      77 | +     */
      78 | +    bool FetchCoin() noexcept
    


    l0rinc commented at 5:00 PM on November 6, 2025:

    can we call it something else to not coincide with CCoinsViewCache::FetchCoin

  281. andrewtoth force-pushed on Nov 6, 2025
  282. DrahtBot removed the label CI failed on Nov 6, 2025
  283. andrewtoth force-pushed on Nov 8, 2025
  284. andrewtoth force-pushed on Nov 8, 2025
  285. andrewtoth force-pushed on Nov 12, 2025
  286. andrewtoth force-pushed on Nov 14, 2025
  287. l0rinc commented at 10:29 AM on November 14, 2025: contributor

    I was still wondering how number of parallel threads affects this, given that it's not a cpu-bound task.

    The measurements were done on i9, m4 and rpi4. First two have 16 threads, rpi has 4. The results still indicate to me that it doesn't make sense to set parallelism based on number of cpus directly. Beyond 4-8 threads the systems didn't really perform any better.

    <img width="1489" height="861" alt="image" src="https://github.com/user-attachments/assets/c6065154-0625-4609-928f-867c957ba2e4" />

    I will continue measuring it on other systems as well, but wanted to share preliminary results since these measurements take a lot of time.

    <details> <summary>i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD </summary>

    commit=c2b0239001629a43d50cb8eb00e884423db89b38 && git log -1 --pretty='%h %s' $commit && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 2 4 8 16 32 64; do   time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par; done
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    
    real    94m1.512s
    user    169m2.995s
    sys     11m59.017s
    
    real    86m23.533s
    user    171m32.112s
    sys     12m42.929s
    
    real    82m58.013s
    user    179m20.981s
    sys     14m25.288s
    
    real    82m38.540s
    user    197m24.427s
    sys     20m31.753s
    
    real    82m38.442s
    user    197m16.954s
    sys     20m32.770s
    
    real    82m46.060s
    user    197m44.609s
    sys     20m37.426s
    

    </details>

    <details> <summary>Macbook Pro M4 Max</summary>

    commit=c2b0239001629a43d50cb8eb00e884423db89b38 && git log -1 --pretty='%h %s' $commit && \
    git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && \
    for par in 2 4 8 16 32 64; do
      time ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par
    done
    
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     10711.00s user 1495.61s system 181% cpu 1:51:52.59 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     10601.27s user 1311.29s system 207% cpu 1:35:49.25 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     11362.34s user 2558.23s system 242% cpu 1:35:29.27 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12203.63s user 5783.56s system 299% cpu 1:40:07.62 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12124.30s user 5764.53s system 298% cpu 1:40:01.07 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12276.36s user 5816.05s system 300% cpu 1:40:13.27 total
    

    </details>

    <details> <summary>rpi4-8-1</summary>

    root@rpi4-8-1:/mnt/my_storage/bitcoin# for par in 4 8; do \
        COMMITS="c2b0239001629a43d50cb8eb00e884423db89b38"; \
        STOP=700000; DBCACHE=450; \
        CC=gcc; CXX=g++; \
        BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
        (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
        (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
        hyperfine \
          --sort command \
          --runs 1 \
          --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
          --parameter-list COMMIT ${COMMITS// /,} \
          --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
            cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
            ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
          --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                      cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
          "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par"; \
    done
    
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    
    reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=4 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
      Time (abs ≡):        36057.895 s               [User: 61089.585 s, System: 12571.590 s]
     
    
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    
    reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=8 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
      Time (abs ≡):        35682.583 s               [User: 61828.582 s, System: 14162.402 s]
    

    </details>

    Edit: updated times:

    <img width="1495" height="866" alt="image" src="https://github.com/user-attachments/assets/d5c8257f-5b46-40ce-9730-6a6b39049fb1" />

    Edit2: update with even more fine-grained measurements <img width="1493" height="868" alt="image" src="https://github.com/user-attachments/assets/7636f9f0-3e3a-484f-a8c0-61a00a6515fa" />

    <details> <summary>Raw measurements</summary>

    
    i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    commit=c2b0239001629a43d50cb8eb00e884423db89b38 && git log -1 --pretty='%h %s' $commit && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 2 4 8 16 32 64; do   time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par; done
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    
    real    94m1.512s
    user    169m2.995s
    sys     11m59.017s
    
    real    86m23.533s
    user    171m32.112s
    sys     12m42.929s
    
    real    82m58.013s
    user    179m20.981s
    sys     14m25.288s
    
    real    82m38.540s
    user    197m24.427s
    sys     20m31.753s
    
    real    82m38.442s
    user    197m16.954s
    sys     20m32.770s
    
    real    82m46.060s
    user    197m44.609s
    sys     20m37.426s
    
    
    
    root@rpi4-8-1:/mnt/my_storage/bitcoin# for par in 2 4 8 16; do \
        COMMITS="c2b0239001629a43d50cb8eb00e884423db89b38"; \
        STOP=700000; DBCACHE=450; \
        CC=gcc; CXX=g++; \
        BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
        (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
        (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
        hyperfine \
          --sort command \
          --runs 1 \
          --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
          --parameter-list COMMIT ${COMMITS// /,} \
          --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
            cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
            ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
          --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                      cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
          "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par"; \
    done
    
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    
    reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=2 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
      Time (abs ≡):        39770.114 s               [User: 62016.921 s, System: 12176.118 s]
    
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    
    reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=4 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
      Time (abs ≡):        36057.895 s               [User: 61089.585 s, System: 12571.590 s]
    
    
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    
    reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=8 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
      Time (abs ≡):        35682.583 s               [User: 61828.582 s, System: 14162.402 s]
    
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    
    reindex-chainstate | 700000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=16 (COMMIT = c2b0239001629a43d50cb8eb00e884423db89b38)
      Time (abs ≡):        36414.980 s               [User: 63043.265 s, System: 16780.513 s]
    
    
    
    
    
    rpi5-16-1:
    commit=d6fac85ee4465cce8e81e36cdfd46636d34725fa && git log -1 --pretty='%h %s' $commit && git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 1 2 3 4 5 6 7 8 9 10; do   time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -par=$par -stopatheight=800000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0; done
    d6fac85ee4 validation: fetch block inputs via InputFetcher before connecting
    
    real    393m14.069s
    user    611m13.972s
    sys     64m44.860s
    
    real    335m41.301s
    user    615m9.967s
    sys     61m58.381s
    
    real    314m37.969s
    user    617m42.106s
    sys     60m47.302s
    
    real    313m26.580s
    user    619m26.740s
    sys     63m59.801s
    
    real    314m35.267s
    user    622m30.314s
    sys     66m53.456s
    
    real    314m14.335s
    user    621m22.819s
    sys     68m30.831s
    
    real    314m49.454s
    user    621m3.552s
    sys     70m1.733s
    
    real    316m10.270s
    user    624m1.233s
    sys     70m48.074s
    
    real    315m57.060s
    user    619m48.948s
    sys     72m10.586s
    
    real    316m27.926s
    user    622m56.166s
    sys     73m16.170s
    
    
    
    i9:
    commit=d6fac85ee4465cce8e81e36cdfd46636d34725fa && git log -1 --pretty='%h %s' $commit && git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18; do   time ./build/bin/bitcoind -par=$par -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0; done
    
    d6fac85ee4 validation: fetch block inputs via InputFetcher before connecting
    
    real    395m38.371s
    user    624m54.760s
    sys     50m2.017s
    
    real    321m54.706s
    user    638m0.914s
    sys     45m16.018s
    
    real    295m13.108s
    user    635m21.111s
    sys     44m41.847s
    
    real    281m53.289s
    user    636m26.743s
    sys     45m4.569s
    
    real    274m35.436s
    user    638m44.171s
    sys     46m25.320s
    
    real    270m4.622s
    user    641m52.896s
    sys     45m44.507s
    
    real    266m44.501s
    user    647m12.988s
    sys     47m8.929s
    
    real    265m9.994s
    user    661m8.192s
    sys     48m54.941s
    
    real    263m49.009s
    user    675m41.091s
    sys     49m52.024s
    
    real    262m13.541s
    user    687m23.237s
    sys     51m29.839s
    
    real    262m29.564s
    user    701m19.813s
    sys     52m20.824s
    
    real    261m29.692s
    user    717m48.727s
    sys     53m54.110s
    
    real    260m36.882s
    user    727m35.223s
    sys     56m30.900s
    
    real    259m58.230s
    user    740m17.961s
    sys     57m37.615s
    
    real    259m46.496s
    user    750m59.419s
    sys     59m55.428s
    
    real    262m3.069s
    user    756m49.146s
    sys     63m51.639s
    
    real    262m29.892s
    user    755m7.876s
    sys     63m29.261s
    
    real    262m54.519s
    user    759m4.919s
    sys     62m51.652s
    
    
    
    i7:
    commit=d6fac85ee4465cce8e81e36cdfd46636d34725fa && git log -1 --pretty='%h %s' $commit && git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && for par in 1 2 3 4 5 6 7 8 9 10 11 12 13; do   time ./build/bin/bitcoind -par=$par -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0; done
    
    d6fac85ee4 validation: fetch block inputs via InputFetcher before connecting
    
    real    677m53.522s
    user    540m13.974s
    sys     48m0.685s
    
    real    616m58.994s
    user    547m24.056s
    sys     46m0.808s
    
    real    582m52.884s
    user    554m55.162s
    sys     45m21.949s
    
    real    577m5.937s
    user    563m47.629s
    sys     45m28.084s
    
    real    568m25.225s
    user    576m19.153s
    sys     46m34.582s
    
    real    566m31.568s
    user    586m8.162s
    sys     46m41.220s
    
    real    564m43.096s
    user    594m42.382s
    sys     49m17.091s
    
    real    558m40.218s
    user    600m10.625s
    sys     52m24.030s
    
    real    556m23.944s
    user    606m36.724s
    sys     55m27.147s
    
    real    565m47.020s
    user    607m41.449s
    sys     56m21.510s
    
    real    563m37.784s
    user    609m40.123s
    sys     57m35.573s
    
    real    567m43.207s
    user    608m53.189s
    sys     57m43.208s
    
    real    563m14.968s
    user    611m1.819s
    sys     58m43.991s
    
    
    Macbook Pro M4 Max
    commit=c2b0239001629a43d50cb8eb00e884423db89b38 && git log -1 --pretty='%h %s' $commit && \
    git checkout $commit >/dev/null 2>&1 && rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && \
    for par in 2 4 8 16 32 64; do
      time ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=$par
    done
    
    c2b0239001 validation: fetch block inputs via InputFetcher before connecting
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     10711.00s user 1495.61s system 181% cpu 1:51:52.59 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     10601.27s user 1311.29s system 207% cpu 1:35:49.25 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     11362.34s user 2558.23s system 242% cpu 1:35:29.27 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12203.63s user 5783.56s system 299% cpu 1:40:07.62 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12124.30s user 5764.53s system 298% cpu 1:40:01.07 total
    ./build/bin/bitcoind -stopatheight=800000 -dbcache=450 -reindex-chainstate     12276.36s user 5816.05s system 300% cpu 1:40:13.27 total
    

    </details>

  288. andrewtoth force-pushed on Nov 23, 2025
  289. andrewtoth force-pushed on Nov 23, 2025
  290. DrahtBot added the label CI failed on Nov 23, 2025
  291. DrahtBot commented at 1:55 AM on November 23, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task fuzzer,address,undefined,integer, no depends: https://github.com/bitcoin/bitcoin/actions/runs/19603900050/job/56139454218</sub> <sub>LLM reason (✨ experimental): Fuzz run crashes with UndefinedBehaviorSanitizer: null-pointer-use (null CTransaction dereference in CoinsViewCacheAsync).</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  292. andrewtoth marked this as a draft on Nov 23, 2025
  293. andrewtoth force-pushed on Nov 23, 2025
  294. andrewtoth force-pushed on Nov 23, 2025
  295. andrewtoth force-pushed on Nov 23, 2025
  296. andrewtoth force-pushed on Nov 23, 2025
  297. andrewtoth force-pushed on Nov 23, 2025
  298. andrewtoth force-pushed on Nov 23, 2025
  299. andrewtoth force-pushed on Nov 23, 2025
  300. andrewtoth force-pushed on Nov 23, 2025
  301. DrahtBot removed the label CI failed on Nov 23, 2025
  302. andrewtoth marked this as ready for review on Nov 23, 2025
  303. andrewtoth commented at 5:13 PM on November 23, 2025: contributor

    I've updated the PR to make the InputFetcher a subclass of CCoinsViewCache. Instead of waiting on the MPSC queue to be finished before connecting the block, the queue can be processed inside CCoinsViewCache::FetchCoin during ConnectBlock. This makes the fetching non-blocking, which is a significant performance improvement. It's been renamed to CoinsViewCacheAsync. @l0rinc thank you for your very thorough benchmarks. I've updated this to use 4 worker threads, which yields the same speed as with 15 on my benchmark machine. I've also made this non-configurable, as I don't see a reason why a user would want to change it. It should help performance on single core machines as well since the parallel work is IO bound.

    This new non-blocking version using 4 threads is significantly faster. On the same machine that yielded 16% IBD speedup until block 921129, this version is now 21% faster :rocket:. I'm curious to see benchmarks on other machines.

    Command Mean [s] Min [s] Max [s] Relative
    echo 42e68d48b4da838361b045a977242d3262f8b351 && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=921129 17842.357 ± 131.143 17749.624 17935.089 1.00
    echo 25c45bb0d0bd6618ec9296a1a43605657124e5de && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=921129 21537.077 ± 123.626 21449.660 21624.494 1.21
  304. andrewtoth force-pushed on Nov 24, 2025
  305. andrewtoth force-pushed on Nov 24, 2025
  306. DrahtBot added the label CI failed on Nov 24, 2025
  307. DrahtBot commented at 1:28 AM on November 24, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task fuzzer,address,undefined,integer, no depends: https://github.com/bitcoin/bitcoin/actions/runs/19620057148/job/56178655244</sub> <sub>LLM reason (✨ experimental): libFuzzer crash (deadly signal) from an assertion in DbCoinsView::GetCoin during fuzzing.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  308. andrewtoth force-pushed on Nov 24, 2025
  309. andrewtoth force-pushed on Nov 24, 2025
  310. andrewtoth force-pushed on Nov 24, 2025
  311. andrewtoth force-pushed on Nov 24, 2025
  312. andrewtoth force-pushed on Nov 24, 2025
  313. DrahtBot removed the label CI failed on Nov 25, 2025
  314. andrewtoth force-pushed on Nov 25, 2025
  315. andrewtoth force-pushed on Nov 25, 2025
  316. andrewtoth force-pushed on Nov 26, 2025
  317. andrewtoth force-pushed on Nov 27, 2025
  318. in src/coins.h:405 in 0f3778fbfb
     400 | @@ -401,6 +401,14 @@ class CCoinsViewCache : public CCoinsViewBacked
     401 |       */
     402 |      bool HaveCoinInCache(const COutPoint &outpoint) const;
     403 |  
     404 | +    /**
     405 | +     * Retrieves the coin from the cache even if it is spent, without calling
    


    l0rinc commented at 12:21 PM on November 27, 2025:

    nit: other comments use the bare verb form

         * Retrieve the coin from the cache even if it is spent, without calling
    
  319. in src/coinsviewcacheasync.h:85 in 0f3778fbfb outdated
      80 | +     * Similar to CCoinsViewCache::GetCoin, but it does not mutate internally.
      81 | +     * Therefore safe to call from any thread once inside the barrier.
      82 | +     */
      83 | +    std::optional<Coin> GetCoinWithoutMutating(const COutPoint& outpoint) const
      84 | +    {
      85 | +        auto coin{static_cast<CCoinsViewCache*>(base)->GetPossiblySpentCoinFromCache(outpoint)};
    


    l0rinc commented at 12:28 PM on November 27, 2025:

    This highlights that our dbcache layering is a bit messy – having to static_cast the base view to CCoinsViewCache here is quite ugly. It would be good to clean this up in a follow-up.

  320. in src/test/fuzz/coinsviewcacheasync.cpp:164 in 0f3778fbfb outdated
     159 | +                assert(coin->nHeight == db_coin.nHeight);
     160 | +                assert(coin->out == db_coin.out);
     161 | +            }
     162 | +        }
     163 | +        assert(cache.GetCacheSize() == outpoints_in_cache.size());
     164 | +        fuzzed_data_provider.ConsumeBool() ? (void)cache.Flush() : cache.Reset();
    


    l0rinc commented at 1:05 PM on November 27, 2025:

    We’re moving towards Flush not returning a value – can we avoid adding new code that relies on its return? Not sure it’s possible before that other PR lands, but worth keeping in mind.


    andrewtoth commented at 4:53 PM on November 27, 2025:

    The (void) here does not rely on the return value. Or do you mean something else?


    l0rinc commented at 5:09 PM on November 27, 2025:

    we wouldn't need it if Flush were a void itself, like Reset, right?


    andrewtoth commented at 5:15 PM on November 27, 2025:

    Yes, but if the other PR is merged, then this will still compile with no errors or warnings right?


    l0rinc commented at 5:25 PM on November 27, 2025:

    you will have a conflict where you're making Flush abstract and they're making it void

  321. in src/bench/coinsviewcacheasync.cpp:21 in 0f3778fbfb
      16 | +
      17 | +//! Simulates a DB by adding a delay when calling GetCoin
      18 | +struct DelayedCoinsView : CCoinsView {
      19 | +    std::optional<Coin> GetCoin(const COutPoint&) const override
      20 | +    {
      21 | +        UninterruptibleSleep(DELAY);
    


    l0rinc commented at 1:06 PM on November 27, 2025:

    I wonder if we could change this benchmark to use an actual LevelDB-backed view instead of a synthetic delay – and maybe reuse the setup from #32554 for a more realistic measurement?

  322. in src/coinsviewcacheasync.h:208 in fc4346fe05 outdated
     203 | +    explicit CoinsViewCacheAsync(CCoinsViewCache& cache, const CCoinsView& db, bool deterministic = false) noexcept
     204 | +        : CCoinsViewCache{&cache, deterministic}, m_db{db}, m_barrier{WORKER_THREADS + 1}
     205 | +    {
     206 | +        for (uint32_t n{0}; n < WORKER_THREADS; ++n) {
     207 | +            m_worker_threads.emplace_back([this, n] {
     208 | +                util::ThreadRename(strprintf("inputfetcher.%i", n));
    


    l0rinc commented at 1:08 PM on November 27, 2025:

    nit: now that this is CoinsViewCacheAsync, maybe rename the thread prefix from "inputfetcher.%i" to something like "coinsviewcacheasync.%i" or similar, to reflect the current design.


    andrewtoth commented at 5:03 AM on November 30, 2025:

    Hmm the thread is just an input fetcher though. That's all it does. I like this name for it.

  323. in src/bench/coinsviewcacheasync.cpp:43 in 0f3778fbfb outdated
      38 | +        async_cache.StartFetching(block);
      39 | +        async_cache.Reset();
      40 | +    });
      41 | +}
      42 | +
      43 | +BENCHMARK(CoinsViewCacheAsyncBenchmark, benchmark::PriorityLevel::HIGH);
    


    l0rinc commented at 1:15 PM on November 27, 2025:

    nit: the formatter should enforce a trailing newline at the end of C++ source files; could you add one here?

  324. in src/bench/coinsviewcacheasync.cpp:38 in 0f3778fbfb outdated
      33 | +    DelayedCoinsView db{};
      34 | +    CCoinsViewCache main_cache(&db);
      35 | +    CoinsViewCacheAsync async_cache{main_cache, db};
      36 | +
      37 | +    bench.run([&] {
      38 | +        async_cache.StartFetching(block);
    


    l0rinc commented at 1:16 PM on November 27, 2025:

    What if we added a second run inside the same benchmark that calls StartFetching again or a series of AccessCoin calls - to exercise the “everything is already in the cache” path (similar to a large -dbcache scenario)?


    andrewtoth commented at 4:54 PM on November 27, 2025:

    Ah, yes the Reset now just short circuits it. So this is not a very good benchmark anymore. Will fix to access all coins.

  325. in src/test/fuzz/coinsviewcacheasync.cpp:39 in 0f3778fbfb
      34 | +        cacheCoins.clear();
      35 | +        ReallocateCache();
      36 | +        cachedCoinsUsage = 0;
      37 | +    }
      38 | +
      39 | +    CoinsViewDb() : CCoinsViewCache(nullptr, /*deterministic=*/true) {}
    


    l0rinc commented at 1:18 PM on November 27, 2025:

    Same idea as in the bench: would it be feasible to fuzz this against an actual LevelDB-backed view instead of a synthetic cache, or is that too expensive / complicated for now?

  326. in src/test/fuzz/coinsviewcacheasync.cpp:67 in 0f3778fbfb outdated
      62 | +
      63 | +        std::map<const COutPoint, const Coin> db_map{};
      64 | +        std::map<const COutPoint, const Coin> cache_map{};
      65 | +        std::vector<COutPoint> input_outpoints{};
      66 | +
      67 | +        CCoinsViewCache main_cache(&*g_db, /*deterministic=*/true);
    


    l0rinc commented at 1:18 PM on November 27, 2025:

    Why do we need this cache to be constructed with deterministic=true here? Is the deterministic ordering actually required for the fuzz harness, or could we drop that?

  327. in src/coins.h:410 in 0f3778fbfb outdated
     405 | +     * Retrieves the coin from the cache even if it is spent, without calling
     406 | +     * the backing CCoinsView if no coin exists.
     407 | +     * Used in CoinsViewCacheAsync to make sure we do not add a coin from the backing
     408 | +     * view when it is spent in the cache but not yet flushed to the parent.
     409 | +     */
     410 | +    std::optional<Coin> GetPossiblySpentCoinFromCache(const COutPoint& outpoint) const noexcept;
    


    l0rinc commented at 1:20 PM on November 27, 2025:

    As mentioned before, I’m not a fan of adding a separate “get-possibly-spent” helper – it feels like spentness is a separate concern and we’re encoding it into the accessor. I still think long term it’d be cleaner if GetCoin always returned the raw coin and call sites handled spentness explicitly, but I understand that’s probably a separate cleanup.

  328. in src/coinsviewcacheasync.h:47 in 0f3778fbfb outdated
      42 | +{
      43 | +private:
      44 | +    //! The latest input not yet being fetched. Workers atomically increment this when fetching.
      45 | +    mutable std::atomic_uint32_t m_input_head{0};
      46 | +    //! The latest input not yet accessed by a consumer. Only the main thread increments this.
      47 | +    mutable uint32_t m_input_tail{0};
    


    l0rinc commented at 1:21 PM on November 27, 2025:

    Hmm, can the main thread keep track of this locally instead of storing m_input_tail as a member? It looks like we’re effectively simulating a queue here (claiming work from the “head” and consuming from the “tail”). Could we use an actual std::deque or similar, or would that invalidate indices / references that the workers rely on?


    andrewtoth commented at 4:57 PM on November 27, 2025:

    Yes, this is a queue! An MPSC queue to be exact. I don't think it's possible to not store this as an instance member. I don't think we can use a std::deque though because we can't actually mutate the container. The benefit of a deque would be to actually push and pop, but here we set up the container before kicking off the worker threads.

  329. in src/coinsviewcacheasync.h:70 in 0f3778fbfb outdated
      65 | +    mutable std::vector<InputToFetch> m_inputs{};
      66 | +
      67 | +    /**
      68 | +     * The first 8 bytes of txids of all txs in the block being fetched. This is used to filter out inputs that
      69 | +     * are created and spent in the same block, since they will not be in the db or the cache.
      70 | +     * Using only the first 8 bytes is a performance improvement, versus storing the entire 32 bytes. In case of a
    


    l0rinc commented at 1:23 PM on November 27, 2025:

    Do we have tests that explicitly exercise the 8-byte prefix collision case (i.e. two txids in the same block sharing the same first 64 bits)? If we used 32-bit prefixes instead, collisions would be much more frequent, but the structure would be smaller; did we benchmark 32-bit vs 64-bit prefixes, or is 64-bit a conservative choice?


    andrewtoth commented at 4:59 PM on November 27, 2025:

    did we benchmark 32-bit vs 64-bit prefixes

    I did do microbenchmarks with 32-bit, and it was not more performant. Also, there is no uint256::GetUint32, so we would either just cast the 64 bits to 32 or have to write a new method on uint256. I don't think it's worth exploring this more.

  330. in src/coinsviewcacheasync.h:122 in 0f3778fbfb
     117 | +    {
     118 | +        const auto [ret, inserted] = cacheCoins.try_emplace(outpoint);
     119 | +        if (inserted) {
     120 | +            for (auto i{m_input_tail}; i < m_inputs.size(); ++i) {
     121 | +                auto& input{m_inputs[i]};
     122 | +                if (input.outpoint != outpoint) continue;
    


    l0rinc commented at 1:28 PM on November 27, 2025:

    I find this iteration quite hard to follow. Could we extract the search into a helper and then act on the found index, something like:

        std::optional<uint32_t> FindInputIndex(const COutPoint& outpoint) const
        {
            for (size_t i{m_input_tail}; i < m_inputs.size(); ++i) {
                if (m_inputs[i].outpoint == outpoint) {
                    return i;
                }
            }
            return std::nullopt;
        }
    
    
        CCoinsMap::iterator FetchCoin(const COutPoint& outpoint) const override
        {
            const auto [ret, inserted] = cacheCoins.try_emplace(outpoint);
            if (!inserted) return ret;
    
            if (const auto idx_opt{FindInputIndex(outpoint)}) {
                auto& input{m_inputs[*idx_opt]};
    
                // Wait until this input is ready. Acquire matches the worker's release.
                while (!input.ready.test(std::memory_order_acquire)) {
                    // Try to process other inputs instead of just waiting
                    if (!ProcessInputInBackground()) {
                        // No more work; just wait on this one
                        input.ready.wait(/*old=*/false, std::memory_order_acquire);
                        break;
                    }
                }
    
                if (input.coin) [[likely]]
                    ret->second.coin = std::move(*input.coin);
                m_input_tail = *idx_opt + 1; // We will never need to scan earlier entries again
            }
    
            if (ret->second.coin.IsSpent()) [[unlikely]] {
                // We only get here for BIP30 checks, txid collisions, or missing/spent inputs.
                if (auto coin{FetchCoinFromParent(outpoint)}) {
                    ret->second.coin = std::move(*coin);
                } else {
                    cacheCoins.erase(ret);
                    return cacheCoins.end();
                }
            }
    
            cachedCoinsUsage += ret->second.coin.DynamicMemoryUsage();
            return ret;
        }
    
  331. in src/coinsviewcacheasync.h:177 in 0f3778fbfb outdated
     172 | +            const auto& tx{block.vtx[i]};
     173 | +            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
     174 | +            for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
     175 | +        }
     176 | +        if (m_inputs.size() == 0) return;
     177 | +        std::ranges::sort(m_txids);
    


    l0rinc commented at 2:46 PM on November 27, 2025:

    It might be worth mentioning (either in a comment here or in the PR description) that benchmarks indicated this sorted std::vector<uint64_t> + binary_search approach is significantly faster than a std::unordered_set<Txid> or std::set<Txid> for the expected hit/miss mix.

  332. in src/coinsviewcacheasync.h:212 in 0f3778fbfb outdated
     207 | +            m_worker_threads.emplace_back([this, n] {
     208 | +                util::ThreadRename(strprintf("inputfetcher.%i", n));
     209 | +                while (true) {
     210 | +                    m_barrier.arrive_and_wait();
     211 | +                    if (m_request_stop) [[unlikely]] return;
     212 | +                    while (ProcessInputInBackground()) {}
    


    l0rinc commented at 2:49 PM on November 27, 2025:

    nit: this is very compact, but maybe a bit hard to see, consider:

    for (;;) {
        if (!ProcessInputInBackground()) break;
    }
    

    andrewtoth commented at 5:05 AM on November 30, 2025:

    I prefer the current version.

  333. in src/validation.cpp:3100 in 0f3778fbfb outdated
    3096 | @@ -3095,6 +3097,7 @@ bool Chainstate::ConnectTip(
    3097 |              if (state.IsInvalid())
    3098 |                  InvalidBlockFound(pindexNew, state);
    3099 |              LogError("%s: ConnectBlock %s failed, %s\n", __func__, pindexNew->GetBlockHash().ToString(), state.ToString());
    3100 | +            view.Reset();
    


    l0rinc commented at 2:49 PM on November 27, 2025:

    Is Reset() strictly necessary here given that a failed ConnectBlock will discard the ephemeral view anyway? Can we carry any async state across to the next block?


    andrewtoth commented at 5:01 PM on November 27, 2025:

    We don't actually throw away the ephemeral view anymore. We just reset the state each time. This is a big performance improvement since we don't have to reallocate anything. If we didn't do this we would need an external thread pool as well, since the threads are owned by the CoinsViewCacheAsync. I think it's a fair tradeoff for one extra line here. Perhaps calling it ephemeral_view is a misnomer though. I'm not sure what a better name would be right now.


    l0rinc commented at 5:13 PM on November 27, 2025:

    We don't actually throw away the ephemeral view anymore

    Hah, I missed that in the latest push, thanks. 👍

    calling it ephemeral_view is a misnomer though

    Yeah :)


    andrewtoth commented at 5:04 AM on November 30, 2025:

    Renamed to m_connect_block_view.

  334. in src/coinsviewcacheasync.h:87 in 0f3778fbfb outdated
      82 | +     */
      83 | +    std::optional<Coin> GetCoinWithoutMutating(const COutPoint& outpoint) const
      84 | +    {
      85 | +        auto coin{static_cast<CCoinsViewCache*>(base)->GetPossiblySpentCoinFromCache(outpoint)};
      86 | +        if (!coin) coin = m_db.GetCoin(outpoint);
      87 | +        if (coin && !coin->IsSpent()) [[likely]] return coin;
    


    l0rinc commented at 2:51 PM on November 27, 2025:

    I don’t mind the [[likely]] hints (they can help document expectations), but I know others are opposed to using them in our codebase.


    andrewtoth commented at 5:02 PM on November 27, 2025:

    There are some already in the codebase. They are added in C++20, why not use them? Maybe in future compiler versions the hints will become more useful for optimization? I only use them when we know a path is always happy or unhappy for a valid block (except for BIP30 checks, which we only do at the very beginning). An invalid block with valid proof of work is very rare, so we can optimize for the case where we don't get one. If we get an invalid block, we care less about speed to validate.


    l0rinc commented at 5:23 PM on November 27, 2025:

    andrewtoth commented at 5:40 PM on November 27, 2025:

    The first link shows a benefit to using them, and that is now in our codebase. The second and third links reference this blog post which has crazy convoluted usages. I don't find that blog post convincing. The usages here are straightforward and in very hot paths. They are only used in paths where we know a valid block will always or never go into (aside from BIP30).

  335. in src/coinsviewcacheasync.h:176 in 0f3778fbfb
     171 | +        for (size_t i{1}; i < block.vtx.size(); ++i) {
     172 | +            const auto& tx{block.vtx[i]};
     173 | +            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
     174 | +            for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
     175 | +        }
     176 | +        if (m_inputs.size() == 0) return;
    


    l0rinc commented at 2:54 PM on November 27, 2025:

    What does the if (m_inputs.size() == 0) guard against exactly?


    andrewtoth commented at 5:06 PM on November 27, 2025:

    I use this instead of an m_is_fetching or equivalent boolean state. We don't want to exit the barrier if we haven't entered it. We don't enter above if the block has less than 2 txs, and we don't enter if it is an invalid block with 2 or more txs that have zero inputs. After we stop the threads and exit the barrier, we clear the inputs as well so we don't exit again.

  336. l0rinc commented at 3:26 PM on November 27, 2025: contributor

    This new design achieves the best results I've seen so far across all platforms measured, excellent job!!

    It evolved from an InputFetcher helper (where the parallelism happened before ConnectBlock was called) to CoinsViewCacheAsync, a multithreaded CCoinsViewCache subclass whose worker threads run in parallel with ConnectBlock itself.

    Internal spends are still filtered using short txid prefixes stored in a sorted std::vector and checked via std::binary_search. In the rare case of a prefix collision, the async fetcher will simply skip that input, and it will be fetched later by the normal synchronous path.

    For now the async view uses a fixed worker thread count of 4. The workload is primarily I/O-bound on DB latency rather than CPU-bound, so 4 workers already hide most of the latency and it simplifies the implementation. If needed we can make this configurable or tie it to -par later.

    This way the I/O-bound work runs in parallel with the CPU-bound validation work, and the preliminary results are very encouraging: on a Raspberry Pi 5 the best -reindex-chainstate so far is about 7.3 hours with -dbcache=4500 and about 7.7 hours with the default 450 MB, roughly 36% and 46% faster than the current single-threaded baseline.

    The new implementation has been fuzzed for several days - it would be good to get some more eyes on it.

    <img width="1500" height="861" alt="Image" src="https://github.com/user-attachments/assets/c631ce84-9610-46ed-8cc2-8e171fd2d0f5" />

    (some of these measurements are suspiciously good, especially the M4 versions (maybe it switches from the energy efficient cores to the performance ones), I will try to replicate the results in followups)

    <details> <summary>Details</summary>

    > M4 @ dbcache=450:
    for commit in dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab 32de0ff1a97fc2880ba2f507dd00082727badf3f; do
      git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && git log -1 --pretty='%h %s' && \
      rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && \
      time ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 || exit 1
    done
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate     20994.00s user 5348.08s system 120% cpu 6:03:05.28 total
    
    32de0ff1a9 validation: fetch block inputs via CCoinsViewCacheAsync during connection
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=450 -reindex-chainstate     16748.42s user 1830.34s system 276% cpu 1:51:50.94 total
    
    
    > M4 @ dbcache=4500:
    d5ed4ba9d8 Merge bitcoin/bitcoin#33906: depends: Add patch for Windows11Style plugin
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -blocksonly    8895.93s user 751.51s system 133% cpu 2:00:08.82 total
    b1a791db1c validation: fetch block inputs via CCoinsViewCacheAsync during connection
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=4500 -blocksonly    10327.49s user 940.30s system 186% cpu 1:40:57.44 total
    
    > M4 @ dbcache=45000:
    for commit in dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab b1a791db1c75a47569b690baf7b074b78e08ca5a; do
      git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && git log -1 --pretty='%h %s' && \
      rm -rf build && cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF >/dev/null 2>&1 && ninja -C build bitcoind -j$(nproc) >/dev/null 2>&1 && \
      time ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 || exit 1
    done
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate   6338.71s user 332.32s system 131% cpu 1:24:22.56 total
    
    b1a791db1c validation: fetch block inputs via CCoinsViewCacheAsync during connection
    ./build/bin/bitcoind -stopatheight=921129 -dbcache=45000 -reindex-chainstate   7301.13s user 471.62s system 162% cpu 1:19:35.33 total
    
    
    > i9 @ dbcache=450:
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    e86d485271 validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 921129 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
      Time (abs ≡):        20659.626 s               [User: 40733.602 s, System: 2871.904 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
      Time (abs ≡):        14729.736 s               [User: 39566.674 s, System: 2313.959 s]
    
    Relative speed comparison
            1.40          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
    
    
    > i9  @ dbcache=4500:
    COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab 32de0ff1a97fc2880ba2f507dd00082727badf3f"; STOP=921129; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    32de0ff1a9 validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 921129 blocks | dbcache 4500 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
      Time (abs ≡):        16615.768 s               [User: 25458.915 s, System: 859.662 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
      Time (abs ≡):        13689.366 s               [User: 26290.581 s, System: 991.037 s]
    
    Relative speed comparison
            1.21          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
    
    
    > i9  @ dbcache=45000:
    COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab 32de0ff1a97fc2880ba2f507dd00082727badf3f"; STOP=921129; DBCACHE=45000; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    32de0ff1a9 validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 921129 blocks | dbcache 45000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
      Time (abs ≡):        16118.775 s               [User: 23433.898 s, System: 725.843 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
      Time (abs ≡):        14429.306 s               [User: 23850.818 s, System: 792.987 s]
    
    Relative speed comparison
            1.12          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
    
    
    > i7 @ dbcache=450:
    COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab 32de0ff1a97fc2880ba2f507dd00082727badf3f"; STOP=921129; DBCACHE=450; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origi
    n $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(unam
    e -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS/
    / /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5;
    grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.
    log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -d
    bcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    32de0ff1a9 validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 921129 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -conne
    ct=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
      Time (abs ≡):        42473.571 s               [User: 40584.287 s, System: 3012.074 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -conne
    ct=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
      Time (abs ≡):        34193.205 s               [User: 42326.030 s, System: 2778.267 s]
    
    Relative speed comparison
            1.24          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blockson
    ly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blockson
    ly -connect=0 -printtoconsole=0 (COMMIT = 32de0ff1a97fc2880ba2f507dd00082727badf3f)
    
    
    > i7 @ dbcache=4500:
    COMMITS="d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac b1a791db1c75a47569b690baf7b074b78e08ca5a"; STOP=921129; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
    cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    d5ed4ba9d8 Merge bitcoin/bitcoin#33906: depends: Add patch for Windows11Style plugin
    b1a791db1c validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 921129 blocks | dbcache 4500 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
      Time (abs ≡):        27190.152 s               [User: 33685.961 s, System: 1842.096 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
      Time (abs ≡):        23873.513 s               [User: 27793.779 s, System: 1036.030 s]
    
    Relative speed comparison
            1.14          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
    
    
    > i7 @ dbcache=45000:
    COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab e86d48527122c803a58d7bfecffd43a0e373c756"; \
    STOP=921129; DBCACHE=45000; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 2 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    e86d485271 validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 921129 blocks | dbcache 45000 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
      Time (mean ± σ):     22133.846 s ± 42.629 s    [User: 24498.825 s, System: 634.139 s]
      Range (min … max):   22103.703 s … 22163.990 s    2 runs
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
      Time (mean ± σ):     20547.518 s ±  8.809 s    [User: 25074.730 s, System: 695.076 s]
      Range (min … max):   20541.289 s … 20553.747 s    2 runs
    
    Relative speed comparison
            1.08 ±  0.00  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
    
    
    > rpi5 @ dbcache=450:
    COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab e86d48527122c803a58d7bfecffd43a0e373c756"; \
    STOP=921129; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 2 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    e86d485271 validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 921129 blocks | dbcache 450 | rpi5-16-1 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
      Time (mean ± σ):     41084.236 s ± 701.352 s    [User: 68642.573 s, System: 7256.334 s]
      Range (min … max):   40588.305 s … 41580.166 s    2 runs
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
      Time (mean ± σ):     28200.555 s ± 297.983 s    [User: 66305.959 s, System: 5678.536 s]
      Range (min … max):   27989.850 s … 28411.261 s    2 runs
    
    Relative speed comparison
            1.46 ±  0.03  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
    
    
    > rpi5 @ dbcache=4500:
    COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab e86d48527122c803a58d7bfecffd43a0e373c756"; STOP=921129; DBCACHE=4500; CC14:18:21 [10/1679]
    E_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exi
    t 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs)
    | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 &&
     echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.jso
    n"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA
    _DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    root@rpi5-16-1:/mnt/my_storage/bitcoin# COMMITS="dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab e86d48527122c803a58d7bfecffd43a0e373c756"; STOP=921129; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    dfde31f2ec Merge bitcoin/bitcoin#33864: scripted-diff: fix leftover references to `policy/fees.h`
    e86d485271 validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 921129 blocks | dbcache 4500 | rpi5-16-1 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
      Time (abs ≡):        35867.389 s               [User: 41695.508 s, System: 3281.868 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
      Time (abs ≡):        26440.662 s               [User: 43495.688 s, System: 3743.419 s]
    
    Relative speed comparison
            1.36          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dfde31f2ec1f90976f3ba6b06f2b38a1307c01ab)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e86d48527122c803a58d7bfecffd43a0e373c756)
    

    </details>

  337. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads >20% faster IBD
    validation: fetch block inputs on parallel threads >40% faster IBD
    on Nov 27, 2025
  338. andrewtoth force-pushed on Nov 30, 2025
  339. andrewtoth force-pushed on Nov 30, 2025
  340. DrahtBot added the label CI failed on Nov 30, 2025
  341. DrahtBot commented at 4:01 AM on November 30, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task ASan + LSan + UBSan + integer: https://github.com/bitcoin/bitcoin/actions/runs/19793395783/job/56710184816</sub> <sub>LLM reason (✨ experimental): Compiler errors: thread-safety checks fail in coinsviewcacheasync.cpp (requires holding cs_main exclusively), causing build to fail.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  342. andrewtoth force-pushed on Nov 30, 2025
  343. DrahtBot removed the label CI failed on Nov 30, 2025
  344. andrewtoth commented at 3:15 PM on November 30, 2025: contributor

    Thank you very much for your review and bencharking @l0rinc! The speedup this offers is great. I have taken most of your suggestions.

  345. andrewtoth force-pushed on Nov 30, 2025
  346. andrewtoth force-pushed on Nov 30, 2025
  347. DrahtBot added the label CI failed on Nov 30, 2025
  348. DrahtBot commented at 4:16 PM on November 30, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task Windows-cross to x86_64, ucrt: https://github.com/bitcoin/bitcoin/actions/runs/19801430839/job/56729280722</sub> <sub>LLM reason (✨ experimental): Linker failure: undefined reference to util::TraceThread causing the bitcoinkernel.dll build to fail.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  349. DrahtBot removed the label CI failed on Nov 30, 2025
  350. andrewtoth force-pushed on Nov 30, 2025
  351. in src/coinsviewcacheasync.h:234 in 69310ec003
     229 | +                }
     230 | +            });
     231 | +        }
     232 | +    }
     233 | +
     234 | +    ~CoinsViewCacheAsync()
    


    l0rinc commented at 6:00 PM on November 30, 2025:

    nit:

        ~CoinsViewCacheAsync() override
    
  352. in src/coinsviewcacheasync.h:227 in 69310ec003
     222 | +            m_worker_threads.emplace_back([this, n] {
     223 | +                util::ThreadRename(strprintf("inputfetcher.%i", n));
     224 | +                while (true) {
     225 | +                    m_barrier.arrive_and_wait();
     226 | +                    while (ProcessInputInBackground()) {}
     227 | +                    if (m_inputs.size() == 0) return;
    


    l0rinc commented at 6:01 PM on November 30, 2025:

    I understand that size() > 0 may be more descriptive than !empty(), but there's a dedicated method for this case (there are a few more cases):

                        if (m_inputs.empty()) return;
    
  353. in src/bench/coinsviewcacheasync.cpp:37 in 69310ec003
      32 | +            coins_tip.EmplaceCoinInternalDANGER(COutPoint{in.prevout}, std::move(coin));
      33 | +        }
      34 | +    }
      35 | +    chainstate.ForceFlushStateToDisk();
      36 | +    const auto& coins_db{WITH_LOCK(testing_setup->m_node.chainman->GetMutex(), return chainstate.CoinsDB();)};
      37 | +    CoinsViewCacheAsync async_cache{coins_tip, coins_db, /*deterministic=*/true};
    


    l0rinc commented at 6:08 PM on November 30, 2025:

    deterministic sounds like something we should do in a benchmark, but it just predefines the salts for testability: https://github.com/bitcoin/bitcoin/blob/9c24cda72edb2085edfa75296d6b42fab34433d9/src/util/hasher.cpp#L22-L25

    As far as I can tell we don't need here, so we can likely remove that constructor arg completely:

        explicit CoinsViewCacheAsync(CCoinsViewCache& cache, const CCoinsView& db) noexcept
            : CCoinsViewCache{&cache}, m_db{db}, m_barrier{WORKER_THREADS + 1}
    
  354. in src/coinsviewcacheasync.h:121 in 69310ec003
     116 | +        input.ready.notify_one();
     117 | +        return true;
     118 | +    }
     119 | +
     120 | +    //! Get the index in m_inputs for the given outpoint. Advances m_input_tail if found.
     121 | +    std::optional<uint32_t> GetInputIndex(const COutPoint &outpoint) const noexcept
    


    l0rinc commented at 6:18 PM on November 30, 2025:

    Should we maybe document that this assumes that ConnectBlock will fetch the inputs in the same order?

    nit: could we use the current formatting for new code?

  355. in src/test/coinsviewcacheasync_tests.cpp:66 in 69310ec003
      61 | +        return block;
      62 | +    }
      63 | +
      64 | +public:
      65 | +    explicit CoinsViewCacheAsyncTest(const ChainType chainType = ChainType::MAIN,
      66 | +                              TestOpts opts = {})
    


    l0rinc commented at 6:19 PM on November 30, 2025:

    As https://corecheck.dev/bitcoin/bitcoin/pulls/31132 hints, this could also be passed as const reference.

    nit: formatting


    l0rinc commented at 9:13 AM on December 1, 2025:

    formatting is still off though

  356. in src/coinsviewcacheasync.h:166 in 69310ec003
     161 | +        cachedCoinsUsage += ret->second.coin.DynamicMemoryUsage();
     162 | +        return ret;
     163 | +    }
     164 | +
     165 | +    std::vector<std::thread> m_worker_threads{};
     166 | +    std::barrier<> m_barrier;
    


    l0rinc commented at 6:26 PM on November 30, 2025:

    we should likely init it here instead:

        std::barrier<> m_barrier{WORKER_THREADS + 1};
    
  357. in src/validation.h:489 in 69310ec003 outdated
     485 | @@ -485,6 +486,10 @@ class CoinsViews {
     486 |      //! can fit per the dbcache setting.
     487 |      std::unique_ptr<CCoinsViewCache> m_cacheview GUARDED_BY(cs_main);
     488 |  
     489 | +    //! Used as an empty view that is only passed into ConnectBlock to help speed up block validation,
    


    l0rinc commented at 6:32 PM on November 30, 2025:

    speed up block validation

    Should we mention any parallelism here?


    andrewtoth commented at 10:04 PM on November 30, 2025:

    That's more an implementation detail that can be found by reading the header file, no?

  358. in src/test/coinsviewcacheasync_tests.cpp:33 in 69310ec003
      28 | +struct CoinsViewCacheAsyncTest : BasicTestingSetup {
      29 | +private:
      30 | +    std::unique_ptr<CoinsViewCacheAsync> m_async_cache{nullptr};
      31 | +    std::unique_ptr<CBlock> m_block{nullptr};
      32 | +
      33 | +    CBlock CreateBlock(int32_t num_txs)
    


    l0rinc commented at 6:33 PM on November 30, 2025:

    can we make this static or at least const?

  359. in src/coins.cpp:176 in 765b57d1b1 outdated
     172 | @@ -173,6 +173,12 @@ bool CCoinsViewCache::HaveCoinInCache(const COutPoint &outpoint) const {
     173 |      return (it != cacheCoins.end() && !it->second.coin.IsSpent());
     174 |  }
     175 |  
     176 | +std::optional<Coin> CCoinsViewCache::GetPossiblySpentCoinFromCache(const COutPoint& outpoint) const noexcept
    


    l0rinc commented at 6:40 PM on November 30, 2025:

    nit: the first commit is huge, maybe we could split out the coins.[cpp|h] changes to a commit before it. Not sure what else we could split out, though... Would it help if we split out the single-threaded internal spends case, so that they avoid the cache entirely? Wouldn't that already speed up IBD - in which case it's definitely a separate feature. I also don't mind if we do that in a separate PR to have some progress.


    andrewtoth commented at 10:04 PM on November 30, 2025:

    I would prefer to split out the implementation, the tests, the fuzz harness... But you prefer those all be in the same commit?


    l0rinc commented at 6:40 AM on December 1, 2025:

    I prefer simple but fully functioning chunks that converge towards a feature (as someone illustrated: "skateboard -> bicycle -> scooter -> motorcycle -> car" instead of "left wheels -> right wheels -> doors -> wipers -> radio antenna -> windows -> etc"). So if we can carve out chunks (such as the internal spend + main fallback, or a single-threaded fetcher first), we could guide the reviewer instead of having a big-bang change that's really hard to fully comprehend as such.


    andrewtoth commented at 5:49 PM on December 20, 2025:

    I've split the PR up into multiple commits that build on each other. Please let me know what you think.

  360. in src/coinsviewcacheasync.h:208 in 765b57d1b1 outdated
     203 | +        cacheCoins.clear();
     204 | +        cachedCoinsUsage = 0;
     205 | +        hashBlock = uint256::ZERO;
     206 | +    }
     207 | +
     208 | +    bool Flush() override
    


    l0rinc commented at 6:42 PM on November 30, 2025:

    we may want to explain in a comment why the parent CCoinsViewCache::Flush isn't called here (i.e. to stop propagation to disk)


    l0rinc commented at 8:28 PM on November 30, 2025:

    this whole function never seems to be called from unit tests


    andrewtoth commented at 10:00 PM on November 30, 2025:

    It gets called by functional tests though.


    andrewtoth commented at 10:05 PM on November 30, 2025:

    (i.e. to stop propagation to disk)

    The parent could be called, but this is faster since we skip calling ReallocateCache.


    l0rinc commented at 6:58 AM on December 1, 2025:

    wouldn't that percolate to the database layer?


    andrewtoth commented at 2:24 PM on December 1, 2025:

    Oh, we override so we make sure all threads are stopped before we do the batch write. This is in a comment two lines below.

  361. in src/coinsviewcacheasync.h:63 in 765b57d1b1 outdated
      58 | +        /**
      59 | +         * We only move when m_inputs reallocates during setup.
      60 | +         * We never move after work begins, so we don't have to copy other members.
      61 | +         */
      62 | +        InputToFetch(InputToFetch&& other) noexcept : outpoint{other.outpoint} {}
      63 | +        explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
    


    l0rinc commented at 6:47 PM on November 30, 2025:

    Sonar is complaining that we're violating the Rule of Five here, it's a bit verbose, but maybe we could:

            InputToFetch(InputToFetch&& other) noexcept : outpoint{other.outpoint} {}
            explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
            InputToFetch(const InputToFetch&) = delete;
            InputToFetch& operator=(const InputToFetch&) = delete;
            InputToFetch& operator=(InputToFetch&&) = delete;
    

    But it begs the question: why are we even moving these and why is the "move" only copying partial state. Is it meant to avoid reallocations in StartFetching? But m_inputs is always empty at the beginning of assignment and we could easily reserve the actual size to avoid moves as far as I understood, something like:

        //! Start fetching all block inputs in parallel.
        void StartFetching(const CBlock& block) noexcept
        {
            int input_count{std::accumulate(block.vtx.begin() + 1, block.vtx.end(), 0, [](size_t s, const auto& t) { return s + t->vin.size(); })};
            m_inputs.reserve(input_count);
    
            // Loop through the inputs of the block and set them in the queue. Also construct the set of txids to filter.
            for (const auto& tx : block.vtx | std::views::drop(1)) {
                for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
                m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
            }
    

    which would allow us to delete most of the other constructors instead:

        explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
        InputToFetch(const InputToFetch&) = delete;
        InputToFetch& operator=(const InputToFetch&) = delete;
        InputToFetch(InputToFetch&&) = delete;
        InputToFetch& operator=(InputToFetch&&) = delete;
    

    andrewtoth commented at 10:08 PM on November 30, 2025:

    I'm not sure if we need all these deletes here? What are we gaining from this?

    We need to have move assignment for it to compile. It can't automatically move the atomic_flag. So, it doesn't really matter if it gets called or not, we still need to define it. And, since it only happens before we start doing work, might as well make it simple and not bother moving the other fields. I don't think we need to bother reserving since we keep the capacity over many blocks. The looping is just extra overhead, and it doesn't allow us to remove the move assignment.


    l0rinc commented at 7:00 AM on December 1, 2025:

    I disagree, I think the current move construction is incorrect and if I understand it correctly, we should reserve instead and delete (or ignore) the other constructors.


    l0rinc commented at 2:24 PM on December 1, 2025:

    Looks like this needs more work, it's not as easy as I though - we could use an std::deque instead of a vector here:

    Subject: [PATCH] std::deque<InputToFetch> m_inputs
    ---
    diff --git a/src/coinsviewcacheasync.h b/src/coinsviewcacheasync.h
    --- a/src/coinsviewcacheasync.h	(revision a3f56354d6e3f64eaca84a16e4951e6073090f60)
    +++ b/src/coinsviewcacheasync.h	(revision 462408b897197de3a7067dcbdee318ad9dc1e546)
    @@ -17,6 +17,7 @@
     #include <atomic>
     #include <barrier>
     #include <cstdint>
    +#include <deque>
     #include <optional>
     #include <ranges>
     #include <thread>
    @@ -55,14 +56,9 @@
             //! The coin that workers will fetch and main thread will insert into cache.
             std::optional<Coin> coin{std::nullopt};
     
    -        /**
    -         * We only move when m_inputs reallocates during setup.
    -         * We never move after work begins, so we don't have to copy other members.
    -         */
    -        InputToFetch(InputToFetch&& other) noexcept : outpoint{other.outpoint} {}
             explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
         };
    -    mutable std::vector<InputToFetch> m_inputs{};
    +    mutable std::deque<InputToFetch> m_inputs{};
     
         /**
          * The first 8 bytes of txids of all txs in the block being fetched. This is used to filter out inputs that
    

    But unfortunately the speed difference is measurable (5% slower):

    vector:

    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,644,638.02 |              608.04 |    0.1% |      1.09 | `CoinsViewCacheAsyncBenchmark`
    

    deque:

    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,732,962.13 |              577.05 |    0.1% |      1.10 | `CoinsViewCacheAsyncBenchmark`
    

    Maybe we could split out the atomic and have a vector + a deque, something like:

    diff --git a/src/coinsviewcacheasync.h b/src/coinsviewcacheasync.h
    --- a/src/coinsviewcacheasync.h	(revision 462408b897197de3a7067dcbdee318ad9dc1e546)
    +++ b/src/coinsviewcacheasync.h	(date 1764598896036)
    @@ -48,17 +48,17 @@
         mutable uint32_t m_input_tail{0};
     
         //! The inputs of the block which is being fetched.
    -    struct InputToFetch {
    -        //! Workers set this after setting the coin. The main thread tests this before reading the coin.
    -        std::atomic_flag ready{};
    +    struct InputData {
             //! The outpoint of the input to fetch;
             const COutPoint& outpoint;
             //! The coin that workers will fetch and main thread will insert into cache.
             std::optional<Coin> coin{std::nullopt};
     
    -        explicit InputToFetch(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
    +        explicit InputData(const COutPoint& o LIFETIMEBOUND) noexcept : outpoint{o} {}
         };
    -    mutable std::deque<InputToFetch> m_inputs{};
    +    //! Workers set this after setting the coin. The main thread tests this before reading the coin.
    +    mutable std::deque<std::atomic_flag> m_ready_flags{};
    +    mutable std::vector<InputData> m_inputs{};
     
         /**
          * The first 8 bytes of txids of all txs in the block being fetched. This is used to filter out inputs that
    @@ -97,19 +97,18 @@
             const auto i{m_input_head.fetch_add(1, std::memory_order_relaxed)};
             if (i >= m_inputs.size()) [[unlikely]] return false;
     
    -        auto& input{m_inputs[i]};
             // Inputs spending a coin from a tx earlier in the block won't be in the cache or db
    -        if (std::ranges::binary_search(m_txids, input.outpoint.hash.ToUint256().GetUint64(0))) {
    +        if (std::ranges::binary_search(m_txids, m_inputs[i].outpoint.hash.ToUint256().GetUint64(0))) {
                 // We can use relaxed ordering here since we don't write the coin.
    -            input.ready.test_and_set(std::memory_order_relaxed);
    -            input.ready.notify_one();
    +            m_ready_flags[i].test_and_set(std::memory_order_relaxed);
    +            m_ready_flags[i].notify_one();
                 return true;
             }
     
    -        if (auto coin{GetCoinWithoutMutating(input.outpoint)}) [[likely]] input.coin.emplace(std::move(*coin));
    +        if (auto coin{GetCoinWithoutMutating(m_inputs[i].outpoint)}) [[likely]] m_inputs[i].coin.emplace(std::move(*coin));
             // We need release here, so writing coin in the line above happens before the main thread acquires.
    -        input.ready.test_and_set(std::memory_order_release);
    -        input.ready.notify_one();
    +        m_ready_flags[i].test_and_set(std::memory_order_release);
    +        m_ready_flags[i].notify_one();
             return true;
         }
     
    @@ -135,17 +134,16 @@
             if (!inserted) return ret;
     
             if (const auto i{GetInputIndex(outpoint)}) [[likely]] {
    -            auto& input{m_inputs[*i]};
                 // Check if the coin is ready to be read. We need to acquire to match the worker thread's release.
    -            while (!input.ready.test(std::memory_order_acquire)) {
    +            while (!m_ready_flags[*i].test(std::memory_order_acquire)) {
                     // Work instead of waiting if the coin is not ready
                     if (!ProcessInputInBackground()) {
                         // No more work, just wait
    -                    input.ready.wait(/*old=*/false, std::memory_order_acquire);
    +                    m_ready_flags[*i].wait(/*old=*/false, std::memory_order_acquire);
                         break;
                     }
                 }
    -            if (input.coin) [[likely]] ret->second.coin = std::move(*input.coin);
    +            if (m_inputs[*i].coin) [[likely]] ret->second.coin = std::move(*m_inputs[*i].coin);
             }
     
             if (ret->second.coin.IsSpent()) [[unlikely]] {
    @@ -180,6 +178,10 @@
         //! Start fetching all block inputs in parallel.
         void StartFetching(const CBlock& block) noexcept
         {
    +        int input_count{std::accumulate(block.vtx.begin() + 1, block.vtx.end(), 0, [](size_t s, const auto& t) { return s + t->vin.size(); })};
    +        m_ready_flags.resize(input_count);
    +        m_inputs.reserve(input_count);
    +
             // Loop through the inputs of the block and set them in the queue. Also construct the set of txids to filter.
             for (const auto& tx : block.vtx | std::views::drop(1)) {
                 for (const auto& input : tx->vin) m_inputs.emplace_back(input.prevout);
    

    (maybe deduplicating the index accesses speeds it back up, not sure) but this would be even slower: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,880,070.41 | 531.89 | 0.3% | 1.10 | CoinsViewCacheAsyncBenchmark


    andrewtoth commented at 2:37 PM on December 1, 2025:

    I disagree, I think the current move construction is incorrect and if I understand it correctly, we should reserve instead and delete (or ignore) the other constructors.

    I think you are disagreeing with my statement:

    since it only happens before we start doing work, might as well make it simple and not bother moving the other fields

    because the other things I said were facts. Do you think we should also construct new atomic_flags when doing the move construction? As I mentioned I don't think it's worth it. We can logically see the only time we will move is during a reallocation of the vector when capacity is reached. It doesn't matter if we reserve or not, the compiler cannot deduce that so will still require the move constructor. A deque does not need to reallocate and move all elements when another is added over a capacity, which is why it is ok to use atomics without a custom move constructor. But, it is obviously much slower than a vector.

  362. in src/test/coinsviewcacheasync_tests.cpp:85 in 765b57d1b1 outdated
      80 | +    for (const auto& tx : block.vtx) {
      81 | +        for (const auto& in : tx->vin) {
      82 | +            auto outpoint{in.prevout};
      83 | +            Coin coin{};
      84 | +            if (!spent) coin.out.nValue = 1;
      85 | +            BOOST_CHECK(spent ? coin.IsSpent() : !coin.IsSpent());
    


    l0rinc commented at 6:53 PM on November 30, 2025:
                BOOST_CHECK_EQUAL(coin.IsSpent(), spent);
    
  363. in src/test/coinsviewcacheasync_tests.cpp:50 in 69310ec003 outdated
      45 | +            if (i % 3 == 0) {
      46 | +                txid = Txid::FromUint256(uint256(i));
      47 | +            } else if (i % 3 == 1) {
      48 | +                txid = prevhash;
      49 | +            } else {
      50 | +                // Test shortid collisions
    


    l0rinc commented at 7:54 PM on November 30, 2025:

    I'm not sure I understand how we're actually testing this. NoAccessCoinsView is designed to abort on access, so in the shortid collision scenario how does it simulate going to disk? Shouldn't we assert here that the number of collisions coincides with the number of simulated disk reads?


    andrewtoth commented at 10:11 PM on November 30, 2025:

    It doesn't simulate going to disk. It simulates not setting the coin in ProcessInputInBackground even though the base has it. Then the if (ret->second.coin.IsSpent()) [[unlikely]] { branch is executed in FetchCoin and the coin is fetched from base via GetCoinWithoutMutating.


    l0rinc commented at 9:02 AM on December 1, 2025:

    isn't GetCoinWithoutMutating meant to simulate going one layer deeper in the cache - which is basically going to disk on the main thread, right?


    andrewtoth commented at 2:48 PM on December 1, 2025:

    NoAccessCoinsView is designed to abort on access

    It is designed to abort on accessing the db via the main cache. We want to access the db only via our m_db ref and not go through the main cache's base pointer. This is unrelated to the test of short txid collisions. For those, we want to go to successfully go to disk on the main thread, while getting a nullopt from our m_input coin.

  364. in src/test/coinsviewcacheasync_tests.cpp:56 in 69310ec003
      51 | +                const uint64_t shorttxid{prevhash.ToUint256().GetUint64(0)};
      52 | +                uint256 u(i);
      53 | +                WriteLE64(u.data(), shorttxid);
      54 | +                txid = Txid::FromUint256(u);
      55 | +            }
      56 | +            tx.vin.emplace_back(COutPoint(txid, 0));
    


    l0rinc commented at 7:59 PM on November 30, 2025:

    nit: emplace doesn't necessarily need the class name

                tx.vin.emplace_back(txid, 0);
    
  365. in src/test/coinsviewcacheasync_tests.cpp:166 in 69310ec003
     161 | +        view.StartFetching(block);
     162 | +        for (const auto& tx : block.vtx) {
     163 | +            for (const auto& in : tx->vin) view.AccessCoin(in.prevout);
     164 | +        }
     165 | +        // Coins are not added to the view, even though they exist unspent in the parent db
     166 | +        BOOST_CHECK(view.GetCacheSize() == 0);
    


    l0rinc commented at 8:07 PM on November 30, 2025:

    any reason not to use:

            BOOST_CHECK_EQUAL(view.GetCacheSize(), 0);
    

    ? Or just a copy-paste convenience from fuzz :)?


    l0rinc commented at 9:11 AM on December 1, 2025:

    There's one remaining that wasn't migrated:

    BOOST_CHECK(cache.GetCacheSize() == counter);
    
  366. in src/test/coinsviewcacheasync_tests.cpp:52 in 69310ec003
      47 | +            } else if (i % 3 == 1) {
      48 | +                txid = prevhash;
      49 | +            } else {
      50 | +                // Test shortid collisions
      51 | +                const uint64_t shorttxid{prevhash.ToUint256().GetUint64(0)};
      52 | +                uint256 u(i);
    


    l0rinc commented at 8:08 PM on November 30, 2025:

    aren't we already overwriting the first 64 bits in the next step anyway?

  367. in src/test/coinsviewcacheasync_tests.cpp:46 in 69310ec003 outdated
      41 | +
      42 | +        for (const auto i : std::views::iota(1, num_txs)) {
      43 | +            CMutableTransaction tx;
      44 | +            Txid txid;
      45 | +            if (i % 3 == 0) {
      46 | +                txid = Txid::FromUint256(uint256(i));
    


    l0rinc commented at 8:11 PM on November 30, 2025:

    isn't this too small for our purposes? Or since there's no randomness involved, we just store at most 100 values on 256 bits? That might work I guess, I would have just gone with Txid::FromUint256(m_rng.rand256());


    andrewtoth commented at 9:53 PM on November 30, 2025:

    isn't this too small for our purposes?

    I don't see why we need any randomness or larger values for these?


    l0rinc commented at 9:10 AM on December 1, 2025:

    some values are tiny now while internal spends (real or fake) have full hashes. It might be easier to work with small values, but internal spends will still result in big ugly full hashes anyway. I'm fine with it as it is.

  368. in src/test/coinsviewcacheasync_tests.cpp:53 in 69310ec003
      48 | +                txid = prevhash;
      49 | +            } else {
      50 | +                // Test shortid collisions
      51 | +                const uint64_t shorttxid{prevhash.ToUint256().GetUint64(0)};
      52 | +                uint256 u(i);
      53 | +                WriteLE64(u.data(), shorttxid);
    


    l0rinc commented at 8:13 PM on November 30, 2025:

    we're not switching platforms during testing, not sure we need LE/BE conversions here:

    uint256 u{m_rng.rand256()};
    std::memcpy(u.begin(), prevhash.ToUint256().begin(), 8);
    txid = Txid::FromUint256(u);
    
  369. in src/test/coinsviewcacheasync_tests.cpp:122 in 69310ec003
     117 | +
     118 | +BOOST_FIXTURE_TEST_CASE(fetch_inputs_from_db, CoinsViewCacheAsyncTest)
     119 | +{
     120 | +    const auto& block{getBlock()};
     121 | +    NoAccessCoinsView dummy;
     122 | +    CCoinsViewCache db(&dummy);
    


    l0rinc commented at 8:16 PM on November 30, 2025:

    nit: brace init may be slightly better here to differentiate it from function calls (many such cases):

        const CCoinsViewCache db{&dummy};
    
  370. in src/test/fuzz/coinsviewcacheasync.cpp:137 in 69310ec003 outdated
     132 | +            } else {
     133 | +                const auto txid{Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))};
     134 | +                const auto index{fuzzed_data_provider.ConsumeIntegral<uint32_t>()};
     135 | +                outpoint = COutPoint(txid, index);
     136 | +            }
     137 | +            cache.AccessCoin(outpoint);
    


    l0rinc commented at 8:17 PM on November 30, 2025:

    we could validate the result values of AccessCoin throughout the tests

  371. in src/coinsviewcacheasync.h:183 in 69310ec003 outdated
     178 | +
     179 | +public:
     180 | +    //! Start fetching all block inputs in parallel.
     181 | +    void StartFetching(const CBlock& block) noexcept
     182 | +    {
     183 | +        // Loop through the inputs of the block and set them in the queue. Also construct the set of txids to filter.
    


    l0rinc commented at 8:22 PM on November 30, 2025:

    should we assume that the cache is empty here - or can you imagine a scenario when we wouldn't want that.? Otherwise tests like:

    for (auto i{0}; i < 3; ++i) {
        view.StartFetching(block);
        CheckCache(block, view);
        view.Reset();
    }
    

    would basically pass (but hang) if we forget to Reset


    andrewtoth commented at 10:21 PM on November 30, 2025:

    would basically pass (but hang) if we forget to Reset

    A hanging test is treated as failure in the CI. I don't think it's necessary to do anything else here.

  372. in src/coinsviewcacheasync.h:135 in 69310ec003 outdated
     130 | +    }
     131 | +
     132 | +    CCoinsMap::iterator FetchCoin(const COutPoint &outpoint) const override
     133 | +    {
     134 | +        const auto [ret, inserted] = cacheCoins.try_emplace(outpoint);
     135 | +        if (!inserted) return ret;
    


    l0rinc commented at 8:26 PM on November 30, 2025:

    this early exit doesn't seem to be covered by unit tests in coinsviewcacheasync_tests (same for StopFetching)

  373. in src/coinsviewcacheasync.h:142 in 69310ec003 outdated
     137 | +        if (const auto i{GetInputIndex(outpoint)}) [[likely]] {
     138 | +            auto& input{m_inputs[*i]};
     139 | +            // Check if the coin is ready to be read. We need to acquire to match the worker thread's release.
     140 | +            while (!input.ready.test(std::memory_order_acquire)) {
     141 | +                // Work instead of waiting if the coin is not ready
     142 | +                if (!ProcessInputInBackground()) {
    


    l0rinc commented at 8:27 PM on November 30, 2025:

    this also never seem to be triggered by unit tests in coinsviewcacheasync_tests - could we selectively block other threads to make sure we get here?


    andrewtoth commented at 9:59 PM on November 30, 2025:

    Hmm not sure how we'd test this with a unit test. It surely gets exercised by fuzzing though. Also the function is called by other threads.


    l0rinc commented at 9:08 AM on December 1, 2025:

    Hmm not sure how we'd test this with a unit test

    We could block the thread pool by adding a dummy underlying cache which blocks for the gets and make sure the main thread can still do the fetch on a single thread, when you can unblock the cache and check that we still managed to make progress on a single thread.


    andrewtoth commented at 2:23 PM on December 1, 2025:

    That would be racy. We could have the worker thread count be a variable instead of hard coded, then for a test we could make it zero.


    l0rinc commented at 2:30 PM on December 1, 2025:

    you sure it would be racy?


    andrewtoth commented at 2:42 PM on December 1, 2025:

    If we have a backing cache that blocks, how can we know if it's the main thread or worker threads that need to be blocked? And if we block the main thread by mistake, it will make no progress even though the worker thread can fetch all inputs


    andrewtoth commented at 12:26 AM on December 7, 2025:

    This is tested now by having zero worker threads.

  374. in src/test/coinsviewcacheasync_tests.cpp:109 in 69310ec003
     104 | +                if (should_have) {
     105 | +                    cache.AccessCoin(outpoint);
     106 | +                    ++counter;
     107 | +                }
     108 | +                const auto have{cache.GetPossiblySpentCoinFromCache(outpoint)};
     109 | +                BOOST_CHECK(should_have ? !!have : !have);
    


    l0rinc commented at 8:29 PM on November 30, 2025:

    We could also use a less brancy equality check here

                    BOOST_CHECK_EQUAL(should_have, !!have);
    

    or

                    BOOST_CHECK_NE(should_have, !have);
    

    (but the latter might be too much for some :p)

  375. in src/bench/coinsviewcacheasync.cpp:51 in 69310ec003 outdated
      46 | +        }
      47 | +        async_cache.Reset();
      48 | +    });
      49 | +}
      50 | +
      51 | +BENCHMARK(CoinsViewCacheAsyncBenchmark, benchmark::PriorityLevel::HIGH);
    


    l0rinc commented at 8:35 PM on November 30, 2025:

    could you please add new lines to the end of the files? I also don't like them, but it seems to be necessary in certain cases...


    andrewtoth commented at 9:45 PM on November 30, 2025:

    If there is a missing new line github will show a red arrow at the end of the file.


    l0rinc commented at 8:53 AM on December 1, 2025:

    You're right, my mistake

  376. in src/bench/coinsviewcacheasync.cpp:41 in 69310ec003 outdated
      36 | +    const auto& coins_db{WITH_LOCK(testing_setup->m_node.chainman->GetMutex(), return chainstate.CoinsDB();)};
      37 | +    CoinsViewCacheAsync async_cache{coins_tip, coins_db, /*deterministic=*/true};
      38 | +
      39 | +    bench.run([&] {
      40 | +        async_cache.StartFetching(block);
      41 | +        for (const auto& tx : block.vtx | std::views::drop(1)) {
    


    l0rinc commented at 8:37 PM on November 30, 2025:

    I like these functional constructs, they may take some getting used to, but they have a lot fewer moving parts which help with separating iteration (= glue code) from important parts!


    Note: could you publish the benchmark results in the commit messages before/after and can you reproduce 4 threads saturating the parallelism factor with it? (I don't have any available benchmarking servers at the moment...)


    andrewtoth commented at 10:01 PM on November 30, 2025:

    before/after

    Not sure there are any before results we can use for this.

    can you reproduce 4 threads saturating the parallelism factor with it?

    How would I know if I do that?


    l0rinc commented at 6:44 AM on December 1, 2025:

    any before results

    I just noticed the benchmark only tests the new state - what if the benchmark originally measured the current behavior and in the commit when we're changing to multithreaded connection, we update the benchmark to reflect that (if needed, maybe the same benchmark can already switch automatically if the original CoinsViewCacheAsync implementation reimplements everything on a single thread at first).

    In that case we could have something like (fake numbers):

    bench: add CoinsViewCacheAsync benchmark

    | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,764,037.75 | 566.88 | 0.2% | 11.02 | CoinsViewCacheAsyncBenchmark

    validation: fetch block inputs via CCoinsViewCacheAsync during connection

    | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,304,607.17 | 766.51 | 0.7% | 10.60 | CoinsViewCacheAsyncBenchmark


    can you reproduce 4 threads saturating the parallelism factor with it?

    Changing:

    static constexpr uint32_t WORKER_THREADS{1};
    

    which gives me | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,294,995.86 | 772.20 | 1.4% | 10.61 | CoinsViewCacheAsyncBenchmark

    and 2: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,237,416.55 | 808.14 | 1.9% | 10.83 | CoinsViewCacheAsyncBenchmark

    and 3: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,321,210.42 | 756.88 | 1.3% | 10.84 | CoinsViewCacheAsyncBenchmark

    and for 4: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1,788,112.75 | 559.25 | 0.3% | 10.91 | CoinsViewCacheAsyncBenchmark

    and 5: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 2,126,524.56 | 470.25 | 2.0% | 9.88 | CoinsViewCacheAsyncBenchmark

    and 6: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 3,053,001.11 | 327.55 | 3.3% | 10.57 | CoinsViewCacheAsyncBenchmark

    So it kinda' reproduces that it doesn't make sense to do more than 4


    andrewtoth commented at 2:41 PM on December 1, 2025:

    There is a ConnectBlock benchmark already. But, it only does internal spends so it won't be that great for this. I can maybe extend it to also work on a block with inputs from leveldb. WDYT?


    andrewtoth commented at 4:12 PM on December 1, 2025:

    So it kinda' reproduces that it doesn't make sense to do more than 4

    Looks like it doesn't make sense to do more than 2?


    l0rinc commented at 4:24 PM on December 1, 2025:

    maybe, but leveldb is basically empty, we shouldn't take it too seriously


    andrewtoth commented at 4:42 PM on December 28, 2025:

    Added some benchmark results in the commits where it makes sense.

  377. in src/validation.h:490 in 69310ec003 outdated
     485 | @@ -485,6 +486,10 @@ class CoinsViews {
     486 |      //! can fit per the dbcache setting.
     487 |      std::unique_ptr<CCoinsViewCache> m_cacheview GUARDED_BY(cs_main);
     488 |  
     489 | +    //! Used as an empty view that is only passed into ConnectBlock to help speed up block validation,
     490 | +    //! as well as not pollute the underlying cache with newly created coins in case the block is invalid.
    


    l0rinc commented at 8:48 PM on November 30, 2025:

    Do we have a specific test case that verifies cache isolation in a failure scenario?


    andrewtoth commented at 9:44 PM on November 30, 2025:

    I think some invalid block tests exercise this. If a block is invalid then the outputs of any txs are not added to the utxo set.


    l0rinc commented at 9:04 AM on December 1, 2025:

    I think some invalid block tests exercise this. If a block is invalid then the outputs of any txs are not added to the utxo set.

    Are those tests realistic and using the new cache or simulating it somehow? My question is: is the new functionality covered for this case, when a block has e.g. a double-spend to make sure it's not propagated to the main cache?


    andrewtoth commented at 2:57 PM on December 1, 2025:

    when a block has e.g. a double-spend to make sure it's not propagated to the main cache

    That would be a necessary condition of a double-spending block test. If the double spend is propagated to the main cache it is part of the utxo set and the test is a failure. The main cache is treated as the source of truth.

  378. l0rinc approved
  379. l0rinc commented at 9:15 PM on November 30, 2025: contributor

    The new version cleverly uses m_inputs being empty as the shared shutdown signal (handling both 'no work' and 'shutdown' cases). This finally allowed us to eliminate the m_request_stop flag I disliked :).

    It also benchmarks real-world I/O latency via a real LevelDB acces, while the fuzz tests use in-memory LevelDB now - sweet! The new design basically falls back to synchronous fetching gracefully in cases of collisions or delays (we may want to test that specifically).

    Regarding 'faster IBD performance', I think it would be more accurate to call it 'validation performance', as this benefits block connection generally, not just during initial download (which we can't reliably measure anyway).

    Should we also add some log message announcing that input fetching is running on parallel threads? We should definitely add release notes for this feature once it stabilizes - and I like this latest push a lot!

  380. andrewtoth force-pushed on Nov 30, 2025
  381. in src/test/coinsviewcacheasync_tests.cpp:106 in 236e7a4374 outdated
     101 | +                const auto& outpoint{in.prevout};
     102 | +                const auto should_have{!txids.contains(outpoint.hash)};
     103 | +                if (should_have) {
     104 | +                    cache.AccessCoin(outpoint);
     105 | +                    ++counter;
     106 | +                }
    


    l0rinc commented at 7:29 AM on December 1, 2025:

    ConnectBlock calls AccessCoin for every input, shouldn't the test do the same? Aren't we cheating by only calling it when we already know it's not an internal spend?

        for (const auto& tx : block.vtx) {
            if (tx->IsCoinBase()) {
                BOOST_CHECK(!cache.GetPossiblySpentCoinFromCache(tx->vin[0].prevout));
            } else {
                for (const auto& outpoint : tx->vin | std::views::transform(&CTxIn::prevout)) {
                    const auto external{!txids.contains(outpoint.hash)};
                    const auto& c{cache.AccessCoin(outpoint)};
                    BOOST_CHECK_EQUAL(c.IsSpent(), !external);
    
                    counter += external;
                    const bool in_cache{!!cache.GetPossiblySpentCoinFromCache(outpoint)};
                    BOOST_CHECK_EQUAL(external, in_cache);
                }
                txids.emplace(tx->GetHash());
            }
        }
        BOOST_CHECK_EQUAL(cache.GetCacheSize(), counter);
    

    andrewtoth commented at 3:29 PM on December 1, 2025:

    But ConnectBlock also inserts the newly created utxos into the cache, so that the next call to AccessCoin will just get it from the cache's cacheCoins map.

  382. in src/test/coinsviewcacheasync_tests.cpp:88 in 236e7a4374 outdated
      83 | +            if (!spent) coin.out.nValue = 1;
      84 | +            BOOST_CHECK_EQUAL(coin.IsSpent(), spent);
      85 | +            cache.EmplaceCoinInternalDANGER(std::move(outpoint), std::move(coin));
      86 | +        }
      87 | +    }
      88 | +}
    


    l0rinc commented at 7:50 AM on December 1, 2025:

    why are coinbases and internal spends added to the cache, that's not what happens in reality, right? It should represent the state before the block is connected, and it should populate backed and db caches as well, so maybe something like:

    void PopulateCache(const CBlock& block, CCoinsView& view, bool spent = false)
    {
        CCoinsViewCache cache{&view};
        cache.SetBestBlock(uint256::ONE);
    
        std::unordered_set<Txid, SaltedTxidHasher> txids{};
        txids.reserve(block.vtx.size() - 1);
        for (const auto& tx : block.vtx | std::views::drop(1)) {
            for (const auto& in : tx->vin) {
                if (!txids.contains(in.prevout.hash)) {
                    Coin coin{};
                    if (!spent) coin.out.nValue = 1;
                    cache.EmplaceCoinInternalDANGER(COutPoint{in.prevout}, std::move(coin));
                }
            }
            txids.emplace(tx->GetHash());
        }
    
        cache.Flush();
    }
    
  383. in src/test/coinsviewcacheasync_tests.cpp:25 in 236e7a4374 outdated
      20 | +#include <unordered_set>
      21 | +
      22 | +BOOST_AUTO_TEST_SUITE(coinsviewcacheasync_tests)
      23 | +
      24 | +struct NoAccessCoinsView : CCoinsView {
      25 | +    std::optional<Coin> GetCoin(const COutPoint&) const override { abort(); }
    


    l0rinc commented at 8:15 AM on December 1, 2025:

    instead of std::abort we should just return an std::nullopt, it's closer to what the database would do - especially since GetPossiblySpentCoinFromCache is noexcept. Or even better, what if we also used an in-memory leveldb here instead and populated a CCoinsView& view in PopulateCache instead (see below)?


    andrewtoth commented at 2:44 PM on December 1, 2025:

    We want to specifically test that we do not access the main cache's backing view, by e.g. calling GetCoin on it. If we return a nullopt here then it would correctly go to the db layer (while then mutating base non-atomically and causing UB), while we want to make sure we blow up here because it is a failed test.

  384. in src/test/coinsviewcacheasync_tests.cpp:33 in 236e7a4374 outdated
      28 | +struct CoinsViewCacheAsyncTest : BasicTestingSetup {
      29 | +private:
      30 | +    std::unique_ptr<CoinsViewCacheAsync> m_async_cache{nullptr};
      31 | +    std::unique_ptr<CBlock> m_block{nullptr};
      32 | +
      33 | +    CBlock CreateBlock(int32_t num_txs) const noexcept
    


    l0rinc commented at 8:41 AM on December 1, 2025:

    we're only using 100 here, we might as well:

        static constexpr auto num_txs{100};
        CBlock CreateBlock() const noexcept
    
  385. in src/test/coinsviewcacheasync_tests.cpp:30 in 236e7a4374 outdated
      25 | +    std::optional<Coin> GetCoin(const COutPoint&) const override { abort(); }
      26 | +};
      27 | +
      28 | +struct CoinsViewCacheAsyncTest : BasicTestingSetup {
      29 | +private:
      30 | +    std::unique_ptr<CoinsViewCacheAsync> m_async_cache{nullptr};
    


    l0rinc commented at 8:42 AM on December 1, 2025:

    m_async_cache seems unused

  386. in src/test/coinsviewcacheasync_tests.cpp:48 in 236e7a4374 outdated
      43 | +            CMutableTransaction tx;
      44 | +            Txid txid;
      45 | +            if (i % 3 == 0) {
      46 | +                txid = Txid::FromUint256(uint256(i));
      47 | +            } else if (i % 3 == 1) {
      48 | +                txid = prevhash;
    


    l0rinc commented at 8:42 AM on December 1, 2025:

    we could add comments for these cases as well:

                if (i % 3 == 0) {
                    // External input
                    txid = Txid::FromUint256(uint256(i));
                } else if (i % 3 == 1) {
                    // Internal spend (prev tx)
                    txid = prevhash;
                } else {
                    // Test shortid collisions (looks internal, but is external)
    
  387. in src/test/coinsviewcacheasync_tests.cpp:175 in 236e7a4374 outdated
     170 | +BOOST_FIXTURE_TEST_CASE(fetch_no_inputs, CoinsViewCacheAsyncTest)
     171 | +{
     172 | +    const auto& block{getBlock()};
     173 | +    CCoinsView db;
     174 | +    CCoinsViewCache main_cache(&db);
     175 | +    CoinsViewCacheAsync view{main_cache, db};
    


    l0rinc commented at 8:47 AM on December 1, 2025:

    we could also add an in-memory leveldb here to simplify the code and make it more realistic:

        CCoinsViewDB db{{.path = "", .cache_bytes = 1_MiB, .memory_only = true}, {}};
        CCoinsViewCache main_cache{&db};
        CoinsViewCacheAsync view{main_cache, db};
    
  388. in src/test/coinsviewcacheasync_tests.cpp:179 in 236e7a4374 outdated
     174 | +    CCoinsViewCache main_cache(&db);
     175 | +    CoinsViewCacheAsync view{main_cache, db};
     176 | +    for (auto i{0}; i < 3; ++i) {
     177 | +        view.StartFetching(block);
     178 | +        for (const auto& tx : block.vtx) {
     179 | +            for (const auto& in : tx->vin) view.AccessCoin(in.prevout);
    


    l0rinc commented at 8:48 AM on December 1, 2025:

    we should do something with the result:

                for (const auto& in : tx->vin) {
                    const auto& c{view.AccessCoin(in.prevout)};
                    BOOST_CHECK(c.IsSpent());
                }
    
  389. l0rinc commented at 9:43 AM on December 1, 2025: contributor

    Redid he measurements on a Mac with AppleClang for different sizes to check why there's such a massive speedup for lowe memory:

    for DBCACHE in 450 4500 45000; do \
        COMMITS="d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac b1a791db1c75a47569b690baf7b074b78e08ca5a"; \
        STOP=921129; \
        DATA_DIR="$HOME/Library/Application Support/Bitcoin"; LOG_DIR="$HOME/bitcoin-reindex-logs"; \
        mkdir -p "$LOG_DIR"; \
        COMMA_COMMITS=${COMMITS// /,}; \
        (echo ""; echo "$COMMITS" | tr ' ' '\n' | while read -r c; do git log -1 --pretty='%h %s' -- "$c" || exit 1; done;) && \
        (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(sysctl -n machdep.cpu.brand_string) | $(nproc) cores | $(printf '%.1fGiB' "$(( $(sysctl -n hw.memsize)/1024/1024/1024 ))") RAM | SSD | $(sw_vers -productName) $(sw_vers -productVersion) $(sw_vers -buildVersion) | $(xcrun clang --version | head -1)"; echo "") && \
        hyperfine \
          --sort command \
          --runs 1 \
          --export-json "$LOG_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-appleclang.json" \
          --parameter-list COMMIT "$COMMA_COMMITS" \
          --prepare "killall bitcoind 2>/dev/null || true; rm -f \"$DATA_DIR\"/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
            cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
            ./build/bin/bitcoind -datadir=\"$DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
          --conclude "killall bitcoind 2>/dev/null || true; sleep 5; grep -q 'height=0' \"$DATA_DIR\"/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' \"$DATA_DIR\"/debug.log && grep -q \"height=$STOP\" \"$DATA_DIR\"/debug.log || { echo 'debug.log assertions failed'; exit 1; }; \
                      cp \"$DATA_DIR\"/debug.log \"$LOG_DIR\"/debug-{COMMIT}-\$(date +%s).log 2>/dev/null || true" \
          "./build/bin/bitcoind -datadir=\"$DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    done
    

    dbcache 450

    reindex-chainstate | 921129 blocks | dbcache 450 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)
    
    Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
      Time (abs ≡):        26759.295 s               [User: 29786.899 s, System: 7379.722 s]
    
    Benchmark 2: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
      Time (abs ≡):        8826.595 s               [User: 23102.926 s, System: 2391.832 s]
    
    Relative speed comparison
            3.03          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
            1.00          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
    

    dbcache 4500

    reindex-chainstate | 921129 blocks | dbcache 4500 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)
    
    Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
      Time (abs ≡):        12563.690 s               [User: 15217.346 s, System: 1087.166 s]
    
    Benchmark 2: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
      Time (abs ≡):        7786.335 s               [User: 14306.318 s, System: 1220.685 s]
    
    Relative speed comparison
            1.61          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
            1.00          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
    

    dbcache 45000

    reindex-chainstate | 921129 blocks | dbcache 45000 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)
    
    Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
      Time (abs ≡):        5256.592 s               [User: 6551.334 s, System: 337.214 s]
    
    Benchmark 2: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
      Time (abs ≡):        4727.896 s               [User: 7191.973 s, System: 467.989 s]
    
    Relative speed comparison
            1.11          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = d5ed4ba9d8627f1897322ce7eb5b34e08e4f73ac)
            1.00          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=921129 -dbcache=45000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b1a791db1c75a47569b690baf7b074b78e08ca5a)
    

    The huge difference may come from the multithreaded solution using the performance cores more heavily:

    Master: <img width="869" height="429" alt="Image" src="https://github.com/user-attachments/assets/caa6ddaf-51c8-45cd-835a-232d2dc943af" />

    PR: <img width="868" height="426" alt="Image" src="https://github.com/user-attachments/assets/5a410b64-b460-423e-b7f1-133bd266d02f" /> @andrewtoth can you reproduce the low-memory speedup results?

  390. andrewtoth force-pushed on Dec 1, 2025
  391. DrahtBot added the label CI failed on Dec 1, 2025
  392. DrahtBot commented at 7:26 PM on December 1, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/19833312863/job/56824503489</sub> <sub>LLM reason (✨ experimental): Narrowing conversion in CoinsViewCacheAsync constructor triggers -Werror, causing the build to fail.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  393. andrewtoth force-pushed on Dec 1, 2025
  394. DrahtBot removed the label CI failed on Dec 1, 2025
  395. andrewtoth commented at 3:12 PM on December 5, 2025: contributor

    I've generated some flamegraphs from perf data recorded during IBD for ~10k blocks between 850-900k for both master and this branch.

    The 4 worker threads can be seen clearly on the left of the graph. The binary search through short txids is barely noticeable, which confirms our approach to not use an unordered_set + salted hasher.

    Looking at b-msghand (main) thread, we see ConnectBlock dominating it on master while block serialization and tx hash computation becomes the dominant factor on this branch. The main thread calling ProcessInputInBackground on this branch indicates that work stealing is working correctly, and that there might be some speedup if worker thread count was increased.

    I see these don't work well in the browser when hosted on github. If you right-click download them, then open in a browser they are easier to inspect.

    perf_master perf_branch

  396. l0rinc commented at 4:58 PM on December 5, 2025: contributor

    The flames look impressive, my dfferential flames for all 900k blocks should also finish in a few days.

    Parallelism vs speedup on different platforms

    <img width="1268" height="910" alt="image" src="https://github.com/user-attachments/assets/b9a7538d-0e28-46bf-b1cc-6861cc459bd8" />

    <details> <summary>reindex-chainstate | 700000 blocks | dbcache 450 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)</summary>

    COMMITS="8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
    STOP=700000; DBCACHE=450; \
    DATA_DIR="$HOME/Library/Application Support/Bitcoin"; LOG_DIR="$HOME/bitcoin-reindex-logs"; \
    mkdir -p "$LOG_DIR"; \
    COMMA_COMMITS=${COMMITS// /,}; \
    (echo ""; for c in $(echo $COMMITS); do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(sysctl -n machdep.cpu.brand_string) | $(nproc) cores | $(printf '%.1fGiB' "$(( $(sysctl -n hw.memsize)/1024/1024/1024 ))") RAM | SSD | $(sw_vers -productName) $(sw_vers -productVersion) $(sw_vers -buildVersion) | $(xcrun clang --version | head -1)"; echo "") && \
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$LOG_DIR/rdx-$(echo "$COMMITS" | sed -E 's/([a-f0-9]{8})[a-f0-9]+ ?/\1-/g;s/-$//')-$STOP-$DBCACHE-appleclang.json" \
      --parameter-list COMMIT "$COMMA_COMMITS" \
      --prepare "killall -9 bitcoind 2>/dev/null || true; rm -f \"$DATA_DIR\"/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=\"$DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind 2>/dev/null || true; sleep 5; grep -q 'height=0' \"$DATA_DIR\"/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' \"$DATA_DIR\"/debug.log && grep -q \"height=$STOP\" \"$DATA_DIR\"/debug.log || { echo 'debug.log assertions failed'; exit 1; }; \
                  cp \"$DATA_DIR\"/debug.log \"$LOG_DIR\"/debug-{COMMIT}-\$(date +%s).log 2>/dev/null || true" \
      "./build/bin/bitcoind -datadir=\"$DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    
    8744e5a03e WORKER_THREADS{1}
    042dbfdc3f WORKER_THREADS{2}
    db9ec4d4e7 WORKER_THREADS{3}
    e4bb647a16 WORKER_THREADS{4}
    36613ec982 WORKER_THREADS{5}
    82cd3e294f WORKER_THREADS{6}
    114fef0f34 WORKER_THREADS{7}
    ae1589d7f9 WORKER_THREADS{8}
    
    reindex-chainstate | 700000 blocks | dbcache 450 | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)
    
    Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
      Time (abs ≡):        4188.146 s               [User: 7387.420 s, System: 932.198 s]
    
    Benchmark 2: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
      Time (abs ≡):        3683.049 s               [User: 7346.612 s, System: 833.443 s]
    
    Benchmark 3: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
      Time (abs ≡):        3483.915 s               [User: 7427.734 s, System: 815.722 s]
    
    Benchmark 4: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
      Time (abs ≡):        3349.891 s               [User: 7531.310 s, System: 839.720 s]
    
    Benchmark 5: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
      Time (abs ≡):        3402.258 s               [User: 7836.059 s, System: 1139.180 s]
    
    Benchmark 6: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
      Time (abs ≡):        3399.448 s               [User: 8072.648 s, System: 1508.136 s]
    
    Benchmark 7: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
      Time (abs ≡):        3404.973 s               [User: 8226.177 s, System: 1889.810 s]
    
    Benchmark 8: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
      Time (abs ≡):        3398.617 s               [User: 8358.164 s, System: 2256.116 s]
    
    Relative speed comparison
            1.25          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
            1.10          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
            1.04          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
            1.00          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
            1.02          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
            1.01          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
            1.02          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
            1.01          ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
    

    </details>

    <img width="1257" height="910" alt="image" src="https://github.com/user-attachments/assets/ff770b73-4ad5-4477-9f51-3d9f267d50d1" />

    <details> <summary>reindex-chainstate | 923319 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD</summary>

    COMMITS="8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
    STOP=923319; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    8744e5a03e WORKER_THREADS{1}
    042dbfdc3f WORKER_THREADS{2}
    db9ec4d4e7 WORKER_THREADS{3}
    e4bb647a16 WORKER_THREADS{4}
    36613ec982 WORKER_THREADS{5}
    82cd3e294f WORKER_THREADS{6}
    114fef0f34 WORKER_THREADS{7}
    ae1589d7f9 WORKER_THREADS{8}
    
    reindex-chainstate | 923319 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
      Time (abs ≡):        17079.999 s               [User: 41791.411 s, System: 2436.225 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
      Time (abs ≡):        15860.805 s               [User: 41236.804 s, System: 2345.765 s]
    
    Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
      Time (abs ≡):        15450.583 s               [User: 41354.937 s, System: 2364.782 s]
    
    Benchmark 4: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
      Time (abs ≡):        15319.447 s               [User: 41812.192 s, System: 2357.605 s]
    
    Benchmark 5: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
      Time (abs ≡):        15266.043 s               [User: 42357.261 s, System: 2514.361 s]
    
    Benchmark 6: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
      Time (abs ≡):        15206.688 s               [User: 42723.482 s, System: 2511.710 s]
    
    Benchmark 7: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
      Time (abs ≡):        15241.462 s               [User: 43303.245 s, System: 2591.484 s]
    
    Benchmark 8: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
      Time (abs ≡):        15234.312 s               [User: 43964.600 s, System: 2819.547 s]
    
    Relative speed comparison
            1.12          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
            1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
            1.02          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
            1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=923319 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
    

    </details>

    <img width="1273" height="911" alt="image" src="https://github.com/user-attachments/assets/2a05ed7f-fdc3-4373-9853-a04c18faf5fb" />

    <details> <summary>reindex-chainstate | 700000 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD</summary>

    COMMITS="8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
    STOP=700000; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    8744e5a03e WORKER_THREADS{1}
    042dbfdc3f WORKER_THREADS{2}
    db9ec4d4e7 WORKER_THREADS{3}
    e4bb647a16 WORKER_THREADS{4}
    36613ec982 WORKER_THREADS{5}
    82cd3e294f WORKER_THREADS{6}
    114fef0f34 WORKER_THREADS{7}
    ae1589d7f9 WORKER_THREADS{8}
    
    reindex-chainstate | 700000 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
      Time (abs ≡):        18390.604 s               [User: 16855.672 s, System: 1349.628 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
      Time (abs ≡):        17781.743 s               [User: 17041.495 s, System: 1368.022 s]
    
    Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
      Time (abs ≡):        17337.544 s               [User: 17490.866 s, System: 1413.850 s]
    
    Benchmark 4: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
      Time (abs ≡):        17775.681 s               [User: 17832.946 s, System: 1436.415 s]
    
    Benchmark 5: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
      Time (abs ≡):        17699.749 s               [User: 18151.899 s, System: 1476.080 s]
    
    Benchmark 6: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
      Time (abs ≡):        17115.228 s               [User: 18467.571 s, System: 1506.336 s]
    
    Benchmark 7: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
      Time (abs ≡):        17546.582 s               [User: 18532.849 s, System: 1551.217 s]
    
    Benchmark 8: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
      Time (abs ≡):        17452.997 s               [User: 18550.132 s, System: 1546.017 s]
    
    Relative speed comparison
            1.07          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
            1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
            1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
            1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
            1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
            1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
            1.02          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
    

    </details>

    Edit:

    <img width="1272" height="912" alt="image" src="https://github.com/user-attachments/assets/29e79143-7ed0-4a32-9fca-3dae4548fd08" />

    <details> <summary>reindex-chainstate | 700000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD</summary>

    COMMITS="9a29b2d331eed5b4cbd6922f63e397b68ff12447 8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
    STOP=700000; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    9a29b2d331 Merge bitcoin/bitcoin#33857: doc: Add `x86_64-w64-mingw32ucrt` triplet to `depends/README.md`
    8744e5a03e WORKER_THREADS{1}
    042dbfdc3f WORKER_THREADS{2}
    db9ec4d4e7 WORKER_THREADS{3}
    e4bb647a16 WORKER_THREADS{4}
    36613ec982 WORKER_THREADS{5}
    82cd3e294f WORKER_THREADS{6}
    114fef0f34 WORKER_THREADS{7}
    ae1589d7f9 WORKER_THREADS{8}
    
    reindex-chainstate | 700000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 9a29b2d331eed5b4cbd6922f63e397b68ff12447)
      Time (abs ≡):        17037.553 s               [User: 26114.648 s, System: 2505.015 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
      Time (abs ≡):        13967.390 s               [User: 27084.842 s, System: 2533.624 s]
    
    Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
      Time (abs ≡):        13030.059 s               [User: 27638.137 s, System: 2473.673 s]
    
    Benchmark 4: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
      Time (abs ≡):        13077.949 s               [User: 27739.880 s, System: 2496.343 s]
    
    Benchmark 5: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
      Time (abs ≡):        13051.649 s               [User: 27609.668 s, System: 2538.616 s]
    
    Benchmark 6: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
      Time (abs ≡):        13287.758 s               [User: 27771.809 s, System: 2615.043 s]
    
    Benchmark 7: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
      Time (abs ≡):        13308.250 s               [User: 27744.112 s, System: 2646.436 s]
    
    Benchmark 8: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
      Time (abs ≡):        13436.808 s               [User: 27789.127 s, System: 2709.751 s]
    
    Benchmark 9: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
      Time (abs ≡):        13430.790 s               [User: 27676.672 s, System: 2727.739 s]
    
    Relative speed comparison
            1.31          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 9a29b2d331eed5b4cbd6922f63e397b68ff12447)
            1.07          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
            1.02          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
            1.02          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
            1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
            1.03          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
    

    </details>

    Edit2:

    <img width="1264" height="915" alt="image" src="https://github.com/user-attachments/assets/9f0bbaea-e742-48af-9ef6-73ce79bfacdc" />

    <details> <summary>reindex-chainstate | 600000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD</summary>

    COMMITS="9a29b2d331eed5b4cbd6922f63e397b68ff12447 8744e5a03e84eb407a861cd36fc30c2c5367169a 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f db9ec4d4e74a6286b3de713b47398013837c7749 e4bb647a1614bd9e6718f80a83d9fe998eb48f5f 36613ec98299411950520dd6361a96786607ed08 82cd3e294f3100d8f705d63135508e018efcb80f 114fef0f348b9a4d76b826585fd737886c87a6f1 ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd"; \
    STOP=600000; DBCACHE=450; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
      --sort command \
      --runs 1 \
      --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
      --parameter-list COMMIT ${COMMITS// /,} \
      --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
        ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
      --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                  cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
      "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    
    9a29b2d331 Merge bitcoin/bitcoin#33857: doc: Add `x86_64-w64-mingw32ucrt` triplet to `depends/README.md`
    8744e5a03e WORKER_THREADS{1}
    042dbfdc3f WORKER_THREADS{2}
    db9ec4d4e7 WORKER_THREADS{3}
    e4bb647a16 WORKER_THREADS{4}
    36613ec982 WORKER_THREADS{5}
    82cd3e294f WORKER_THREADS{6}
    114fef0f34 WORKER_THREADS{7}
    ae1589d7f9 WORKER_THREADS{8}
    
    reindex-chainstate | 600000 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 9a29b2d331eed5b4cbd6922f63e397b68ff12447)
      Time (abs ≡):        28988.956 s               [User: 36830.542 s, System: 6768.571 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
      Time (abs ≡):        23269.261 s               [User: 38584.449 s, System: 7103.664 s]
    
    Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
      Time (abs ≡):        21678.210 s               [User: 39240.948 s, System: 7259.279 s]
    
    Benchmark 4: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
      Time (abs ≡):        21506.209 s               [User: 39363.857 s, System: 7569.643 s]
    
    Benchmark 5: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
      Time (abs ≡):        21428.512 s               [User: 39616.150 s, System: 7698.857 s]
    
    Benchmark 6: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
      Time (abs ≡):        21392.758 s               [User: 39653.354 s, System: 8054.084 s]
    
    Benchmark 7: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
      Time (abs ≡):        21395.545 s               [User: 39365.692 s, System: 8235.890 s]
    
    Benchmark 8: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
      Time (abs ≡):        21449.737 s               [User: 39321.387 s, System: 8314.124 s]
    
    Benchmark 9: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
      Time (abs ≡):        21505.684 s               [User: 39558.723 s, System: 8648.682 s]
    
    Relative speed comparison
            1.36          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 9a29b2d331eed5b4cbd6922f63e397b68ff12447)
            1.09          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 8744e5a03e84eb407a861cd36fc30c2c5367169a)
            1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 042dbfdc3f2c2ea5f04dfa91ac8785a42d493c2f)
            1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = db9ec4d4e74a6286b3de713b47398013837c7749)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = e4bb647a1614bd9e6718f80a83d9fe998eb48f5f)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 36613ec98299411950520dd6361a96786607ed08)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 82cd3e294f3100d8f705d63135508e018efcb80f)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 114fef0f348b9a4d76b826585fd737886c87a6f1)
            1.01          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=600000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ae1589d7f96ae2e4bdaa86bf16a0006e38b093bd)
    

    </details>

    My conclusion (given that threads aren't for free) is that 4 threads should suffice for now!

    Memory usage

    I've also measured the peak memory usage during reindex-chainstate and unsurprisingly, the additional threads (4 threads * 8MB stack = ~32MB on glibc), reused internal caches (m_inputs.clear() does not free the allocated memory capacity so the peak block input size will be helf for the lifetime of the node) result in a measurable memory overhead (>100MB peak extra). Once we're close to the finish line, we can discuss whether we should do anything about it (e.g., we could lower the default dbcache size, or try to reduce memory usage in other ways to compensate, or just document the increase).

    The peak memory of v30 was 744.4 MB (measured via assumeutxo):

        MB
    744.4^                            :
         |# :::: :  ::: ::  :::::  ::@:::::   ::::::::@@
         |# : :: :  ::: ::  : ::   ::@:: ::   ::: ::: @
         |#:: :: :  ::: ::  : ::   ::@:: ::   ::: ::: @
         |#:: :: :  ::: ::  : ::   ::@:: :: ::::: ::: @
         |#:: :: :  ::: ::  : ::   ::@:: :: : ::: ::: @
         |#:: :: :  ::: ::  : ::   ::@:: :::: ::: ::: @
         |#:: :: :::::: ::  : ::   ::@:: :::: ::: ::: @
         |#:: ::::: ::: ::::: :: ::::@:: :::: ::: ::: @
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ ::::::::@@::@::::::::::@::
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
         |#:: ::::: ::::::: : :: : ::@:: :::: ::: ::: @ : : : : @ ::@:::: :::::@::
       0 +----------------------------------------------------------------------->h
         0                                                                   1.603
    

    the peak memory of master is 819.6 MB (measured via assumeutxo, see #31645 (comment))

        MB
    819.6^      ##
         | :    #       :      :  :           :
         | :  ::# :::@@:::::::::  :  @:::::: :::::::
         | :  : # :: @ ::: :: ::  :  @:: ::  :::: :
         | :: : # :: @ ::: :: ::  :  @:: ::  :::: :
         | :: : # :: @ ::: :: ::  :  @:: ::  :::: :
         | :: : # :: @ ::: :: ::  :  @:: ::  :::: :
         | :::: # :: @ ::: :: ::  :  @:: ::  :::: :
         | :::: # :: @ ::: :: ::  :  @:: ::  :::: :
         | :::: # :: @ ::: :: :::::  @:: ::  :::: :
         | :::: # :: @ ::: :: ::: :::@:: ::  :::: :
         | :::: # :: @ ::: :: ::: :: @:: ::  :::: :
         | :::: # :: @ ::: :: ::: :: @:: :: ::::: :
         | :::: # :: @ ::: :: ::: :: @:: :: ::::: :
         | :::: # :: @ ::: :: ::: :: @:: :: ::::: : :::::::::::@::::@::::@:::::::@
         | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
         | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
         | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
         | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
         | :::: # :: @ ::: :: ::: :: @:: :: ::::: : : : :::::: @::::@::::@:::::::@
       0 +----------------------------------------------------------------------->h
         0                                                                   1.484
    

    and the peak memory after the PR is 944.1MB (via reindex-chainstate, we can't measure it via assumeutxo):

        MB
    944.1^                                                         #              
         |                       :@                    @           #              
         |                       :@                    @           #              
         |                       :@                    @      :    #              
         |     :::               :@     :         @    @   :  :  @ #        :     
         |     : :            :  :@     :    @    @    @ :::  :: @ #:       :     
         |     : :         :: : ::@     :   :@   :@    @ :::  :: @ #:: :    :     
         |  :  : :      :  : :: ::@:: : : :::@   :@   @@::::  :: @ #:: :    :     
         |  :  : :::    : :: :: ::@:  :@: : :@ : :@  :@@::::  :::@ #:: ::   : :   
         |  :  : ::::   : :: :::::@:  :@: : :@ : :@  :@@::::  :::@:#:: ::::@: ::  
         |  :  : ::::   :::: :::::@:  :@: : :@ : :@ ::@@::::  :::@:#:: ::::@: ::  
         | ::  : ::::::::::: :::::@: ::@: : :@:: :@ ::@@::::@ :::@:#:: ::::@: ::  
         | ::::: :::::: :::: :::::@: ::@: : :@::::@:::@@::::@::::@:#:::::::@::::::
         | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
         | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
         | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
         | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
         | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
         | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
         | ::: : :::::: :::: :::::@: ::@::: :@::::@:::@@::::@::::@:#:::::::@::::::
       0 +----------------------------------------------------------------------->h
         0                                                                   95.25
    

    massif-0f3778fbfb03fc9083326e9cf62b3d3293a7f623-921129-450.txt

    Also, now that the whole process is done in half the time, it's also possible that the higher peak memory is mostly a result of different memory-intensive processes coinciding more often.

  397. andrewtoth force-pushed on Dec 7, 2025
  398. DrahtBot added the label CI failed on Dec 7, 2025
  399. DrahtBot commented at 1:43 AM on December 7, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task TSan: https://github.com/bitcoin/bitcoin/actions/runs/19996361340/job/57344491981</sub> <sub>LLM reason (✨ experimental): ThreadSanitizer data race detected in CCoinsViewCache::FetchCoin during coinsviewcacheasync_tests.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  400. andrewtoth force-pushed on Dec 7, 2025
  401. andrewtoth force-pushed on Dec 7, 2025
  402. DrahtBot removed the label CI failed on Dec 7, 2025
  403. DrahtBot added the label Needs rebase on Dec 11, 2025
  404. andrewtoth force-pushed on Dec 11, 2025
  405. DrahtBot removed the label Needs rebase on Dec 11, 2025
  406. TheBlueMatt commented at 4:21 PM on December 11, 2025: contributor

    For now the async view uses a fixed worker thread count of 4. The workload is primarily I/O-bound on DB latency rather than CPU-bound, so 4 workers already hide most of the latency and it simplifies the implementation. If needed we can make this configurable or tie it to -par later.

    Probably makes sense to benchmark this kind of change in a cloud environment as well. There you'll likely see fixed, higher latency but more consistent as you push more IOPS, which I anticipate might result in substantially different results/optimal thread counts compared to physically attached flash.

  407. andrewtoth commented at 10:45 PM on December 11, 2025: contributor

    I've rebased due to #33602, and added several touchups. Thank you @l0rinc for the suggestions!

    Thank you also @l0rinc for your very thorough measurements. I think 4 threads is a decent choice for now, but as @TheBlueMatt suggests I will try and run benchmarks in a cloud environment with network connected storage.

    I would like to reproduce the memory findings, but #33351 makes it a little difficult to determine exact numbers. I think running an IBD on each branch and confirming the max RSS would be a better indicator than running an assumeutxo load.

    the additional threads (4 threads * 8MB stack = ~32MB on glibc), reused internal caches (m_inputs.clear() does not free the allocated memory capacity so the peak block input size will be helf for the lifetime of the node) result in a measurable memory overhead (>100MB peak extra).

    I am skeptical about this claim. An InputToFetch is 72 bytes, plus 8 bytes per txid stored. Let's be generous and round up to 100 bytes per input. The theoretical maximum number of inputs in a block is 1MB / 41 bytes = ~24.3k, let's say 30k. Let's be generous and double that to 60k because vectors will double capacity when reaching the limit. That's 100 bytes * 60k = 6 MB. That plus the 32MB for the thread stacks is only 38MB. Where does the extra memory come from? Or am I not accounting for something big in my math here? Edit: Actually, double that since we also have the cacheCoins not being reallocated. So that would still only be a maximum of 12 MB (in reality much less) on top of the 32 MB.

  408. l0rinc commented at 1:50 PM on December 12, 2025: contributor

    As mentioned on IRC yesterday:

    We also observed that on the 16 GB system, runs with -dbcache values of 4 GB and higher were a lot slower than with -dbcache of 3 GB, and that an rpi5 with 16 GB of memory ran out of memory with -dbcache of 10 GB`.

    My assumption was that it's caused by the UTXO size getting closer to the total memory, so I ran it on the i9 and i7 servers - both of which have 64 GB memory:

    <img width="1490" height="873" alt="image" src="https://github.com/user-attachments/assets/be18a827-bbeb-4031-b0cc-79f30aa37d45" />

    <details> <summary>reindex-chainstate | 900000 blocks | dbcache 100-7000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD</summary>

    for DBCACHE in 100 200 300 400 500 1000 2000 3000 4000 5000 6000 7000; do \
        COMMITS="f6acbef1084e34f126bf530df99e4ef6a11c38e8 eee2204d6f7117c5b39abaf47d7d329ff0951638"; \
        STOP=900000; \
        CC=gcc; CXX=g++; \
        BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
        (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
        (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
        hyperfine \
          --sort command \
          --runs 1 \
          --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
          --parameter-list COMMIT ${COMMITS// /,} \
          --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
            cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
            ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
          --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                      cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
          "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    done
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 100 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        23016.826 s               [User: 40886.840 s, System: 2758.769 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        15579.671 s               [User: 39232.678 s, System: 2491.046 s]
    
    Relative speed comparison
            1.48          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 200 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        21436.283 s               [User: 37890.294 s, System: 2736.881 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        14707.903 s               [User: 35945.513 s, System: 2246.392 s]
    
    Relative speed comparison
            1.46          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 300 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        20189.990 s               [User: 35193.471 s, System: 2875.028 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        14072.230 s               [User: 33903.334 s, System: 2159.474 s]
    
    Relative speed comparison
            1.43          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 400 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        19874.210 s               [User: 33637.992 s, System: 2759.377 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        14185.335 s               [User: 32711.873 s, System: 2197.601 s]
    
    Relative speed comparison
            1.40          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 500 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        19176.938 s               [User: 31471.188 s, System: 2511.966 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        13856.071 s               [User: 31289.683 s, System: 2210.229 s]
    
    Relative speed comparison
            1.38          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 1000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=1000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        17434.420 s               [User: 25518.649 s, System: 1736.314 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=1000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        13178.801 s               [User: 26366.210 s, System: 1707.001 s]
    
    Relative speed comparison
            1.32          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=1000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=1000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 2000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=2000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        16285.543 s               [User: 20987.530 s, System: 1073.606 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=2000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        12830.954 s               [User: 22300.342 s, System: 1206.297 s]
    
    Relative speed comparison
            1.27          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=2000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=2000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 3000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        15768.416 s               [User: 19226.843 s, System: 863.531 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        12788.084 s               [User: 20314.162 s, System: 965.156 s]
    
    Relative speed comparison
            1.23          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 4000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=4000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        15685.029 s               [User: 18706.301 s, System: 811.667 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=4000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        12892.918 s               [User: 19746.475 s, System: 903.910 s]
    
    Relative speed comparison
            1.22          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=4000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=4000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 5000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        15478.130 s               [User: 18161.854 s, System: 764.612 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        12924.056 s               [User: 19084.441 s, System: 852.527 s]
    
    Relative speed comparison
            1.20          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 6000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=6000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        15563.687 s               [User: 17939.937 s, System: 754.868 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=6000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        13059.707 s               [User: 18685.622 s, System: 808.353 s]
    
    Relative speed comparison
            1.19          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=6000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=6000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 7000 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=7000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        15630.441 s               [User: 17853.930 s, System: 776.871 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=7000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        13210.208 s               [User: 18681.925 s, System: 822.040 s]
    
    Relative speed comparison
            1.18          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=7000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=7000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    

    </details>

    <details> <summary>reindex-chainstate | 900000 blocks | dbcache 100-500 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD</summary>

    for DBCACHE in 100 200 300 400 500; do \
        COMMITS="f6acbef1084e34f126bf530df99e4ef6a11c38e8 eee2204d6f7117c5b39abaf47d7d329ff0951638"; \
        STOP=900000; \
        CC=gcc; CXX=g++; \
        BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
        (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
        (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
        hyperfine \
          --sort command \
          --runs 1 \
          --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
          --parameter-list COMMIT ${COMMITS// /,} \
          --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
            cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
            ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
          --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
                      cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
          "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    done
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 100 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        44039.358 s               [User: 40502.406 s, System: 3048.444 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        34822.469 s               [User: 43549.781 s, System: 2913.842 s]
    
    Relative speed comparison
            1.26          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 200 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        43276.275 s               [User: 37875.550 s, System: 3095.389 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        33394.767 s               [User: 39767.262 s, System: 2773.980 s]
    
    Relative speed comparison
            1.30          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=200 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 300 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        53339.879 s               [User: 37057.843 s, System: 3635.662 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        39372.213 s               [User: 37610.647 s, System: 2897.763 s]
    
    Relative speed comparison
            1.35          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=300 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 400 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        41460.250 s               [User: 33577.389 s, System: 3144.309 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        33277.494 s               [User: 35893.949 s, System: 2803.631 s]
    
    Relative speed comparison
            1.25          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=400 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    
    f6acbef108 Merge bitcoin/bitcoin#33764: ci: Add Windows + UCRT jobs for cross-compiling and native testing
    eee2204d6f validation: fetch block inputs via CCoinsViewCacheAsync during connection
    
    reindex-chainstate | 900000 blocks | dbcache 500 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
      Time (abs ≡):        39183.661 s               [User: 31615.517 s, System: 2853.531 s]
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
      Time (abs ≡):        32531.387 s               [User: 33908.272 s, System: 2720.594 s]
    
    Relative speed comparison
            1.20          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = f6acbef1084e34f126bf530df99e4ef6a11c38e8)
            1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=900000 -dbcache=500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = eee2204d6f7117c5b39abaf47d7d329ff0951638)
    

    </details>

    My conclusion from the above are:

    • It's not the size of the UTXO set vs total memory;
    • This change performs better the lower the memory (we knew this already);
    • after 3 GB of dbcache there isn't any measurable speedup for some reason - with the PR it even gets slightly slower;
    • master seems to behave similarly to the PR in this regard;
    • HDD isn't very stable, but the SSD is ridiculously predictable;
    • PR performs better with 100 mb dbcache than master with 7 GB;
    • PR performs exactly the same with 1 GB and 7 GB.
  409. andrewtoth commented at 6:41 PM on December 12, 2025: contributor

    PR performs better with 100 mb dbcache than master with 7 GB;

    wat

  410. andrewtoth force-pushed on Dec 20, 2025
  411. in src/validation.cpp:3127 in 6db7f75f53 outdated
    3123 | @@ -3122,6 +3124,7 @@ bool Chainstate::ConnectTip(
    3124 |              if (state.IsInvalid())
    3125 |                  InvalidBlockFound(pindexNew, state);
    3126 |              LogError("%s: ConnectBlock %s failed, %s\n", __func__, pindexNew->GetBlockHash().ToString(), state.ToString());
    3127 | +            view.Reset();
    


    l0rinc commented at 11:49 AM on December 21, 2025:

    // local CCoinsViewCache goes out of scope

    This isn't true anymore as far as I can tell

  412. in src/test/fuzz/coinsviewcacheasync.cpp:42 in 6db7f75f53 outdated
      37 | +        .path = "",
      38 | +        .cache_bytes = 1_MiB,
      39 | +        .memory_only = true,
      40 | +    };
      41 | +    g_db.emplace(std::move(db_params), CoinsViewOptions{});
      42 | +    CCoinsViewCache cache{nullptr};
    


    l0rinc commented at 12:42 PM on December 21, 2025:

    hmmm, isn't this UB, aren't we getting lifetime problems here?

    nit: why are the tests separated from the implementation? They're part of the "feature", how can we review one fully without the other?


    andrewtoth commented at 3:00 PM on December 21, 2025:

    hmmm, isn't this UB, aren't we getting lifetime problems here?

    Nothing touches base while it is nullptr. We need to call a method to get UB, and we don't do that until we replace it with a valid pointer. Fuzzing would have revealed any UB by now.


    andrewtoth commented at 3:01 PM on December 21, 2025:

    why are the tests separated from the implementation?

    This seems to be a standard way to do this. Less cognitive load per commit. See e.g. #29415.


    l0rinc commented at 3:04 PM on December 21, 2025:

    Less cognitive load per commit

    Strongly disagree. We simply don't have enough information to review it properly and just skip to the next commit. So in a way, yes, less cognitive load - but not the good kind...


    l0rinc commented at 3:37 PM on December 21, 2025:

    Nothing touches base while it is nullptr

    It's not nullptr, it's a dangling pointer. CoinsViewCacheAsync constructor takes CCoinsViewCache& from stack which gets destroyed after setup_threadpool_test exits, something like: https://godbolt.org/z/3e9qoPTP1

    The fix could simply be:

     std::optional<CoinsViewCacheAsync> g_async_cache{};
     std::optional<CCoinsViewDB> g_db{};
    +std::optional<CCoinsViewCache> g_cache{};
     
     static void setup_threadpool_test()
     {
    @@ -39,8 +40,8 @@
             .memory_only = true,
         };
         g_db.emplace(std::move(db_params), CoinsViewOptions{});
    -    CCoinsViewCache cache{nullptr};
    -    g_async_cache.emplace(cache, *g_db);
    +    g_cache.emplace(nullptr);
    +    g_async_cache.emplace(*g_cache, *g_db);
     }
    

    andrewtoth commented at 6:01 PM on December 21, 2025:

    Right, now I see. Yes it is a dangling pointer, but again UB is not triggered unless the pointer is dereferenced.


    andrewtoth commented at 3:21 PM on December 22, 2025:

    We can get rid of the dangling pointer by just doing this:

    diff --git a/src/test/fuzz/coinsviewcacheasync.cpp b/src/test/fuzz/coinsviewcacheasync.cpp
    index 77c378288e..8796c51926 100644
    --- a/src/test/fuzz/coinsviewcacheasync.cpp
    +++ b/src/test/fuzz/coinsviewcacheasync.cpp
    @@ -41,6 +41,7 @@ static void setup_threadpool_test()
         g_db.emplace(std::move(db_params), CoinsViewOptions{});
         CCoinsViewCache cache{nullptr};
         g_async_cache.emplace(cache, *g_db);
    +    g_async_cache->SetBackend(nullptr);
     }
     
     FUZZ_TARGET(coinsviewcacheasync, .init = setup_threadpool_test)
    

    andrewtoth commented at 3:00 AM on December 23, 2025:

    This is updated to have the same constructor interface as CCoinsViewCache, so fixed.

  413. in src/coinsviewcacheasync.h:240 in 6db7f75f53 outdated
     235 | +        }
     236 | +    }
     237 | +
     238 | +    ~CoinsViewCacheAsync() override
     239 | +    {
     240 | +        m_barrier.arrive_and_drop();
    


    l0rinc commented at 12:53 PM on December 21, 2025:

    we seem to be calling StopFetching everywhere except here - since we rely on m_inputs.empty() for releasing the barrier, it might be safer to do that here, too - what do you think? Or maybe a Flush() - not sure...


    l0rinc commented at 8:43 AM on December 25, 2025:

    Hmmm, if I just call start and let the destructor do its job, I get a hanging test:

    BOOST_AUTO_TEST_CASE(destructor_without_reset)
    {
        CCoinsViewDB db{{.path = "", .memory_only = true}, {}};
        CCoinsViewCache main_cache{&db};
        CoinsViewCacheAsync view{&main_cache};
        view.StartFetching(CreateBlock());
        // Destructor called WITHOUT Reset() or Flush()
    }
    

    adding an explicit stop to the destructor fixes it for me:

    ~CoinsViewCacheAsync() override
    {
        StopFetching();
        m_barrier.arrive_and_drop();
        for (auto& t : m_worker_threads) t.join();
    }
    

    If you think this is a misuse of the API, we could add an assert in the destructor instead.

  414. in src/coinsviewcacheasync.h:25 in 6db7f75f53
      20 | +#include <ranges>
      21 | +#include <thread>
      22 | +#include <utility>
      23 | +#include <vector>
      24 | +
      25 | +static constexpr int32_t WORKER_THREADS{8};
    


    l0rinc commented at 1:08 PM on December 21, 2025:

    Based on our previous measurements I think this should remain 4 - unless you have better data.

  415. in src/coinsviewcacheasync.h:86 in 6db7f75f53 outdated
      81 | +     * Similar to CCoinsViewCache::GetCoin, but it does not mutate internally.
      82 | +     * Therefore safe to call from any thread once inside the barrier.
      83 | +     */
      84 | +    std::optional<Coin> GetCoinWithoutMutating(const COutPoint& outpoint) const
      85 | +    {
      86 | +        if (auto coin{static_cast<CCoinsViewCache*>(base)->GetPossiblySpentCoinFromCache(outpoint)}) {
    


    l0rinc commented at 1:23 PM on December 21, 2025:

    this is still super-fishy to me, we have to fix some abstractions here first ...


    andrewtoth commented at 3:04 PM on December 21, 2025:

    We know we will always have a cache as the base here. This doesn't always hold for the base class. e.g. CCoinsViewCache can have a CCoinsViewDB as base.


    andrewtoth commented at 2:59 AM on December 23, 2025:

    Updated to use a FetchCoinWithoutMutating protected method.


    l0rinc commented at 4:47 PM on December 25, 2025:

    My understanding is that as long as the current cache and its descendants are CCoinsViewCache instances, we will try to get the outpoint from them - but as mentioned, I think that violates their interface and introduced unannounced unpredictability (i.e. surprise the reviewer by assuming more locally than the interface promised).

    Not yet sure if that would be better, but if ew introduced an additional CCoinsViewCache method to peek into the structure, we wouldn't need iteration and casting, i.e. something like:

    std::optional<Coin> CCoinsViewBacked::PeekCoin(const COutPoint& outpoint) const { return base->PeekCoin(outpoint); }
    

    and

    std::optional<Coin> CCoinsViewCache::PeekCoin(const COutPoint& outpoint) const
    {
        if (auto it{cacheCoins.find(outpoint)}; it != cacheCoins.end()) {
            return it->second.coin.IsSpent() ? std::nullopt : std::optional{it->second.coin};
        }
        return base->PeekCoin(outpoint);
    }
    
  416. in src/coins.h:495 in ab1614473b outdated
     491 | @@ -484,7 +492,7 @@ class CCoinsViewCache : public CCoinsViewBacked
     492 |       * @note this is marked const, but may actually append to `cacheCoins`, increasing
     493 |       * memory usage.
     494 |       */
     495 | -    CCoinsMap::iterator FetchCoin(const COutPoint &outpoint) const;
     496 | +    virtual CCoinsMap::iterator FetchCoin(const COutPoint &outpoint) const;
    


    l0rinc commented at 1:25 PM on December 21, 2025:

    Doesn't this incur an additional hot-path virtual dispatch cost? I will carve this out to a separate commit and run a reindex-chainstate with minimal dbcache to force Flush & FetchCoin calls with and without virtual access to see if this is a valid concern.


    andrewtoth commented at 3:05 PM on December 21, 2025:

    I'm not sure why that matters. Our benchmarks show a very large speedup with this change?


    l0rinc commented at 6:48 PM on December 25, 2025:

    Measured it a few times, got very different results, I guess we can assume for now that this isn't a problem.


    andrewtoth commented at 7:43 PM on January 11, 2026:

    Got rid of this via #34165.

  417. in src/coinsviewcacheasync.h:14 in dc4c3f6cac outdated
       9 | +
      10 | +class CoinsViewCacheAsync : public CCoinsViewCache
      11 | +{
      12 | +public:
      13 | +    //! Reset state.
      14 | +    void Reset() noexcept
    


    l0rinc commented at 1:30 PM on December 21, 2025:

    Reset doesn't sound like an async property - would it make sense to make that a CCoinsViewCache method instead?


    andrewtoth commented at 3:24 PM on December 22, 2025:

    We would have to have a virtual method in the base CCoinsViewCache class, and then override it here because we have to clear our subclass members as well. But, it would never be called as a base class anywhere. It might make the commits easier to follow though, since I could introduce it on the base class and use that to reuse just a CCoinsViewCache. Then I could introduce CoinsViewCacheAsync later. I will experiment.


    l0rinc commented at 6:49 PM on December 25, 2025:

    The way you've added it is excellent, carving out a genuine sub-feature (which we could push as a separate PR)

  418. in src/coinsviewcacheasync.h:83 in b9edd77b49 outdated
      78 | @@ -67,6 +79,11 @@ class CoinsViewCacheAsync : public CCoinsViewCache
      79 |          if (i >= m_inputs.size()) [[unlikely]] return false;
      80 |  
      81 |          auto& input{m_inputs[i]};
      82 | +        // Inputs spending a coin from a tx earlier in the block won't be in the cache or db
      83 | +        if (std::ranges::binary_search(m_txids, input.outpoint.hash.ToUint256().GetUint64(0))) {
    


    l0rinc commented at 1:44 PM on December 21, 2025:

    b9edd77b4960f68afc761447e4e3372371be2143: this feature is nicely split out of the whole - but I'm missing a test in the commit that could help me debug it locally before I move on to the next commit.

  419. in src/test/coinsviewoverlay_tests.cpp:206 in 85c20a57d4 outdated
     201 | @@ -202,4 +202,20 @@ BOOST_AUTO_TEST_CASE(access_non_input_coin)
     202 |      }
     203 |  }
     204 |  
     205 | +// Test that the main thread can make progress with no workers
     206 | +BOOST_AUTO_TEST_CASE(fetch_main_thread)
    


    l0rinc commented at 1:46 PM on December 21, 2025:

    can we add this to the commit that introduces the feature that this is validating?

  420. in src/coinsviewcacheasync.h:27 in 6db7f75f53 outdated
      23 | @@ -24,6 +24,19 @@
      24 |  
      25 |  static constexpr int32_t WORKER_THREADS{8};
      26 |  
      27 | +/**
    


    l0rinc commented at 1:46 PM on December 21, 2025:

    6db7f75f53bd89e4e9b019c6ecd0c31f43d0f219: comments aren't features - why not add them when CoinsViewCacheAsync is introduced and extend the description with every added feature

  421. in src/coinsviewcacheasync.h:127 in 6db7f75f53 outdated
     122 | +    {
     123 | +        // This assumes ConnectBlock accesses all inputs in the same order as they are added to m_inputs
     124 | +        // in StartFetching. Some outpoints are not accessed because they are created by the block, so we scan until we
     125 | +        // come across the requested input. We advance the tail since the input will be cached and not accessed through
     126 | +        // this method again.
     127 | +        for (const auto i : std::views::iota(m_input_tail, m_inputs.size())) [[likely]] {
    


    l0rinc commented at 1:49 PM on December 21, 2025:

    does the [[likely]] have an effect on the if below? As mentioned before, I think it has some weird effect when nested...


    andrewtoth commented at 2:54 PM on December 21, 2025:

    No, it does not affect anything other than than the loop branch. It's not nested here.

  422. l0rinc commented at 1:56 PM on December 21, 2025: contributor

    I went through the changes quickly, planning on recreating everything locally to understand the constraints more fundamentally.

    I want to investigate a few more issues before I can do that (e.g. virtual dispatch cost on critical path, how removing noexcept and simpler siphash changes would affect the constraints, whether we can clean up the coins area a bit more before we proceed).

    To reduce risk (since this is at the heart of the project), I want to continue carving our cleanup PR that would derisk and simplify this one. Appreciate your patience and quick reaction time here.

  423. andrewtoth commented at 2:32 PM on December 21, 2025: contributor

    Did some benchmarks for cloud connected storage in AWS.

    I used 2 c6in.xlarge instances (4 vCPU, 8 GB RAM) with 800 GB gp3 volumes attached. The volumes were configured with 12500 IOPS and 391 MB/s throughput, which is the baseline for that instance type. The instance type was chosen because it had the highest baseline EBS throughput for that size class.

    I ran a reindex-chainstate up to block 921129 with this branch and master, and this branch was ~2.6x faster. 12h43m vs 33h20m.

    branch: 39139.06user 4836.53system 12:42:43elapsed 96%CPU (0avgtext+0avgdata 6748912maxresident)k 25383964712inputs+5043205272outputs (140197459major+76837294minor)pagefaults 0swaps

    master: 34586.03user 4147.82system 33:19:56elapsed 32%CPU (0avgtext+0avgdata 6931572maxresident)k 25758328464inputs+5040510928outputs (137873360major+81380971minor)pagefaults 0swaps

    On network connected storage, master is completely dominated by the serial latency of fetching inputs one by one. It can't push past around 140 MB/s read throughput, so didn't get close to maxing out disk usage. This branch easily hits the volume limits with 4 threads, by parallelizing this latency. After it completed, I ran it again but bumped the gp3 volume throughput limit to 1000 MB/s. It managed over 500 MB/s and completed in 10h01m - a ~3.3x speedup over master. Both the second run and master completed at almost the exact same time coincidentally, which you can see in the graph below.

    <img width="2720" height="593" alt="Screenshot from 2025-12-20 23-17-52" src="https://github.com/user-attachments/assets/06c771b9-dbef-409a-9870-0647d39986c0" />

    After I tested running with 8 and 12 threads, with both volumes bumped to 1000 MB/s. The 12 thread variant managed to get limited by the throughput limit, but not the 8. However, after 3 hours of bursting above the baseline, they both got throttled back. These finished in 10h04m and 8h51m - 3.3x and ~3.8x speedups respectively. So they are not too much faster with this instance type. But with a bigger instance with higher EBS baseline and gp3 settings to get more read throughput, they could be much faster.

    So it might make sense if you have a lot of sustained read throughput available to bump to a lot more threads. @TheBlueMatt

    The below graph shows this branch but with 12 worker threads (called master in the graph) and 8 worker threads (called branch in the graph). After 3 hours of bursting above baseline they get throttled back to ~390 MB/s. <img width="2720" height="593" alt="Screenshot from 2025-12-21 09-27-01" src="https://github.com/user-attachments/assets/ce9f24d2-d416-44bd-99d7-41921a869928" />

    Also, the max RSS reported by /usr/bin/time from the output above shows that master actually has a higher max RSS of 6931572k vs 6748912k. That would imply that the RSS is 6.9 GB vs 6.7 GB though, which doesn't really make sense to me. Both were using default dbcache, so I'm not entirely sure how these values are computed. However, it seems like there is less memory pressure in this branch than master. @l0rinc

  424. andrewtoth force-pushed on Dec 21, 2025
  425. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads >40% faster IBD
    validation: fetch block inputs on parallel threads 3x faster IBD
    on Dec 22, 2025
  426. andrewtoth force-pushed on Dec 23, 2025
  427. andrewtoth force-pushed on Dec 23, 2025
  428. DrahtBot added the label CI failed on Dec 23, 2025
  429. DrahtBot commented at 3:06 AM on December 23, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task test max 6 ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/20450003058/job/58760941958</sub> <sub>LLM reason (✨ experimental): Compilation error: parameter 'base' shadows inherited member in CoinsViewCacheAsync, causing CI failure.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  430. andrewtoth force-pushed on Dec 23, 2025
  431. andrewtoth force-pushed on Dec 23, 2025
  432. DrahtBot removed the label CI failed on Dec 23, 2025
  433. in src/validation.cpp:3117 in 14f1b79138 outdated
    3113 | @@ -3113,7 +3114,7 @@ bool Chainstate::ConnectTip(
    3114 |      LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
    3115 |               Ticks<MillisecondsDouble>(time_2 - time_1));
    3116 |      {
    3117 | -        CCoinsViewCache view(&CoinsTip());
    3118 | +        auto& view{*m_coins_views->m_connect_block_view};
    


    l0rinc commented at 11:39 AM on December 24, 2025:

    14f1b7913861d95e3ffeda658d45b0b88e7019e7:

    I love this separation, we could extract it to a new PR since it kinda' makes sense on its own - carving out a small but self-sufficient portion of this PR. It could also help us in future refactors (e.g. preallocating the coins cache maps to avoid resizes).


    Could we add a Start here as well, which could assert that this was in a clean state at the beginning (as a sanity check that all return paths are calling it)?

    void CCoinsViewCache::Start()
    {
        Assert(cacheCoins.empty());
        Assert(cachedCoinsUsage == 0);
        Assert(m_sentinel.second.Next() == &m_sentinel);
        Assert(m_sentinel.second.Prev() == &m_sentinel);
    
        SetBestBlock(base->GetBestBlock());
    }
    

    Besides the symmetry with Reset, it could help with using this same view for the other /*will_reuse_cache=*/false. The assertion would assure that the two cannot accidentally run at the same time (proving that we can reuse the same instance)

    • currently applied to Chainstate::ConnectTip, but could be added (here or in a separate cleanup PR) to Chainstate::DisconnectTip and TestBlockValidity cleanly
    • CVerifyDB::VerifyDB and Chainstate::ReplayBlocks & Chainstate::RollforwardBlock need to hold more than 1 block, not sure it's worth reusing the cache for those - though parallelization would be welcome in both cases...

    We could also add Reset() and Start() to the constructor, but that would require fixing the mentioned UB in #34124 (comment)


    Note: we could access this through a dedicated method instead like we do with other similar ones:

    CoinsViewCacheAsync& ConnectBlockView() EXCLUSIVE_LOCKS_REQUIRED(::cs_main)
    {
        AssertLockHeld(::cs_main);
        Assert(m_coins_views);
        return *Assert(m_coins_views->m_connect_block_view);
    }
    

    andrewtoth commented at 7:34 PM on December 28, 2025:

    Hmm doing SetBestBlock(base->GetBestBlock()); seems like a behavior change which I don't want to do here. Reset is basically returning the state of CCoinsViewCache to a fresh copy, so it is identical behavior-wise to what we had before.

    If we don't call SetBestBlock, then I don't see a benefit to including a Start method.


    l0rinc commented at 7:37 PM on December 28, 2025:

    seems like a behavior change which I don't want to do here

    Don't we have a race condition otherwise because of the lazy init + setter?


    andrewtoth commented at 7:41 PM on December 28, 2025:

    a race condition

    I don't think so? Only the base's cacheCoins is accessed via multiple threads, not hashBlock. So it is safe to mutate hashBlock on the main thread.

  434. in src/coins.cpp:284 in 14f1b79138 outdated
     276 | @@ -275,6 +277,13 @@ bool CCoinsViewCache::Sync()
     277 |      return fOk;
     278 |  }
     279 |  
     280 | +void CCoinsViewCache::Reset() noexcept
     281 | +{
     282 | +    cacheCoins.clear();
     283 | +    cachedCoinsUsage = 0;
     284 | +    hashBlock.SetNull();
    


    l0rinc commented at 12:15 PM on December 24, 2025:

    14f1b7913861d95e3ffeda658d45b0b88e7019e7: we're not setting view.SetBestBlock in Chainstate::ConnectTip in this commit - but I think we should, to avoid leaving the lazy getter which is in a race with the setter. Since this is done in a multithreaded code, I think we should set it deterministically and remove the lazy init. Especially since Flush/Sync access hashBlock directly...


    andrewtoth commented at 7:42 PM on December 29, 2025:

    the lazy getter which is in a race with the setter

    Can you describe this race? The current behavior creates a CCoinsViewCache with null hashBlock and passes it to ConnectBlock. This behavior now resets the CoinsViewCacheAsync hashBlock to null and passes it to ConnectBlock. We don't modify any behavior inside ConnectBlock. hashBlock is not accessed in multi threaded code.

  435. in src/coins.cpp:256 in 14f1b79138 outdated
     252 | @@ -253,11 +253,13 @@ bool CCoinsViewCache::Flush(bool will_reuse_cache) {
     253 |      auto cursor{CoinsViewCacheCursor(m_sentinel, cacheCoins, /*will_erase=*/true)};
     254 |      bool fOk = base->BatchWrite(cursor, hashBlock);
     255 |      if (fOk) {
     256 | -        cacheCoins.clear();
     257 |          if (will_reuse_cache) {
    


    l0rinc commented at 10:28 AM on December 26, 2025:

    we can likely get rid of the will_reuse_cache now that we have a reusable cache that we can reset - will attempt in a follow-up


    andrewtoth commented at 8:57 PM on January 3, 2026:

    Done in #34164.

  436. andrewtoth force-pushed on Dec 26, 2025
  437. andrewtoth force-pushed on Dec 26, 2025
  438. DrahtBot added the label CI failed on Dec 26, 2025
  439. DrahtBot commented at 4:41 PM on December 26, 2025: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task lint: https://github.com/bitcoin/bitcoin/actions/runs/20525757781/job/58968699149</sub> <sub>LLM reason (✨ experimental): Lint failure due to trailing whitespace in src/test/fuzz/coinscache_sim.cpp:59.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  440. andrewtoth force-pushed on Dec 26, 2025
  441. andrewtoth force-pushed on Dec 26, 2025
  442. DrahtBot removed the label CI failed on Dec 26, 2025
  443. andrewtoth commented at 8:19 PM on December 26, 2025: contributor

    Thank you again @l0rinc for your review. I've taken most of your suggestions. Reset() is now on the base CCoinsViewCache class and GetPossiblySpentCoinFromCache has been replaced with a protected FetchCoinWithoutMutating.

    I've removed the new fuzz harness and instead integrated the CoinsViewCacheAsync to our coins_view and coinscache_sim fuzz targets. So, we can fuzz them as we would a CCoinsViewCache and make sure the new subclass behaves the same as before. I have been fuzzing the three new targets for a while now.

    virtual dispatch cost on critical path

    I don't see why this needs investigating. Both our benchmarks show large speedups along the critical path, so even if there is a minor increased cost here it is a net benefit.

    how removing noexcept and simpler siphash changes would affect the constraints

    These seem like good ideas to investigate, but I don't see how they are applicable to the change proposed here. Can you elaborate?

    whether we can clean up the coins area a bit more before we proceed

    That is a worthy goal. I have refactored to make the changes to coins more straightforward.

  444. in src/test/fuzz/coins_view.cpp:91 in 738c40a566 outdated
      84 | @@ -85,6 +85,11 @@ void TestCoinsView(FuzzedDataProvider& fuzzed_data_provider, CCoinsView& backend
      85 |                  if (is_db && best_block.IsNull()) best_block = uint256::ONE;
      86 |                  coins_view_cache.SetBestBlock(best_block);
      87 |              },
      88 | +            [&] {
      89 | +                coins_view_cache.Reset();
      90 | +                // Set best block hash to non-null to satisfy the assertion in CCoinsViewDB::BatchWrite().
      91 | +                if (is_db) coins_view_cache.SetBestBlock(uint256::ONE);
    


    l0rinc commented at 7:20 AM on December 27, 2025:

    we only need this for caches that actually write to db, so we might as well:

                    if (!will_reuse_cache && is_db) coins_view_cache.SetBestBlock(uint256::ONE);
    
  445. in src/test/coins_tests.cpp:1124 in 738c40a566 outdated
    1120 | @@ -1121,4 +1121,28 @@ BOOST_AUTO_TEST_CASE(ccoins_emplace_duplicate_keeps_usage_balanced)
    1121 |      BOOST_CHECK(cache.AccessCoin(outpoint) == coin1);
    1122 |  }
    1123 |  
    1124 | +BOOST_AUTO_TEST_CASE(ccoins_reset)
    


    l0rinc commented at 7:28 AM on December 27, 2025:

    we could extend this with idempotency checks - especially if we add the mentioned Start method:

    BOOST_AUTO_TEST_CASE(ccoins_start)
    {
        test_only_CheckFailuresAreExceptionsNotAborts mock_checks{};
    
        CCoinsView root;
        CCoinsViewCacheTest cache{&root};
    
        // Start fails if state wasn't reset
        cache.Start();
        cache.EmplaceCoinInternalDANGER({Txid::FromUint256(m_rng.rand256()), m_rng.rand32()}, {});
        BOOST_CHECK_THROW(cache.Start(), NonFatalCheckError);
    
        // Resetting allows start again
        cache.Reset();
        cache.Start();
    
        // Reset and Start are idempotent
        cache.Reset();
        cache.Reset();
        cache.Start();
        cache.Start();
    }
    

    andrewtoth commented at 8:39 PM on December 28, 2025:

    Added in #34164.

  446. in src/coins.cpp:278 in 738c40a566 outdated
     274 | @@ -275,6 +275,13 @@ bool CCoinsViewCache::Sync()
     275 |      return fOk;
     276 |  }
     277 |  
     278 | +void CCoinsViewCache::Reset() noexcept
    


    l0rinc commented at 7:42 AM on December 27, 2025:

    the constructor should likely call this reset at the beginning, so this should likely adjust m_sentinel.second as well, something like:

    CCoinsViewCache::CCoinsViewCache(CCoinsView* baseIn, bool deterministic) :
        CCoinsViewBacked(baseIn), m_deterministic(deterministic),
        cacheCoins(0, SaltedOutpointHasher(/*deterministic=*/deterministic), CCoinsMap::key_equal{}, &m_cache_coins_memory_resource)
    {
        CCoinsViewCache::Reset();
        Start();
    }
    
    void CCoinsViewCache::Start()
    {
        Assert(cacheCoins.empty());
        Assert(cachedCoinsUsage == 0);
        Assert(m_sentinel.second.Next() == &m_sentinel);
        Assert(m_sentinel.second.Prev() == &m_sentinel);
    
        SetBestBlock(base->GetBestBlock());
    }
    
    void CCoinsViewCache::Reset() noexcept
    {
        cacheCoins.clear();
        cachedCoinsUsage = 0;
        hashBlock.SetNull();
        m_sentinel.second.SelfRef(m_sentinel);
    }
    

    andrewtoth commented at 8:58 PM on January 3, 2026:

    I don't think we need to add a Start() method to the cache. I'd rather not touch the hashBlock behavior in this PR. It can be cleaned up in a parallel PR. No multi threaded code touches it here.

  447. in src/validation.cpp:3136 in 738c40a566 outdated
    3132 | @@ -3131,8 +3133,9 @@ bool Chainstate::ConnectTip(
    3133 |                   Ticks<MillisecondsDouble>(time_3 - time_2),
    3134 |                   Ticks<SecondsDouble>(m_chainman.time_connect_total),
    3135 |                   Ticks<MillisecondsDouble>(m_chainman.time_connect_total) / m_chainman.num_blocks_total);
    3136 | -        bool flushed = view.Flush(/*will_reuse_cache=*/false); // local CCoinsViewCache goes out of scope
    3137 | +        bool flushed = view.Flush(/*will_reuse_cache=*/false); // No need to reallocate since it only has capacity for 1 block
    


    l0rinc commented at 7:45 AM on December 27, 2025:

    👍 for the new comment

  448. in src/coinsviewcacheasync.h:177 in b9ecb3c9ba outdated
     172 | +        m_txids.clear();
     173 | +    }
     174 | +
     175 | +public:
     176 | +    //! Fetch all block inputs.
     177 | +    void StartFetching(const CBlock& block) noexcept
    


    l0rinc commented at 9:11 AM on December 27, 2025:

    Is this idempotent, are we sure all state is reset after the previous block? Could we call a reset here or an assert to make sure we're not accidentally inheriting anything from a previous (failed?) fetch?


    andrewtoth commented at 7:22 PM on December 28, 2025:

    It's not idempotent. We need to call Reset/Flush/Sync/SetBackend on the cache before calling this again. There is an Assume(m_inputs.empty());. I'm not sure if we want to call Reset here, since it is an error if we call this before Reset and we should crash.


    andrewtoth commented at 7:50 PM on January 11, 2026:
  449. in src/test/coinsviewcacheasync_tests.cpp:134 in b9ecb3c9ba outdated
     129 | +    PopulateView(block, main_cache);
     130 | +    CoinsViewCacheAsync view{&main_cache};
     131 | +    for (auto i{0}; i < 3; ++i) {
     132 | +        view.StartFetching(block);
     133 | +        CheckCache(block, view);
     134 | +        view.Reset();
    


    l0rinc commented at 9:13 AM on December 27, 2025:

    what if we forget to call Reset() between two StartFetching calls?


    andrewtoth commented at 7:19 PM on December 28, 2025:

    Bad things. We need to call Reset before the block is destroyed.


    andrewtoth commented at 7:50 PM on January 11, 2026:

    I have updated to call StopFetching in StartFetching. This way you can fetch two blocks without resetting the cache (useful for VerifyDB). I've also added a RAII control object that gets returned from StartFetching, and will call StopFetching when we go out of scope. This is much safer than having to remember to call one of the methods to stop. It also separates the concerns of Reset to the base CCoinsViewCache only.

    This is bound to the lifetime of the block as well, so we have static analysis that will ensure we don't keep fetching after the block is destroyed (causing UB).


    l0rinc commented at 8:27 PM on January 11, 2026:

    That's definitely better. Added some comment in #31132#pullrequestreview-3648380425 and #31132 (review) that might make this even more lightweight.

  450. in src/coinsviewcacheasync.h:183 in b9ecb3c9ba outdated
     178 | +    {
     179 | +        Assume(m_inputs.empty());
     180 | +        // Loop through the inputs of the block and set them in the queue. Also construct the set of txids to filter.
     181 | +        for (const auto& tx : block.vtx | std::views::drop(1)) [[likely]] {
     182 | +            for (const auto& input : tx->vin) [[likely]] m_inputs.emplace_back(input.prevout);
     183 | +            m_txids.emplace_back(tx->GetHash().ToUint256().GetUint64(0));
    


    l0rinc commented at 9:21 AM on December 27, 2025:

    This is internal, so we don't necessarily need endian conversion here (would be skipped on most popular platforms anyway), but we could simplify a few of these regardless:

                m_txids.emplace_back(ReadLE64(tx->GetHash().begin()));
    

    andrewtoth commented at 7:39 PM on December 29, 2025:

    Is this simpler? GetUint64 calls ReadLE64 internally. This would require another import for ReadLE64 as well. We're just skipping the conversion to uint256.


    l0rinc commented at 8:51 PM on December 29, 2025:

    yes, it's simpler, we're skipping a call and a conversion and it's shorter - fewer moving parts. If you don't like it, resolve this comment.


    andrewtoth commented at 3:30 PM on January 3, 2026:

    You're right. Taken, thanks!

  451. andrewtoth force-pushed on Dec 28, 2025
  452. andrewtoth force-pushed on Dec 28, 2025
  453. DrahtBot added the label CI failed on Dec 28, 2025
  454. l0rinc commented at 6:01 PM on December 28, 2025: contributor

    While I was reviewing it you have pushed two new versions, so let me add my half-baked comments in the meantime. I'm also experimenting with adding the features in smaller steps in https://github.com/l0rinc/bitcoin/pull/79/commits - a separate resetable and reusable cache (applied to other temp places as well), introducing a single-threaded fetcher at first, changing it to newly created threads in the next, optimizing it via a barrier-guarded thread pool in a follow-up. It's definitely not done yet, but want to make sure we have progress, would appreciate if you could take a look and see what we can use here from the ideas. I think we should be able to extract the cache reuse commit to a dedicated PR.

  455. l0rinc commented at 7:49 AM on December 31, 2025: contributor

    Now that I have access to a Windows benchmarking server, managed to run a few rounds of reindex-chainstate with default 450 dbcache until tip.

    Edit: previously posted some measurements for the PR, turns out Windows disagreed and didn't actually check out the new commit (had some chmod leftovers) so I was measuring just variance.

    Edit2: reran the PR separately, seems we maintain the speedup we were hoping for:

    <img width="1448" height="845" alt="image" src="https://github.com/user-attachments/assets/8f189bf8-1d67-4e9d-a163-aa5a3958e848" />

    <details> <summary>Details</summary>

     for DBCACHE in 450; do \
    >   COMMITS="7f295e1d9b44c225c823242c1f04239f46fb27a6 0827d5d363d68f38feff89124347e9914de83cfa"; \
    >   STOP=927719; \
    >   BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    >   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    >   (echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | SSD"; echo "") &&\
    erfine \>   hyperfine \
    ort comm>     --sort command \
      --runs>     --runs 2 \
    >     --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
    >     --parameter-list COMMIT ${COMMITS// /,} \
    >     --prepare "killall -9 bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git clean -fxd; git reset --hard {COMMIT} && \
    e -B bu>       cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    >       ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log" \
    >     --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log; \
            >                 cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
    >     "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0";
    > done
    
    0827d5d363 validation: fetch inputs on parallel threads
    
    reindex-chainstate | 927719 blocks | dbcache 450 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | SSD
    
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=927719 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7f295e1d9b44c225c823242c1f04239f46fb27a6)
      Time (mean ± σ):     119997.373 s ± 2661.660 s    [User: 85751.035 s, System: 34713.420 s]
      Range (min … max):   118115.295 s … 121879.451 s    2 runs
    
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=927719 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 0827d5d363d68f38feff89124347e9914de83cfa)
      Time (mean ± σ):     48412.140 s ± 1510.148 s    [User: 86485.276 s, System: 21483.316 s]
      Range (min … max):   47344.305 s … 49479.976 s    2 runs
    

    </details>


    edit: tried native executable with clang, I couldn't reproduce any speedup vs master that way. Couldn't yet make gcc work.

  456. maflcko removed the label CI failed on Jan 1, 2026
  457. maflcko commented at 11:37 AM on January 2, 2026: member

    Now that I have access to a Windows benchmarking server, managed to run a few rounds

    It looks like this is run inside WSL (Linux) and compiled for Linux. I wonder if this is meaningful of real end-user performance, which normally run native Windows (.exe) binaries?

  458. DrahtBot added the label Needs rebase on Jan 3, 2026
  459. andrewtoth force-pushed on Jan 3, 2026
  460. DrahtBot removed the label Needs rebase on Jan 3, 2026
  461. andrewtoth force-pushed on Jan 3, 2026
  462. andrewtoth force-pushed on Jan 3, 2026
  463. DrahtBot added the label CI failed on Jan 3, 2026
  464. andrewtoth force-pushed on Jan 3, 2026
  465. DrahtBot removed the label CI failed on Jan 3, 2026
  466. andrewtoth commented at 9:22 PM on January 3, 2026: contributor

    Measured the performance at tip. This branch is ~64% faster connecting newly seen blocks than master.

    Node 1 Node 2 Average
    master 273.80ms/blk 319.19ms/blk 296.50ms/blk
    branch 179.89ms/blk 181.45ms/blk 180.67ms/blk

    I ran 5 t3.small AWS instances with 20 GB gp2 EBS volumes attached. I ran all nodes pruned to 550, with the exact same blocks, chainstate, and mempool.dat uploaded to them. I ran 2 nodes at master, and 2 nodes at this branch. These 4 nodes only connected to the 5th node, which itself connected to a trusted node outside the VPC. Using debug=bench, we can see the cumulative block connection speed in the debug logs. These are linked in the table above.

    Edit: There was a network outage with the gateway node for 12 hours, and on connection all nodes caught up. This skews results. Restarted the nodes and will get more data.

  467. andrewtoth force-pushed on Jan 5, 2026
  468. andrewtoth force-pushed on Jan 5, 2026
  469. DrahtBot added the label CI failed on Jan 5, 2026
  470. DrahtBot removed the label CI failed on Jan 5, 2026
  471. andrewtoth force-pushed on Jan 9, 2026
  472. andrewtoth force-pushed on Jan 11, 2026
  473. in src/bench/coinsviewcacheasync.cpp:41 in 396f784f8f
      36 | +    }
      37 | +    chainstate.ForceFlushStateToDisk();
      38 | +    CoinsViewCacheAsync async_cache{&coins_tip};
      39 | +
      40 | +    bench.run([&] {
      41 | +        const auto fetch_control{async_cache.StartFetching(block)};
    


    l0rinc commented at 7:41 PM on January 11, 2026:

    I have mixed feelings about these new RAII unused variables. We already have structures like these where e.g. the locks are only applied in a given scope - if we think this automatic cleanup is better, can we do something like that instead?

    <details> <summary>WITH_BLOCK_INPUTS_FETCHING prototype</summary>

    diff --git a/src/bench/coinsviewcacheasync.cpp b/src/bench/coinsviewcacheasync.cpp
    index aa6c9c4cd7..8b7dcc5505 100644
    --- a/src/bench/coinsviewcacheasync.cpp
    +++ b/src/bench/coinsviewcacheasync.cpp
    @@ -38,7 +38,7 @@ static void CoinsViewCacheAsyncBenchmark(benchmark::Bench& bench)
         CoinsViewCacheAsync async_cache{&coins_tip};
     
         bench.run([&] {
    -        const auto fetch_control{async_cache.StartFetching(block)};
    +        WITH_BLOCK_INPUTS_FETCHING(async_cache, block);
             for (const auto& tx : block.vtx | std::views::drop(1)) {
                 for (const auto& in : tx->vin) {
                     const auto have{async_cache.HaveCoin(in.prevout)};
    diff --git a/src/coinsviewcacheasync.h b/src/coinsviewcacheasync.h
    index 779bb05633..db471bff40 100644
    --- a/src/coinsviewcacheasync.h
    +++ b/src/coinsviewcacheasync.h
    @@ -12,6 +12,7 @@
     #include <primitives/transaction.h>
     #include <tinyformat.h>
     #include <util/check.h>
    +#include <util/macros.h>
     #include <util/threadnames.h>
     
     #include <algorithm>
    @@ -298,4 +299,8 @@ public:
         }
     };
     
    +//! Helper macro to start background fetching of all inputs in a block for the current scope.
    +#define WITH_BLOCK_INPUTS_FETCHING(view, block) \
    +    [[maybe_unused]] const auto UNIQUE_NAME(fetch_control_) = (view).StartFetching(block)
    +
     #endif // BITCOIN_COINSVIEWCACHEASYNC_H
    diff --git a/src/test/coinsviewcacheasync_tests.cpp b/src/test/coinsviewcacheasync_tests.cpp
    index 8e0020de9c..06b0dd3fc6 100644
    --- a/src/test/coinsviewcacheasync_tests.cpp
    +++ b/src/test/coinsviewcacheasync_tests.cpp
    @@ -109,7 +109,7 @@ BOOST_AUTO_TEST_CASE(fetch_inputs_from_db)
         CCoinsViewCache main_cache{&db};
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    +        WITH_BLOCK_INPUTS_FETCHING(view, block);
             CheckCache(block, view);
             // Check that no coins have been moved up to main cache from db
             for (const auto& tx : block.vtx) {
    @@ -129,7 +129,7 @@ BOOST_AUTO_TEST_CASE(fetch_inputs_from_cache)
         PopulateView(block, main_cache);
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    +        WITH_BLOCK_INPUTS_FETCHING(view, block);
             CheckCache(block, view);
             view.Reset();
         }
    @@ -147,7 +147,7 @@ BOOST_AUTO_TEST_CASE(fetch_no_double_spend)
         PopulateView(block, main_cache, /*spent=*/true);
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    +        WITH_BLOCK_INPUTS_FETCHING(view, block);
             for (const auto& tx : block.vtx) {
                 for (const auto& in : tx->vin) {
                     const auto& c{view.AccessCoin(in.prevout)};
    @@ -167,7 +167,7 @@ BOOST_AUTO_TEST_CASE(fetch_no_inputs)
         CCoinsViewCache main_cache{&db};
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    +        WITH_BLOCK_INPUTS_FETCHING(view, block);
             for (const auto& tx : block.vtx) {
                 for (const auto& in : tx->vin) {
                     const auto& c{view.AccessCoin(in.prevout)};
    @@ -191,7 +191,7 @@ BOOST_AUTO_TEST_CASE(access_non_input_coin)
         main_cache.EmplaceCoinInternalDANGER(COutPoint{Txid::FromUint256(uint256::ZERO), 0}, std::move(coin));
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    +        WITH_BLOCK_INPUTS_FETCHING(view, block);
             const auto& accessed_coin{view.AccessCoin(outpoint)};
             BOOST_CHECK(!accessed_coin.IsSpent());
             view.Reset();
    @@ -207,7 +207,7 @@ BOOST_AUTO_TEST_CASE(fetch_main_thread)
         PopulateView(block, main_cache);
         CoinsViewCacheAsync view{&main_cache, /*deterministic=*/false, /*num_workers=*/0};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    +        WITH_BLOCK_INPUTS_FETCHING(view, block);
             CheckCache(block, view);
             view.Reset();
         }
    diff --git a/src/test/fuzz/coins_view.cpp b/src/test/fuzz/coins_view.cpp
    index 3f03ec90f0..aaffe98b88 100644
    --- a/src/test/fuzz/coins_view.cpp
    +++ b/src/test/fuzz/coins_view.cpp
    @@ -384,7 +384,7 @@ FUZZ_TARGET(coins_view_async, .init = initialize_coins_view)
         CCoinsView backend_coins_view;
         g_async_cache->SetBackend(backend_coins_view);
         CBlock block{BuildRandomBlock(fuzzed_data_provider)};
    -    const auto fetch_control{g_async_cache->StartFetching(block)};
    +    WITH_BLOCK_INPUTS_FETCHING(*g_async_cache, block);
         TestCoinsView(fuzzed_data_provider, *g_async_cache, backend_coins_view, /*is_db=*/false);
         g_async_cache->Reset();
     }
    @@ -402,7 +402,7 @@ FUZZ_TARGET(coins_view_stacked, .init = initialize_coins_view)
         g_async_cache->SetBackend(backend_coins_view);
         TestCoinsView(fuzzed_data_provider, backend_coins_view, db_coins_view, /*is_db=*/true);
         CBlock block{BuildRandomBlock(fuzzed_data_provider)};
    -    const auto fetch_control{g_async_cache->StartFetching(block)};
    +    WITH_BLOCK_INPUTS_FETCHING(*g_async_cache, block);
         TestCoinsView(fuzzed_data_provider, *g_async_cache, backend_coins_view, /*is_db=*/false);
         TestCoinsView(fuzzed_data_provider, backend_coins_view, db_coins_view, /*is_db=*/true);
         g_async_cache->Reset();
    diff --git a/src/test/fuzz/coinscache_sim.cpp b/src/test/fuzz/coinscache_sim.cpp
    index c635525a24..aa304aab80 100644
    --- a/src/test/fuzz/coinscache_sim.cpp
    +++ b/src/test/fuzz/coinscache_sim.cpp
    @@ -408,7 +408,7 @@ FUZZ_TARGET(coinscache_sim, .init = setup_coinscache_sim)
                             for (auto& async_cache : g_async_caches) {
                                 if (async_cache.use_count() > 1) continue;
                                 async_cache->SetBackend(*top_cache());
    -                            const auto fetch_control{async_cache->StartFetching(data.block)};
    +                            WITH_BLOCK_INPUTS_FETCHING(*async_cache, data.block);
                                 caches.emplace_back(async_cache);
                                 break;
                             }
    diff --git a/src/validation.cpp b/src/validation.cpp
    index f1168729d2..7b304b6f5b 100644
    --- a/src/validation.cpp
    +++ b/src/validation.cpp
    @@ -3100,7 +3100,7 @@ bool Chainstate::ConnectTip(
                  Ticks<MillisecondsDouble>(time_2 - time_1));
         {
             auto& view{*m_coins_views->m_connect_block_view};
    -        const auto fetch_control{view.StartFetching(*block_to_connect)};
    +        WITH_BLOCK_INPUTS_FETCHING(view, *block_to_connect);
             bool rv = ConnectBlock(*block_to_connect, state, pindexNew, view);
             if (m_chainman.m_options.signals) {
                 m_chainman.m_options.signals->BlockChecked(block_to_connect, state);
    

    </details>


    Alternatively (I like this one a lot more), what if we covered the existing view itself (making FetchControl a proxy for the view accessed through the -> and * operators neatly) to make it obvious why we need it in the scope but not outside it. This would indicate that there's a state we don't want to touch, but that there's a start/stop layer that is still needed. This would enable calling Reset automatically (which would already call StopFetching).

    <details> <summary>FetchControl proxy prototype</summary>

    diff --git a/src/bench/coinsviewcacheasync.cpp b/src/bench/coinsviewcacheasync.cpp
    index aa6c9c4cd7..9b66a0e953 100644
    --- a/src/bench/coinsviewcacheasync.cpp
    +++ b/src/bench/coinsviewcacheasync.cpp
    @@ -38,14 +38,13 @@ static void CoinsViewCacheAsyncBenchmark(benchmark::Bench& bench)
         CoinsViewCacheAsync async_cache{&coins_tip};
     
         bench.run([&] {
    -        const auto fetch_control{async_cache.StartFetching(block)};
    +        auto view{async_cache.StartFetching(block)};
             for (const auto& tx : block.vtx | std::views::drop(1)) {
                 for (const auto& in : tx->vin) {
    -                const auto have{async_cache.HaveCoin(in.prevout)};
    +                const auto have{view->HaveCoin(in.prevout)};
                     assert(have);
                 }
             }
    -        async_cache.Reset();
         });
     }
     
    diff --git a/src/coinsviewcacheasync.h b/src/coinsviewcacheasync.h
    index 779bb05633..dc3f51255b 100644
    --- a/src/coinsviewcacheasync.h
    +++ b/src/coinsviewcacheasync.h
    @@ -96,9 +96,11 @@ class CoinsViewCacheAsync : public CCoinsViewCache
     {
     public:
         /**
    -     * RAII-style controller that guarantees fetching is stopped when it goes out of scope.
    +     * RAII-style controller that guarantees fetching is stopped and the view is reset when it goes out of scope.
          * Returned by StartFetching() and bound to the lifetime of the block.
          * Non-copyable and non-movable to prevent scope escape.
    +     *
    +     * Provides access to the view through operator-> and operator*.
          */
         class FetchControl
         {
    @@ -113,10 +115,13 @@ public:
             FetchControl(FetchControl&&) = delete;
             FetchControl& operator=(FetchControl&&) = delete;
     
    -        ~FetchControl()
    -        {
    -            m_cache.StopFetching();
    -        }
    +        CoinsViewCacheAsync& operator*() noexcept { return m_cache; }
    +        const CoinsViewCacheAsync& operator*() const noexcept { return m_cache; }
    +
    +        CoinsViewCacheAsync* operator->() noexcept { return &m_cache; }
    +        const CoinsViewCacheAsync* operator->() const noexcept { return &m_cache; }
    +
    +        ~FetchControl() { m_cache.Reset(); }
         };
     
     private:
    @@ -228,7 +233,7 @@ private:
         }
     
     public:
    -    //! Start fetching all block inputs and return RAII guard that stops fetching on destruction.
    +    //! Start fetching all block inputs and return RAII guard that resets the view on destruction.
         [[nodiscard]] FetchControl StartFetching(const CBlock& block LIFETIMEBOUND) noexcept
         {
             StopFetching();
    diff --git a/src/test/coinsviewcacheasync_tests.cpp b/src/test/coinsviewcacheasync_tests.cpp
    index 8e0020de9c..39232112b7 100644
    --- a/src/test/coinsviewcacheasync_tests.cpp
    +++ b/src/test/coinsviewcacheasync_tests.cpp
    @@ -109,15 +109,14 @@ BOOST_AUTO_TEST_CASE(fetch_inputs_from_db)
         CCoinsViewCache main_cache{&db};
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    -        CheckCache(block, view);
    +        auto async_view{view.StartFetching(block)};
    +        CheckCache(block, *async_view);
             // Check that no coins have been moved up to main cache from db
             for (const auto& tx : block.vtx) {
                 for (const auto& in : tx->vin) {
                     BOOST_CHECK(!main_cache.HaveCoinInCache(in.prevout));
                 }
             }
    -        view.Reset();
         }
     }
     
    @@ -129,9 +128,8 @@ BOOST_AUTO_TEST_CASE(fetch_inputs_from_cache)
         PopulateView(block, main_cache);
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    -        CheckCache(block, view);
    -        view.Reset();
    +        auto async_view{view.StartFetching(block)};
    +        CheckCache(block, *async_view);
         }
     }
     
    @@ -147,16 +145,15 @@ BOOST_AUTO_TEST_CASE(fetch_no_double_spend)
         PopulateView(block, main_cache, /*spent=*/true);
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    +        auto async_view{view.StartFetching(block)};
             for (const auto& tx : block.vtx) {
                 for (const auto& in : tx->vin) {
    -                const auto& c{view.AccessCoin(in.prevout)};
    +                const auto& c{async_view->AccessCoin(in.prevout)};
                     BOOST_CHECK(c.IsSpent());
                 }
             }
             // Coins are not added to the view, even though they exist unspent in the parent db
    -        BOOST_CHECK_EQUAL(view.GetCacheSize(), 0);
    -        view.Reset();
    +        BOOST_CHECK_EQUAL(async_view->GetCacheSize(), 0);
         }
     }
     
    @@ -167,15 +164,14 @@ BOOST_AUTO_TEST_CASE(fetch_no_inputs)
         CCoinsViewCache main_cache{&db};
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    +        auto async_view{view.StartFetching(block)};
             for (const auto& tx : block.vtx) {
                 for (const auto& in : tx->vin) {
    -                const auto& c{view.AccessCoin(in.prevout)};
    +                const auto& c{async_view->AccessCoin(in.prevout)};
                     BOOST_CHECK(c.IsSpent());
                 }
             }
    -        BOOST_CHECK_EQUAL(view.GetCacheSize(), 0);
    -        view.Reset();
    +        BOOST_CHECK_EQUAL(async_view->GetCacheSize(), 0);
         }
     }
     
    @@ -191,10 +187,9 @@ BOOST_AUTO_TEST_CASE(access_non_input_coin)
         main_cache.EmplaceCoinInternalDANGER(COutPoint{Txid::FromUint256(uint256::ZERO), 0}, std::move(coin));
         CoinsViewCacheAsync view{&main_cache};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    -        const auto& accessed_coin{view.AccessCoin(outpoint)};
    +        auto async_view{view.StartFetching(block)};
    +        const auto& accessed_coin{async_view->AccessCoin(outpoint)};
             BOOST_CHECK(!accessed_coin.IsSpent());
    -        view.Reset();
         }
     }
     
    @@ -207,9 +202,8 @@ BOOST_AUTO_TEST_CASE(fetch_main_thread)
         PopulateView(block, main_cache);
         CoinsViewCacheAsync view{&main_cache, /*deterministic=*/false, /*num_workers=*/0};
         for (auto i{0}; i < 3; ++i) {
    -        const auto fetch_control{view.StartFetching(block)};
    -        CheckCache(block, view);
    -        view.Reset();
    +        auto async_view{view.StartFetching(block)};
    +        CheckCache(block, *async_view);
         }
     }
     
    diff --git a/src/test/fuzz/coins_view.cpp b/src/test/fuzz/coins_view.cpp
    index 3f03ec90f0..f3df7b19a9 100644
    --- a/src/test/fuzz/coins_view.cpp
    +++ b/src/test/fuzz/coins_view.cpp
    @@ -384,9 +384,8 @@ FUZZ_TARGET(coins_view_async, .init = initialize_coins_view)
         CCoinsView backend_coins_view;
         g_async_cache->SetBackend(backend_coins_view);
         CBlock block{BuildRandomBlock(fuzzed_data_provider)};
    -    const auto fetch_control{g_async_cache->StartFetching(block)};
    -    TestCoinsView(fuzzed_data_provider, *g_async_cache, backend_coins_view, /*is_db=*/false);
    -    g_async_cache->Reset();
    +    auto async_view{g_async_cache->StartFetching(block)};
    +    TestCoinsView(fuzzed_data_provider, *async_view, backend_coins_view, /*is_db=*/false);
     }
     
     FUZZ_TARGET(coins_view_stacked, .init = initialize_coins_view)
    @@ -402,8 +401,7 @@ FUZZ_TARGET(coins_view_stacked, .init = initialize_coins_view)
         g_async_cache->SetBackend(backend_coins_view);
         TestCoinsView(fuzzed_data_provider, backend_coins_view, db_coins_view, /*is_db=*/true);
         CBlock block{BuildRandomBlock(fuzzed_data_provider)};
    -    const auto fetch_control{g_async_cache->StartFetching(block)};
    -    TestCoinsView(fuzzed_data_provider, *g_async_cache, backend_coins_view, /*is_db=*/false);
    +    auto async_view{g_async_cache->StartFetching(block)};
    +    TestCoinsView(fuzzed_data_provider, *async_view, backend_coins_view, /*is_db=*/false);
         TestCoinsView(fuzzed_data_provider, backend_coins_view, db_coins_view, /*is_db=*/true);
    -    g_async_cache->Reset();
     }
    diff --git a/src/test/fuzz/coinscache_sim.cpp b/src/test/fuzz/coinscache_sim.cpp
    index c635525a24..4090697b8e 100644
    --- a/src/test/fuzz/coinscache_sim.cpp
    +++ b/src/test/fuzz/coinscache_sim.cpp
    @@ -407,8 +407,8 @@ FUZZ_TARGET(coinscache_sim, .init = setup_coinscache_sim)
                             // Find an unused async cache from the pool
                             for (auto& async_cache : g_async_caches) {
                                 if (async_cache.use_count() > 1) continue;
    +                            async_cache->Reset();
                                 async_cache->SetBackend(*top_cache());
    -                            const auto fetch_control{async_cache->StartFetching(data.block)};
                                 caches.emplace_back(async_cache);
                                 break;
                             }
    diff --git a/src/validation.cpp b/src/validation.cpp
    index f1168729d2..57108d05ed 100644
    --- a/src/validation.cpp
    +++ b/src/validation.cpp
    @@ -3099,9 +3099,8 @@ bool Chainstate::ConnectTip(
         LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
                  Ticks<MillisecondsDouble>(time_2 - time_1));
         {
    -        auto& view{*m_coins_views->m_connect_block_view};
    -        const auto fetch_control{view.StartFetching(*block_to_connect)};
    -        bool rv = ConnectBlock(*block_to_connect, state, pindexNew, view);
    +        auto view{m_coins_views->m_connect_block_view->StartFetching(*block_to_connect)};
    +        bool rv = ConnectBlock(*block_to_connect, state, pindexNew, *view);
             if (m_chainman.m_options.signals) {
                 m_chainman.m_options.signals->BlockChecked(block_to_connect, state);
             }
    @@ -3109,7 +3108,6 @@ bool Chainstate::ConnectTip(
                 if (state.IsInvalid())
                     InvalidBlockFound(pindexNew, state);
                 LogError("%s: ConnectBlock %s failed, %s\n", __func__, pindexNew->GetBlockHash().ToString(), state.ToString());
    -            view.Reset();
                 return false;
             }
             time_3 = SteadyClock::now();
    @@ -3119,8 +3117,7 @@ bool Chainstate::ConnectTip(
                      Ticks<MillisecondsDouble>(time_3 - time_2),
                      Ticks<SecondsDouble>(m_chainman.time_connect_total),
                      Ticks<MillisecondsDouble>(m_chainman.time_connect_total) / m_chainman.num_blocks_total);
    -        view.Flush(/*will_reuse_cache=*/false); // No need to reallocate since it only has capacity for 1 block
    -        view.Reset();
    +        view->Flush(/*will_reuse_cache=*/false); // No need to reallocate since it only has capacity for 1 block
         }
         const auto time_4{SteadyClock::now()};
         m_chainman.time_flush += time_4 - time_3;
    

    </details>


    andrewtoth commented at 8:45 PM on January 11, 2026:

    What is the benefit of the macro? I don't see a problem with using it as suggested, but not sure what it is giving us.

    Re proxy - we don't want to call Reset everytime we go out of scope. Reset is only if we need to reset the cache to its initial state. Consider VerifyDB, where we could use it to fetch 6 blocks in a row but we don't want to clear its state each time.

    diff --git a/src/validation.cpp b/src/validation.cpp
    index f1168729d2..bcd933fa78 100644
    --- a/src/validation.cpp
    +++ b/src/validation.cpp
    @@ -4697,7 +4697,7 @@ VerifyDBResult CVerifyDB::VerifyDB(
         }
         nCheckLevel = std::max(0, std::min(4, nCheckLevel));
         LogInfo("Verifying last %i blocks at level %i", nCheckDepth, nCheckLevel);
    -    CCoinsViewCache coins(&coinsview);
    +    CoinsViewCacheAsync coins(&coinsview);
         CBlockIndex* pindex;
         CBlockIndex* pindexFailure = nullptr;
         int nGoodTransactions = 0;
    @@ -4799,6 +4799,7 @@ VerifyDBResult CVerifyDB::VerifyDB(
                     LogError("Verification error: ReadBlock failed at %d, hash=%s", pindex->nHeight, pindex->GetBlockHash().ToString());
                     return VerifyDBResult::CORRUPTED_BLOCK_DB;
                 }
    +            const auto fetch_control{coins.StartFetching(block)};
                 if (!chainstate.ConnectBlock(block, state, pindex, coins)) {
                     LogError("Verification error: found unconnectable block at %d, hash=%s (%s)", pindex->nHeight, pindex->GetBlockHash().ToString(), state.ToString());
                     return VerifyDBResult::CORRUPTED_BLOCK_DB;
    

    What this gives us is a guarantee that we will stop fetching before exceeding the lifetime of the block.


    l0rinc commented at 9:07 PM on January 11, 2026:

    Wouldn't we declare and start fetching at a higher level so that the 6 blocks are all in the same scope?


    andrewtoth commented at 9:14 PM on January 11, 2026:

    We start fetching as soon as we get the block, and the block is destroyed when we exit the scope. I'm not sure what you mean.


    andrewtoth commented at 3:35 PM on January 12, 2026:

    Looking closer at VerifyDB, I don't think this would be useful there. All blocks are disconnected in the same cache without flushing, so all utxos will already be in the cache and no lookups will occur. Maybe it makes sense to just Reset, and then we can get rid of those Reset calls everywhere... Will look into this, thanks!


    andrewtoth commented at 7:22 PM on January 14, 2026:

    Updated to use a controller that returns a handle that dereferences to the cache. When the handle is destroyed it resets the cache.

  474. in src/coinsviewcacheasync.h:234 in 396f784f8f
     229 | +
     230 | +public:
     231 | +    //! Start fetching all block inputs and return RAII guard that stops fetching on destruction.
     232 | +    [[nodiscard]] FetchControl StartFetching(const CBlock& block LIFETIMEBOUND) noexcept
     233 | +    {
     234 | +        StopFetching();
    


    l0rinc commented at 7:57 PM on January 11, 2026:

    Now that this is automatic, we could rather assert that it's stopped, since that would be a programming error, right?


    andrewtoth commented at 8:33 PM on January 11, 2026:

    It's not automatic. Consider:

    const auto fetch_control{view.StartFetching(block)};
    const auto fetch_control2{view.StartFetching(block)};
    

    l0rinc commented at 8:35 PM on January 11, 2026:

    yes, shouldn't that fail with an assertion error instead of the second spitting in the soup of the first?


    andrewtoth commented at 8:48 PM on January 11, 2026:

    I don't think so, it's a valid use of the API. The first fetching will be stopped and the second started. It's just inefficient, but perfectly safe.


    l0rinc commented at 9:08 PM on January 11, 2026:

    Can you come up with a valid usecase because it seems like an error to me...


    andrewtoth commented at 7:23 PM on January 14, 2026:

    Changed back to an assertion.

  475. in src/coinsviewcacheasync.h:86 in 396f784f8f
      81 | + *
      82 | + *       After workers start:
      83 | + *
      84 | + *                                        Worker 2            Worker 0  Worker 3  Worker 1  m_input_head
      85 | + *                                           │                   │         │         │         │
      86 | + *                                           ▼                   ▼         ▼         ▼         ▼
    


    l0rinc commented at 7:58 PM on January 11, 2026:

    Lol, cool ascii art - though it's off by one space :p

     *                                       Worker 2            Worker 0  Worker 3  Worker 1  m_input_head
     *                                          │                   │         │         │         │
     *                                          ▼                   ▼         ▼         ▼         ▼
    
  476. in src/test/fuzz/coinscache_sim.cpp:411 in 396f784f8f
     408 | +                    if (provider.ConsumeBool()) {
     409 | +                        // Find an unused async cache from the pool
     410 | +                        for (auto& async_cache : g_async_caches) {
     411 | +                            if (async_cache.use_count() > 1) continue;
     412 | +                            async_cache->SetBackend(*top_cache());
     413 | +                            const auto fetch_control{async_cache->StartFetching(data.block)};
    


    l0rinc commented at 8:19 PM on January 11, 2026:

    What is the purpose here of starting a fetch, adding it to a vector and resetting it immediately?


    andrewtoth commented at 8:36 PM on January 11, 2026:

    Not much, but it exercises the StartFetching/StopFetching paths. There's not much more we can do here since the fetching is now bound to the scope. Fuzzing the methods while we are still fetching happens in coins_view.cpp fuzz harness.


    andrewtoth commented at 7:23 PM on January 14, 2026:

    The latest version does not return a fetch control object here, so we can continue fetching in the background while exercising different methods.

  477. l0rinc changes_requested
  478. l0rinc commented at 8:26 PM on January 11, 2026: contributor

    I like the new cleanup changes and the ASCII art, I only had time and patience to quickly go over it, hope the comments are useful.

  479. andrewtoth force-pushed on Jan 14, 2026
  480. l0rinc commented at 11:05 AM on January 19, 2026: contributor

    It looks like this is run inside WSL (Linux) and compiled for Linux

    Took me longer than anticipated, but we finally have our first native GCC Windows measurement - after a few failed previous attempts using clang or older gcc versions.

    <img width="2148" height="1400" alt="image" src="https://github.com/user-attachments/assets/f699bc10-fd34-486c-80a6-582007f33288" />

    Results: 29% faster with dbcache=450, 14.5% faster with dbcache=4500 (932239 blocks, native .exe, MinGW GCC 15.2.0)

    <details> <summary>2026-01-17 | reindex-chainstate | 932239 blocks | dbcache 450 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15</summary>

    for DBCACHE in 450 4500; do \
      COMMITS="ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb b3cb5bb90a41af4199dde17946e5aa9b3cd72db6"; \
      STOP=932239; \
      HOST=x86_64-w64-mingw32; \
      XPACK="/home/win/xpack-mingw-w64-gcc-15.2.0-2"; \
      BASE_DIR="/mnt/c/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
      WIN_DATA_DIR='C:\\my_storage\\BitcoinData'; \
      export PATH="$XPACK/bin:$PATH"; \
      mkdir -p "$LOG_DIR"; \
      (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
      (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | win64-gcc15"; echo "") && \
      hyperfine \
        --sort command \
        --runs 1 \
        --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-win64.json" \
        --parameter-list COMMIT ${COMMITS// /,} \
        --prepare "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; rm -f ./build/bin/bitcoind.exe; rm -f $DATA_DIR/debug.log; git clean -fxd -e depends/built -e depends/sources -e depends/$HOST; git reset --hard {COMMIT} && \
          make -C depends HOST=$HOST NO_QT=1 NO_ZMQ=1 CC=\"$XPACK/bin/x86_64-w64-mingw32-gcc\" CXX=\"$XPACK/bin/x86_64-w64-mingw32-g++\" -j\$(nproc) && \
          cmake -B build -G Ninja --toolchain depends/$HOST/toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF && \
          ninja -C build bitcoind -j\$(nproc) && \
          ./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log" \
        --conclude "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(printf %.12s {COMMIT})"; \
          cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-\$(date +%s).log" \
        "./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"; \
    done
    
    ab233255d4 Merge bitcoin/bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
    b3cb5bb90a validation: fetch inputs on parallel threads
    
    2026-01-17 | reindex-chainstate | 932239 blocks | dbcache 450 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15
    
    Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
      Time (abs ≡):        37260.585 s               [User: 0.002 s, System: 0.000 s]
    
    Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
      Time (abs ≡):        28819.823 s               [User: 0.002 s, System: 0.000 s]
    
    Relative speed comparison
            1.29          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
            1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
    
    ab233255d4 Merge bitcoin/bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
    b3cb5bb90a validation: fetch inputs on parallel threads
    
    2026-01-18 | reindex-chainstate | 932239 blocks | dbcache 4500 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15
    
    Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
      Time (abs ≡):        29746.920 s               [User: 0.002 s, System: 0.000 s]
    
    Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
      Time (abs ≡):        25974.137 s               [User: 0.002 s, System: 0.000 s]
    
    Relative speed comparison
            1.15          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
            1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
    

    </details>

    <details> <summary>Earlier attempts</summary>

    Measure-Command { C:\my_storage\bitcoin-win64\bin\bitcoind.exe -datadir=C:\my_storage\BitcoinData -stopatheight=927729 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 }
    
    Days              : 0
    Hours             : 10
    Minutes           : 33
    Seconds           : 40
    Milliseconds      : 162
    Ticks             : 380201620408
    TotalDays         : 0.440048171768519
    TotalHours        : 10.5611561224444
    TotalMinutes      : 633.669367346667
    TotalSeconds      : 38020.1620408
    TotalMilliseconds : 38020162.0408
    

    and

     Measure-Command { C:\my_storage\bitcoin-win64\bin\bitcoind.exe -datadir=C:\my_storage\BitcoinData -stopatheight=927729 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 }
    
    Days              : 0
    Hours             : 11
    Minutes           : 11
    Seconds           : 40
    Milliseconds      : 78
    Ticks             : 403000786529
    TotalDays         : 0.466436095519676
    TotalHours        : 11.1944662924722
    TotalMinutes      : 671.667977548333
    TotalSeconds      : 40300.0786529
    TotalMilliseconds : 40300078.6529
    

    and

    win@WIN-A2EHOAU4JET:/mnt/my_storage/bitcoin$ git log -1
    commit 7f295e1d9b44c225c823242c1f04239f46fb27a6 (HEAD, l0rinc/master, master)
    Merge: 5e7931af35 fa4cb13b52
    Author: merge-script <fanquake@gmail.com>
    Date:   Fri Dec 19 16:56:02 2025 +0000
    
    Measure-Command { C:\my_storage\bitcoin-win64\bin\bitcoind.exe -datadir=C:\my_storage\BitcoinData -stopatheight=927729 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 }
    
    Days              : 0
    Hours             : 9
    Minutes           : 48
    Seconds           : 15
    Milliseconds      : 866
    Ticks             : 352958669523
    TotalDays         : 0.408516978614583
    TotalHours        : 9.80440748675
    TotalMinutes      : 588.264449205
    TotalSeconds      : 35295.8669523
    TotalMilliseconds : 35295866.9523
    

    and this is v30 with official release:

    Measure-Command { C:\my_storage\bitcoin_bins\bitcoin-30.0\bin\bitcoind.exe -datadir=C:\my_storage\BitcoinData -stopatheight=927719 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 }
    
    Days              : 0
    Hours             : 9
    Minutes           : 54
    Seconds           : 34
    Milliseconds      : 482
    Ticks             : 356744820316
    TotalDays         : 0.412899097587963
    TotalHours        : 9.90957834211111
    TotalMinutes      : 594.574700526667
    TotalSeconds      : 35674.4820316
    TotalMilliseconds : 35674482.0316
    

    </details>


    re-checked pruned IBD - this still seems to be bandwidth bound, so the difference is more modest:

    <details> <summary>18% faster - 2026-01-18 | pruned IBD | 932239 blocks | dbcache 450 | pruning 550 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD </summary>

    COMMITS="22bde74d1d8f861323eabb8dc60401bbf1226544 13d32ed39cf869eb64faf8f489c53f38806a6c29"; \
    STOP=932239; DBCACHE=450; PRUNE=550; \
    CC=gcc; CXX=g++; \
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/ShallowBitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "$(date -I) | pruned IBD | ${STOP} blocks | dbcache ${DBCACHE} | pruning ${PRUNE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(fr ee -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
    hyperfine \
    --sort command \
    --runs 2 \
    --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
    --parameter-list COMMIT ${COMMITS// /,} \
    --prepare "killall -9 bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git clean -fxd; git reset --hard {COMMIT} && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -prune=$PRUNE -printtoconsole=0; sleep 20" \
    --conclude "killall bitcoind || true; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block #1' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && g rep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(printf %.12s {COMMIT})"; \
    cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
    "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -prune=$PRUNE -printtoconsole=0"
    22bde74d1d Merge bitcoin-core/gui#924: Show an error message if the restored wallet name is empty
    13d32ed39c validation: fetch inputs on parallel threads
    2026-01-18 | pruned IBD | 932239 blocks | dbcache 450 | pruning 550 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/ShallowBitcoinData -stopatheight=932239 -dbcache=450 -blocksonly -prune=550 -printtoconsole=0 (COMMIT = 22bde74d1d8f861323eabb8dc60401bbf1226544)
    Time (mean ± σ): 33025.250 s ± 368.595 s [User: 73666.506 s, System: 4901.220 s]
    Range (min … max): 32764.613 s … 33285.886 s 2 runs
    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/ShallowBitcoinData -stopatheight=932239 -dbcache=450 -blocksonly -prune=550 -printtoconsole=0 (COMMIT = 13d32ed39cf869eb64faf8f489c53f38806a6c29)
    Time (mean ± σ): 27953.899 s ± 205.665 s [User: 72179.265 s, System: 4704.044 s]
    Range (min … max): 27808.472 s … 28099.327 s 2 runs

    </details>

    <details> <summary>same measurements with `reindex-chainstate` for dbcache of 3 and 12 GB</summary>

     for DBCACHE in 3000 12000; do   COMMITS="ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb b3cb5bb90a41af4199dde17946e5aa9b3cd72db6";   STOP=932239;
       HOST=x86_64-w64-mingw32;   XPACK="/home/win/xpack-mingw-w64-gcc-15.2.0-2";   BASE_DIR="/mnt/c/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";   WIN_DATA_DIR='
    C:\\my_storage\\BitcoinData';   export PATH="$XPACK/bin:$PATH";   mkdir -p "$LOG_DIR";   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit
    1; done) &&   (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 |
    xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | win64-gcc15"; echo "") &&   hyperfine     --sort command     --runs 1     --export-json "$BASE_DIR/rdx-$(sed -E 's/(\
    w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-win64.json"     --parameter-list COMMIT ${COMMITS// /,}     --prepare "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; rm -f ./build/b
    in/bitcoind.exe; rm -f $DATA_DIR/debug.log; git clean -fxd -e depends/built -e depends/sources -e depends/$HOST; git reset --hard {COMMIT} && \
          make -C depends HOST=$HOST NO_QT=1 NO_ZMQ=1 CC=\"$XPACK/bin/x86_64-w64-mingw32-gcc\" CXX=\"$XPACK/bin/x86_64-w64-mingw32-g++\" -j\$(nproc) && \
          cmake -B build -G Ninja --toolchain depends/$HOST/toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF && \
          ninja -C build bitcoind -j\$(nproc) && \
          ./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log"     --conclude "taskkill.exe /IM bitco
    ind.exe /F 2>/dev/null; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(prin
    tf %.12s {COMMIT})"; \
          cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-\$(date +%s).log"     "./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate
    -blocksonly -connect=0 -printtoconsole=0"; done
    
    ab233255d4 Merge bitcoin/bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
    b3cb5bb90a validation: fetch inputs on parallel threads
    
    2026-01-19 | reindex-chainstate | 932239 blocks | dbcache 3000 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15
    
    Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
      Time (abs ≡):        30456.010 s               [User: 0.002 s, System: 0.000 s]
    
    Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
      Time (abs ≡):        26194.317 s               [User: 0.006 s, System: 0.000 s]
    
    Relative speed comparison
            1.16          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
            1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
    
    ab233255d4 Merge bitcoin/bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
    b3cb5bb90a validation: fetch inputs on parallel threads
    
    2026-01-19 | reindex-chainstate | 932239 blocks | dbcache 12000 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15
    
    Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
      Time (abs ≡):        29192.227 s               [User: 0.002 s, System: 0.000 s]
    
    Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
      Time (abs ≡):        26041.974 s               [User: 0.003 s, System: 0.000 s]
    
    Relative speed comparison
            1.12          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ab233255d444ccf6ffe4a45cb02bfc3e5fb71bdb)
            1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b3cb5bb90a41af4199dde17946e5aa9b3cd72db6)
    

    </details>


    <details> <summary>WORKER_THREADS{8} is slower</summary>

     for DBCACHE in 3000 12000; do   COMMITS="363e525d8da3c6c495191cb92d8eaf5dbeaeddf5";   STOP=932239;   HOST=x86_64-w64-mingw32;   XPACK="/home/win/xpack-mingw-w64-gcc-15.2.0-2";   BASE_DIR="/mnt/c/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";   WIN_DATA_DIR='C:\\my_storage\\BitcoinData';   export PATH="$XPACK/bin:$PATH";   mkdir -p "$LOG_DIR";   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) &&   (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | win64-gcc15"; echo "") &&   hyperfine     --sort command     --runs 1     --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-win64.json"     --parameter-list COMMIT ${COMMITS// /,}     --prepare "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; rm -f ./build/bin/bitcoind.exe; rm -f $DATA_DIR/debug.log; git clean -fxd -e depends/built -e depends/sources -e depends/$HOST; git reset --hard {COMMIT} && \
          make -C depends HOST=$HOST NO_QT=1 NO_ZMQ=1 CC=\"$XPACK/bin/x86_64-w64-mingw32-gcc\" CXX=\"$XPACK/bin/x86_64-w64-mingw32-g++\" -j\$(nproc) && \
          cmake -B build -G Ninja --toolchain depends/$HOST/toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF && \
          ninja -C build bitcoind -j\$(nproc) && \
          ./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log"     --conclude "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(printf %.12s {COMMIT})"; \
          cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-\$(date +%s).log"     "./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"; done
    
    363e525d8d WORKER_THREADS{8}
    
    2026-01-21 | reindex-chainstate | 932239 blocks | dbcache 3000 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15
    
    Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=3000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 363e525d8da3c6c495191cb92d8eaf5dbeaeddf5)
      Time (abs ≡):        26386.395 s               [User: 0.000 s, System: 0.001 s]
    
    363e525d8d WORKER_THREADS{8}
    
    2026-01-21 | reindex-chainstate | 932239 blocks | dbcache 12000 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15
    
    Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=932239 -dbcache=12000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 363e525d8da3c6c495191cb92d8eaf5dbeaeddf5)
      Time (abs ≡):        26401.538 s               [User: 0.002 s, System: 0.000 s]
    

    </details>

  481. fanquake commented at 4:51 PM on January 22, 2026: member

    Note that this needs a rebase:

    /root/ci_scratch/src/bench/coinsviewcacheasync.cpp:51:71: error: macro ‘BENCHMARK’ passed 2 arguments, but takes just 1
       51 | BENCHMARK(CoinsViewCacheAsyncBenchmark, benchmark::PriorityLevel::HIGH);
          |                                                                       ^
    In file included from /root/ci_scratch/src/bench/coinsviewcacheasync.cpp:5:
    /root/ci_scratch/src/bench/bench.h:68:9: note: macro ‘BENCHMARK’ defined here
       68 | #define BENCHMARK(n) \
          |         ^~~~~~~~~
    
  482. DrahtBot added the label CI failed on Jan 22, 2026
  483. DrahtBot commented at 5:26 PM on January 22, 2026: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/21006803006/job/61174454902</sub> <sub>LLM reason (✨ experimental): Compilation failed due to BENCHMARK macro usage: it is invoked with two arguments, but the macro defined takes only one, causing a build error.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  484. andrewtoth force-pushed on Jan 22, 2026
  485. andrewtoth force-pushed on Jan 22, 2026
  486. andrewtoth force-pushed on Jan 22, 2026
  487. DrahtBot removed the label CI failed on Jan 23, 2026
  488. willcl-ark commented at 2:10 PM on January 27, 2026: member

    Benchcoin Full IBD Results (to block 930,000)

    Benchmark run: https://github.com/bitcoin-dev-tools/benchcoin/pull/178

    dbcache master (2778eb4) PR (fc72fca) Δ
    450 MB 323 min 260 min -19.5%
    32000 MB 266 min 246 min -7.5%

    Configuration:

    • -prune=200GB
    • AMD Ryzen 7 7700 8-Core, 64GB RAM, NVMe SSD
    • 1Gbit network to dedicated seed node

    PR commits: 95ee2d60c217aa2ccf37ed1e5951ea91fdf403d9^..fc72fca292d995de07d98f12dfc4164478826b1f

    Seems like we do indeed get a nice speedup, especially with the default dbcache :)

  489. achow101 referenced this in commit 6750744eb3 on Jan 29, 2026
  490. DrahtBot added the label Needs rebase on Jan 30, 2026
  491. andrewtoth force-pushed on Jan 30, 2026
  492. andrewtoth force-pushed on Jan 30, 2026
  493. DrahtBot added the label CI failed on Jan 30, 2026
  494. DrahtBot commented at 1:54 AM on January 30, 2026: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task MSan: https://github.com/bitcoin/bitcoin/actions/runs/21501458749/job/61948501131</sub> <sub>LLM reason (✨ experimental): Compilation failed in src/bench/coinsviewcacheasync.cpp due to undeclared identifier CoinsViewCacheAsyncController, causing the build to abort.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  495. andrewtoth force-pushed on Jan 30, 2026
  496. DrahtBot removed the label Needs rebase on Jan 30, 2026
  497. DrahtBot removed the label CI failed on Jan 30, 2026
  498. in src/coinsviewcacheasync.h:27 in 77c0df7b59
      22 | +#include <ranges>
      23 | +#include <thread>
      24 | +#include <utility>
      25 | +#include <vector>
      26 | +
      27 | +static constexpr int32_t WORKER_THREADS{4};
    


    HowHsu commented at 8:34 AM on February 10, 2026:

    Hi @andrewtoth

    Have you tried a WORKER_THREADS bigger than this? I have this question because the workloads are not cpu intensive but IO intensive.


    sedited commented at 8:38 AM on February 10, 2026:

    This was extensively benchmarked, see l0rinc's comments #31132 (comment) and #31132 (comment) and the discussions further up in this PR.


    andrewtoth commented at 3:14 PM on February 10, 2026:

    Also see #31132 (comment), where higher thread count indeed correlates to a big speed increase. A system with high IO latency coupled with high IO bandwidth will see the most benefit from this PR in general, and the most benefit from increasing the threadcount.

    We decided to keep it simple for now and use a static 4 threads. More threads will translate to higher memory usage of course. I think we can investigate making this configurable in a follow-up if there is interest.

  499. sedited commented at 2:16 PM on February 10, 2026: contributor

    When reindexing on my current system this PR consistently does not perform faster, and I'm getting the impression that it might actually be slower. This is on a system with 32 virtual cores, heaps of ram, and a fast nvme drive. I'd also be curious in general what the performance looks liked with maxed out dbcache.

  500. l0rinc commented at 3:12 PM on February 10, 2026: contributor

    I'd also be curious in general what the performance looks like with maxed out dbcache.

    With a big enough dbcache we won't really have any disk activity since all the inputs are still in memory, see: #31132 (comment) The parallelization still results in some modest speedups in most cases because of the parallel temporary cache filling and SipHash calculation and map interactions, but that's not the main goal of the change. There are also some fixed costs that we wouldn't need to do if we knew that everything is in memory so some minor regression is expected for max-memory dbcache. Also note that adding more and more memory isn't necessarily faster after a while, e.g. 30 GB dbcache isn't usually faster than 5 GB - sometimes it's even slower, most likely because a larger hashmap spreads entries across more memory, causing more cache misses on the random UTXO lookups.


    Can you please share your measurements so that we can try to reproduce them? This is mainly meant for default or low dbcache since the in-memory cache size matters a lot less after the cache warming.

  501. fanquake commented at 3:18 PM on February 10, 2026: member

    I also haven't seen any speedup running a "real world" sync with this branch. i.e Guix build the branch, and then run from scratch on a reasonble (16 core, 32GB) machine. IBD time seems the same as master.

  502. andrewtoth commented at 3:23 PM on February 10, 2026: contributor

    When reindexing on my current system this PR consistently does not perform faster, and I'm getting the impression that it might actually be slower.

    IBD time seems the same as master.

    +1 on sharing the commands you are running and the times you are getting.

    I did IBD as well and saw consistently better performance on a machine with locally connected NVMe drive, 16 vcores 32GB RAM #31132 (comment).

    Cores and RAM should not really be a factor with this change. The main bottleneck is higher IO latency. So, for a directly connected NVMe drive you should not see as big of an increase compared to network connected storage. I think most users of this software run it in a cloud environment.

  503. l0rinc commented at 3:28 PM on February 10, 2026: contributor

    IBD time seems the same as master.

    I also noticed that doing actual IBD compared to just a -reindex-chainstate often shows a less dramatic speedup since validation wasn't the main bottleneck in the first place (likely bandwidth was). With the average (100Mbps) global internet speed just downloading the blockchain would take 16 hours.

  504. sedited commented at 4:44 PM on February 10, 2026: contributor

    I re-ran three runs interleaved of ./bitcoind -signet -stopatheight=290000 -reindex-chainstate. This PR averages 8:40, master 7:50.

    Edit: Re-running on mainnet too, but will obviously take a while:

    On a AMD Ryzen 9 9950X3D 16-Core Processor, nvme drive, and heaps of RAM

    Baseline: ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate 3:11:48.66 total ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate -dbcache=10000 2:30:11.60 total ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate -dbcache=30000 2:10:11.60 total

    This PR: ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate 2:09:11.83 total ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate -dbcache=10000 2:05:03.33 total ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate -dbcache=30000 2:03:49.21 total

    Also checked again what happens when more workers (16) are added and the gains are indeed marginal: ./build_dev_mode_clang/bin/bitcoind -nowallet -reindex-chainstate 2:07:13.48 total

    So I guess the slowdown I perceived before is just higher dbcache mattering less over time.

  505. l0rinc commented at 4:54 PM on February 10, 2026: contributor

    Thanks, let me retry the latest push (I never tested signet though)

  506. andrewtoth commented at 4:57 PM on February 10, 2026: contributor

    @sedited thanks! I don't think this PR will perform better than master on signet. Blocks on signet seem to have <100 txs with mostly single inputs. The overhead of collecting the inputs and then releasing threads to start fetching them is likely not recouped on fetching that few inputs.

    Also, the size of the chainstate leveldb is very small compared to mainnet. Fetching inputs in series really starts to degrade around block 800,000 when the utxo set is much larger on mainnet. Roughly 90% of the sync time on network connected storage was for 800k to tip.

  507. l0rinc commented at 2:24 PM on February 11, 2026: contributor

    Retried validation latest version on Windows, the 30% speedup with default dbcache still reproduces

    <details> <summary>2026-02-10 | reindex-chainstate | 933339 blocks | dbcache 450 | WIN-A2EHOAU4JET | x86_64 | Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz | 8 cores | 31Gi RAM | win64-gcc15</summary>

    > for DBCACHE in 450; do \
    >   COMMITS="5401e673d56198f2c0bad366581e70d5d9cd765c 77c0df7b59ff5a3a77d37e77145f1a157e05db19"; \
    >   STOP=933339; \
    >   HOST=x86_64-w64-mingw32; \
    " && >   XPACK="/home/win/xpack-mingw-w64-gcc-15.2.0-2"; \
    (date -I) |>   BASE_DIR="/mnt/c/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
    >   WIN_DATA_DIR='C:\\my_storage\\BitcoinData'; \
    | win64>   export PATH="$XPACK/bin:$PATH"; \
    >   mkdir -p "$LOG_DIR"; \
    >   (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    >   (echo "" && echo "$(date -I) | reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | win64-gcc15"; echo "") && \
    ns 1 \
    >   hyperfine \
    >     --sort command \
    >     --runs 1 \
    >     --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-win64.json" \
    >     --parameter-list COMMIT ${COMMITS// /,} \
    >     --prepare "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; rm -f ./build/bin/bitcoind.exe; rm -f $DATA_DIR/debug.log; git clean -fxd -e depends/built -e depends/sources -e depends/$HOST; git reset --hard {COMMIT} && \
    >       make -C depends HOST=$HOST NO_QT=1 NO_ZMQ=1 CC=\"$XPACK/bin/x86_64-w64-mingw32-gcc\" CXX=\"$XPACK/bin/x86_64-w64-mingw32-g++\" -j\$(nproc) && \
    >       cmake -B build -G Ninja --toolchain depends/$HOST/toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF && \
    >       ninja -C build bitcoind -j\$(nproc) && \
    >       ./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20; rm -f $DATA_DIR/debug.log" \
    >     --conclude "taskkill.exe /IM bitcoind.exe /F 2>/dev/null; sleep 5; grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log && grep 'Bitcoin Core version' $DATA_DIR/debug.log | grep -q "$(printf %.12s {COMMIT})"; \
    >       cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-\$(date +%s).log" \
    >     "./build/bin/bitcoind.exe -datadir=\"$WIN_DATA_DIR\" -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"; \
    > done~
    

    </details>

    5401e673d5 Merge bitcoin/bitcoin#33604: p2p: Allow block downloads from peers without snapshot block after assumeutxo validation 77c0df7b59 validation: fetch inputs on parallel threads

    Benchmark 1: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=933339 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 5401e673d56198f2c0bad366581e70d5d9cd765c)
      Time (abs ≡):        37691.648 s               [User: 0.000 s, System: 0.001 s]
    
    Benchmark 2: ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=933339 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 77c0df7b59ff5a3a77d37e77145f1a157e05db19)
      Time (abs ≡):        28752.722 s               [User: 0.003 s, System: 0.000 s]
    
    Relative speed comparison
            1.31          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=933339 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 5401e673d56198f2c0bad366581e70d5d9cd765c)
            1.00          ./build/bin/bitcoind.exe -datadir="C:\\my_storage\\BitcoinData" -stopatheight=933339 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 77c0df7b59ff5a3a77d37e77145f1a157e05db19)
    
  508. sedited referenced this in commit 4a05825a3f on Feb 11, 2026
  509. rustaceanrob commented at 2:10 PM on February 17, 2026: contributor

    I tested this with a simple time based script. Note real is the absolute time of the reindex.

    <details> <summary>A/B test on Linux systems</summary>

    #!/usr/bin/env bash
    set -euo pipefail
    
    SRC_DIR="${SRC_DIR:-$HOME/bitcoin}"
    COMMIT_A="${COMMIT_A:-5401e673d56198f2c0bad366581e70d5d9cd765c}"
    COMMIT_B="${COMMIT_B:-77c0df7b59ff5a3a77d37e77145f1a157e05db19}"
    STOP="${STOP:-930000}"
    DBCACHE="${DBCACHE:-450}"
    DATA_DIR="${DATA_DIR:-$HOME/.bitcoin}"
    JOBS="${JOBS:-$(nproc)}"
    
    git reset --hard $COMMIT_A
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF
    ninja -C build bitcoind -j $JOBS
    time ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -daemon=0
    
    git reset --hard $COMMIT_B
    cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_GUI=OFF -DWITH_ZMQ=OFF -DBUILD_TESTS=OFF -DBUILD_BENCH=OFF
    ninja -C build bitcoind -j $JOBS
    time ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -daemon=0
    

    </details>

    Results on my first machine:

    <details> <summary>Machine specifications</summary>

    $ lscpu
    Architecture:             x86_64
      CPU op-mode(s):         32-bit, 64-bit
      ...
    CPU(s):                   16
      ...
      Model name:             AMD Ryzen 7 7700 8-Core Processor
    

    </details>

    Before

    real	223m59.233s
    user	435m25.461s
    sys	52m39.860s
    

    After

    real	144m43.344s
    user	429m27.149s
    sys	38m21.716s
    

    Results on my second machine:

    <details> <summary>Machine specifications</summary>

    $ lscpu
    Architecture:             x86_64
      CPU op-mode(s):         32-bit, 64-bit
      ...
    CPU(s):                   16
      ...
    Vendor ID:                GenuineIntel
      Model name:             13th Gen Intel(R) Core(TM) i5-1340P
    

    </details>

    Before

    real	316m47.537s
    user	770m3.013s
    sys	39m59.231s
    

    After

    real	236m34.064s
    user	804m33.719s
    sys	36m46.623s
    
  510. in src/coinsviewcacheasync.h:61 in bc9d1e7ee4
      56 | +     * collision of an input being spent having the same first 8 bytes as a txid of a tx elsewhere in the block,
      57 | +     * the input will not be fetched in the background. The input will still be fetched later on the main thread.
      58 | +     * Using a sorted vector and binary search lookups is a performance improvement. It is faster than
      59 | +     * using std::unordered_set with salted hash or std::set.
      60 | +     */
      61 | +    std::vector<uint64_t> m_txids{};
    


    sipa commented at 9:18 PM on February 18, 2026:

    Could these be salted hashes instead of the first 8 bytes of txids? I'm slightly concerned this could enable deliberate performance degradation using a $2^{32}$ collision search on txids.


    l0rinc commented at 9:26 PM on February 18, 2026:

    What if we randomly add a shift x instead of always using the first few bytes - we'd take [x, x+8] instead?

    Edit: rehashing these kinda' sounds like it would defeat the purpose. What if we did 8 random bytes instead (e.g. shuffle 0..31 and load the first 8 indices)


    andrewtoth commented at 1:38 PM on February 19, 2026:

    Here are the measurements that led us to use a sorted vector + binary search #31132 (review). Note the graph on the left is more important since that's done on the main thread. The right side is mostly done on worker threads so is not as important. Obviously the sorted vector of short txids is the best option. @l0rinc it sounds like you're trying to reinvent siphash.

    What if we removed the sorted vector for now and used a salted unordered_set? That would probably make it easier to review, since we don't have to think about collisions at all. We could introduce a performance improvement for this in a follow-up.


    sipa commented at 1:49 PM on February 19, 2026:

    I think we can use something much weaker than a hash here, if we assume (1) the inputs are cryptographic hashes already (txids are) and (2) the attacker does not get to observe our secret salt in any way (not even through timing leaks - which may be the case here because by the time they observe it, they've already succeeded).

    class QuickHashHasher
    {
        uint64_t m_key[4];
    
    public:
        QuickHashHasher() noexcept
        {
            FastRandomContext rng;
            for (int i = 0; i < 4; ++i) m_key[i] = rng.rand64();
        }
    
        uint64_t operator()(const uint256& hash_input) noexcept
        {
            return (hash_input.GetUint64(0) ^ m_key[0]) +
                   (hash_input.GetUint64(1) ^ m_key[1]) +
                   (hash_input.GetUint64(2) ^ m_key[2]) +
                   (hash_input.GetUint64(3) ^ m_key[3]);
        }
    };
    

    So my suggestion would be to use the current approach, but instead of ReadLE64(txid.begin()), use QuickHashHasher m_hasher; ... m_hasher(txid) .... Would that be acceptably fast? If so, I think I can write up a better formal argument why this is sufficient.


    andrewtoth commented at 4:07 AM on February 20, 2026:

    Adding a benchmark to @l0rinc's code at #31132 (review), I added the above quick hash and used it on each txid before adding to the vector. It did not slow down the vector creation + sorting and even showed a slight speedup. Lookups were essentially the same speed as well.


    andrewtoth commented at 1:57 AM on February 23, 2026:

    Thanks @sipa, I added this and made you a co-author. I did XOR accumulation instead of addition since it was triggering the overflow UB in CI.


    sipa commented at 2:05 AM on February 23, 2026:

    @andrewtoth Sadly, that doesn't work, because now the salt has no impact on which pairs form collisions, so the attacker can find those in 2^32 work again. To see why, let t[4] be the txid and s[4] be the salts, then you're computing (t[0] ^ s[0]) ^ (t[1] ^ s[1]) ^ (t[2] ^ s[2]) ^ (t[3] ^ s[3]), which can be rearranged as (t[0] ^ t[1] ^ t[2] ^ t[3]) ^ (s[0] ^ s[1] ^ s[2] ^ s[3]), so collisions just depend on the xoring of the txid qwords.

    Adding a ubsan suppression should suffice; there is nothing actually UB about uint64_t overflow, it's just our sanitizer that "helpfully" warns about some perfectly legal but suspicious things.


    andrewtoth commented at 2:32 AM on February 23, 2026:

    Aha thanks yes I get it! I reverted and added a ubsan suppression instead.


    andrewtoth commented at 11:33 PM on April 11, 2026:

    @sipa @l0rinc For the same reason an attacker can't create quick hash collisions, they also can't create bucket collisions in an unordered_set. This lets us store the uint64_t quick hash directly in an unordered_set<uint64_t> rather than storing the full Txid with a salted hash. Benchmarks with this method show roughly the same construction time but much faster lookups.

  511. ryanofsky referenced this in commit ee2065fdea on Feb 20, 2026
  512. DrahtBot added the label Needs rebase on Feb 20, 2026
  513. andrewtoth force-pushed on Feb 23, 2026
  514. andrewtoth force-pushed on Feb 23, 2026
  515. DrahtBot added the label CI failed on Feb 23, 2026
  516. DrahtBot commented at 1:01 AM on February 23, 2026: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/22288974266/job/64472680743</sub> <sub>LLM reason (✨ experimental): Linker error: undefined reference to util::TraceThread prevents bitcoin-chainstate from linking.</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  517. andrewtoth force-pushed on Feb 23, 2026
  518. andrewtoth force-pushed on Feb 23, 2026
  519. andrewtoth commented at 1:55 AM on February 23, 2026: contributor

    Rebased due to #34165 and the ThreadPool in #33689.

    The CoinsViewOverlay is now used for parallel input fetching. It also takes a shared_ptr<ThreadPool> instead of managing threads manually.

    Now instead of managing threads via a std::barrier, we just spawn tasks each time we start a block and wait for the futures to complete.

    This also lets us pass in a global thread pool for tests and fuzzing, so we can recreate CoinsViewOverlays quickly without having to spawn and teardown the threads each iteration.

    A variation of the QuickHashHasher suggested in #31132 (review) is now used for the txid filter.

    The benchmark was dropped for this PR. There is already a lot to review as it is, and #34320 contains basically the same benchmark. It just needs to add a StartFetching once this is merged.

  520. DrahtBot removed the label Needs rebase on Feb 23, 2026
  521. andrewtoth force-pushed on Feb 23, 2026
  522. andrewtoth force-pushed on Feb 24, 2026
  523. andrewtoth force-pushed on Feb 24, 2026
  524. DrahtBot removed the label CI failed on Feb 24, 2026
  525. andrewtoth force-pushed on Feb 24, 2026
  526. DrahtBot added the label Needs rebase on Feb 26, 2026
  527. sedited commented at 10:21 AM on March 7, 2026: contributor

    Benched this some more:

    The hetzner box I used is x86_64 and has 8 vCPU cores. The node is also configured to prune (thought that might be interesting because of the slightly increased IO) and uses default dbcache.

    The RockPro64 (which I think is the platform used by nodl and the ronin dojo) is from the original 2018 series, has a nvme SSD, and active cooling. The node running on it is configured with 1GB of dbcache. The Bitseed is the Qotom Q190N-S01 with 4G of RAM and no active cooling. Its original hdd died years ago, it now has a SATA SSD. The node is running with default dbcache.

    The two low power nodes run at home, and the connection I have is not that stable, so I did not want to do repeated ibd runs. I did one ibd run from a local connection on the RockPro64 which clocked in shy of 35 hours.

    All nodes had -stopatheight=930000.

    benchmark base branch
    IBD on hetzner 13h 32m 9h 57m
    reindex-chainstate RockPro 64 67h 47m 32h 24m
    reindex-chainstate Bitseed 108h 48m 60h 59m
  528. andrewtoth force-pushed on Mar 8, 2026
  529. DrahtBot removed the label Needs rebase on Mar 8, 2026
  530. andrewtoth commented at 12:26 AM on March 9, 2026: contributor

    Rebased due to #34562. Also split the changes into more atomic commits that should be easier to review.

  531. sedited referenced this in commit 524aa1e533 on Mar 11, 2026
  532. DrahtBot added the label Needs rebase on Mar 11, 2026
  533. andrewtoth force-pushed on Mar 11, 2026
  534. DrahtBot removed the label Needs rebase on Mar 11, 2026
  535. andrewtoth commented at 2:35 PM on March 12, 2026: contributor

    Rebased due to #34576. All split out PRs have been merged. The diff seems a lot more manageable now.

    Thank you everyone for your benchmarks.

    I think this is ready for more review.

  536. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads 3x faster IBD
    validation: fetch block inputs on parallel threads
    on Mar 12, 2026
  537. murchandamus commented at 5:40 PM on March 12, 2026: member

    Concept ACK

    I’ve been loosely following this PR in the context of doing outreach. I have read many complaints about IBD being slow, especially for microcomputers and node in the box setups, which often come with external drives. The preliminary results described in comments on this PR sound promising.

    I don’t think I have valuable code review to add here, but conceptually this seems worthwhile, especially because we can anticipate that a lot of users will be switching out hard drives soon, if they are running a node with a full copy of the blockchain and had a 1 TB drive.

  538. in src/coins.h:744 in 551050628c outdated
     744 |  
     745 | +    std::shared_ptr<ThreadPool> m_thread_pool;
     746 | +    std::vector<std::future<void>> m_futures{};
     747 | +
     748 | +protected:
     749 | +    void Reset() noexcept override {
    


    hodlinator commented at 7:52 PM on March 12, 2026:

    nit:

        void Reset() noexcept override
        {
    
  539. in src/coins.h:1 in 612f420811 outdated


    hodlinator commented at 7:57 PM on March 12, 2026:

    612f420 validation: collect block inputs in CoinsViewOverlay before ConnectBlock:

    I wonder if we could instead have the first 2 commits squashed together and also add reading from m_inputs so that we have a working but non-parallelized implementation?

    Edit: My own attempt at this: https://github.com/hodlinator/bitcoin/tree/pr/31132_suggestions


    andrewtoth commented at 5:41 PM on March 13, 2026:

    Will do this :+1:.


    andrewtoth commented at 4:58 PM on March 22, 2026:

    Done, added you as a co-author.

  540. in src/coins.h:693 in 551050628c outdated
     693 | +
     694 | +        if (auto coin{base->PeekCoin(input.outpoint)}) [[likely]] input.coin.emplace(std::move(*coin));
     695 | +        // We need release here, so writing coin in the line above happens before the main thread acquires.
     696 | +        input.ready.test_and_set(std::memory_order_release);
     697 | +        input.ready.notify_one();
     698 | +        return true;
    


    hodlinator commented at 8:27 PM on March 12, 2026:

    Would it be more correct to return false for the last input to prevent ProcessInput() from being called one extra time?

            return i < m_inputs.size() - 1;
    

    (We will already have returned at the top if m_inputs.size() == 0).


    andrewtoth commented at 3:11 PM on March 13, 2026:

    This would only have an effect on a single thread. The other 3 threads would all call ProcessInput() an extra time.

  541. in src/coins.h:736 in 551050628c outdated
     736 | +            if (input.coin) [[likely]] return std::move(*input.coin);
     737 | +            // If we get here, then this block has missing or spent inputs or there is a txid quick hash collision.
     738 | +            break;
     739 | +        }
     740 | +
     741 | +        // We will only get here for BIP30 checks, txid quick hash collisions or a block with missing or spent inputs.
    


    hodlinator commented at 8:47 PM on March 12, 2026:

    We dropped the coinbase tx (& indirectly input) in StartFetching(), which shouldn't be spendable yet for another 100 blocks anyway, but tests try to access it so technically it could be included in the comment (invalid blocks may try to access it too).

            // We will only get here for BIP30 checks, txid quick hash collisions,
            // a block with missing or spent inputs, or attempts to look up coinbase inputs.
    

    andrewtoth commented at 3:14 PM on March 13, 2026:

    This comment only reflects production code. There is also a unit test to specifically access a coin that is not in the block, and of course fuzz tests will be able to hit here as well. Would it be helpful to clarify these are the only reasons in non-test code?

    invalid blocks may try to access it too

    I think it's more specific to say a block with missing or spent inputs. Other types of invalid blocks will not get here.

    Edit: Actually, a block spending its own coinbase outputs would get here too. Not sure that's possible though...

    Edit again: Ok, I'm pretty sure it's not possible to construct a segwit block that spends from its own coinbase. The merkle root of the witness data in the coinbase creates a circular dependency on the tx spending the coinbase outputs. However, it could be possible for a pre-segwit legacy block. Not sure if we need to spell that out in the comment here?

  542. hodlinator commented at 10:43 PM on March 12, 2026: contributor

    Concept ACK 551050628c0a4e17a72180353888ddeeab7e4030

    Been looking forward to this optimization landing. It is unfortunate that it's only gotten Concept ACKs so far. From briefly looking at the approach, it seems fairly straight forward. Don't have a solid grasp of the edge cases and the surrounding pieces of the puzzle though.

  543. willcl-ark commented at 11:33 AM on March 13, 2026: member

    I'm concept ACK here, and plan to review more soon.

    I was trying to enumerate new new assumptions that this change would bring as changing the db (even only the access mechanism) has the potential to lead to consensus bugs (i.e. 2013), if done incorrectly.

    The main change here is of course using multiple threads to read from levelDB. This is a well-supported and used use-case of levelDB (as I understand this is how Chrome and many other users of levelDB use it). However we do now rely on the correctness of this part of the levelDB implementation, which we did not before. I wonder if we could fuzz multi-threaded reads from levelDB (or if it's already being done) to try and give more assurances around this new assumption?

    If for example there were a LevelDB bug triggered only under concurrent read load (a corrupted read, stale cache entry, a race in the table cache eviction etc.) an upgraded node could get different coin data than an un-upgraded node. That may result in a chain split. IMO this is the main question we have to be able to assure ourselves of in here (this could also be an argument for adding a config option to disable parallel fetch, if we cannot assure ourselves enough?)

    I have taken a look at some of the levelDB code (which is reasonably new to me outside of tweaking various params) to try and get a better understanding of how it works under the hood under concurrent reads (I was curious whether our threads were just saturating a single pipeline more, or actually executing fully in parallel):

    My initial read is that an internal mutex is held while grabbing a memtable reference (with current pointers to the current db state) which is very fast, and then released before any real expensive work is done. The internal block cache is using 16 independent shards, each with its own mutex held only during O(1) hash table operations. So with 4 worker threads if we read from different SST files we run fully in parallel. We only contend if we hash to the same shard (so 1/16 chance), and even then we only block on the mutex very briefly. So we are doing genuine parallel reads.

    Most of the other "potential problems" I was trying to consider, I feel, were pretty much quashed by the fact that we do not change the fact we still hold cs_main for the duration of the parallel fetch, as before, and so really the main variable is the leveldb concurrent access.

    I have observed significant speedup benchmarking this, and it feels valuable enough to consider taking IMO, once we get good-enough assurances from the levelDB side of things.

  544. andrewtoth commented at 5:40 PM on March 13, 2026: contributor

    Thanks for your reviews @murchandamus, @hodlinator, and @willcl-ark.

    we do now rely on the correctness of this part of the levelDB implementation, which we did not before

    This is not entirely accurate. We rely on this correctness for our indexes. Concurrent getrawtransaction or getblockfilter calls using txindex or blockfilterindex do concurrent levelDB reads. We just haven't used this for chainstate reads.

    I wonder if we could fuzz multi-threaded reads from levelDB

    I pushed a commit to add a coins_view_stacked fuzz harness. This creates a similar stack of views as we use in production. A CoinsViewOverlay -> CCoinsViewCache -> CCoinsViewDB stack using an in-memory levelDB. The fuzzer first works on CCoinsViewCache -> CCoinsViewDB by themselves to populate the levelDB, then works on the overlay on top of the main cache to perform concurrent reads through to levelDB, and then after works on the cache and db to flush any data from the main cache down to the db.

    I built this harness before and fuzzed with it, and am fuzzing with it now. I'm not sure why I removed it when rebasing at some point.

  545. DrahtBot added the label CI failed on Mar 13, 2026
  546. andrewtoth commented at 10:28 PM on March 16, 2026: contributor

    I collected additional steady-state data.

    I ran four nodes on AWS t2.small instances (1 vCPU, 2 GB RAM) with -prune=550 -debug=bench and 20 GB gp2 EBS volumes. They ran from 1 Jan–3 Mar 2026 (blocks 930,301–939,173). All four started from the same chainstate, block files, and mempool.dat. They all connected to a single gateway node in the same VPS, which itself connected only to two outside trusted nodes. Two nodes ran master and two ran this branch. The log files are attached to this comment.

    On average, the branch nodes were 23.7% faster at connecting blocks (25.1 ms per block). Although that is a modest improvement overall, worst-case block connection times were much better on the branch. The table below lists the 20 slowest blocks (by average connect time across the four nodes), with an average speedup of 2.87×, or about 11.7 seconds per block for that set.

    Rank Height Txins branch1 branch2 master1 master2 Average Speedup
    1 935502 11,740 23.4 s 10.9 s 76.0 s 16.4 s 2.69x
    2 936879 11,539 10.5 s 11.4 s 43.4 s 43.3 s 3.95x
    3 935500 10,462 16.2 s 8.6 s 44.4 s 34.2 s 3.18x
    4 939086 17,118 11.7 s 17.0 s 49.4 s 19.0 s 2.38x
    5 939021 7,373 8.1 s 7.2 s 27.4 s 30.7 s 3.80x
    6 930335 8,381 8.5 s 8.4 s 22.1 s 25.0 s 2.78x
    7 934760 11,920 3.0 s 7.5 s 14.1 s 15.0 s 2.77x
    8 930334 8,843 4.1 s 3.5 s 14.2 s 15.5 s 3.90x
    9 930338 6,616 3.0 s 4.2 s 14.3 s 14.1 s 3.92x
    10 936669 11,195 5.9 s 5.7 s 11.8 s 11.9 s 2.06x
    11 930311 6,915 2.6 s 2.1 s 11.5 s 11.8 s 5.04x
    12 930364 9,102 6.9 s 5.6 s 9.4 s 5.7 s 1.21x
    13 939024 9,719 4.8 s 6.3 s 8.7 s 7.7 s 1.47x
    14 930336 7,847 2.9 s 2.6 s 10.2 s 10.9 s 3.85x
    15 930333 7,224 2.6 s 5.2 s 7.0 s 10.7 s 2.29x
    16 933330 3,868 2.5 s 3.2 s 9.6 s 8.6 s 3.17x
    17 930312 8,812 1.4 s 2.6 s 8.6 s 11.3 s 5.05x
    18 930308 7,194 3.0 s 3.0 s 6.6 s 9.6 s 2.68x
    19 939046 9,468 4.1 s 4.1 s 7.0 s 6.8 s 1.68x
    20 930339 6,554 2.8 s 3.1 s 4.3 s 9.7 s 2.39x
    Avg 6.4 s 6.1 s 20.0 s 15.9 s 2.87x

    Why the improvement shows up in the worst cases

    If every transaction in a block was added to the mempool after the last cache flush, this change has little effect, because their inputs are already in the cache. After the cache is flushed due to memory limits, however, those inputs are evicted. This can happen regardless of when the transactions entered the mempool. So if large consolidation transactions are in the mempool and the cache then flushes, when those transactions are mined, all their inputs must be fetched from disk. With typical single-digit millisecond latency for network-attached storage, a block with many inputs can easily spend tens of seconds just fetching UTXOs. This effect is illustrated in the chart in the description of #28233.

    The same pattern affects blocks where some transactions were never in the mempool. When missing transactions are fetched to complete a compact block, their inputs will not be in the cache before entering ConnectBlock. That further slows block connection when it is already suboptimal. For example, when many transactions are non-standard.

    Also, -blocksonly will see significant speedup at steady-state as well because of this.

    branch1.log.gz branch2.log.gz master1.log.gz master2.log.gz

  547. murchandamus commented at 11:10 PM on March 16, 2026: member

    The table below lists the 20 slowest blocks (by average connect time across the four nodes), with an average speedup of 2.87×, or about 11.7 seconds per block for that set.

    Reading that, I became curious. What would the 20 slowest blocks by the average “branch” time and “master” time look like in comparison? Is it largely the same set, or are there perhaps some cases in which the performance shifts one way or the other?

  548. andrewtoth commented at 11:28 PM on March 16, 2026: contributor

    @murchandamus Here are the longest average block times for the branch and master nodes independently. It is largely the same set, but the order is slightly different.

    Rank Branch Height Branch Average Time Master Height Master Average Time
    1 935502 17.1 s 935502 46.2 s
    2 939086 14.4 s 936879 43.3 s
    3 935500 12.4 s 935500 39.3 s
    4 936879 11.0 s 939086 34.2 s
    5 930335 8.5 s 939021 29.1 s
    6 939021 7.6 s 930335 23.5 s
    7 930364 6.2 s 930334 14.9 s
    8 936669 5.8 s 934760 14.6 s
    9 930347 5.7 s 930338 14.2 s
    10 939024 5.6 s 936669 11.9 s
    11 934760 5.2 s 930311 11.6 s
    12 929609 4.5 s 930336 10.6 s
    13 930777 4.4 s 930312 10.0 s
    14 939046 4.1 s 933330 9.1 s
    15 930357 4.0 s 930333 8.8 s
    16 931291 4.0 s 939024 8.2 s
    17 930333 3.9 s 930308 8.1 s
    18 930334 3.8 s 930310 7.7 s
    19 930338 3.6 s 930364 7.5 s
    20 930305 3.4 s 930313 7.2 s
  549. murchandamus commented at 11:35 PM on March 16, 2026: member

    Oh, I was thinking the same table as above, but selected by the times of the branch or master. I thought it might be interesting to see what the speed-up factor was on blocks that are slow for the branch vs the speed-up factor for blocks that are slow for master, and might be interesting if the overlap isn’t complete. I figured you might have a script that produces the table already. If it’s too much work (because you did this manually), don’t worry—my comment was just from random curiosity inspired by your data dump.

  550. andrewtoth commented at 11:51 PM on March 16, 2026: contributor

    I couldn't exactly parse your request, so I put it into the LLM and it came up with this :) Interestingly, there are a few branch-slow blocks that are slower than master. All master-slow blocks are slower than the branch though.

    931291 seems to be a major outlier. It is 2x slower than master, and it has the typical pattern of a lot of large very low fee consolidation transactions. This one should be a lot faster than master. According to the block audit on mempool.space, this tx with >1200 inputs was confirmed in that block after being seen 7 seconds earlier. So, my theory is that the master nodes accepted the transaction into their mempool right before they saw the block, while the branch nodes did not yet see it.

    Top 20 by average branch time (branch-slow blocks)

    Rank Height Txins branch1.log branch2 master1 master2 Speedup (m/b)
    1 935502 11,740 23.4 s 10.9 s 76.0 s 16.4 s 2.69x
    2 939086 17,118 11.7 s 17.0 s 49.4 s 19.0 s 2.38x
    3 935500 10,462 16.2 s 8.6 s 44.4 s 34.2 s 3.18x
    4 936879 11,539 10.5 s 11.4 s 43.4 s 43.3 s 3.95x
    5 930335 8,381 8.5 s 8.4 s 22.1 s 25.0 s 2.78x
    6 939021 7,373 8.1 s 7.2 s 27.4 s 30.7 s 3.80x
    7 930364 9,102 6.9 s 5.6 s 9.4 s 5.7 s 1.21x
    8 936669 11,195 5.9 s 5.7 s 11.8 s 11.9 s 2.06x
    9 930347 8,532 5.8 s 5.6 s 4.0 s 4.1 s 0.71x
    10 939024 9,719 4.8 s 6.3 s 8.7 s 7.7 s 1.47x
    11 934760 11,920 3.0 s 7.5 s 14.1 s 15.0 s 2.77x
    12 929609 7,048 4.5 s 4.5 s 4.9 s 5.0 s 1.10x
    13 930777 7,240 4.4 s 4.4 s 4.5 s 4.7 s 1.04x
    14 939046 9,468 4.1 s 4.1 s 7.0 s 6.8 s 1.68x
    15 930357 7,258 3.9 s 4.2 s 3.5 s 3.5 s 0.87x
    16 931291 11,867 4.0 s 4.0 s 1.9 s 1.9 s 0.48x
    17 930333 7,224 2.6 s 5.2 s 7.0 s 10.7 s 2.29x
    18 930334 8,843 4.1 s 3.5 s 14.2 s 15.5 s 3.90x
    19 930338 6,616 3.0 s 4.2 s 14.3 s 14.1 s 3.92x
    20 930305 7,033 3.4 s 3.4 s 4.4 s 6.6 s 1.62x

    Top 20 by average master time (master-slow blocks)

    Rank Height Txins branch1.log branch2 master1 master2 Speedup (m/b)
    1 935502 11,740 23.4 s 10.9 s 76.0 s 16.4 s 2.69x
    2 936879 11,539 10.5 s 11.4 s 43.4 s 43.3 s 3.95x
    3 935500 10,462 16.2 s 8.6 s 44.4 s 34.2 s 3.18x
    4 939086 17,118 11.7 s 17.0 s 49.4 s 19.0 s 2.38x
    5 939021 7,373 8.1 s 7.2 s 27.4 s 30.7 s 3.80x
    6 930335 8,381 8.5 s 8.4 s 22.1 s 25.0 s 2.78x
    7 930334 8,843 4.1 s 3.5 s 14.2 s 15.5 s 3.90x
    8 934760 11,920 3.0 s 7.5 s 14.1 s 15.0 s 2.77x
    9 930338 6,616 3.0 s 4.2 s 14.3 s 14.1 s 3.92x
    10 936669 11,195 5.9 s 5.7 s 11.8 s 11.9 s 2.06x
    11 930311 6,915 2.6 s 2.1 s 11.5 s 11.8 s 5.04x
    12 930336 7,847 2.9 s 2.6 s 10.2 s 10.9 s 3.85x
    13 930312 8,812 1.4 s 2.6 s 8.6 s 11.3 s 5.05x
    14 933330 3,868 2.5 s 3.2 s 9.6 s 8.6 s 3.17x
    15 930333 7,224 2.6 s 5.2 s 7.0 s 10.7 s 2.29x
    16 939024 9,719 4.8 s 6.3 s 8.7 s 7.7 s 1.47x
    17 930308 7,194 3.0 s 3.0 s 6.6 s 9.6 s 2.68x
    18 930310 7,681 1.2 s 1.0 s 7.6 s 7.8 s 7.09x
    19 930364 9,102 6.9 s 5.6 s 9.4 s 5.7 s 1.21x
    20 930313 7,913 1.3 s 1.0 s 4.7 s 9.8 s 6.17x
  551. DrahtBot removed the label CI failed on Mar 18, 2026
  552. andrewtoth force-pushed on Mar 22, 2026
  553. andrewtoth commented at 5:04 PM on March 22, 2026: contributor

    Addressed comments by @hodlinator to rework the commit progression. The first commit is now a fully complete standalone change that fetches all block inputs into a vector before ConnectBlock on a single thread, and scans this vector in FetchCoinFromBase instead of looking up the coin from base.

    The next commits add performance improvements

    • cache last looked up input in m_input_tail so we don't scan entire vector each lookup (9783ff481fd922cfa59c980046b0491c0241fd83)
    • filter inputs that are created earlier in the block so we don't look them up from disk (150f052ef4d057c45f8a85904be92a6c6fb1418c, 5ec61b1e9c879fa30e15d907c61f2953a140f567)

    Then the next few commits make it safe for parallel lookups

    • introduce a ready flag in case input is not yet fetched (8eae22f493138f6ff0a4e07020a3036df98e0413)
    • stop fetching whenever any method is called that will mutate base (3e5cdee07720f841b8ae2538f556ee1bb5cb5bc0)

    Then the threadpool is added (f15dd38be78a89139b308e7a7682979adf3b0e0b) and finally used (1ef7474d19cb720d526e29cdabaa85f8f79c9d5f)

    The rest of the commits add documentation, unit tests, and fuzz harness updates.

  554. ryanofsky commented at 9:32 PM on April 7, 2026: contributor

    Concept ACK. Change seems worthwhile and surprisingly not that complicated. One thing I was wondering was about how 4 worker threads were chosen. I see some testing was done #31132 (comment) but if optimal number depends on the type of storage device, maybe it should be configurable.

  555. DrahtBot added the label Needs rebase on Apr 13, 2026
  556. validation: collect block inputs in CoinsViewOverlay before ConnectBlock
    Introduce CoinsViewOverlay::StartFetching, which maps all input prevouts of a
    block to a new m_inputs vector of InputToFetch elements. Returns a ResetGuard
    which is lifetime bound to the block, while the InputToFetch elements are
    lifetime bound to the block as well.
    
    Introduce StopFetching to clear the m_inputs vector.
    CCoinsViewCache::Reset is made virtual and is overridden in CoinsViewOverlay.
    StopFetching is called on Reset, so the InputToFetch objects will not
    exceed the lifetime of the block.
    
    Introduce ProcessInput to fetch the utxo of an individual input in m_inputs.
    Each caller fetches the input at m_input_head and increments it, so each call
    will fetch the next input in the queue.
    
    Fetch coins from the m_inputs vector in FetchCoinFromBase by scanning all inputs
    until we discover the input with the correct outpoint.
    
    This is designed deliberately so multiple threads can call ProcessInput independently.
    
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
    4203d58656
  557. andrewtoth force-pushed on Apr 14, 2026
  558. DrahtBot added the label CI failed on Apr 15, 2026
  559. DrahtBot commented at 12:09 AM on April 15, 2026: contributor

    <!--85328a0da195eb286784d51f73fa0af9-->

    🚧 At least one of the CI tasks failed. <sub>Task test ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/24427713628/job/71365320458</sub> <sub>LLM reason (✨ experimental): CI failed because CTest reported a segmentation fault (SIGSEGV) in validation_block_tests (test 327).</sub>

    <details><summary>Hints</summary>

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

    </details>

  560. coins: track last accessed input using m_input_tail
    Inputs are accessed by ConnectBlock in the same order as they
    are created in StartFetching (excepting BIP30 checks).
    We can use this information, as well as the fact that CoinsViewOverlay
    caches coins accessed via FetchCoinFromBase, to skip scanning
    over previously accessed coins.
    
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    4731016c81
  561. coins: introduce QuickHashHasher
    Collapses a 32-byte Txid into a uint64_t, using 4 random uint64_ts.
    Used in place of a hash function as a performance improvement.
    
    Co-authored-by: Pieter Wuille <pieter@wuille.net>
    02756fc5e1
  562. coins: filter inputs spending outputs of same block in ProcessInput
    This is a performance improvement, because we can skip checking on disk
    that the input does not exist.
    
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    e40042eda3
  563. coins: add ready flag to InputToFetch
    Prepares for ProcessInput to be called from multiple threads.
    
    This flag acts as a memory fence around InputToFetch::coin. There is no lock
    guarding reads and writes of the coin field.
    Instead we use the flag's release/acquire semantics to ensure that when the
    main thread reads the coin it will have happened after a worker thread has
    finished writing it.
    
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    39fe4975a3
  564. coins: stop fetching before mutating base
    Prepares for ProcessInput to be called from multiple threads.
    
    ProcessInput reads from base. For ProcessInput to be safe to call in parallel
    on separate threads, it must not be mutated.
    Flush, Sync, and SetBackend can modify base, so we override these and
    StopFetching before calling the base class.
    
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    f6a868595a
  565. validation: add -inputfetchthreads configuration option
    Add a configuration option for the number of worker threads used for
    parallel UTXO input fetching during block connection.
    
    Default is 4 threads, max is 15, 0 disables parallel fetching.
    e56373fc2d
  566. coins: introduce thread pool in CoinsViewOverlay
    Prepares for ProcessInput to be called from multiple threads.
    
    Introduce a ThreadPool shared pointer to CoinsViewOverlay. A pool managed
    externally can be passed in the constructor.
    
    A global thread pool is used in fuzz harnesses since iterations can happen
    faster than the OS can create and tear down thread pools.
    This can cause a memory leak when fuzzing.
    
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    0188760a85
  567. coins: fetch inputs in parallel
    Leverages the thread pool to fetch inputs on multiple threads, while the overlay
    serves inputs on the main thread.
    
    This is a performance improvement over blocking the main thread to fetch inputs.
    
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    5a34853872
  568. doc: update CoinsViewOverlay docstring to describe parallel fetching
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    34b931df5f
  569. test: add unit tests for CoinsViewOverlay::StartFetching
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    ff6a56335f
  570. fuzz: update harnesses to cover CoinsViewOverlay::StartFetching
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    Co-authored-by: sedited <seb.kung@gmail.com>
    1dd2f0fa06
  571. fuzz: add coins_view_stacked fuzz harness to test concurrent leveldb reads cfbff4cd70
  572. andrewtoth force-pushed on Apr 15, 2026
  573. DrahtBot removed the label Needs rebase on Apr 15, 2026
  574. DrahtBot removed the label CI failed on Apr 15, 2026
  575. andrewtoth commented at 2:26 PM on April 15, 2026: contributor

    Rebased due to #34124.

    Added a new commit to add a configuration option to set the number of input fetcher threads -inputfetchthreads e56373fc2d5ee4c617f7bca0e63b0e82e9bbed0d. Default is 4, maximum is 15 like script validation threads, and 0 disables input fetching on threads other than main. Addresses suggestions in #31132 (comment) (thanks @ryanofsky) and #31132 (comment) (thanks @willcl-ark).

    Uses an unordered_set for storing and looking up the quick hashes of txids, instead of a sorted vector and binary search lookups. This is faster than the previous approach, and is safe from bucket filling attacks due to the same collision resistant property of using quick hash to avoid collisions between txids and prevout hashes. See discussion #31132 (review).

    Fixes an issue with the fuzz harnesses using -fork with certain fuzzers (thanks @furszy).

    git range-diff e98d36715eace5ee54a10f2931adcbbc5f6b0a15..62e4ec4bf38e4f22eed3b1015036105b2efa000a 976985eccd546a95e38973b854ccc6589e8afc74..cfbff4cd70092d5b53bf4f1dee3df84b4961a51c

    One thing I was wondering was about how 4 worker threads were chosen. @ryanofsky there are more measurements here with different benchmarks for different values of threads #31132 (comment). Most systems will benefit from more threads, but a few do not and even show slight degradation if more than 4 threads are chosen. @l0rinc and I decided that 4 was a sane conservative default that showed significant speedup across all systems benchmarked. However, #31132 (comment) shows much better performance with more threads on network connected storage. I decided to add a configuration option.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-28 03:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me