validation: fetch block inputs on parallel threads >20% faster IBD #31132

pull andrewtoth wants to merge 5 commits into bitcoin:master from andrewtoth:threaded-inputs changing 11 files +638 −1
  1. andrewtoth commented at 2:40 pm on October 22, 2024: contributor

    When fetching inputs in ConnectBlock, each input is fetched from the cache in series. A cache miss means a round trip to the disk db to fetch the outpoint and insert it into the cache. Since the db is locked from being written during ConnectTip, we can fetch all block inputs missing from the cache in parallel on multiple threads before entering ConnectBlock. Using this strategy resulted in a 10% faster IBD.

    Doing IBD with 16 vcores from a local peer with default settings, stopping at height 850k:

    Mean [s] Min [s] Max [s] Relative
    branch 17065.138 ± 117.439 16982.096 17148.181 1.00
    master 18731.509 ± 94.142 18731.509 18864.646 1.10

    For later blocks this change makes block connection even faster. Doing an assumeutxo from block 840k to 850k with 15 worker threads, this change is 26% faster. With just a single worker thread, this same benchmark is 6% faster. Benchmark and flame graph with 15 worker threads Benchmark and flame graph with 1 worker thread

    I have fuzzed for over 500 million iterations with the provided fuzz harness with no issues.

    This approach is heavily inspired by CCheckQueue, but we could not easily reuse it since it only checks for validity and doesn’t allow us to store results. So, this PR creates a new InputFetcher that loops through all inputs of a block on the main thread and adds their outpoints to a shared vector. After writing, the main thread and worker threads assign ranges of outpoints from the vector and fetch them from the db, and then push the resulting coins onto a thread local vector. Once the threads have finished reading all inputs, the main thread loops through all thread local vectors and inserts the results into the cache.

    This PR uses the -par value for the number of threads, which defaults to the number of vcores on the machine or 15 whichever is fewer. This is the same value used for CCheckQueue, so any users that specifically have the multi threaded validation disabled by using -par=1 will also have this feature disabled. This also means the maximum number of input fetching threads is capped at 15.

    Since InputFetcher::FetchInputs is blocking, a follow-up can update this to share the thread pool between CCheckQueue and InputFetcher.

  2. DrahtBot commented at 2:40 pm on October 22, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31132.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK l0rinc, ismaelsadeeq, Raimo33

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #32317 (kernel: Separate UTXO set access from validation functions by TheCharlatan)
    • #29700 (kernel, refactor: return error status on all fatal errors by ryanofsky)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label Validation on Oct 22, 2024
  4. andrewtoth force-pushed on Oct 22, 2024
  5. DrahtBot commented at 2:45 pm on October 22, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/31894441286

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  6. DrahtBot added the label CI failed on Oct 22, 2024
  7. andrewtoth renamed this:
    validation: fetch block inputs parallel threads ~17% faster IBD
    validation: fetch block inputs on parallel threads ~17% faster IBD
    on Oct 22, 2024
  8. andrewtoth force-pushed on Oct 22, 2024
  9. andrewtoth force-pushed on Oct 22, 2024
  10. andrewtoth force-pushed on Oct 22, 2024
  11. DrahtBot removed the label CI failed on Oct 22, 2024
  12. in src/inputfetcher.h:140 in e9e23b59f8 outdated
    146+        : m_batch_size(batch_size)
    147+    {
    148+        m_worker_threads.reserve(worker_thread_count);
    149+        for (size_t n = 0; n < worker_thread_count; ++n) {
    150+            m_worker_threads.emplace_back([this, n]() {
    151+                util::ThreadRename(strprintf("inputfetch.%i", n));
    


    l0rinc commented at 10:26 am on October 23, 2024:
    Q: Is this a leftover a hack for non-owning LevelDB threads, or is this really the best way to name threads in a cross-platform way?

    andrewtoth commented at 1:49 pm on October 23, 2024:
    Unsure, copied from CScriptCheck. If the state of the art of thread naming has advanced since that was written, please let me know!

    sipa commented at 1:54 pm on October 23, 2024:
    The C++ standard library does as far as I know have no way of renaming threads at all. src/util/threadnames.{h,cpp} is our wrapper around the various platform-dependent ways of doing so on supported systems.

    l0rinc commented at 2:03 pm on October 23, 2024:
    Thank you, please resolve the comment.
  13. in src/inputfetcher.h:124 in e9e23b59f8 outdated
    184+                    continue;
    185+                }
    186+
    187+                buffer.emplace_back(outpoint);
    188+                if (buffer.size() == m_batch_size) {
    189+                    Add(std::move(buffer));
    


    l0rinc commented at 11:29 am on October 23, 2024:
    We’re mostly creating the buckets randomly here, so each thread will need access to basically all of the keys. Since we have an idea of how LevelDB works here (i.e. Sorted String Table), we could likely improve cache locality (would likely be most beneficial on HDDs) and minimize lock contention by splitting the reads by sorted transactions instead.

    andrewtoth commented at 1:46 pm on October 23, 2024:

    I don’t think there is any lock contention here if we are doing multithreaded reading?

    I also think what you’re suggesting would add a lot more complexity to this PR, when this is “good enough”.


    l0rinc commented at 2:02 pm on October 23, 2024:
    This might be as simple as sorting by tx before we create the buckets.

    andrewtoth commented at 2:12 pm on October 23, 2024:
    If a benchmark shows that it is better, then great!
  14. in src/inputfetcher.h:90 in e9e23b59f8 outdated
    68+        std::vector<std::pair<COutPoint, Coin>> pairs{};
    69+        do {
    70+            std::vector<COutPoint> outpoints{};
    71+            outpoints.reserve(m_batch_size);
    72+            {
    73+                WAIT_LOCK(m_mutex, lock);
    


    l0rinc commented at 11:30 am on October 23, 2024:
    I’m wondering if we really need to (b)lock here or whether we could we create a read-only snapshot instead and avoid stalling?

    andrewtoth commented at 1:34 pm on October 23, 2024:
    This is blocking so we can access the queue of shared outpoints that we need to fetch from. It is not blocking for LevelDB, we access the db once we are out of the critical section.

    l0rinc commented at 2:04 pm on October 23, 2024:
    As mentioned before, why do we need shared outpoints here?

    andrewtoth commented at 2:15 pm on October 23, 2024:
    The main thread adds all outpoints to a global vector, which all workers will fetch their work from.

    andrewtoth commented at 0:56 am on December 4, 2024:
    We no longer need to block on the shared outpoints vector. We write to it once in the main thread before notifying the other threads and then only read from it afterwards.
  15. in src/inputfetcher.h:38 in e9e23b59f8 outdated
    24+ * onto the queue, where they are fetched by N worker threads. The resulting
    25+ * coins are pushed onto another queue after they are read from disk. When
    26+ * the main is done adding outpoints, it starts writing the results of the read
    27+ * queue to the cache.
    28+ */
    29+class InputFetcher
    


    l0rinc commented at 11:39 am on October 23, 2024:
    I know it’s not trivial request, but can we add a test for this class which fetches everything in parallel and sequentially and assert that the result is equivalent? And preferably also a benchmark, like we have it for https://github.com/bitcoin/bitcoin/blob/master/src/bench/checkqueue.cpp. I would gladly help here, if needed.

    andrewtoth commented at 1:35 pm on October 23, 2024:
    Yes, I can add these but I am waiting for some more conceptual support.

    andrewtoth commented at 3:06 pm on November 7, 2024:
    Added tests and benchmark. The test has random parameters, one of which would be end up having a single worker thread.

    andrewtoth commented at 8:59 pm on November 16, 2024:
    Also added fuzz harness
  16. in src/inputfetcher.h:145 in e9e23b59f8 outdated
    140+    }
    141+
    142+
    143+public:
    144+    //! Create a new input fetcher
    145+    explicit InputFetcher(size_t batch_size, size_t worker_thread_count) noexcept
    


    l0rinc commented at 11:46 am on October 23, 2024:
    For consistency (see: explicit CCheckQueue(unsigned int batch_size, int worker_threads_num)) and simplicity (m_input_fetcher{/*batch_size=*/128, static_cast<size_t>(options.worker_threads_num)}, and to follow modern C++ directions where sizes seem to be preferred as signed values, see: #30927 (review)), please consider making these int(s) instead.

    andrewtoth commented at 3:06 pm on November 7, 2024:
    Done.
  17. in src/validation.cpp:6257 in e9e23b59f8 outdated
    6247@@ -6243,6 +6248,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
    6248 
    6249 ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
    6250     : m_script_check_queue{/*batch_size=*/128, options.worker_threads_num},
    6251+      m_input_fetcher{/*batch_size=*/128, static_cast<size_t>(options.worker_threads_num)},
    


    l0rinc commented at 12:07 pm on October 23, 2024:

    Unlike the script checks, these fetches aren’t CPU bound, there is no reason to provide the number of CPUs as the number of parallels threads. I don’t know if we care about HDD performance here or not, but we can likely find a multiplier that makes this better for both SSD and HDD.

    Quoting from https://pkolaczk.github.io/disk-parallelism:

    It was surprising to me that even 64 threads, which are far more than the number of CPU cores (4 physical, 8 virtual), still improved the performance. I guess that with requests of such a small size to such a fast storage, you need to submit really many of them to keep the SSD busy.

    If we can provide a benchmark for this usecase we can likely find an optimal multiplier here - I won’t nack but this part is very important for me.


    andrewtoth commented at 1:43 pm on October 23, 2024:

    Adding more threads will require more memory, which is one reason to not use many more.

    I did a benchmark using 64 threads on the same 16 vcore machine, and it was slightly slower :/


    l0rinc commented at 2:00 pm on October 23, 2024:
    4x may be too much to begin with, but 1.5-2x sounds plausible, I’ll help with benchmarking this once my current batches finish.

    andrewtoth commented at 3:06 pm on November 7, 2024:
    Added a benchmark to experiment with these.
  18. in src/inputfetcher.h:123 in e9e23b59f8 outdated
    183+                if (cache.HaveCoinInCache(outpoint)) {
    184+                    continue;
    185+                }
    186+
    187+                buffer.emplace_back(outpoint);
    188+                if (buffer.size() == m_batch_size) {
    


    l0rinc commented at 12:10 pm on October 23, 2024:
    Would it be possible to create the batch sizes dynamically? Since the number of missing values differs for every block (and every dbcache size), it may not make more sense to calculate the optimal split instead of using the random value of 128. Coroutines might alleviate this problem.

    andrewtoth commented at 1:42 pm on October 23, 2024:
    I’m not sure it would warrant the complexity I think this batch size is “good enough” for now. In a follow up we could maybe add ways to set this with configs to experiment if there really is more optimal settings.

    andrewtoth commented at 3:05 pm on November 7, 2024:
    I changed the batch size to be number of workers.
  19. in src/inputfetcher.h:132 in e9e23b59f8 outdated
    192+                }
    193+            }
    194+            txids.insert(tx->GetHash());
    195+        }
    196+
    197+        Add(std::move(buffer));
    


    l0rinc commented at 12:11 pm on October 23, 2024:
    Do we always have leftovers or will this process the last batch twice (or process an empty one) if the batch happens to be divisible by batch_size?

    andrewtoth commented at 1:37 pm on October 23, 2024:
    It won’t process twice, but it could pass in an empty vector, which is ignored if you look at Add implementation.
  20. in src/inputfetcher.h:70 in e9e23b59f8 outdated
    60+
    61+    std::vector<std::thread> m_worker_threads;
    62+    bool m_request_stop GUARDED_BY(m_mutex){false};
    63+
    64+    /** Internal function that does the fetching from disk. */
    65+    void Loop() noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
    


    l0rinc commented at 12:35 pm on October 23, 2024:

    We’re basically mimicking RocksDB’s MultiGet here - but prewarming the cache instead in separate get requests, since we can’t really access LevelDB’s internals.

    Since splitting into buckets isn’t trivial and since MultiGet seems to rely on C++20 coroutines (which wasn’t available in 2012 when CCheckQueue was written), I’m wondering how much simpler this fetching would be if we had lightweight suspendible threads instead: https://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html#multiget


    andrewtoth commented at 1:40 pm on October 23, 2024:

    I think it would be similar in complexity, we would still need all the locking mechanisms to prevent multithreaded access.

    What would really be great is if we had a similar construction to Rust’s std::sync::mpsc.


    l0rinc commented at 1:58 pm on October 23, 2024:

    Can you tell me why we need to prevent multithreaded access exactly? We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread, right?

    How would mpsc solve this better? Do you think we need work stealing to make it perfectly parallel? Wouldn’t coroutines already achieve the same?


    sipa commented at 2:07 pm on October 23, 2024:
    I haven’t yet experimented with them, but as far as I understand it, coroutines are just programming paradigm, not magic; they don’t do anything of their own, besides making things that were already possible easier to write. In particular, you still need a thread pool or some mechanism for scheduling how to run them,

    andrewtoth commented at 2:09 pm on October 23, 2024:

    We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread

    If the vectors are thread local, then how can the main thread access them at the end to write them? We also want to be writing throughout while the workers are fetching, not just at the end.

    How would mpsc solve this better?

    Instead of each worker thread having a local queue of results, which they then append to the global results queue, they could just push each result to the channel individually. The main thread could just pull results off the channel as they arrive, instead of waiting to be awoken by a worker thread that appended all its results to the global queue.

    work stealing

    That is a concept for async rust, or std::async::mpsc. We can do all this without introducing an async runtime. But, this is getting off topic.


    l0rinc commented at 3:15 pm on October 23, 2024:

    coroutines are just programming paradigm, not magic

    That’s also what I was counting on! :D

    In RocksDB they have high and low priority work (I assume that’s just added to the front or the back of a background work deque) – this could align well with @furszy’s suggestion for mixing different kinds of background work units.

    I haven’t used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime. Combined with an appropriate scheduling mechanism (such as a thread pool), we could maximize both CPU and IO usage, if I’m not mistaken. Instead of each thread handling just one task, it could suspend a coroutine while waiting on IO (e.g., a database fetch) and resume it later, effectively maximizing CPU and IO work without needing to know the exact details of the work.

    If the vectors are thread local

    The vector would still be global, but each thread would only access a single bucket (i.e. global vector of vectors, with each thread from the pool writing only to vector[thread_id], which contains a vector of fetched coins). When all the work is finished, we’d iterate over the global vector and merge the results into the cache on a single thread. As mentioned, sorting the outpoints before fetching could help improve data locality and reduce lock contention, and the coroutines above would help with work stealing, ensuring that all threads finish roughly at the same time.

    Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.


    sipa commented at 3:28 pm on October 23, 2024:

    I haven’t used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime.

    That needs async I/O, and is unrelated to coroutines, as far as I understand it. Coroutines just help with keeping track of what to do when the reads come back inside rocksdb.

    As long as LevelDB (or whatever database engine we use) internally does not use async I/O, there will be one (waiting) thread per parallel outstanding read request from the database.


    HowHsu commented at 4:06 pm on June 28, 2025:

    Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.

    As far as I know, the advantage of coroutines over threads is faster context switching, since it doesn’t go through the operating system kernel. This advantage only becomes apparent under extremely high concurrency, such as hundreds of thousands of concurrent tasks. Using coroutines does not eliminate the need for synchronization mechanisms where they are inherently required.

  21. l0rinc changes_requested
  22. l0rinc commented at 12:44 pm on October 23, 2024: contributor

    Concept ACK

    I’m still missing tests and benchmarks here and I think we need to find better default values for SSD and HDD parallelism, and I’d be interested in how coroutines would perform here instead of trying to find the best batching size manually.

  23. furszy commented at 2:25 pm on October 23, 2024: member

    Cool idea.

    Since the inputs fetcher call is blocking, instead of creating a new set of worker threads, what do you think about re-using the existing script validation ones (or any other unused worker threads) by implementing a general-purpose thread pool shared among the validation checks? The script validation checks and the inputs fetching mechanism are never done concurrently because you need the inputs in order to verify the scripts. So, they could share workers.

    This should be benchmarked because it might add some overhead but, #26966 introduces such structure inside 401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

  24. andrewtoth commented at 2:48 pm on October 23, 2024: contributor

    implementing a general-purpose thread pool shared among the validation checks?

    Nice, yes that would be great! That would simplify this PR a lot if it could just schedule tasks on worker threads and receive the responses, instead of implementing all the sync code itself.

    #26966 introduces such structure inside https://github.com/bitcoin/bitcoin/commit/401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

    Concept ACK!

  25. l0rinc commented at 6:12 pm on October 24, 2024: contributor

    Finished benching on a HDD until 860k on Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, CPU = 8:

    0Summary
    1'COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0' ran
    2 1.02 times faster than 'COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=
    30'
    

    f278ca4ec3 coins: allow emplacing non-dirty coins internally (39993.343777768874 seconds = 11.1 hours) e9e23b59f8 validation: fetch block inputs in parallel (40929.84310861388 seconds = 11.3 hours)


    So likely on HDD we shouldn’t use so many threads, apparently it slows down IBD. Maybe we could add a new config option (iothreads or iothreadmultiplier or something). The defaults should likely depend on whether it’s an SSD or HDD.


    Edit:

    0"command": "COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    1"times": [39993.343777768874],
    2
    3"command": "COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    4"times": [40929.84310861388],
    

    I have retried the same with half the parallelism (rebased, but no other change in the end, otherwise the results would be hard to interpret):

    0"command": "COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    1"times": [40579.00445769842],
    

    So it’s a tiny bit faster than before (surprisingly stable for an actual IBD with real peers), but still slower-than/same-as before, so not sure why it’s not faster.


    Edit:

    Running it on a HDD with a low dbcache value reproduces the original result:

    0hyperfine --runs 1 --show-output --export-json /mnt/my_storage/ibd_full-threaded-inputs-3.json --parameter-list COMMIT 92fc718592be55812b2c73a3bf57599fc81425fa,8207d372b2fac24af0f8999b30e71e88d40b3a13 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'
    
    08207d372b2 validation: fetch block inputs in parallel
    192fc718592 coins: allow emplacing non-dirty coins internally
    2Summary
    3  'COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0' ran
    4    1.16 times faster than 'COMMIT=92fc718592be55812b2c73a3bf57599fc81425fa ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'
    
  26. andrewtoth commented at 6:31 pm on October 24, 2024: contributor

    So likely on HDD we shouldn’t use so many threads, apparently it slows down IBD.

    I’m not sure we can conclude that from your benchmark. It used a very high dbcache setting, which makes the effect of this change less important. It also is syncing from untrusted network peers, so there is some variance which could also account for the 2% difference.

  27. andrewtoth force-pushed on Oct 24, 2024
  28. andrewtoth force-pushed on Oct 24, 2024
  29. DrahtBot commented at 7:25 pm on October 24, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32027275494

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  30. DrahtBot added the label CI failed on Oct 24, 2024
  31. DrahtBot removed the label CI failed on Oct 25, 2024
  32. andrewtoth force-pushed on Oct 26, 2024
  33. andrewtoth force-pushed on Oct 26, 2024
  34. DrahtBot added the label CI failed on Oct 26, 2024
  35. DrahtBot commented at 6:52 pm on October 26, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32107893176

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  36. andrewtoth force-pushed on Oct 26, 2024
  37. andrewtoth force-pushed on Oct 26, 2024
  38. andrewtoth force-pushed on Oct 26, 2024
  39. andrewtoth force-pushed on Oct 26, 2024
  40. andrewtoth force-pushed on Oct 27, 2024
  41. andrewtoth force-pushed on Oct 27, 2024
  42. andrewtoth marked this as a draft on Oct 27, 2024
  43. andrewtoth force-pushed on Oct 27, 2024
  44. andrewtoth force-pushed on Oct 27, 2024
  45. andrewtoth force-pushed on Oct 27, 2024
  46. andrewtoth force-pushed on Oct 27, 2024
  47. andrewtoth force-pushed on Oct 27, 2024
  48. andrewtoth force-pushed on Oct 27, 2024
  49. andrewtoth force-pushed on Oct 27, 2024
  50. andrewtoth force-pushed on Nov 7, 2024
  51. andrewtoth force-pushed on Nov 7, 2024
  52. DrahtBot removed the label CI failed on Nov 7, 2024
  53. andrewtoth force-pushed on Nov 7, 2024
  54. andrewtoth marked this as ready for review on Nov 7, 2024
  55. andrewtoth commented at 3:05 pm on November 7, 2024: contributor

    @furszy I tried to switch to using a shared threadpool, but it is much slower that way. We need a way to have shared state between threads for this, instead of just scheduling tasks. I suppose the generic threadpool is great for scheduling independent tasks like indexing an individual block, but for quickly pulling outpoints off a shared vector it is not optimized well.

    From #29386:

    I just noticed the comment in the code:

    For each thread a thread stack needs to be allocated. By default on Linux, threads take up 8MiB for the thread stack on a 64-bit system, and 4MiB in a 32-bit system.

    Only 8MiB of Virtual Memory is allocated, which doesn’t really mean anything. Due to CoW mechanism, only the parts of stack that are being used will be allocated as Physical Memory which is the one that actually matters.

    So, I don’t think it matters much to have an extra threadpool owned by the input fetcher.

    I think this is ready for more review. I also added tests and a benchmark.

  56. andrewtoth force-pushed on Nov 7, 2024
  57. andrewtoth force-pushed on Nov 9, 2024
  58. andrewtoth commented at 3:28 pm on November 13, 2024: contributor
    For later blocks where cache misses are much more common, this change has an even bigger impact. This benchmark report shows a 40% speedup measuring from blocks 840k to 850k. Also, compare flamegraphs of master and this branch, where the latter has 15 worker threads fetching coins from disk. https://bitcoin-dev-tools.github.io/benchcoin/results/pr-19/11798124132/index.html
  59. andrewtoth commented at 3:35 am on November 16, 2024: contributor
    Even with just 2 worker threads, there is significant (~30%) speed improvement for syncing recent blocks. https://bitcoin-dev-tools.github.io/benchcoin/results/pr-19/11865650166/index.html
  60. andrewtoth force-pushed on Nov 16, 2024
  61. andrewtoth force-pushed on Nov 16, 2024
  62. andrewtoth force-pushed on Nov 16, 2024
  63. andrewtoth force-pushed on Nov 16, 2024
  64. DrahtBot commented at 7:45 pm on November 16, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33086747731

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  65. DrahtBot added the label CI failed on Nov 16, 2024
  66. andrewtoth force-pushed on Nov 16, 2024
  67. andrewtoth force-pushed on Nov 16, 2024
  68. DrahtBot removed the label CI failed on Nov 16, 2024
  69. andrewtoth force-pushed on Nov 17, 2024
  70. in src/test/fuzz/inputfetcher.cpp:36 in 2bd5f0f03b outdated
    27+    CBlock block;
    28+    Txid prevhash{Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))};
    29+
    30+    const auto txs{fuzzed_data_provider.ConsumeIntegralInRange<uint32_t>(1,
    31+        std::numeric_limits<uint32_t>::max())};
    32+    for (uint32_t i{0}; i < txs; ++i) {
    


    dergoegge commented at 2:16 pm on November 18, 2024:

    This will create very long running inputs (e.g. txs = std::numeric_limits<uint32_t>::max()).

    0    LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), N) {
    

    or

    0    LIMITED_WHILE(fuzzed_data_provider.remaining_bytes(), N) {
    

    andrewtoth commented at 5:04 pm on November 18, 2024:
    Thanks, done!
  71. in src/test/fuzz/inputfetcher.cpp:41 in 2bd5f0f03b outdated
    31+        std::numeric_limits<uint32_t>::max())};
    32+    for (uint32_t i{0}; i < txs; ++i) {
    33+        CMutableTransaction tx;
    34+
    35+        const auto inputs{fuzzed_data_provider.ConsumeIntegral<uint32_t>()};
    36+        for (uint32_t j{0}; j < inputs; ++j) {
    


    dergoegge commented at 2:17 pm on November 18, 2024:
    Same as above, this will create long running inputs and maybe even run out of memory?
  72. andrewtoth force-pushed on Nov 18, 2024
  73. andrewtoth force-pushed on Nov 18, 2024
  74. DrahtBot commented at 3:34 pm on November 18, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33143571653

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  75. DrahtBot added the label CI failed on Nov 18, 2024
  76. andrewtoth force-pushed on Nov 18, 2024
  77. andrewtoth force-pushed on Nov 18, 2024
  78. andrewtoth force-pushed on Nov 18, 2024
  79. andrewtoth force-pushed on Nov 18, 2024
  80. andrewtoth force-pushed on Nov 18, 2024
  81. andrewtoth force-pushed on Nov 18, 2024
  82. andrewtoth force-pushed on Nov 18, 2024
  83. DrahtBot removed the label CI failed on Nov 18, 2024
  84. andrewtoth force-pushed on Nov 19, 2024
  85. in src/coins.h:420 in 3a4af55071 outdated
    415@@ -416,13 +416,14 @@ class CCoinsViewCache : public CCoinsViewBacked
    416     void AddCoin(const COutPoint& outpoint, Coin&& coin, bool possible_overwrite);
    417 
    418     /**
    419-     * Emplace a coin into cacheCoins without performing any checks, marking
    420-     * the emplaced coin as dirty.
    421+     * Emplace a coin into cacheCoins without performing any checks, optionally
    422+     * marking the emplaced coin as dirty.
    


    TheCharlatan commented at 2:38 pm on November 19, 2024:
    Should this rather say “optionally marking the emplaced coin as not dirty”, since the default is always dirty?

    andrewtoth commented at 6:53 pm on November 20, 2024:

    I’m not sure that’s the best though, since we do not mark a coin as not dirty. That is the default state.

    What about “marking the coin as dirty unless set_dirty is set to false”?


    TheCharlatan commented at 7:15 pm on November 20, 2024:
    That sounds good to me :+1:

    andrewtoth commented at 9:56 pm on November 20, 2024:
    Done.
  86. andrewtoth force-pushed on Nov 19, 2024
  87. andrewtoth force-pushed on Nov 20, 2024
  88. andrewtoth force-pushed on Nov 20, 2024
  89. DrahtBot added the label CI failed on Nov 20, 2024
  90. DrahtBot commented at 6:46 pm on November 20, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33279820062

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  91. andrewtoth force-pushed on Nov 20, 2024
  92. DrahtBot removed the label CI failed on Nov 20, 2024
  93. andrewtoth force-pushed on Nov 21, 2024
  94. DrahtBot commented at 5:12 pm on November 21, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33335042693

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  95. DrahtBot added the label CI failed on Nov 21, 2024
  96. andrewtoth force-pushed on Nov 21, 2024
  97. DrahtBot removed the label CI failed on Nov 21, 2024
  98. andrewtoth force-pushed on Nov 22, 2024
  99. andrewtoth force-pushed on Nov 24, 2024
  100. in src/test/inputfetcher_tests.cpp:55 in 3c201bcffc outdated
    50+    const auto cores{GetNumCores()};
    51+    const auto num_txs{m_rng.randrange(cores * 10)};
    52+    const auto block{CreateBlock(num_txs)};
    53+    const auto batch_size{m_rng.randrange<int32_t>(block.vtx.size() * 2)};
    54+    const auto worker_threads{m_rng.randrange(cores * 2)};
    55+    InputFetcher fetcher{batch_size, worker_threads};
    


    ismaelsadeeq commented at 8:31 pm on November 26, 2024:
    In 3c201bcffc1d7e382e8afa9a88750a4c261c1cf8 “tests: add inputfetcher tests” You can set this up in InputFetcherTest so that you don’t have to repeat it in the rest of the tests.

    andrewtoth commented at 0:58 am on December 4, 2024:
    Done.
  101. in src/bench/inputfetcher.cpp:15 in 2349ac7d60 outdated
    10+#include <primitives/block.h>
    11+#include <serialize.h>
    12+#include <streams.h>
    13+#include <util/time.h>
    14+
    15+static constexpr auto QUEUE_BATCH_SIZE{128};
    


    ismaelsadeeq commented at 8:33 pm on November 26, 2024:
    In 2349ac7d6071746a80223358bce0d5e556b277d7 “bench: add inputfetcher bench” How did you select this batch size?

    andrewtoth commented at 0:58 am on December 4, 2024:
    This is the hardcoded batch size used in CheckQueue. Not sure why that was selected, but I deferred to previous choices.

    l0rinc commented at 3:45 pm on September 29, 2025:
    I would prefer retesting those assumptions (I don’t even think we need a batch here)

    andrewtoth commented at 7:06 pm on October 3, 2025:
    Removed the batch size 🎉
  102. in src/bench/inputfetcher.cpp:18 in 2349ac7d60 outdated
    14+
    15+static constexpr auto QUEUE_BATCH_SIZE{128};
    16+static constexpr auto DELAY{2ms};
    17+
    18+//! Simulates a DB by adding a delay when calling GetCoin
    19+class DelayedCoinsView : public CCoinsView
    


    ismaelsadeeq commented at 8:47 pm on November 26, 2024:
    In 2349ac7d6071746a80223358bce0d5e556b277d7 “bench: add inputfetcher bench” nit: will be nice if we have block413567 input’s data that we can read so that we dont have to simulate this.

    andrewtoth commented at 0:59 am on December 4, 2024:
    We’re reading the previous outpoints of that block’s inputs, which are in many other previous blocks. So, not sure this is feasible.
  103. ismaelsadeeq commented at 8:55 pm on November 26, 2024: member

    Concept ACK This is nice. Although I have not yet benchmarked this branch, I also like @furszy’s idea of having a general-purpose thread pool.

    I just have one test improvement comment, question and a nit after first pass of the PR

  104. DrahtBot added the label Needs rebase on Dec 4, 2024
  105. andrewtoth force-pushed on Dec 4, 2024
  106. andrewtoth commented at 1:02 am on December 4, 2024: contributor

    Rebased. Since #30039 reading inputs is much faster, so the effect of this is somewhat less significant (17% -> 10%). It’s still a significant speedup though so still worth it. Especially for worst case where the cache is completely empty, like on startup or right after it gets flushed due to size.

    It is also refactored significantly. The main thread now writes everything before notifying threads, and then joins in working. This lets us do significantly less work in the critical section and parallelize more checks.

  107. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads ~17% faster IBD
    validation: fetch block inputs on parallel threads 10% faster IBD
    on Dec 4, 2024
  108. andrewtoth force-pushed on Dec 4, 2024
  109. DrahtBot commented at 1:29 am on December 4, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33884531020

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  110. DrahtBot added the label CI failed on Dec 4, 2024
  111. andrewtoth force-pushed on Dec 4, 2024
  112. DrahtBot removed the label Needs rebase on Dec 4, 2024
  113. DrahtBot removed the label CI failed on Dec 4, 2024
  114. DrahtBot added the label Needs rebase on Dec 5, 2024
  115. andrewtoth force-pushed on Dec 5, 2024
  116. DrahtBot removed the label Needs rebase on Dec 5, 2024
  117. in src/validation.cpp:3176 in b2da764446 outdated
    3194@@ -3195,6 +3195,8 @@ bool Chainstate::ConnectTip(BlockValidationState& state, CBlockIndex* pindexNew,
    3195     LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
    3196              Ticks<MillisecondsDouble>(time_2 - time_1));
    3197     {
    3198+        m_chainman.GetInputFetcher().FetchInputs(CoinsTip(), CoinsDB(), blockConnecting);
    


    l0rinc commented at 9:26 am on May 11, 2025:

    Can we let the objects do the job instead of querying their internals and doing it ourselves (“tell, don’t ask”):

    0        m_chainman.FetchInputs(CoinsTip(), CoinsDB(), blockConnecting);
    

    andrewtoth commented at 2:20 pm on October 2, 2025:
    I tried to mimic the script validation like GetCheckQueue. But, I guess this is different enough. Will change next time I push.

    andrewtoth commented at 0:00 am on October 4, 2025:
    Done.
  118. DrahtBot added the label CI failed on Jun 2, 2025
  119. maflcko commented at 7:02 am on June 10, 2025: member
    Looks like the CI started failing, due to too many threads being launched in the functional tests with that parallelism? As the threads may open files, this could be hitting the max open files limit? Or maybe it is a different limit hit?
  120. HowHsu commented at 4:42 pm on June 26, 2025: none
    Hi folks, this looks great, since if all the prevout coins of all transactions of a block are loaded in advance, then the optimization in #32791 makes sense.
  121. TheCharlatan commented at 9:09 am on September 17, 2025: contributor
    What’s the status here?
  122. andrewtoth force-pushed on Sep 17, 2025
  123. andrewtoth force-pushed on Sep 17, 2025
  124. andrewtoth force-pushed on Sep 17, 2025
  125. andrewtoth commented at 8:38 pm on September 17, 2025: contributor

    Looks like the CI started failing, due to too many threads being launched in the functional tests with that parallelism? As the threads may open files, this could be hitting the max open files limit? Or maybe it is a different limit hit?

    Thanks, I added -par=1 to all nodes spawned in features_proxy.py in 6980852416040bdddf111df3cea3ec50639f010a. That test spawns lots of nodes and block validation is not relevant to it.

    What’s the status here?

    Rebased to fix silent conflicts and added the fix for features_proxy.py.

  126. andrewtoth commented at 10:57 pm on September 21, 2025: contributor

    I benchmarked the latest branch with default dbcache up to 912683. Results are a speedup of 14% - 5:07 vs 5:49.

    Command Mean [s] Min [s] Max [s] Relative
    echo 688c03597afb0b76077f1ffc4608eef19481056e && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 18430.672 ± 19.856 18416.631 18444.712 1.00
    echo 1444ed855f438f1270104fca259ce61b99ed5cdb && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 20937.219 ± 62.635 20892.929 20981.508 1.14 ± 0.00
  127. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads 10% faster IBD
    validation: fetch block inputs on parallel threads >10% faster IBD
    on Sep 21, 2025
  128. maflcko removed the label CI failed on Sep 22, 2025
  129. andrewtoth commented at 5:18 pm on September 23, 2025: contributor

    Did the same benchmark with 5000 dbcache and there is a 6% speedup :rocket: - 4:27 vs 4:44. Even with far fewer cache misses this change is still a benefit, and will continue to improve block connection speed as the blockchain and utxo set get bigger.

    Command Mean [s] Min [s] Max [s] Relative
    echo 688c03597afb0b76077f1ffc4608eef19481056e && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 -dbcache=5000 16021.047 ± 5.892 16016.881 16025.213 1.00
    echo 1444ed855f438f1270104fca259ce61b99ed5cdb && /usr/bin/time ./build/bin/bitcoind -printtoconsole=0 -connect=192.168.2.171 -stopatheight=912683 -dbcache=5000 17057.947 ± 42.032 17028.226 17087.668 1.06 ± 0.00
  130. in test/functional/feature_proxy.py:141 in 688c03597a outdated
    135@@ -136,6 +136,9 @@ def setup_nodes(self):
    136         if self.have_unix_sockets:
    137             args[5] = ['-listen', f'-proxy=unix:{socket_path}']
    138             args[6] = ['-listen', f'-onion=unix:{socket_path}']
    139+        # Keep validation threads low to avoid CI thread/pid limits.
    140+        # Ensure even empty arg lists get '-par=1'.
    141+        args = [a + ['-par=1'] if a else ['-par=1'] for a in args]
    


    maflcko commented at 8:16 pm on September 24, 2025:

    seems a bit odd to have the number of nodes in a test influence whether or not the test has to be edited to remove or add -par=1 everywhere. Would it not be easier to just globally set -par=2 for all funtional tests?

     0diff --git a/test/functional/test_framework/util.py b/test/functional/test_framework/util.py
     1index e5a5938f07..42bb213dd3 100644
     2--- a/test/functional/test_framework/util.py
     3+++ b/test/functional/test_framework/util.py
     4@@ -459,6 +459,7 @@ def write_config(config_path, *, n, chain, extra_config="", disable_autoconnect=
     5         f.write("printtoconsole=0\n")
     6         f.write("natpmp=0\n")
     7         f.write("shrinkdebugfile=0\n")
     8+        f.write("par=2\n")
     9         # To improve SQLite wallet performance so that the tests don't timeout, use -unsafesqlitesync
    10         f.write("unsafesqlitesync=1\n")
    11         if disable_autoconnect:
    

    andrewtoth commented at 9:01 pm on September 24, 2025:
    Yes, I wondered if that would be more invasive to other tests though.

    maflcko commented at 7:42 am on September 25, 2025:
    It disables the auto-detection for all functional tests by default, which I can’t really find a downside to. Also, it removes idle “spam” threads while debugging (gdb and other tools will display less script check threads), which also seems beneficial to have?

    andrewtoth commented at 4:09 pm on September 27, 2025:
    Makes sense. Done here #33485.
  131. in src/inputfetcher.h:134 in 688c03597a outdated
    140+                        m_in_flight_outpoints_count -= m_last_outpoint_index;
    141+                        m_last_outpoint_index = 0;
    142+                        break;
    143+                    }
    144+                }
    145+            } catch (const std::runtime_error&) {
    


    l0rinc commented at 2:31 pm on September 29, 2025:
    nit: is there anything in the error that we may want to log?

    andrewtoth commented at 6:04 pm on October 11, 2025:
    Added a log.
  132. in src/inputfetcher.h:107 in 688c03597a
    102+                while (m_last_outpoint_index == 0) {
    103+                    if ((is_main_thread && m_in_flight_outpoints_count == 0) || m_request_stop) {
    104+                        return;
    105+                    }
    106+                    ++m_idle_worker_count;
    107+                    cond.wait(lock);
    


    l0rinc commented at 2:33 pm on September 29, 2025:
    I haven’t reviewed it in detail but was wondering why we need locking here, it should be possible to do most of this lock free (especially if we sort the keys first so that threads are more likely to access different regions). I have started reviewing and testing it in detail, but to have some progress I’m sharing my observations as I go along

    andrewtoth commented at 2:21 pm on October 2, 2025:

    I’ve updated to use semaphores instead of mutex. That should be more efficient.

    especially if we sort the keys first so that threads are more likely to access different regions

    I don’t understand what this has to do with being lock free.


    l0rinc commented at 2:40 pm on October 2, 2025:

    I don’t understand what this has to do with being lock free.

    We may have fewer file system locks if the threads are accessing different regions

    I’ve updated to use semaphores instead of mutex

    I will review that in more detail soon, probably next week.


    andrewtoth commented at 2:47 pm on October 2, 2025:

    We may have fewer file system locks if the threads are accessing different regions

    Ok, but that is not the same as this InputFetcher construction being lock free.


    l0rinc commented at 2:54 pm on October 2, 2025:
    No, that’s orthogonal, it’s another area where we could possibly reduce contention.
  133. in src/inputfetcher.h:82 in 688c03597a
    77+    const CCoinsViewCache* m_cache{nullptr};
    78+
    79+    std::vector<std::thread> m_worker_threads;
    80+    bool m_request_stop GUARDED_BY(m_mutex){false};
    81+
    82+    //! Internal function that does the fetching from disk.
    


    l0rinc commented at 2:35 pm on September 29, 2025:
    instead of the comment, can we express this in the method name?

    andrewtoth commented at 2:22 pm on October 2, 2025:
    I updated the thread name to ThreadLoop, which just does the loop. There is another function now, FetchInputsOnThread, that fetches for each block until finished.
  134. in src/inputfetcher.h:84 in 688c03597a outdated
    74+    //! DB coins view to fetch from.
    75+    const CCoinsView* m_db{nullptr};
    76+    //! The cache to check if we already have this input.
    77+    const CCoinsViewCache* m_cache{nullptr};
    78+
    79+    std::vector<std::thread> m_worker_threads;
    


    l0rinc commented at 2:38 pm on September 29, 2025:

    I have tried std::jthread in l0rinc/bitcoin@6afe2e8 (#40) but it seems the CI’s libc++ doesn’t provide it

    Q: it’s just the second commit and we’re already doing the fetching on multiple threads. Can we add a single-threaded input fetcher first and add multithreading only as the very last step?


    andrewtoth commented at 3:55 pm on October 4, 2025:

    Can we add a single-threaded input fetcher first and add multithreading only as the very last step?

    Done.


    andrewtoth commented at 2:09 pm on October 14, 2025:

    I have tried std::jthread

    I looked at this, but it doesn’t really add anything to this implementation. We could have a std::stop_token for each thread, but we would have to request_stop() each jthread before releasing the semaphore in the destructor anyway. So it doesn’t let us remove the destructor, and saves a line for not having to declare m_request_stop. I don’t think it’s worth it to use jthreads here.


    l0rinc commented at 2:49 pm on October 14, 2025:
    I couldn’t get it to work on CI anyway
  135. in src/inputfetcher.h:73 in 688c03597a
    68+     */
    69+    int32_t m_in_flight_outpoints_count GUARDED_BY(m_mutex){0};
    70+    //! The number of worker threads that are waiting on m_worker_cv
    71+    int32_t m_idle_worker_count GUARDED_BY(m_mutex){0};
    72+    //! The maximum number of outpoints to be assigned in one batch
    73+    const int32_t m_batch_size;
    


    l0rinc commented at 2:42 pm on September 29, 2025:
    what if instead of locking we just iterate every nth element (where n is the thread index), implicitly dividing the input into n buckets without locking. Each thread would work on a distinct set of values - we can pre-filter for existing values on a single thread before forking off. This won’t have work stealing, but we can likely assume uniform distribution and the solution would be trivial and lock free.

    andrewtoth commented at 2:24 pm on October 2, 2025:
    Prefiltering on the main thread is too slow, it’s faster if we do the filtering in parallel. So, we still need to have a smaller batch size because then work will not be divided evenly. One thread could get all cache misses while the others all have cached inputs.

    l0rinc commented at 2:40 pm on October 2, 2025:
    Not sure why that’s problematic, we don’t have to have perfect parallelism, it seems to me we can assume uniform distribution - it’s fine if there are outliers if that makes the code simpler (which I think it should, it could even eliminate most locks, since the jobs are basically completely independent)
  136. in src/inputfetcher.h:80 in 688c03597a outdated
    72+    //! The maximum number of outpoints to be assigned in one batch
    73+    const int32_t m_batch_size;
    74+    //! DB coins view to fetch from.
    75+    const CCoinsView* m_db{nullptr};
    76+    //! The cache to check if we already have this input.
    77+    const CCoinsViewCache* m_cache{nullptr};
    


    l0rinc commented at 2:44 pm on September 29, 2025:
    could we pre-filter on a single tread and send the results to the fetcher instead? That way we can also decide not to do multi-threaded access for small sets (we can experiment with the values, but we can probably start with set size < nproc should be done on a single thread).

    andrewtoth commented at 2:25 pm on October 2, 2025:
    Prefiltering on the main thread is too slow. It is several milliseconds to check every input in large blocks whether they exist in the cache.
  137. in src/inputfetcher.h:1 in 688c03597a outdated
    0@@ -0,0 +1,246 @@
    1+// Copyright (c) 2024-present The Bitcoin Core developers
    


    l0rinc commented at 2:45 pm on September 29, 2025:

    nit: the curse of long review queues

    0// Copyright (c) 2025-present The Bitcoin Core developers
    

    andrewtoth commented at 2:25 pm on October 2, 2025:
    Done.
  138. in src/coins.h:432 in 688c03597a outdated
    430+     * NOT FOR GENERAL USE. Used when loading coins from a UTXO snapshot, and
    431+     * in the InputFetcher.
    432      * @sa ChainstateManager::PopulateAndValidateSnapshot()
    433      */
    434-    void EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin);
    435+    void EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty = true);
    


    l0rinc commented at 2:48 pm on September 29, 2025:

    to peel away the preparatory commits, it would simplify review to extract these into tiny, focused PRs - to have some progress, since this PR is in review for some time, but it’s a very good change that I’d like to have some progress on.

    nit: I understand the default param is meant to make the diff smaller, but it doesn’t help with understanding the effect of the change, to see where this is used and what we’re changing


    andrewtoth commented at 4:09 pm on October 15, 2025:
    I think I can just drop this first commit entirely. We don’t actually care to not set the coins we fetch as dirty. In the happy path, all these coins will be spent immediately after ConnectBlock, so they will be set to dirty anyways. In the unhappy path where the valid proof-of-work block is found to be invalid, the dirty coins we added will just cause the coins to be overwritten by the same data in the db at the next flush.
  139. in src/coins.cpp:114 in 688c03597a outdated
    109@@ -110,10 +110,15 @@ void CCoinsViewCache::AddCoin(const COutPoint &outpoint, Coin&& coin, bool possi
    110            (bool)it->second.coin.IsCoinBase());
    111 }
    112 
    113-void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin) {
    114-    cachedCoinsUsage += coin.DynamicMemoryUsage();
    115+void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty) {
    116+    const auto mem_usage{coin.DynamicMemoryUsage()};
    


    l0rinc commented at 2:49 pm on September 29, 2025:

    af8a366bd6a08d9362e69a89b0b89b5c94eb63ca I had something similar in https://github.com/bitcoin/bitcoin/pull/32313/files#diff-f0ed73d62dae6ca28ebd3045e5fc0d5d02eaaacadb4c2a292985a3fbd7e1c77cR254

    Can you please explain in the commit message why this change is necessary?


    andrewtoth commented at 3:59 pm on October 3, 2025:
    Added some explanation in the commit message. Please let me know if it makes it more clear.
  140. in src/validation.cpp:6271 in 688c03597a outdated
    6267@@ -6266,6 +6268,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
    6268 
    6269 ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
    6270     : m_script_check_queue{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    6271+      m_input_fetcher{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    


    l0rinc commented at 2:50 pm on September 29, 2025:
    I have tested this with different par values and surprisingly it barely had any effect. Is it because of the locking?

    andrewtoth commented at 2:27 pm on October 2, 2025:
    I believe this is resolved.
  141. in src/coins.cpp:118 in af8a366bd6 outdated
    115+void CCoinsViewCache::EmplaceCoinInternalDANGER(COutPoint&& outpoint, Coin&& coin, bool set_dirty) {
    116+    const auto mem_usage{coin.DynamicMemoryUsage()};
    117     auto [it, inserted] = cacheCoins.try_emplace(std::move(outpoint), std::move(coin));
    118-    if (inserted) CCoinsCacheEntry::SetDirty(*it, m_sentinel);
    119+    if (inserted) {
    120+        cachedCoinsUsage += mem_usage;
    


    l0rinc commented at 2:57 pm on September 29, 2025:

    this seems like a change in behavior, but it assumed that the coin was never in the cache - though if (inserted) didn’t help with understanding this, so it likely isn’t.

    Is that still something that we can assume - hence the “danger”, right? And if it’s always inserted, does the insertion guard still make sense? It’s a bit even more confusing now :/


    andrewtoth commented at 2:33 pm on October 2, 2025:
    This is dangerous because it doesn’t check for freshness or if already inserted. It is meant to bulk load new utxos from the assume utxo set. Since assume utxo assumes the utxo set is currently empty, the coins would always be inserted. This is repurposed here to bulk load utxos from the db directly into the cache. However, an invalid block could be mined which spends an already spent utxo that is in the cache but has not been synced to the db yet. In that case, the insertion will fail here. There is a unit test specifically for this scenario.
  142. in src/inputfetcher.h:25 in 912f26b81e outdated
    19+#include <unordered_set>
    20+#include <vector>
    21+
    22+/**
    23+ * Input fetcher for fetching inputs from the CoinsDB and inserting
    24+ * into the CoinsTip.
    


    l0rinc commented at 2:59 pm on September 29, 2025:
    0 * Helper for fetching inputs from the CoinsDB and inserting into the CoinsTip.
    

    andrewtoth commented at 3:59 pm on October 3, 2025:
    Done.
  143. in src/inputfetcher.h:27 in 912f26b81e outdated
    22+/**
    23+ * Input fetcher for fetching inputs from the CoinsDB and inserting
    24+ * into the CoinsTip.
    25+ *
    26+ * The main thread loops through the block and writes all input prevouts to a
    27+ * global vector. It then wakes all workers and starts working as well. Each
    


    l0rinc commented at 3:00 pm on September 29, 2025:
    do we need to write to a global vector or can we safely iterate the prevouts directly from each thread?

    andrewtoth commented at 2:36 pm on October 2, 2025:
    We iterate the prevouts directly from the block now. However, we store the tx index and vin index in a global vector now. This way we can flatten the inputs instead of having to scan the txs to see how many inputs they have.
  144. in src/inputfetcher.h:83 in 912f26b81e outdated
    78+
    79+    std::vector<std::thread> m_worker_threads;
    80+    bool m_request_stop GUARDED_BY(m_mutex){false};
    81+
    82+    //! Internal function that does the fetching from disk.
    83+    void Loop(int32_t index, bool is_main_thread = false) noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
    


    l0rinc commented at 3:04 pm on September 29, 2025:
    Q: do we really need the main thread to be part of this? I expect this to be disk bound and not CPU restricted, we should be able to go beyond nproc, so it should be safe to leave the main thread out of this as far as I can tell…

    andrewtoth commented at 2:37 pm on October 2, 2025:
    Not sure, would have to benchmark this. I have updated the functions though to make the main thread’s entrance clearer.

    l0rinc commented at 2:43 pm on October 2, 2025:
    My benchmarks so far indicate the opposite: after 3-4 threads there is no benefit to the parallelization (either on SSD or HDD). I will remeasure your new changes after you give me the 👍
  145. in src/inputfetcher.h:121 in 912f26b81e outdated
    125+                    // block, it won't be in the cache yet but it also won't be
    126+                    // in the db either.
    127+                    if (m_txids.contains(outpoint.hash)) {
    128+                        continue;
    129+                    }
    130+                    if (m_cache->HaveCoinInCache(outpoint)) {
    


    l0rinc commented at 3:28 pm on September 29, 2025:
    as mentioned before I think it should be safe to pre-filter on a single thread instead

    andrewtoth commented at 2:38 pm on October 2, 2025:
    It is definitely safe to do, since all access would be on main thread. It is also safe to do from parallel threads if we don’t write until all threads are done reading, which is what this PR does. Prefiltering is slow though (several milliseconds) so is better to do in parallel.

    l0rinc commented at 2:45 pm on October 2, 2025:
    But prefiltering would allow sorting, which should untangle the threads. The threads will access the same files (which are more likely to be different from the files the other threads are requesting), so they may profit from cache locality if the OS supports it - that’s why I suggested giving it a try.
  146. in src/inputfetcher.h:118 in 912f26b81e outdated
    122+                for (auto i{end_index - local_batch_size}; i < end_index; ++i) {
    123+                    const auto& outpoint{m_outpoints[i]};
    124+                    // If an input spends an outpoint from earlier in the
    125+                    // block, it won't be in the cache yet but it also won't be
    126+                    // in the db either.
    127+                    if (m_txids.contains(outpoint.hash)) {
    


    l0rinc commented at 3:31 pm on September 29, 2025:
    what if this ends up on different threads, i.e. a spend from an earlier outpoint processed on a different thread? Wouldn’t we take care of those automatically? We can likely skip all values that are not found, since we will revalidate everything after this cache warming call - we just have to document that it’s theoretically possible that some values won’t be in the cache after this call (though the internal spends should be added, just in a different thread, right?).

    andrewtoth commented at 2:40 pm on October 2, 2025:
    I’m not sure I understand this. The m_txids set is computed on the main thread, and is only read from multiple threads. If we didn’t do this we would try to fetch non-existent outputs from the db, which would be much slower.
  147. in src/inputfetcher.h:127 in 912f26b81e outdated
    131+                        continue;
    132+                    }
    133+                    if (auto coin{m_db->GetCoin(outpoint)}; coin) {
    134+                        local_pairs.emplace_back(outpoint, std::move(*coin));
    135+                    } else {
    136+                        // Missing an input. This block will fail validation.
    


    l0rinc commented at 3:32 pm on September 29, 2025:
    do we really care about this, it’s not our job here to validate, just fetch whatever we can, the validation will happen after this pre-warming.

    andrewtoth commented at 2:41 pm on October 2, 2025:
    We don’t really care, but it would be good to not continue doing work here if we know it’s pointless. This just exits early. No validation is happening.

    l0rinc commented at 2:50 pm on October 2, 2025:
    I think I would prefer a less opinionated version, as long as it’s still correct. No need to optimize for the consensus failure speed in my opinion, I would prefer simpler code for a change as risky as this one.
  148. in src/inputfetcher.h:151 in 912f26b81e outdated
    157+
    158+    //! Create a new input fetcher
    159+    explicit InputFetcher(int32_t batch_size, int32_t worker_thread_count) noexcept
    160+        : m_batch_size(batch_size)
    161+    {
    162+        if (worker_thread_count < 1) {
    


    l0rinc commented at 3:34 pm on September 29, 2025:

    what’s the reason for allowing negative worker_thread_count? In other cases I think it was used to signal how many CPUs to reserve, but that doesn’t seem to be the case here, and since we’re claming to min of 0, consider:

    0        if (worker_thread_count == 0) {
    

    andrewtoth commented at 3:58 pm on October 3, 2025:
    Done.
  149. in src/inputfetcher.h:172 in 912f26b81e outdated
    187+    void FetchInputs(CCoinsViewCache& cache,
    188+                     const CCoinsView& db,
    189+                     const CBlock& block) noexcept
    190+        EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
    191+    {
    192+        if (m_worker_threads.empty() || block.vtx.size() <= 1) {
    


    l0rinc commented at 3:37 pm on September 29, 2025:

    Can we maybe do something like this instead?

    0        if (block.vtx.size() < m_worker_threads.size()) {
    

    andrewtoth commented at 2:42 pm on October 2, 2025:
    This is to not enter if there is only a coinbase tx, since it has no inputs to fetch. If there were 2 txs, and the second has 1000 inputs, we would still want to enter here.
  150. in src/inputfetcher.h:152 in 912f26b81e outdated
    193+            return;
    194+        }
    195+
    196+        // Set the db and cache to use for this block.
    197+        m_db = &db;
    198+        m_cache = &cache;
    


    l0rinc commented at 3:38 pm on September 29, 2025:
    can we avoid mutating the state in a multithreaded class for safety? It’s easier to follow along knowing that the class is immutable and the state is passed along…

    andrewtoth commented at 2:43 pm on October 2, 2025:
    I don’t think we can do that. We need to set these here for other threads to read. These are only read from other threads, never written to. We also only read from other threads after the main thread has released the counting_semaphore, so we know the pointers are synced across the threads.

    l0rinc commented at 2:48 pm on October 2, 2025:
    I really dislike that, will try to come up with a lock-free version later (maybe next week)
  151. in src/test/inputfetcher_tests.cpp:51 in c705c6f1f1 outdated
    46+
    47+        return block;
    48+    }
    49+
    50+public:
    51+    explicit InputFetcherTest(const ChainType chainType = ChainType::MAIN,
    


    l0rinc commented at 3:40 pm on September 29, 2025:
    can we add these tests before the multithreading change - having a single-threaded InputFetcher first, adding tests and benchmarks after and doing the actual multithreading as a very last step. That would construct the whole scenario in smaller steps, proving that every change is safe

    andrewtoth commented at 3:57 pm on October 4, 2025:
    Done.
  152. in src/test/inputfetcher_tests.cpp:55 in c705c6f1f1 outdated
    50+public:
    51+    explicit InputFetcherTest(const ChainType chainType = ChainType::MAIN,
    52+                             TestOpts opts = {})
    53+        : BasicTestingSetup{chainType, opts}
    54+    {
    55+        SeedRandomForTest(SeedRand::ZEROS);
    


    l0rinc commented at 3:41 pm on September 29, 2025:
    I understand why benchmarks need predictability, but wouldn’t we want variance for tests?

    andrewtoth commented at 6:04 pm on October 11, 2025:
    Changed to FIXED_SEED.
  153. in src/test/inputfetcher_tests.cpp:170 in c705c6f1f1 outdated
    166+
    167+class ThrowCoinsView : public CCoinsView
    168+{
    169+    std::optional<Coin> GetCoin(const COutPoint& outpoint) const override
    170+    {
    171+        throw std::runtime_error("database error");
    


    l0rinc commented at 3:44 pm on September 29, 2025:
    consider std::terminate

    andrewtoth commented at 2:44 pm on October 2, 2025:
    Err, we want to throw a runtime error here to test the try/catch in the inputfetcher.
  154. in src/test/fuzz/inputfetcher.cpp:148 in 1faf0595a5 outdated
    145+            // Check any newly added coins in the cache are the same as the db
    146+            const auto& coin{cache.AccessCoin(outpoint)};
    147+            assert(!coin.IsSpent());
    148+            assert(coin.fCoinBase == (*maybe_coin).fCoinBase);
    149+            assert(coin.nHeight == (*maybe_coin).nHeight);
    150+            assert(coin.out == (*maybe_coin).out);
    


    l0rinc commented at 3:47 pm on September 29, 2025:
    0            assert(coin.fCoinBase == maybe_coin->fCoinBase);
    1            assert(coin.nHeight == maybe_coin->nHeight);
    2            assert(coin.out == maybe_coin->out);
    

    andrewtoth commented at 6:04 pm on October 11, 2025:
    Done!
  155. l0rinc commented at 4:16 pm on September 29, 2025: contributor

    I have re-reviewed the changes again lightly and did quite a few benchmarks on different platforms. There were a lot of surprises, see my measurements:

     0COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
     1STOP=915961; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
     5hyperfine \
     6  --sort command \
     7  --runs 1 \
     8  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
     9  --parameter-list COMMIT ${COMMITS// /,} \
    10  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    11    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    12    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
    13  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
    14  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    15
    16
    17688c03597a validation: fetch block inputs in parallel
    18af8a366bd6 coins: allow emplacing non-dirty coins internally
    19
    20Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    21  Time (abs ):        29732.695 s               [User: 60441.083 s, System: 5856.247 s]
    22
    23Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    24  Time (abs ):        37896.082 s               [User: 60968.810 s, System: 7062.414 s]
    25
    26Relative speed comparison
    27        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    28        1.27          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    29
    30---
    31
    32Retested it separately with:
    

    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && time ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -connect=rpi5-16-3.local

    cat ../BitcoinData/debug.log | egrep ‘height=0|height=916000’ 2025-09-25T17:03:06Z UpdateTip: new best=000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f height=0 version=0x00000001 log2_work=32.000022 tx=1 date=‘2009-01-03T18:15:05Z’ progress=0.000000 cache=0.3MiB(0txo) 2025-09-26T01:02:56Z UpdateTip: new best=000000000000000000003ca9748080f4c3d1230ba9fa4bed66be6ded05f9b6e6 height=916000 version=0x2000e000 log2_work=95.840381 tx=1246369867 date=‘2025-09-23T07:22:08Z’ progress=0.998966 cache=367.7MiB(2821413txo) 7h 59m 50s

    0
    1</details>
    2
    3Doing the same on an Intel i9 with SSD shows similar results
    4
    5<details>
    6<summary>i9 with SSD, IBD from real peers/reindex-chainstate seem is 24%/25% faster for default memory, done in 6h/3.5h</summary>
    

    COMMITS=“688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca”;
    STOP=915961; DBCACHE=450;
    CC=gcc; CXX=g++;
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";
    (echo “”; for c in $COMMITS; do git fetch -q origin $c && git log -1 –pretty=’%h %s’ $c || exit 1; done; echo “”) &&
    hyperfine
    –sort command
    –runs 1
    –export-json “$BASE_DIR/rdx-$(sed -E ’s/(\w{8})\w+ ?/\1-/g;s/-$//’«<"$COMMITS”)-$STOP-$DBCACHE-$CC.json"
    –parameter-list COMMIT ${COMMITS// /,}
    –prepare “killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset –hard &&
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20”
    –cleanup “cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log”
    “COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0”

    688c03597a validation: fetch block inputs in parallel af8a366bd6 coins: allow emplacing non-dirty coins internally

    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (abs ≡): 12698.166 s [User: 33794.242 s, System: 3015.471 s]

    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (abs ≡): 15928.708 s [User: 28382.232 s, System: 2308.299 s]

    Relative speed comparison 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) 1.25 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

    0
    1and
    

    COMMITS=“688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca”;
    STOP=916000; DBCACHE=450;
    CC=gcc; CXX=g++;
    BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs";
    (echo “”; for c in $COMMITS; do git fetch -q origin $c && git log -1 –pretty=’%h %s’ $c || exit 1; done; echo “”) &&
    hyperfine
    –sort command
    –runs 3
    –export-json “$BASE_DIR/ibd-$(sed -E ’s/(\w{8})\w+ ?/\1-/g;s/-$//’«<"$COMMITS”)-$STOP-$DBCACHE-$CC.json"
    –parameter-list COMMIT ${COMMITS// /,}
    –prepare “killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset –hard &&
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 &&
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20”
    –cleanup “cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log &&
    grep -q ‘height=0’ $DATA_DIR/debug.log && grep -q ‘height=$STOP’ $DATA_DIR/debug.log”
    “COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0”

    688c03597a validation: fetch block inputs in parallel af8a366bd6 coins: allow emplacing non-dirty coins internally

    Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) Time (mean ± σ): 21484.108 s ± 1187.956 s [User: 42976.944 s, System: 4356.289 s] Range (min … max): 20112.390 s … 22175.559 s 3 runs

    Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca) Time (mean ± σ): 26589.393 s ± 1171.370 s [User: 36011.245 s, System: 3193.496 s] Range (min … max): 25607.731 s … 27886.055 s 3 runs

    Relative speed comparison 1.00 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e) 1.24 ± 0.09 COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)

     0
     1</details>
     2
     3Increasing the memory decreases the difference:
     4
     5<details>
     6<summary>i9 reindex-chainstate seem is ~9% faster for default memory, done in 3.3h</summary>
     7
     8COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; STOP=915961; DBCACHE=4500; CC=gcc; CXX=g++; BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; (echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && hyperfine   --sort command   --runs 1   --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json"   --parameter-list COMMIT ${COMMITS// /,}   --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
     9    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    10    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20"   --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log"   "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    11
    12688c03597a validation: fetch block inputs in parallel
    13af8a366bd6 coins: allow emplacing non-dirty coins internally
    14
    15Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    16  Time (abs ):        11801.704 s               [User: 20216.598 s, System: 1181.879 s]
    17
    18Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    19  Time (abs ):        12916.432 s               [User: 17150.579 s, System: 747.711 s]
    20
    21Relative speed comparison
    22        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    23        1.09          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=915961 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    

    Note that the difference between small and big dbcache is also shrunk from 23% to 7.5%!

    Checked the same on an i7 with hdd, it seems the speedup is best on non-rotating disks, maybe we could consider reducing the parallelism in those cases:

     0COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
     1STOP=916000; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
     5hyperfine \
     6  --sort command \
     7  --runs 1 \
     8  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
     9  --parameter-list COMMIT ${COMMITS// /,} \
    10  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    11    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    12    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
    13  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
    14             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    15  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    16
    17688c03597a validation: fetch block inputs in parallel
    18af8a366bd6 coins: allow emplacing non-dirty coins internally
    19
    20Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    21  Time (abs ):        35766.853 s               [User: 39688.514 s, System: 2853.808 s]
    22
    23Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    24  Time (abs ):        41355.517 s               [User: 35667.321 s, System: 2872.506 s]
    25
    26Relative speed comparison
    27        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blockso$ly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    28        1.16          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    

    and

     0COMMITS="688c03597afb0b76077f1ffc4608eef19481056e af8a366bd6a08d9362e69a89b0b89b5c94eb63ca"; \
     1STOP=916000; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
     5hyperfine \
     6  --sort command \
     7  --runs 1 \
     8  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
     9  --parameter-list COMMIT ${COMMITS// /,} \
    10  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    11    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    12    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
    13  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
    14             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    15  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    16
    17688c03597a validation: fetch block inputs in parallel
    18af8a366bd6 coins: allow emplacing non-dirty coins internally
    19
    20Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    21  Time (abs ):        35766.853 s               [User: 39688.514 s, System: 2853.808 s]
    22
    23Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    24  Time (abs ):        41355.517 s               [User: 35667.321 s, System: 2872.506 s]
    25
    26Relative speed comparison
    27        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blockso$ly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    28        1.16          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    

    Checking the same on my M4 max laptop was the most surprising:

     0STOP=916000; DBCACHE=450; \
     1DATA_DIR="/Users/lorinc/Library/Application\ Support/Bitcoin"; \
     2hyperfine \
     3  --sort command \
     4  --runs 1 \
     5  --parameter-list COMMIT 688c03597afb0b76077f1ffc4608eef19481056e \
     6  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
     7    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
     8    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
     9  --cleanup "grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    10  "./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    11
    12Benchmark 1: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    13  Time (abs ≡):        20658.825 s               [User: 19679.918 s, System: 4763.490 s]
    14Benchmark 1b: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    15  Time (abs ≡):        20186.312 s               [User: 19481.126 s, System: 4716.728 s]
    16
    17Benchmark 2: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    18  Time (abs ≡):        7131.178 s               [User: 17244.133 s, System: 12850.427 s]
    19Benchmark 2b: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    20  Time (abs ≡):        7180.762 s               [User: 17430.360 s, System: 12949.731 s
    21
    22Relative speed comparison
    23        2.90          ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = af8a366bd6a08d9362e69a89b0b89b5c94eb63ca)
    24        1.00          ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    

    It was hard to believe this was true, so I re-ran it a few times, and it was consistent.

    I have tried -par=32 on my laptop as well - exactly the same speed:

     0STOP=916000; DBCACHE=450; \
     1DATA_DIR="/Users/lorinc/Library/Application\ Support/Bitcoin"; \
     2hyperfine \
     3  --sort command \
     4  --runs 1 \
     5  --parameter-list COMMIT 688c03597afb0b76077f1ffc4608eef19481056e \
     6  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
     7    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
     8    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
     9  --cleanup "grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    10  "./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=32"
    11
    12Benchmark 1: ./build/bin/bitcoind -datadir=/Users/lorinc/Library/Application\ Support/Bitcoin -stopatheight=916000 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 -par=32 (COMMIT = 688c03597afb0b76077f1ffc4608eef19481056e)
    13  Time (abs ≡):        7109.626 s               [User: 17210.848 s, System: 12938.964 s]
    

    note, the commit had:

    0      m_input_fetcher{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, 10 * MAX_SCRIPTCHECK_THREADS)},
    
  156. l0rinc changes_requested
  157. l0rinc commented at 4:17 pm on September 29, 2025: contributor
  158. in src/inputfetcher.h:86 in 688c03597a
    81+
    82+    //! Internal function that does the fetching from disk.
    83+    void Loop(int32_t index, bool is_main_thread = false) noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
    84+    {
    85+        auto local_batch_size{0};
    86+        auto end_index{0};
    


    l0rinc commented at 5:41 pm on September 29, 2025:
    I think we should add exact types here to make sure calculations like end_index - local_batch_size can’t underflow

    andrewtoth commented at 2:44 pm on October 2, 2025:
    I rewrote this part, these are gone now.
  159. in src/bench/inputfetcher.cpp:41 in 688c03597a outdated
    42+    DelayedCoinsView db(DELAY);
    43+    CCoinsViewCache cache(&db);
    44+
    45+    // The main thread should be counted to prevent thread oversubscription, and
    46+    // to decrease the variance of benchmark results.
    47+    const auto worker_threads_num{GetNumCores() - 1};
    


    l0rinc commented at 5:44 pm on September 29, 2025:
    I’m a bit conflicted here: this way we’re all measuring something slightly different - which is especially problematic since the work here isn’t even CPU bound. What if we did a min of npcu and 4?
  160. Raimo33 commented at 2:23 pm on October 1, 2025: contributor
    Concept ACK
  161. andrewtoth force-pushed on Oct 2, 2025
  162. andrewtoth force-pushed on Oct 2, 2025
  163. andrewtoth force-pushed on Oct 2, 2025
  164. andrewtoth force-pushed on Oct 2, 2025
  165. andrewtoth force-pushed on Oct 2, 2025
  166. andrewtoth commented at 3:47 pm on October 2, 2025: contributor

    Updated the input fetcher significantly:

    • uses counting_semaphores to synchronize threads instead of mutex + condvar.
    • stores tx + vin indexes in global vector instead of copying the COutPoints. The COutPoints are read from a global CBlock pointer.
    • The fetch queue counter is an atomic int instead of a mutex guarded int.
  167. andrewtoth force-pushed on Oct 3, 2025
  168. andrewtoth force-pushed on Oct 3, 2025
  169. andrewtoth force-pushed on Oct 3, 2025
  170. andrewtoth commented at 5:33 pm on October 3, 2025: contributor
    Removed m_batch_size. Each thread now increments the atomic counter by 1.
  171. andrewtoth force-pushed on Oct 3, 2025
  172. l0rinc commented at 0:53 am on October 4, 2025: contributor
    The latest version seems very promising, I like that the algorithms is getting simpler. I noticed that for small dbcache it has a very noticeable effect, but for very high dbcache this seems to add an extra cost - since we already have everything in the cache, so it just does useless work. I wonder if we could enable this fetching only after the very first time we Flush and erase, since it cannot help in any way before that.
  173. andrewtoth force-pushed on Oct 4, 2025
  174. l0rinc commented at 6:52 pm on October 7, 2025: contributor

    Compared it against master on a Raspberry Pi 5 synchronizing from real peers for realism, ran it twice for good measure until 917000 blocks with dbcache 450:

    This isn’t the latest version of the PR, but should likely be representative anyway.

     0COMMITS="a8f9a806751b5755bdec5b096186f70c0bfddcfa f0dc19f16826f68ef482acfb7b24e8bb7168fc51"; \
     1STOP=917000; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
     5(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
     6hyperfine \
     7  --sort command \
     8  --runs 1 \
     9  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
    10  --parameter-list COMMIT ${COMMITS// /,} \
    11  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    12    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    13    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
    14  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
    15             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    16  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
    17
    18a8f9a80675 validation: fetch block inputs in parallel
    19f0dc19f168 coins: allow emplacing non-dirty coins internally
    20
    21IBD | 917000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    22
    23Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
    24  Time (abs ):        47485.682 s               [User: 79615.847 s, System: 9374.261 s]
    25 
    26Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
    27  Time (abs ):        56374.354 s               [User: 78807.079 s, System: 10196.290 s]
    28 
    29Relative speed comparison
    30        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
    31        1.19          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
    
     0COMMITS="a8f9a806751b5755bdec5b096186f70c0bfddcfa f0dc19f16826f68ef482acfb7b24e8bb7168fc51"; \
     1STOP=917000; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
     5(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
     6hyperfine \
     7  --sort command \
     8  --runs 1 \
     9  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
    10  --parameter-list COMMIT ${COMMITS// /,} \
    11  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    12    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    13    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
    14  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
    15             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    16  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"
    17
    18a8f9a80675 validation: fetch block inputs in parallel
    19f0dc19f168 coins: allow emplacing non-dirty coins internally
    20
    21IBD | 917000 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    22
    23Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
    24  Time (abs ):        45907.874 s               [User: 81006.258 s, System: 10039.919 s]
    25 
    26Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
    27  Time (abs ):        55612.464 s               [User: 81830.349 s, System: 11913.754 s]
    28 
    29Relative speed comparison
    30        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = a8f9a806751b5755bdec5b096186f70c0bfddcfa)
    31        1.21          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=917000 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = f0dc19f16826f68ef482acfb7b24e8bb7168fc51)
    

    The variance between the runs shows 1% for master and 3% difference for the PR indicating that we’re likely nearing the network bandwidth limitations.

  175. andrewtoth force-pushed on Oct 11, 2025
  176. andrewtoth force-pushed on Oct 11, 2025
  177. andrewtoth force-pushed on Oct 14, 2025
  178. andrewtoth commented at 8:28 pm on October 14, 2025: contributor

    I noticed that for small dbcache it has a very noticeable effect, but for very high dbcache this seems to add an extra cost - since we already have everything in the cache, so it just does useless work. I wonder if we could enable this fetching only after the very first time we Flush and erase, since it cannot help in any way before that. @l0rinc There is already quite a lot to review here, and your benchmarks (and mine) show very promising results. So, I would prefer to keep this idea as a follow-up. We can do isolated benchmarks with your suggested change afterwards and propose an improvement accordingly.

  179. l0rinc commented at 9:04 pm on October 14, 2025: contributor
    I’m fine with doing that in a follow-up if you think it’s too complicated (though it’s likely quite simple, we can just track the very first cache miss and always prefetch after that - that heuristic would even survive node restarts. Maybe we need to skip the Bip30 values though, but it’s just a heuristic anyway). But I don’t yet see this PR as close to being final yet - do you? I still want to review it thoroughly, I don’t think we should ossify yet :)
  180. DrahtBot added the label Needs rebase on Oct 15, 2025
  181. andrewtoth force-pushed on Oct 16, 2025
  182. andrewtoth force-pushed on Oct 16, 2025
  183. DrahtBot removed the label Needs rebase on Oct 16, 2025
  184. andrewtoth commented at 2:07 am on October 16, 2025: contributor
    Rebased due to conflicts. Removed the first commit. We don’t need to modify EmplaceCoinInternalDANGER, we can just insert the coins as dirty into the cache. They will be set dirty when they are spent anyways. If a block fails validation inside ConnectBlock, then the dirty coins will just rewrite the same value to the db on the next flush or sync.
  185. in src/inputfetcher.h:35 in 063946d6bd outdated
    30+private:
    31+    /**
    32+     * The flattened indexes to each input in the block. The first item in the
    33+     * pair is the index of the tx, and the second is the index of the vin.
    34+     */
    35+    std::vector<std::pair<size_t, size_t>> m_inputs{};
    


    l0rinc commented at 11:56 pm on October 16, 2025:

    As far as I understood the tx index and the vin index cannot exceed the limits of a uint32_t, see https://github.com/bitcoin/bitcoin/blob/e744fd1249bf9577274614eaf3997bf4bbb612ff/src/primitives/transaction.h#L32

    Please consider std::vector<std::pair<uint32_t, uint32_t>> instead, which would likely halve the memory footprint. Based on https://godbolt.org/z/Wb918bWaM it seems to me this layout allows modern compilers to coalesce the two member accesses into a single, optimal 64-bit load from memory.

    Packing it into a single uint64_t seems to be the best, but that’s completely unreadable, so I’d go with the above std::pair<uint32_t, uint32_t>.


    andrewtoth commented at 1:59 pm on October 28, 2025:
    Done, using uint32_t in the Input struct now, not sure if it will have any effect in the struct though.
  186. in src/validation.cpp:3142 in 64de911053 outdated
    3136@@ -3137,6 +3137,8 @@ bool Chainstate::ConnectTip(
    3137     LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
    3138              Ticks<MillisecondsDouble>(time_2 - time_1));
    3139     {
    3140+        m_chainman.FetchInputs(CoinsTip(), CoinsDB(), *block_to_connect);
    3141+
    3142         CCoinsViewCache view(&CoinsTip());
    


    l0rinc commented at 2:02 pm on October 18, 2025:

    I have played with this to see if we can construct the new cache layer before the fetcher, so that it populates the new layer instead of reading & writing to the old one - seemed like a cleaner separation.

    But unfortunately we would need access to both in-memory cache layers inside FetchInputs in that case - to read for presence from the old cache and to write the newly fetched values to the new one (which will be flushed together with the new outputs when existing this scope).

    But given that we’re adding these missing entries only to spend them a moment later in the other cache layer (and to avoid having all of the missing ones require two-hop-lookups) we should still try reading from the stable cache and writing to the temporary one.

    Collecting the missing inputs to a separate cache would also help with benchmarking and testing since the underlying cache would only be modified once block connection finishes - while the complete diff would be in the top cache layer. We also shouldn’t add the entries to the cache if the block fails validation.


    andrewtoth commented at 1:30 pm on October 28, 2025:
    Done.
  187. in src/test/inputfetcher_tests.cpp:41 in 64de911053 outdated
    36+
    37+        Txid prevhash{Txid::FromUint256(uint256(1))};
    38+
    39+        for (auto i{1}; i < num_txs; ++i) {
    40+            CMutableTransaction tx;
    41+            const auto txid{m_rng.randbool() ? Txid::FromUint256(uint256(i)) : prevhash};
    


    l0rinc commented at 4:14 pm on October 18, 2025:

    Does this mean the spent tx is never processed on the same thread currently?

    Maybe we can mix it up a bit by something like

    0if (m_rng.randbool()) {
    1    prevhash = tx.GetHash(); // TODO This can theoretically simulate double spends
    2}
    

    andrewtoth commented at 2:04 pm on October 28, 2025:

    Does this mean the spent tx is never processed on the same thread currently?

    I don’t think that’s what it means. The same thread could fetch two inputs in a row.

    Maybe we can mix it up a bit by something like

    So we can use the same prevhash in difference txs? What is the benefit of this?

  188. in src/test/inputfetcher_tests.cpp:152 in 64de911053 outdated
    148+}
    149+
    150+BOOST_FIXTURE_TEST_CASE(fetch_no_inputs, InputFetcherTest)
    151+{
    152+    const auto& block{getBlock()};
    153+    for (auto i{0}; i < 3; ++i) {
    


    l0rinc commented at 4:34 pm on October 18, 2025:
    What’s the point of the loop in these, is there a state we’re changing in each iteration?

    andrewtoth commented at 1:31 pm on October 28, 2025:
    The InputFetcher is stateful, so this is making sure previous state does not leak into the next fetch phase.
  189. in src/test/fuzz/inputfetcher.cpp:49 in 64de911053 outdated
    42+{
    43+public:
    44+    std::optional<Coin> GetCoin(const COutPoint&) const override
    45+    {
    46+        abort();
    47+    }
    


    l0rinc commented at 4:52 pm on October 18, 2025:

    nit:

    0    std::optional<Coin> GetCoin(const COutPoint&) const override { std::abort(); }
    

    andrewtoth commented at 2:00 pm on October 28, 2025:
    Done.
  190. in src/test/fuzz/inputfetcher.cpp:26 in 64de911053 outdated
    21+class DbCoinsView : public CCoinsView
    22+{
    23+private:
    24+    DbMap& m_map;
    25+
    26+public:
    


    l0rinc commented at 4:53 pm on October 18, 2025:

    nit: in test code I’d strive for simpler code instead of needlessly “safe”

    0struct DbCoinsView : CCoinsView
    1{
    2    DbMap& m_map;
    

    andrewtoth commented at 2:00 pm on October 28, 2025:
    Done.
  191. in src/bench/inputfetcher.cpp:29 in 64de911053 outdated
    24+    DelayedCoinsView(std::chrono::milliseconds delay) : m_delay(delay) {}
    25+
    26+    std::optional<Coin> GetCoin(const COutPoint&) const override
    27+    {
    28+        UninterruptibleSleep(m_delay);
    29+        return Coin{};
    


    l0rinc commented at 4:54 pm on October 18, 2025:
    GetCoin shouldn’t return spent entries

    andrewtoth commented at 2:00 pm on October 28, 2025:
    Done.
  192. in src/bench/inputfetcher.cpp:24 in 64de911053 outdated
    19+{
    20+private:
    21+    std::chrono::milliseconds m_delay;
    22+
    23+public:
    24+    DelayedCoinsView(std::chrono::milliseconds delay) : m_delay(delay) {}
    


    l0rinc commented at 4:54 pm on October 18, 2025:
    this seems overly general to me, I think we can inline the delay for now

    andrewtoth commented at 2:00 pm on October 28, 2025:
    Done.
  193. in src/inputfetcher.h:153 in 64de911053 outdated
    148+            return;
    149+        }
    150+
    151+        m_db = &db;
    152+        m_cache = &cache;
    153+        m_block = &block;
    


    l0rinc commented at 4:55 pm on October 18, 2025:
    As mentioned before, I really dislike these lines, a fetcher shouldn’t change the internal state (especially since they’re const). Since we need a continuously running threads, can we package these to avoid state mutations? This would likely be solved by sending these off to a ThreadPool instance.

    andrewtoth commented at 1:36 pm on October 28, 2025:
    Indeed, this would be gracefully solved with #33689. We could pass all state into the worker threads via lambda capture.
  194. in src/test/fuzz/inputfetcher.cpp:120 in 64de911053 outdated
    115+
    116+            prevhash = tx.GetHash();
    117+            block.vtx.push_back(MakeTransactionRef(tx));
    118+        }
    119+
    120+        fetcher.FetchInputs(cache, db, block);
    


    l0rinc commented at 5:00 pm on October 18, 2025:

    We shouldn’t test this with invalid (empty) blocks:

    0        if (block.vtx.empty()) continue;
    1        fetcher.FetchInputs(cache, db, block);
    

    andrewtoth commented at 1:37 pm on October 28, 2025:
    Why not? The InputFetcher should not make assumptions about the structure of the block being passed in.

    l0rinc commented at 4:30 pm on October 28, 2025:
    Even consensus invalid ones? Or did I misunderstand the context here?

    andrewtoth commented at 4:38 pm on October 28, 2025:
    Yeah, InputFetcher doesn’t know if a block is consensus valid or not yet. It hasn’t passed ConnectBlock yet before entering here.
  195. in src/bench/inputfetcher.cpp:39 in 64de911053 outdated
    34+
    35+static void InputFetcherBenchmark(benchmark::Bench& bench)
    36+{
    37+    DataStream stream{benchmark::data::block413567};
    38+    CBlock block;
    39+    stream >> TX_WITH_WITNESS(block);
    


    l0rinc commented at 5:08 pm on October 18, 2025:

    nit:

    0    CBlock block;
    1    DataStream{benchmark::data::block413567} >> TX_WITH_WITNESS(block);
    

    andrewtoth commented at 2:00 pm on October 28, 2025:
    Done.
  196. in src/bench/inputfetcher.cpp:40 in 64de911053 outdated
    40+
    41+    DelayedCoinsView db(DELAY);
    42+    CCoinsViewCache cache(&db);
    43+
    44+    // The main thread should be counted to prevent thread oversubscription, and
    45+    // to decrease the variance of benchmark results.
    


    l0rinc commented at 5:09 pm on October 18, 2025:
    “should be counted” is the reason for the “- 1”?
  197. in src/bench/inputfetcher.cpp:42 in 64de911053 outdated
    42+    CCoinsViewCache cache(&db);
    43+
    44+    // The main thread should be counted to prevent thread oversubscription, and
    45+    // to decrease the variance of benchmark results.
    46+    const auto worker_threads_num{GetNumCores() - 1};
    47+    InputFetcher fetcher{static_cast<size_t>(worker_threads_num)};
    


    l0rinc commented at 5:12 pm on October 18, 2025:

    nit: if we keep the processor count here (which kinda’ makes the benchmark measure different things on different platforms, but I don’t really have a better idea, unless we assume that even the simplest machines where this matters (e.g. rpi4) already have at least 4 threads - and since this isn’t even CPU bound, we shouldn’t pretend that it does):

    0    const auto worker_threads_num{size_t(GetNumCores() - 1)};
    1    const InputFetcher fetcher{worker_threads_num};
    

    or

    0    const InputFetcher fetcher{/*max_thread_count=*/4};
    

    andrewtoth commented at 1:38 pm on October 28, 2025:
    Why don’t we want different benchmarks on different machines? All benchmarks are subtly different depending on the host machine.

    l0rinc commented at 4:32 pm on October 28, 2025:
    Yeah, but we want to understand where the differences are coming from, otherwise we’d have the “faster-on-my-machine” syndrome. If you disagree, just resolve the issue.
  198. in src/bench/inputfetcher.cpp:50 in 64de911053 outdated
    45+    // to decrease the variance of benchmark results.
    46+    const auto worker_threads_num{GetNumCores() - 1};
    47+    InputFetcher fetcher{static_cast<size_t>(worker_threads_num)};
    48+
    49+    bench.run([&] {
    50+        const auto ok{cache.Flush()};
    


    l0rinc commented at 5:17 pm on October 18, 2025:
    Doesn’t this change the behavior of the benchmark after the first iteration?

    andrewtoth commented at 1:39 pm on October 28, 2025:
    Fixed in latest iteration.
  199. in src/validation.cpp:6286 in 64de911053 outdated
    6266@@ -6265,6 +6267,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
    6267 
    6268 ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
    6269     : m_script_check_queue{/*batch_size=*/128, std::clamp(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    6270+      m_input_fetcher{std::clamp<size_t>(options.worker_threads_num, 0, MAX_SCRIPTCHECK_THREADS)},
    


    l0rinc commented at 5:25 pm on October 18, 2025:

    I don’t understand what 0 threads means. It likely means turn off prefetching.

    I would consider it to be a lot more intuitive if this started with 1 (and not be bound by the number of unrelated MAX_SCRIPTCHECK_THREADS since it’s not script related and not even CPU bound).

    Note: chainstatemanager_snapshot_init creates it with 0 workers by default, not sure it’s intended


    andrewtoth commented at 1:41 pm on October 28, 2025:
    0 threads means it is turned off, yes. If -par=1 is configured, we want to pass 0 to input fetcher to disable prefetching. Also if options.worker_threads_num was negative for some reason. Do you have a suggestion of how this can be made more clear?
  200. in src/inputfetcher.h:119 in 64de911053 outdated
    108+                }
    109+                if (m_cache->HaveCoinInCache(outpoint)) {
    110+                    continue;
    111+                }
    112+                if (auto coin{m_db->GetCoin(outpoint)}; coin) {
    113+                    m_coins[thread_index].emplace_back(outpoint, std::move(*coin));
    


    l0rinc commented at 5:38 pm on October 18, 2025:
    It seems to me that since the original m_coins is never written by different threads, we shouldn’t have a false sharing problem here - right?

    andrewtoth commented at 1:43 pm on October 28, 2025:
    Possibly. This is no longer relevant in the current implementation.
  201. in src/inputfetcher.h:84 in 64de911053 outdated
    73+    const CBlock* m_block{nullptr};
    74+
    75+    std::vector<std::thread> m_worker_threads;
    76+    std::counting_semaphore<> m_start_semaphore{0};
    77+    std::counting_semaphore<> m_complete_semaphore{0};
    78+    std::atomic<bool> m_request_stop{false};
    


    l0rinc commented at 5:41 pm on October 18, 2025:

    This style of dynamic work-stealing seems too complicated for such a uniform problem - static slicing would likely be a lot simpler and I would expect it to perform equally well.

    I also realize that some of these are needed to avoid recreating the threads for every call. Currently the parallelism and the thread recreation are done in a single commit - could we first implement multithreading with thread recreation, and do the thread reuse as a separate concern (though a ThreadPool with dedicated test)?


    andrewtoth commented at 2:05 pm on October 28, 2025:
    It should be simpler now. I’m not sure if this comment is still valid though with the current approach. I agree if we already had a ThreadPool this would be much cleaner.
  202. in src/test/inputfetcher_tests.cpp:131 in 64de911053 outdated
    129+
    130+        // Add all inputs as spent already in cache
    131+        for (const auto& tx : block.vtx) {
    132+            for (const auto& in : tx->vin) {
    133+                auto outpoint{in.prevout};
    134+                Coin coin{}; // Not setting nValue implies spent
    


    l0rinc commented at 6:09 pm on October 18, 2025:
    0                Coin coin{};
    1                assert(coin.IsSpent());
    

    andrewtoth commented at 2:01 pm on October 28, 2025:
    Done.
  203. andrewtoth force-pushed on Oct 19, 2025
  204. andrewtoth commented at 3:26 pm on October 19, 2025: contributor
    Updated to use std::barrier for the completion synchronization instead of acquiring a semaphore for each thread, as suggested by @l0rinc .
  205. andrewtoth force-pushed on Oct 19, 2025
  206. andrewtoth force-pushed on Oct 19, 2025
  207. andrewtoth force-pushed on Oct 19, 2025
  208. andrewtoth force-pushed on Oct 19, 2025
  209. andrewtoth force-pushed on Oct 19, 2025
  210. in src/inputfetcher.h:96 in a07936a62c
    91+            m_start_semaphore.acquire();
    92+            if (m_request_stop.load(std::memory_order_relaxed)) {
    93+                return;
    94+            }
    95+            Work(thread_index);
    96+            [[maybe_unused]] const auto arrival_token{m_complete_barrier.arrive()};
    


    l0rinc commented at 1:23 am on October 21, 2025:

    Would this suffice?

    0            (void)m_complete_barrier.arrive();
    
  211. in src/inputfetcher.h:190 in a07936a62c
    185+        m_input_counter.store(0, std::memory_order_relaxed);
    186+        m_start_semaphore.release(m_worker_threads.size());
    187+
    188+        // Have the main thread work too before we wait for other threads
    189+        Work(m_worker_threads.size());
    190+        m_complete_barrier.arrive_and_wait();
    


    l0rinc commented at 1:29 am on October 21, 2025:

    What’s the reason for not doing the completion here instead of the OnCompletion workaround?

    0        m_complete_barrier.arrive_and_wait();
    1        for (auto& coins : m_coins) {
    2            for (auto& [outpoint, coin] : coins) {
    3                m_cache->EmplaceCoinInternalDANGER(std::move(outpoint), std::move(coin));
    4            }
    5            coins.clear();
    6        }
    
  212. in src/inputfetcher.h:149 in a07936a62c
    144+        }
    145+    }
    146+
    147+public:
    148+    explicit InputFetcher(size_t worker_thread_count) noexcept
    149+        : m_complete_barrier{static_cast<int32_t>(worker_thread_count + 1), OnCompletionWrapper{this}}
    


    l0rinc commented at 1:30 am on October 21, 2025:

    std::ptrdiff_t seems more appropriate here: https://en.cppreference.com/w/cpp/thread/barrier/barrier.html

    0    explicit InputFetcher(size_t worker_thread_count) noexcept : m_complete_barrier{std::ptrdiff_t(worker_thread_count + 1)}
    
  213. in src/inputfetcher.h:149 in a07936a62c outdated
    160+    }
    161+
    162+    //! Fetch all block inputs from db, and insert into cache.
    163+    void FetchInputs(CCoinsViewCache& cache, const CCoinsView& db, const CBlock& block) noexcept
    164+    {
    165+        if (block.vtx.size() <= 1 || m_worker_threads.size() == 0) {
    


    l0rinc commented at 2:21 pm on October 21, 2025:
    wouldn’t m_worker_threads.size() == 0 be an error?

    andrewtoth commented at 1:44 pm on October 28, 2025:
    No, we can have no worker threads, for instance if started with -par=1. In that case just disable prefetching.
  214. in src/bench/inputfetcher.cpp:18 in a07936a62c outdated
    13+#include <util/time.h>
    14+
    15+static constexpr auto DELAY{2ms};
    16+
    17+//! Simulates a DB by adding a delay when calling GetCoin
    18+class DelayedCoinsView : public CCoinsView
    


    l0rinc commented at 2:24 pm on October 21, 2025:

    I personally would favor going for simpler code as opposed to going for theoretically better encapsulation, similarly to current inputfetcher_tests.cpp:

    0struct DelayedCoinsView : CCoinsView
    

    andrewtoth commented at 2:01 pm on October 28, 2025:
    Done.
  215. in src/bench/inputfetcher.cpp:28 in a07936a62c outdated
    27+    {
    28+        UninterruptibleSleep(m_delay);
    29+        return Coin{};
    30+    }
    31+
    32+    bool BatchWrite(CoinsViewCacheCursor&, const uint256&) override { return true; }
    


    l0rinc commented at 2:25 pm on October 21, 2025:

    We can add better assertions if we count the iterator size here:

    0bool BatchWrite(CoinsViewCacheCursor& cursor, const uint256&) override
    1{
    2    for (auto it{cursor.Begin()}; it != cursor.End(); it = cursor.NextAndMaybeErase(*it)) {
    3        m_write_count++;
    4    }
    5    return true;
    6}
    

    So the bench could be (without flush which makes the bench runs equivalent):

    0bench.run([&] {
    1    CCoinsViewCache block_cache{&cache};
    2    fetcher.FetchInputs(cache, block_cache, db, block);
    3    assert(db.m_write_count == 0 && cache.GetCacheSize() == 0 && block_cache.GetCacheSize() == 4599);
    4});
    

    andrewtoth commented at 1:49 pm on October 28, 2025:
    When do we do a batch write though in this example?

    l0rinc commented at 4:28 pm on October 28, 2025:
    Originally we flushed, so if we still want to, we may want to extend this to assert that behavior. Or avoid flushing and simplify the bench. Or add the flushing behavior above to a test, etc.

    andrewtoth commented at 4:37 pm on October 28, 2025:
    We avoid flushing now.
  216. in src/validation.cpp:3140 in a07936a62c
    3136@@ -3137,6 +3137,8 @@ bool Chainstate::ConnectTip(
    3137     LogDebug(BCLog::BENCH, "  - Load block from disk: %.2fms\n",
    3138              Ticks<MillisecondsDouble>(time_2 - time_1));
    3139     {
    3140+        m_chainman.FetchInputs(CoinsTip(), CoinsDB(), *block_to_connect);
    


    l0rinc commented at 2:30 pm on October 21, 2025:

    Given that we already create the actual temporary top cache here, it would be great if we could separate the reading and writing (read from old/big in-memory cache and write to the temporary small one):

    0CCoinsViewCache* cache{&CoinsTip()};
    1CCoinsViewCache new_cache{cache};
    2m_chainman.FetchInputs(*cache, new_cache, CoinsDB(), *block_to_connect);
    

    This would also mean that the actually read changes are read after that from the top-layer cache instead of always requiring two hops (assuming many missing ones). It also makes sense to not add inputs from blocks that fail validation - which we’d be getting for free this way.


    andrewtoth commented at 1:50 pm on October 28, 2025:
    Done.
  217. l0rinc commented at 3:32 pm on October 21, 2025: contributor

    I love the new structure, it’s a lot easier to track the progress compared to previous versions. It’s also measurably faster than previous versions, we seem to be nearing ~20% - sweeeet! It also pairs well with other Siphash related changes that are in the pipeline.

    I have reimplemented most of it locally to make sure I can do a meaningful review, see https://github.com/l0rinc/bitcoin/pull/47 for my attempt. It’s not finished, I’m still experimenting with different alternatives to make sure we can make this as simple and useful as possible (e.g. using barriers, filtering and sorting before fetch, adding dedicated ThreadPool, splitting reading and writing caches etc.), but wanted to publish my observations so far.

    My remaining biggest concern is that the threadpool should definitely be untangled from the InputFetcher logic (I hate concurrent mutability), it’s an independent concern mixed with non-trivial fetcher logic. We also shouldn’t parallelize based on CPU in the first place, I don’t see why we’d do that. And the ThreadPool part still needs independent tests.

    I have tried fetching everything on a single thread and the same with sorted UTXOs and it does seem to be ~6% faster on an SSD (I’d expected even bigger difference on HDD, still measuring that) - but I don’t have a final solution that’s faster than everything else (since I could only solve it by single-threaded filtering which is slow).

    0b6ccd542fd single-threaded NO sort
    1c50a6bd981 single-threaded + sorted fetch
    2
    3reindex-chainstate | 700000 blocks | dbcache 100 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD
    4 
    5Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = b6ccd542fdc371d3ecd90164169e3d5d7c60e82d)
    6  Time (abs ≡):        11322.014 s               [User: 20431.572 s, System: 1375.995 s]
    7 
    8Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=700000 -dbcache=100 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = c50a6bd9815ebb2e0d86d5e6b679da5ac021b796)
    9  Time (abs ≡):        10646.049 s               [User: 19231.516 s, System: 1268.395 s]
    
  218. andrewtoth commented at 4:55 pm on October 21, 2025: contributor

    Thank you for your review @l0rinc!

    I love the new structure, it’s a lot easier to track the progress compared to previous versions. It’s also measurably faster than previous versions, we seem to be nearing ~20% - sweeeet!

    :rocket:

    we can make this as simple and useful as possible

    What else do you have in mind for this being useful? It has a very focused purpose in my view. I would love to make it as simple as possible.

    using barriers

    Done :)

    filtering and sorting before fetch

    Can you expand on why we would want this? Prefiltering on the main thread was always slower than filtering in parallel when I was testing this idea.

    adding dedicated ThreadPool

    Yes, see #26966. Hopefully we can pull that out and it can be used here. See https://github.com/andrewtoth/bitcoin/tree/test-thread-pool.

    splitting reading and writing caches

    I’m not sure what you mean by this? There is only one cache, and we read from it concurrently, then write to it on a single thread. Can you expand on how we can split the cache into two? Also, why would we want separate caches?

    My remaining biggest concern is that the threadpool should definitely be untangled from the InputFetcher logic

    Is this not accomplished by ThreadLoop, Work and OnCompletion functions? The Work function has no thread pool logic at all, it is completely independent and can be run by one thread or many. Do you have any concrete suggestions, or specific codepaths that are tangled that are concerning?

    I hate concurrent mutability

    I’m not sure I understand this. There is no concurrent mutability in my implementation. That would be undefined behavior. Can you point out the data members that are being concurrently mutated and how? Perhaps we have different definitions of concurrent mutability.

    We also shouldn’t parallelize based on CPU in the first place, I don’t see why we’d do that.

    This is following the logic of the check queue. Ideally we could reuse both threads in the same threadpool in the future. Do you have any concrete recommendations on how we many threads we should run? Using the number of threads per CPU is yielding a 20% speed boost, so it seems like a sane enough choice.

  219. l0rinc commented at 6:02 pm on October 24, 2025: contributor

    Let me summarize our offline discussions:

    Cache hierarchy

    During block connection we’re adding an extra temporary in-memory dbcache layer on top so that whatever happens during block connection doesn’t end up polluting the big dbcache or leveldb. I think we should take advantage of this and use the temporary top layer to collect the missing inputs there:

    • if block validation fails we can just throw it out
    • it’s easier to test and benchmark, we can rerun the same operation without changing the underlying state
    • the missing values will all be fetched from the top layer now, avoiding two-hop lookups
    • since the threads aren’t reading from the temp cache and aren’t writing to the big cache, we can copy the needed inputs from the big in-memory cache contents on the main thread as well to the temporary top layer (without locking) while the other threads are fetching from the DB (since those are IO bound, this being CPU bound). That would create a dedicated dbcache per blocks which we could eventually flush independently.

    This might need some workarounds, but it does enable new cache invalidation opportunities.

    Sorted fetch

    Since LevelDB writes (via the CDBBatch and MemTable) are already sorted and result in a significant speedup compared to inserting the values one-by-one (though likely this isn’t just the effect of sorting), I thought it would make sense to experiment with sorted fetches as well. I have added an extra fetch (similarly to this PR, see https://github.com/l0rinc/bitcoin/commit/b72f67d4a88495a0222cbb9ae825daa6ea38e4df) before calling ConnectBlock so that the cache is pre-warmed, but on a single thread. This is the baseline, in the next commit after gathering the missing values I’m sorting them and doing the fetch from the db one-by-one, but in a sorted order.

    The results indicate that sorted fetching is a lot faster, especially on lower-end devices (i7 with HDD and Rpi5 with SSD).

     0COMMITS="7d27af98c7cf858b5ab5a02e64f89a857cc53172 ccc748b05858a9ebeb375cc1f3e7426698394470 ead8da2b33117807e27a66f814cf11cdf676d194"; \
     1STOP=919191; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
     5(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
     6hyperfine \
     7  --sort command \
     8  --runs 1 \
     9  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
    10  --parameter-list COMMIT ${COMMITS// /,} \
    11  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    12    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    13    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
    14  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
    15             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    16  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    17
    187d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz
    19ccc748b058 coins: prefetch inputs on single thread
    20ead8da2b33 coins: sorted input prefetch
    21
    22reindex-chainstate | 919191 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    23
    24Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)
    25  Time (abs ):        42271.083 s               [User: 66842.872 s, System: 8003.776 s]
    26 
    27Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)
    28  Time (abs ):        40776.988 s               [User: 67394.771 s, System: 7130.144 s]
    29 
    30Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)
    31  Time (abs ):        38435.516 s               [User: 64537.357 s, System: 6716.634 s]
    32 
    33Relative speed comparison
    34        1.10          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)
    35        1.06          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)
    36        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)
    

    (interestingly it seems this is even faster than master by a lot, we’re not yet sure what’s causing that since the fetches are still single-threaded)

    Similar speedup on HDD:

     0STOP=919191; DBCACHE=450; \                                                                                                                   
     1CC=gcc; CXX=g++; \                                                                                                                                                                  
     2BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \                                                                                           
     3(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \                                                                      
     4(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cor
     5es | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || e
     6cho HDD)"; echo "") &&\                                                                                                                                                             
     7hyperfine \                                  
     8  --sort command \                           
     9  --runs 1 \                                 
    10  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \                                                                      
    11  --parameter-list COMMIT ${COMMITS// /,} \                                               
    12  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \                                                  
    13    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \                                                                  
    14    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \                                                                        
    15  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \                                                                                                   
    16             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \                                                                                
    17  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"                         
    187d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz                             
    19ccc748b058 coins: prefetch inputs on single thread                                        
    20ead8da2b33 coins: sorted input prefetch                                                   
    21
    22reindex-chainstate | 919191 blocks | dbcache 450 | i7-hdd | x86_64 | Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz | 8 cores | 62Gi RAM | ext4 | HDD                                      
    23
    24Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                                        
    25  Time (abs ):        42897.195 s               [User: 38613.661 s, System: 3154.561 s]                                                                                            
    26                                             
    27Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                                               
    28  Time (abs ):        42015.404 s               [User: 40096.242 s, System: 3180.038 s]                                                                                            
    29                                             
    30Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)                                               
    31  Time (abs ):        40176.361 s               [User: 38614.897 s, System: 3047.461 s]                                                                                            
    32                                             
    33Relative speed comparison                    
    34        1.07          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                               
    35        1.05          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                               
    36        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)   
    

    On a very powerful i9 with very performant SSD sorting isn’t faster than master, but it’s still a lot faster than random fetching:

     0STOP=919191; DBCACHE=450; \                                                                                                      
     1CC=gcc; CXX=g++; \                                                                                                                                                     
     2BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \                                                                              
     3(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \                                                         
     4(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\                                  
     5hyperfine \                              
     6  --sort command \                       
     7  --runs 1 \                             
     8  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \                                                         
     9  --parameter-list COMMIT ${COMMITS// /,} \                                        
    10  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \                                     
    11    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \                                                     
    12    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \                                                           
    13  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \                                                                                      
    14             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \                                                                   
    15  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"            
    16
    177d27af98c7 Merge bitcoin/bitcoin#33461: ci: add Valgrind fuzz                      
    18ccc748b058 coins: prefetch inputs on single thread                                 
    19ead8da2b33 coins: sorted input prefetch                                            
    20
    21reindex-chainstate | 919191 blocks | dbcache 450 | i9-ssd | x86_64 | Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz | 16 cores | 62Gi RAM | xfs | SSD                        
    22
    23Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)                    
    24  Time (abs ):        20383.488 s               [User: 38259.799 s, System: 2702.188 s]                                                                               
    25                                         
    26Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)                    
    27  Time (abs ):        21213.645 s               [User: 39728.945 s, System: 2709.999 s]                                                                               
    28                                         
    29Benchmark 3: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)                    
    30  Time (abs ):        20476.751 s               [User: 38448.254 s, System: 2597.600 s]                                                                               
    31                                         
    32Relative speed comparison                
    33        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 7d27af98c7cf858b5ab5a02e64f89a857cc53172)           
    34        1.04          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ccc748b05858a9ebeb375cc1f3e7426698394470)           
    35        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = ead8da2b33117807e27a66f814cf11cdf676d194)  
    

    Theoretically this can also be affected by LevelDB’s internal options.block_cache and options.block_size and iteroptions.fill_cache (currently experimenting with different values to see if it makes any difference). This is why I have suggested trying a read-only snapshot per thread, if they’re reading sorted values the previous path could theoretically be caches which can help avoid a few lookups.

    Separate ThreadPool

    I think we should be able to use a single barrier for starting and stopping, assuming that we either want to use all of them or none of them, something like: https://github.com/l0rinc/bitcoin/commit/67e79e041b524eb1b0039ff39d4fc5b235d89b8f#diff-7ad6c5646dfd3749d2634861dfaf98e8c6fbc041a8e8eded973ff39929acffd7 or #33689 This would also help with completely separating the multithreaded part from the Inputfetcher which would also allow testing that critical behavior separately. This would also help with minimizing the mutable state which could theoretically allow concurrent calls to swap out the work from under the started threads (so likely the m_task in my example should be atomic, haven’t given it enough thought).

    Since sorted-fetching currently needs a preprocessing step, we could use these available threads for filtering operations as well (since single-threaded filtering is slow, as proven above) - we could filter on all threads for missing, sort the missing on main thread and spread out to all threads for db fetching - leaving the main thread to do other things (since db fetching isn’t really CPU bound, no need to involve the main thread), such as the mentioned outer-layer warming from the main in-memory dbcache. And alternative to the “fetch-missing & sort & get” on a single thread is to try “fetch-all & sort & filter & get” instead which is more parallelizable. Since everything in the outer-layer will be spent anyway, in the future maybe we could even flush them on a background thread to db (given that we’ve just copied the items to the temp cache, after successful flush we could just remove them from the main cache) - which would likely simplify the main dbcache’s dirty and spent behaviors, but that’s outside of the scope of this PR but could provide extra motivation for these changes).

  220. andrewtoth force-pushed on Oct 27, 2025
  221. andrewtoth force-pushed on Oct 27, 2025
  222. andrewtoth commented at 1:58 am on October 27, 2025: contributor

    Thank you @l0rinc for your detailed review and suggestions! I have taken some of them. The input fetcher has now been redesigned.

    • Coins are written to the ephemeral cache that is created just to be used in ConnectBlock, instead of the main cache. This requires a new method in CCoinsViewCache - GetPossiblySpentCoinFromCache. Since we write to an empty cache instead of CoinsTip(), we could insert a Coin that exists in the db but is spent in CoinsTip(). Previously that insertion would fail since we were inserting again into CoinsTip() which would not overwrite the spent coin.
    • Because of this we can safely write to the ephemeral cache on the main thread, since the worker threads will be reading from a different cache. So the main thread does not do any fetching, but it writes fetched Coins to the cache in parallel as workers fetch them. It is a lock-free MPSC queue. The workers and main thread synchronize on an atomic Status member of each input. The workers set it to READY while the main thread spins on it until it is no longer WAITING. This is a substantial speed improvement over the previous version, and it also lets us insert Coins from the main cache into the ephemeral cache faster. So this change speeds up block connection even when there are no cache misses, and makes the workers utilize more CPU rather than just IO.
    • A single barrier is used to synchronize the threads. Both the main and worker threads call arrive_and_wait before they begin their work and after completion.

    I removed the last commit, and instead add the InputFetcher as multi threaded initially. It didn’t make sense to split it since the worker threads and main threads do different things now.

    I haven’t been able to effectively utilize a sorting strategy. Sorting the inputs before fetching doesn’t seem to have a benefit in the multi threaded approach. The overhead of copying the COutPoints and sorting them was always slower. I think the parallel fetching dominates any speedup achieved by sorted fetching anyways.

  223. andrewtoth force-pushed on Oct 27, 2025
  224. l0rinc commented at 9:48 pm on October 27, 2025: contributor

    Looks like I forgot to push the comments last time - that explains your surprise. I pushed all of them, many are out of date, please just resolve them. Sorry for the confusion.

    Since I’m reposting these old comments (since commented from the main GitHub view), lemme’ post a recently finished Rpi4 benchmark showing a 31% speedup for an older push \:D/

     0COMMITS="063946d6bd78035276d12e070a208d84492ac5cd 64de91105312d36dadb5f71ec01fc6af9b14da69"; \
     1STOP=919191; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
     5(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
     6hyperfine \
     7  --sort command \
     8  --runs 1 \
     9  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
    10  --parameter-list COMMIT ${COMMITS// /,} \
    11  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    12    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    13    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
    14  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
    15             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    16  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    17
    18063946d6bd coins: add inputfetcher
    1964de911053 coins: fetch coins on parallel threads
    20
    21reindex-chainstate | 919191 blocks | dbcache 450 | rpi4-8-1 | aarch64 | Cortex-A72 | 4 cores | 7.6Gi RAM | ext4 | SSD
    22
    23Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 063946d6bd78035276d12e070a208d84492ac5cd)
    24  Time (abs ):        329713.086 s               [User: 172834.637 s, System: 87207.916 s]
    25 
    26Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 64de91105312d36dadb5f71ec01fc6af9b14da69)
    27  Time (abs ):        251976.600 s               [User: 177923.005 s, System: 116804.580 s]
    28 
    29Relative speed comparison
    30        1.31          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 063946d6bd78035276d12e070a208d84492ac5cd)
    31        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=919191 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 64de91105312d36dadb5f71ec01fc6af9b14da69)
    
  225. andrewtoth force-pushed on Oct 28, 2025
  226. l0rinc commented at 8:40 pm on October 29, 2025: contributor

    I’d say it’s time to update the PR description:

     0COMMITS="cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85 2aa510348143521a14146e41b5cf87cb3e60b29e"; \
     1STOP=921129; DBCACHE=450; \
     2CC=gcc; CXX=g++; \
     3BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
     4(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
     5(echo "" && echo "reindex-chainstate | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
     6hyperfine \
     7  --sort command \
     8  --runs 1 \
     9  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
    10  --parameter-list COMMIT ${COMMITS// /,} \
    11  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    12    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo -DENABLE_IPC=OFF && ninja -C build bitcoind -j2 && \
    13    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 20" \
    14  --conclude "killall bitcoind || true; sleep 5; cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log; \
    15              grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
    16  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
    17
    18cb0fdfdf37 coins: add inputfetcher
    192aa5103481 validation: fetch block inputs via InputFetcher before connecting
    20
    21reindex-chainstate | 921129 blocks | dbcache 450 | rpi5-16-2 | aarch64 | Cortex-A76 | 4 cores | 15Gi RAM | ext4 | SSD
    22
    23Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
    24  Time (abs ):        40539.887 s               [User: 69358.879 s, System: 7393.185 s]
    25 
    26Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
    27  Time (abs ):        32768.672 s               [User: 69495.022 s, System: 11880.553 s]
    28 
    29Relative speed comparison
    30        1.24          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = cb0fdfdf3704d5ffe6ccc634de6fdba6b7b57a85)
    31        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=921129 -dbcache=450 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 2aa510348143521a14146e41b5cf87cb3e60b29e)
    

    With this change we’re getting a 9 hours full reindex-chainstate on an rpi5 with default 450MB memory. Wow!

  227. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads >10% faster IBD
    validation: fetch block inputs on parallel threads >20% faster IBD
    on Oct 29, 2025
  228. coins: add inputfetcher
    Co-authored-by: l0rinc <pap.lorinc@gmail.com>
    201110c6b3
  229. tests: add inputfetcher tests 47cffceec3
  230. bench: add inputfetcher bench 823ffd4ea9
  231. fuzz: add inputfetcher fuzz harness c79b6d2c3c
  232. validation: fetch block inputs via InputFetcher before connecting 1ebf2f02e6
  233. andrewtoth force-pushed on Oct 31, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-10-31 12:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me