validation: fetch block inputs on parallel threads 10% faster IBD #31132

pull andrewtoth wants to merge 6 commits into bitcoin:master from andrewtoth:threaded-inputs changing 11 files +666 −6
  1. andrewtoth commented at 2:40 pm on October 22, 2024: contributor

    When fetching inputs in ConnectBlock, each input is fetched from the cache in series. A cache miss means a round trip to the disk db to fetch the outpoint and insert it into the cache. Since the db is locked from being written during ConnectTip, we can fetch all block inputs missing from the cache in parallel on multiple threads before entering ConnectBlock. Using this strategy resulted in a 10% faster IBD.

    Doing IBD with 16 vcores from a local peer with default settings, stopping at height 850k:

    Mean [s] Min [s] Max [s] Relative
    branch 17065.138 ± 117.439 16982.096 17148.181 1.00
    master 18731.509 ± 94.142 18731.509 18864.646 1.10

    For later blocks this change makes block connection even faster. Doing an assumeutxo from block 840k to 850k with 15 worker threads, this change is 26% faster. With just a single worker thread, this same benchmark is 6% faster. Benchmark and flame graph with 15 worker threads Benchmark and flame graph with 1 worker thread

    I have fuzzed for over 500 million iterations with the provided fuzz harness with no issues.

    This approach is heavily inspired by CCheckQueue, but we could not easily reuse it since it only checks for validity and doesn’t allow us to store results. So, this PR creates a new InputFetcher that loops through all inputs of a block on the main thread and adds their outpoints to a shared vector. After writing, the main thread and worker threads assign ranges of outpoints from the vector and fetch them from the db, and then push the resulting coins onto a thread local vector. Once the threads have finished reading all inputs, the main thread loops through all thread local vectors and inserts the results into the cache.

    This PR uses the -par value for the number of threads, which defaults to the number of vcores on the machine or 15 whichever is fewer. This is the same value used for CCheckQueue, so any users that specifically have the multi threaded validation disabled by using -par=1 will also have this feature disabled. This also means the maximum number of input fetching threads is capped at 15.

    Since InputFetcher::FetchInputs is blocking, a follow-up can update this to share the thread pool between CCheckQueue and InputFetcher.

  2. DrahtBot commented at 2:40 pm on October 22, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31132.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK l0rinc, ismaelsadeeq

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  3. DrahtBot added the label Validation on Oct 22, 2024
  4. andrewtoth force-pushed on Oct 22, 2024
  5. DrahtBot commented at 2:45 pm on October 22, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/31894441286

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  6. DrahtBot added the label CI failed on Oct 22, 2024
  7. andrewtoth renamed this:
    validation: fetch block inputs parallel threads ~17% faster IBD
    validation: fetch block inputs on parallel threads ~17% faster IBD
    on Oct 22, 2024
  8. andrewtoth force-pushed on Oct 22, 2024
  9. andrewtoth force-pushed on Oct 22, 2024
  10. andrewtoth force-pushed on Oct 22, 2024
  11. DrahtBot removed the label CI failed on Oct 22, 2024
  12. in src/inputfetcher.h:173 in e9e23b59f8 outdated
    146+        : m_batch_size(batch_size)
    147+    {
    148+        m_worker_threads.reserve(worker_thread_count);
    149+        for (size_t n = 0; n < worker_thread_count; ++n) {
    150+            m_worker_threads.emplace_back([this, n]() {
    151+                util::ThreadRename(strprintf("inputfetch.%i", n));
    


    l0rinc commented at 10:26 am on October 23, 2024:
    Q: Is this a leftover a hack for non-owning LevelDB threads, or is this really the best way to name threads in a cross-platform way?

    andrewtoth commented at 1:49 pm on October 23, 2024:
    Unsure, copied from CScriptCheck. If the state of the art of thread naming has advanced since that was written, please let me know!

    sipa commented at 1:54 pm on October 23, 2024:
    The C++ standard library does as far as I know have no way of renaming threads at all. src/util/threadnames.{h,cpp} is our wrapper around the various platform-dependent ways of doing so on supported systems.

    l0rinc commented at 2:03 pm on October 23, 2024:
    Thank you, please resolve the comment.
  13. in src/inputfetcher.h:124 in e9e23b59f8 outdated
    184+                    continue;
    185+                }
    186+
    187+                buffer.emplace_back(outpoint);
    188+                if (buffer.size() == m_batch_size) {
    189+                    Add(std::move(buffer));
    


    l0rinc commented at 11:29 am on October 23, 2024:
    We’re mostly creating the buckets randomly here, so each thread will need access to basically all of the keys. Since we have an idea of how LevelDB works here (i.e. Sorted String Table), we could likely improve cache locality (would likely be most beneficial on HDDs) and minimize lock contention by splitting the reads by sorted transactions instead.

    andrewtoth commented at 1:46 pm on October 23, 2024:

    I don’t think there is any lock contention here if we are doing multithreaded reading?

    I also think what you’re suggesting would add a lot more complexity to this PR, when this is “good enough”.


    l0rinc commented at 2:02 pm on October 23, 2024:
    This might be as simple as sorting by tx before we create the buckets.

    andrewtoth commented at 2:12 pm on October 23, 2024:
    If a benchmark shows that it is better, then great!
  14. in src/inputfetcher.h:90 in e9e23b59f8 outdated
    68+        std::vector<std::pair<COutPoint, Coin>> pairs{};
    69+        do {
    70+            std::vector<COutPoint> outpoints{};
    71+            outpoints.reserve(m_batch_size);
    72+            {
    73+                WAIT_LOCK(m_mutex, lock);
    


    l0rinc commented at 11:30 am on October 23, 2024:
    I’m wondering if we really need to (b)lock here or whether we could we create a read-only snapshot instead and avoid stalling?

    andrewtoth commented at 1:34 pm on October 23, 2024:
    This is blocking so we can access the queue of shared outpoints that we need to fetch from. It is not blocking for LevelDB, we access the db once we are out of the critical section.

    l0rinc commented at 2:04 pm on October 23, 2024:
    As mentioned before, why do we need shared outpoints here?

    andrewtoth commented at 2:15 pm on October 23, 2024:
    The main thread adds all outpoints to a global vector, which all workers will fetch their work from.

    andrewtoth commented at 0:56 am on December 4, 2024:
    We no longer need to block on the shared outpoints vector. We write to it once in the main thread before notifying the other threads and then only read from it afterwards.
  15. in src/inputfetcher.h:33 in e9e23b59f8 outdated
    24+ * onto the queue, where they are fetched by N worker threads. The resulting
    25+ * coins are pushed onto another queue after they are read from disk. When
    26+ * the main is done adding outpoints, it starts writing the results of the read
    27+ * queue to the cache.
    28+ */
    29+class InputFetcher
    


    l0rinc commented at 11:39 am on October 23, 2024:
    I know it’s not trivial request, but can we add a test for this class which fetches everything in parallel and sequentially and assert that the result is equivalent? And preferably also a benchmark, like we have it for https://github.com/bitcoin/bitcoin/blob/master/src/bench/checkqueue.cpp. I would gladly help here, if needed.

    andrewtoth commented at 1:35 pm on October 23, 2024:
    Yes, I can add these but I am waiting for some more conceptual support.

    andrewtoth commented at 3:06 pm on November 7, 2024:
    Added tests and benchmark. The test has random parameters, one of which would be end up having a single worker thread.

    andrewtoth commented at 8:59 pm on November 16, 2024:
    Also added fuzz harness
  16. in src/inputfetcher.h:145 in e9e23b59f8 outdated
    140+    }
    141+
    142+
    143+public:
    144+    //! Create a new input fetcher
    145+    explicit InputFetcher(size_t batch_size, size_t worker_thread_count) noexcept
    


    l0rinc commented at 11:46 am on October 23, 2024:
    For consistency (see: explicit CCheckQueue(unsigned int batch_size, int worker_threads_num)) and simplicity (m_input_fetcher{/*batch_size=*/128, static_cast<size_t>(options.worker_threads_num)}, and to follow modern C++ directions where sizes seem to be preferred as signed values, see: #30927 (review)), please consider making these int(s) instead.

    andrewtoth commented at 3:06 pm on November 7, 2024:
    Done.
  17. in src/validation.cpp:6257 in e9e23b59f8 outdated
    6247@@ -6243,6 +6248,7 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
    6248 
    6249 ChainstateManager::ChainstateManager(const util::SignalInterrupt& interrupt, Options options, node::BlockManager::Options blockman_options)
    6250     : m_script_check_queue{/*batch_size=*/128, options.worker_threads_num},
    6251+      m_input_fetcher{/*batch_size=*/128, static_cast<size_t>(options.worker_threads_num)},
    


    l0rinc commented at 12:07 pm on October 23, 2024:

    Unlike the script checks, these fetches aren’t CPU bound, there is no reason to provide the number of CPUs as the number of parallels threads. I don’t know if we care about HDD performance here or not, but we can likely find a multiplier that makes this better for both SSD and HDD.

    Quoting from https://pkolaczk.github.io/disk-parallelism:

    It was surprising to me that even 64 threads, which are far more than the number of CPU cores (4 physical, 8 virtual), still improved the performance. I guess that with requests of such a small size to such a fast storage, you need to submit really many of them to keep the SSD busy.

    If we can provide a benchmark for this usecase we can likely find an optimal multiplier here - I won’t nack but this part is very important for me.


    andrewtoth commented at 1:43 pm on October 23, 2024:

    Adding more threads will require more memory, which is one reason to not use many more.

    I did a benchmark using 64 threads on the same 16 vcore machine, and it was slightly slower :/


    l0rinc commented at 2:00 pm on October 23, 2024:
    4x may be too much to begin with, but 1.5-2x sounds plausible, I’ll help with benchmarking this once my current batches finish.

    andrewtoth commented at 3:06 pm on November 7, 2024:
    Added a benchmark to experiment with these.
  18. in src/inputfetcher.h:123 in e9e23b59f8 outdated
    183+                if (cache.HaveCoinInCache(outpoint)) {
    184+                    continue;
    185+                }
    186+
    187+                buffer.emplace_back(outpoint);
    188+                if (buffer.size() == m_batch_size) {
    


    l0rinc commented at 12:10 pm on October 23, 2024:
    Would it be possible to create the batch sizes dynamically? Since the number of missing values differs for every block (and every dbcache size), it may not make more sense to calculate the optimal split instead of using the random value of 128. Coroutines might alleviate this problem.

    andrewtoth commented at 1:42 pm on October 23, 2024:
    I’m not sure it would warrant the complexity I think this batch size is “good enough” for now. In a follow up we could maybe add ways to set this with configs to experiment if there really is more optimal settings.

    andrewtoth commented at 3:05 pm on November 7, 2024:
    I changed the batch size to be number of workers.
  19. in src/inputfetcher.h:132 in e9e23b59f8 outdated
    192+                }
    193+            }
    194+            txids.insert(tx->GetHash());
    195+        }
    196+
    197+        Add(std::move(buffer));
    


    l0rinc commented at 12:11 pm on October 23, 2024:
    Do we always have leftovers or will this process the last batch twice (or process an empty one) if the batch happens to be divisible by batch_size?

    andrewtoth commented at 1:37 pm on October 23, 2024:
    It won’t process twice, but it could pass in an empty vector, which is ignored if you look at Add implementation.
  20. in src/inputfetcher.h:64 in e9e23b59f8 outdated
    60+
    61+    std::vector<std::thread> m_worker_threads;
    62+    bool m_request_stop GUARDED_BY(m_mutex){false};
    63+
    64+    /** Internal function that does the fetching from disk. */
    65+    void Loop() noexcept EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
    


    l0rinc commented at 12:35 pm on October 23, 2024:

    We’re basically mimicking RocksDB’s MultiGet here - but prewarming the cache instead in separate get requests, since we can’t really access LevelDB’s internals.

    Since splitting into buckets isn’t trivial and since MultiGet seems to rely on C++20 coroutines (which wasn’t available in 2012 when CCheckQueue was written), I’m wondering how much simpler this fetching would be if we had lightweight suspendible threads instead: https://rocksdb.org/blog/2022/10/07/asynchronous-io-in-rocksdb.html#multiget


    andrewtoth commented at 1:40 pm on October 23, 2024:

    I think it would be similar in complexity, we would still need all the locking mechanisms to prevent multithreaded access.

    What would really be great is if we had a similar construction to Rust’s std::sync::mpsc.


    l0rinc commented at 1:58 pm on October 23, 2024:

    Can you tell me why we need to prevent multithreaded access exactly? We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread, right?

    How would mpsc solve this better? Do you think we need work stealing to make it perfectly parallel? Wouldn’t coroutines already achieve the same?


    sipa commented at 2:07 pm on October 23, 2024:
    I haven’t yet experimented with them, but as far as I understand it, coroutines are just programming paradigm, not magic; they don’t do anything of their own, besides making things that were already possible easier to write. In particular, you still need a thread pool or some mechanism for scheduling how to run them,

    andrewtoth commented at 2:09 pm on October 23, 2024:

    We could collect the values to different vectors, each one accessed only by a single thread and merge them into the cache at the end on a single thread

    If the vectors are thread local, then how can the main thread access them at the end to write them? We also want to be writing throughout while the workers are fetching, not just at the end.

    How would mpsc solve this better?

    Instead of each worker thread having a local queue of results, which they then append to the global results queue, they could just push each result to the channel individually. The main thread could just pull results off the channel as they arrive, instead of waiting to be awoken by a worker thread that appended all its results to the global queue.

    work stealing

    That is a concept for async rust, or std::async::mpsc. We can do all this without introducing an async runtime. But, this is getting off topic.


    l0rinc commented at 3:15 pm on October 23, 2024:

    coroutines are just programming paradigm, not magic

    That’s also what I was counting on! :D

    In RocksDB they have high and low priority work (I assume that’s just added to the front or the back of a background work deque) – this could align well with @furszy’s suggestion for mixing different kinds of background work units.

    I haven’t used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime. Combined with an appropriate scheduling mechanism (such as a thread pool), we could maximize both CPU and IO usage, if I’m not mistaken. Instead of each thread handling just one task, it could suspend a coroutine while waiting on IO (e.g., a database fetch) and resume it later, effectively maximizing CPU and IO work without needing to know the exact details of the work.

    If the vectors are thread local

    The vector would still be global, but each thread would only access a single bucket (i.e. global vector of vectors, with each thread from the pool writing only to vector[thread_id], which contains a vector of fetched coins). When all the work is finished, we’d iterate over the global vector and merge the results into the cache on a single thread. As mentioned, sorting the outpoints before fetching could help improve data locality and reduce lock contention, and the coroutines above would help with work stealing, ensuring that all threads finish roughly at the same time.

    Is there anything prohibiting us from doing something like this to minimize synchronization and lock contention during the fetch phase? I understand some synchronization would still be needed during the merge, but this could help reduce global locks and unnecessary synchronization throughout the process.


    sipa commented at 3:28 pm on October 23, 2024:

    I haven’t used the C++ variant of coroutines either, but my thinking was that since they can theoretically yield execution when waiting for IO (and resume later), this would allow threads to focus on other tasks in the meantime.

    That needs async I/O, and is unrelated to coroutines, as far as I understand it. Coroutines just help with keeping track of what to do when the reads come back inside rocksdb.

    As long as LevelDB (or whatever database engine we use) internally does not use async I/O, there will be one (waiting) thread per parallel outstanding read request from the database.

  21. l0rinc changes_requested
  22. l0rinc commented at 12:44 pm on October 23, 2024: contributor

    Concept ACK

    I’m still missing tests and benchmarks here and I think we need to find better default values for SSD and HDD parallelism, and I’d be interested in how coroutines would perform here instead of trying to find the best batching size manually.

  23. furszy commented at 2:25 pm on October 23, 2024: member

    Cool idea.

    Since the inputs fetcher call is blocking, instead of creating a new set of worker threads, what do you think about re-using the existing script validation ones (or any other unused worker threads) by implementing a general-purpose thread pool shared among the validation checks? The script validation checks and the inputs fetching mechanism are never done concurrently because you need the inputs in order to verify the scripts. So, they could share workers.

    This should be benchmarked because it might add some overhead but, #26966 introduces such structure inside 401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

  24. andrewtoth commented at 2:48 pm on October 23, 2024: contributor

    implementing a general-purpose thread pool shared among the validation checks?

    Nice, yes that would be great! That would simplify this PR a lot if it could just schedule tasks on worker threads and receive the responses, instead of implementing all the sync code itself.

    #26966 introduces such structure inside https://github.com/bitcoin/bitcoin/commit/401f21bfd72f32a28147677af542887518a4dbff, which we could pull off and use for validation.

    Concept ACK!

  25. l0rinc commented at 6:12 pm on October 24, 2024: contributor

    Finished benching on a HDD until 860k on Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz, CPU = 8:

    0Summary
    1'COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0' ran
    2 1.02 times faster than 'COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=
    30'
    

    f278ca4ec3 coins: allow emplacing non-dirty coins internally (39993.343777768874 seconds = 11.1 hours) e9e23b59f8 validation: fetch block inputs in parallel (40929.84310861388 seconds = 11.3 hours)


    So likely on HDD we shouldn’t use so many threads, apparently it slows down IBD. Maybe we could add a new config option (iothreads or iothreadmultiplier or something). The defaults should likely depend on whether it’s an SSD or HDD.


    Edit:

    0"command": "COMMIT=f278ca4ec3f0a90c285e640f1a270869ca594d20 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    1"times": [39993.343777768874],
    2
    3"command": "COMMIT=e9e23b59f8eedb8dfae75aa660328299fba92b50 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    4"times": [40929.84310861388],
    

    I have retried the same with half the parallelism (rebased, but no other change in the end, otherwise the results would be hard to interpret):

    0"command": "COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=10000 -printtoconsole=0",
    1"times": [40579.00445769842],
    

    So it’s a tiny bit faster than before (surprisingly stable for an actual IBD with real peers), but still slower-than/same-as before, so not sure why it’s not faster.


    Edit:

    Running it on a HDD with a low dbcache value reproduces the original result:

    0hyperfine --runs 1 --show-output --export-json /mnt/my_storage/ibd_full-threaded-inputs-3.json --parameter-list COMMIT 92fc718592be55812b2c73a3bf57599fc81425fa,8207d372b2fac24af0f8999b30e71e88d40b3a13 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'
    
    08207d372b2 validation: fetch block inputs in parallel
    192fc718592 coins: allow emplacing non-dirty coins internally
    2Summary
    3  'COMMIT=8207d372b2fac24af0f8999b30e71e88d40b3a13 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0' ran
    4    1.16 times faster than 'COMMIT=92fc718592be55812b2c73a3bf57599fc81425fa ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=860000 -dbcache=1000 -printtoconsole=0'
    
  26. andrewtoth commented at 6:31 pm on October 24, 2024: contributor

    So likely on HDD we shouldn’t use so many threads, apparently it slows down IBD.

    I’m not sure we can conclude that from your benchmark. It used a very high dbcache setting, which makes the effect of this change less important. It also is syncing from untrusted network peers, so there is some variance which could also account for the 2% difference.

  27. andrewtoth force-pushed on Oct 24, 2024
  28. andrewtoth force-pushed on Oct 24, 2024
  29. DrahtBot commented at 7:25 pm on October 24, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32027275494

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  30. DrahtBot added the label CI failed on Oct 24, 2024
  31. DrahtBot removed the label CI failed on Oct 25, 2024
  32. andrewtoth force-pushed on Oct 26, 2024
  33. andrewtoth force-pushed on Oct 26, 2024
  34. DrahtBot added the label CI failed on Oct 26, 2024
  35. DrahtBot commented at 6:52 pm on October 26, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32107893176

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  36. andrewtoth force-pushed on Oct 26, 2024
  37. andrewtoth force-pushed on Oct 26, 2024
  38. andrewtoth force-pushed on Oct 26, 2024
  39. andrewtoth force-pushed on Oct 26, 2024
  40. andrewtoth force-pushed on Oct 27, 2024
  41. andrewtoth force-pushed on Oct 27, 2024
  42. andrewtoth marked this as a draft on Oct 27, 2024
  43. andrewtoth force-pushed on Oct 27, 2024
  44. andrewtoth force-pushed on Oct 27, 2024
  45. andrewtoth force-pushed on Oct 27, 2024
  46. andrewtoth force-pushed on Oct 27, 2024
  47. andrewtoth force-pushed on Oct 27, 2024
  48. andrewtoth force-pushed on Oct 27, 2024
  49. andrewtoth force-pushed on Oct 27, 2024
  50. andrewtoth force-pushed on Nov 7, 2024
  51. andrewtoth force-pushed on Nov 7, 2024
  52. DrahtBot removed the label CI failed on Nov 7, 2024
  53. andrewtoth force-pushed on Nov 7, 2024
  54. andrewtoth marked this as ready for review on Nov 7, 2024
  55. andrewtoth commented at 3:05 pm on November 7, 2024: contributor

    @furszy I tried to switch to using a shared threadpool, but it is much slower that way. We need a way to have shared state between threads for this, instead of just scheduling tasks. I suppose the generic threadpool is great for scheduling independent tasks like indexing an individual block, but for quickly pulling outpoints off a shared vector it is not optimized well.

    From #29386:

    I just noticed the comment in the code:

    For each thread a thread stack needs to be allocated. By default on Linux, threads take up 8MiB for the thread stack on a 64-bit system, and 4MiB in a 32-bit system.

    Only 8MiB of Virtual Memory is allocated, which doesn’t really mean anything. Due to CoW mechanism, only the parts of stack that are being used will be allocated as Physical Memory which is the one that actually matters.

    So, I don’t think it matters much to have an extra threadpool owned by the input fetcher.

    I think this is ready for more review. I also added tests and a benchmark.

  56. andrewtoth force-pushed on Nov 7, 2024
  57. andrewtoth force-pushed on Nov 9, 2024
  58. andrewtoth commented at 3:28 pm on November 13, 2024: contributor
    For later blocks where cache misses are much more common, this change has an even bigger impact. This benchmark report shows a 40% speedup measuring from blocks 840k to 850k. Also, compare flamegraphs of master and this branch, where the latter has 15 worker threads fetching coins from disk. https://bitcoin-dev-tools.github.io/benchcoin/results/pr-19/11798124132/index.html
  59. andrewtoth commented at 3:35 am on November 16, 2024: contributor
    Even with just 2 worker threads, there is significant (~30%) speed improvement for syncing recent blocks. https://bitcoin-dev-tools.github.io/benchcoin/results/pr-19/11865650166/index.html
  60. andrewtoth force-pushed on Nov 16, 2024
  61. andrewtoth force-pushed on Nov 16, 2024
  62. andrewtoth force-pushed on Nov 16, 2024
  63. andrewtoth force-pushed on Nov 16, 2024
  64. DrahtBot commented at 7:45 pm on November 16, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33086747731

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  65. DrahtBot added the label CI failed on Nov 16, 2024
  66. andrewtoth force-pushed on Nov 16, 2024
  67. andrewtoth force-pushed on Nov 16, 2024
  68. DrahtBot removed the label CI failed on Nov 16, 2024
  69. andrewtoth force-pushed on Nov 17, 2024
  70. in src/test/fuzz/inputfetcher.cpp:36 in 2bd5f0f03b outdated
    27+    CBlock block;
    28+    Txid prevhash{Txid::FromUint256(ConsumeUInt256(fuzzed_data_provider))};
    29+
    30+    const auto txs{fuzzed_data_provider.ConsumeIntegralInRange<uint32_t>(1,
    31+        std::numeric_limits<uint32_t>::max())};
    32+    for (uint32_t i{0}; i < txs; ++i) {
    


    dergoegge commented at 2:16 pm on November 18, 2024:

    This will create very long running inputs (e.g. txs = std::numeric_limits<uint32_t>::max()).

    0    LIMITED_WHILE(fuzzed_data_provider.ConsumeBool(), N) {
    

    or

    0    LIMITED_WHILE(fuzzed_data_provider.remaining_bytes(), N) {
    

    andrewtoth commented at 5:04 pm on November 18, 2024:
    Thanks, done!
  71. in src/test/fuzz/inputfetcher.cpp:41 in 2bd5f0f03b outdated
    31+        std::numeric_limits<uint32_t>::max())};
    32+    for (uint32_t i{0}; i < txs; ++i) {
    33+        CMutableTransaction tx;
    34+
    35+        const auto inputs{fuzzed_data_provider.ConsumeIntegral<uint32_t>()};
    36+        for (uint32_t j{0}; j < inputs; ++j) {
    


    dergoegge commented at 2:17 pm on November 18, 2024:
    Same as above, this will create long running inputs and maybe even run out of memory?
  72. andrewtoth force-pushed on Nov 18, 2024
  73. andrewtoth force-pushed on Nov 18, 2024
  74. DrahtBot commented at 3:34 pm on November 18, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33143571653

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  75. DrahtBot added the label CI failed on Nov 18, 2024
  76. andrewtoth force-pushed on Nov 18, 2024
  77. andrewtoth force-pushed on Nov 18, 2024
  78. andrewtoth force-pushed on Nov 18, 2024
  79. andrewtoth force-pushed on Nov 18, 2024
  80. andrewtoth force-pushed on Nov 18, 2024
  81. andrewtoth force-pushed on Nov 18, 2024
  82. andrewtoth force-pushed on Nov 18, 2024
  83. DrahtBot removed the label CI failed on Nov 18, 2024
  84. andrewtoth force-pushed on Nov 19, 2024
  85. in src/coins.h:420 in 3a4af55071 outdated
    415@@ -416,13 +416,14 @@ class CCoinsViewCache : public CCoinsViewBacked
    416     void AddCoin(const COutPoint& outpoint, Coin&& coin, bool possible_overwrite);
    417 
    418     /**
    419-     * Emplace a coin into cacheCoins without performing any checks, marking
    420-     * the emplaced coin as dirty.
    421+     * Emplace a coin into cacheCoins without performing any checks, optionally
    422+     * marking the emplaced coin as dirty.
    


    TheCharlatan commented at 2:38 pm on November 19, 2024:
    Should this rather say “optionally marking the emplaced coin as not dirty”, since the default is always dirty?

    andrewtoth commented at 6:53 pm on November 20, 2024:

    I’m not sure that’s the best though, since we do not mark a coin as not dirty. That is the default state.

    What about “marking the coin as dirty unless set_dirty is set to false”?


    TheCharlatan commented at 7:15 pm on November 20, 2024:
    That sounds good to me :+1:

    andrewtoth commented at 9:56 pm on November 20, 2024:
    Done.
  86. andrewtoth force-pushed on Nov 19, 2024
  87. andrewtoth force-pushed on Nov 20, 2024
  88. andrewtoth force-pushed on Nov 20, 2024
  89. DrahtBot added the label CI failed on Nov 20, 2024
  90. DrahtBot commented at 6:46 pm on November 20, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33279820062

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  91. andrewtoth force-pushed on Nov 20, 2024
  92. DrahtBot removed the label CI failed on Nov 20, 2024
  93. andrewtoth force-pushed on Nov 21, 2024
  94. DrahtBot commented at 5:12 pm on November 21, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33335042693

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  95. DrahtBot added the label CI failed on Nov 21, 2024
  96. andrewtoth force-pushed on Nov 21, 2024
  97. DrahtBot removed the label CI failed on Nov 21, 2024
  98. andrewtoth force-pushed on Nov 22, 2024
  99. andrewtoth force-pushed on Nov 24, 2024
  100. in src/test/inputfetcher_tests.cpp:55 in 3c201bcffc outdated
    50+    const auto cores{GetNumCores()};
    51+    const auto num_txs{m_rng.randrange(cores * 10)};
    52+    const auto block{CreateBlock(num_txs)};
    53+    const auto batch_size{m_rng.randrange<int32_t>(block.vtx.size() * 2)};
    54+    const auto worker_threads{m_rng.randrange(cores * 2)};
    55+    InputFetcher fetcher{batch_size, worker_threads};
    


    ismaelsadeeq commented at 8:31 pm on November 26, 2024:
    In 3c201bcffc1d7e382e8afa9a88750a4c261c1cf8 “tests: add inputfetcher tests” You can set this up in InputFetcherTest so that you don’t have to repeat it in the rest of the tests.

    andrewtoth commented at 0:58 am on December 4, 2024:
    Done.
  101. in src/bench/inputfetcher.cpp:15 in 2349ac7d60 outdated
    10+#include <primitives/block.h>
    11+#include <serialize.h>
    12+#include <streams.h>
    13+#include <util/time.h>
    14+
    15+static constexpr auto QUEUE_BATCH_SIZE{128};
    


    ismaelsadeeq commented at 8:33 pm on November 26, 2024:
    In 2349ac7d6071746a80223358bce0d5e556b277d7 “bench: add inputfetcher bench” How did you select this batch size?

    andrewtoth commented at 0:58 am on December 4, 2024:
    This is the hardcoded batch size used in CheckQueue. Not sure why that was selected, but I deferred to previous choices.
  102. in src/bench/inputfetcher.cpp:19 in 2349ac7d60 outdated
    14+
    15+static constexpr auto QUEUE_BATCH_SIZE{128};
    16+static constexpr auto DELAY{2ms};
    17+
    18+//! Simulates a DB by adding a delay when calling GetCoin
    19+class DelayedCoinsView : public CCoinsView
    


    ismaelsadeeq commented at 8:47 pm on November 26, 2024:
    In 2349ac7d6071746a80223358bce0d5e556b277d7 “bench: add inputfetcher bench” nit: will be nice if we have block413567 input’s data that we can read so that we dont have to simulate this.

    andrewtoth commented at 0:59 am on December 4, 2024:
    We’re reading the previous outpoints of that block’s inputs, which are in many other previous blocks. So, not sure this is feasible.
  103. ismaelsadeeq commented at 8:55 pm on November 26, 2024: member

    Concept ACK This is nice. Although I have not yet benchmarked this branch, I also like @furszy’s idea of having a general-purpose thread pool.

    I just have one test improvement comment, question and a nit after first pass of the PR

  104. DrahtBot added the label Needs rebase on Dec 4, 2024
  105. andrewtoth force-pushed on Dec 4, 2024
  106. andrewtoth commented at 1:02 am on December 4, 2024: contributor

    Rebased. Since #30039 reading inputs is much faster, so the effect of this is somewhat less significant (17% -> 10%). It’s still a significant speedup though so still worth it. Especially for worst case where the cache is completely empty, like on startup or right after it gets flushed due to size.

    It is also refactored significantly. The main thread now writes everything before notifying threads, and then joins in working. This lets us do significantly less work in the critical section and parallelize more checks.

  107. andrewtoth renamed this:
    validation: fetch block inputs on parallel threads ~17% faster IBD
    validation: fetch block inputs on parallel threads 10% faster IBD
    on Dec 4, 2024
  108. andrewtoth force-pushed on Dec 4, 2024
  109. DrahtBot commented at 1:29 am on December 4, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33884531020

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  110. DrahtBot added the label CI failed on Dec 4, 2024
  111. andrewtoth force-pushed on Dec 4, 2024
  112. DrahtBot removed the label Needs rebase on Dec 4, 2024
  113. DrahtBot removed the label CI failed on Dec 4, 2024
  114. DrahtBot added the label Needs rebase on Dec 5, 2024
  115. coins: allow emplacing non-dirty coins internally 07d30787ef
  116. coins: add inputfetcher d73eacc109
  117. tests: add inputfetcher tests cb5188c198
  118. bench: add inputfetcher bench 054b7830e5
  119. fuzz: add inputfetcher fuzz harness e8c038c9f6
  120. validation: fetch block inputs in parallel b2da764446
  121. andrewtoth force-pushed on Dec 5, 2024
  122. DrahtBot removed the label Needs rebase on Dec 5, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me