furszy
commented at 1:36 pm on January 25, 2023:
member
The current procedure for building the block filter index involves processing filters one at a time;
Reading blocks, undo data, and previous headers from disk sequentially.
This PR introduces a new mechanism to perform the work concurrently. Dividing the filters
generation workload among a pool of workers that can be configured by the user,
significantly increasing the speed of the index construction process.
The same concurrent processing model has been applied to the transactions index as well.
The newly introduced init flag -indexworkers=<n> enables the concurrent sync
behavior.
Where “n” is the number of worker threads that will be spawned at startup to create ranges
of block filters during the initial sync process. Destroying the workers pool once the
initial sync completes.
Note: by default, the parallelized sync process is not enabled.
Now the juicy part:
In my computer, with the node in debug mode and on IBD, with -indexworkers=4, the
block filter index generation took less than an hour. While, in master, the sync took more than 7 hours.
Important Note:
As the access to the block data on disk is protected by cs_main, this new feature runs substantially
faster when the node is not in IBD.
Testing Notes:
Sync your node without any index.
Restart the node with one of the indexes (-blockfilterindex or -txindex) and -connect=0 (to sync only the index, without running the net/validation threads. Since threads won’t be competing for cs_main, this will give you a more accurate result).
You’ll see a “[index name] is enabled at height [height]” log entry once it finishes. Then it’s just a matter of subtracting the index startup time from the “index synced” log time.
Keep in mind that threads are shared among all indexes you start. So if you run both indexes at the same time, your benchmark results cannot be compared against single-index runs.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Conflicts
Reviewers, this pull request conflicts with the following ones:
#33689 (http: replace WorkQueue and single threads handling for ThreadPool by furszy)
#31308 (ci, iwyu: Treat warnings as errors for specific directories by hebasto)
#31260 (scripted-diff: Type-safe settings retrieval by ryanofsky)
#29770 (index: Check all necessary block data is available before starting to sync by fjahr)
#29700 (kernel, refactor: return error status on all fatal errors by ryanofsky)
#17783 (common: Disallow calling IsArgSet() on ALLOW_LIST options by ryanofsky)
#17581 (refactor: Remove settings merge reverse precedence code by ryanofsky)
#17580 (refactor: Add ALLOW_LIST flags and enforce usage in CheckArgFlags by ryanofsky)
#17493 (util: Forbid ambiguous multiple assignments in config file by ryanofsky)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
LLM Linter (✨ experimental)
Possible typos and grammar issues:
a posterior -> later [“a posterior” is incorrect here; use “later” (or “posterior”) to convey a subsequent check]
drahtbot_id_5_m
DrahtBot added the label
UTXO Db and Indexes
on Jan 25, 2023
furszy force-pushed
on Jan 25, 2023
furszy force-pushed
on Jan 25, 2023
furszy force-pushed
on Jan 25, 2023
Sjors
commented at 2:38 pm on January 26, 2023:
member
Cool, will take it for a spin…
furszy force-pushed
on Jan 28, 2023
furszy
commented at 3:18 pm on January 28, 2023:
member
Cool @Sjors, just pushed a small update. Found a little bug.
Going to add an important note to the PR description (because otherwise testing results will vary a lot):
As the access to the block data on disk is protected by cs_main, this new feature runs substantially
faster when the node is not in IBD.
(where “substantially” here means full index sync, with 5 workers, in less than 20 minutes in my computer).
furszy force-pushed
on Jan 28, 2023
mzumsande
commented at 10:49 pm on January 29, 2023:
contributor
It’s probably much easier suggested than done, but did you attempt to implement parallelization in a more general way so that other indices could benefit from it as well? On first glance, txindex, and the indices suggested in PRs (#24539, #26951) seem to be parallelizable as well (not coinstatsindex though).
furszy
commented at 1:37 pm on January 30, 2023:
member
It’s probably much easier suggested than done, but did you attempt to implement parallelization in a more general way so that other indices could benefit from it as well?
Yeah, that is part of the plan. I started with the block filter index because it requires an special treatment that txindex parallelization does not require (block/undo data reading and block filters creation can be parallelized but writing must be done sequentially due the need to link filter headers to their predecessors to create the filters-chain on disk).
My idea was to start reviewing this one, so the process gets as clean as possible, and then move forward with the generalization step. It’s usually more natural to abstract processes when the specific cases are well-defined.
furszy force-pushed
on Jan 30, 2023
w0xlt
commented at 4:42 pm on January 30, 2023:
contributor
Concept ACK
Perhaps it could have separate parallelization and index logic. So it could be reused in other indexes.
furszy force-pushed
on Jan 30, 2023
furszy force-pushed
on Jan 30, 2023
Sjors
commented at 1:47 pm on February 7, 2023:
member
Building the index on (AMD Ryzen 7950X, blocks stored on SSD):
master @ fe86616bb4ad0c4296d34299bc2e2f0fca1fe936: 35'15" (mostly 1 alternating CPU thread)
this PR (rebased)
n=8: 5'20" (uses about 8 of 32 CPU threads as expected)
n=32: 4'26" (pleasantly close to 100% CPU usage with a dip every 10 seconds, but it drops to only 1 CPU in the last minute or two)
I made sure to not load any wallets and disabled other indexes.
I didn’t test if the index was correct.
I wonder if, for users without this index, it would be faster to generate the index, rescan the wallet and then delete it again. Combined with #26951 you would only have to generate filters up to the age of the wallet (IIUC, cc @pstratem).
Note to self for future benchmarks: -txindex takes 1:07'14", -coinstatsindex takes 3.5 hours.
furszy
commented at 2:18 pm on February 7, 2023:
member
Could also give it a run rebased on top #27006.
On master, the index initial sync is slower when the node is in IBD because the index thread has to compete for access to block data on disk through cs_main acquisition.
I didn’t test if the index was correct.
The PR contain a test verifying it.
Side note:
I’m working on generalizing the parallelization flow so other indexes, like the txindex and #26951 can make use of it too.
DrahtBot added the label
Needs rebase
on Feb 17, 2023
furszy force-pushed
on Feb 21, 2023
furszy force-pushed
on Feb 21, 2023
DrahtBot removed the label
Needs rebase
on Feb 21, 2023
furszy force-pushed
on Feb 22, 2023
furszy force-pushed
on Feb 27, 2023
furszy renamed this:
index: blockfilter initial sync speedup, parallelize process
index: blockfilter and txindex initial sync speedup, parallelize process
on Feb 27, 2023
furszy renamed this:
index: blockfilter and txindex initial sync speedup, parallelize process
index: blockfilter initial sync speedup, parallelize process
on Feb 27, 2023
furszy force-pushed
on Feb 27, 2023
furszy
commented at 12:40 pm on February 28, 2023:
member
PR updated, most of it implementation has changed.
The news are:
Decreased ThreadSync cs_main lock contention.
Removed CBlockIndex access from the child indexes internals.
Implemented generic workers pool.
Introduced a last header cache for the Block Filter index. Avoiding disk reads on every new processed block.
Enabled parallel sync on the tx index.
Important Note:
The introduced workers pool spawned by the -indexworkers init arg is shared among all the enabled indexes that support parallel sync.
The implementation uses std::any mainly to simplify the patch-set, the base class template form of it requires a larger set of changes.
Side note: in a first glance and without going too far over the coinstats index implementation, I would say that it could also be parallelized. But will leave it for a follow-up to not continue expanding the PR size.
Future (doesn’t need to be included here, just mentioning so the path is clear):
I’m working on decoupling the initial sync logic into a separate structure, so indexes subscribe to events instead of reading blocks from disk by themselves.
DrahtBot added the label
Needs rebase
on May 11, 2023
DrahtBot removed the label
Needs rebase
on May 11, 2023
furszy force-pushed
on May 11, 2023
DrahtBot added the label
Needs rebase
on Jun 12, 2023
furszy force-pushed
on Jun 21, 2023
DrahtBot removed the label
Needs rebase
on Jun 21, 2023
DrahtBot added the label
Needs rebase
on Jun 30, 2023
furszy force-pushed
on Jun 30, 2023
DrahtBot removed the label
Needs rebase
on Jun 30, 2023
DrahtBot added the label
CI failed
on Jun 30, 2023
DrahtBot removed the label
CI failed
on Jul 4, 2023
DrahtBot added the label
Needs rebase
on Jul 6, 2023
furszy force-pushed
on Aug 9, 2023
DrahtBot removed the label
Needs rebase
on Aug 9, 2023
DrahtBot added the label
CI failed
on Aug 29, 2023
furszy force-pushed
on Aug 29, 2023
DrahtBot
commented at 8:34 pm on September 4, 2023:
contributor
Could mark as draft while CI is red?
furszy force-pushed
on Sep 4, 2023
DrahtBot
commented at 5:47 am on September 6, 2023:
contributor
It looks like tsan failed, but there is no log, when there should be a log. Maybe it was accidentally removed by #27667 ?
DrahtBot added the label
Needs rebase
on Sep 12, 2023
furszy force-pushed
on Sep 12, 2023
DrahtBot removed the label
CI failed
on Sep 12, 2023
DrahtBot removed the label
Needs rebase
on Sep 12, 2023
DrahtBot added the label
Needs rebase
on Oct 2, 2023
furszy force-pushed
on Oct 18, 2023
DrahtBot removed the label
Needs rebase
on Oct 18, 2023
DrahtBot added the label
CI failed
on Oct 18, 2023
maflcko
commented at 9:16 am on October 25, 2023:
member
CI is still red. Also, how would this work with AU?
furszy force-pushed
on Oct 27, 2023
furszy force-pushed
on Oct 27, 2023
DrahtBot removed the label
CI failed
on Oct 27, 2023
Sjors
commented at 11:09 am on November 7, 2023:
member
That PR currently does not cleanly cherry-pick on top of this PR, not can I (trivially) rebase this PR on top top of it. Happy to try if you can make a branch.
I just tried it again, deleting the blockfilterindex and rebuilding it. My impression is that it’s going slower than before and I’m not seeing much CPU activity.
I also noticed the shutdown takes a very long time, with the indexer (?) threads sticking around for many minutes:
in
src/index/blockfilterindex.h:46
in
003f076a15outdated
41@@ -42,6 +42,9 @@ class BlockFilterIndex final : public BaseIndex
42 /** cache of block hash to filter header, to avoid disk access when responding to getcfcheckpt. */
43 std::unordered_map<uint256, uint256, FilterHeaderHasher> m_headers_cache GUARDED_BY(m_cs_headers_cache);
4445+ // Last computed header to avoid disk reads at every new block.
46+ uint256 last_header{};
TheCharlatan
commented at 12:58 pm on November 24, 2023:
NIt: Might take this opportunity to add the <uint256.h> header?
TheCharlatan
commented at 3:32 pm on November 24, 2023:
If I run IWYU locally, it reports the following headers as missing:
0#include<cstddef>// for size_t
1#include<algorithm>// for max
2#include<atomic>// for atomic
3#include<functional>// for function
4#include<memory>// for make_shared
5#include<stdexcept>// for runtime_error
6#include<utility>// for move, swap
7#include<vector>// for vector
Because I initially thought on pushing extra work into the thread pool queue so then the originator thread, once it finishes calculating the different tasks, can take the workload on its active wait. But I ended up implementing it differently to re-use the same piece of code for the single-thread approach (Can check it looking for where it says “// Otherwise, this is an active-wait, so we process blocks until all workers finish.”)
Still, I think that having a method to process tasks manually make sense for a general thread pool implementation. But could remove it if you are strong on it.
andrewtoth
commented at 1:00 pm on October 26, 2024:
This would make sense if we wanted to reuse this for CCheckQueue. Then, we could just loop and ProcessTask on the main thread once when we call Wait.
in
src/util/threadpool.h:85
in
46e74257f2outdated
45+ m_work_queue.pop();
46+ }
47+
48+ // Execute the task without the lock
49+ WITH_REVERSE_LOCK(wait_lock, task());
50+ }
TheCharlatan
commented at 12:37 pm on November 25, 2023:
I don’t think it would be necessary for the tasks themselves, since you already demonstrate in the tests how to retrieve exceptions, but I think some kind of exception handling and infomation logging similar to the one of util::TraceThread would still be nice. Did you choose not to use TraceThread on purpose?
I don’t think it would be necessary for the tasks themselves, since you already demonstrate in the tests how to retrieve exceptions, but I think some kind of exception handling and infomation logging similar to the one of util::TraceThread would still be nice. Did you choose not to use TraceThread on purpose?
I probably wasn’t aware of TraceThread when this was implemented. Changing it.. Thanks!
in
src/test/threadpool_tests.cpp:76
in
46e74257f2outdated
19+ // 6) Busy workers, help them by processing tasks from outside.
20+
21+ // Test case 1, submit tasks and verify completion.
22+ {
23+ int num_workers = 3;
24+ int num_tasks = 50;
TheCharlatan
commented at 12:44 pm on November 25, 2023:
Could use this test to stress test the pool a bit with more tasks?
Could use this test to stress test the pool a bit with more tasks?
What about creating a fuzzing test instead? I’m not sure about the benefits of adding more tasks here. I can only foresee other developers complaining about the increased unit test times.
TheCharlatan
commented at 3:19 pm on September 16, 2025:
I made a small fuzz test now. If you think it is useful, and want to add it:
will check it out. Thanks!. I was having fun with the code first :).
brunoerg
commented at 11:29 am on September 17, 2025:
@TheCharlatan I tried to run this target and got 0 exec/s which is pretty bad. I was expecting a bad performance due to the number of tasks and the pool overhead btw.
in
src/test/threadpool_tests.cpp:29
in
46e74257f2outdated
TheCharlatan
commented at 12:54 pm on November 25, 2023:
contributor
This all looks pretty promising. I left some feedback before continuing with the latter half of the PR.
in
src/index/base.h:29
in
7b2d4d471aoutdated
23 namespace Consensus {
24 struct Params;
25 }
2627+/** Number of concurrent jobs during the initial sync process */
28+const int16_t INDEX_WORKERS_COUNT = 0;
TheCharlatan
commented at 4:22 pm on November 25, 2023:
1980+ }
1981+
1982 // Start threads
1983- for (auto index : node.indexes) if (!index->StartBackgroundSync()) return false;
1984+ for (auto index : node.indexes) {
1985+ // todo: Only provide thread pool to indexes that supports parallel sync
TheCharlatan
commented at 4:46 pm on November 25, 2023:
In commit 7b2d4d471ae251d4c22184dda26b73993a65eff8: Could the AllowParallelSync method be moved to this commit?
Nitty nit: Keep the order of workers_count and work_chunk consistent?
What do you mean? Set workers_count first, then work_chunk always in the same order?
in
src/index/base.cpp:247
in
727be9d500outdated
244+ it_start = WITH_LOCK(::cs_main, return NextSyncBlock(it_end, m_chainstate->m_chain));
245+ }
246+ }
247248+ // If we have only one block to process, run it directly.
249+ // Otherwise, this is an active-wait, so we process blocks until all workers finish.
TheCharlatan
commented at 10:38 pm on November 25, 2023:
Would so we also process blocks in this thread until all workers finish. be more accurate?
TheCharlatan
commented at 10:52 pm on November 25, 2023:
In commit de69506c090f530f34d78478db39898bd4e2cbba: I think the test introduction should be squashed into the following commit, or do you want to demonstrate something here first?
In commit de69506: I think the test introduction should be squashed into the following commit, or do you want to demonstrate something here first?
hmm, will squash them. I don’t remember why I split them (2 years ago).
TheCharlatan
commented at 11:01 pm on November 25, 2023:
contributor
Concept ACK.
Done with my first pass, still want to think some of the approaches here over a bit.
in
src/init.cpp:2117
in
373b0bb077outdated
1974+ if (node.args->IsArgSet("-indexworkers")) {
1975+ int index_workers = node.args->GetIntArg("-indexworkers", INDEX_WORKERS_COUNT);
1976+ if (index_workers < 0 || index_workers > 100) return InitError(_("Invalid -indexworkers arg"));
1977+
1978+ thread_pool = std::make_shared<ThreadPool>();
1979+ thread_pool->Start(index_workers);
TheCharlatan
commented at 4:08 pm on November 26, 2023:
Since the constructor and Start are always called in succession in this patch set, you could make ThreadPool more RAII styled if the Start method were removed and instead placed in the constructor. Could also make sense to do the same with the destructor and Stop.
DrahtBot added the label
Needs rebase
on Dec 14, 2023
furszy
commented at 3:22 pm on February 5, 2024:
member
Focus is on #28955, which contains a good number of commits decoupled from this PR.
Will come back here after it.
achow101 referenced this in commit
0b96a1925e
on Mar 20, 2024
furszy force-pushed
on Mar 20, 2024
DrahtBot removed the label
Needs rebase
on Mar 20, 2024
DrahtBot added the label
CI failed
on Mar 21, 2024
DrahtBot
commented at 1:36 am on March 21, 2024:
contributor
🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.
Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
Leave a comment here, if you need help tracking down a confusing failure.
Sjors
commented at 5:44 pm on March 21, 2024:
member
Running it again, let’s see how quick it is…
Since I might set indexworkers=32 in my config file, which is great for the initial sync and if it needs to do a big catchup. But does it cause much overhead when it’s up to date? Maybe the threads should spin down if there’s not much work.
What happens when there are multiple readBlockFromDisk calls around the same time? I don’t see any (obvious) thread locking happening in CAutoFile. Since I keep block files on a spinning disk (-blocksdir), I wonder if that potentially slows things down - compared to fetching one file in a single uninterrupted operation.
So far (block 200K) my spinning disk is making a ton of noise and CPU activity is negligible.
It took 4 hours and 23 minutes. That’s an improvement over the 5 hours 46 minutes without: #28955 (comment)
Note that last year I tested with SSD - which resulted in a 16x improvement #26966 (comment). This time I used a spinning disk. CPU activity was negligible all the way.
furszy
commented at 5:54 pm on March 21, 2024:
member
Running it again, let’s see how quick it is…
Since I might set indexworkers=32 in my config file, which is great for the initial sync and if it needs to do a big catchup. But does it cause much overhead when it’s up to date? Maybe the threads should spin down if there’s not much work.
Yeah sure. The thread pool can be destructed once all indexes initial sync finish (once the index is synced, it starts receiving blocks through the validation signals and does not use the initial sync workers anymore).
I’m currently tackling theCharlatan’s feedback, and thinking about some improvements. Will add this change too on the next push.
DrahtBot removed the label
CI failed
on Mar 21, 2024
furszy force-pushed
on Mar 23, 2024
furszy
commented at 1:34 pm on March 23, 2024:
member
Thanks for the in-depth review theCharlatan! Most comments were tackled. And thanks for testing Sjors!
Now that we’ve reached this point (after #28955 merge), I’m rethinking and polishing the design. I’m not totally convinced about the current implementation anymore. We’ve grown a lot since this was implemented two years ago.
Other than that, Sjors had a nice idea that I want to try out (or at least design this in a way so that the idea can be implemented in isolation in the future) –> Instead of dividing the work based on block ranges, it can be divided based on block file ranges. This would minimize the needle movement on spinning disks.
DrahtBot added the label
CI failed
on Mar 23, 2024
DrahtBot
commented at 2:10 pm on March 23, 2024:
contributor
🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.
Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
Leave a comment here, if you need help tracking down a confusing failure.
Sjors
commented at 9:06 am on March 25, 2024:
member
Instead of dividing the work based on block ranges, it can be divided based on block file ranges. This would minimize the needle movement on spinning disks.
It’s possible that this can be achieved with block ranges too. But you have to make sure only one block range is read at any given time. I.e. the other threads should wait while a disk read is in progress. I suspect the problem lies in having 32 threads trying to read different things at the same time, and then operating system goes and fetches a few kilobytes, a few kilobytes there, etc. Though I haven’t measured this.
DrahtBot removed the label
CI failed
on Apr 4, 2024
DrahtBot
commented at 7:54 am on April 5, 2024:
contributor
🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.
Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
Leave a comment here, if you need help tracking down a confusing failure.
DrahtBot removed the label
CI failed
on Apr 5, 2024
DrahtBot
commented at 8:37 am on April 5, 2024:
contributor
🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.
Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
Leave a comment here, if you need help tracking down a confusing failure.
DrahtBot removed the label
CI failed
on Apr 5, 2024
DrahtBot added the label
CI failed
on Apr 5, 2024
furszy force-pushed
on Apr 7, 2024
DrahtBot
commented at 5:19 pm on April 17, 2024:
contributor
Are you still working on this? If not, this could be moved to draft for as long as the tsan CI is failing.
furszy marked this as a draft
on Apr 18, 2024
DrahtBot added the label
Needs rebase
on Apr 30, 2024
furszy force-pushed
on May 1, 2024
DrahtBot removed the label
Needs rebase
on May 1, 2024
DrahtBot removed the label
CI failed
on May 1, 2024
DrahtBot added the label
CI failed
on Jun 14, 2024
DrahtBot
commented at 2:16 pm on June 14, 2024:
contributor
🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the
documentation.
Possibly this is due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
Leave a comment here, if you need help tracking down a confusing failure.
DrahtBot added the label
Needs rebase
on Aug 5, 2024
furszy renamed this:
index: blockfilter initial sync speedup, parallelize process
index: initial sync speedup, parallelize process
on Aug 7, 2024
furszy force-pushed
on Aug 14, 2024
furszy force-pushed
on Aug 14, 2024
DrahtBot removed the label
Needs rebase
on Aug 14, 2024
hebasto added the label
Needs CMake port
on Aug 16, 2024
maflcko removed the label
Needs CMake port
on Aug 29, 2024
DrahtBot added the label
Needs rebase
on Sep 2, 2024
furszy force-pushed
on Oct 1, 2024
DrahtBot removed the label
Needs rebase
on Oct 1, 2024
in
src/util/threadpool.h:72
in
81dcee9464outdated
40+ {
41+ // Wait for the task or until the stop flag is set
42+ m_condition.wait(wait_lock,[&]() EXCLUSIVE_LOCKS_REQUIRED(cs_work_queue) { return m_interrupt || !m_work_queue.empty(); });
43+
44+ // If stopped, exit worker.
45+ if (m_interrupt && m_work_queue.empty()) {
andrewtoth
commented at 2:54 pm on October 23, 2024:
If we receive an interrupt, we don’t also want to wait for the work queue to be empty right?
andrewtoth
commented at 4:52 pm on October 27, 2024:
I tried this and it causes memory errors, since the remaining futures will have a dangling ref to m_condition.
In commit “util: introduce general purpose thread pool” (a382501d798c02a4dc042bc0641ce7b99c2db40f)
I tried this and it causes memory errors, since the remaining futures will have a dangling ref to m_condition.
To be clear, you tried dropping the m_work_queue.empty() condition and that didn’t work, so current code is correct? If so we could probably mark this thread resolved (if I am not misinterpreting).
andrewtoth
commented at 10:12 pm on April 9, 2025:
Err, I still think we would want to abandon the work queue if we are interrupted instead of waiting for it to finish, no? My naive approach of not waiting for the queue to be empty does not work though.
Hmm, sorry for the very very late response @andrewtoth. I missed this message completely.
I still think we would want to abandon the work queue if we are interrupted instead of waiting for it to finish, no? My naive approach of not waiting for the queue to be empty does not work though.
Yeah, I don’t think that’s safe. Other threads might be waiting on the tasks’ futures to complete, so exiting without notifying them would leave them blocked forever.
What we could do (and I think you mentioned this elsewhere) is keep track of all the tasks’ promises internally and fail them with an interruption error/exception if the thread pool gets interrupted. In this way we also avoid lingering objects.
andrewtoth
commented at 3:02 pm on October 23, 2024:
Can we modify this to be more generic and return a type? And add logic to collect all returned values into a shared vector which can then be atomically swapped out by an observer? Possibly not in this PR, but if this will be split out into a generic thread pool.
Can we modify this to be more generic and return a type? And add logic to collect all returned values into a shared vector which can then be atomically swapped out by an observer? Possibly not in this PR, but if this will be split out into a generic thread pool.
We could keep track of the task futures’ promises, if that’s what you’re referring to. In other words, this void() function is just a wrapper that executes a generic function which sets the result inside the caller’s future.
andrewtoth
commented at 6:11 pm on November 18, 2024:
contributor
I’m using the ThreadPool here in #31132 as a cherry-picked commit, modulo changing ThreadPool() {} to ThreadPool() = default;. Perhaps we could pull this out to a separate PR since it would be useful for both changes.
One request for the ThreadPool would be to track in flight tasks being executed. That way we could write tests that ensure that all tasks have been completed before continuing, even if we don’t have access to the futures.
DrahtBot added the label
Needs rebase
on Jan 22, 2025
furszy force-pushed
on Feb 25, 2025
DrahtBot removed the label
CI failed
on Feb 25, 2025
DrahtBot removed the label
Needs rebase
on Feb 25, 2025
in
src/test/threadpool_tests.cpp:53
in
349b09983doutdated
10+
11+BOOST_AUTO_TEST_CASE(threadpool_basic)
12+{
13+ // Test Cases
14+ // 1) Submit tasks and verify completion.
15+ // 2) Maintain all threads busy except one.
yancyribbens
commented at 10:37 pm on February 26, 2025:
0 // 2) Maintain all busy threads except one.
DrahtBot added the label
Needs rebase
on Mar 12, 2025
in
src/index/base.cpp:223
in
c4723fb985outdated
232- if (!CustomAppend(block_info)) {
233- FatalErrorf("%s: Failed to write block %s to index database",
234- __func__, pindex->GetBlockHash().ToString());
235- return;
236- }
237+ if (!ProcessBlock(pindex)) break; // error logged internally
In commit “index: remove CBlockIndex access from node internals” (c4723fb)
It seems like this should be returning, not breaking. Would be good to change or clarify with a comment
yeah, good catch!.
I think at the end it doesn’t matter much because ProcessBlock aborts the node during a failure but still, this would have logged an extra line “index enabled at height ” during shutdown.
In commit “index: remove CBlockIndex access from node internals” (c4723fb9857c624ac0e1e034dc6c29d7821f5b4a)
I’m not sure this is safe. It looks like if block height is 0 this is taking a reference to a temporary object that will go out scope.
In any case I think this could be simplified to const CBlockUndo& block_undo{*Assert(block.undo_data)} because it looks like pointer will be set above even if height is 0.
In commit “index: remove CBlockIndex access from node internals” (c4723fb)
I’m not sure this is safe. It looks like if block height is 0 this is taking a reference to a temporary object that will go out scope.
In any case I think this could be simplified to const CBlockUndo& block_undo{*Assert(block.undo_data)} because it looks like pointer will be set above even if height is 0.
In commit “util: introduce general purpose thread pool” (a382501d798c02a4dc042bc0641ce7b99c2db40f)
I don’t think it makes sense for m_interrupt to use the CThreadInterrupt class because the ThreadPool class already has it’s own mutex and condition variable and it would be wasteful to introduce more. I think could just replace CThreadInterrupt with bool here, and replace m_interrupt(); with m_interrupt = true and replace m_interrupt.reset() with m_interrupt = false while holding the mutex.
In commit “util: introduce general purpose thread pool” (a382501)
I don’t think it makes sense for m_interrupt to use the CThreadInterrupt class because the ThreadPool class already has it’s own mutex and condition variable and it would be wasteful to introduce more. I think could just replace CThreadInterrupt with bool here, and replace m_interrupt(); with m_interrupt = true and replace m_interrupt.reset() with m_interrupt = false while holding the mutex.
sure, done as suggested. Thanks!
in
src/sync.h:302
in
a382501d79outdated
298@@ -299,6 +299,7 @@ inline MutexType* MaybeCheckNotHeld(MutexType* m) LOCKS_EXCLUDED(m) LOCK_RETURNE
299 //! The above is detectable at compile-time with the -Wreturn-local-addr flag in
300 //! gcc and the -Wreturn-stack-address flag in clang, both enabled by default.
301 #define WITH_LOCK(cs, code) (MaybeCheckNotHeld(cs), [&]() -> decltype(auto) { LOCK(cs); code; }())
302+#define WITH_REVERSE_LOCK(cs, code) ([&]() -> decltype(auto) { REVERSE_LOCK(cs); code; }())
In commit “util: introduce general purpose thread pool” (a382501)
This seems ok, but it’s is only used one place where WITH_REVERSE_LOCK is not much simpler than plain REVERSE_LOCK. Could consider dropping it.
Sure. I think I did it this way to be very explicit about the code that will be executed without the lock, but yeah, we could achieve the same outcome with another set of brackets too.
In commit “util: introduce general purpose thread pool” (a382501d798c02a4dc042bc0641ce7b99c2db40f)
Not important, but would suggest simplifying naming and just calling these members:
0Mutex m_mutex;
1std::condition_variable m_cv;
The cs_ prefix is an older convention that comes from windows code, and there is as long as this class is going to have one mutex there isn’t really a reason to use a more complicated name.
In commit “util: introduce general purpose thread pool” (a382501)
Not important, but would suggest simplifying naming and just calling these members:
0Mutex m_mutex;
1std::condition_variable m_cv;
The cs_ prefix is an older convention that comes from windows code, and there is as long as this class is going to have one mutex there isn’t really a reason to use a more complicated name.
In commit “index: implement index parallel sync” (bc0e5211e8a914e80e68e9c700b846a4cc3ef95b)
This doesn’t seem like a great use of shared_ptr because makes the shutdown sequence more complicated than it needs to be. I think it would be clearer if instead of taking std::shared_ptr<ThreadPool> references they just used ThreadPool& and a new std::unique_ptr<ThreadPool> m_index_threads member was added to NodeContext. This way the threads could be stopped with an explicit reset() call instead of shutting down more unpredictably when the last index is destroyed.
In commit “index: implement index parallel sync” (bc0e521)
This doesn’t seem like a great use of shared_ptr because makes the shutdown sequence more complicated than it needs to be. I think it would be clearer if instead of taking std::shared_ptr<ThreadPool> references they just used ThreadPool& and a new std::unique_ptr<ThreadPool> m_index_threads member was added to NodeContext. This way the threads could be stopped with an explicit reset() call instead of shutting down more unpredictably when the last index is destroyed.
Sounds good. Done as suggested. Thanks!
in
src/index/base.cpp:242
in
bc0e5211e8outdated
238+
239+ if (parallel_sync_enabled) {
240+ const int max_blocks_to_sync = m_tasks_per_worker * m_thread_pool->WorkersCount() + m_tasks_per_worker; // extra 'm_tasks_per_worker' due the active-wait.
241+ const int tip_height = WITH_LOCK(cs_main, return m_chainstate->m_chain.Height());
242+ const int remaining_blocks = tip_height - pindex_next->nHeight;
243+ work_chunk = remaining_blocks > max_blocks_to_sync ? m_tasks_per_worker : remaining_blocks / (m_thread_pool->WorkersCount() + 1);
In commit “index: implement index parallel sync” (bc0e5211e8a914e80e68e9c700b846a4cc3ef95b)
IMO, this would be clearer if it were called blocks_per_worker or blocks_per_chunk instead of tasks_per_worker. Not knowing that a task is a block made the code harder to understand when initially reading it.
In commit “index: implement index parallel sync” (bc0e521)
IMO, this would be clearer if it were called blocks_per_worker or blocks_per_chunk instead of tasks_per_worker. Not knowing that a task is a block made the code harder to understand when initially reading it.
sure. Done as suggested.
in
src/index/base.cpp:238
in
bc0e5211e8outdated
234+ int work_chunk = 1;
235+ int workers_count = 0;
236+ std::vector<std::future<std::vector<std::any>>> futures;
237+ const CBlockIndex* it_start = pindex_next;
238+
239+ if (parallel_sync_enabled) {
In commit “index: implement index parallel sync” (bc0e5211e8a914e80e68e9c700b846a4cc3ef95b)
Would be really helpful to have a comment saying how this works at a high level. Would suggest something like “If parallel sync is enabled, use WorkersCount()+1 threads (including the current thread) to each process block ranges of up to m_tasks_per_worker blocks. The blocks in each range are processed in sequence by calling the index’s CustomProcessBlock method which returns std::any values that are collected into vectors. As the threads finish their work, the std::any values are processed in order by calling the index’s CustomPostProcessBlocks method, and the process repeats until no blocks are remaining to be processed and post-processed.”
I guess at a high level this seems reasonable, but perhaps too rigid. Like if there are 3 threads processing block ranges 1-10, 11-20, 21-30, and the first 2 threads finish while the third thread is slow. Why should the loop need to wait for the third thread before beginning to process blocks 31-50 and there are two idle threads doing nothing?
It seems like this idleness could be avoided by moving the CustomPostProcessBlocks calls into the worker threads. So that whenever each worker thread finishes processing blocks, it then opportunistically calls CustomPostProcessBlocks to post-process any blocks that are available (given the ordering constraint for post-processing). This way all the worker threads would continuously have work to do, and I suspect the resulting code might be simpler too since there would just be a single phase of work, not alternating Processing/PostProcessing phases.
Would be really helpful to have a comment saying how this works at a high level. Would suggest something like “If parallel sync is enabled, use WorkersCount()+1 threads (including the current thread) to each process block ranges of up to m_tasks_per_worker blocks. The blocks in each range are processed in sequence by calling the index’s CustomProcessBlock method which returns std::any values that are collected into vectors. As the threads finish their work, the std::any values are processed in order by calling the index’s CustomPostProcessBlocks method, and the process repeats until no blocks are remaining to be processed and post-processed.”
Sure done!
guess at a high level this seems reasonable, but perhaps too rigid. Like if there are 3 threads processing block ranges 1-10, 11-20, 21-30, and the first 2 threads finish while the third thread is slow. Why should the loop need to wait for the third thread before beginning to process blocks 31-50 and there are two idle threads doing nothing?
It seems like this idleness could be avoided by moving the CustomPostProcessBlocks calls into the worker threads. So that whenever each worker thread finishes processing blocks, it then opportunistically calls CustomPostProcessBlocks to post-process any blocks that are available (given the ordering constraint for post-processing). This way all the worker threads would continuously have work to do, and I suspect the resulting code might be simpler too since there would just be a single phase of work, not alternating Processing/PostProcessing phases.
Spent a few days implementing this suggestion. It started out small but turned into a larger change than I initially expected. That said, I liked the direction and felt it was worth the extra effort. The process runs faster now.
Let me know what you think.
Design-wise, I kept everything within the Sync method, but could also encapsulate it into a separate class if preferred.
ryanofsky approved
ryanofsky
commented at 9:50 pm on April 9, 2025:
contributor
Approach ACK349b09983d994cb46faeed12b123ae2269c6c516 and I reviewed most of the code. Seems like a nice design and good approach. Plan to finish reviewing later.
furszy force-pushed
on Jun 5, 2025
furszy
commented at 8:02 pm on June 5, 2025:
member
Thanks for the review, andrewtoth and ryanofsky!
Addressed most of the suggestions, but not all yet. Will finish the rest soon.
DrahtBot removed the label
Needs rebase
on Jun 5, 2025
DrahtBot added the label
CI failed
on Jun 5, 2025
DrahtBot
commented at 10:09 pm on June 5, 2025:
contributor
🚧 At least one of the CI tasks failed.
Task TSan, depends, gui: https://github.com/bitcoin/bitcoin/runs/43571812170
LLM reason (✨ experimental): The CI failure is caused by a segmentation fault occurring in CBlockIndex::GetBlockPos().
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
furszy force-pushed
on Jun 5, 2025
furszy force-pushed
on Jun 6, 2025
DrahtBot removed the label
CI failed
on Jun 6, 2025
furszy
commented at 9:04 pm on June 6, 2025:
member
Decoupled part of this work inside #32694 - combining it with part of #24230.
achow101 referenced this in commit
19765dca19
on Jun 12, 2025
DrahtBot added the label
Needs rebase
on Jun 12, 2025
furszy force-pushed
on Jun 13, 2025
DrahtBot removed the label
Needs rebase
on Jun 13, 2025
in
src/index/blockfilterindex.cpp:295
in
53bc8b9663outdated
writings -> writing [incorrect verb form in "Error writings filters"]
DrahtBot added the label
CI failed
on Jun 17, 2025
DrahtBot removed the label
CI failed
on Jun 17, 2025
DrahtBot added the label
CI failed
on Jun 20, 2025
DrahtBot
commented at 0:34 am on June 20, 2025:
contributor
🚧 At least one of the CI tasks failed.
Task tidy: https://github.com/bitcoin/bitcoin/runs/44013275482
LLM reason (✨ experimental): The CI failed due to compilation errors caused by an incorrect macro invocation in the source code.
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
furszy force-pushed
on Jun 20, 2025
DrahtBot removed the label
CI failed
on Jun 20, 2025
furszy force-pushed
on Jun 23, 2025
furszy force-pushed
on Jun 23, 2025
furszy
commented at 7:15 pm on June 23, 2025:
member
Updated based on the feedback. Thanks!
I believe I’ve addressed all the comments, but let me know if I missed anything.
furszy marked this as ready for review
on Jun 23, 2025
DrahtBot added the label
CI failed
on Jun 23, 2025
DrahtBot
commented at 8:34 pm on June 23, 2025:
contributor
🚧 At least one of the CI tasks failed.
Task TSan, depends, gui: https://github.com/bitcoin/bitcoin/runs/44630110161
LLM reason (✨ experimental): Data race detected in operator delete caused the failure of threadpool_tests.
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
furszy force-pushed
on Jun 23, 2025
furszy force-pushed
on Jun 24, 2025
furszy force-pushed
on Jun 24, 2025
furszy force-pushed
on Jun 24, 2025
in
src/index/base.cpp:249
in
a3e7d97b89outdated
249+// CustomPostProcessBlocks. This continues until all blocks have been fully processed and committed.
250+//
251+// Reorgs are detected and handled before syncing begins, ensuring the index starts aligned with the active chain.
252 void BaseIndex::Sync()
253 {
254+ if (m_synced) return; // we are sync, nothing to do
“// we are sync, nothing to do” → “// we are synced, nothing to do” [‘sync’ is incorrect here; use past participle ‘synced’]
“// Log progress every often” → “// Log progress every so often” [missing ‘so’ in the idiom]
“// Commit changes every often” → “// Commit changes every so often” [same idiomatic error]
DrahtBot removed the label
CI failed
on Jun 25, 2025
furszy force-pushed
on Jun 25, 2025
DrahtBot added the label
CI failed
on Jun 25, 2025
furszy force-pushed
on Jun 26, 2025
furszy force-pushed
on Jun 26, 2025
furszy force-pushed
on Jun 26, 2025
furszy force-pushed
on Jun 26, 2025
DrahtBot removed the label
CI failed
on Jun 26, 2025
in
src/index/base.h:144
in
f6b7da2493outdated
140@@ -130,6 +141,26 @@ class BaseIndex : public CValidationInterface
141 /// Update the internal best block index as well as the prune lock.
142 void SetBestBlockIndex(const CBlockIndex* block);
143144+ /// If 'AllowParallelSync()' retrieves true, 'ProcessBlock()' will run concurrently in batches.
retrieves -> returns [‘retrieves true’ is awkward; standard terminology is ‘returns true’]
will be already be logged -> will already be logged [duplicate “be”]
achow101 referenced this in commit
528f79f010
on Jul 8, 2025
DrahtBot added the label
Needs rebase
on Jul 8, 2025
furszy force-pushed
on Jul 8, 2025
DrahtBot removed the label
Needs rebase
on Jul 8, 2025
DrahtBot added the label
Needs rebase
on Jul 14, 2025
Sjors
commented at 5:28 pm on July 14, 2025:
member
DrahtBot removed the label
Needs rebase
on Jul 14, 2025
DrahtBot added the label
CI failed
on Jul 15, 2025
DrahtBot
commented at 1:26 am on July 15, 2025:
contributor
🚧 At least one of the CI tasks failed.
Task TSan, depends, gui: https://github.com/bitcoin/bitcoin/runs/45962526520
LLM reason (✨ experimental): The CI failure is caused by a data race detected in operator delete, leading to a crash during threadpool_tests.
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
furszy force-pushed
on Jul 15, 2025
furszy force-pushed
on Jul 15, 2025
DrahtBot removed the label
CI failed
on Jul 15, 2025
Sjors
commented at 8:48 am on July 17, 2025:
member
furszy
commented at 1:48 pm on July 17, 2025:
member
I tried to use this to boost the silent payment indexer, but I don’t know what I’m doing :-) Sjors#96@Sjors, see https://github.com/furszy/bitcoin-core/commits/2025_bip352_blind_fix.
Note: I only spent a few minutes with it and the test seem to pass (I’m partially afk these days). Let me know how it goes and could check it in detail next week.
The first commit there (413ce51bf10326a6c56bd4250f9a9f19fde44ed4) is merely a code improvement + cleanup, because you don’t need to re-read the undo data inside the child class anymore since #32694.
And the last commit (b8883fa1fd76fbf3c3d3c10f702bea24abb42dc9) enables it by overriding CustomProcessBlock() (same as it was implemented for the tx index parallelization e8add40fbfbda9955f9b1cf998732ca03787eb72).
in
src/util/threadpool.h:44
in
7d984c3b2doutdated
39+ std::function<void()> task;
40+ {
41+ // Wait for the task or until the stop flag is set
42+ m_cv.wait(wait_lock,[&]() EXCLUSIVE_LOCKS_REQUIRED(m_mutex) { return m_interrupt.load() || !m_work_queue.empty(); });
43+
44+ // If stopped, exit worker.
ismaelsadeeq
commented at 8:44 pm on July 21, 2025:
In “util: introduce general purpose thread pool” 7d984c3b2dca085ef7f49d21568b06c7b89c7807
Hmm I can imagine that this can be blocking when you want to stop instantly; but I guess there is a reason why you did not return here and then empty the work queue in Stop.
Hmm I can imagine that this can be blocking when you want to stop instantly; but I guess there is a reason why you did not return here and then empty the work queue in Stop.
Yes. We need to fulfill all promises so there are no dangling futures waiting for the worker to finish executing the task.
In the future, we could avoid this by tracking all the promises and triggering a “shutdown” exception during stop.
in
src/util/threadpool.h:49
in
7d984c3b2doutdated
44+ // If stopped, exit worker.
45+ if (m_interrupt && m_work_queue.empty()) {
46+ return;
47+ }
48+
49+ // Pop the task
ismaelsadeeq
commented at 8:45 pm on July 21, 2025:
In “util: introduce general purpose thread pool” 7d984c3b2dca085ef7f49d21568b06c7b89c7807
I think this and other verbose comment can be removed, it is quite obvious.
What might need comment is Submit due to the abstraction there.
518@@ -517,6 +519,7 @@ void SetupServerArgs(ArgsManager& argsman, bool can_listen_ipc)
519 strprintf("Maintain an index of compact filters by block (default: %s, values: %s).", DEFAULT_BLOCKFILTERINDEX, ListBlockFilterTypes()) +
520 " If <type> is not supplied or if <type> = 1, indexes for all known types are enabled.",
521 ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
522+ argsman.AddArg("-indexworkers=<n>", strprintf("Number of worker threads spawned for the indexes initial sync process (default: %d).", INDEX_WORKERS_COUNT), ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
ismaelsadeeq
commented at 8:56 pm on July 21, 2025:
In " init: provide thread pool to indexes " 82fa9d29653b445118fc2d03e2ced520a5d4c7dc
Define the max and then use it here and in checking the bounds
In " init: provide thread pool to indexes " 82fa9d2
Define the max and then use it here and in checking the bounds
Since the maximum number of threads for indexes should depend on how many threads Core has at runtime and the number of available processors, I don’t think hardcoding it here is the best approach. I agree with adding it on the error side: #26966 (review)
Wonder if it should be specified that threadpool is shared among all indexers (that support multithreading)
Done as suggested.
in
src/index/base.h:32
in
82fa9d2965outdated
22 namespace interfaces {
23 class Chain;
24 } // namespace interfaces
2526+/** Number of concurrent jobs during the initial sync process */
27+static constexpr int16_t INDEX_WORKERS_COUNT = 0;
ismaelsadeeq
commented at 8:56 pm on July 21, 2025:
In " init: provide thread pool to indexes " 82fa9d29653b445118fc2d03e2ced520a5d4c7dc
Parallel sync is disabled by default. We’re currently not tracking the number of threads spawned by Core, so I chose not to make assumptions here (don’t want indexes threads competing with net/validation ones). It’s safer to let users specify the appropriate number for their setup. In the future, we could improve this by adding thread tracking object/mechanism and picking up the best number for their machine.
ismaelsadeeq
commented at 9:54 pm on July 22, 2025:
Makes sense, I think we can make the bounds not just an arbitrary number but tied it to the cores of the machine.
That will prevent footgun whereby user will spawn more threads than the machine can handle leading to degraded performance due to lots of context switching.
ismaelsadeeq
commented at 9:03 pm on July 21, 2025:
member
Concept ACK, did a quick pass through this.
Will do in-depth review soon
Now the juicy part:
What is the step to reproduce your result?
Side note to self could be useful to try benchmarking this using benchkit so that anyone can reproduce using the yaml config?
furszy
commented at 5:58 pm on July 23, 2025:
member
What is the step to reproduce your result?
Sync your node without any index.
Restart the node with block filter or txindex enabled and let it run (you could also set -connect=0 to sync only the index, without running the net/validation threads. Since threads won’t be competing for cs_main, this will give you a more accurate result).
You’ll see a “[index name] is enabled at height [height]” log entry once it finishes. Then it’s just a matter of subtracting the index startup time from the “index synced” log time.
Unfortunately, there’s no “stop at block” option like we have for chain sync, so this process is a bit more manual and has some variances depending on where/how you run it. But you will see an overall significant speedup anyway.
Note: I should update my results in the PR description. Those were compiled three years ago on a small VPS, and we’ve introduced many changes since then.
Why not just +1 in each task like you do in the next test block?
Hmm, good question. I was probably not only testing that all tasks were executed, but also that each of them was executed only once (if they were all doing the same, it would be hard to know if they were all executed). Or.. maybe I just wanted to mention Gauss somewhere in our code.
I did this one 3 years ago so.. it is hard to know. But it doesn’t hurt to have it.
in
src/test/threadpool_tests.cpp:94
in
d2138bef76outdated
88+
89+ for (auto& fut : futures) fut.wait();
90+ BOOST_CHECK_EQUAL(counter.load(), num_tasks);
91+
92+ blocker.set_value();
93+ for (auto& t : blocking_tasks) t.wait(); // ensure blocking tasks finish too
would there be any benefit here to incrementing the counter at the end of each blocking task and check that they executed properly when unblocked?
It’s subtle but we’re doing the same with the wait here. The blocking_tasks vector contains the futures whose promises are set only when the worker finishes executing the task (meaning it has run all its code).
in
src/test/threadpool_tests.cpp:202
in
d2138bef76outdated
170+ for (int i = 0; i < num_tasks; i++) {
171+ threadPool.Submit([&counter]() {
172+ counter.fetch_add(1);
173+ });
174+ }
175+ std::this_thread::sleep_for(std::chrono::milliseconds{100});
is this sleep to give tasks a chance to execute if the blocking breaks?
Good eye.
IIRC, I added it to wait until the workers actually get blocked. Otherwise, the thread pool queue size would be greater than the expected value (because it did not consumed the blocking tasks).
Still, we could remove this by adding some ready_promises, as did in test 2. A bit more code but less fragile.
in
src/test/threadpool_tests.cpp:209
in
d2138bef76outdated
177+
178+ // Now process manually
179+ for (int i = 0; i < num_tasks; i++) {
180+ threadPool.ProcessTask();
181+ }
182+ BOOST_CHECK_EQUAL(counter.load(), num_tasks);
Could anything be gained by checking the counter after each ProcessTask()?
don’t think so. Could maybe improve it by making the task return something and checking the futures’ promises. But I’m not totally convinced it will add much value.
in
src/init.cpp:2170
in
b360102619outdated
2164@@ -2160,7 +2165,19 @@ bool StartIndexBackgroundSync(NodeContext& node)
2165 }
2166 }
21672168+ if (node.args->IsArgSet("-indexworkers")) {
2169+ int index_workers = node.args->GetIntArg("-indexworkers", INDEX_WORKERS_COUNT);
2170+ if (index_workers < 0 || index_workers > MAX_INDEX_WORKERS_COUNT) return InitError(Untranslated(strprintf("Invalid -indexworkers arg. Must be a number in-between 1 and %d", MAX_INDEX_WORKERS_COUNT)));
Whats the benefit of defining this as a lambda instead of just moving the code inside func_worker? It doesn’t seem to be called anywhere else …?
It was like that at first, but I found it harder to follow and wanted to simplify func_worker as much as possible.
You can try moving it back in and will surely see what I’m referring to.
in
src/index/base.h:147
in
76aa1c2da9outdated
142@@ -138,6 +143,26 @@ class BaseIndex : public CValidationInterface
143 /// Update the internal best block index as well as the prune lock.
144 void SetBestBlockIndex(const CBlockIndex* block);
145146+ /// If 'AllowParallelSync()' returns true, 'ProcessBlock()' will run concurrently in batches.
147+ /// The 'std::any' result will be passed to 'PostProcessBlocks()' so the index can process
The corresponding line in the serial/legacy processing path asserts the block data (I don’t care much about that) but I do think you should keep the variable for the filter type if ever one day a new filter index is added:
142@@ -138,6 +143,26 @@ class BaseIndex : public CValidationInterface
143 /// Update the internal best block index as well as the prune lock.
144 void SetBestBlockIndex(const CBlockIndex* block);
145146+ /// If 'AllowParallelSync()' returns true, 'ProcessBlock()' will run concurrently in batches.
147+ /// The 'std::any' result will be passed to 'PostProcessBlocks()' so the index can process
148+ /// async result batches in a synchronous fashion (if required).
149+ [[nodiscard]] virtual std::any CustomProcessBlock(const interfaces::BlockInfo& block_info) {
150+ // If parallel sync is enabled, the child class must implement this method.
151+ if (AllowParallelSync()) return std::any();
I wonder if this condition should be a bit more attention-getting, since it should never execute right? Either log something or Assume() for the benefit of future index developers?
I wonder if this condition should be a bit more attention-getting, since it should never execute right? Either log something or Assume() for the benefit of future index developers?
Hmm, or we could remove this line and call CustomAppend by default. Then it would be up to the child index to decide whether it needs sequential ordering during parallel sync or not. This way, the tx and BIP352 indexes would require fewer lines of code.
in
src/util/threadpool.h:76
in
d2138bef76outdated
71+ if (!m_workers.empty()) throw std::runtime_error("Thread pool already started");
72+ m_interrupt.store(false); // Reset
73+
74+ // Create workers
75+ for (int i = 0; i < num_workers; i++) {
76+ m_workers.emplace_back(&util::TraceThread, "threadpool_worker_" + util::ToString(i), [this] { WorkerThread(); });
I’d like to see ThreadPool reused in the codebase (for example, as http workers) which makes me think the class should also have a custom name property for logging and process monitoring. (e.g. index_worker_thread_1 and http_worker_thread_1)
Built and tested on macos/arm64 as well as Debian/x86. Left several questions and suggestions. I really like ThreadPool as a util and look forward to using it for http workers as well, to share the code.
I’m currently rebuilding block filter and tx indexes with this branch to compare against master, which after 48 hours was only about 70% complete…
DrahtBot requested review from w0xlt
on Jul 25, 2025
DrahtBot requested review from TheCharlatan
on Jul 25, 2025
DrahtBot requested review from ryanofsky
on Jul 25, 2025
DrahtBot requested review from ismaelsadeeq
on Jul 25, 2025
furszy force-pushed
on Jul 25, 2025
pinheadmz
commented at 11:01 am on July 27, 2025:
member
Did a rough benchmark test. Froze a full node at height 906551 by restarting with -noconnect and also -txindex -blockfilterindex. First test was from master, after 48 hours I aborted the process after reaching only:
This is bike shedding now so ok to ignore – but when we set thread names on Linux they are truncated to 15 characters. So I dunno I guess “thread” and “worker” are redundant in the name of a thread?
So I dunno I guess “thread” and “worker” are redundant in the name of a thread?
Sure, done. Now the previous “indexes_threadpool_worker_[num]” will be “indexes_pool_[num]”.
pinheadmz approved
pinheadmz
commented at 2:53 pm on July 29, 2025:
member
re-ACK9bba5ef0cc5b6890eba2be3a6ed429c4fea5a28f
Changes since last review are minimal responses to my own review suggestions. Built on macos/arm64 and debian/x86. Ran functional and unit tests, ran with -indexworkers=16 on mainnet fullnode with >900000 blocks
DashCoreAutoGuix referenced this in commit
55c5a2a348
on Jul 31, 2025
furszy force-pushed
on Aug 7, 2025
DrahtBot added the label
CI failed
on Aug 8, 2025
DrahtBot
commented at 0:19 am on August 8, 2025:
contributor
🚧 At least one of the CI tasks failed.
Task TSan, depends, no gui: https://github.com/bitcoin/bitcoin/runs/47637510992
LLM reason (✨ experimental): The CI failure is due to the threadpool_tests timing out.
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
furszy force-pushed
on Aug 8, 2025
furszy force-pushed
on Aug 8, 2025
DrahtBot removed the label
CI failed
on Aug 8, 2025
DashCoreAutoGuix referenced this in commit
3ee82f3b2b
on Aug 9, 2025
ismaelsadeeq
commented at 11:19 am on August 12, 2025:
member
I ran the node on mainnet with -noconnect-txindex-blockfilterindex-indexworkers=4
Although I was not able to later reproduce the crash it seems to be happen intermittently cc @furszy
furszy force-pushed
on Aug 12, 2025
furszy
commented at 8:34 pm on August 12, 2025:
member
Although I was not able to later reproduce the crash it seems to be happen intermittently
Thanks for the report! Check it now if you can. There was a very subtle bug on which the worker threads might have accessed an index Sync() local variable post-destruction.
DrahtBot added the label
Needs rebase
on Aug 19, 2025
furszy force-pushed
on Aug 20, 2025
DrahtBot added the label
CI failed
on Aug 20, 2025
DrahtBot
commented at 11:26 am on August 20, 2025:
contributor
🚧 At least one of the CI tasks failed.
Task MSan, depends: https://github.com/bitcoin/bitcoin/runs/48475050585
LLM reason (✨ experimental): The CI failure is caused by a test timeout due to the index not reaching the expected height within the allowed time.
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
DrahtBot removed the label
Needs rebase
on Aug 20, 2025
furszy force-pushed
on Aug 30, 2025
furszy force-pushed
on Aug 30, 2025
DrahtBot removed the label
CI failed
on Aug 30, 2025
furszy force-pushed
on Sep 1, 2025
furszy force-pushed
on Sep 17, 2025
furszy force-pushed
on Sep 17, 2025
furszy force-pushed
on Sep 18, 2025
in
src/util/threadpool.h:160
in
9dbea51ab5outdated
andrewtoth
commented at 5:09 pm on October 6, 2025:
If we wait() or get() on the returned future, does that imply a release/acquire memory ordering or a relaxed memory ordering? I can’t seem to find out what this means for other non-atomic memory that was written on the worker thread before the task is completed.
If we wait() or get() on the returned future, does that imply a release/acquire memory ordering or a relaxed memory ordering? I can’t seem to find out what this means for other non-atomic memory that was written on the worker thread before the task is completed.
For the returned value, wait()/get() provide release/acquire semantics. All updates performed by the task should be visible after get() returns.
Now, if the task modifies other objects that might be accessed concurrently by threads that don’t wait on the future, I’m pretty sure you need to explicitly synchronize access to those objects.
Maybe show an example of how you’re using it, and we could reason about it.
andrewtoth
commented at 7:10 pm on October 6, 2025:
andrewtoth
commented at 7:22 pm on October 6, 2025:
All updates performed by the task should be visible after get() returns.
Hmm so in that case, then the diff I linked above is correct? The main thread is the one waiting on the future, and the non-synchronized vector is modified only by the worker thread. So the change to the vector should be visible on the main thread.
For the returned value, wait()/get() provide release/acquire semantics.
So you mean not for the returned value, but for any thread that calls get()/wait()? Having the return value be visible is sufficient with just a relaxed memory ordering.
So I’m confused whether other memory modified by the thread will be visible (release/acquire), or just the returned value (relaxed).
Sure, so I tried updating #31132 to use the thread pool here. It makes the change simpler if we were to already have this thread pool merged. However, the threads write to non-synchronized shared vectors, which the main thread will read. We need to make sure the write happens before the read. I initially stored the futures and wait on them like this andrewtoth@c256f1b#diff-3eff97d24109f473292b38a6495c5b51e8bda9f4bff9b691cf255968ba80d85dR151-R164. Since there is no happens before guarantee though, I think we still need a completion semaphore that each thread releases instead.
Have you tried something like (haven’t tried it yet, will give it a run tomorrow):
0diff --git a/src/inputfetcher.h b/src/inputfetcher.h
1--- a/src/inputfetcher.h (revision c256f1b457cbd5b900aa34703eb5853d2449bcde)
2+++ b/src/inputfetcher.h (date 1759781954442)
3@@ -43,15 +43,6 @@
4 */
5 std::atomic<size_t> m_input_counter{0};
6 7- /**
8- * The vector of vectors of outpoint:coin pairs.
9- * Each thread writes the coins it fetches to the vector at its thread
10- * index. This way multiple threads can write concurrently to different
11- * vectors in a thread safe way. After all threads are finished, the main
12- * thread can loop through all vectors and write the coins to the cache.
13- */
14- std::vector<std::vector<std::pair<COutPoint, Coin>>> m_coins{};
15-
16 /**
17 * The set of txids of all txs in the block being fetched.
18 * This is used to filter out inputs that are created in the block,
19@@ -69,15 +60,15 @@
20 const size_t m_worker_thread_count;
21 ThreadPool m_thread_pool{"inputfetcher"};
22 23- void Work(size_t thread_index) noexcept
24+ std::vector<std::pair<COutPoint, Coin>> Work(size_t thread_index) noexcept
25 {
26 const auto inputs_count{m_inputs.size()};
27- auto& coins{m_coins[thread_index]};
28+ std::vector<std::pair<COutPoint, Coin>> coins;
29 try {
30 while (true) {
31 const auto input_index{m_input_counter.fetch_add(1, std::memory_order_relaxed)};
32 if (input_index >= inputs_count) {
33- return;
34+ return {};
35 }
36 const auto [tx_index, vin_index] = m_inputs[input_index];
37 const auto& outpoint{m_block->vtx[tx_index]->vin[vin_index].prevout};
38@@ -96,7 +87,7 @@
39 // Missing an input. This block will fail validation.
40 // Skip remaining inputs.
41 m_input_counter.store(inputs_count, std::memory_order_relaxed);
42- return;
43+ return {};
44 }
45 }
46 } catch (const std::runtime_error&) {
47@@ -104,6 +95,8 @@
48 // Skip remaining inputs.
49 m_input_counter.store(inputs_count, std::memory_order_relaxed);
50 }
51+
52+ return coins;
53 }
54 55 public:
56@@ -116,10 +109,6 @@
57 return;
58 }
59 m_thread_pool.Start(worker_thread_count);
60- m_coins.reserve(worker_thread_count + 1);
61- for (size_t n{0}; n < worker_thread_count + 1; ++n) {
62- m_coins.emplace_back();
63- }
64 }
65 66 //! Fetch all block inputs from db, and insert into cache.
67@@ -148,25 +137,30 @@
68 69 // Set the input counter and wake threads.
70 m_input_counter.store(0, std::memory_order_relaxed);
71- std::vector<std::future<void>> futures;
72+ std::vector<std::future<std::vector<std::pair<COutPoint, Coin>>>> futures;
73 futures.reserve(m_worker_thread_count);
74 for (size_t n{0}; n < m_worker_thread_count; ++n) {
75 futures.emplace_back(m_thread_pool.Submit([this, n]() {
76- Work(n);
77+ return Work(n);
78 }));
79 }
80 81 // Have the main thread work too while we wait for other threads
82- Work(m_worker_thread_count);
83+ std::vector<std::vector<std::pair<COutPoint, Coin>>> coins;
84+ coins.reserve(m_worker_thread_count + 1);
85+ coins[m_worker_thread_count] = Work(m_worker_thread_count);
86 87 // Wait for all worker threads to complete
88- for (const auto& future : futures) {
89- future.wait();
90+ for (size_t i = 0; i < futures.size(); i++) {
91+ coins[i] = futures[i].get();
92 }
93 94 // At this point all threads are done writing to m_coins, so we can
95 // safely read from it and insert the fetched coins into the cache.
96- for (auto& thread_coins : m_coins) {
97+ for (auto& thread_coins : coins) {
98+ if (thread_coins.empty()) {
99+ // TODO: report failure..
100+ }
101 for (auto&& [outpoint, coin] : thread_coins) {
102 cache.EmplaceCoinInternalDANGER(std::move(outpoint),
103 std::move(coin),
andrewtoth
commented at 8:24 pm on October 6, 2025:
It went well thanks 0030dc5
Sorry for hijacking the PR :)
hehe np, I owe you a few reviews anyway.
in
src/util/threadpool.h:43
in
0afb05cad2outdated
38+ WAIT_LOCK(m_mutex, wait_lock);
39+ for (;;) {
40+ std::function<void()> task;
41+ {
42+ // Wait only if needed; avoid sleeping when a new task was submitted while we were processing another one.
43+ if (!m_interrupt.load() && m_work_queue.empty()) {
andrewtoth
commented at 4:11 pm on October 7, 2025:
Any reason we can’t use memory_order_relaxed on loads and stores of m_interrupt?
Any reason we can’t use memory_order_relaxed on loads and stores of m_interrupt?
I recall briefly thinking about it, but going down that path would mean guarding all m_interrupt accesses with m_mutex too. And that seemed like an unnecessary overhead for methods like Submit that should be performing a lightweight interruption check before submission.
andrewtoth
commented at 8:28 pm on October 7, 2025:
going down that path would mean guarding all m_interrupt accesses with m_mutex too
Hmm I don’t see why we would need to do that? If we wanted to make m_interrupt a non-atomic bool we would need to do that. But if it’s atomic we can use relaxed since it’s just a flag and not acting as a fence to any other non-atomic memory?
going down that path would mean guarding all m_interrupt accesses with m_mutex too
Hmm I don’t see why we would need to do that? If we wanted to make m_interrupt a non-atomic bool we would need to do that. But if it’s atomic we can just use relaxed since it’s just a flag and not acting as a fence to any other non-atomic memory?
Hmm, but what about the flag updates visibility?
If all m_interrupt loads were relaxed, Submit() could read a stale false value even after Stop() set it to true, right?.
This would cause the task to be enqueued during shutdown, which could end up with a lingering task that never gets executed, which will hang the caller on the future’s get()/wait() call forever and stall the shutdown procedure?
andrewtoth
commented at 9:16 pm on October 7, 2025:
I don’t see how memory ordering solves that? Consider this exact line where the Submit thread is suspended:
0 auto Submit(T task) -> std::future<decltype(task())>1 {
2if (m_workers.empty() || m_interrupt.load()) throw std::runtime_error("No active workers; cannot accept new tasks");
3---> Stop() is called right here on another thread, setting m_interrupt to true and all threads exit before this thread continues execution
4 using TaskType = std::packaged_task<decltype(task())()>;
I think we need to say instead that all public methods must be called on the same thread. That way we can use relaxed ordering, since Submit and Stop cannot be called by different threads so this won’t happen.
andrewtoth
commented at 10:40 pm on October 7, 2025:
If all m_interrupt loads were relaxed, Submit() could read a stale false value even after Stop() set it to true, right?
I don’t think it could? Once a write to true becomes visible to a thread, it cannot be read again as false until something writes false to it again. This is independent of memory ordering. Relaxed memory ordering can reorder when writes to different data become visible to a thread. For instance:
0atomic<bool> x{false}, y{false};
1// thread 1 stores true to x first then y
2x.store(true, relaxed);
3y.store(true, relaxed);
4// thread 25y.load(relaxed); // Could be true
6x.load(relaxed); // Could still be false even if y above is already visible as true
7y.load(relaxed); // Cannot be false again after already reading true above
Of course this is only relevant if there are other threads that will be reading and writing. Since the same thread will be calling Stop, Start, and Submit, there is no need to be concerned with this.
The only reason m_interrupt needs to be atomic is because it is also being read from worker threads, and those are reading it while synchronized with m_mutex.
andrewtoth
commented at 11:28 pm on October 7, 2025:
Thinking about this more, since m_workers is not synchronized it implies Stop and Submit must be called on the same thread. Therefore m_workers.empty() is sufficient to test if the pool is stopped since Stop() must complete before a subsequent Submit(). So m_interrupt is redundant in if (m_workers.empty() || m_interrupt.load()), and m_interrupt can just be made a regular bool guarded by m_mutex. Start would have to set it with the mutex locked, but that is infrequent compared to all the loads done in Submit and WorkerThread.
Nice thread of thoughts! Will try to go over the comments.
Regarding Submit being called only from the initial controller thread:
Right now, we are actually submitting tasks from threads that are not the initial one. Each index starts its own “background sync” thread that is responsible for computing the tasks submitted to the thread pool.
I don’t think it’s reasonable to expect a generic thread pool to only accept submissions from a single thread. We must be able to submit tasks from any thread. I’ve adapted the code to reflect this.
Regarding the Start and Stop restriction to the initial controller thread:
Agree. In fact, calling Stop from a worker thread would actually deadlock. I’ve added documentation clearly expressing this requirement.
Could extend this and check the thread id during shutdown too but it seemed like an overkill.
Regarding Submit after shutdown:
Yeah, good eye there. I think the simplest solution is to move the interruption check inside the mutex guard, just before enqueuing the task. This is not really going to happen anyway. And even if it happens, the extra overhead of creating the wrapper and destructing it is minimal since the app will be shutting down anyway.
Thanks for the thread of thoughts! I have updated the class to reflect all of this.
andrewtoth
commented at 3:24 pm on October 8, 2025:
Ok, but now we also need std::vector<std::thread> m_workers; guarded by m_mutex. Since it can be written by Stop and also read in Submit. So it also needs to be locked at the end of Stop and in WorkersCount.
Ok, but now we also need std::vector<std::thread> m_workers; guarded by m_mutex. Since it can be written by Stop and also read in Submit. So it also needs to be locked at the end of Stop and in WorkersCount.
True. Pushed. Thanks!
in
src/init.cpp:536
in
99a4d06154outdated
532@@ -530,6 +533,7 @@ void SetupServerArgs(ArgsManager& argsman, bool can_listen_ipc)
533 strprintf("Maintain an index of compact filters by block (default: %s, values: %s).", DEFAULT_BLOCKFILTERINDEX, ListBlockFilterTypes()) +
534 " If <type> is not supplied or if <type> = 1, indexes for all known types are enabled.",
535 ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
536+ argsman.AddArg("-indexworkers=<n>", strprintf("Number of worker threads spawned for the initial index synchronization (default: %d). These threads are shared across all indexes", INDEX_WORKERS_COUNT), ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
mzumsande
commented at 8:17 pm on October 7, 2025:
I tested this for signet on two computers, and while parallel sync (5 threads) was a great speedup on an SSD, I also observed a slowdown on a HDD compared to master (by a factor 2).
Presumably that’s because reading the blocks from disk is the main bottleneck on a HDD, and with parallel indexing there is a lot of jumping back and forth, increasing seek time.
Should it be mentioned in the -indexworkers help that it is not advisable to use this option on a HDD?
andrewtoth
commented at 11:50 am on October 8, 2025:
This would require m_workers to be synchronized somehow, since it is written in Stop and Start but read in Submit.
Edit: actually this will work, since m_workers won’t be written until after the signal inside the last task. But, if Stop was called manually before waiting for the signal it could cause a deadlock.
Is there value in making the ThreadPool safe for recursive task submission? We might not be able to implement certain divide and conquer algorithms, but we might not need to.
andrewtoth
commented at 1:23 pm on October 8, 2025:
I don’t think so. In that case, should your test have some warnings that this can be unsafe if the thread pool is stopped or destroyed before all calls to Submit have returned?
We can modify the test to make the parent task wait for the child task to complete before returning. Stop waits for all workers to complete before clearing the workers vector, that is, of course, if @furszy determines that there is any point to a recursive test at all, since we are not designing for that
Yeah, I found the test worthwhile. Taken. Thanks for sharing! I had something similar on the fuzz test.
Allowing task submission from any thread should be a must for me. It otherwise defeats the purpose of having a shared thread pool if you can only submit tasks from a single spot.
This also pushed me further. Added another test for the case where the pool is about to shut down while workers are all busy and a thread is concurrently submitting a new task.
in
src/util/threadpool.h:76
in
0afb05cad2outdated
48+ // If stopped and no work left, exit worker
49+ if (m_interrupt.load() && m_work_queue.empty()) {
50+ return;
51+ }
52+
53+ task = std::move(m_work_queue.front());
Is it not beneficial for a generic ThreadPool to support assigning tasks in batches to threads? I expect this to be useful when submitting many small tasks.
Is it not beneficial for a generic ThreadPool to support assigning tasks in batches to threads? I expect this to be useful when submitting many small tasks.
I don’t think that’s thread pool responsibility. Generic thread pools are designed to execute tasks efficiently, but they don’t dictate their granularity (mainly because they do not know the content of the task). If batching is beneficial, the caller should handle it before submission (that’s actually what I’m doing here for the indexes).
furszy force-pushed
on Oct 8, 2025
furszy force-pushed
on Oct 8, 2025
furszy force-pushed
on Oct 8, 2025
in
src/util/threadpool.h:124
in
e780f6fdf3outdated
119+ */
120+ void Stop() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
121+ {
122+ std::vector<std::thread> threads_to_join;
123+ // Notify workers and join them.
124+ // Note: even when m_interrupt is atomic, it must be modified while holding the same mutex
andrewtoth
commented at 3:57 pm on October 8, 2025:
Ups, updated. Thanks.
Also moved the condition variable comment above the member declaration (this caused a nasty bug that we shouldn’t forget).
furszy force-pushed
on Oct 8, 2025
in
src/util/threadpool.h:154
in
3c4c3cbe09outdated
146+ using TaskType = std::packaged_task<decltype(task())()>;
147+ auto ptr_task = std::make_shared<TaskType>(std::move(task));
148+ std::future<decltype(task())> future = ptr_task->get_future();
149+ {
150+ LOCK(m_mutex);
151+ if (m_workers.empty() || m_interrupt) {
andrewtoth
commented at 4:07 pm on October 8, 2025:
These 2 checks are redundant since they are both set atomically. Should just pick one for clarity.
andrewtoth
commented at 4:10 pm on October 8, 2025:
Actually, we must use worker count check. Since we can Start with 0 worker threads but still set interrupt to false, so this would deadlock with only the interrupt check.
Maybe guard against starting with 0 threads and then pick either?
Actually, we must use worker count check. Since we can Start with 0 worker threads but still set interrupt to false, so this would deadlock with only the interrupt check.
Maybe guard against starting with 0 threads and then pick either?
I’m tempted to leave it as is, mainly because we might want to add Interrupt() and Resume() methods in the future that don’t kill the worker threads. We might want to clear the queue due to an early failure in one of our tasks without having to recreate the threads.
furszy force-pushed
on Oct 8, 2025
furszy force-pushed
on Oct 8, 2025
furszy force-pushed
on Oct 8, 2025
furszy force-pushed
on Oct 8, 2025
furszy force-pushed
on Oct 9, 2025
furszy force-pushed
on Oct 9, 2025
furszy
commented at 6:41 pm on October 9, 2025:
member
Updated with a fuzz test case from @TheCharlatan. Thanks!
brunoerg
commented at 5:13 pm on October 10, 2025:
contributor
mzumsande
commented at 7:22 pm on October 10, 2025:
I don’t understand the need for the sanity cleanup. Shouldn’t the logic from WorkerThread guarantee that m_work_queue is empty at this point, so that we could just assert that here?
I don’t understand the need for the sanity cleanup. Shouldn’t the logic from WorkerThread guarantee that m_work_queue is empty at this point, so that we could just assert that here?
Yeah, right now it is not really necessary. Have explained the context and the rationale here: #26966 (review)
in
src/test/threadpool_tests.cpp:52
in
3943630a4coutdated
46+}
47+
48+BOOST_AUTO_TEST_CASE(threadpool_basic)
49+{
50+ // Test Cases
51+ // 1) Submit tasks and verify completion.
mzumsande
commented at 8:33 pm on October 10, 2025:
in
src/util/threadpool.h:167
in
3943630a4coutdated
169+
170+ /**
171+ * @brief Execute a single queued task synchronously.
172+ * Removes one task from the queue and executes it on the calling thread.
173+ */
174+ void ProcessTask() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
mzumsande
commented at 8:45 pm on October 10, 2025:
Is this meant to be used by tests only, or are there possible other use cases? If it’s the latter, it might be helpful to return a bool that indicates whether a task was executed or not (empty queue).
2258@@ -2252,7 +2259,19 @@ bool StartIndexBackgroundSync(NodeContext& node)
2259 }
2260 }
22612262+ if (node.args->IsArgSet("-indexworkers")) {
2263+ int index_workers = node.args->GetIntArg("-indexworkers", INDEX_WORKERS_COUNT);
2264+ if (index_workers < 0 || index_workers > MAX_INDEX_WORKERS_COUNT) return InitError(Untranslated(strprintf("Invalid -indexworkers arg. Must be a number between 0 and %d", MAX_INDEX_WORKERS_COUNT)));
mzumsande
commented at 9:00 pm on October 10, 2025:
-indexworkers=0 is allowed according to the doc, but it will trigger the assert num_workers > 0 in the ThreadPool. Shouldn’t create a ThreadPool in this case.
in
src/util/threadpool.h:154
in
3943630a4coutdated
149+ */
150+ template<class T> EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
151+ auto Submit(T task) -> std::future<decltype(task())>
152+ {
153+ using TaskType = std::packaged_task<decltype(task())()>;
154+ auto ptr_task = std::make_shared<TaskType>(std::move(task));
ryanofsky
commented at 9:08 pm on October 10, 2025:
In commit “util: introduce general purpose thread pool” (3943630a4cea56b040377b33476803745055510c)
Would be nice to avoid shared_ptr here. Apparently it is needed because std::function objects are required to be copyable and packaged tasks aren’t copyable. Could fix this by avoiding std::function which would also simplify the Submit() function:
0--- a/src/util/threadpool.h
1+++ b/src/util/threadpool.h
2@@ -47,7 +47,7 @@ class ThreadPool {
3 private:
4 std::string m_name;
5 Mutex m_mutex;
6- std::queue<std::function<void()>> m_work_queue GUARDED_BY(m_mutex);
7+ std::queue<std::packaged_task<void()>> m_work_queue GUARDED_BY(m_mutex);
8 std::condition_variable m_cv;
9 // Note: m_interrupt must be modified while holding the same mutex used by threads waiting on the condition variable.
10 // This ensures threads blocked on m_cv reliably observe the change and proceed correctly without missing signals.
11@@ -59,7 +59,7 @@ private:
12 {
13 WAIT_LOCK(m_mutex, wait_lock);
14 for (;;) {
15- std::function<void()> task;
16+ std::packaged_task<void()> task;
17 {
18 // Wait only if needed; avoid sleeping when a new task was submitted while we were processing another one.
19 if (!m_interrupt && m_work_queue.empty()) {
20@@ -135,7 +135,7 @@ public:
21 {
22 // Sanity cleanup: release any std::function captured shared_ptrs
23 LOCK(m_mutex);
24- std::queue<std::function<void()>> empty;
25+ std::queue<std::packaged_task<void()>> empty;
26 m_work_queue.swap(empty);
27 }
28 // Note: m_interrupt is left true until next Start()
29@@ -147,21 +147,17 @@ public:
30 * Enqueues a callable to be executed by one of the worker threads.
31 * Returns a `std::future` that can be used to retrieve the task’s result.
32 */
33- template<class T> EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
34- auto Submit(T task) -> std::future<decltype(task())>
35+ template<class F> EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
36+ auto Submit(F&& fn)
37 {
38- using TaskType = std::packaged_task<decltype(task())()>;
39- auto ptr_task = std::make_shared<TaskType>(std::move(task));
40- std::future<decltype(task())> future = ptr_task->get_future();
41+ std::packaged_task task{std::forward<F>(fn)};
42+ auto future{task.get_future()};
43 {
44 LOCK(m_mutex);
45 if (m_workers.empty() || m_interrupt) {
46 throw std::runtime_error("No active workers; cannot accept new tasks");
47 }
48- m_work_queue.emplace([ptr_task]() mutable {
49- (*ptr_task)();
50- ptr_task.reset(); // Explicitly release packaged_task and the stored function obj.
51- });
52+ m_work_queue.emplace(std::move(task));
53 }
54 m_cv.notify_one();
55 return future;
56@@ -173,7 +169,7 @@ public:
57 */
58 void ProcessTask() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
59 {
60- std::function<void()> task;
61+ std::packaged_task<void()> task;
62 {
63 LOCK(m_mutex);
64 if (m_work_queue.empty()) return;
ha, that’s awesome. The queue’s std::packaged_task<void()> change broke my mind because the task has obviously a different return value..
I see the trick now, it seems the packaged_task<R()> is wrapped into another packaged_task<void()> which forwards the call internally. That’s elegant.
in
src/util/threadpool.h:139
in
3943630a4coutdated
ryanofsky
commented at 9:29 pm on October 10, 2025:
In commit “util: introduce general purpose thread pool” (3943630a4cea56b040377b33476803745055510c)
I’m a little unclear what this swap is trying to do. I would expect the point of swapping into a temporary vector would be to destroy the callables without holding m_mutex (since they could own resources and take time to destroy). But it looks like the empty vector is destroyed while m_mutex is still locked, so that isn’t happening. Also I don’t understand why m_mutex is locked twice in this function with a notify in between. I would expect it to look more like:
I’m a little unclear what this swap is trying to do.
Also I don’t understand why m_mutex is locked twice in this function with a notify in between
The idea behind the swap was more about ensuring the queue is empty after joining the threads, mainly to avoid any lingering future that would block callers indefinitely. That’s why I used the “sanity” wording there.
I did it after joining the threads because we’re currently waiting for all pending tasks to complete before stopping the worker loop. If we cleaned the queue prior to joining, all futures would throw a future_error - which could be fine too, but we’d probably want to throw something different, like a custom TaskInterruptedException, and adapt the submitter code to handle possible interruptions.
A bit more context, I added this sanity swap after finding #26966 (review). But it is not really necessary anymore with the latest changes. Could also just drop it.
I would expect the point of swapping into a temporary vector would be to destroy the callables without holding m_mutex (since they could own resources and take time to destroy)
That’s an overseen, yeah. Thanks!
mzumsande
commented at 9:35 pm on October 10, 2025:
contributor
Reviewed only the first two commits so far.
It could make sense to introduce the threadpool in an extra PR first, since it is very generic and can be used for multiple things, in case there are some reviewers who don’t feel comfortable ACKing the pretty involved index changes, but would want to ACK the threadpool changes. (just a suggestion, not necessary just for my sake since I happen know the index code)
in
src/test/threadpool_tests.cpp:12
in
3943630a4coutdated
nit: since it’s strongly typed we don’t need to include the type name in the variable anymore
in
src/util/threadpool.h:46
in
3943630a4coutdated
40+ *
41+ * - `Stop()` prevents further task submission and wakes all worker threads.
42+ * Workers finish processing all remaining queued tasks before exiting,
43+ * guaranteeing that no caller waits forever on a pending future.
44+ */
45+class ThreadPool {
I think it might be clearer if this commit came later in the series, after the groundwork is established. In my experience, it’s often easier to review when the problem definition is presented before the solution.
As it stands, this commit introduces code whose purpose isn’t immediately clear without reading ahead to future commits.
I’d like to echo the concern other reviewers raised about separating the thread pool from the script parallelization work. Since script parallelization is already quite complex, would you consider starting with a simpler 2-threaded implementation in this PR, and introducing the general-purpose thread pool in a separate PR?
Given the review bandwidth challenges this PR has faced over the years, breaking it into smaller, more focused pieces might help reviewers feel more confident providing ACKs.
I’d like to echo the concern other reviewers raised about separating the thread pool from the script parallelization work. Since script parallelization is already quite complex, would you consider starting with a simpler 2-threaded implementation in this PR, and introducing the general-purpose thread pool in a separate PR?
Adding 2 threads or n threads represents the exact same work. The code complexity comes from the two-step procedure that allows sequential dumps after parallel block data construction (a restriction that comes from the filters headers-chain structure). It is not related to the number of threads introduced, that’s is simply controlled by the thread pool introduced in the first commit.
I could split the thread pool into another PR. That would be easy to do. I’m just concerned about getting into a high-level reviewer discussion with not much substance. Such simple PRs tend to allow that. One can easily lose focus and end up discussing nuances endlessly.
andrewtoth
commented at 5:52 pm on October 15, 2025:
Such simple PRs
The thread pool, tests, and fuzz harness are 560 lines. More than half the additions to this PR. I don’t think it’s fair to say that would be a simple PR.
The thread pool, tests, and fuzz harness are 560 lines. More than half the additions to this PR. I don’t think it’s fair to say that would be a simple PR.
I don’t think the fuzz and tests are hard to grasp, but fair. I should have said that it is simple in comparison to the other changes, which require further contextual knowledge and not just multi-threading.
We can try splitting the thread pool. I’m definitely not opposed to it, just have some concerns.
The review effort would be the same for 2 or nproc threads, but the implications would be limited with the former - we cannot have e.g. complete deadlock if at most two threads are stuck on indexing. And the recent failures on CI indicate that deadlock concerns aren’t unfounded.
The review effort would be the same for 2 or nproc threads, but the implications would be limited with the former - we cannot have e.g. complete deadlock if at most two threads are stuck on indexing. And the recent failures on CI indicate that deadlock concerns aren’t unfounded.
That’s not what happened. Tasks are not depending on each other (they actually never were), they run in isolation, so threads cannot deadlock between each other. They are not waiting for any resource from other threads.
The issue on the CI was not related to the number of threads, they were all finishing and exiting. It would have occurred with 2 as well. I was just not contemplating the fact that tasks can finish at the same time on the new approach so the opportunistic post-processing procedure was not triggered.
in
src/index/txindex.h:33
in
7868fae75doutdated
28@@ -29,13 +29,19 @@ class TxIndex final : public BaseIndex
2930 BaseIndex::DB& GetDB() const override;
3132+ std::any CustomProcessBlock(const interfaces::BlockInfo& block) override {
33+ return CustomAppend(block);
2025-10-16T13:58:05Z initload thread exit
2025-10-16T13:58:35Z Syncing txindex with block chain from height 189999
2025-10-16T13:59:12Z Syncing txindex with block chain from height 211999
2025-10-16T13:59:42Z Syncing txindex with block chain from height 223999
2025-10-16T14:00:13Z Syncing txindex with block chain from height 235999
….
2025-10-16T14:16:01Z Syncing txindex with block chain from height 373999
2025-10-16T14:16:41Z Syncing txindex with block chain from height 377999
2025-10-16T14:17:37Z Syncing txindex with block chain from height 379999
2025-10-16T14:18:11Z Syncing txindex with block chain from height 381999
———
Sequential run, ~48 minutes:
2025-10-16T14:21:09Z initload thread exit
2025-10-16T14:21:40Z Syncing txindex with block chain from height 135999
2025-10-16T14:22:10Z Syncing txindex with block chain from height 157999
2025-10-16T14:22:43Z Syncing txindex with block chain from height 179999
2025-10-16T14:23:18Z Syncing txindex with block chain from height 185999
2025-10-16T14:23:49Z Syncing txindex with block chain from height 191999
….
2025-10-16T15:08:25Z Syncing txindex with block chain from height 378999
2025-10-16T15:08:56Z Syncing txindex with block chain from height 379999
2025-10-16T15:09:59Z Syncing txindex with block chain from height 381999
in
src/test/threadpool_tests.cpp:22
in
3943630a4coutdated
17+ for (size_t i = 0; i < futures.size(); ++i) {
18+ if (futures[i].wait_for(TIMEOUT_SECS) != std::future_status::ready) {
19+ throw std::runtime_error("Timeout waiting for: " + context + ", task index " + util::ToString(i));
20+ }
21+ }
22+}
I’m finding this commit difficult to review in depth due to its complexity. Like before, I’m seeing the solution (out-of-order processing with task tracking and opportunistic post-processing) before experiencing the problem it solves.
Could we break this down into focused, trivial commits, where low-risk changes are separated from high-risk ones and where the commits tell a story, where the pain is experienced before we propose a solution.
l0rinc
commented at 8:51 pm on October 12, 2025:
contributor
I started a detailed review of this PR and I’m enthusiastic about the concept - it addresses a real problem and the potential performance gains are significant.
Strong Concept ACK!
However, I’ve encountered some concerns that make me hesitant to continue reviewing the current implementation in depth:
The txindex parallelization appears to be broken (showing 3-13% slowdowns rather than speedups), and this went unnoticed for years
The complexity of the changes makes thorough review very challenging
A shared thread pool introduces additional complexity when IO-bound and CPU-bound tasks compete for the same resources - this typically requires work-stealing mechanisms or separate pools to prevent CPU starvation.
Reproducer
Not sure how other reviewers have reproduced these results, but I needed something reliable that I can run on different platforms with different numbers of threads and storage types.
I have automated it to make it easier to replicate on different platforms, this is a variation of the script I used:
Given that this PR has struggled to attract reviewers over the past two years, I think we need a different approach.
This is a risky change that touches critical index infrastructure - let’s do it in tiny, focused steps instead.
I’d like to propose keeping this PR open as a draft tracking PR that contains the full end-to-end implementation for reference and discussion. Meanwhile, we can merge the work through a series of smaller, focused PRs:
Start with a minimal implementation: single index (blockfilterindex), hardcoded 2 threads, minimal configurability
Get that reviewed, tested, and merged - let’s see what users think
Then add the thread pool infrastructure in a separate PR (if still needed - we might find simpler per-index solutions work better)
Then add txindex and coinstatsindex (and eventually chainstate) support once we have confidence in the approach (and fix the current implementation issues).
I’m happy to help in any way I can and I don’t want to discourage you, but I strongly disagree with the current direction, so that’s an Approach NACK from me.
Breaking this into smaller, more manageable chunks would significantly improve the chances of getting this important work merged - let me know how I can help!
0C++20 coroutines (with work-stealing threadpool, suspending on I/O operations) might be worth investigating for this scenario - they could provide a cleaner way to handle the async I/O patterns here. I haven't used them in C++ myself, so I'm not sure if they're a good fit, but I'd be happy to experiment with them if there's interest.
andrewtoth
commented at 1:48 pm on October 13, 2025:
contributor
I also measured the txindex with -indexworkers=15 on a 32 vcpu machine and it was slightly slower than master.
I guess this makes sense since each thread needs to write the index data to leveldb, and the actual work done by the CPU is minimal (serializing then computing offsets). So each thread is waiting for the other threads to finish writing.
For blockfilter the filter computation is more intensive, and then each filter is written to a separate file. Only after the filter is written the location of the filter is written to the shared leveldb.
Edit: Just tried again on latest push f13033233d3f2953f179612ac0582b653ac629da on a node synced to 913866:
branch (f13033233d3f2953f179612ac0582b653ac629da) - time was 90 minutes 41 seconds
master (becf1500131805bd6a47486cd5bc5bdb55839211) - time was 84 minutes 42 seconds
So master was ~7% faster than this PR with 5 workers on an SSD.
Maybe we don’t enable parallel sync for txindex?
Note: #30039 was merged since many of the earlier benchmarks were made. That greatly speeds up the txindex creation.
furszy force-pushed
on Oct 13, 2025
furszy
commented at 3:48 pm on October 13, 2025:
member
Haven’t read the latest comments yet (will do soon), but I’ve been working on an overhaul of the approach this weekend. I got enlightened after last week’s review.
The latest code on Signet, syncing the block filter index up to block height 240593 on an SSD (following the testing steps in the PR description). Results:
furszy
commented at 3:06 pm on October 14, 2025:
member
I also measured the txindex with -indexworkers=15 on a 32 vcpu machine and it was slightly slower than master. I guess this makes sense since each thread needs to write the index data to leveldb, and the actual work done by the CPU is minimal (serializing then computing offsets). So each thread is waiting for the other threads to finish writing. For blockfilter the filter computation is more intensive, and then each filter is written to a separate file. Only after the filter is written the location of the filter is written to the shared leveldb.
Edit: Just tried again on latest push f130332 on a node synced to 913866:
branch (f130332) - time was 90 minutes 41 seconds master (becf150) - time was 84 minutes 42 seconds
So master was ~7% faster than this PR with 5 workers on an SSD.
Maybe we don’t enable parallel sync for txindex?
I’m still working on the current approach but synced the tx index on the same environment as before; bare hardware running on an SSD, synchronizing Signet up to block height 240593 (following the testing steps in the PR description). Results:
Parallel Mode (5 workers): ~5 minutes.
02025-10-13T19:54:55Z initload thread exit
12025-10-13T19:55:28Z Syncing txindex with block chain from height 14999922025-10-13T19:55:59Z Syncing txindex with block chain from height 19799932025-10-13T19:56:32Z Syncing txindex with block chain from height 22499942025-10-13T19:57:19Z Syncing txindex with block chain from height 22599952025-10-13T19:58:13Z Syncing txindex with block chain from height 23199962025-10-13T19:58:57Z Syncing txindex with block chain from height 23699972025-10-13T19:59:09Z txindex is enabled at height 240593
Sequential Mode: 13 minutes.
02025-10-13T20:00:02Z initload thread exit
12025-10-13T20:00:34Z Syncing txindex with block chain from height 64999 22025-10-13T20:01:04Z Syncing txindex with block chain from height 84999 32025-10-13T20:01:35Z Syncing txindex with block chain from height 130999 42025-10-13T20:02:05Z Syncing txindex with block chain from height 182999 52025-10-13T20:02:36Z Syncing txindex with block chain from height 192999 62025-10-13T20:03:16Z Syncing txindex with block chain from height 200999 72025-10-13T20:03:47Z Syncing txindex with block chain from height 212999 82025-10-13T20:04:20Z Syncing txindex with block chain from height 223999 92025-10-13T20:05:21Z Syncing txindex with block chain from height 226999102025-10-13T20:05:54Z Syncing txindex with block chain from height 227999112025-10-13T20:06:29Z Syncing txindex with block chain from height 228999122025-10-13T20:06:59Z Syncing txindex with block chain from height 229999132025-10-13T20:07:39Z Syncing txindex with block chain from height 230999142025-10-13T20:08:39Z Syncing txindex with block chain from height 231999152025-10-13T20:09:22Z Syncing txindex with block chain from height 232999162025-10-13T20:10:09Z Syncing txindex with block chain from height 233999172025-10-13T20:10:48Z Syncing txindex with block chain from height 234999182025-10-13T20:11:38Z Syncing txindex with block chain from height 236999192025-10-13T20:12:14Z Syncing txindex with block chain from height 237999202025-10-13T20:13:02Z Syncing txindex with block chain from height 239999212025-10-13T20:13:06Z txindex is enabled at height 240593
———
Furthermore, this is something I haven’t pushed but this is from the batching DB writes branch:
Parallel Mode (5 workers) batching db writes: ~2.30 minutes
02025-10-13T20:30:54Z initload thread exit
12025-10-13T20:31:25Z Syncing txindex with block chain from height 18199922025-10-13T20:31:56Z Syncing txindex with block chain from height 21799932025-10-13T20:32:29Z Syncing txindex with block chain from height 22599942025-10-13T20:33:14Z Syncing txindex with block chain from height 23099952025-10-13T20:33:50Z Syncing txindex with block chain from height 23699962025-10-13T20:33:54Z txindex is enabled at height 240593
So I’d guess that it’s not only related to disk access, but the difference might also be tied to your CPU virtualization technology? @andrewtoth.
Moreover, I mentioned this somewhere earlier in this PR, but it probably got lost over the years. This PR also prepares the ground for batching DB writes across blocks, which will improve sequential runs as well. Which every user will benefit.
Also, parallelization is disabled by default. I don’t think it’s worth removing it for the tx index just because some users might not be able to take advantage of it. It seems to me that the benefits for those who can run it greatly outweigh the limitations for those who can’t (they’re not losing anything with this PR, since sequential mode is still available).
furszy force-pushed
on Oct 14, 2025
furszy force-pushed
on Oct 14, 2025
l0rinc
commented at 8:17 pm on October 14, 2025:
contributor
what’s the reason for the frequent pushes here recently?
furszy
commented at 8:44 pm on October 14, 2025:
member
what’s the reason for the frequent pushes here recently?
I wrote it above, #26966 (comment). I’m working on a few changes that improves parallelism across indexes and the structure of the code, which work properly locally but are not stable on the CI yet. Will update the state with the modifications upon finishing.
furszy force-pushed
on Oct 15, 2025
furszy force-pushed
on Oct 15, 2025
furszy
commented at 6:20 pm on October 15, 2025:
member
Way forward
Given that this PR has struggled to attract reviewers over the past two years, I think we need a different approach.
This is a risky change that touches critical index infrastructure - let’s do it in tiny, focused steps instead.
I’d like to propose keeping this PR open as a draft tracking PR that contains the full end-to-end implementation for reference and discussion. Meanwhile, we can merge the work through a series of smaller, focused PRs:
Start with a minimal implementation: single index (blockfilterindex), hardcoded 2 threads, minimal configurability
Get that reviewed, tested, and merged - let’s see what users think
Then add the thread pool infrastructure in a separate PR (if still needed - we might find simpler per-index solutions work better)
Then add txindex and coinstatsindex (and eventually chainstate) support once we have confidence in the approach (and fix the current implementation issues).
For the “hardcoded 2 threads” step, I wrote this above but adding 2 threads or n threads represents the exact same work. The code complexity comes from the two-step procedure that allows sequential dumps after parallel block data construction (a restriction that comes from the filters headers-chain structure; headers are linked to each other). It is not related to the number of threads introduced, that’s is simply controlled by the thread pool introduced in the first commit. That line simply doesn’t make sense to me.
Furthermore, the description of this as “critical index infrastructure” is not entirely accurate. We are not dealing with consensus-related code here - this code lives on a different level. Existing tests already verify both its behavior and outcomes, and the previous sequential sync behavior is retained. This PR introduces a new optional feature which, in the worst-case scenario, can run slower than sequential if it is badly configured by the user (something that could also happen with other configurations, such as setting a high number of block sigs verification threads).
Also, about the “Given that this PR has struggled to attract reviewers over the past two years”:
I don’t think it is fair to say that at the moment. This PR has gotten quite nice reviews lately and has been ramping up quite nicely. That’s mainly why I ended up modifying part of it and simplifying some workflows (an update will come in the next comment).
I think the problem in the past years was mainly related to the lack of urgency for having this new feature and its requirement of understanding not only what indexes are and how they are compounded, but also the underlying structure of the block filters chain.
I think it is just a matter of coordination between people who have worked on this code before and anyone else who wants to spend time understanding it deeply. I think we are happily starting to roll the ball on that direction now.
furszy
commented at 6:27 pm on October 15, 2025:
member
Now that the CI is happy again. Update regarding the introduced changes:
Before we had queues at two different levels, one at the thread pool level and another one at the index level. The initial rationale behind it was to granularly control the way on which the process interrupts them when needed. Not draining the index’s pending_tasks queue as soon as the first task fails to process or a shutdown was triggered. But, after talking with mzumsande last week, I realized that this was actually over-engineering for not much gain and was harming parallelism across indexes. Top level index queues were racing with each other to take control of the underlying worker threads.
By unifying this two queues, we let the thread pool decide how/when to handle the tasks alone, with no index inference at all. Which allows for much better parallelism among any piece of code using the same workers (not just indexes) and also for further improvements along the road; like modifying a single spot with a lock-free structure or even the prioritization of certain tasks at the workers level - this former one was an idea coming from l0rinc two weeks ago.
So, in short, the code should be cleaner and run faster now, using less mutexes. Also, there is a possible follow-up improvement on moving the final post-processing round to the last task being executed (in case more than one task finishes at the same time by the end of sync). I didn’t do it here because it comes with some extra code complexity that I don’t think is worth adding to this PR; the same reason I don’t think parallelizing the coinstats index here is a good idea, nor batching DB writes. Better to go in steps, as the gains introduced here are already a pretty solid step forward.
furszy force-pushed
on Oct 17, 2025
furszy
commented at 3:42 pm on October 21, 2025:
member
Small update, per discussion, will push the thread pool on an isolated PR in the coming week. Replacing the current http server WorkQueue and all its related code; it is an easy dedup that adds further test coverage to a never unit nor fuzz tested area. Plus a way to start using this class in the repo that allow others to use it.
Also, I’m investigating the tx index slow down with @andrewtoth. Trying to understand the differences between us. Worst-case scenario, we can disable parallel sync for it (will update accordantly in the coming week).
That being said, this refactoring enables batching DB writes as well, which will speedup tx index anyway. But prefer to do it in a follow-up to not continue expanding this PR.
Last note; I reworked the large commit, splitting it into smaller chunks. We can come back to this one after (or while) the http server thread handling PR moves on. But will focus on moving forward there first per feedback.
index: split block processing into process and post-process phases
No behavior changes.
This separation lays the groundwork for parallelizing the block data
digest phase, while preserving sequential consistency for indexes
that requires it before dumping the data to disk.
And it will also allow us to batch database and file writes across
multiple blocks.
Essentially, it introduces a two-phase block processing mechanism using
'CustomProcessBlock()' and 'CustomPostProcessBlocks()':
1) 'CustomProcessBlock()' will be in charge of digesting block data and
returning a custom result object per index.
The goal is to encapsulate all procedures that can safely run
concurrently here.
2) 'CustomPostProcessBlocks()' receives all digested objects in order,
allowing the index to perform any subsequent steps that require sequentiality.
For example, the block filter index requires sequentiality in the headers-chain
construction, as each record depends on its predecessor hash.
Important note: at present, the code executes entirely synchronously.
b8048e8c0a
index: add support for processing batches of blocks
No behavior changes. The index still processes one block at a time.
The code was refactored to allow handling multiple blocks on the
processing phase.
Introducing the concept of a "Task" which merely represents a range
of blocks and their resulting digests.
In the next commit, we will take advantage of this to process batches
of blocks concurrently.
5eaf553d04
Index: introduce SyncContext - move logging & last locator timers
Move progress related timers (logging and locator write) into a shared
SyncContext struct with atomic members. This does not change behavior
but prepares the code for safe multi-threaded execution in the next
commit.
275d0aee75
index: encapsulate Task processing code into RunTask
No-behavior changes.
This refactor moves the existing processing logic from BaseIndex::Sync
into BaseIndex::RunTask, simplifying the main sync loop and laying
groundwork for parallelization of block processing in the upcoming commit.
The goal is to have BaseIndex::RunTask running concurrently.
66237215d1
util: introduce general purpose thread pool83db5cd4ab
init: provide thread pool to indexes
And add option to customize thread pool workers count
4525317af8
index: implement index parallel sync
This introduces parallel sync for the indexes initial sync,
distributing the workload across multiple threads.
When enabled, the chain is divided into fixed-size block ranges called
"tasks". Worker threads consume tasks concurrently, calling to the
CustomProcessBlock over their assigned range, storing results in a
shared context while opportunistically batching and dumping the collected
information to disk sequentially (when needed).
Since large reorgs are improbable during initial sync (headers-chain PoW
dictates the valid chain), reorgs are detected only before syncing begins
and once it completes. Any new blocks connected during the process are
caught up sequentially at the end. This, together with the fact that we
no longer depend on an intermediate "index best block" that might be out
of sync with the index m_best_block_index, allows us to remove the
index_reorg_crash test, as it is no longer possible to call Rewind on a
block that is not the index tip.
72562c5163
index: enable block filter index parallel sync
It also adds coverage for initial sync from a particular block.
Mimicking a node restart.
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2025-10-31 15:13 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me