This PR:
- makes
CCheckQueueRAII-styled - gets rid of the global
scriptcheckqueue - fixes #25448
The previous attempt was in #18731.
CCheckQueue RAII-styled (attempt 2)
#26762
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
For detailed information about the code coverage, see the test coverage report.
See the guideline for information on the review process.
| Type | Reviewers |
|---|---|
| ACK | TheCharlatan, martinus, achow101 |
| Approach ACK | w0xlt |
| Stale ACK | ajtowns |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
174@@ -187,24 +175,16 @@ class CCheckQueue
175 }
176 }
177
178- //! Stop all of the worker threads.
179- void StopWorkerThreads() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
180+ ~CCheckQueue()
181 {
182 WITH_LOCK(m_mutex, m_request_stop = true);
243@@ -244,7 +244,6 @@ void Shutdown(NodeContext& node)
244 // CScheduler/checkqueue, scheduler and load block thread.
245 if (node.scheduler) node.scheduler->stop();
246 if (node.chainman && node.chainman->m_load_block.joinable()) node.chainman->m_load_block.join();
247- StopScriptCheckWorkerThreads();
I don’t think this change makes anything worse, but Shutdown() is pretty fragile.
133@@ -134,22 +134,10 @@ class CCheckQueue
134 Mutex m_control_mutex;
135
136 //! Create a new check queue
137- explicit CCheckQueue(unsigned int nBatchSizeIn)
138- : nBatchSize(nBatchSizeIn)
FWIW, I would probably have broken this up into multiple commits:
scriptcheckqueue into a class (and move its params out of init.cpp)StopWorkerThreads() calls to wherever the queue goes out of scope (~ChainStateManager?)worker_threads_num and just calls StartWorkerThreads()StartWorkerThreads with the new constructorStopWorkerThreads() from the destructor only1491@@ -1492,6 +1492,7 @@ Chainstate::Chainstate(
1492 ChainstateManager& chainman,
1493 std::optional<uint256> from_snapshot_blockhash)
1494 : m_mempool(mempool),
1495+ m_script_check_queue{std::make_unique<CCheckQueue<CScriptCheck>>(/*batch_size=*/128, chainman.m_options.worker_threads_num)},
m_script_check_queue be part of ChainstateManager instead, so it remains a single global queue?
142@@ -143,6 +143,9 @@ class CCheckQueue
143 }
144 }
145
146+ CCheckQueue(const CCheckQueue&) = delete;
147+ CCheckQueue& operator=(const CCheckQueue&) = delete;
Updated 2d248d6b80147e5f9f62f4c543d4dbb10c0d95c7 -> 5a19c3962c8702ba3b17f986524e1eb5de049c0e (pr26762.01 -> pr26762.02, diff):
40+ if (par_value <= 0) {
41+ // -par=0 means autodetect (number of cores - 1 script threads)
42+ // -par=-n means "leave n cores free" (number of cores - n - 1 script threads)
43+ par_value += GetNumCores();
44+ }
45+ opts.worker_threads_num = std::clamp(par_value - 1, 0, MAX_SCRIPTCHECK_THREADS);
- 1
5299@@ -5307,10 +5300,17 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
5300 return std::move(opts);
5301 }
5302
5303-ChainstateManager::ChainstateManager(Options options) : m_options{Flatten(std::move(options))} {}
5304+ChainstateManager::ChainstateManager(Options options)
5305+ : m_script_check_queue{/*nBatchSizeIn=*/128},
5306+ m_options{Flatten(std::move(options))}
5307+{
5308+ m_script_check_queue.StartWorkerThreads(m_options.worker_threads_num);
worker_threads_num > 0 (since StartScriptCheckWorkerThreads would not be called unless script_threads >= 1). I think this is okay, since in that case the function only sets some variables that will never be used and skips over a loop.
CCheckQueue constructor start worker threads” commit.
utACK f0209248e1baea41aae72147c973f1f5aa27c97b
Seems reasonable to me. Not 100% confident over delaying stopping the threads until the destructor; but also about equally unsure about how it works now…
Updated f0209248e1baea41aae72147c973f1f5aa27c97b -> 5a7932f395c675fad332cbcd0498bb9fefcb33e0 (pr26762.03 -> pr26762.04, diff):
kernel::ChainstateManagerOpts::worker_threads_numNot 100% confident over delaying stopping the threads until the destructor; but also about equally unsure about how it works now…
Actually, it is how the #25448 has been fixed now.
add a default initializer for
kernel::ChainstateManagerOpts::worker_threads_num
Yikes. Work on adding cppcoreguidelines-pro-type-member-init` to clang-tidy maybe?
reACK 5a7932f395c675fad332cbcd0498bb9fefcb33e0
35@@ -34,6 +36,16 @@ std::optional<bilingual_str> ApplyArgsManOptions(const ArgsManager& args, Chains
36
37 if (auto value{args.GetIntArg("-maxtipage")}) opts.max_tip_age = std::chrono::seconds{*value};
38
39+ int par_value = args.GetIntArg("-par", DEFAULT_SCRIPTCHECK_THREADS);
max_tip_age and also move the MAX_SCRIPTCHECK_THREADS to chainstatemanager_opts.h.
This is not consistent with how the other options are handled. This overrides the value of the passed in option even if no argument was passed in. I would handle this the same way as
max_tip_age…
I’m not sure about this change considering this PR scope. The semantic of the “-par” and “-maxtipage” options are quite different.
… and also move the
MAX_SCRIPTCHECK_THREADStochainstatemanager_opts.h.
Thanks! Updated.
35@@ -34,6 +36,16 @@ std::optional<bilingual_str> ApplyArgsManOptions(const ArgsManager& args, Chains
36
37 if (auto value{args.GetIntArg("-maxtipage")}) opts.max_tip_age = std::chrono::seconds{*value};
38
39+ int par_value = args.GetIntArg("-par", DEFAULT_SCRIPTCHECK_THREADS);
40+ if (par_value <= 0) {
par_value were still called script_threads. I also find script_threads a better name than par_value.
Yikes. Work on adding
cppcoreguidelines-pro-type-member-init` to clang-tidy maybe?
Well, it fires too many false warnings for our code base (for example, for CTxOut::CTxOut() constructor).
Updated 5a7932f395c675fad332cbcd0498bb9fefcb33e0 -> f1370b2c1586f7fe487d9f17ee53bcd9b87a9f23 (pr26762.04 -> pr26762.05):
173@@ -181,24 +174,16 @@ class CCheckQueue
174 }
175 }
176
177- //! Stop all of the worker threads.
178- void StopWorkerThreads() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
179+ ~CCheckQueue()
180 {
181- WITH_LOCK(m_mutex, m_request_stop = true);
182+ m_request_stop = true;
I think that removing the lock here could cause a deadlock.
m_request_stop is set to true and then notify_all() is called; but when there is a worker thread currently standing inside the while (queue.empty() && !m_request_stop) { loop but before the cond.wait(lock); it will not receive the notification and then proceed to wait in the cond indefinitely.
That deadlock was not possible before when the m_mutex had to be acquired to change m_request_stop because then it was guaranteed that all threads were either waiting on the condition variable or waiting outside that loop to get the lock
memory_order_seq_cst memory order for std::atomic<bool> m_request_stop guarantee the correct behaviour?
The only thing guaranteed is the order; so m_request_stop will be set to true before notify_all is called. But there is no guarantee where the other threads currently are because there is now no lock anymore. E.g. a worker thread could be currently here:
0while (queue.empty() && !m_request_stop) {
1 if (fMaster && nTodo == 0) { // <-------- Worker thread waiting here due to context switch
2 nTotal--;
3 bool fRet = fAllOk;
4 // reset the status for new work later
5 fAllOk = true;
6 // return the current status
7 return fRet;
8 }
9 nIdle++;
10 cond.wait(lock); // wait
11 nIdle--;
12}
Now while the worker thread is not running for whatever reason (context switch, high CPU load, …) the main thread runs, calls m_request_stop = true, then m_worker_cv.notify_all();, and then hangs in the t.join() loop.
Eventually the worker thread will get its share of the CPU and continues running, and will run right into the cond.wait(lock); without checking m_request_stop. There it won’t receive the notification any more because by now it’s too late; notifications have already been sent. So the worker thread will now wait.
The result is that the master hangs in the join() loop, and the worker waits for notification => deadlock.
145- nTotal = 0;
146- fAllOk = true;
147- }
148- assert(m_worker_threads.empty());
149- for (int n = 0; n < threads_num; ++n) {
150+ for (int n = 0; n < worker_threads_num; ++n) {
m_worker_threads.reserve(worker_threads_num); before the loop
140@@ -155,6 +141,13 @@ class CCheckQueue
141 }
142 }
143
144+ // Since this class manages its own resources, which is a thread
145+ // pool `m_worker_threads`, copy operations are not appropriate.
146+ CCheckQueue(const CCheckQueue&) = delete;
147+ CCheckQueue& operator=(const CCheckQueue&) = delete;
148+ CCheckQueue(CCheckQueue&&) = default;
= delete. I think std::mutex is not movable so these will be deleted anyways, but it feels safer to just mark them as deleted too. There will be threads running throughout the lifetime of CCheckQueue, so moving the object while something might be going on or threads waiting on a lock wouldn’t be safe
Updated f1370b2c1586f7fe487d9f17ee53bcd9b87a9f23 -> d462e3da7fc6be75269e88928fd80fc98c405474 (pr26762.05 -> pr26762.07, diff):
249@@ -250,7 +250,6 @@ int main(int argc, char* argv[])
250 // dereferencing and UB.
251 scheduler.stop();
nit, just a thought: maybe it would make sense to have a method like signalStop() or similar in CCheckQueue that only sets m_request_stop to true, and call this as soon as we know we don’t need the queue’s result any more. That way the threads will stop processing and stop wasting CPU while shutting down.
On the other hand the queue will usually pretty quickly stop working unless it has lots of work items, so I guess in most cases this wouldn’t make any difference.
-assumevalid=0 so script checks are done, stopping the process a few times with Ctrl+C, all good; shutdown is still quick. -par still works too. Also did a rebase with --exec "make -j18 check", so each commit individually builds & all tests work.
make check
16@@ -17,9 +17,9 @@
17
18 #include <common/system.h>
19 #include <interfaces/node.h>
20+#include <node/chainstatemanager_args.h>
352@@ -362,7 +353,7 @@ BOOST_AUTO_TEST_CASE(test_CheckQueue_FrozenCleanup)
353 /** Test that CCheckQueueControl is threadsafe */
354 BOOST_AUTO_TEST_CASE(test_CheckQueueControl_Locks)
355 {
356- auto queue = std::make_unique<Standard_Queue>(QUEUE_BATCH_SIZE);
357+ auto queue = std::make_unique<Standard_Queue>(QUEUE_BATCH_SIZE, SCRIPT_CHECK_THREADS);
16@@ -17,6 +17,7 @@
17 #include <mapport.h>
18 #include <net.h>
19 #include <netbase.h>
20+#include <node/chainstatemanager_args.h>
21 #include <txdb.h> // for -dbcache defaults
22 #include <util/string.h>
23 #include <validation.h> // For DEFAULT_SCRIPTCHECK_THREADS