This PR:
- makes
CCheckQueue
RAII-styled - gets rid of the global
scriptcheckqueue
- fixes #25448
The previous attempt was in #18731.
CCheckQueue
RAII-styled (attempt 2)
#26762
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
For detailed information about the code coverage, see the test coverage report.
See the guideline for information on the review process.
Type | Reviewers |
---|---|
ACK | TheCharlatan, martinus, achow101 |
Approach ACK | w0xlt |
Stale ACK | ajtowns |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
174@@ -187,24 +175,16 @@ class CCheckQueue
175 }
176 }
177
178- //! Stop all of the worker threads.
179- void StopWorkerThreads() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
180+ ~CCheckQueue()
181 {
182 WITH_LOCK(m_mutex, m_request_stop = true);
243@@ -244,7 +244,6 @@ void Shutdown(NodeContext& node)
244 // CScheduler/checkqueue, scheduler and load block thread.
245 if (node.scheduler) node.scheduler->stop();
246 if (node.chainman && node.chainman->m_load_block.joinable()) node.chainman->m_load_block.join();
247- StopScriptCheckWorkerThreads();
I don’t think this change makes anything worse, but Shutdown()
is pretty fragile.
133@@ -134,22 +134,10 @@ class CCheckQueue
134 Mutex m_control_mutex;
135
136 //! Create a new check queue
137- explicit CCheckQueue(unsigned int nBatchSizeIn)
138- : nBatchSize(nBatchSizeIn)
FWIW, I would probably have broken this up into multiple commits:
scriptcheckqueue
into a class (and move its params out of init.cpp)StopWorkerThreads()
calls to wherever the queue goes out of scope (~ChainStateManager
?)worker_threads_num
and just calls StartWorkerThreads()
StartWorkerThreads
with the new constructorStopWorkerThreads()
from the destructor only1491@@ -1492,6 +1492,7 @@ Chainstate::Chainstate(
1492 ChainstateManager& chainman,
1493 std::optional<uint256> from_snapshot_blockhash)
1494 : m_mempool(mempool),
1495+ m_script_check_queue{std::make_unique<CCheckQueue<CScriptCheck>>(/*batch_size=*/128, chainman.m_options.worker_threads_num)},
m_script_check_queue
be part of ChainstateManager
instead, so it remains a single global queue?
142@@ -143,6 +143,9 @@ class CCheckQueue
143 }
144 }
145
146+ CCheckQueue(const CCheckQueue&) = delete;
147+ CCheckQueue& operator=(const CCheckQueue&) = delete;
Updated 2d248d6b80147e5f9f62f4c543d4dbb10c0d95c7 -> 5a19c3962c8702ba3b17f986524e1eb5de049c0e (pr26762.01 -> pr26762.02, diff):
40+ if (par_value <= 0) {
41+ // -par=0 means autodetect (number of cores - 1 script threads)
42+ // -par=-n means "leave n cores free" (number of cores - n - 1 script threads)
43+ par_value += GetNumCores();
44+ }
45+ opts.worker_threads_num = std::clamp(par_value - 1, 0, MAX_SCRIPTCHECK_THREADS);
- 1
5299@@ -5307,10 +5300,17 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts)
5300 return std::move(opts);
5301 }
5302
5303-ChainstateManager::ChainstateManager(Options options) : m_options{Flatten(std::move(options))} {}
5304+ChainstateManager::ChainstateManager(Options options)
5305+ : m_script_check_queue{/*nBatchSizeIn=*/128},
5306+ m_options{Flatten(std::move(options))}
5307+{
5308+ m_script_check_queue.StartWorkerThreads(m_options.worker_threads_num);
worker_threads_num > 0
(since StartScriptCheckWorkerThreads
would not be called unless script_threads >= 1
). I think this is okay, since in that case the function only sets some variables that will never be used and skips over a loop.
CCheckQueue
constructor start worker threads” commit.
utACK f0209248e1baea41aae72147c973f1f5aa27c97b
Seems reasonable to me. Not 100% confident over delaying stopping the threads until the destructor; but also about equally unsure about how it works now…
Updated f0209248e1baea41aae72147c973f1f5aa27c97b -> 5a7932f395c675fad332cbcd0498bb9fefcb33e0 (pr26762.03 -> pr26762.04, diff):
kernel::ChainstateManagerOpts::worker_threads_num
Not 100% confident over delaying stopping the threads until the destructor; but also about equally unsure about how it works now…
Actually, it is how the #25448 has been fixed now.
add a default initializer for
kernel::ChainstateManagerOpts::worker_threads_num
Yikes. Work on adding cppcoreguidelines-pro-type-member-init
` to clang-tidy maybe?
reACK 5a7932f395c675fad332cbcd0498bb9fefcb33e0
35@@ -34,6 +36,16 @@ std::optional<bilingual_str> ApplyArgsManOptions(const ArgsManager& args, Chains
36
37 if (auto value{args.GetIntArg("-maxtipage")}) opts.max_tip_age = std::chrono::seconds{*value};
38
39+ int par_value = args.GetIntArg("-par", DEFAULT_SCRIPTCHECK_THREADS);
max_tip_age
and also move the MAX_SCRIPTCHECK_THREADS
to chainstatemanager_opts.h
.
This is not consistent with how the other options are handled. This overrides the value of the passed in option even if no argument was passed in. I would handle this the same way as
max_tip_age
…
I’m not sure about this change considering this PR scope. The semantic of the “-par” and “-maxtipage” options are quite different.
… and also move the
MAX_SCRIPTCHECK_THREADS
tochainstatemanager_opts.h
.
Thanks! Updated.
35@@ -34,6 +36,16 @@ std::optional<bilingual_str> ApplyArgsManOptions(const ArgsManager& args, Chains
36
37 if (auto value{args.GetIntArg("-maxtipage")}) opts.max_tip_age = std::chrono::seconds{*value};
38
39+ int par_value = args.GetIntArg("-par", DEFAULT_SCRIPTCHECK_THREADS);
40+ if (par_value <= 0) {
par_value
were still called script_threads
. I also find script_threads
a better name than par_value
.
Yikes. Work on adding
cppcoreguidelines-pro-type-member-init
` to clang-tidy maybe?
Well, it fires too many false warnings for our code base (for example, for CTxOut::CTxOut()
constructor).
Updated 5a7932f395c675fad332cbcd0498bb9fefcb33e0 -> f1370b2c1586f7fe487d9f17ee53bcd9b87a9f23 (pr26762.04 -> pr26762.05):
173@@ -181,24 +174,16 @@ class CCheckQueue
174 }
175 }
176
177- //! Stop all of the worker threads.
178- void StopWorkerThreads() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex)
179+ ~CCheckQueue()
180 {
181- WITH_LOCK(m_mutex, m_request_stop = true);
182+ m_request_stop = true;
I think that removing the lock here could cause a deadlock.
m_request_stop
is set to true and then notify_all()
is called; but when there is a worker thread currently standing inside the while (queue.empty() && !m_request_stop) {
loop but before the cond.wait(lock);
it will not receive the notification and then proceed to wait in the cond
indefinitely.
That deadlock was not possible before when the m_mutex
had to be acquired to change m_request_stop
because then it was guaranteed that all threads were either waiting on the condition variable or waiting outside that loop to get the lock
memory_order_seq_cst
memory order for std::atomic<bool> m_request_stop
guarantee the correct behaviour?
The only thing guaranteed is the order; so m_request_stop
will be set to true
before notify_all
is called. But there is no guarantee where the other threads currently are because there is now no lock anymore. E.g. a worker thread could be currently here:
0while (queue.empty() && !m_request_stop) {
1 if (fMaster && nTodo == 0) { // <-------- Worker thread waiting here due to context switch
2 nTotal--;
3 bool fRet = fAllOk;
4 // reset the status for new work later
5 fAllOk = true;
6 // return the current status
7 return fRet;
8 }
9 nIdle++;
10 cond.wait(lock); // wait
11 nIdle--;
12}
Now while the worker thread is not running for whatever reason (context switch, high CPU load, …) the main thread runs, calls m_request_stop = true
, then m_worker_cv.notify_all();
, and then hangs in the t.join()
loop.
Eventually the worker thread will get its share of the CPU and continues running, and will run right into the cond.wait(lock);
without checking m_request_stop
. There it won’t receive the notification any more because by now it’s too late; notifications have already been sent. So the worker thread will now wait.
The result is that the master hangs in the join()
loop, and the worker waits for notification => deadlock.
145- nTotal = 0;
146- fAllOk = true;
147- }
148- assert(m_worker_threads.empty());
149- for (int n = 0; n < threads_num; ++n) {
150+ for (int n = 0; n < worker_threads_num; ++n) {
m_worker_threads.reserve(worker_threads_num);
before the loop
140@@ -155,6 +141,13 @@ class CCheckQueue
141 }
142 }
143
144+ // Since this class manages its own resources, which is a thread
145+ // pool `m_worker_threads`, copy operations are not appropriate.
146+ CCheckQueue(const CCheckQueue&) = delete;
147+ CCheckQueue& operator=(const CCheckQueue&) = delete;
148+ CCheckQueue(CCheckQueue&&) = default;
= delete
. I think std::mutex
is not movable so these will be deleted anyways, but it feels safer to just mark them as deleted too. There will be threads running throughout the lifetime of CCheckQueue
, so moving the object while something might be going on or threads waiting on a lock wouldn’t be safe
Updated f1370b2c1586f7fe487d9f17ee53bcd9b87a9f23 -> d462e3da7fc6be75269e88928fd80fc98c405474 (pr26762.05 -> pr26762.07, diff):
249@@ -250,7 +250,6 @@ int main(int argc, char* argv[])
250 // dereferencing and UB.
251 scheduler.stop();
nit, just a thought: maybe it would make sense to have a method like signalStop()
or similar in CCheckQueue
that only sets m_request_stop
to true, and call this as soon as we know we don’t need the queue’s result any more. That way the threads will stop processing and stop wasting CPU while shutting down.
On the other hand the queue will usually pretty quickly stop working unless it has lots of work items, so I guess in most cases this wouldn’t make any difference.
-assumevalid=0
so script checks are done, stopping the process a few times with Ctrl+C, all good; shutdown is still quick. -par
still works too. Also did a rebase with --exec "make -j18 check"
, so each commit individually builds & all tests work.
make check
16@@ -17,9 +17,9 @@
17
18 #include <common/system.h>
19 #include <interfaces/node.h>
20+#include <node/chainstatemanager_args.h>
352@@ -362,7 +353,7 @@ BOOST_AUTO_TEST_CASE(test_CheckQueue_FrozenCleanup)
353 /** Test that CCheckQueueControl is threadsafe */
354 BOOST_AUTO_TEST_CASE(test_CheckQueueControl_Locks)
355 {
356- auto queue = std::make_unique<Standard_Queue>(QUEUE_BATCH_SIZE);
357+ auto queue = std::make_unique<Standard_Queue>(QUEUE_BATCH_SIZE, SCRIPT_CHECK_THREADS);
16@@ -17,6 +17,7 @@
17 #include <mapport.h>
18 #include <net.h>
19 #include <netbase.h>
20+#include <node/chainstatemanager_args.h>
21 #include <txdb.h> // for -dbcache defaults
22 #include <util/string.h>
23 #include <validation.h> // For DEFAULT_SCRIPTCHECK_THREADS