This PR:
- makes
CCheckQueueRAII-styled - gets rid of the global
scriptcheckqueue - fixes #25448
The previous attempt was in #18731.
<!--e57a25ab6845829454e8d69fc972939a-->
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
<!--006a51241073e994b41acfe9ec718e94-->
For detailed information about the code coverage, see the test coverage report.
<!--021abf342d371248e50ceaed478a90ca-->
See the guideline for information on the review process.
| Type | Reviewers |
|---|---|
| ACK | TheCharlatan, martinus, achow101 |
| Approach ACK | w0xlt |
| Stale ACK | ajtowns |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
<!--174a7506f384e20aa4161008e828411d-->
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
Approach ACK
174 | @@ -187,24 +175,16 @@ class CCheckQueue 175 | } 176 | } 177 | 178 | - //! Stop all of the worker threads. 179 | - void StopWorkerThreads() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex) 180 | + ~CCheckQueue() 181 | { 182 | WITH_LOCK(m_mutex, m_request_stop = true);
clang turns off thread safety analysis in constructors and destructors, so I think if locking is needed it would be better to keep a separate (private?) function, that the destructor calls when necessary to maximise compile time checking.
243 | @@ -244,7 +244,6 @@ void Shutdown(NodeContext& node) 244 | // CScheduler/checkqueue, scheduler and load block thread. 245 | if (node.scheduler) node.scheduler->stop(); 246 | if (node.chainman && node.chainman->m_load_block.joinable()) node.chainman->m_load_block.join(); 247 | - StopScriptCheckWorkerThreads();
I don't think this change makes anything worse, but Shutdown() is pretty fragile.
133 | @@ -134,22 +134,10 @@ class CCheckQueue 134 | Mutex m_control_mutex; 135 | 136 | //! Create a new check queue 137 | - explicit CCheckQueue(unsigned int nBatchSizeIn) 138 | - : nBatchSize(nBatchSizeIn)
FWIW, I would probably have broken this up into multiple commits:
scriptcheckqueue into a class (and move its params out of init.cpp)StopWorkerThreads() calls to wherever the queue goes out of scope (~ChainStateManager?)worker_threads_num and just calls StartWorkerThreads()StartWorkerThreads with the new constructorStopWorkerThreads() from the destructor only1491 | @@ -1492,6 +1492,7 @@ Chainstate::Chainstate( 1492 | ChainstateManager& chainman, 1493 | std::optional<uint256> from_snapshot_blockhash) 1494 | : m_mempool(mempool), 1495 | + m_script_check_queue{std::make_unique<CCheckQueue<CScriptCheck>>(/*batch_size=*/128, chainman.m_options.worker_threads_num)},
This seems to be doubling the number of script check workers if the assumeutxo snapshot chainstate is allocated; shouldn't m_script_check_queue be part of ChainstateManager instead, so it remains a single global queue?
142 | @@ -143,6 +143,9 @@ class CCheckQueue 143 | } 144 | } 145 | 146 | + CCheckQueue(const CCheckQueue&) = delete; 147 | + CCheckQueue& operator=(const CCheckQueue&) = delete;
Should apply rule-of-5 and default the move operations, no? (EDIT: also, move the explanation of why these are deleted from the commit message into the file itself?)
Concept ACK
Updated 2d248d6b80147e5f9f62f4c543d4dbb10c0d95c7 -> 5a19c3962c8702ba3b17f986524e1eb5de049c0e (pr26762.01 -> pr26762.02, diff):
Rebased 5a19c3962c8702ba3b17f986524e1eb5de049c0e -> f0209248e1baea41aae72147c973f1f5aa27c97b (pr26762.02 -> pr26762.03) due to the conflict with #26251.
40 | + if (par_value <= 0) { 41 | + // -par=0 means autodetect (number of cores - 1 script threads) 42 | + // -par=-n means "leave n cores free" (number of cores - n - 1 script threads) 43 | + par_value += GetNumCores(); 44 | + } 45 | + opts.worker_threads_num = std::clamp(par_value - 1, 0, MAX_SCRIPTCHECK_THREADS);
Missing the comment explaining the - 1
5299 | @@ -5307,10 +5300,17 @@ static ChainstateManager::Options&& Flatten(ChainstateManager::Options&& opts) 5300 | return std::move(opts); 5301 | } 5302 | 5303 | -ChainstateManager::ChainstateManager(Options options) : m_options{Flatten(std::move(options))} {} 5304 | +ChainstateManager::ChainstateManager(Options options) 5305 | + : m_script_check_queue{/*nBatchSizeIn=*/128}, 5306 | + m_options{Flatten(std::move(options))} 5307 | +{ 5308 | + m_script_check_queue.StartWorkerThreads(m_options.worker_threads_num);
Previously this was conditional on worker_threads_num > 0 (since StartScriptCheckWorkerThreads would not be called unless script_threads >= 1). I think this is okay, since in that case the function only sets some variables that will never be used and skips over a loop.
This code will gone in the "refactor: Make CCheckQueue constructor start worker threads" commit.
utACK f0209248e1baea41aae72147c973f1f5aa27c97b
Seems reasonable to me. Not 100% confident over delaying stopping the threads until the destructor; but also about equally unsure about how it works now...
Updated f0209248e1baea41aae72147c973f1f5aa27c97b -> 5a7932f395c675fad332cbcd0498bb9fefcb33e0 (pr26762.03 -> pr26762.04, diff):
kernel::ChainstateManagerOpts::worker_threads_numNot 100% confident over delaying stopping the threads until the destructor; but also about equally unsure about how it works now...
Actually, it is how the #25448 has been fixed now.
add a default initializer for
kernel::ChainstateManagerOpts::worker_threads_num
Yikes. Work on adding cppcoreguidelines-pro-type-member-init` to clang-tidy maybe?
reACK 5a7932f395c675fad332cbcd0498bb9fefcb33e0
35 | @@ -34,6 +36,16 @@ std::optional<bilingual_str> ApplyArgsManOptions(const ArgsManager& args, Chains 36 | 37 | if (auto value{args.GetIntArg("-maxtipage")}) opts.max_tip_age = std::chrono::seconds{*value}; 38 | 39 | + int par_value = args.GetIntArg("-par", DEFAULT_SCRIPTCHECK_THREADS);
This is not consistent with how the other options are handled. This overrides the value of the passed in option even if no argument was passed in. I would handle this the same way as max_tip_age and also move the MAX_SCRIPTCHECK_THREADS to chainstatemanager_opts.h.
This is not consistent with how the other options are handled. This overrides the value of the passed in option even if no argument was passed in. I would handle this the same way as
max_tip_age...
I'm not sure about this change considering this PR scope. The semantic of the "-par" and "-maxtipage" options are quite different.
... and also move the
MAX_SCRIPTCHECK_THREADStochainstatemanager_opts.h.
Thanks! Updated.
35 | @@ -34,6 +36,16 @@ std::optional<bilingual_str> ApplyArgsManOptions(const ArgsManager& args, Chains 36 | 37 | if (auto value{args.GetIntArg("-maxtipage")}) opts.max_tip_age = std::chrono::seconds{*value}; 38 | 39 | + int par_value = args.GetIntArg("-par", DEFAULT_SCRIPTCHECK_THREADS); 40 | + if (par_value <= 0) {
Nit: This diff would be more move only if par_value were still called script_threads. I also find script_threads a better name than par_value.
Concept ACK
Yikes. Work on adding
cppcoreguidelines-pro-type-member-init` to clang-tidy maybe?
Well, it fires too many false warnings for our code base (for example, for CTxOut::CTxOut() constructor).
Updated 5a7932f395c675fad332cbcd0498bb9fefcb33e0 -> f1370b2c1586f7fe487d9f17ee53bcd9b87a9f23 (pr26762.04 -> pr26762.05):
173 | @@ -181,24 +174,16 @@ class CCheckQueue 174 | } 175 | } 176 | 177 | - //! Stop all of the worker threads. 178 | - void StopWorkerThreads() EXCLUSIVE_LOCKS_REQUIRED(!m_mutex) 179 | + ~CCheckQueue() 180 | { 181 | - WITH_LOCK(m_mutex, m_request_stop = true); 182 | + m_request_stop = true;
I think that removing the lock here could cause a deadlock.
m_request_stop is set to true and then notify_all() is called; but when there is a worker thread currently standing inside the while (queue.empty() && !m_request_stop) { loop but before the cond.wait(lock); it will not receive the notification and then proceed to wait in the cond indefinitely.
That deadlock was not possible before when the m_mutex had to be acquired to change m_request_stop because then it was guaranteed that all threads were either waiting on the condition variable or waiting outside that loop to get the lock
Doesn't the default memory_order_seq_cst memory order for std::atomic<bool> m_request_stop guarantee the correct behaviour?
The only thing guaranteed is the order; so m_request_stop will be set to true before notify_all is called. But there is no guarantee where the other threads currently are because there is now no lock anymore. E.g. a worker thread could be currently here:
while (queue.empty() && !m_request_stop) {
if (fMaster && nTodo == 0) { // <-------- Worker thread waiting here due to context switch
nTotal--;
bool fRet = fAllOk;
// reset the status for new work later
fAllOk = true;
// return the current status
return fRet;
}
nIdle++;
cond.wait(lock); // wait
nIdle--;
}
Now while the worker thread is not running for whatever reason (context switch, high CPU load, ...) the main thread runs, calls m_request_stop = true, then m_worker_cv.notify_all();, and then hangs in the t.join() loop.
Eventually the worker thread will get its share of the CPU and continues running, and will run right into the cond.wait(lock); without checking m_request_stop. There it won't receive the notification any more because by now it's too late; notifications have already been sent. So the worker thread will now wait.
The result is that the master hangs in the join() loop, and the worker waits for notification => deadlock.
145 | - nTotal = 0; 146 | - fAllOk = true; 147 | - } 148 | - assert(m_worker_threads.empty()); 149 | - for (int n = 0; n < threads_num; ++n) { 150 | + for (int n = 0; n < worker_threads_num; ++n) {
nit: might want to add m_worker_threads.reserve(worker_threads_num); before the loop
140 | @@ -155,6 +141,13 @@ class CCheckQueue 141 | } 142 | } 143 | 144 | + // Since this class manages its own resources, which is a thread 145 | + // pool `m_worker_threads`, copy operations are not appropriate. 146 | + CCheckQueue(const CCheckQueue&) = delete; 147 | + CCheckQueue& operator=(const CCheckQueue&) = delete; 148 | + CCheckQueue(CCheckQueue&&) = default;
I'd explicitly set the move operations to = delete. I think std::mutex is not movable so these will be deleted anyways, but it feels safer to just mark them as deleted too. There will be threads running throughout the lifetime of CCheckQueue, so moving the object while something might be going on or threads waiting on a lock wouldn't be safe
Updated f1370b2c1586f7fe487d9f17ee53bcd9b87a9f23 -> d462e3da7fc6be75269e88928fd80fc98c405474 (pr26762.05 -> pr26762.07, diff):
249 | @@ -250,7 +250,6 @@ int main(int argc, char* argv[])
250 | // dereferencing and UB.
251 | scheduler.stop();
nit, just a thought: maybe it would make sense to have a method like signalStop() or similar in CCheckQueue that only sets m_request_stop to true, and call this as soon as we know we don't need the queue's result any more. That way the threads will stop processing and stop wasting CPU while shutting down.
On the other hand the queue will usually pretty quickly stop working unless it has lots of work items, so I guess in most cases this wouldn't make any difference.
Code review ACK d462e3da7fc6be75269e88928fd80fc98c405474, as far as I can say now deadlocks are not possible any more when shutting down. I started an initial sync with -assumevalid=0 so script checks are done, stopping the process a few times with Ctrl+C, all good; shutdown is still quick. -par still works too. Also did a rebase with --exec "make -j18 check", so each commit individually builds & all tests work.
Rebased d462e3da7fc6be75269e88928fd80fc98c405474 -> 52129e335cbc68ac5d863f283f7d1a328ce79581 (pr26762.07 -> pr26762.08) due to the conflict with #27419.
re-ack 52129e335cbc68ac5d863f283f7d1a328ce79581
Rebased 52129e335cbc68ac5d863f283f7d1a328ce79581 -> 2076d846cc917cbafe61937a99b7867067011341 (pr26762.08 -> pr26762.09) due to the conflict with #25977.
Rebased 2076d846cc917cbafe61937a99b7867067011341 -> b023c26eb8eda4c7f80ad2e7ebe1fb046e87d2ee (pr26762.09 -> pr26762.10) due to the conflict with #27636.
Rebased b023c26eb8eda4c7f80ad2e7ebe1fb046e87d2ee -> e89273e80765219ff545ee5c26a257aef7d69f9c (pr26762.10 -> pr26762.11) due to the conflict with #27576.
Rebased e89273e80765219ff545ee5c26a257aef7d69f9c -> b7ebc999ee373f0429939ac0bdb8e38fb37404aa (pr26762.11 -> pr26762.12) due to the conflict with #27861.
Rebased b7ebc999ee373f0429939ac0bdb8e38fb37404aa -> 6e079015707188ac15405662ce396dd837da569a (pr26762.12 -> pr26762.13) due to the conflict with #27607.
Rebased 6e079015707188ac15405662ce396dd837da569a -> 2235765f7fd946d3b08fd9e109584967407e46d1 (pr26762.13 -> pr26762.14) due to the conflict with #28048.
code review ACK 2235765f7fd946d3b08fd9e109584967407e46d1, also rebased & built & tested each individual commit with make check
Rebased 2235765f7fd946d3b08fd9e109584967407e46d1 -> 5b3ea5fa2e7f6dc1c9161ed8b74c9be4bd1e92dd (pr26762.14 -> pr26762.15) due to the conflict with #27596.
16 | @@ -17,9 +17,9 @@ 17 | 18 | #include <common/system.h> 19 | #include <interfaces/node.h> 20 | +#include <node/chainstatemanager_args.h>
Nit: fix include order.
ACK 5b3ea5fa2e7f6dc1c9161ed8b74c9be4bd1e92dd
ACK 5b3ea5fa2e7
ACK 5b3ea5fa2e7f6dc1c9161ed8b74c9be4bd1e92dd
352 | @@ -362,7 +353,7 @@ BOOST_AUTO_TEST_CASE(test_CheckQueue_FrozenCleanup) 353 | /** Test that CCheckQueueControl is threadsafe */ 354 | BOOST_AUTO_TEST_CASE(test_CheckQueueControl_Locks) 355 | { 356 | - auto queue = std::make_unique<Standard_Queue>(QUEUE_BATCH_SIZE); 357 | + auto queue = std::make_unique<Standard_Queue>(QUEUE_BATCH_SIZE, SCRIPT_CHECK_THREADS);
This is a change in behavior for the test in 9cf89f7a5b81197e38f58b24be0793b28fe41477?
16 | @@ -17,6 +17,7 @@ 17 | #include <mapport.h> 18 | #include <net.h> 19 | #include <netbase.h> 20 | +#include <node/chainstatemanager_args.h> 21 | #include <txdb.h> // for -dbcache defaults 22 | #include <util/string.h> 23 | #include <validation.h> // For DEFAULT_SCRIPTCHECK_THREADS
5b3ea5fa2e7f6dc1c9161ed8b74c9be4bd1e92dd: Comment is now wrong, no?
lgtm