refactor: make checkqueue manage the threads by itself (also removed some boost dependencies) #14464

pull ken2812221 wants to merge 3 commits into bitcoin:master from ken2812221:drop-boost-cond changing 9 files +156 −112
  1. ken2812221 commented at 1:36 pm on October 11, 2018: contributor

    This PR removes use of the following object in checkqueue

    • boost::condition_variable
    • boost::mutex
    • boost::unique_lock

    Instead of using boost::thread_group::interrupt_all to trigger boost::thread_interrupted exception, it should call InterruptScriptCheck() or scriptcheckqueue.Interrupt() to interrupt the checkqueue thread.

    Also, it adds three functions:

    • StartScriptCheck()
    • InterruptScriptCheck()
    • StopScriptCheck()

    for clearer logic for script check threads.

  2. fanquake added the label Refactoring on Oct 11, 2018
  3. ken2812221 force-pushed on Oct 11, 2018
  4. ken2812221 force-pushed on Oct 11, 2018
  5. in test/lint/lint-includes.sh:69 in 5bb441ce2e outdated
    67@@ -68,7 +68,6 @@ EXPECTED_BOOST_INCLUDES=(
    68     boost/test/unit_test.hpp
    69     boost/thread.hpp
    70     boost/thread/condition_variable.hpp
    


    practicalswift commented at 1:52 pm on October 11, 2018:
    According to PR title boost/thread/condition_variable.hpp should go as well? :-)

    ken2812221 commented at 1:54 pm on October 11, 2018:
    I only remove them in checkqueue, the another place to use it is in scheduler.
  6. practicalswift commented at 1:52 pm on October 11, 2018: contributor

    Concept ACK

    Towards a Boost free future! :-)

  7. ken2812221 force-pushed on Oct 11, 2018
  8. ken2812221 force-pushed on Oct 11, 2018
  9. in src/checkqueue.h:65 in 0e9a6f40e8 outdated
    61@@ -62,17 +62,19 @@ class CCheckQueue
    62     //! The maximum number of elements to be processed in one batch
    63     unsigned int nBatchSize;
    64 
    65+    std::atomic<bool> fInterrupted;
    


    Empact commented at 2:43 pm on October 11, 2018:
    nit: could just call this interrupted
  10. ken2812221 force-pushed on Oct 11, 2018
  11. ken2812221 force-pushed on Oct 11, 2018
  12. ken2812221 force-pushed on Oct 11, 2018
  13. in src/checkqueue.h:103 in 84b0fcdf65 outdated
     97@@ -96,7 +98,11 @@ class CCheckQueue
     98                         return fRet;
     99                     }
    100                     nIdle++;
    101+                    if (interrupted)
    102+                        return false;
    103                     cond.wait(lock); // wait
    104+                    if (interrupted)
    105+                        return false;
    


    Empact commented at 3:18 pm on October 11, 2018:

    If an if only has a single-statement then-clause, it can appear on the same line as the if, without braces. In every other case, braces are required, and the then and else clauses must appear correctly indented on a new line.

    https://github.com/bitcoin/bitcoin/blob/master/doc/developer-notes.md#coding-style-c


    ken2812221 commented at 3:26 pm on October 11, 2018:
    clang-format doesn’t catch this. Done manually.
  14. ken2812221 force-pushed on Oct 11, 2018
  15. ken2812221 renamed this:
    rafactor: Remove use of boost::condition_variable and boost::mutex in checkqueue
    refactor: Remove use of boost::condition_variable and boost::mutex in checkqueue
    on Oct 11, 2018
  16. ken2812221 force-pushed on Oct 11, 2018
  17. in src/validation.h:265 in 76c233a135 outdated
    259@@ -260,8 +260,12 @@ bool LoadBlockIndex(const CChainParams& chainparams) EXCLUSIVE_LOCKS_REQUIRED(cs
    260 bool LoadChainTip(const CChainParams& chainparams);
    261 /** Unload database information */
    262 void UnloadBlockIndex();
    263-/** Run an instance of the script checking thread */
    264-void ThreadScriptCheck();
    265+/** Start script checking thread */
    266+void StartScriptCheck();
    267+/** Interrupt all script checking thread */
    


    ch4ot1c commented at 7:48 pm on October 11, 2018:
    Pluralize to threads
  18. ch4ot1c commented at 7:48 pm on October 11, 2018: contributor
    utACK
  19. ken2812221 force-pushed on Oct 11, 2018
  20. ken2812221 force-pushed on Oct 11, 2018
  21. ken2812221 force-pushed on Oct 11, 2018
  22. ken2812221 renamed this:
    refactor: Remove use of boost::condition_variable and boost::mutex in checkqueue
    refactor: Remove boost in checkqueue
    on Oct 12, 2018
  23. ken2812221 force-pushed on Oct 12, 2018
  24. sipa commented at 7:51 pm on October 12, 2018: member
    utACK 9aacef638f0c2159930640762daf524942052a9d
  25. in src/checkqueue.h:170 in 9aacef638f outdated
    165+    {
    166+        {
    167+            std::lock_guard<std::mutex> lock(mutex);
    168+            interrupted = true;
    169+        }
    170+        condMaster.notify_all();
    


    donaloconnor commented at 10:07 pm on October 12, 2018:
    tiny nit: only one master thread so maybe condMaster.notify_one() is more consistent. No difference to code though.
  26. ken2812221 force-pushed on Oct 13, 2018
  27. in src/test/checkqueue_tests.cpp:151 in 68e50c8779 outdated
    147@@ -149,9 +148,9 @@ typedef CCheckQueue<FrozenCleanupCheck> FrozenCleanup_Queue;
    148 static void Correct_Queue_range(std::vector<size_t> range)
    149 {
    150     auto small_queue = MakeUnique<Correct_Queue>(QUEUE_BATCH_SIZE);
    151-    boost::thread_group tg;
    152+    std::vector<std::thread> tg;
    


    practicalswift commented at 2:00 pm on October 13, 2018:
    nit: Request vector capacity of nScriptCheckThreads before loop?

    JeremyRubin commented at 0:19 am on October 16, 2018:
    This is good practice, but it’s test code so it doesn’t really matter all too much.
  28. in src/test/transaction_tests.cpp:465 in 68e50c8779 outdated
    461@@ -462,12 +462,13 @@ BOOST_AUTO_TEST_CASE(test_big_witness_transaction) {
    462 
    463     // check all inputs concurrently, with the cache
    464     PrecomputedTransactionData txdata(tx);
    465-    boost::thread_group threadGroup;
    466+    std::vector<std::thread> threadGroup;
    


    practicalswift commented at 2:01 pm on October 13, 2018:
    nit: Same here. Request a capacity of 20? :-)
  29. fanquake added this to the "In progress" column in a project

  30. in src/checkqueue.h:102 in 68e50c8779 outdated
     97@@ -96,7 +98,9 @@ class CCheckQueue
     98                         return fRet;
     99                     }
    100                     nIdle++;
    101+                    if (interrupted) return false;
    102                     cond.wait(lock); // wait
    


    JeremyRubin commented at 0:26 am on October 16, 2018:

    Add the condition as a callback to wait.

    e.g., cond.wait(lock, [&] { return interrupted || !queue.empty(); })

  31. JeremyRubin commented at 0:35 am on October 16, 2018: contributor

    Weak concept Ack.

    Necessary step to get rid of boost, but conflicts with some unmerged work (https://github.com/bitcoin/bitcoin/pull/9938, https://github.com/JeremyRubin/bitcoin/pull/1) which does the same thing.

    Overall, this appears to be correct.

    Before merge I’d like to see some benchmarks, the checkqueue code is very performance sensitive, sometimes the boost condition variables and mutexs difference from std’s can be substantial. (I’m not kidding – the time to wake-up and condition variable sleep behavior is one of the major performance issues with this code).

  32. ken2812221 force-pushed on Oct 16, 2018
  33. ken2812221 commented at 4:26 pm on October 16, 2018: contributor
    Addressed comments
  34. ken2812221 force-pushed on Oct 16, 2018
  35. ken2812221 force-pushed on Oct 16, 2018
  36. ken2812221 force-pushed on Oct 16, 2018
  37. ken2812221 force-pushed on Oct 16, 2018
  38. in src/init.cpp:1274 in d0107034fe outdated
    1200@@ -1199,12 +1201,6 @@ bool AppInitMain()
    1201     InitSignatureCache();
    1202     InitScriptExecutionCache();
    1203 
    1204-    LogPrintf("Using %u threads for script verification\n", nScriptCheckThreads);
    


    sipa commented at 7:16 am on October 17, 2018:
    I was expecting to see this replaced with a StartScriptCheck(); line; why is that not necessary?

    ken2812221 commented at 2:43 pm on October 17, 2018:
    I think I forgot to add it. Fixed now.
  39. ken2812221 force-pushed on Oct 17, 2018
  40. ken2812221 commented at 2:18 am on October 18, 2018: contributor

    @JeremyRubin benchmark result on my Ubuntu 18.04 virtual machine: master 816fab9ccae568612d5ed90378b4587256925a1e

    0# Benchmark, evals, iterations, total, min, max, median
    1CCheckQueueSpeedPrevectorJob, 100, 1400, 83.1845, 0.000461328, 0.000692877, 0.000668382
    

    This PR 625c55a70c376af9a94a2bb118c39685d3b82b33

    0# Benchmark, evals, iterations, total, min, max, median
    1CCheckQueueSpeedPrevectorJob, 100, 1400, 66.268, 0.000451829, 0.000491473, 0.000473479
    
  41. JeremyRubin commented at 3:38 am on October 18, 2018: contributor

    That’s great, thanks for checking that – much bigger change than I expected it to be.

    Looking at the numbers, it seems not only to be faster on the microbenchmark, but more consistent.

    This merits running two additional tests, if you have energy for it:

    1. Time to reindex (with assumevalid up to last month or something so it doesn’t take forever
    2. Two nodes (one with this PR one without) connected behind a third node (on latest release), with debug bench set. Collect the logs on ConnectBlock.
      • Also collect a general system profile of which processes are using what on the cpu.

    The second test is important because one thing I noticed when working on this code previously is that it was trivially easy to get speedups on this code at the expense of general performance. E.g., getting rid of condition variables (and just spinning until jobs came in) was much faster, but meant that the OS scheduler would be giving our CheckQueue workers run-time when there was nothing to do, impacting other processes on the computer (most likely because of causing more cache invalidation, rather than the impact of losing a few cycles of a process switch). Hence my suspicion: the 20% speedup is because the standard condition variables are not using system sleep (which plays better with the OS). This can actually negatively impact the process too as our main thread might be getting context switched out while we are trying to add jobs or something rather than waiting for notify.

  42. in src/checkqueue.h:100 in 440e02a3e2 outdated
     98                         // return the current status
     99                         return fRet;
    100                     }
    101                     nIdle++;
    102-                    cond.wait(lock); // wait
    103+                    cond.wait(lock, [&] { return interrupted || !queue.empty() || fMaster && nTodo == 0; }); // wait
    


    JeremyRubin commented at 3:42 am on October 18, 2018:

    I think you could assert(! (fMaster && nTodo == 0)) here.

    If fMaster were ever true here, we would deadlock.

    Hence, there is no point to check that…

  43. DrahtBot commented at 10:02 am on October 20, 2018: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    No conflicts as of last run.

  44. ken2812221 force-pushed on Oct 20, 2018
  45. ken2812221 force-pushed on Oct 20, 2018
  46. ken2812221 force-pushed on Oct 20, 2018
  47. ken2812221 force-pushed on Oct 20, 2018
  48. ken2812221 force-pushed on Oct 20, 2018
  49. ken2812221 force-pushed on Oct 20, 2018
  50. ken2812221 force-pushed on Oct 20, 2018
  51. in src/checkqueue.h:101 in 02aae28c9d outdated
     98                         // return the current status
     99                         return fRet;
    100+                    } else {
    101+                        nIdle++;
    102+                        while (!(interrupted || !queue.empty() || (fMaster && nTodo == 0))) cond.wait(lock);
    103+                        if (interrupted) return false;
    


    ken2812221 commented at 2:26 pm on October 24, 2018:
    After tested a few times, return false or somthing else? seem to make the node have to reindex next time.

    JeremyRubin commented at 7:16 am on October 25, 2018:

    The issue could be that when you interrupt, you don’t want to make fMaster return false causing the block to be marked false (which could be written into cache somehow).

    Is interrupt guaranteed to wait until the current job is done? Making it guaranteed to wait probably could help fix that behavior… One way of doing that is to separate out when you wake up if there’s an active job or not, and don’t allow to quit when there is an active job until it’s done.


    JeremyRubin commented at 7:22 am on October 25, 2018:
    I’d recommend adapting the quitting logic from https://github.com/JeremyRubin/bitcoin/blob/aa847431e266fb08289edbac3665ac7c93a81b8a/src/checkqueue.h if you want a reference, which should be more or less correct (albeit with a lot of extensions you can ignore).
  52. ken2812221 renamed this:
    refactor: Remove boost in checkqueue
    [WIP] refactor: Remove boost in checkqueue
    on Oct 24, 2018
  53. JeremyRubin commented at 8:32 am on October 25, 2018: contributor

    This version is not correct either if I’m correct about the block getting flushed with validity cached during shutdown.

    Now you might incorrectly cache that a block was valid when it may not have been.

    You need to finish processing the block, then shut down. Or, if a shutdown is triggered, abandon the block in the master (you could make a trivalent return: false, true, shutdown).

  54. JeremyRubin commented at 8:36 am on October 25, 2018: contributor
    Actually I think that’s not right, I think this version will just hang because nTodo will be != 0 and queue.empty() will be true, so the master would sleep and never get woken?
  55. ken2812221 commented at 3:31 am on October 26, 2018: contributor
    @JeremyRubin master would never wait/sleep if interrupted is true. It would keep looping until nTodo == 0. Oh, I know what you mean.
  56. JeremyRubin commented at 6:24 am on October 26, 2018: contributor
    Your best bet IMO is to have a CheckQueueControl mutex which is separate from the CheckQueue mutex. Make interrupting get both…
  57. ken2812221 force-pushed on Oct 26, 2018
  58. ken2812221 force-pushed on Oct 26, 2018
  59. ken2812221 force-pushed on Oct 26, 2018
  60. ken2812221 force-pushed on Oct 26, 2018
  61. ken2812221 force-pushed on Oct 26, 2018
  62. ken2812221 force-pushed on Oct 26, 2018
  63. ken2812221 force-pushed on Oct 26, 2018
  64. ken2812221 force-pushed on Oct 26, 2018
  65. ken2812221 force-pushed on Oct 26, 2018
  66. ken2812221 commented at 2:46 pm on October 26, 2018: contributor

    @JeremyRubin How about now? I think the logic here would be more clear. It would keep checking until finish. If there are new tasks added into queue after the worker threads ended, it would be only master worker checking the tasks unfortunately. However I think this is better than interrupt the master thread. Alternatively, we can detach these worker threads.

    I am going to test this to see if there is anything broken.

  67. ken2812221 force-pushed on Oct 26, 2018
  68. ken2812221 force-pushed on Oct 26, 2018
  69. JeremyRubin commented at 11:53 pm on October 26, 2018: contributor

    I still think it’s best to make control of the checkqueue (as opposed to participation) protected by a mutex.

    Easiest to audit/review.

    But do let me know if that fixes the corruption issue…

  70. ken2812221 commented at 0:01 am on October 27, 2018: contributor

    I still think it’s best to make control of the checkqueue (as opposed to participation) protected by a mutex.

    There already have ControlMutex to ensure that there is only one CCheckQueueControl to access the CCheckQueue.

    Edit: I don’t think it’s necessary to interrupt the master queue though. We can wait it to return the check result.

    The second thought, if we detach the worker thread, then that is not necessary to interrupt the checkqueue. So I start to suspect that if the Interrupt is necessary here.

  71. JeremyRubin commented at 8:29 am on October 27, 2018: contributor
    I am saying that a simple change to make is to have Interrupt take the ControlMutex.
  72. ken2812221 force-pushed on Oct 27, 2018
  73. ken2812221 force-pushed on Oct 27, 2018
  74. ken2812221 force-pushed on Oct 27, 2018
  75. ken2812221 force-pushed on Oct 27, 2018
  76. ken2812221 commented at 12:42 pm on October 27, 2018: contributor

    I am saying that a simple change to make is to have Interrupt take the ControlMutex.

    That would cause recursive lock in test_bitcoin. Also, I don’t think that it’s necessary to lock ControlMutex. Can you explain why it’s necessary for Interrupt?

    But do let me know if that fixes the corruption issue…

    Seems the node won’t have to reindex after restarting now.

  77. ken2812221 renamed this:
    [WIP] refactor: Remove boost in checkqueue
    refactor: Remove boost in checkqueue
    on Oct 27, 2018
  78. JeremyRubin commented at 9:31 pm on October 27, 2018: contributor
    Where’s the recursive lock?
  79. ken2812221 commented at 10:53 pm on October 27, 2018: contributor
    In test_bitcoin, it constuct CCheckQueueControl and call Interrupt in same thread
  80. JeremyRubin commented at 11:06 pm on October 27, 2018: contributor
    Interrupt should be called in the destructor of the test which should happen after any queuecontrols are destructed though?
  81. ken2812221 force-pushed on Oct 28, 2018
  82. ken2812221 commented at 5:05 am on October 28, 2018: contributor

    Interrupt should be called in the destructor of the test which should happen after any queuecontrols are destructed though?

    I think it would be better to make the checkqueue to control their worker threads instead of calling Thread() in somewhere.

  83. ken2812221 force-pushed on Oct 28, 2018
  84. ken2812221 force-pushed on Oct 28, 2018
  85. ken2812221 force-pushed on Oct 28, 2018
  86. ken2812221 force-pushed on Oct 28, 2018
  87. ken2812221 force-pushed on Oct 30, 2018
  88. ken2812221 force-pushed on Oct 30, 2018
  89. ken2812221 force-pushed on Oct 30, 2018
  90. ken2812221 force-pushed on Oct 30, 2018
  91. ken2812221 force-pushed on Oct 30, 2018
  92. ken2812221 force-pushed on Oct 30, 2018
  93. ken2812221 force-pushed on Oct 30, 2018
  94. ken2812221 force-pushed on Oct 30, 2018
  95. ken2812221 force-pushed on Oct 30, 2018
  96. ken2812221 force-pushed on Oct 30, 2018
  97. ken2812221 force-pushed on Oct 30, 2018
  98. ken2812221 force-pushed on Oct 31, 2018
  99. ken2812221 force-pushed on Nov 5, 2018
  100. ken2812221 force-pushed on Nov 9, 2018
  101. ken2812221 force-pushed on Nov 22, 2018
  102. ken2812221 force-pushed on Nov 22, 2018
  103. ken2812221 force-pushed on Nov 22, 2018
  104. ken2812221 renamed this:
    refactor: Remove boost in checkqueue
    refactor: make checkqueue manage the threads by itself (also removed some boost dependencies)
    on Nov 23, 2018
  105. DrahtBot added the label Needs rebase on Dec 13, 2018
  106. ken2812221 force-pushed on Dec 17, 2018
  107. DrahtBot removed the label Needs rebase on Dec 17, 2018
  108. DrahtBot added the label Needs rebase on Dec 29, 2018
  109. ken2812221 force-pushed on Dec 29, 2018
  110. fanquake deleted a comment on Jan 3, 2019
  111. fanquake removed the label Needs rebase on Jan 3, 2019
  112. fanquake deleted a comment on Jan 3, 2019
  113. DrahtBot added the label Needs rebase on Jan 21, 2019
  114. ken2812221 force-pushed on Jan 22, 2019
  115. DrahtBot removed the label Needs rebase on Jan 22, 2019
  116. MarcoFalke commented at 8:21 pm on March 6, 2019: member
    Could try a git rebase -Xignore-all-space df36ddf9ce8a4dcf0d7f324e57a13abb6cf6a57c?
  117. DrahtBot added the label Needs rebase on Mar 6, 2019
  118. refactor: Make CCheckQueue interruptible
    This commit add a Interrupt function for CCheckQueue that it can handle interrupt by itself instead of relying on boost thread interrupt
    d998ca40e1
  119. refactor: Drop boost in CCheckQueue
    replace boost::mutex with debuggable Mutex
    replace boost::condition_variable with std::condition_variable
    add const specifier to fMaster and nBatchSize
    add clang thread safety attributes
    move init value of member of CCheckQueue from constructor to definition
    8a6e1a9531
  120. refactor: Make CCheckQueue worker threads be owned by itself
    Manage CCheckQueue workder threads by itself. Do not expose the threads outside the object.
    f8145436ad
  121. ken2812221 force-pushed on Mar 7, 2019
  122. DrahtBot removed the label Needs rebase on Mar 7, 2019
  123. ken2812221 closed this on Mar 15, 2019

  124. practicalswift commented at 6:10 am on March 15, 2019: contributor
    @ken2812221 Why the close? :-)
  125. ken2812221 commented at 6:20 am on March 15, 2019: contributor
    The CI was failed and I don’t have time to find the bugs out recently, so I closed it, I might reopen this when I has more time working on this.
  126. MarcoFalke added the label Up for grabs on Mar 15, 2019
  127. jnewbery moved this from the "In progress" to the "Later" column in a project

  128. laanwj referenced this in commit b386d37360 on Jan 25, 2021
  129. sidhujag referenced this in commit af159bf7a7 on Jan 25, 2021
  130. MarcoFalke removed the label Up for grabs on Jan 30, 2021
  131. DrahtBot locked this on Aug 16, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-03 13:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me