Lock-Free CheckQueue #9938

pull JeremyRubin wants to merge 3 commits into bitcoin:master from JeremyRubin:PR-lockfree-checkqueue changing 6 files +199 −152
  1. JeremyRubin commented at 12:20 pm on March 7, 2017: contributor

    TL;DR: This PR introduces a new hopefully easy-to-review Lock-Free CheckQueue algorithm. It’s really fast!

    Summary & Problem Statement

    In Bitcoin-Qt 0.8, parallel script validation was introduced to improve the speed at which a block could be validated. The basic way in which this worked is that the master thread creates a task for every script of each input in a transaction being checked, and enqueues it to a work queue. After adding all tasks, the master also becomes a worker. Each worker thread claims a number of tasks from the queue (based on how many tasks are available and how many workers are available) and then runs each task, reporting if any errors were encountered. The core approach has not changed significantly since 0.8. The current approach is deficient for four main reasons which these changes address over 2 major commits (and 1 minor one introducing API change only, and 2 others covered in previous PRs).

    Deficiencies in Current CheckQueue & Solutions

    1. Memory Instability:

    Each task must be moved to three different memory locations during its lifetime:

    1. the place where the master creates the task.
    2. the shared work queue.
    3. the worker-local work queue. This also makes it difficult to write multithreaded code, because memory is modified more than once during validation.

    Solution:

    The new algorithm uses stable memory: during block validation, enough memory for the worst case number of checks is allocated, and then tasks are directly placed and read out of that memory. Instead, each worker thread claims a pointer to the task. See 89a1f93

    2. Lock-Heavy algorithm:

    In the original algorithm, each worker thread and the master contend for a single lock for enqueuing and dequeuing tasks. This is highly inefficient, each worker spends significant amount of time acquiring the lock to be able to dequeue a task, and the master spends time waiting to enqueue tasks.

    Solution:

    The new algorithm is Lock-Free; during block validation no locks are taken for enqueuing or dequeuing tasks, only atomic memory is read or written. The use of a different piece of atomic memory for enqueuing and most dequeuing operations means that the master and worker threads infrequently interact during validation, meaning low contention. See 6e24aa8

    3. Sleeping during validation:

    In the original algorithm, while waiting for a task to be enqueued by the master, each thread sleeps. This means that the OS’s scheduler might choose to pre-empt the thread, thrashing it’s cache with another process’s memory. Furthermore, the latency for a wake up operation is high, as it requires the operating system to awaken a thread.

    Solution:

    the new algorithm only sleeps before and after block validation, and busy-spins waiting for tasks until the master joins and there are no more tasks available (then sleeps). See 6e24aa8

    4. Arbitrary batch size selection:

    The original batch size algorithm was selected as an experimental tuning for reasonable performance. If the batch size is too small, workers spend too much time getting a new task and not enough time running the tasks. The problem with any form of batching is that you may run into an imbalanced case where one worker has a batch (say of size five) and another worker has no tasks. In order to improve that, a batch size of one is optimal.

    Solution:

    The new algorithm uses a batch size of one, so that all threads finish their work at as close to the same time as possible. See 6e24aa8.

    Reviewing this PR

    Code Organization

    In writing this PR, I used many small-step iterations to get to the final design. While I am fairly confident that the commits in this PR should be easily reviewable, I have left the unsquashed version available here for anyone to reference https://github.com/JeremyRubin/bitcoin/tree/PR-lockfree-checkqueue-unsquashed

    Testing

    #9497 introduces a fairly comprehensive set of tests, which these changes pass, as well as the earlier CheckQueue. See #9497 for details on the conditions tested.

    Performance

    Test results are from a 2011 MacBook Pro 13", 2.7 GHz Intel Core i7, 8 GB 1333 MHz DDR3.

    MicroBenchmark Performance (2.9 to 5.7X faster)

    #9498 introduced two new microbenchmarks for the CheckQueue. One which tests tasks which are empty, and another which tests tasks that do a little bit of work. These are microbenchmarks, so they need to be taken with a grain of salt.

    0#Before (7ff4a538a8682cdf02a4bcd6f15499c841001b73)
    1#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
    2CCheckQueueSpeed,896,0.001128047704697,0.001315370202065,0.001167648338846,3038807,3543440,3145483
    3CCheckQueueSpeedPrevectorJob,320,0.002763845026493,0.003494121134281,0.003255434334278,7445473,9415834,8770003
    4
    5#After (6e24aa818be4b494fc1809a7ca3ee568e253deb6)
    6#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
    7CCheckQueueSpeed,5120,0.000198634807020,0.000226773321629,0.000202708039433,535092,610897,546067
    8CCheckQueueSpeedPrevectorJob,896,0.000990316271782,0.001982234418392,0.001142452071820,2667680,5339862,3077721
    

    So we see for trivial jobs, it’s about 5.7X faster, and for more involved jobs, about 2.9X faster.

    Test Performance (10X faster)

    I have observed very nice performance improvements when it comes to how long it takes to run the checkqueue_tests.

    0#before (08e4e1ea89427a2594415d0b37011692a5109c39)
    1$ time ./test_bitcoin -t checkqueue_tests
    2./test/test_bitcoin -t checkqueue_tests  12.39s user 58.12s system 116% cpu 1:00.31 total
    3#after (6e24aa818be4b494fc1809a7ca3ee568e253deb6)
    4$ time ./test_bitcoin -t checkqueue_tests
    5./test/test_bitcoin -t checkqueue_tests  3.43s user 1.40s system 78% cpu 6.180 total
    

    So we see about a 10x performance improvement here.

    Cross Platform

    It would be good to have reviewers comment with cross platform performance testing, as some of the lock free instructions may be slower on some platforms than others, and I don’t have access to a large enough swath of machines. If you’re reviewing performance, you may find the benchdiff.py tool I wrote useful https://github.com/JeremyRubin/bitcoin/commit/14aa19a35cbf0cff742f36a2c9ca00af162918ee.

    Real World/Simulated Numbers

    I don’t have great real-world-node numbers yet for these changes, but I’ll wait for others (e.g., @morcos) to report back on those.

    edit: I have run a simulation on a month of block validation on a 4-core machine and seen no difference in aggregate performance. I’ll poke around to see if the design can be tweaked a little bit for some advantage with fewer cores, but the more important simulations are for machines with >7 cores or multiple cpus, where the contention relieved by this PR becomes more significant.

    Notes

    This builds on #9497 (and #9495). There were a couple minor nits on those PR’s outstanding, but I figure they can be addressed here later with a squashme (rather than having to squash those, get them re-reviewed, and then issue this PR). An earlier (and worse) attempt at a similar design can be seen here for comparison #8464.

    Acknowledgements

    Thanks to @morcos and @sdaftuar for supporting and helping in the development of this work, and to various others (@TheBlueMatt, @theuni, and others) for review in feedback at various stages of this project.

  2. laanwj added the label Validation on Mar 7, 2017
  3. in src/test/checkqueue_tests.cpp: in 08e4e1ea89 outdated
    36+
    37+struct FakeCheckCheckCompletion {
    38+    static std::atomic<size_t> n_calls;
    39+    bool operator()()
    40+    {
    41+        ++n_calls;
    


    TheBlueMatt commented at 9:39 pm on March 8, 2017:
    To make the test expose more parallellism, should probably use release/acquire here, no?
  4. in src/test/checkqueue_tests.cpp: in 08e4e1ea89 outdated
    94+    {
    95+        fake_allocated_memory += b;
    96+    };
    97+    ~MemoryCheck(){
    98+        fake_allocated_memory -= b;
    99+    
    


    TheBlueMatt commented at 9:42 pm on March 8, 2017:
    nit: whitespace at end of line (and stray line, it looks like).
  5. in src/test/checkqueue_tests.cpp: in 08e4e1ea89 outdated
    222+
    223+    for (size_t i = 0; i < 1001; ++i) {
    224+        CCheckQueueControl<FailingCheck> control(fail_queue.get());
    225+        size_t remaining = i;
    226+        while (remaining) {
    227+            size_t r = GetRand(10);
    


    TheBlueMatt commented at 9:45 pm on March 8, 2017:
    nit: something > 10 might test parallellism a bit more, no? (and in a few other places)
  6. in src/test/checkqueue_tests.cpp: in 08e4e1ea89 outdated
    260+                vChecks.resize(100, false);
    261+                vChecks[99] = end_fails;
    262+                control.Add(vChecks);
    263+            }
    264+            bool r =control.Wait();
    265+            BOOST_REQUIRE(r || end_fails);
    


    TheBlueMatt commented at 9:49 pm on March 8, 2017:
    I think you meant to have an XOR here.

    JeremyRubin commented at 6:15 pm on March 27, 2017:
    Yes.
  7. in src/test/checkqueue_tests.cpp: in 08e4e1ea89 outdated
    358+        control.Wait(); // Hangs here
    359+    });
    360+    {
    361+        std::unique_lock<std::mutex> l(FrozenCleanupCheck::m);
    362+        // Wait until the queue has finished all jobs and frozen
    363+        FrozenCleanupCheck::cv.wait(l, [](){return FrozenCleanupCheck::nFrozen == 1;});
    


    TheBlueMatt commented at 10:13 pm on March 8, 2017:
    nit: maybe close the “l” scope after this line so that the try_lock()s happen without FrozenCleanupCheck::m?
  8. in src/test/checkqueue_tests.cpp: in 08e4e1ea89 outdated
    335+    }
    336+    tg.interrupt_all();
    337+    tg.join_all();
    338+}
    339+
    340+// Test that a new verification cannot occur until all checks 
    


    TheBlueMatt commented at 10:14 pm on March 8, 2017:
    nit: whitespace at EOL here.
  9. in src/checkqueue.h: in 89a1f939d2 outdated
    40@@ -41,7 +41,6 @@ class CCheckQueue
    41 
    42     //! The queue of elements to be processed.
    


    TheBlueMatt commented at 10:32 pm on March 8, 2017:
    I think you meant to remove/move this comment as well?
  10. in src/checkqueue.h: in 89a1f939d2 outdated
    139@@ -145,18 +140,25 @@ class CCheckQueue
    140         return Loop(true);
    141     }
    142 
    143+    typename std::vector<T>::iterator check_mem;
    


    TheBlueMatt commented at 10:36 pm on March 8, 2017:
    Nit: can these be private?
  11. in src/checkqueue.h: in 89a1f939d2 outdated
    221+    {
    222+        if (pqueue != NULL) {
    223+            check_mem.emplace_back(std::forward<Args>(args)...);
    224+        }
    225+    }
    226+    void Flush(size_t s)
    


    TheBlueMatt commented at 10:38 pm on March 8, 2017:
    Can you add some documentation for Flush() (and how it interacts with the two versions of Add()) here?
  12. in src/checkqueue.h: in 89a1f939d2 outdated
    139@@ -145,18 +140,25 @@ class CCheckQueue
    140         return Loop(true);
    141     }
    142 
    143+    typename std::vector<T>::iterator check_mem;
    144+    typename std::vector<T>::iterator check_mem_top;
    145+    typename std::vector<T>::iterator check_mem_bottom;
    146+    void Setup(typename std::vector<T>::iterator check_mem_in) 
    


    TheBlueMatt commented at 10:39 pm on March 8, 2017:
    nit: whitespace at EOL here.
  13. in src/checkqueue.h: in 6e24aa818b outdated
    77+    T* check_mem {nullptr};
    78+
    79+    /** The begin and end offsets into check_mem. 128 bytes of padding is
    80+     * inserted before and after check_mem_top to eliminate false sharing*/
    81+    std::atomic<uint32_t> check_mem_bot {0};
    82+    unsigned char _padding[128];
    


    TheBlueMatt commented at 10:46 pm on March 8, 2017:
    Maybe ifdef hardware_destructive_interference_size use it?
  14. in src/checkqueue.h: in 6e24aa818b outdated
    184+                // practice in case T uses a lot of memory.
    185                 auto t = T();
    186-                checks_iterator->swap(t);
    187-                std::advance(checks_iterator, 1);
    188+                pT->swap(t);
    189+                // Can't reveal result until after swapped, otherwise
    


    TheBlueMatt commented at 10:57 pm on March 8, 2017:
    Huh? Sure we can, we just cant decrement nAwake until we’ve swap()ed. I assume you had a previous version with an early termination for !fAllOk?

    JeremyRubin commented at 2:11 am on March 9, 2017:

    Yes; you’re correct. Earlier version was like this.

    Also termination does still occur “early”, all the actual checks are skipped (they are just stilled dequeued). I suppose I could make it abort (may actually be nice to check if it aborted before calling checkinputs… but maybe that’s best for another PR).

    I’ll refactor to something which converges more quickly on abort (e.g., setting !fMasterPresent && !fAllOk)


    TheBlueMatt commented at 5:26 pm on March 9, 2017:
    Its probably better to not have quick abort and have less between-thread contention, no?

    JeremyRubin commented at 6:27 pm on March 9, 2017:

    I think the solution i came up with is pretty trivial (I’ll push it later today).

     0fOk = fAllOk.load(std::memory_order_relaxed);
     1if (fOk) {
     2    T t();
     3    pT->swap(t);
     4} else {
     5    fAllOk.store(false, std::memory_order_relaxed);
     6    fMasterPresent = false;
     7    // try to consume all values as quickly as possible
     8    while (top_cache > bottom_cache &&
     9                    !check_mem_bot.compare_exchange_weak( bottom_cache, top_cache)) {
    10            top_cache = check_mem_top.load();
    11    }
    12}
    

    Especially since now, fAllOk is never written really now.


    TheBlueMatt commented at 8:07 pm on March 9, 2017:
    Looks like too just complexity? Just dont bother exiting early, we can take a performance hit of very little on an invalid block, I think. Just dont bother writing to fAllOk if fOk?
  15. in src/checkqueue.h: in 6e24aa818b outdated
    68+    std::atomic<bool> fMasterPresent;
    69 
    70     //! Whether we're shutting down.
    71     bool fQuit;
    72 
    73     //! The maximum number of elements to be processed in one batch
    


    TheBlueMatt commented at 11:05 pm on March 8, 2017:
    Looks like, in the current version, you can remove nBatchSize as well (though I’m surprised its not a performance gain to batch operations?

    JeremyRubin commented at 2:08 am on March 9, 2017:
    Yes, absolutely. I didn’t change it to minimize changeset, but it no longer has a use.
  16. TheBlueMatt commented at 11:08 pm on March 8, 2017: member
    Generally looks good. I’m curious if you’ve benchmarked the lock-free version vs just dropping the last commit (possibly after removing the batching, if that was actually a loss). Given the contention on fAllOk (which could be trivially removed since you dont quit early on !fAllOk, btw) and check_mem_top/check_mem_bot, I’m surprised there is much of a win to be had on the last commit.
  17. CheckQueue: Add size constants to validation.{h,cpp}.
    Update code to pass MAX_SCRIPTCHECKS_PER_BLOCK to CCheckQueueControl
    0232c07c82
  18. CheckQueue: Make the checkqueue's checks "memory-stable". Before this commit,
    checks were stored in three separate vectors. First, a vector was made locally
    to pass to a call to CCheckQueueControl::Add. In that call, checks were added to
    the main CCheckQueue memory, then each thread had to locally (per batch) put the
    check into a new, local vector. This means the location of the Check moved 2
    times. In this patch, the check is created in a buffer associated with the
    CCheckQueueControl. Instead of moving the check, a counter is incremented to
    keep track of which threads have claimed which checks.
    
    Also updates benchmark to use new in-place CheckQueue API
    12e9cd472d
  19. CheckQueue: This commit refactors the CheckQueue to be written "lock free" style
    to improve the concurrency on multicore machines.  In effect, it eliminates the
    use of any "critical section" during block validation.
    39397bb40c
  20. JeremyRubin force-pushed on Mar 27, 2017
  21. JeremyRubin referenced this in commit b36ad42d0a on Mar 27, 2017
  22. in src/checkqueue.h:119 in 39397bb40c
    175-            }
    176-            // execute work
    177-            for (unsigned int i = 0; i < nNow && fOk; i++) {
    178-                fOk = (*checks_iterator)();
    179+                // execute work
    180+                fOk &= fOk && (*pT)();
    


    theuni commented at 8:54 pm on March 27, 2017:

    Why not:

    0fOk &= fOk && T(std::move(*pT))()
    

    and do away with all of the swap business?

  23. JeremyRubin referenced this in commit 5004e46063 on Mar 29, 2017
  24. JeremyRubin referenced this in commit 9cdb246e81 on Aug 9, 2017
  25. JeremyRubin referenced this in commit 8c2f4b8882 on Aug 9, 2017
  26. sipa referenced this in commit 424be03305 on Oct 12, 2017
  27. JeremyRubin commented at 6:07 pm on March 6, 2018: contributor

    Going to close this for now; it is a good target for future work if someone wants to take another crack at it.

    I have a number of further relaxations/tweaks (which need more review) to this PR available here https://github.com/JeremyRubin/bitcoin/pull/1 if anyone is taking it over, they should review those too.

  28. JeremyRubin closed this on Mar 6, 2018

  29. ryanofsky commented at 6:34 pm on March 6, 2018: member
    Should be tagged up for grabs
  30. fanquake added the label Up for grabs on Mar 6, 2018
  31. HashUnlimited referenced this in commit d83b93f000 on Mar 12, 2018
  32. PastaPastaPasta referenced this in commit 67513d3df3 on Dec 22, 2019
  33. PastaPastaPasta referenced this in commit b85ffe17e5 on Jan 2, 2020
  34. PastaPastaPasta referenced this in commit 84a3603b87 on Jan 4, 2020
  35. PastaPastaPasta referenced this in commit 84ab39d400 on Jan 12, 2020
  36. PastaPastaPasta referenced this in commit 962becc9eb on Jan 12, 2020
  37. PastaPastaPasta referenced this in commit 69991c3764 on Jan 12, 2020
  38. PastaPastaPasta referenced this in commit 4d0475b592 on Jan 12, 2020
  39. PastaPastaPasta referenced this in commit 920e3e709d on Jan 12, 2020
  40. PastaPastaPasta referenced this in commit 40f559ebbb on Jan 12, 2020
  41. PastaPastaPasta referenced this in commit 1e2ab7633f on Jan 16, 2020
  42. jonatack commented at 8:55 am on January 11, 2021: member
    Thanks @JeremyRubin for the really nice exposition of the issues in this PR description, found it helpful for reviewing #18710.
  43. laanwj referenced this in commit b386d37360 on Jan 25, 2021
  44. sidhujag referenced this in commit af159bf7a7 on Jan 25, 2021
  45. adamjonas removed the label Up for grabs on Feb 26, 2021
  46. ckti referenced this in commit 72d0e25cdd on Mar 28, 2021
  47. gades referenced this in commit 482f526902 on May 1, 2022
  48. DrahtBot locked this on Aug 18, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-18 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me