TL;DR: This PR introduces a new hopefully easy-to-review Lock-Free CheckQueue algorithm. It’s really fast!
Summary & Problem Statement
In Bitcoin-Qt 0.8, parallel script validation was introduced to improve the speed at which a block could be validated. The basic way in which this worked is that the master thread creates a task for every script of each input in a transaction being checked, and enqueues it to a work queue. After adding all tasks, the master also becomes a worker. Each worker thread claims a number of tasks from the queue (based on how many tasks are available and how many workers are available) and then runs each task, reporting if any errors were encountered. The core approach has not changed significantly since 0.8. The current approach is deficient for four main reasons which these changes address over 2 major commits (and 1 minor one introducing API change only, and 2 others covered in previous PRs).
Deficiencies in Current CheckQueue & Solutions
1. Memory Instability:
Each task must be moved to three different memory locations during its lifetime:
- the place where the master creates the task.
- the shared work queue.
- the worker-local work queue. This also makes it difficult to write multithreaded code, because memory is modified more than once during validation.
Solution:
The new algorithm uses stable memory: during block validation, enough memory for the worst case number of checks is allocated, and then tasks are directly placed and read out of that memory. Instead, each worker thread claims a pointer to the task. See 89a1f93
2. Lock-Heavy algorithm:
In the original algorithm, each worker thread and the master contend for a single lock for enqueuing and dequeuing tasks. This is highly inefficient, each worker spends significant amount of time acquiring the lock to be able to dequeue a task, and the master spends time waiting to enqueue tasks.
Solution:
The new algorithm is Lock-Free; during block validation no locks are taken for enqueuing or dequeuing tasks, only atomic memory is read or written. The use of a different piece of atomic memory for enqueuing and most dequeuing operations means that the master and worker threads infrequently interact during validation, meaning low contention. See 6e24aa8
3. Sleeping during validation:
In the original algorithm, while waiting for a task to be enqueued by the master, each thread sleeps. This means that the OS’s scheduler might choose to pre-empt the thread, thrashing it’s cache with another process’s memory. Furthermore, the latency for a wake up operation is high, as it requires the operating system to awaken a thread.
Solution:
the new algorithm only sleeps before and after block validation, and busy-spins waiting for tasks until the master joins and there are no more tasks available (then sleeps). See 6e24aa8
4. Arbitrary batch size selection:
The original batch size algorithm was selected as an experimental tuning for reasonable performance. If the batch size is too small, workers spend too much time getting a new task and not enough time running the tasks. The problem with any form of batching is that you may run into an imbalanced case where one worker has a batch (say of size five) and another worker has no tasks. In order to improve that, a batch size of one is optimal.
Solution:
The new algorithm uses a batch size of one, so that all threads finish their work at as close to the same time as possible. See 6e24aa8.
Reviewing this PR
Code Organization
In writing this PR, I used many small-step iterations to get to the final design. While I am fairly confident that the commits in this PR should be easily reviewable, I have left the unsquashed version available here for anyone to reference https://github.com/JeremyRubin/bitcoin/tree/PR-lockfree-checkqueue-unsquashed
Testing
#9497 introduces a fairly comprehensive set of tests, which these changes pass, as well as the earlier CheckQueue. See #9497 for details on the conditions tested.
Performance
Test results are from a 2011 MacBook Pro 13", 2.7 GHz Intel Core i7, 8 GB 1333 MHz DDR3.
MicroBenchmark Performance (2.9 to 5.7X faster)
#9498 introduced two new microbenchmarks for the CheckQueue. One which tests tasks which are empty, and another which tests tasks that do a little bit of work. These are microbenchmarks, so they need to be taken with a grain of salt.
0#Before (7ff4a538a8682cdf02a4bcd6f15499c841001b73)
1#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
2CCheckQueueSpeed,896,0.001128047704697,0.001315370202065,0.001167648338846,3038807,3543440,3145483
3CCheckQueueSpeedPrevectorJob,320,0.002763845026493,0.003494121134281,0.003255434334278,7445473,9415834,8770003
4
5#After (6e24aa818be4b494fc1809a7ca3ee568e253deb6)
6#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
7CCheckQueueSpeed,5120,0.000198634807020,0.000226773321629,0.000202708039433,535092,610897,546067
8CCheckQueueSpeedPrevectorJob,896,0.000990316271782,0.001982234418392,0.001142452071820,2667680,5339862,3077721
So we see for trivial jobs, it’s about 5.7X faster, and for more involved jobs, about 2.9X faster.
Test Performance (10X faster)
I have observed very nice performance improvements when it comes to how long it takes to run the checkqueue_tests.
0#before (08e4e1ea89427a2594415d0b37011692a5109c39)
1$ time ./test_bitcoin -t checkqueue_tests
2./test/test_bitcoin -t checkqueue_tests 12.39s user 58.12s system 116% cpu 1:00.31 total
3#after (6e24aa818be4b494fc1809a7ca3ee568e253deb6)
4$ time ./test_bitcoin -t checkqueue_tests
5./test/test_bitcoin -t checkqueue_tests 3.43s user 1.40s system 78% cpu 6.180 total
So we see about a 10x performance improvement here.
Cross Platform
It would be good to have reviewers comment with cross platform performance testing, as some of the lock free instructions may be slower on some platforms than others, and I don’t have access to a large enough swath of machines. If you’re reviewing performance, you may find the benchdiff.py tool I wrote useful https://github.com/JeremyRubin/bitcoin/commit/14aa19a35cbf0cff742f36a2c9ca00af162918ee.
Real World/Simulated Numbers
I don’t have great real-world-node numbers yet for these changes, but I’ll wait for others (e.g., @morcos) to report back on those.
edit: I have run a simulation on a month of block validation on a 4-core machine and seen no difference in aggregate performance. I’ll poke around to see if the design can be tweaked a little bit for some advantage with fewer cores, but the more important simulations are for machines with >7 cores or multiple cpus, where the contention relieved by this PR becomes more significant.
Notes
This builds on #9497 (and #9495). There were a couple minor nits on those PR’s outstanding, but I figure they can be addressed here later with a squashme (rather than having to squash those, get them re-reviewed, and then issue this PR). An earlier (and worse) attempt at a similar design can be seen here for comparison #8464.
Acknowledgements
Thanks to @morcos and @sdaftuar for supporting and helping in the development of this work, and to various others (@TheBlueMatt, @theuni, and others) for review in feedback at various stages of this project.