checkqueue: implement a new scriptcheck worker pool with atomic variables #32791

pull HowHsu wants to merge 7 commits into bitcoin:master from HowHsu:checkqueue_atomic changing 10 files +275 −20
  1. HowHsu commented at 12:58 pm on June 21, 2025: none
    TL;DR
    This patchset proposes a new design of scriptcheck worker pool with
    atomic variables to leverage more cpu resource on a machine which the
    current mutex version cannot. Achieve about 45% boost in performance.
    
    Main Idea:
    
    The current version of worker pool use mutex to do the synchronization
    between worker threads, which makes the performance downgrade after
    worker_threads_num >= 14. One root cause is the high contention on the
    queue's mutex among worker threads. Workers repeatedly sleep and be woken
    up in short time, leading to heavy context switching.
    The idea is to use atomic variables to protect the queue, workers
    acquire it and increase it with a batch_size(a small number). This
    significantly reduces context switching.
    
    Implementation
    
    To implement it, below changes are made:(enabled in atomic mode)
    - atomic<int> nNext is used to indicate the next available indice in the
      queue
    - atomic<int> nTodo2 is used to indicate the number of tasks left
    - nBatchSize is the batch size of tasks everytime a worker consumes
      it should be small to keep workload balance among workers
    - m_mutex is still used, to protect m_result and conditional vars
    - fAtomic to indicate which mode is enabled
    - about the queue:
    
      - the queue is used as queue, not stacks any more
      - the queue is 'add only', which means only adding tasks to it, don't
        remove used tasks.
    
        the above two are a must to make this idea achievable, and the
        second sacrifices space
    
      - start the workers only after all the tasks have been added to the
        queue
    
        this is a trade-off to make the implementaion simple, a worker can
        compare nNext + nBatchSize to queue.size() to find if there are
        tasks undone in the queue, otherwise another atomic<int> tail is
        needed, and race condition may exist in this way (need more
        investigation). From my test, it's not 100% to say this is a
        'trade-off' since it also brings something good, pros && cons:
        - pros:
            1. no contention among producer and consumers when adding tasks
               to the queue
            2. the adding period is very short compared to the verification
               period
        - cons:
            1. the workers start later
    
    For details, please refer to the worker handler code
    
    Tests:
    
    - Numbers
      - For blocks/s numbers
        `./build/bin/bench_bitcoin --filter=ConnectBlockMixedEcdsaSchnor`
      - For cpu usage numbers
        `NANOBENCH_ENDLESS=ConnectBlockAllEcdsa ./build/bin/bench_bitcoin --filter=ConnectBlockAllEcdsa`
    
      - Some data is missing because the machine was billed by time, and there wasn't
        enough time to complete all measurements.
      - Some cells contain a range instead of a single value due to high fluctuations
      - The numbers represent the average of multiple test runs.
    
               |     mutex verion       |   atomic version
    nr_workers | blocks/s  | cpu usage  | blocks/s  | cpu usage2
    -----------|-----------|------------|-----------|-------------
    0          | 4.22      | 100%       | 4.2       | 100%
    1          | 8.23      | 198%       | 8         | 191.3%
    2          | 12.08     | 293.7%     | 11.6      | 276%
    3          | 16        | 386.7%     | 14.9      | 354.5%
    4          | 19.8      | 476.3%     |           |
    5          | 23.12     | 565.4%     |           |
    6          | 27.3      | 654%       | 23        | 560%
    7          | 28.5      | 738%       |           |
    8          | 33.8      | 809%       |           |
    9          | 33.5      | 868%       | 28.5      | 730%
    10         | 37.4      | 910%       | 30.5      | 777%
    11         | 33~40     | 880%~928%  | 32        | 825%
    12         | 22~39     | 860%~925%  | 32.5      | 830%
    13         | 27.8~37   | 700%~913%  | 34.8      | 928%
    15         | 24.8~38.9 | 640%~719%  | 38        | 1000%
    19         | 18~23     | 370%~510%  | 41        | 1161%
    22         | 7~18      | 270%~400%  |           |
    25         | 6.4       | 260%       | 46        | 1400%
    30         | 5         | 270%       | 49        | 1560%
    60         | 5         | 280%       | 50~58     | 2100%
    
    - Env:
     A virtual machine on cloud
       Architecture:          x86_64
       CPU op-mode(s):        32-bit, 64-bit
       Address sizes:         46 bits physical, 48 bits virtual
       Byte Order:            Little Endian
       CPU(s):                64
       On-line CPU(s) list:   0-63
       Vendor ID:             GenuineIntel
       Model name:            Intel Xeon Processor (Cascadelake)
       CPU family:            6
       Model:                 85
       Thread(s) per core:    2
       Core(s) per socket:    16
       Socket(s):             2
       Stepping:              6
       BogoMIPS:              5985.93
     Virtualization features:
       Hypervisor vendor:     KVM
       Virtualization type:   full
     Caches (sum of all):
       L1d:                   2 MiB (64 instances)
       L1i:                   2 MiB (64 instances)
       L2:                    128 MiB (32 instances)
       L3:                    32 MiB (2 instances)
     NUMA:
       NUMA node(s):          2
       NUMA node0 CPU(s):     0-31
       NUMA node1 CPU(s):     32-63
    
    Signed-off-by: Hao Xu <hao.xu@linux.dev>
    
  2. DrahtBot commented at 12:59 pm on June 21, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32791.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #32699 (docs: adds correct updated documentation links by Zeegaths)
    • #32692 (checkqueue: set MAX_SCRIPTCHECK_THREADS to nCores - 1 by HowHsu)
    • #31132 (validation: fetch block inputs on parallel threads 10% faster IBD by andrewtoth)
    • #28690 (build: Introduce internal kernel library by TheCharlatan)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label CI failed on Jun 21, 2025
  4. DrahtBot commented at 2:21 pm on June 21, 2025: contributor

    🚧 At least one of the CI tasks failed. Task no wallet, libbitcoinkernel: https://github.com/bitcoin/bitcoin/runs/44527955551 LLM reason (✨ experimental): The CI failed due to a linker error: undefined reference to GetNumCores(), indicating a missing implementation or linkage of this function.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  5. Muaytie666 commented at 7:49 am on June 22, 2025: none
    98b1db9283b5c04775309fa5dc2161fc597b355a
  6. HowHsu force-pushed on Jun 22, 2025
  7. HowHsu commented at 3:16 pm on June 22, 2025: none
    hmmm…Ok, I realized that this may only be suitable for when all the preout Coin are in CCoinsViewCache, otherwise the Add() is not fast due to I/O there.
  8. HowHsu force-pushed on Jun 23, 2025
  9. HowHsu force-pushed on Jun 23, 2025
  10. HowHsu force-pushed on Jun 23, 2025
  11. HowHsu force-pushed on Jun 23, 2025
  12. HowHsu force-pushed on Jun 23, 2025
  13. HowHsu force-pushed on Jun 24, 2025
  14. HowHsu force-pushed on Jun 24, 2025
  15. HowHsu force-pushed on Jun 25, 2025
  16. HowHsu force-pushed on Jun 25, 2025
  17. HowHsu force-pushed on Jun 26, 2025
  18. HowHsu commented at 3:28 pm on June 26, 2025: none

    Hi folks, any hints on how to see node logs of failed CI tests, It’s not easy to create all the environements including OS and libs and to run it locally:

    [11:09:03.884]  test 2025-06-26T15:09:03.311909Z TestFramework (ERROR): Test failed. Test logging available at /ci_container_base/ci/scratch/test_runner/test_runner_₿🏃20250626_150635/wallet_fundrawtransaction_252/test_framework.log  [11:09:03.884]  test 2025-06-26T15:09:03.312354Z TestFramework (ERROR):  [11:09:03.884]  test 2025-06-26T15:09:03.312746Z TestFramework (ERROR): Hint: Call /ci_container_base/test/functional/combine_logs.py ‘/ci_container_base/ci/scratch/test_runner/test_runner🏃_20250626_150635/wallet_fundrawtransaction_252’ to consolidate all logs  [11:09:03.884]  test 2025-06-26T15:09:03.312954Z TestFramework (ERROR):  [11:09:03.884]  test 2025-06-26T15:09:03.313070Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.  [11:09:03.884]  test 2025-06-26T15:09:03.313299Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues  [11:09:03.884]  test 2025-06-26T15:09:03.313452Z TestFramework (ERROR):  [11:09:03.884] [11:09:03.884]  node1 stderr libc++abi: terminating  [11:09:03.884] [11:09:03.884] TEST | STATUS | DURATION [11:09:03.884] [11:09:03.884] mempool_ephemeral_dust.py | ✓ Passed | 76 s [11:09:03.884] mempool_persist.py | ✓ Passed | 25 s [11:09:03.884] mining_getblocktemplate_longpoll.py | ✓ Passed | 69 s [11:09:03.884] p2p_opportunistic_1p1c.py | ✓ Passed | 140 s [11:09:03.884] p2p_segwit.py | ✓ Passed | 113 s [11:09:03.884] wallet_conflicts.py | ✓ Passed | 63 s [11:09:03.884] wallet_fundrawtransaction.py | ✖ Failed | 2 s [11:09:03.884]  [11:09:03.884] ALL | ✖ Failed | 488 s (accumulated) [11:09:03.884] Runtime: 143 s

  19. maflcko commented at 3:42 pm on June 26, 2025: member

    Hi folks, any hints on how to see node logs of failed CI tests, It’s not easy to create all the environements including OS and libs and to run it locally:

    You can follow the " View more details on Cirrus CI " link: https://cirrus-ci.com/task/6753181050339328

    It says system error, so you are likely launching enough threads to run into a system limit.

  20. maflcko commented at 3:43 pm on June 26, 2025: member
    For the CI docs, see https://github.com/bitcoin/bitcoin/tree/master/ci#ci-scripts, but you’ll likely have to run that on a 64 core system to recreate the system error
  21. HowHsu commented at 4:29 pm on June 26, 2025: none

    For the CI docs, see https://github.com/bitcoin/bitcoin/tree/master/ci#ci-scripts, but you’ll likely have to run that on a 64 core system to recreate the system error

    Thanks, maflcko. The thing is this patchset is to extend numbers of scriptcheck threads to nCores-1, so I guess it is bound to fail the CI tests. What should I do now.

  22. HowHsu commented at 4:38 pm on June 26, 2025: none

    Hint: Call /ci_container_base/test/functional/combine_logs.py ‘/ci_container_base/ci/scratch/test_runner/test_runner_₿_🏃_20250626_150635/wallet_fundrawtransaction_252’

    In my local run, I can do something like this to see the bitcoind log (node log as I said), but how can I see the bitcoind log on Cirrus CI, seems no way to do that since that is not exposed.

  23. maflcko commented at 4:43 pm on June 26, 2025: member

    is not exposed.

    it is: See " View more details on Cirrus CI " on the checks page

  24. HowHsu requested review from Muaytie666 on Jun 26, 2025
  25. HowHsu commented at 4:53 pm on June 26, 2025: none

    is not exposed.

    it is: See " View more details on Cirrus CI " on the checks page

    I’m so sorry, there is node logs, I missed them, thanks again.

  26. HowHsu commented at 5:00 pm on June 26, 2025: none
    From the CI log, seems too much threads causes failure, I’ll temporarily set a smaller number limitation of scriptcheck workers to sort of confirm it.
  27. HowHsu force-pushed on Jun 27, 2025
  28. HowHsu force-pushed on Jun 27, 2025
  29. HowHsu force-pushed on Jun 28, 2025
  30. DrahtBot removed the label CI failed on Jun 28, 2025
  31. HowHsu force-pushed on Jun 28, 2025
  32. HowHsu force-pushed on Jun 28, 2025
  33. HowHsu force-pushed on Jun 28, 2025
  34. HowHsu force-pushed on Jun 28, 2025
  35. rest: fetch spent transaction outputs by blockhash
    Today, it is possible to fetch a block's spent prevouts in order to
    build an external index by using the `/rest/block/HASH.json` endpoint.
    However, its performance is low due to JSON serialization overhead.
    
    We can significantly optimize it by adding a new REST endpoint, using
    a binary response format:
    
    ```
    $ BLOCKHASH=00000000000000000002a7c4c1e48d76c5a37902165a270156b7a8d72728a054
    
    $ ab -k -c 1 -n 100 http://localhost:8332/rest/block/$BLOCKHASH.json
    Document Length:        13278152 bytes
    Requests per second:    3.53 [#/sec] (mean)
    Time per request:       283.569 [ms] (mean)
    
    $ ab -k -c 1 -n 10000 http://localhost:8332/rest/spentoutputs/$BLOCKHASH.bin
    Document Length:        195591 bytes
    Requests per second:    254.47 [#/sec] (mean)
    Time per request:       3.930 [ms] (mean)
    ```
    
    Currently, this PR is being used and tested by Bindex:
    
     * https://github.com/romanz/bindex-rs
    
    This PR would allow to improve the performance of external indexers
    such as electrs, ElectrumX, Fulcrum and Blockbook:
    
     * https://github.com/romanz/electrs (also https://github.com/Blockstream/electrs and https://github.com/mempool/electrs)
     * https://github.com/spesmilo/electrumx
     * https://github.com/cculianu/Fulcrum
     * https://github.com/trezor/blockbook
    2171ee9427
  36. doc: add release notes for #32540 f7690a7c0a
  37. checkqueue: implement a new scriptcheck worker pool with atomic variables
    TL;DR
    This patchset proposes a new design of scriptcheck worker pool with
    atomic variables to leverage more cpu resource on a machine which the
    current mutex version cannot. Achieve about 45% boost in performance.
    
    Main Idea:
    
    The current version of worker pool use mutex to do the synchronization
    between worker threads, which makes the performance downgrade after
    worker_threads_num >= 14. One root cause is the high contention on the
    queue's mutex among worker threads. Workers repeatedly sleep and be woken
    up in short time, leading to heavy context switching.
    The idea is to use atomic variables to protect the queue, workers
    acquire it and increase it with a batch_size(a small number). This
    significantly reduces context switching.
    
    Implementation
    
    To implement it, below changes are made:(enabled in atomic mode)
    - atomic<int> nNext is used to indicate the next available indice in the
      queue
    - atomic<int> nTodo2 is used to indicate the number of tasks left
    - nBatchSize is the batch size of tasks everytime a worker consumes
      it should be small to keep workload balance among workers
    - m_mutex is still used, to protect m_result and conditional vars
    - fAtomic to indicate which mode is enabled
    - about the queue:
    
      - the queue is used as queue, not stacks any more
      - the queue is 'add only', which means only adding tasks to it, don't
        remove used tasks.
    
        the above two are a must to make this idea achievable, and the
        second sacrifices space
    
      - start the workers only after all the tasks have been added to the
        queue
    
        this is a trade-off to make the implementaion simple, a worker can
        compare nNext + nBatchSize to queue.size() to find if there are
        tasks undone in the queue, otherwise another atomic<int> tail is
        needed, and race condition may exist in this way (need more
        investigation). From my test, it's not 100% to say this is a
        'trade-off' since it also brings something good, pros && cons:
        - pros:
            1. no contention among producer and consumers when adding tasks
               to the queue
            2. the adding period is very short compared to the verification
               period
        - cons:
            1. the workers start later
    
    For details, please refer to the worker handler code
    
    Tests:
    
    - Numbers
      - For blocks/s numbers
        `./build/bin/bench_bitcoin --filter=ConnectBlockMixedEcdsaSchnor`
      - For cpu usage numbers
        `NANOBENCH_ENDLESS=ConnectBlockAllEcdsa ./build/bin/bench_bitcoin --filter=ConnectBlockAllEcdsa`
    
      - Some data is missing because the machine was billed by time, and there wasn't
        enough time to complete all measurements.
      - Some cells contain a range instead of a single value due to high fluctuations
      - The numbers represent the average of multiple test runs.
    
               |     mutex verion       |   atomic version
    nr_workers | blocks/s  | cpu usage  | blocks/s  | cpu usage2
    -----------|-----------|------------|-----------|-------------
    0          | 4.22      | 100%       | 4.2       | 100%
    1          | 8.23      | 198%       | 8         | 191.3%
    2          | 12.08     | 293.7%     | 11.6      | 276%
    3          | 16        | 386.7%     | 14.9      | 354.5%
    4          | 19.8      | 476.3%     |           |
    5          | 23.12     | 565.4%     |           |
    6          | 27.3      | 654%       | 23        | 560%
    7          | 28.5      | 738%       |           |
    8          | 33.8      | 809%       |           |
    9          | 33.5      | 868%       | 28.5      | 730%
    10         | 37.4      | 910%       | 30.5      | 777%
    11         | 33~40     | 880%~928%  | 32        | 825%
    12         | 22~39     | 860%~925%  | 32.5      | 830%
    13         | 27.8~37   | 700%~913%  | 34.8      | 928%
    15         | 24.8~38.9 | 640%~719%  | 38        | 1000%
    19         | 18~23     | 370%~510%  | 41        | 1161%
    22         | 7~18      | 270%~400%  |           |
    25         | 6.4       | 260%       | 46        | 1400%
    30         | 5         | 270%       | 49        | 1560%
    60         | 5         | 280%       | 50~58     | 2100%
    
    - Env:
     A virtual machine on cloud
       Architecture:          x86_64
       CPU op-mode(s):        32-bit, 64-bit
       Address sizes:         46 bits physical, 48 bits virtual
       Byte Order:            Little Endian
       CPU(s):                64
       On-line CPU(s) list:   0-63
       Vendor ID:             GenuineIntel
       Model name:            Intel Xeon Processor (Cascadelake)
       CPU family:            6
       Model:                 85
       Thread(s) per core:    2
       Core(s) per socket:    16
       Socket(s):             2
       Stepping:              6
       BogoMIPS:              5985.93
     Virtualization features:
       Hypervisor vendor:     KVM
       Virtualization type:   full
     Caches (sum of all):
       L1d:                   2 MiB (64 instances)
       L1i:                   2 MiB (64 instances)
       L2:                    128 MiB (32 instances)
       L3:                    32 MiB (2 instances)
     NUMA:
       NUMA node(s):          2
       NUMA node0 CPU(s):     0-31
       NUMA node1 CPU(s):     32-63
    
    Signed-off-by: Hao Xu <hao.xu@linux.dev>
    ed58de0bf7
  38. checkqueue: mark and quit fast when error result is raised
    Use a atomic<bool> fResultDone to indicate the result is out to be
    something error, thus all workers can quit immediately. This is to
    boost the performance in script verification error case.
    
    Signed-off-by: Hao Xu <hao.xu@linux.dev>
    03303b74b7
  39. checkqueue: set max number of scriptcheck threads to GetNumCores()
    The current max number of scriptcheck threads is MAX_SCRIPTCHECK_THREADS = 14
    due to the limitation of the current mechanism, release it to GetNumCores() as
    we have atomic mode of scriptcheck worker pool.
    
    Signed-off-by: Hao Xu <hao.xu@linux.dev>
    128d88db22
  40. checkqueue: fix: add common/system.cpp to libbitcoinkernel
    GetNumCores() is now used in checkqueue related code, thus it has to be
    added to libbitcoinkernel to keep everything ok
    
    Signed-off-by: Hao Xu <hao.xu@linux.dev>
    50e8013d12
  41. tmp: this is to test the CI failure
    Signed-off-by: Hao Xu <hao.xu@linux.dev>
    8ed42044f8
  42. HowHsu force-pushed on Jun 28, 2025
  43. HowHsu commented at 1:25 pm on June 29, 2025: none

    hmmm…Ok, I realized that this may only be suitable for when all the preout Coin are in CCoinsViewCache, otherwise the Add() is not fast due to I/O there.

    With #31132 , this one makes sense again, need to test them together in all-cache-miss case

  44. DrahtBot added the label Needs rebase on Jun 30, 2025
  45. DrahtBot commented at 1:56 pm on June 30, 2025: contributor
    🐙 This pull request conflicts with the target branch and needs rebase.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-07-02 03:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me