crypto: optimize SipHash Write() method with chunked processing #33325

pull Raimo33 wants to merge 3 commits into bitcoin:master from Raimo33:siphash-write-chunked changing 1 files +45 −9
  1. Raimo33 commented at 6:52 pm on September 6, 2025: contributor

    Summary

    The current default Write() implementation of Siphash uses a byte-by-byte approach to iterate the span. This results in significant overhead for large inputs due to repeated bounds checking and span manipulations, without any help from the compiler.

    This PR aims at optimizing Siphash by replacing byte-by-byte processing in CSipHasher::Write() with an optimized chunked approach that processes data in 8-byte aligned blocks when possible.

    Details

    The new implementation is divided in 3 stages that process:

    1. initial unaligned bytes to reach an 8-byte boundary
    2. aligned 8-byte chunks directly using memcpy for efficiency
    3. remaining bytes at the end

    every change was thoroughly tested and benchmarked to avoid overfitting, but replicating is welcomed and encouraged.

    Benchmarks

    0taskset -c 1 ./bin/bench_bitcoin -filter="(WalletIsMineMigratedDescriptors|WalletIsMineDescriptors|GCSFilterConstruct|AddrManSelect)" -output-csv=bench_old.csv --min-time=60000
    

    Before:

    ns/op op/s err% total benchmark
    12,983,090.72 77.02 0.1% 66.00 GCSFilterConstruct

    After:

    ns/op op/s err% total benchmark
    11,155,751.42 89.64 0.1% 65.99 GCSFilterConstruct

    compared to master:

    • GCSFilterConstruct +16% faster
  2. DrahtBot added the label Utils/log/libs on Sep 6, 2025
  3. DrahtBot commented at 6:52 pm on September 6, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33325.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Approach NACK l0rinc

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  4. l0rinc commented at 7:55 pm on September 6, 2025: contributor

    This seems like a good idea to me, I will review this in more detail soon.

    To set your expectations, these fundamental changes can take a long time to review - and we need very good reasons to risk introducing a potential error and making the code more complicated for a speedup that the users might never notice.

    I like that you’re explaining it in a lot of detail in the commit message, it’s not a trivial change! We may be able to use more C++20 alignment-specific methods there (e.g. like the std::assume_aligned we used in https://github.com/bitcoin/bitcoin/pull/31144/files#diff-f26d4597a7f5a5d4aa30053032d49427b402ef4fd7a1b3194cb75b2551d58ca4R44) - but I haven’t gone deep yet, I could be wrong here, I just glanced at it so far.

    In the meantime, please check if there’s any way to split this into even smaller commits to make it even easier for reviewers, check if the current tests cover each branch (e.g. the big-endian ones), different architectures, etc.

    But as far as I can tell this doesn’t affect most caching scenarios since we have dedicated siphash methods for those. To be sure, I can run some reindexes once the change is settled to make sure it doesn’t make it slower at least.

  5. Raimo33 force-pushed on Sep 6, 2025
  6. Raimo33 commented at 1:44 am on September 7, 2025: contributor
    I just realized this was attempted in the past. merged in Bitcoin knots 19. restored to match Core’s implementation in knots 29. 🤔. #18014
  7. Raimo33 commented at 2:15 am on September 7, 2025: contributor

    We may be able to use more C++20 alignment-specific methods there (e.g. like the std::assume_aligned we used in

    I tried with

    0- uint64_t chunk;
    1- std::memcpy(&chunk, ptr, 8);
    2+ auto aligned_ptr = std::assume_aligned<8>(ptr);
    3+ uint64_t chunk = *reinterpret_cast<const uint64_t *>(aligned_ptr);
    

    but no luck. performance is identical. memcpy is extremely optimized.

  8. Raimo33 force-pushed on Sep 7, 2025
  9. refactor: replace span with pointer 7dde00f23d
  10. optimize: handle unaligned bytes separately 720045094b
  11. crypto: optimize SipHash Write() method with chunked processing
    Replace byte-by-byte processing in CSipHasher::Write() with an optimized
    chunked approach that processes data in 8-byte aligned blocks when possible.
    
    The previous implementation processed input data one byte at a time using
    data.front() and data.subspan(1), which resulted in significant overhead
    for large inputs due to repeated bounds checking and span manipulations.
    
    The new implementation:
    - Handles initial unaligned bytes to reach an 8-byte boundary
    - Processes aligned 8-byte chunks directly using memcpy for efficiency
    - Handles remaining bytes at the end
    
    This optimization provides substantial performance improvements across
    Bitcoin operations that rely on SipHash:
    - GCSFilterConstruct: 16% improvement
    - WalletIsMineDescriptors: 11.6% improvement
    - WalletIsMineMigratedDescriptors: 59.0% improvement
    
    These improvements are particularly beneficial for wallet operations and block filter construction.
    
    ./bin/bench_bitcoin -filter="(WalletIsMineMigratedDescriptors|WalletIsMineDescriptors|GCSFilterConstruct|AddrManSelect)" -output-csv=bench_old.csv --min-time=60000
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |              214.55 |        4,660,983.40 |    0.2% |     65.89 | `AddrManSelect`
    |       12,983,090.72 |               77.02 |    0.1% |     66.00 | `GCSFilterConstruct`
    |              100.29 |        9,971,046.61 |    0.0% |     66.02 | `WalletIsMineDescriptors`
    |              115.42 |        8,664,379.92 |    0.0% |     66.02 | `WalletIsMineMigratedDescriptors`
    
    ./bin/bench_bitcoin -filter="(WalletIsMineMigratedDescriptors|WalletIsMineDescriptors|GCSFilterConstruct|AddrManSelect)" -output-csv=bench_new.csv --min-time=60000
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |              210.60 |        4,748,271.82 |    0.1% |     65.93 | `AddrManSelect`
    |       11,155,751.42 |               89.64 |    0.1% |     65.99 | `GCSFilterConstruct`
    |               89.87 |       11,126,702.73 |    0.0% |     66.01 | `WalletIsMineDescriptors`
    |               72.67 |       13,761,145.85 |    0.0% |     66.01 | `WalletIsMineMigratedDescriptors`
    5ba5198c45
  12. Raimo33 force-pushed on Sep 17, 2025
  13. Raimo33 commented at 5:28 pm on September 17, 2025: contributor

    In the meantime, please check if there’s any way to split this into even smaller commits to make it even easier for reviewers

    I’ve just split the PR into 3 commits, diff should now be simpler

  14. luke-jr commented at 1:33 am on September 18, 2025: member

    I just realized this was attempted in the past. merged in Bitcoin knots 19. restored to match Core’s implementation in knots 29. 🤔.

    #18014 is still in Knots 29…

  15. l0rinc commented at 1:46 am on September 18, 2025: contributor
    @luke-jr where is this general writer needed? We already have separate specialized siphash implementation for the critical usecases…
  16. l0rinc commented at 10:13 am on October 23, 2025: contributor

    I have looked into this again.

    I have printed the actual span sizes during runtime (IBD or reindex or reindex chainstate) and it seems to me most of the span sizes are tiny:

    The benchmarks however use bigger values of different distribution (e.g. 16, 22, 23, 25, 32, 34). They do show some speedup for your mentioned cases:

    0for commit in 36e40417de3fcaa776fb7bb8ebf3a23db4266572 7dde00f23d335fb5bf5835ebb811793b17b5b987 720045094b805f37bd4792f8ca49c09fd55ac256 5ba5198c45625486a5be3210709155abdc432b83; do \
    1git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && git log -1 --pretty='%h %s' && \
    2rm -rf build && \
    3cmake -B build -DBUILD_BENCH=ON -DENABLE_IPC=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo >/dev/null 2>&1 && \
    4cmake --build build -j$(nproc) >/dev/null 2>&1 && \
    5for i in 1 2; do \
    6  build/bin/bench_bitcoin -filter='WalletIsMineMigratedDescriptors|WalletIsMineDescriptors|GCSFilterConstruct|AddrManSelect' -min-time=5000; \
    7done; \
    8done
    

    36e40417de Merge bitcoin-core/gui#884: Fix compatibility with -debuglogfile command-line option

    ns/op op/s err% total benchmark
    70.33 14,218,742.61 0.5% 5.54 AddrManSelect
    3,766,398.94 265.51 0.2% 5.35 GCSFilterConstruct
    30.02 33,315,107.96 0.7% 5.48 WalletIsMineDescriptors
    29.95 33,390,477.67 0.1% 5.51 WalletIsMineMigratedDescriptors
    ns/op op/s err% total benchmark
    70.39 14,205,833.32 0.5% 5.52 AddrManSelect
    3,766,733.09 265.48 0.1% 5.35 GCSFilterConstruct
    30.62 32,655,356.28 0.1% 5.29 WalletIsMineDescriptors
    30.78 32,486,484.45 0.5% 5.30 WalletIsMineMigratedDescriptors

    7dde00f23d refactor: replace span with pointer

    ns/op op/s err% total benchmark
    68.89 14,516,583.77 0.3% 5.50 AddrManSelect
    3,737,364.51 267.57 0.1% 5.34 GCSFilterConstruct
    30.63 32,650,170.36 0.1% 5.26 WalletIsMineDescriptors
    30.31 32,994,438.71 0.1% 5.27 WalletIsMineMigratedDescriptors
    ns/op op/s err% total benchmark
    68.59 14,579,049.65 0.5% 5.52 AddrManSelect
    3,732,675.34 267.90 0.1% 5.34 GCSFilterConstruct
    29.68 33,689,750.87 0.3% 5.50 WalletIsMineDescriptors
    29.74 33,625,670.63 0.1% 5.52 WalletIsMineMigratedDescriptors

    720045094b optimize: handle unaligned bytes separately

    ns/op op/s err% total benchmark
    69.58 14,372,223.10 0.7% 5.52 AddrManSelect
    3,748,722.62 266.76 0.1% 5.34 GCSFilterConstruct
    30.05 33,278,839.89 0.2% 5.25 WalletIsMineDescriptors
    30.25 33,062,965.52 0.1% 5.51 WalletIsMineMigratedDescriptors
    ns/op op/s err% total benchmark
    69.59 14,370,240.65 0.8% 5.54 AddrManSelect
    3,747,784.27 266.82 0.1% 5.35 GCSFilterConstruct
    30.94 32,322,539.81 0.1% 5.28 WalletIsMineDescriptors
    30.21 33,099,540.82 0.1% 5.50 WalletIsMineMigratedDescriptors

    5ba5198c45 crypto: optimize SipHash Write() method with chunked processing

    ns/op op/s err% total benchmark
    69.11 14,468,871.77 0.1% 5.52 AddrManSelect
    3,157,427.53 316.71 0.2% 5.29 GCSFilterConstruct
    23.14 43,223,027.78 0.1% 5.52 WalletIsMineDescriptors
    23.11 43,276,911.75 0.2% 5.51 WalletIsMineMigratedDescriptors
    ns/op op/s err% total benchmark
    68.83 14,528,112.71 1.0% 5.55 AddrManSelect
    3,154,843.34 316.97 0.2% 5.26 GCSFilterConstruct
    23.08 43,320,920.01 0.2% 5.51 WalletIsMineDescriptors

    But then again, they’re not run that often and run with tiny values and are not performance critical code and the resulting code is a lot more complicated:

    Benchmark μs saved/op Speedup
    GCSFilterConstruct 610.38 +19.3%
    WalletIsMineMigratedDescriptors 0.00725 +31.4%
    WalletIsMineDescriptors 0.00721 +31.2%
    AddrManSelect 0.00139 +2.0%

    So having this considerable complication (e.g. where did the big-endian condition come from in your code, how did you check that, where is it explained, why not use spans, why are pointers better, etc) for an artificial speedup of 0.6ms for data that isn’t representative, that’s an Approach NACK from me, while keeping the Concept ACK if you can simplify it considerably.

  17. Raimo33 commented at 8:58 am on October 24, 2025: contributor
    Thank you for the realistic benchmarks, you’re right, the diff is currently too complex. Marking as draft for now.
  18. Raimo33 marked this as a draft on Oct 24, 2025
  19. Raimo33 closed this on Oct 24, 2025

  20. Raimo33 deleted the branch on Oct 24, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-10-31 06:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me