crypto: optimize SipHash Write() method with chunked processing

Raimo33 commented at 6:52 pm on September 6, 2025: contributor

Summary

The current default Write() implementation of Siphash uses a byte-by-byte approach to iterate the span. This results in significant overhead for large inputs due to repeated bounds checking and span manipulations, without any help from the compiler.

This PR aims at optimizing Siphash by replacing byte-by-byte processing in CSipHasher::Write() with an optimized chunked approach that processes data in 8-byte aligned blocks when possible.

Details

The new implementation is divided in 3 stages that process:

initial unaligned bytes to reach an 8-byte boundary
aligned 8-byte chunks directly using memcpy for efficiency
remaining bytes at the end

every change was thoroughly tested and benchmarked to avoid overfitting, but replicating is welcomed and encouraged.

Benchmarks

0taskset -c 1 ./bin/bench_bitcoin -filter="(WalletIsMineMigratedDescriptors|WalletIsMineDescriptors|GCSFilterConstruct|AddrManSelect)" -output-csv=bench_old.csv --min-time=60000

Before:

ns/op	op/s	err%	total	benchmark
12,983,090.72	77.02	0.1%	66.00	`GCSFilterConstruct`

After:

ns/op	op/s	err%	total	benchmark
11,155,751.42	89.64	0.1%	65.99	`GCSFilterConstruct`

compared to master:

GCSFilterConstruct +16% faster

DrahtBot added the label Utils/log/libs on Sep 6, 2025

DrahtBot commented at 6:52 pm on September 6, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33325.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Approach NACK	l0rinc

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

l0rinc commented at 7:55 pm on September 6, 2025: contributor

This seems like a good idea to me, I will review this in more detail soon.

To set your expectations, these fundamental changes can take a long time to review - and we need very good reasons to risk introducing a potential error and making the code more complicated for a speedup that the users might never notice.

I like that you’re explaining it in a lot of detail in the commit message, it’s not a trivial change! We may be able to use more C++20 alignment-specific methods there (e.g. like the std::assume_aligned we used in https://github.com/bitcoin/bitcoin/pull/31144/files#diff-f26d4597a7f5a5d4aa30053032d49427b402ef4fd7a1b3194cb75b2551d58ca4R44) - but I haven’t gone deep yet, I could be wrong here, I just glanced at it so far.

In the meantime, please check if there’s any way to split this into even smaller commits to make it even easier for reviewers, check if the current tests cover each branch (e.g. the big-endian ones), different architectures, etc.

But as far as I can tell this doesn’t affect most caching scenarios since we have dedicated siphash methods for those. To be sure, I can run some reindexes once the change is settled to make sure it doesn’t make it slower at least.

Raimo33 force-pushed on Sep 6, 2025

Raimo33 commented at 1:44 am on September 7, 2025: contributor

I just realized this was attempted in the past. merged in Bitcoin knots 19. restored to match Core’s implementation in knots 29. 🤔. #18014

Raimo33 commented at 2:15 am on September 7, 2025: contributor

We may be able to use more C++20 alignment-specific methods there (e.g. like the std::assume_aligned we used in

I tried with

0- uint64_t chunk;
1- std::memcpy(&chunk, ptr, 8);
2+ auto aligned_ptr = std::assume_aligned<8>(ptr);
3+ uint64_t chunk = *reinterpret_cast<const uint64_t *>(aligned_ptr);

but no luck. performance is identical. memcpy is extremely optimized.

Raimo33 force-pushed on Sep 7, 2025

refactor: replace span with pointer 7dde00f23d

optimize: handle unaligned bytes separately 720045094b

crypto: optimize SipHash Write() method with chunked processing

Replace byte-by-byte processing in CSipHasher::Write() with an optimized
chunked approach that processes data in 8-byte aligned blocks when possible.

The previous implementation processed input data one byte at a time using
data.front() and data.subspan(1), which resulted in significant overhead
for large inputs due to repeated bounds checking and span manipulations.

The new implementation:
- Handles initial unaligned bytes to reach an 8-byte boundary
- Processes aligned 8-byte chunks directly using memcpy for efficiency
- Handles remaining bytes at the end

This optimization provides substantial performance improvements across
Bitcoin operations that rely on SipHash:
- GCSFilterConstruct: 16% improvement
- WalletIsMineDescriptors: 11.6% improvement
- WalletIsMineMigratedDescriptors: 59.0% improvement

These improvements are particularly beneficial for wallet operations and block filter construction.

./bin/bench_bitcoin -filter="(WalletIsMineMigratedDescriptors|WalletIsMineDescriptors|GCSFilterConstruct|AddrManSelect)" -output-csv=bench_old.csv --min-time=60000

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|              214.55 |        4,660,983.40 |    0.2% |     65.89 | `AddrManSelect`
|       12,983,090.72 |               77.02 |    0.1% |     66.00 | `GCSFilterConstruct`
|              100.29 |        9,971,046.61 |    0.0% |     66.02 | `WalletIsMineDescriptors`
|              115.42 |        8,664,379.92 |    0.0% |     66.02 | `WalletIsMineMigratedDescriptors`

./bin/bench_bitcoin -filter="(WalletIsMineMigratedDescriptors|WalletIsMineDescriptors|GCSFilterConstruct|AddrManSelect)" -output-csv=bench_new.csv --min-time=60000

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|              210.60 |        4,748,271.82 |    0.1% |     65.93 | `AddrManSelect`
|       11,155,751.42 |               89.64 |    0.1% |     65.99 | `GCSFilterConstruct`
|               89.87 |       11,126,702.73 |    0.0% |     66.01 | `WalletIsMineDescriptors`
|               72.67 |       13,761,145.85 |    0.0% |     66.01 | `WalletIsMineMigratedDescriptors`

5ba5198c45

Raimo33 force-pushed on Sep 17, 2025

Raimo33 commented at 5:28 pm on September 17, 2025: contributor

In the meantime, please check if there’s any way to split this into even smaller commits to make it even easier for reviewers

I’ve just split the PR into 3 commits, diff should now be simpler

luke-jr commented at 1:33 am on September 18, 2025: member

I just realized this was attempted in the past. merged in Bitcoin knots 19. restored to match Core’s implementation in knots 29. 🤔.

#18014 is still in Knots 29…

l0rinc commented at 1:46 am on September 18, 2025: contributor

@luke-jr where is this general writer needed? We already have separate specialized siphash implementation for the critical usecases…

l0rinc commented at 10:13 am on October 23, 2025: contributor

I have looked into this again.

I have printed the actual span sizes during runtime (IBD or reindex or reindex chainstate) and it seems to me most of the span sizes are tiny:

The benchmarks however use bigger values of different distribution (e.g. 16, 22, 23, 25, 32, 34). They do show some speedup for your mentioned cases:

0for commit in 36e40417de3fcaa776fb7bb8ebf3a23db4266572 7dde00f23d335fb5bf5835ebb811793b17b5b987 720045094b805f37bd4792f8ca49c09fd55ac256 5ba5198c45625486a5be3210709155abdc432b83; do \
1git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && git log -1 --pretty='%h %s' && \
2rm -rf build && \
3cmake -B build -DBUILD_BENCH=ON -DENABLE_IPC=OFF -DCMAKE_BUILD_TYPE=RelWithDebInfo >/dev/null 2>&1 && \
4cmake --build build -j$(nproc) >/dev/null 2>&1 && \
5for i in 1 2; do \
6  build/bin/bench_bitcoin -filter='WalletIsMineMigratedDescriptors|WalletIsMineDescriptors|GCSFilterConstruct|AddrManSelect' -min-time=5000; \
7done; \
8done

36e40417de Merge bitcoin-core/gui#884: Fix compatibility with -debuglogfile command-line option

ns/op	op/s	err%	total	benchmark
70.33	14,218,742.61	0.5%	5.54	`AddrManSelect`
3,766,398.94	265.51	0.2%	5.35	`GCSFilterConstruct`
30.02	33,315,107.96	0.7%	5.48	`WalletIsMineDescriptors`
29.95	33,390,477.67	0.1%	5.51	`WalletIsMineMigratedDescriptors`

ns/op	op/s	err%	total	benchmark
70.39	14,205,833.32	0.5%	5.52	`AddrManSelect`
3,766,733.09	265.48	0.1%	5.35	`GCSFilterConstruct`
30.62	32,655,356.28	0.1%	5.29	`WalletIsMineDescriptors`
30.78	32,486,484.45	0.5%	5.30	`WalletIsMineMigratedDescriptors`

7dde00f23d refactor: replace span with pointer

ns/op	op/s	err%	total	benchmark
68.89	14,516,583.77	0.3%	5.50	`AddrManSelect`
3,737,364.51	267.57	0.1%	5.34	`GCSFilterConstruct`
30.63	32,650,170.36	0.1%	5.26	`WalletIsMineDescriptors`
30.31	32,994,438.71	0.1%	5.27	`WalletIsMineMigratedDescriptors`

ns/op	op/s	err%	total	benchmark
68.59	14,579,049.65	0.5%	5.52	`AddrManSelect`
3,732,675.34	267.90	0.1%	5.34	`GCSFilterConstruct`
29.68	33,689,750.87	0.3%	5.50	`WalletIsMineDescriptors`
29.74	33,625,670.63	0.1%	5.52	`WalletIsMineMigratedDescriptors`

720045094b optimize: handle unaligned bytes separately

ns/op	op/s	err%	total	benchmark
69.58	14,372,223.10	0.7%	5.52	`AddrManSelect`
3,748,722.62	266.76	0.1%	5.34	`GCSFilterConstruct`
30.05	33,278,839.89	0.2%	5.25	`WalletIsMineDescriptors`
30.25	33,062,965.52	0.1%	5.51	`WalletIsMineMigratedDescriptors`

ns/op	op/s	err%	total	benchmark
69.59	14,370,240.65	0.8%	5.54	`AddrManSelect`
3,747,784.27	266.82	0.1%	5.35	`GCSFilterConstruct`
30.94	32,322,539.81	0.1%	5.28	`WalletIsMineDescriptors`
30.21	33,099,540.82	0.1%	5.50	`WalletIsMineMigratedDescriptors`

5ba5198c45 crypto: optimize SipHash Write() method with chunked processing

ns/op	op/s	err%	total	benchmark
69.11	14,468,871.77	0.1%	5.52	`AddrManSelect`
3,157,427.53	316.71	0.2%	5.29	`GCSFilterConstruct`
23.14	43,223,027.78	0.1%	5.52	`WalletIsMineDescriptors`
23.11	43,276,911.75	0.2%	5.51	`WalletIsMineMigratedDescriptors`

ns/op	op/s	err%	total	benchmark
68.83	14,528,112.71	1.0%	5.55	`AddrManSelect`
3,154,843.34	316.97	0.2%	5.26	`GCSFilterConstruct`
23.08	43,320,920.01	0.2%	5.51	`WalletIsMineDescriptors`

But then again, they’re not run that often and run with tiny values and are not performance critical code and the resulting code is a lot more complicated:

Benchmark	μs saved/op	Speedup
GCSFilterConstruct	610.38	+19.3%
WalletIsMineMigratedDescriptors	0.00725	+31.4%
WalletIsMineDescriptors	0.00721	+31.2%
AddrManSelect	0.00139	+2.0%

So having this considerable complication (e.g. where did the big-endian condition come from in your code, how did you check that, where is it explained, why not use spans, why are pointers better, etc) for an artificial speedup of 0.6ms for data that isn’t representative, that’s an Approach NACK from me, while keeping the Concept ACK if you can simplify it considerably.

Raimo33 commented at 8:58 am on October 24, 2025: contributor

Thank you for the realistic benchmarks, you’re right, the diff is currently too complex. Marking as draft for now.

Raimo33 marked this as a draft on Oct 24, 2025

Raimo33 closed this on Oct 24, 2025

Raimo33 deleted the branch on Oct 24, 2025

crypto: optimize SipHash Write() method with chunked processing #33325

Summary

Details

Benchmarks

Code Coverage & Benchmarks

Reviews