reopening #33325 as draft
Summary
The current default Write() implementation of Siphash uses a byte-by-byte approach to iterate the span. This results in significant overhead for large inputs due to repeated bounds checking and span manipulations, without any help from the compiler.
This PR aims at optimizing Siphash by replacing byte-by-byte processing in CSipHasher::Write() with an optimized chunked approach that processes data in 8-byte aligned blocks when possible.
Details
The new implementation is divided in 3 stages that process:
- initial unaligned bytes to reach an 8-byte boundary
- aligned 8-byte chunks directly using memcpy for efficiency
- remaining bytes at the end
every change was thoroughly tested and benchmarked to avoid overfitting, but replicating is welcomed and encouraged.
Benchmarks
0taskset -c 1 ./bin/bench_bitcoin -filter="(GCSFilterConstruct)" --min-time=60000
Before:
| ns/op | op/s | err% | total | benchmark | 
|---|---|---|---|---|
| 12,983,090.72 | 77.02 | 0.1% | 66.00 | GCSFilterConstruct | 
After:
| ns/op | op/s | err% | total | benchmark | 
|---|---|---|---|---|
| 11,155,751.42 | 89.64 | 0.1% | 65.99 | GCSFilterConstruct | 
compared to master:
- GCSFilterConstruct+16% faster