This change is part of [IBD] - Tracking PR for speeding up Initial Block Download
Summary
Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.
Since this obfuscation is optional, the speedup measured here depends on whether it’s a random value or completely turned off (i.e. XOR-ing with 0).
Changes in testing, benchmarking and implementation
- Added new tests comparing randomized inputs against a trivial implementation and performing roundtrip checks with random chunks.
- An additional benchmark checks the effect of short-circuiting XOR when the key is zero, ensuring no speed regression occurs when the obfuscation feature is disabled.
- Migrated remaining
std::vector<std::byte>(8)
values touint64_t
.
Reproducer and assembly
Memory alignment is handled via std::memcpy
, optimized out on tested platforms (see https://godbolt.org/z/P4cWx91Kv):
- Clang (x86-64) - 128-bit SIMD (pxor), 256-bit unroll (4×64-bit)
- GCC (x86-64) - 64-bit XOR (QWORD), 128-bit unroll (2×64-bit)
- RISC-V (32-bit) - 64-bit via 32-bit registers, no unroll, byte-by-byte load/store
- s390x (big-endian) - 64-bit XOR (xc), 512-bit unroll (8×64-bit)
Endianness
The only endianness issue was with bit rotation, intended to realign the key if obfuscation halted before full key consumption. Elsewhere, memory is read, processed, and written back in the same endianness, preserving byte order. Since CI lacks a big-endian machine, testing was done locally via Docker.
0brew install podman pigz
1softwareupdate --install-rosetta
2podman machine init
3podman machine start
4docker run --platform linux/s390x -it ubuntu:latest /bin/bash
5 apt update && apt install -y git build-essential cmake ccache pkg-config libevent-dev libboost-dev libssl-dev libsqlite3-dev && \
6 cd /mnt && git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && git remote add l0rinc https://github.com/l0rinc/bitcoin.git && git fetch --all && git checkout l0rinc/optimize-xor && \
7 cmake -B build && cmake --build build --target test_bitcoin -j$(nproc) && \
8 ./build/bin/test_bitcoin --run_test=streams_tests
Measurements (micro benchmarks and full IBDs)
0 cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
1&& cmake --build build -j$(nproc) \
2&& build/bin/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
The 860k block profile contains a lot of very big arrays (96'233 separate sizes, biggest was 3'992'470 bytes long) - a big departure from the previous 400k and 700k blocks (having 1500 sizes, biggest was 9319 bytes long).
The performance characteristics are also quite different, now that we have more and bigger byte arrays:
C++ compiler …………………….. AppleClang 16.0.0.16000026
Before:
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
1.00 | 1,000,913,427.27 | 0.7% | 10.20 | AutoFileXor |
0.85 | 1,173,442,964.60 | 0.2% | 11.16 | XorHistogram |
After:
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
0.09 | 11,204,183,007.86 | 0.6% | 11.08 | AutoFileXor |
0.15 | 6,459,482,269.06 | 0.3% | 10.97 | XorHistogram |
i.e. ~11/5.5x (disabled/enabled) faster with Clang at processing the data with representative histograms.
C++ compiler …………………….. GNU 13.2.0
Before:
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
1.87 | 535,253,389.72 | 0.0% | 9.20 | 3.45 | 2.669 | 1.03 | 0.1% | 11.02 | AutoFileXor |
1.70 | 587,844,715.57 | 0.0% | 9.35 | 5.41 | 1.729 | 1.05 | 1.7% | 10.95 | XorHistogram |
After:
ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |
---|---|---|---|---|---|---|---|---|---|
0.59 | 1,706,433,032.76 | 0.1% | 0.00 | 0.00 | 0.620 | 0.00 | 1.8% | 11.01 | AutoFileXor |
0.47 | 2,145,375,849.71 | 0.0% | 0.95 | 1.48 | 0.642 | 0.20 | 9.6% | 10.93 | XorHistogram |
i.e. ~3.2/3.5x faster (disabled/enabled) with GCC at processing the data with representative histograms.
Before:
ns/op | op/s | err% | total | benchmark |
---|---|---|---|---|
2,237,168.64 | 446.99 | 0.3% | 10.91 | ReadBlockFromDiskTest |
748,837.59 | 1,335.40 | 0.2% | 10.68 | ReadRawBlockFromDiskTest |
After:
ns/op | op/s | err% | total | benchmark |
---|---|---|---|---|
1,827,436.12 | 547.21 | 0.7% | 10.95 | ReadBlockFromDiskTest |
49,276.48 | 20,293.66 | 0.2% | 10.99 | ReadRawBlockFromDiskTest |
Also visible on https://corecheck.dev/bitcoin/bitcoin/pulls/31144