Summary
This PR optimizes the FindByte method by using memchr instead of std::find. This takes advantage of the underlying optimizations that come with memchr, primarily vectorized chunked reads. While std::find is more standard and modern, it is suboptimal for iterating single bytes as they're iterated 1 by 1 instead of exploiting SIMD.
One could argue that this is not a concern of Bitcoin Core but rather of libc++ mantainers, but since it shows 5x improvement in existing benchmarks, I think it's worth including.
Benchmarks
<details>
secp256k1 configure summary
===========================
Build artifacts:
library type ........................ Static
Optional modules:
ECDH ................................ OFF
ECDSA pubkey recovery ............... ON
extrakeys ........................... ON
schnorrsig .......................... ON
musig ............................... ON
ElligatorSwift ...................... ON
Parameters:
ecmult window size .................. 15
ecmult gen table size ............... 86 KiB
Optional features:
assembly ............................ x86_64
external callbacks .................. OFF
Optional binaries:
benchmark ........................... OFF
noverify_tests ...................... OFF
tests ............................... OFF
exhaustive tests .................... OFF
ctime_tests ......................... OFF
examples ............................ OFF
Cross compiling ....................... FALSE
API visibility attributes ............. ON
Valgrind .............................. ON
Preprocessor defined macros ........... ECMULT_WINDOW_SIZE=15 COMB_BLOCKS=43 COMB_TEETH=6 USE_ASM_X86_64=1 VALGRIND
C compiler ............................ GNU 13.3.0, /usr/bin/cc
CFLAGS ................................
Compile options ....................... -Wall -pedantic -Wcast-align -Wcast-align=strict -Wextra -Wnested-externs -Wno-long-long -Wno-overlength-strings -Wno-unused-function -Wshadow -Wstrict-prototypes -Wundef
Build type:
- CMAKE_BUILD_TYPE ................... Release
- CFLAGS ............................. -O2 -g
- LDFLAGS for executables ............
- LDFLAGS for shared libraries .......
Configure summary
=================
Executables:
bitcoin ............................. OFF
bitcoind ............................ ON
bitcoin-node (multiprocess) ......... ON
bitcoin-qt (GUI) .................... OFF
bitcoin-gui (GUI, multiprocess) ..... OFF
bitcoin-cli ......................... OFF
bitcoin-tx .......................... OFF
bitcoin-util ........................ OFF
bitcoin-wallet ...................... OFF
bitcoin-chainstate (experimental) ... OFF
libbitcoinkernel (experimental) ..... OFF
kernel-test (experimental) .......... OFF
Optional features:
wallet support ...................... OFF
external signer ..................... OFF
ZeroMQ .............................. OFF
IPC ................................. ON
USDT tracing ........................ OFF
QR code (GUI) ....................... OFF
DBus (GUI) .......................... OFF
Tests:
test_bitcoin ........................ OFF
test_bitcoin-qt ..................... OFF
bench_bitcoin ....................... OFF
fuzz binary ......................... OFF
Cross compiling ....................... FALSE
C++ compiler .......................... GNU 13.3.0, /usr/bin/c++
CMAKE_BUILD_TYPE ...................... Release
Preprocessor defined macros ...........
C++ compiler flags .................... -O2 -std=c++20 -fPIC -fno-extended-identifiers -fdebug-prefix-map=/home/claudio/Desktop/bitcoinknots/src=. -fmacro-prefix-map=/home/claudio/Desktop/bitcoinknots/src=. -fstack-reuse=none -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -Wstack-protector -fstack-protector-all -fcf-protection=full -fstack-clash-protection -Wall -Wextra -Wformat -Wformat-security -Wvla -Wredundant-decls -Wdate-time -Wduplicated-branches -Wduplicated-cond -Wlogical-op -Woverloaded-virtual -Wsuggest-override -Wimplicit-fallthrough -Wunreachable-code -Wbidi-chars=any -Wundef -Wno-unused-parameter
Linker flags .......................... -O2 -fstack-reuse=none -fstack-protector-all -fcf-protection=full -fstack-clash-protection -Wl,-z,relro -Wl,-z,now -Wl,-z,separate-code -fPIE -pie
</details>
taskset -c 1 ./bin/bench_bitcoin -filter="(FindByte|LoadExternalBlockFile)" --min-time=10000
Before:
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 53.20 | 18,796,833.40 | 0.0% | 11.00 | FindByte
| 22,499,431.11 | 44.45 | 0.2% | 10.90 | LoadExternalBlockFile
After:
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 10.38 | 96,365,031.03 | 0.0% | 10.99 | FindByte
| 22,128,903.67 | 45.19 | 0.3% | 10.96 | LoadExternalBlockFile
I've also ran a reindex benchmark up to block 300'000 and it shows a slight improvement of ~1.2%
<details>
CMD ["hyperfine", \
"--runs", "3", \
"--setup", "pyperf system tune; bitcoind -datadir=. -stopatheight=1 || true", \
"--prepare", "rm -rf chainstate/", \
"--cleanup", "pyperf system reset", \
"bitcoind -datadir=. -listen=0 -dnsseed=0 -fixedseeds=0 -printtoconsole=0 -blocksonly=1 -reindex -stopatheight=300000 -dbcache=4096"]
</details>
before:
Time (mean ± σ): 2097.363 s ± 18.306 s [User: 5859.220 s, System: 62.772 s]
Range (min … max): 2079.740 s … 2116.283 s 3 runs
after:
Time (mean ± σ): 2072.158 s ± 29.275 s [User: 5857.330 s, System: 63.515 s]
Range (min … max): 2046.102 s … 2103.836 s 3 runs