Summary
This PR optimizes the FindByte method by using memchr instead of std::find. This takes advantage of the underlying optimizations that come with memchr, primarily vectorized chunked reads. While std::find is more standard and modern, it is suboptimal for iterating single bytes as they’re iterated 1 by 1 instead of exploiting SIMD.
One could argue that this is not a concern of Bitcoin Core but rather of libc++ mantainers, but since it shows 5x improvement in existing benchmarks, I think it’s worth including.
Benchmarks
0secp256k1 configure summary
1===========================
2Build artifacts:
3 library type ........................ Static
4Optional modules:
5 ECDH ................................ OFF
6 ECDSA pubkey recovery ............... ON
7 extrakeys ........................... ON
8 schnorrsig .......................... ON
9 musig ............................... ON
10 ElligatorSwift ...................... ON
11Parameters:
12 ecmult window size .................. 15
13 ecmult gen table size ............... 86 KiB
14Optional features:
15 assembly ............................ x86_64
16 external callbacks .................. OFF
17Optional binaries:
18 benchmark ........................... OFF
19 noverify_tests ...................... OFF
20 tests ............................... OFF
21 exhaustive tests .................... OFF
22 ctime_tests ......................... OFF
23 examples ............................ OFF
24
25Cross compiling ....................... FALSE
26API visibility attributes ............. ON
27Valgrind .............................. ON
28Preprocessor defined macros ........... ECMULT_WINDOW_SIZE=15 COMB_BLOCKS=43 COMB_TEETH=6 USE_ASM_X86_64=1 VALGRIND
29C compiler ............................ GNU 13.3.0, /usr/bin/cc
30CFLAGS ................................
31Compile options ....................... -Wall -pedantic -Wcast-align -Wcast-align=strict -Wextra -Wnested-externs -Wno-long-long -Wno-overlength-strings -Wno-unused-function -Wshadow -Wstrict-prototypes -Wundef
32Build type:
33 - CMAKE_BUILD_TYPE ................... Release
34 - CFLAGS ............................. -O2 -g
35 - LDFLAGS for executables ............
36 - LDFLAGS for shared libraries .......
37
38
39
40Configure summary
41=================
42Executables:
43 bitcoin ............................. OFF
44 bitcoind ............................ ON
45 bitcoin-node (multiprocess) ......... ON
46 bitcoin-qt (GUI) .................... OFF
47 bitcoin-gui (GUI, multiprocess) ..... OFF
48 bitcoin-cli ......................... OFF
49 bitcoin-tx .......................... OFF
50 bitcoin-util ........................ OFF
51 bitcoin-wallet ...................... OFF
52 bitcoin-chainstate (experimental) ... OFF
53 libbitcoinkernel (experimental) ..... OFF
54 kernel-test (experimental) .......... OFF
55Optional features:
56 wallet support ...................... OFF
57 external signer ..................... OFF
58 ZeroMQ .............................. OFF
59 IPC ................................. ON
60 USDT tracing ........................ OFF
61 QR code (GUI) ....................... OFF
62 DBus (GUI) .......................... OFF
63Tests:
64 test_bitcoin ........................ OFF
65 test_bitcoin-qt ..................... OFF
66 bench_bitcoin ....................... OFF
67 fuzz binary ......................... OFF
68
69Cross compiling ....................... FALSE
70C++ compiler .......................... GNU 13.3.0, /usr/bin/c++
71CMAKE_BUILD_TYPE ...................... Release
72Preprocessor defined macros ...........
73C++ compiler flags .................... -O2 -std=c++20 -fPIC -fno-extended-identifiers -fdebug-prefix-map=/home/claudio/Desktop/bitcoinknots/src=. -fmacro-prefix-map=/home/claudio/Desktop/bitcoinknots/src=. -fstack-reuse=none -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -Wstack-protector -fstack-protector-all -fcf-protection=full -fstack-clash-protection -Wall -Wextra -Wformat -Wformat-security -Wvla -Wredundant-decls -Wdate-time -Wduplicated-branches -Wduplicated-cond -Wlogical-op -Woverloaded-virtual -Wsuggest-override -Wimplicit-fallthrough -Wunreachable-code -Wbidi-chars=any -Wundef -Wno-unused-parameter
74Linker flags .......................... -O2 -fstack-reuse=none -fstack-protector-all -fcf-protection=full -fstack-clash-protection -Wl,-z,relro -Wl,-z,now -Wl,-z,separate-code -fPIE -pie
0taskset -c 1 ./bin/bench_bitcoin -filter="(FindByte|LoadExternalBlockFile)" --min-time=10000
Before:
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 53.20 | 18,796,833.40 | 0.0% | 11.00 | FindByte |
| 22,499,431.11 | 44.45 | 0.2% | 10.90 | LoadExternalBlockFile |
After:
| ns/op | op/s | err% | total | benchmark |
|---|---|---|---|---|
| 10.38 | 96,365,031.03 | 0.0% | 10.99 | FindByte |
| 22,128,903.67 | 45.19 | 0.3% | 10.96 | LoadExternalBlockFile |
I’ve also ran a reindex benchmark up to block 300'000 and it shows a slight improvement of ~1.2%
0CMD ["hyperfine", \
1 "--runs", "3", \
2 "--setup", "pyperf system tune; bitcoind -datadir=. -stopatheight=1 || true", \
3 "--prepare", "rm -rf chainstate/", \
4 "--cleanup", "pyperf system reset", \
5 "bitcoind -datadir=. -listen=0 -dnsseed=0 -fixedseeds=0 -printtoconsole=0 -blocksonly=1 -reindex -stopatheight=300000 -dbcache=4096"]
before:
0 Time (mean ± σ): 2097.363 s ± 18.306 s [User: 5859.220 s, System: 62.772 s]
1 Range (min … max): 2079.740 s … 2116.283 s 3 runs
after:
0 Time (mean ± σ): 2072.158 s ± 29.275 s [User: 5857.330 s, System: 63.515 s]
1 Range (min … max): 2046.102 s … 2103.836 s 3 runs