streams: replace `std::find` with `memchr` (5x improvement) #34044

pull Raimo33 wants to merge 2 commits into bitcoin:master from Raimo33:optimize-findbyte changing 1 files +8 −6
  1. Raimo33 commented at 5:34 PM on December 10, 2025: contributor

    Summary

    This PR optimizes the FindByte method by using memchr instead of std::find. This takes advantage of the underlying optimizations that come with memchr, primarily vectorized chunked reads. While std::find is more standard and modern, it is suboptimal for iterating single bytes as they're iterated 1 by 1 instead of exploiting SIMD.

    One could argue that this is not a concern of Bitcoin Core but rather of libc++ mantainers, but since it shows 5x improvement in existing benchmarks, I think it's worth including.

    Benchmarks

    <details>

    secp256k1 configure summary
    ===========================
    Build artifacts:
      library type ........................ Static
    Optional modules:
      ECDH ................................ OFF
      ECDSA pubkey recovery ............... ON
      extrakeys ........................... ON
      schnorrsig .......................... ON
      musig ............................... ON
      ElligatorSwift ...................... ON
    Parameters:
      ecmult window size .................. 15
      ecmult gen table size ............... 86 KiB
    Optional features:
      assembly ............................ x86_64
      external callbacks .................. OFF
    Optional binaries:
      benchmark ........................... OFF
      noverify_tests ...................... OFF
      tests ............................... OFF
      exhaustive tests .................... OFF
      ctime_tests ......................... OFF
      examples ............................ OFF
    
    Cross compiling ....................... FALSE
    API visibility attributes ............. ON
    Valgrind .............................. ON
    Preprocessor defined macros ........... ECMULT_WINDOW_SIZE=15 COMB_BLOCKS=43 COMB_TEETH=6 USE_ASM_X86_64=1 VALGRIND
    C compiler ............................ GNU 13.3.0, /usr/bin/cc
    CFLAGS ................................ 
    Compile options ....................... -Wall -pedantic -Wcast-align -Wcast-align=strict -Wextra -Wnested-externs -Wno-long-long -Wno-overlength-strings -Wno-unused-function -Wshadow -Wstrict-prototypes -Wundef
    Build type:
     - CMAKE_BUILD_TYPE ................... Release
     - CFLAGS ............................. -O2 -g 
     - LDFLAGS for executables ............ 
     - LDFLAGS for shared libraries ....... 
    
    
    
    Configure summary
    =================
    Executables:
      bitcoin ............................. OFF
      bitcoind ............................ ON
      bitcoin-node (multiprocess) ......... ON
      bitcoin-qt (GUI) .................... OFF
      bitcoin-gui (GUI, multiprocess) ..... OFF
      bitcoin-cli ......................... OFF
      bitcoin-tx .......................... OFF
      bitcoin-util ........................ OFF
      bitcoin-wallet ...................... OFF
      bitcoin-chainstate (experimental) ... OFF
      libbitcoinkernel (experimental) ..... OFF
      kernel-test (experimental) .......... OFF
    Optional features:
      wallet support ...................... OFF
      external signer ..................... OFF
      ZeroMQ .............................. OFF
      IPC ................................. ON
      USDT tracing ........................ OFF
      QR code (GUI) ....................... OFF
      DBus (GUI) .......................... OFF
    Tests:
      test_bitcoin ........................ OFF
      test_bitcoin-qt ..................... OFF
      bench_bitcoin ....................... OFF
      fuzz binary ......................... OFF
    
    Cross compiling ....................... FALSE
    C++ compiler .......................... GNU 13.3.0, /usr/bin/c++
    CMAKE_BUILD_TYPE ...................... Release
    Preprocessor defined macros ........... 
    C++ compiler flags .................... -O2 -std=c++20 -fPIC -fno-extended-identifiers -fdebug-prefix-map=/home/claudio/Desktop/bitcoinknots/src=. -fmacro-prefix-map=/home/claudio/Desktop/bitcoinknots/src=. -fstack-reuse=none -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=3 -Wstack-protector -fstack-protector-all -fcf-protection=full -fstack-clash-protection -Wall -Wextra -Wformat -Wformat-security -Wvla -Wredundant-decls -Wdate-time -Wduplicated-branches -Wduplicated-cond -Wlogical-op -Woverloaded-virtual -Wsuggest-override -Wimplicit-fallthrough -Wunreachable-code -Wbidi-chars=any -Wundef -Wno-unused-parameter
    Linker flags .......................... -O2 -fstack-reuse=none -fstack-protector-all -fcf-protection=full -fstack-clash-protection -Wl,-z,relro -Wl,-z,now -Wl,-z,separate-code -fPIE -pie
    
    

    </details>

    taskset -c 1 ./bin/bench_bitcoin -filter="(FindByte|LoadExternalBlockFile)" --min-time=10000
    

    Before:

    | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 53.20 | 18,796,833.40 | 0.0% | 11.00 | FindByte | 22,499,431.11 | 44.45 | 0.2% | 10.90 | LoadExternalBlockFile

    After:

    | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 10.38 | 96,365,031.03 | 0.0% | 10.99 | FindByte | 22,128,903.67 | 45.19 | 0.3% | 10.96 | LoadExternalBlockFile

    I've also ran a reindex benchmark up to block 300'000 and it shows a slight improvement of ~1.2%

    <details>

    CMD ["hyperfine", \
        "--runs", "3", \
        "--setup", "pyperf system tune; bitcoind -datadir=. -stopatheight=1 || true", \
        "--prepare", "rm -rf chainstate/", \
        "--cleanup", "pyperf system reset", \
        "bitcoind -datadir=. -listen=0 -dnsseed=0 -fixedseeds=0 -printtoconsole=0 -blocksonly=1 -reindex -stopatheight=300000 -dbcache=4096"]
    

    </details>

    before:

      Time (mean ± σ): 2097.363 s ± 18.306 s    [User: 5859.220 s, System: 62.772 s]
      Range (min … max): 2079.740 s … 2116.283 s    3 runs
    

    after:

      Time (mean ± σ): 2072.158 s ± 29.275 s    [User: 5857.330 s, System: 63.515 s]
      Range (min … max): 2046.102 s … 2103.836 s    3 runs
    
  2. DrahtBot commented at 5:34 PM on December 10, 2025: contributor

    <!--e57a25ab6845829454e8d69fc972939a-->

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    <!--006a51241073e994b41acfe9ec718e94-->

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34044.

    <!--021abf342d371248e50ceaed478a90ca-->

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK Ataraxia009, l0rinc

    If your review is incorrectly listed, please copy-paste <code>&lt;!--meta-tag:bot-skip--&gt;</code> into the comment that the bot should ignore.

    <!--5faf32d7da4f0f540f40219e4f7537a3-->

  3. Ataraxia009 commented at 6:30 PM on December 10, 2025: none

    Concept ACK

    'memchr' seems like a better alternative here

  4. in src/streams.h:602 in 4115c99575
     598 | @@ -599,6 +599,7 @@ class BufferedFile
     599 |      {
     600 |          // For best performance, avoid mod operation within the loop.
     601 |          size_t buf_offset{size_t(m_read_pos % uint64_t(vchBuf.size()))};
     602 | +        const unsigned char target{std::to_integer<unsigned char>(byte)};
    


    l0rinc commented at 6:58 PM on December 10, 2025:

    I don't think this needs manual hoisting - it's loop-invariant so the compiler will likely do it. Inling it will reduce the diff. And I don't think we need the ugly to_integer here: https://godbolt.org/z/7MWzTbMdb

  5. in src/streams.h:616 in 4115c99575
     614 | +            const std::byte* start{vchBuf.data() + buf_offset};
     615 | +            const void* hit{memchr(start, target, len)};
     616 | +            const size_t inc{hit ? size_t(static_cast<const std::byte*>(hit) - start) : len};
     617 |              m_read_pos += inc;
     618 |              if (inc < len) break;
     619 |              buf_offset += inc;
    


    l0rinc commented at 7:57 PM on December 10, 2025:

    the old implementation is quite confusing.

    buf_offset += inc is only executed when inc >= len - and given it can't be greater, it's basically:

    buf_offset += len;
    

    which makes a lot more sense.


    m_read_pos += inc; is either incremented by the distance of the find or by len, so if we do the hit check earlier, we can simplify this as well:

    const auto len{std::min<size_t>(vchBuf.size() - buf_offset, nSrcPos - m_read_pos)};
    const auto* start{vchBuf.data() + buf_offset};
    if (const auto* hit{(std::byte*)memchr(start, int(byte), len)}) {
        m_read_pos += std::distance(start, hit);
        return;
    }
    m_read_pos += len;
    buf_offset += len;
    

    Part of this cleanup could be applied in a preceding commit.


    And about the FindByte benchmark - I can reproduce the speedup, however, it checks only 200 bytes (while in reality we have 8 million) and it only checks the very last position (while in reality it's a circular buffer, so the magic bytes can likely be anywhere).

    So only checking when we find something at the very end is similar to when we don't actually find the value, so we should exercise more positions, since the old version is faster when the value is very early in the byte sequence since the latter has a setup cost:

    rm -rf build && cmake -B build -DBUILD_BENCH=ON -DENABLE_IPC=OFF -DCMAKE_BUILD_TYPE=Release && \
    cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='FindByte' -min-time=10000
    

    Before: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 0.93 | 1,079,082,894.27 | 0.3% | 11.02 | FindByte

    After: | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 1.98 | 505,687,246.16 | 0.0% | 11.00 | FindByte


    We could do an iteration with a stride of ~8kb, something like:

    static void FindByte(benchmark::Bench& bench)
    {
        const auto testing_setup{MakeNoLogFileContext<const BasicTestingSetup>(ChainType::REGTEST)};
        constexpr size_t file_size{2 * MAX_BLOCK_SERIALIZED_SIZE};
        constexpr size_t found_index{file_size - 1};
        constexpr std::byte target{1};
    
        AutoFile file{fsbridge::fopen(testing_setup->m_path_root / "streams_tmp", "w+b")};
        std::array<std::byte, file_size> data{}; // file_size * [std::byte{0}]
        data[found_index] = target;
        file << data;
        file.seek(0, SEEK_SET);
        BufferedFile bf{file, /*nBufSize=*/file_size + 1, /*nRewindIn=*/file_size};
    
        bench.run([&] {
            for (size_t start{0}; start < file_size; start += (8 << 10)) {
                bf.SetPos(start);
                bf.FindByte(target);
                assert(bf.GetPos() == found_index);
            }
        });
        assert(file.fclose() == 0);
    }
    

    which shows that on average the new solution is indeed ~6x faster:

    Before:

    | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 409,425,187.50 | 2.44 | 0.4% | 8.69 | FindByte

    After (current PR):

    | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 61,892,369.81 | 16.16 | 0.8% | 11.07 | FindByte

    After (above suggestion):

    | ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 61,744,208.31 | 16.20 | 0.3% | 11.13 | FindByte

    I have pushed the new benchmark with 32kb stride to https://github.com/bitcoin/bitcoin/pull/34046

  6. l0rinc commented at 9:24 PM on December 10, 2025: contributor

    Concept ACK

    FindByte’s only production use seems to be scanning block files during -reindex to find the first byte of the network magic in a circular buffer. Because of documented historical bugs the data may be jumbled a bit, it's why we need to be able to find the beginning. It's not something we expect users to do often, but this should speed it up a tiny bit. Left a few suggestions to make it more readable as well to make it easier to sell :)

    I would be curious whether a -reindex -stopafterblockimport speedup is measurable - since my understanding is that the magic is usually at the beginning anyway, so it's also possible this ends up slowing down the average case.

    I haven't checked, but LoadExternalBlockFile bench should also exercise this change - and it puts the change into perspective.

  7. Raimo33 force-pushed on Dec 11, 2025
  8. l0rinc commented at 12:39 PM on December 11, 2025: contributor

    it doesn't help that you rebase on a Knots base - the first few commits are already merged on Core, please fix that.

  9. in src/streams.h:612 in 246086cdad outdated
     612 | -            const auto it_find{std::find(it_start, it_end, byte)};
     613 | -            if (it_find != it_end) {
     614 | -                m_read_pos += std::distance(it_start, it_find);
     615 | +            const auto* start{vchBuf.data() + buf_offset};
     616 | +            const auto* hit{(std::byte*)memchr(start, int(byte), len)};
     617 | +            if (hit != nullptr) {
    


    l0rinc commented at 12:41 PM on December 11, 2025:

    you could inline hit to the if condition


    Raimo33 commented at 12:44 PM on December 11, 2025:

    messier imo

  10. in src/streams.h:611 in 6c04cbffda outdated
     606 | @@ -607,8 +607,9 @@ class BufferedFile
     607 |                  Fill();
     608 |              }
     609 |              const size_t len{std::min<size_t>(vchBuf.size() - buf_offset, nSrcPos - m_read_pos)};
     610 | -            const auto it_start{vchBuf.begin() + buf_offset};
     611 | -            const auto it_find{std::find(it_start, it_start + len, byte)};
     612 | +            const auto it_start{vchBuf.cbegin() + buf_offset};
     613 | +            const auto it_end{it_start + len};
    


    l0rinc commented at 12:42 PM on December 11, 2025:

    why extract end if we're getting rid of it in the next commit?


    Raimo33 commented at 12:45 PM on December 11, 2025:

    I've thought about it as well. but it makes it easier to review. smaller diff


    l0rinc commented at 12:47 PM on December 11, 2025:

    how so, how is it easier to review multiple conflicting changes? Same for the it_start change - it's completely removed in a next commit.

  11. Raimo33 force-pushed on Dec 11, 2025
  12. DrahtBot added the label CI failed on Dec 11, 2025
  13. refactor: more readable increment in `FindByte` 06d4e79f62
  14. streams: replace std::find with memchr
    before:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |               53.20 |       18,796,833.40 |    0.0% |     11.00 | `FindByte`
    |       22,499,431.11 |               44.45 |    0.2% |     10.90 | `LoadExternalBlockFile`
    
    after:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |               10.38 |       96,365,031.03 |    0.0% |     10.99 | `FindByte`
    |       22,128,903.67 |               45.19 |    0.3% |     10.96 | `LoadExternalBlockFile`
    830024e142
  15. Raimo33 force-pushed on Dec 11, 2025
  16. DrahtBot removed the label CI failed on Dec 11, 2025
  17. Raimo33 commented at 6:57 PM on December 12, 2025: contributor

    I would be curious whether a -reindex -stopafterblockimport speedup is measurable - since my understanding is that the magic is usually at the beginning anyway, so it's also possible this ends up slowing down the average case.

    I've ran a reindex benchmark, see updated PR description. I'm not sure if ~1.2% is an irrelevant gain, but I argue a 5-6x improvement on the FindByte method is significant and should be considered for merging regardless, even for possible future use of this method.

  18. l0rinc commented at 11:08 AM on December 14, 2025: contributor

    I ran a reindex (without chainstate) 3 times for the whole mainchain - there is no measurable speedup here (it's even 1-2% slower than before, though that's most likely just noise):

    <details> <summary>reindex | M4-Max.local | arm64 | Apple M4 Max | 16 cores | 64.0GiB RAM | SSD | macOS 26.1 25B78 | Apple clang version 17.0.0 (clang-1700.4.4.1)</summary>

    COMMITS="938d7aacabd0bb3784bb3e529b1ed06bb2891864 e24701fe5522ac9b0eaeacc67bd16e11555a6020"; \
    DATA_DIR="$HOME/Library/Application Support/Bitcoin"; LOG_DIR="$HOME/bitcoin-reindex-logs"; \
    mkdir -p "$LOG_DIR"; \
    COMMA_COMMITS=${COMMITS// /,}; \
    (echo ""; for c in $(echo $COMMITS); do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
    (echo "" && echo "reindex | $(hostname) | $(uname -m) | $(sysctl -n machdep.cpu.brand_string) | $(nproc) cores | $(printf '%.1fGiB' "$(( $(sysctl -n hw.memsize)/1024/1024/1024 ))") RAM | SSD | $(sw_vers -productName) $(sw_vers -productVersion) $(sw_vers -buildVersion) | $(xcrun clang --version | head -1)"; echo "") && \
    hyperfine \
      --sort command \
      --runs 3 \
      --export-json "$LOG_DIR/reindex-$(echo "$COMMITS" | sed -E 's/([a-f0-9]{8})[a-f0-9]* ?/\1-/g;s/-$//')-appleclang.json" \
      --parameter-list COMMIT "$COMMA_COMMITS" \
      --prepare "killall -9 bitcoind 2>/dev/null || true; rm -f \"$DATA_DIR\"/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
        cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo && ninja -C build bitcoind -j2" \
      --conclude "killall bitcoind 2>/dev/null || true; sleep 5; grep -q 'Stopping after block import' \"$DATA_DIR\"/debug.log || { echo 'debug.log assertions failed'; exit 1; }; \
                  cp \"$DATA_DIR\"/debug.log \"$LOG_DIR\"/reindex-{COMMIT}-\$(date +%s).log 2>/dev/null || true" \
      "./build/bin/bitcoind -datadir=\"$DATA_DIR\" -reindex -stopafterblockimport -printtoconsole=0"
    
    938d7aacab Merge bitcoin/bitcoin#33657: rest: allow reading partial block data from storage
    e24701fe55 streams: replace std::find with memchr
    

    </details>

    Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -reindex -stopafterblockimport -printtoconsole=0 (COMMIT = 938d7aacabd0bb3784bb3e529b1ed06bb2891864)
      Time (mean ± σ):     17116.255 s ± 287.737 s    [User: 18550.915 s, System: 3721.215 s]
      Range (min … max):   16793.539 s … 17346.051 s    3 runs
    
    Benchmark 1: ./build/bin/bitcoind -datadir="/Users/lorinc/Library/Application Support/Bitcoin" -reindex -stopafterblockimport -printtoconsole=0 (COMMIT = e24701fe5522ac9b0eaeacc67bd16e11555a6020)
      Time (mean ± σ):     17344.195 s ± 113.196 s    [User: 18589.419 s, System: 3727.864 s]
      Range (min … max):   17247.068 s … 17468.509 s    3 runs
    

    It can still be a useful change, but I don't think it's fair to say it speeds up anything.

  19. sipa commented at 1:35 PM on December 14, 2025: member

    I wouldn't expect any change, because unless you're starting from a pre-0.8 node block files, or badly corrupted ones, it'll always be the first byte that matches.

  20. l0rinc commented at 1:38 PM on December 14, 2025: contributor

    It's also what I expected, as mentioned above, the current solution is worse for that case

  21. Raimo33 commented at 1:53 PM on December 14, 2025: contributor

    understood. std::find makes more sense then. Something to keep in mind for the future though.

  22. Raimo33 closed this on Dec 14, 2025

  23. l0rinc referenced this in commit 35386df9fe on Dec 14, 2025
  24. l0rinc commented at 5:07 PM on December 14, 2025: contributor

    Something to keep in mind for the future though.

    ~I have pushed a change to remove the benchmark that lead to the confusion: #34046 (comment)~ reverted

  25. maflcko commented at 8:28 AM on December 15, 2025: member

    Something to keep in mind for the future though.

    I have pushed a change to remove the benchmark that lead to the confusion: #34046 (comment)

    I think it could be better to remove the recovery logic. An upgrade from 0.8 does not seem like a use-case that any user will ever need in the future. Also, I don't see what kind of corruption this could possibly be able to recover, given that it can't progress past the first corrupt block anyway?

  26. l0rinc commented at 9:00 AM on December 15, 2025: contributor

    I think it could be better to remove the recovery logic

    I will push a PR for that, let's see what others think


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-21 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me