optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD #31539

pull l0rinc wants to merge 3 commits into bitcoin:master from l0rinc:l0rinc/buffered-block-read-write changing 3 files +77 −2
  1. l0rinc commented at 11:09 am on December 19, 2024: contributor

    An alternative implementation of #31551, where the block is read into memory in chunks, resulting in similar gains, but is slightly more complicated, but uses less memory.


    This is a precursor to #31144.

    Currently, obfuscation operations are performed byte-by-byte during serialization. Buffering the reads allows batching these operations (implemented in #31144) and improves file access efficiency by reducing fread calls and associated locking overhead. Testing with various buffer sizes showed that 16 KiB provides the best performance.

    For writes, buffering enables batched obfuscations directly on the internal buffer, avoiding the need to copy input spans for obfuscation. Xor key offsets are calculated based on the file position, and the batched obfuscation is applied before writing to disk.

    BufferedFile was avoided for both reads and writes due to its unrelated operations and potential deprecation.


    2 full IBD runs against master for 870k blocks (seeded from real nodes) indicates a 6% total speedup.

     0hyperfine --runs 2 --export-json /mnt/my_storage/ibd-xor-buffered.json --parameter-list COMMIT d73f37dda221835b5109ede1b84db2dc7c4b74a1,6853b2740851befffa3ca0b24d94212ed1e48d66 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0'
     1
     2Benchmark 1: COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
     3  Time (mean ± σ):     40216.674 s ± 113.132 s    [User: 51496.289 s, System: 3541.340 s]
     4  Range (min … max):   40136.678 s … 40296.671 s    2 runs
     5
     6Benchmark 2: COMMIT=6853b2740851befffa3ca0b24d94212ed1e48d66 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
     7  Time (mean ± σ):     37964.015 s ± 624.115 s    [User: 49086.037 s, System: 3375.072 s]
     8  Range (min … max):   37522.699 s … 38405.331 s    2 runs
     9
    10Summary
    11  COMMIT=6853b2740851befffa3ca0b24d94212ed1e48d66 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0 ran
    12    1.06 ± 0.02 times faster than COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
    

    Microbenchmarks show a ~23%/290% speedup (the followup XOR batching improves this further):

    cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake –build build -j$(nproc) && build/src/bench/bench_bitcoin -filter=‘ReadBlockBench|SaveBlockBench’ -min-time=10000

    Before:

    ns/op op/s err% total benchmark
    2,288,264.16 437.01 0.2% 11.00 ReadBlockBench
    5,260,934.43 190.08 1.2% 11.08 SaveBlockBench

    After optimization: Buffer serialization reads in UndoRead and ReadBlock (23% faster read):

    ns/op op/s err% total benchmark
    1,847,259.64 541.34 0.2% 11.03 ReadBlockBench

    After optimization: Buffer serialization writes in UndoWrite and WriteBlock (290% faster write):

    ns/op op/s err% total benchmark
    1,804,208.61 554.26 1.4% 10.89 SaveBlockBench
  2. DrahtBot commented at 11:09 am on December 19, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31539.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    No conflicts as of last run.

  3. l0rinc renamed this:
    optimization: buffer reads/writes in [undo]blocks
    optimization: buffer reads/writes in [undo]block [de]serialization
    on Dec 19, 2024
  4. l0rinc force-pushed on Dec 21, 2024
  5. l0rinc renamed this:
    optimization: buffer reads/writes in [undo]block [de]serialization
    optimization: buffer reads(23%)/writes(19%) in [undo]block [de]serialization
    on Dec 21, 2024
  6. l0rinc force-pushed on Dec 21, 2024
  7. l0rinc renamed this:
    optimization: buffer reads(23%)/writes(19%) in [undo]block [de]serialization
    optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization
    on Dec 21, 2024
  8. l0rinc renamed this:
    optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization
    optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD
    on Dec 21, 2024
  9. bitcoin deleted a comment on Jan 4, 2025
  10. ryanofsky referenced this in commit 5d6f6fd00d on Jan 22, 2025
  11. DrahtBot added the label Needs rebase on Jan 22, 2025
  12. l0rinc force-pushed on Jan 22, 2025
  13. l0rinc force-pushed on Jan 22, 2025
  14. DrahtBot removed the label Needs rebase on Jan 22, 2025
  15. l0rinc force-pushed on Jan 22, 2025
  16. l0rinc marked this as ready for review on Jan 22, 2025
  17. l0rinc commented at 9:37 am on January 23, 2025: contributor

    Not sure what the failure means:

    100% tests passed, 0 tests failed out of 138

    Error: Process completed with exit code 143.

  18. maflcko commented at 10:09 am on January 23, 2025: member

    Not sure what the failure means:

    100% tests passed, 0 tests failed out of 138

    Error: Process completed with exit code 143.

    The unit tests are different from the functional tests. I guess the failure is due to a GHA VM corruption or OOM?

  19. l0rinc marked this as a draft on Jan 23, 2025
  20. l0rinc force-pushed on Jan 23, 2025
  21. l0rinc commented at 5:39 pm on January 23, 2025: contributor
    I’ll push a few more drafts to find out why CI is crashing - bear with me.
  22. l0rinc force-pushed on Jan 23, 2025
  23. maflcko commented at 7:11 pm on January 23, 2025: member
    The unit tests are not failing. This can be seen in the output (100% tests passed, 0 tests failed out of 138). So skipping them should have no effect.
  24. maflcko commented at 8:14 pm on January 23, 2025: member
    My recommendation would be to measure the asan CI max memory usage before and after this change. Maybe it is not largely different, but just the last drip to tip an OOM?
  25. l0rinc commented at 8:47 pm on January 23, 2025: contributor

    You may be right, though I don’t yet see how this memory saving option keeps crashing after the rebase (it was passing before), while the memory hogging alternative is passing. https://github.com/bitcoin-dev-tools/benchcoin/pull/123#issuecomment-2610155151 shows the before/after memory to be the same.

    I suspected a stuck file or process or infinite loop…

    asan CI max memory usage

    I don’t yet know how to run that on a Mac (and my linux boxes are running IBDs for now). I’ll push a few more rounds here to try to understand which part is causing the problem - and why only for ASan + LSan + UBSan + integer, if they won’t reveal anything, I’ll investigate how to replicate the CI behavior locally.

  26. l0rinc force-pushed on Jan 23, 2025
  27. optimization: Buffer serialization reads in `UndoRead` and `ReadBlock`
    The Obfuscation (XOR) operations are currently done byte-by-byte during serialization, buffering the reads will enable batching the obfuscation operations later (not yet done here).
    
    Also, different operating systems seem to handle file caching differently, so reading bigger batches (and processing those from memory) is also a bit faster (likely because of fewer native fread calls or less locking).
    
    Buffered reading was done by first exhausting the buffer, and if more data is still needed, reading directly into the destination (to avoid copying into the buffer when we have huge destination Spans, such as the many megabyte vectors in later blocks), and lastly refilling the buffer completely.
    
    Running the `ReadBlockBench` benchmarks with different buffer sizes indicated that 16 KiB is the optimal buffer size (roughly 25% faster than master).
    
    Testing was done by randomly reading data from the same file with `AutoFile` and `BufferedReadOnlyFile` and making sure that the exact same data is read.
    
    `BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future).
    
    Before:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        2,288,264.16 |              437.01 |    0.2% |     11.00 | `ReadBlockBench`
    
    After:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,847,259.64 |              541.34 |    0.2% |     11.03 | `ReadBlockBench`
    f8b10a4863
  28. optimization: Buffer serialization writes in `SaveBlockUndo` and `SaveBlock`
    Similarly to the serialization reads, buffered writes will enable batched xor calculations - especially since currently we need to copy the write inputs Span to do the obfuscation on it, batching will enable doing the xor on the internal buffer instead.
    
    All write operations are delegated to `AutoFile`.
    Xor key offsets are also calculated based on where we are in the underlying file.
    To avoid the buffered write of 4096 in `AutoFile.write`, we're disabling obfuscation for the underlying `m_dest` and doing it ourselves for the whole buffer (instead of byte-by-byte) before writing it to file.
    We can't obfuscate the write's src directly since it would mutate the method's input (for very big writes it would avoid copying), but we can fill our own buffer and xor all of that safely.
    
    Running the `SaveBlockBench` benchmarks with different buffer sizes indicated that 1 MiB was the optimal buffer size.
    
    `BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future).
    
    Before:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        5,260,934.43 |              190.08 |    1.2% |     11.08 | `SaveBlockBench`
    
    After:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,804,208.61 |              554.26 |    1.4% |     10.89 | `SaveBlockBench`
    b1b55a2c23
  29. l0rinc force-pushed on Jan 23, 2025
  30. Revert "optimization: Buffer serialization writes in `SaveBlockUndo` and `SaveBlock`"
    This reverts commit b1b55a2c2363f94e7266d31b7e2d818372bd5ce4.
    9f8a93e0f6
  31. maflcko commented at 10:50 am on January 24, 2025: member

    I’ll investigate how to replicate the CI behavior locally

    It may be sufficient to just copy-paste the cxx flags into a normal build. In the future a cmake preset could be added, see also #31199 (comment)

  32. l0rinc commented at 11:07 am on January 24, 2025: contributor

    It looks like the problem was with BufferedWriteOnlyFile - if needed, I will investigate further.

    Given that there is an alternative PR to this (reading the whole block into memory (instead of smaller chunks) - which we’re already doing in other places, so shouldn’t be controversial), I’ll close this PR. If the memory allocations come up in that, I’ll consider reopening.

  33. l0rinc closed this on Jan 24, 2025


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-02-22 21:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me