optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD #31539

pull l0rinc wants to merge 8 commits into bitcoin:master from l0rinc:l0rinc/buffered-block-read-write changing 10 files +310 −157
  1. l0rinc commented at 11:09 am on December 19, 2024: contributor

    An alternative implementation of #31551, where the block is read into memory in chunks, resulting in more modest gains, is slightly more complicated, but uses less memory.


    This is a follow-up to #31490 (first few commits duplicated here, hence a draft) and a precursor to #31144.

    Currently, obfuscation operations are performed byte-by-byte during serialization. Buffering the reads allows batching these operations (implemented in #31144) and improves file access efficiency by reducing fread calls and associated locking overhead. Testing with various buffer sizes showed that 16 KiB provides the best performance.

    For writes, buffering enables batched obfuscations directly on the internal buffer, avoiding the need to copy input spans for obfuscation. Xor key offsets are calculated based on the file position, and the batched obfuscation is applied before writing to disk.

    BufferedFile was avoided for both reads and writes due to its unrelated operations and potential deprecation.


    2 full IBD runs against master for 870k blocks (seeded from real nodes) indicates a 6% total speedup.

     0hyperfine --runs 2 --export-json /mnt/my_storage/ibd-xor-buffered.json --parameter-list COMMIT d73f37dda221835b5109ede1b84db2dc7c4b74a1,6853b2740851befffa3ca0b24d94212ed1e48d66 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0'
     1
     2Benchmark 1: COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
     3  Time (mean ± σ):     40216.674 s ± 113.132 s    [User: 51496.289 s, System: 3541.340 s]
     4  Range (min … max):   40136.678 s … 40296.671 s    2 runs
     5 
     6Benchmark 2: COMMIT=6853b2740851befffa3ca0b24d94212ed1e48d66 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
     7  Time (mean ± σ):     37964.015 s ± 624.115 s    [User: 49086.037 s, System: 3375.072 s]
     8  Range (min … max):   37522.699 s … 38405.331 s    2 runs
     9 
    10Summary
    11  COMMIT=6853b2740851befffa3ca0b24d94212ed1e48d66 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0 ran
    12    1.06 ± 0.02 times faster than COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
    

    Microbenchmarks show a ~23%/290% speedup (the followup XOR batching improves this further):

    cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake –build build -j$(nproc) && build/src/bench/bench_bitcoin -filter=‘ReadBlockFromDiskBench|SaveBlockToDiskBench’ -min-time=10000

    Before:

    ns/op op/s err% total benchmark
    2,288,264.16 437.01 0.2% 11.00 ReadBlockFromDiskBench
    5,260,934.43 190.08 1.2% 11.08 SaveBlockToDiskBench

    After optimization: Buffer serialization reads in UndoReadFromDisk and ReadBlockFromDisk (23% faster read):

    ns/op op/s err% total benchmark
    1,847,259.64 541.34 0.2% 11.03 ReadBlockFromDiskBench

    After optimization: Buffer serialization writes in UndoWriteToDisk and WriteBlockToDisk (290% faster write):

    ns/op op/s err% total benchmark
    1,804,208.61 554.26 1.4% 10.89 SaveBlockToDiskBench
  2. DrahtBot commented at 11:09 am on December 19, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31539.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #31551 (optimization: bulk reads(27%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD by l0rinc)
    • #31490 (refactor: inline UndoWriteToDisk and WriteBlockToDisk to reduce serialization calls by l0rinc)
    • #31144 (optimization: batch XOR operations 12% faster IBD by l0rinc)
    • #29307 (util: explicitly close all AutoFiles that have been written by vasild)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. l0rinc renamed this:
    optimization: buffer reads/writes in [undo]blocks
    optimization: buffer reads/writes in [undo]block [de]serialization
    on Dec 19, 2024
  4. l0rinc force-pushed on Dec 21, 2024
  5. l0rinc renamed this:
    optimization: buffer reads/writes in [undo]block [de]serialization
    optimization: buffer reads(23%)/writes(19%) in [undo]block [de]serialization
    on Dec 21, 2024
  6. bench: add SaveBlockToDiskBench a20191c567
  7. refactor,blocks: inline `UndoWriteToDisk`
    `UndoWriteToDisk` wasn't really extracting a meaningful subset of the `WriteUndoDataForBlock` functionality, it's tied closely to the only caller (needs the header size twice, recalculated undo serializes size, returns multiple branches, modifies parameter, needs documentation).
    
    The inlined code should only differ in these parts (modernization will be done in other commits):
    * renamed `_pos` to `pos` in `WriteUndoDataForBlock` to match the parameter name;
    * inlined `hashBlock` parameter usage into `hasher << block.pprev->GetBlockHash()`;
    * changed `return false` to `return FatalError`.
    
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    f259d14cca
  8. refactor,blocks: inline `WriteBlockToDisk`
    `WriteBlockToDisk` wasn't really extracting a meaningful subset of the `SaveBlockToDisk` functionality, it's tied closely to the only caller (needs the header size twice, recalculated block serializes size, returns multiple branches, mutates parameter).
    
    The inlined code should only differ in these parts (modernization will be done in other commits):
    * renamed `blockPos` to `pos` in `SaveBlockToDisk` to match the parameter name;
    * changed `return false` to `return FlatFilePos()`.
    
    Also removed remaining references to `SaveBlockToDisk`.
    
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    e76130d635
  9. refactor,blocks: cache block serialized size for consecutive calls
    For consistency `UNDO_DATA_DISK_OVERHEAD` was also extracted to avoid the constant's ambiguity.
    85ef8558b0
  10. refactor,blocks: remove costly asserts
    When the behavior was changes in a previous commit (caching `GetSerializeSize` and avoiding `AutoFile.tell`), asserts were added to make sure the behavior was kept - to make sure reviewers and CI validates it.
    We can safely remove them now.
    
    Co-authored-by: Anthony Towns <aj@erisian.com.au>
    c509c62db0
  11. scripted-diff: rename block and undo functions for consistency
    Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
    
    -BEGIN VERIFY SCRIPT-
    sed -i \
        -e 's/\bSaveBlockToDisk\b/SaveBlock/g' \
        -e 's/\bWriteUndoDataForBlock\b/SaveBlockUndo/g' \
        $(git ls-files)
    -END VERIFY SCRIPT-
    ed8ba94b86
  12. optimization: Buffer serialization reads in `UndoReadFromDisk` and `ReadBlockFromDisk`
    The Obfuscation (XOR) operations are currently done byte-by-byte during serialization, buffering the reads will enable batching the obfuscation operations later (not yet done here).
    
    Also, different operating systems seem to handle file caching differently, so reading bigger batches (and processing those from memory) is also a bit faster (likely because of fewer native fread calls or less locking).
    
    Buffered reading was done by first exhausting the buffer, and if more data is still needed, reading directly into the destination (to avoid copying into the buffer when we have huge destination Spans, such as the many megabyte vectors in later blocks), and lastly refilling the buffer completely.
    
    Running the `ReadBlockFromDiskTest` benchmarks with different buffer sizes indicated that 16 KiB is the optimal buffer size (roughly 25% faster than master).
    
    Testing was done by randomly reading data from the same file with `AutoFile` and `BufferedReadOnlyFile` and making sure that the exact same data is read.
    
    `BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future).
    
    Before:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        2,288,264.16 |              437.01 |    0.2% |     11.00 | `ReadBlockFromDiskBench`
    
    After:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,847,259.64 |              541.34 |    0.2% |     11.03 | `ReadBlockFromDiskBench`
    d3d3955c99
  13. optimization: Buffer serialization writes in `SaveBlockUndo` and `SaveBlock`
    Similarly to the serialization reads, buffered writes will enable batched xor calculations - especially since currently we need to copy the write inputs Span to do the obfuscation on it, batching will enable doing the xor on the internal buffer instead.
    
    All write operations are delegated to `AutoFile`.
    Xor key offsets are also calculated based on where we are in the underlying file.
    To avoid the buffered write of 4096 in `AutoFile.write`, we're disabling obfuscation for the underlying `m_dest` and doing it ourselves for the whole buffer (instead of byte-by-byte) before writing it to file.
    We can't obfuscate the write's src directly since it would mutate the method's input (for very big writes it would avoid copying), but we can fill our own buffer and xor all of that safely.
    
    `BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future).
    
    Before:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        5,260,934.43 |              190.08 |    1.2% |     11.08 | `SaveBlockToDiskBench`
    
    After:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,804,208.61 |              554.26 |    1.4% |     10.89 | `SaveBlockToDiskBench`
    1d99994da8
  14. l0rinc force-pushed on Dec 21, 2024
  15. l0rinc renamed this:
    optimization: buffer reads(23%)/writes(19%) in [undo]block [de]serialization
    optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization
    on Dec 21, 2024
  16. l0rinc renamed this:
    optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization
    optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD
    on Dec 21, 2024
  17. bitcoin deleted a comment on Jan 4, 2025


l0rinc DrahtBot


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-21 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me