optimization: speed up block serialization #31868

pull l0rinc wants to merge 5 commits into bitcoin:master from l0rinc:lorinc/block-serialization-optimizations changing 6 files +150 −80
  1. l0rinc commented at 4:48 pm on February 14, 2025: contributor

    This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks.

    The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been done here before. This enabled further SizeComputer optimizations as well.

    Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt’s first byte which is also needed for every (pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn’t needed:

    • AutoFile write doesn’t need to allocate 4k buffer for a single byte now;
    • VectorWriter and DataStream avoids memcpy/insert calls.

    DeserializeBlock is dominated by the hash calculations so the optimizations barely affect it.

    Before:

    ns/block block/s err% total benchmark
    936,285.45 1,068.05 0.1% 11.01 DeserializeBlock
    194,330.04 5,145.88 0.2% 10.97 SerializeBlock
    12,215.05 81,866.19 0.0% 11.00 SizeComputerBlock

    After:

    ns/block block/s err% total benchmark
    888,859.82 1,125.04 0.4% 10.87 DeserializeBlock
    168,502.88 5,934.62 0.1% 10.99 SerializeBlock
    10,200.88 98,030.75 0.1% 11.00 SizeComputerBlock

    DeserializeBlock - 5.3% faster SerializeBlock - 15.3% faster SizeComputerBlock - 19.7% faster


    Before:

    ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
    4,447,243.87 224.86 0.0% 53,689,737.58 15,966,336.86 3.363 2,409,315.46 0.5% 11.01 DeserializeBlock
    869,833.14 1,149.65 0.0% 8,015,883.90 3,123,013.80 2.567 1,517,035.87 0.5% 10.81 SerializeBlock
    26,535.51 37,685.36 0.0% 225,261.03 95,278.40 2.364 53,037.03 0.6% 11.00 SizeComputerBlock

    After:

    ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
    4,460,428.52 224.19 0.0% 53,692,507.13 16,015,347.97 3.353 2,410,105.48 0.5% 11.01 DeserializeBlock
    567,042.65 1,763.54 0.0% 7,386,775.59 2,035,613.84 3.629 1,385,368.57 0.5% 11.01 SerializeBlock
    25,728.56 38,867.32 0.0% 172,750.03 92,366.64 1.870 42,131.03 1.7% 11.00 SizeComputerBlock

    DeserializeBlock - same speed SerializeBlock - 53.3% faster SizeComputerBlock - 3.1% faster


    While this wasn’t the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

     0COMMITS="05314bde0b06b820225f10c6529b5afae128ff81 1cd94ec2511874ec68b92db34ad7ec7d9534fed1"; \
     1STOP_HEIGHT=880000; DBCACHE=10000; \
     2C_COMPILER=gcc; CXX_COMPILER=g++; \
     3hyperfine \
     4--export-json "/mnt/my_storage/ibd-${COMMITS// /-}-${STOP_HEIGHT}-${DBCACHE}-${C_COMPILER}.json" \
     5--runs 3 \
     6--parameter-list COMMIT ${COMMITS// /,} \
     7--prepare "killall bitcoind || true; rm -rf /mnt/my_storage/BitcoinData/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && cmake --build build -j$(nproc) --target bitcoind && ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -printtoconsole=0 || true" \
     8--cleanup "cp /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s).log || true" \
     9"COMPILER=$C_COMPILER COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -prune=550 -printtoconsole=0"
    10Benchmark 1: COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    11  Time (mean ± σ):     33647.918 s ± 508.655 s    [User: 71503.409 s, System: 4404.899 s]
    12  Range (min … max):   33283.439 s … 34229.026 s    3 runs
    13 
    14Benchmark 2: COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    15  Time (mean ± σ):     33062.491 s ± 183.335 s    [User: 71246.532 s, System: 4318.490 s]
    16  Range (min … max):   32888.211 s … 33253.706 s    3 runs
    17 
    18Summary
    19  COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0 ran
    20    1.02 ± 0.02 times faster than COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    
  2. bench: measure block (size)serialization speed
    The SizeComputer is a special serializer which returns what the exact final size will be of the serialized content.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock|DeserializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          936,285.45 |            1,068.05 |    0.1% |     11.01 | `DeserializeBlock`
    |          194,330.04 |            5,145.88 |    0.2% |     10.97 | `SerializeBlock`
    |           12,215.05 |           81,866.19 |    0.0% |     11.00 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,447,243.87 |              224.86 |    0.0% |   53,689,737.58 |   15,966,336.86 |  3.363 |   2,409,315.46 |    0.5% |     11.01 | `DeserializeBlock`
    |          869,833.14 |            1,149.65 |    0.0% |    8,015,883.90 |    3,123,013.80 |  2.567 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
    |           26,535.51 |           37,685.36 |    0.0% |      225,261.03 |       95,278.40 |  2.364 |      53,037.03 |    0.6% |     11.00 | `SizeComputerBlock`
    cbb8ff7211
  3. refactor: reduce template bloat in primitive serialization
    Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
    This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later
    f6c414c722
  4. cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` 7223807fac
  5. optimization: Add single byte write
    Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
    It makes sense to avoid the generalized serialization infrastructure that isn't needed:
    * AutoFile write doesn't need to allocate 4k buffer for a single byte now;
    * `VectorWriter` and `DataStream` avoids memcpy/insert calls.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock|DeserializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          934,120.45 |            1,070.53 |    0.2% |     11.01 | `DeserializeBlock`
    |          170,719.27 |            5,857.57 |    0.1% |     10.99 | `SerializeBlock`
    |           12,048.40 |           82,998.58 |    0.2% |     11.01 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,433,835.04 |              225.54 |    0.0% |   53,688,481.60 |   15,918,730.23 |  3.373 |   2,409,056.47 |    0.5% |     11.01 | `DeserializeBlock`
    |          563,663.10 |            1,774.11 |    0.0% |    7,386,775.59 |    2,023,525.77 |  3.650 |   1,385,368.57 |    0.5% |     11.00 | `SerializeBlock`
    |           27,351.60 |           36,560.93 |    0.1% |      225,261.03 |       98,209.77 |  2.294 |      53,037.03 |    0.9% |     11.00 | `SizeComputerBlock`
    2d8a85cea7
  6. optimization: merge SizeComputer specializations + add new ones
    Endianness doesn't affect the final size, we can skip it for `SizeComputer`.
    We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock|DeserializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          888,859.82 |            1,125.04 |    0.4% |     10.87 | `DeserializeBlock`
    |          168,502.88 |            5,934.62 |    0.1% |     10.99 | `SerializeBlock`
    |           10,200.88 |           98,030.75 |    0.1% |     11.00 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,460,428.52 |              224.19 |    0.0% |   53,692,507.13 |   16,015,347.97 |  3.353 |   2,410,105.48 |    0.5% |     11.01 | `DeserializeBlock`
    |          567,042.65 |            1,763.54 |    0.0% |    7,386,775.59 |    2,035,613.84 |  3.629 |   1,385,368.57 |    0.5% |     11.01 | `SerializeBlock`
    |           25,728.56 |           38,867.32 |    0.0% |      172,750.03 |       92,366.64 |  1.870 |      42,131.03 |    1.7% |     11.00 | `SizeComputerBlock`
    a7db42f17b
  7. DrahtBot commented at 4:48 pm on February 14, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #31682 (optimization: speed up CheckBlock input checks (duplicate detection & nulls) by l0rinc)
    • #31551 (optimization: bulk reads(32%)/writes(298%) in [undo]block [de]serialization, ~6% faster IBD by l0rinc)
    • #31519 (refactor: Use std::span over Span by maflcko)
    • #31144 (optimization: batch XOR operations 12% faster IBD by l0rinc)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  8. in src/streams.h:86 in 2d8a85cea7 outdated
    82@@ -83,6 +83,16 @@ class VectorWriter
    83         }
    84         nPos += src.size();
    85     }
    86+    void write(std::byte val)
    


    theuni commented at 5:13 pm on February 14, 2025:

    These are nice optims, but as I mentoned here:

    In the future we could potentially add specializations for types with compile-time-known sizes via concepts.

    Presumably some of the streams could benefit from static extents, but that’s waaay overkill for here.

    I think it makes sense to wait for the std::span replacement (#31519) to do this, that way we can specialize for any static extent instead which should compile down to nothing: https://compiler-explorer.com/z/97aY3bnK8


    l0rinc commented at 8:33 pm on February 14, 2025:

    Absolutely, I already have other optimization ideas in mind after that’s merged.

    The reviewers can decide the preferred merge order, I don’t mind rebasing or doing it in multiple PRs - there’s a lot of work left with serialization anyway.

  9. DrahtBot added the label CI failed on Feb 16, 2025
  10. DrahtBot removed the label CI failed on Feb 16, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-02-22 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me