[IBD] specialize block serialization #31868

pull l0rinc wants to merge 6 commits into bitcoin:master from l0rinc:lorinc/block-serialization-optimizations changing 11 files +240 −104
  1. l0rinc commented at 4:48 pm on February 14, 2025: contributor

    This change is part of [IBD] - Tracking PR for speeding up Initial Block Download

    Summary

    This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks. It also takes advantage of the recently merged std::span changes enabling propagating static extents.

    The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been done here before. This enabled further SizeComputer optimizations as well.

    Context

    Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt’s first byte which is also needed for every (pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn’t needed:

    • AutoFile write doesn’t need to allocate 4k buffer for a single byte now;
    • VectorWriter and DataStream avoids memcpy/insert calls;
    • CSHA256::Write can avoid memcpy.

    DeserializeBlock is dominated by the hash calculations so the optimizations barely affect it.

    Measurements

    Before:

    ns/block block/s err% total benchmark
    195,610.62 5,112.20 0.3% 11.00 SerializeBlock
    12,061.83 82,906.19 0.1% 11.01 SizeComputerBlock

    After:

    ns/block block/s err% total benchmark
    174,569.19 5,728.39 0.6% 10.89 SerializeBlock
    10,241.16 97,645.21 0.0% 11.00 SizeComputerBlock

    SerializeBlock - ~12.% faster SizeComputerBlock - ~17.7% faster


    Before:

    ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
    867,857.55 1,152.26 0.0% 8,015,883.90 3,116,099.08 2.572 1,517,035.87 0.5% 10.81 SerializeBlock
    30,928.27 32,332.88 0.0% 221,683.03 111,055.84 1.996 53,037.03 0.8% 11.03 SizeComputerBlock

    After:

    ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
    615,000.56 1,626.01 0.0% 8,015,883.64 2,208,340.88 3.630 1,517,035.62 0.5% 10.56 SerializeBlock
    25,676.76 38,945.72 0.0% 159,390.03 92,202.10 1.729 42,131.03 0.9% 11.00 SizeComputerBlock

    SerializeBlock - ~41.1% faster SizeComputerBlock - ~20.4% faster


    While this wasn’t the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

     0COMMITS="05314bde0b06b820225f10c6529b5afae128ff81 1cd94ec2511874ec68b92db34ad7ec7d9534fed1"; \
     1STOP_HEIGHT=880000; DBCACHE=10000; \
     2C_COMPILER=gcc; CXX_COMPILER=g++; \
     3hyperfine \
     4--export-json "/mnt/my_storage/ibd-${COMMITS// /-}-${STOP_HEIGHT}-${DBCACHE}-${C_COMPILER}.json" \
     5--runs 3 \
     6--parameter-list COMMIT ${COMMITS// /,} \
     7--prepare "killall bitcoind || true; rm -rf /mnt/my_storage/BitcoinData/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && cmake --build build -j$(nproc) --target bitcoind && ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -printtoconsole=0 || true" \
     8--cleanup "cp /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s).log || true" \
     9"COMPILER=$C_COMPILER COMMIT={COMMIT} ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -prune=550 -printtoconsole=0"
    10Benchmark 1: COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    11  Time (mean ± σ):     33647.918 s ± 508.655 s    [User: 71503.409 s, System: 4404.899 s]
    12  Range (min … max):   33283.439 s … 34229.026 s    3 runs
    13 
    14Benchmark 2: COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    15  Time (mean ± σ):     33062.491 s ± 183.335 s    [User: 71246.532 s, System: 4318.490 s]
    16  Range (min … max):   32888.211 s … 33253.706 s    3 runs
    17 
    18Summary
    19  COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0 ran
    20    1.02 ± 0.02 times faster than COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    
  2. DrahtBot commented at 4:48 pm on February 14, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #31682 ([IBD] specialize CheckBlock’s input & coinbase checks by l0rinc)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. in src/streams.h:86 in 2d8a85cea7 outdated
    82@@ -83,6 +83,16 @@ class VectorWriter
    83         }
    84         nPos += src.size();
    85     }
    86+    void write(std::byte val)
    


    theuni commented at 5:13 pm on February 14, 2025:

    These are nice optims, but as I mentoned here:

    In the future we could potentially add specializations for types with compile-time-known sizes via concepts.

    Presumably some of the streams could benefit from static extents, but that’s waaay overkill for here.

    I think it makes sense to wait for the std::span replacement (#31519) to do this, that way we can specialize for any static extent instead which should compile down to nothing: https://compiler-explorer.com/z/97aY3bnK8


    l0rinc commented at 8:33 pm on February 14, 2025:

    Absolutely, I already have other optimization ideas in mind after that’s merged.

    The reviewers can decide the preferred merge order, I don’t mind rebasing or doing it in multiple PRs - there’s a lot of work left with serialization anyway.


    l0rinc commented at 0:01 am on March 10, 2025:
    Rebased on top of #31519 and experimented with static extents - the speed is not the same as with bare std::byte parameters, but close enough and the code is more generalized. Drafting until the span PR is merged - suggestions for further investigations are welcome.

    l0rinc commented at 11:45 am on March 20, 2025:
    Now that #31519 was merged, I’ve rebased an moved it out of draft.
  4. DrahtBot added the label CI failed on Feb 16, 2025
  5. DrahtBot removed the label CI failed on Feb 16, 2025
  6. l0rinc force-pushed on Mar 9, 2025
  7. l0rinc commented at 11:59 pm on March 9, 2025: contributor
    Drafting until #31519 is merged, as recommended in #31868 (review)
  8. l0rinc marked this as a draft on Mar 9, 2025
  9. l0rinc force-pushed on Mar 10, 2025
  10. DrahtBot added the label CI failed on Mar 10, 2025
  11. DrahtBot commented at 0:09 am on March 10, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38458137649

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  12. l0rinc force-pushed on Mar 10, 2025
  13. l0rinc force-pushed on Mar 10, 2025
  14. DrahtBot removed the label CI failed on Mar 10, 2025
  15. l0rinc renamed this:
    optimization: speed up block serialization
    [IBD] Specialize block serialization
    on Mar 12, 2025
  16. l0rinc renamed this:
    [IBD] Specialize block serialization
    [IBD] specialize block serialization
    on Mar 12, 2025
  17. fanquake referenced this in commit a799415d84 on Mar 17, 2025
  18. DrahtBot added the label Needs rebase on Mar 20, 2025
  19. bench: measure block (size)serialization speed
    The SizeComputer is a special serializer which returns what the exact final size will be of the serialized content.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          195,610.62 |            5,112.20 |    0.3% |     11.00 | `SerializeBlock`
    |           12,061.83 |           82,906.19 |    0.1% |     11.01 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          867,857.55 |            1,152.26 |    0.0% |    8,015,883.90 |    3,116,099.08 |  2.572 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
    |           30,928.27 |           32,332.88 |    0.0% |      221,683.03 |      111,055.84 |  1.996 |      53,037.03 |    0.8% |     11.03 | `SizeComputerBlock`
    beb903b88e
  20. cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` 4790ae0c30
  21. refactor: reduce template bloat in primitive serialization
    Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
    This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later
    c9a69f9088
  22. refactor: add explicit static extent to spans 26a400c254
  23. optimization: merge SizeComputer specializations + add new ones
    Endianness doesn't affect the final size, we can skip it for `SizeComputer`.
    We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
    |           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
    |           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
    2c136648dd
  24. optimization: add single byte writes
    Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
    It makes sense to avoid the generalized serialization infrastructure that isn't needed:
    * AutoFile write doesn't need to allocate 4k buffer for a single byte now;
    * `VectorWriter` and `DataStream` avoids memcpy/insert calls.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
    |           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
    |           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
    681aad4fdf
  25. l0rinc force-pushed on Mar 20, 2025
  26. l0rinc marked this as ready for review on Mar 20, 2025
  27. DrahtBot removed the label Needs rebase on Mar 20, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-03-31 09:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me