[IBD] specialize block serialization #31868

pull l0rinc wants to merge 6 commits into bitcoin:master from l0rinc:lorinc/block-serialization-optimizations changing 12 files +231 −107
  1. l0rinc commented at 4:48 pm on February 14, 2025: contributor

    This change is part of [IBD] - Tracking PR for speeding up Initial Block Download


    This PR is drafted until I remeasure everything after the recent merges and I need to find a way to simplify the 1 byte writes more nicely, I don’t like all the specializations.


    Summary

    This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks. It also takes advantage of the recently merged std::span changes enabling propagating static extents.

    The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been done here before. This enabled further SizeComputer optimizations as well.

    Context

    Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt’s first byte which is also needed for every (pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn’t needed:

    • AutoFile write doesn’t need to allocate 4k buffer for a single byte now;
    • VectorWriter and DataStream avoids memcpy/insert calls;
    • CSHA256::Write can avoid memcpy.

    DeserializeBlock is dominated by the hash calculations so the optimizations barely affect it.

    Measurements

    Before:

    ns/block block/s err% total benchmark
    195,610.62 5,112.20 0.3% 11.00 SerializeBlock
    12,061.83 82,906.19 0.1% 11.01 SizeComputerBlock

    After:

    ns/block block/s err% total benchmark
    174,569.19 5,728.39 0.6% 10.89 SerializeBlock
    10,241.16 97,645.21 0.0% 11.00 SizeComputerBlock

    SerializeBlock - ~12.% faster SizeComputerBlock - ~17.7% faster


    Before:

    ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
    867,857.55 1,152.26 0.0% 8,015,883.90 3,116,099.08 2.572 1,517,035.87 0.5% 10.81 SerializeBlock
    30,928.27 32,332.88 0.0% 221,683.03 111,055.84 1.996 53,037.03 0.8% 11.03 SizeComputerBlock

    After:

    ns/block block/s err% ins/block cyc/block IPC bra/block miss% total benchmark
    615,000.56 1,626.01 0.0% 8,015,883.64 2,208,340.88 3.630 1,517,035.62 0.5% 10.56 SerializeBlock
    25,676.76 38,945.72 0.0% 159,390.03 92,202.10 1.729 42,131.03 0.9% 11.00 SizeComputerBlock

    SerializeBlock - ~41.1% faster SizeComputerBlock - ~20.4% faster


    While this wasn’t the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

     0COMMITS="05314bde0b06b820225f10c6529b5afae128ff81 1cd94ec2511874ec68b92db34ad7ec7d9534fed1"; \
     1STOP_HEIGHT=880000; DBCACHE=10000; \
     2C_COMPILER=gcc; CXX_COMPILER=g++; \
     3hyperfine \
     4--export-json "/mnt/my_storage/ibd-${COMMITS// /-}-${STOP_HEIGHT}-${DBCACHE}-${C_COMPILER}.json" \
     5--runs 3 \
     6--parameter-list COMMIT ${COMMITS// /,} \
     7--prepare "killall bitcoind || true; rm -rf /mnt/my_storage/BitcoinData/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && cmake --build build -j$(nproc) --target bitcoind && ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -printtoconsole=0 || true" \
     8--cleanup "cp /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s).log || true" \
     9"COMPILER=$C_COMPILER COMMIT={COMMIT} ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -prune=550 -printtoconsole=0"
    10Benchmark 1: COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    11  Time (mean ± σ):     33647.918 s ± 508.655 s    [User: 71503.409 s, System: 4404.899 s]
    12  Range (min … max):   33283.439 s … 34229.026 s    3 runs
    13 
    14Benchmark 2: COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    15  Time (mean ± σ):     33062.491 s ± 183.335 s    [User: 71246.532 s, System: 4318.490 s]
    16  Range (min … max):   32888.211 s … 33253.706 s    3 runs
    17 
    18Summary
    19  COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0 ran
    20    1.02 ± 0.02 times faster than COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
    
  2. DrahtBot commented at 4:48 pm on February 14, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #32296 (refactor: reenable implicit-integer-sign-change check for serialize.h by l0rinc)
    • #32043 ([IBD] Tracking PR for speeding up Initial Block Download by l0rinc)
    • #31682 ([IBD] specialize CheckBlock’s input & coinbase checks by l0rinc)
    • #31144 ([IBD] multi-byte block obfuscation by l0rinc)
    • #28792 (Embed default ASMap as binary dump header file by fjahr)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    LLM Linter (✨ experimental)

    Possible typos and grammar issues:

    nullpt → nullptr

  3. in src/streams.h:86 in 2d8a85cea7 outdated
    82@@ -83,6 +83,16 @@ class VectorWriter
    83         }
    84         nPos += src.size();
    85     }
    86+    void write(std::byte val)
    


    theuni commented at 5:13 pm on February 14, 2025:

    These are nice optims, but as I mentoned here:

    In the future we could potentially add specializations for types with compile-time-known sizes via concepts.

    Presumably some of the streams could benefit from static extents, but that’s waaay overkill for here.

    I think it makes sense to wait for the std::span replacement (#31519) to do this, that way we can specialize for any static extent instead which should compile down to nothing: https://compiler-explorer.com/z/97aY3bnK8


    l0rinc commented at 8:33 pm on February 14, 2025:

    Absolutely, I already have other optimization ideas in mind after that’s merged.

    The reviewers can decide the preferred merge order, I don’t mind rebasing or doing it in multiple PRs - there’s a lot of work left with serialization anyway.


    l0rinc commented at 0:01 am on March 10, 2025:
    Rebased on top of #31519 and experimented with static extents - the speed is not the same as with bare std::byte parameters, but close enough and the code is more generalized. Drafting until the span PR is merged - suggestions for further investigations are welcome.

    l0rinc commented at 11:45 am on March 20, 2025:
    Now that #31519 was merged, I’ve rebased an moved it out of draft.
  4. DrahtBot added the label CI failed on Feb 16, 2025
  5. DrahtBot removed the label CI failed on Feb 16, 2025
  6. l0rinc force-pushed on Mar 9, 2025
  7. l0rinc commented at 11:59 pm on March 9, 2025: contributor
    Drafting until #31519 is merged, as recommended in #31868 (review)
  8. l0rinc marked this as a draft on Mar 9, 2025
  9. l0rinc force-pushed on Mar 10, 2025
  10. DrahtBot added the label CI failed on Mar 10, 2025
  11. DrahtBot commented at 0:09 am on March 10, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38458137649

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  12. l0rinc force-pushed on Mar 10, 2025
  13. l0rinc force-pushed on Mar 10, 2025
  14. DrahtBot removed the label CI failed on Mar 10, 2025
  15. l0rinc renamed this:
    optimization: speed up block serialization
    [IBD] Specialize block serialization
    on Mar 12, 2025
  16. l0rinc renamed this:
    [IBD] Specialize block serialization
    [IBD] specialize block serialization
    on Mar 12, 2025
  17. fanquake referenced this in commit a799415d84 on Mar 17, 2025
  18. DrahtBot added the label Needs rebase on Mar 20, 2025
  19. l0rinc force-pushed on Mar 20, 2025
  20. l0rinc marked this as ready for review on Mar 20, 2025
  21. DrahtBot removed the label Needs rebase on Mar 20, 2025
  22. in src/serialize.h:262 in c9a69f9088 outdated
    256-template<typename Stream> inline void Serialize(Stream& s, int64_t a ) { ser_writedata64(s, a); }
    257-template<typename Stream> inline void Serialize(Stream& s, uint64_t a) { ser_writedata64(s, a); }
    258+template <typename Stream, ByteOrIntegral T> void Serialize(Stream& s, T a)
    259+{
    260+    if constexpr (sizeof(T) == 1) {
    261+        ser_writedata8(s, static_cast<uint8_t>(a));   // (u)int8_t or std::byte or bool
    


    TheCharlatan commented at 8:59 am on April 17, 2025:

    In commit c9a69f9088340df88017752e1016670141b6ad74:

    Looking at the concept, could this also be unsigned char?


    l0rinc commented at 3:46 pm on April 17, 2025:

    Yes:

    typedef unsigned char uint8_t;

  23. in src/serialize.h:56 in 2c136648dd outdated
    47@@ -48,6 +48,16 @@ static const unsigned int MAX_VECTOR_ALLOCATE = 5000000;
    48 struct deserialize_type {};
    49 constexpr deserialize_type deserialize {};
    50 
    51+class SizeComputer;
    52+
    53+//! Check if type contains a stream by seeing if it has a GetStream() method.
    54+template<typename T>
    55+concept ContainsStream = requires(T t) { t.GetStream(); };
    56+
    


    TheCharlatan commented at 9:58 am on April 17, 2025:
    Nit: Maybe add a description here too along the lines of: Check if type contains a SizeComputer by seeing if the return type of T’s GetStream() method is a SizeComputer.
  24. l0rinc force-pushed on Apr 17, 2025
  25. TheCharlatan commented at 11:58 am on April 17, 2025: contributor
    This looks good, but I am not really reproducing any of the performance changes on my machine yet. Maybe my laptop is just too unreliable though.
  26. l0rinc commented at 12:28 pm on April 17, 2025: contributor

    but I am not really reproducing any of the performance changes on my machine yet

    I just removed the single-byte specializations, they were quite ugly, will have to add it back somehow later - will draft it until I do that, thanks for checking!

  27. l0rinc marked this as a draft on Apr 17, 2025
  28. l0rinc force-pushed on Apr 17, 2025
  29. l0rinc commented at 12:50 pm on April 17, 2025: contributor
    @TheCharlatan, I’ve added back the single-byte optimizations (simplified slightly after rebase), your feedback is appreciated.
  30. l0rinc marked this as ready for review on Apr 17, 2025
  31. in src/crypto/sha256.cpp:726 in 5317366a83 outdated
    722@@ -723,6 +723,21 @@ CSHA256& CSHA256::Write(const unsigned char* data, size_t len)
    723     }
    724     return *this;
    725 }
    726+CSHA256& CSHA256::Write(unsigned char data)
    


    TheCharlatan commented at 1:58 pm on April 17, 2025:
    I’m not sure about all these crypto changes. Will this be noticable at all? Why only do it for the sha256 hasher? Maybe do these once/if the other single byte steam writer changes are accepted.
  32. TheCharlatan commented at 2:10 pm on April 17, 2025: contributor
    This looks ok, albeit that I’m seeing more modest performance improvements on my end.
  33. maflcko commented at 3:40 pm on April 17, 2025: member

    While this wasn’t the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

    If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

    Also, what code part is hitting this during IBD, given that reads and writes are now buffered?

  34. l0rinc marked this as a draft on Apr 17, 2025
  35. l0rinc commented at 3:47 pm on April 17, 2025: contributor

    what code part is hitting this during IBD

    During serialization we have very many 1 byte writes, this change avoids the heavy allocations. But drafting until I have time to properly remeasure everything after the recent merge.

  36. maflcko commented at 3:50 pm on April 17, 2025: member
    I understand that there are many 1-byte writes, but they go into the BufferedReader/Writer, which is not modified here. The Buffered* then passes them to AutoFile, which is modified here, but never receives 1-byte writes anymore
  37. bench: measure block (size)serialization speed
    Measure both full block serialization and size computation via `SizeComputer`.
    `SizeComputer` returns the exact final size of the serialized content without writing any bytes.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          195,610.62 |            5,112.20 |    0.3% |     11.00 | `SerializeBlock`
    |           12,061.83 |           82,906.19 |    0.1% |     11.01 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          867,857.55 |            1,152.26 |    0.0% |    8,015,883.90 |    3,116,099.08 |  2.572 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
    |           30,928.27 |           32,332.88 |    0.0% |      221,683.03 |      111,055.84 |  1.996 |      53,037.03 |    0.8% |     11.03 | `SizeComputerBlock`
    9d3c943d55
  38. cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` ffc615edeb
  39. refactor: reduce template bloat in primitive serialization
    Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
    This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later
    e00ce6ed3c
  40. refactor: add explicit static extent to spans cbdb759d0e
  41. optimization: merge `SizeComputer` specializations and add new overloads
    Endianness doesn’t affect final size, so skip it in `SizeComputer`.
    Fold existing overloads into one implementation, short‑circuiting logic when only the serialized size is needed.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
    |           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
    |           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
    f40e3487d5
  42. optimization: add single byte writes
    Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
    It makes sense to avoid the generalized serialization infrastructure that isn't needed:
    * AutoFile write doesn't need to allocate 4k buffer for a single byte now;
    * `VectorWriter` and `DataStream` avoids memcpy/insert calls.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
    |           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
    |           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
    073e28b1e9
  43. l0rinc force-pushed on Apr 19, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-04-27 18:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me