[IBD] specialize block serialization

l0rinc commented at 4:48 pm on February 14, 2025: contributor

This change is part of [IBD] - Tracking PR for speeding up Initial Block Download

This PR is drafted until I remeasure everything after the recent merges and I need to find a way to simplify the 1 byte writes more nicely, I don’t like all the specializations.

Summary

This PR contain a few different optimization I found by IBD profiling, and via the newly added block seralization benchmarks. It also takes advantage of the recently merged std::span changes enabling propagating static extents.

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been done here before. This enabled further SizeComputer optimizations as well.

Context

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt’s first byte which is also needed for every (pre)Vector), it makes sense to avoid the generalized serialization infrastructure that isn’t needed:

AutoFile write doesn’t need to allocate 4k buffer for a single byte now;
VectorWriter and DataStream avoids memcpy/insert calls;
CSHA256::Write can avoid memcpy.

DeserializeBlock is dominated by the hash calculations so the optimizations barely affect it.

Measurements

Before:

ns/block	block/s	err%	total	benchmark
195,610.62	5,112.20	0.3%	11.00	`SerializeBlock`
12,061.83	82,906.19	0.1%	11.01	`SizeComputerBlock`

After:

ns/block	block/s	err%	total	benchmark
174,569.19	5,728.39	0.6%	10.89	`SerializeBlock`
10,241.16	97,645.21	0.0%	11.00	`SizeComputerBlock`

SerializeBlock - ~12.% faster SizeComputerBlock - ~17.7% faster

Before:

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
867,857.55	1,152.26	0.0%	8,015,883.90	3,116,099.08	2.572	1,517,035.87	0.5%	10.81	`SerializeBlock`
30,928.27	32,332.88	0.0%	221,683.03	111,055.84	1.996	53,037.03	0.8%	11.03	`SizeComputerBlock`

After:

ns/block	block/s	err%	ins/block	cyc/block	IPC	bra/block	miss%	total	benchmark
615,000.56	1,626.01	0.0%	8,015,883.64	2,208,340.88	3.630	1,517,035.62	0.5%	10.56	`SerializeBlock`
25,676.76	38,945.72	0.0%	159,390.03	92,202.10	1.729	42,131.03	0.9%	11.00	`SizeComputerBlock`

SerializeBlock - ~41.1% faster SizeComputerBlock - ~20.4% faster

While this wasn’t the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

 0COMMITS="05314bde0b06b820225f10c6529b5afae128ff81 1cd94ec2511874ec68b92db34ad7ec7d9534fed1"; \
 1STOP_HEIGHT=880000; DBCACHE=10000; \
 2C_COMPILER=gcc; CXX_COMPILER=g++; \
 3hyperfine \
 4--export-json "/mnt/my_storage/ibd-${COMMITS// /-}-${STOP_HEIGHT}-${DBCACHE}-${C_COMPILER}.json" \
 5--runs 3 \
 6--parameter-list COMMIT ${COMMITS// /,} \
 7--prepare "killall bitcoind || true; rm -rf /mnt/my_storage/BitcoinData/*; git checkout {COMMIT}; git clean -fxd; git reset --hard; cmake -B build -DCMAKE_BUILD_TYPE=Release -DENABLE_WALLET=OFF -DCMAKE_C_COMPILER=$C_COMPILER -DCMAKE_CXX_COMPILER=$CXX_COMPILER && cmake --build build -j$(nproc) --target bitcoind && ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=1 -printtoconsole=0 || true" \
 8--cleanup "cp /mnt/my_storage/BitcoinData/debug.log /mnt/my_storage/logs/debug-{COMMIT}-$(date +%s).log || true" \
 9"COMPILER=$C_COMPILER COMMIT={COMMIT} ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=$STOP_HEIGHT -dbcache=$DBCACHE -prune=550 -printtoconsole=0"
10Benchmark 1: COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
11  Time (mean ± σ):     33647.918 s ± 508.655 s    [User: 71503.409 s, System: 4404.899 s]
12  Range (min … max):   33283.439 s … 34229.026 s    3 runs
13 
14Benchmark 2: COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0
15  Time (mean ± σ):     33062.491 s ± 183.335 s    [User: 71246.532 s, System: 4318.490 s]
16  Range (min … max):   32888.211 s … 33253.706 s    3 runs
17 
18Summary
19  COMPILER=gcc COMMIT=1cd94ec2511874ec68b92db34ad7ec7d9534fed1 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0 ran
20    1.02 ± 0.02 times faster than COMPILER=gcc COMMIT=05314bde0b06b820225f10c6529b5afae128ff81 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=880000 -dbcache=10000 -prune=550 -printtoconsole=0

DrahtBot commented at 4:48 pm on February 14, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31868.

Reviews

See the guideline for information on the review process. A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#33026 (test, refactor: Embedded ASmap selected preparatory work by fjahr)
#32729 (test,refactor: extract script template helpers & widen sigop count coverage by l0rinc)
#31682 ([IBD] specialize CheckBlock’s input & coinbase checks by l0rinc)
#28792 (Embed default ASMap as binary dump header file by fjahr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

in src/streams.h:86 in 2d8a85cea7 outdated

82@@ -83,6 +83,16 @@ class VectorWriter
83         }
84         nPos += src.size();
85     }
86+    void write(std::byte val)

theuni commented at 5:13 pm on February 14, 2025:

These are nice optims, but as I mentoned here:

In the future we could potentially add specializations for types with compile-time-known sizes via concepts.

Presumably some of the streams could benefit from static extents, but that’s waaay overkill for here.

I think it makes sense to wait for the std::span replacement (#31519) to do this, that way we can specialize for any static extent instead which should compile down to nothing: https://compiler-explorer.com/z/97aY3bnK8

l0rinc commented at 8:33 pm on February 14, 2025:

Absolutely, I already have other optimization ideas in mind after that’s merged.

The reviewers can decide the preferred merge order, I don’t mind rebasing or doing it in multiple PRs - there’s a lot of work left with serialization anyway.

l0rinc commented at 0:01 am on March 10, 2025:

Rebased on top of #31519 and experimented with static extents - the speed is not the same as with bare std::byte parameters, but close enough and the code is more generalized. Drafting until the span PR is merged - suggestions for further investigations are welcome.

l0rinc commented at 11:45 am on March 20, 2025:

Now that #31519 was merged, I’ve rebased an moved it out of draft.

DrahtBot added the label CI failed on Feb 16, 2025

DrahtBot removed the label CI failed on Feb 16, 2025

l0rinc force-pushed on Mar 9, 2025

l0rinc commented at 11:59 pm on March 9, 2025: contributor

Drafting until #31519 is merged, as recommended in #31868 (review)

l0rinc marked this as a draft on Mar 9, 2025

l0rinc force-pushed on Mar 10, 2025

DrahtBot added the label CI failed on Mar 10, 2025

DrahtBot commented at 0:09 am on March 10, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38458137649

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

l0rinc force-pushed on Mar 10, 2025

DrahtBot removed the label CI failed on Mar 10, 2025

l0rinc renamed this:
~~optimization: speed up block serialization~~
[IBD] Specialize block serialization
on Mar 12, 2025

l0rinc renamed this:
~~[IBD] Specialize block serialization~~
[IBD] specialize block serialization
on Mar 12, 2025

fanquake referenced this in commit a799415d84 on Mar 17, 2025

DrahtBot added the label Needs rebase on Mar 20, 2025

l0rinc force-pushed on Mar 20, 2025

l0rinc marked this as ready for review on Mar 20, 2025

DrahtBot removed the label Needs rebase on Mar 20, 2025

in src/serialize.h:261 in c9a69f9088 outdated

256-template<typename Stream> inline void Serialize(Stream& s, int64_t a ) { ser_writedata64(s, a); }
257-template<typename Stream> inline void Serialize(Stream& s, uint64_t a) { ser_writedata64(s, a); }
258+template <typename Stream, ByteOrIntegral T> void Serialize(Stream& s, T a)
259+{
260+    if constexpr (sizeof(T) == 1) {
261+        ser_writedata8(s, static_cast<uint8_t>(a));   // (u)int8_t or std::byte or bool

TheCharlatan commented at 8:59 am on April 17, 2025:

In commit c9a69f9088340df88017752e1016670141b6ad74:

Looking at the concept, could this also be unsigned char?

l0rinc commented at 3:46 pm on April 17, 2025:

Yes:

typedef unsigned char uint8_t;

in src/serialize.h:56 in 2c136648dd outdated

47@@ -48,6 +48,16 @@ static const unsigned int MAX_VECTOR_ALLOCATE = 5000000;
48 struct deserialize_type {};
49 constexpr deserialize_type deserialize {};
50 
51+class SizeComputer;
52+
53+//! Check if type contains a stream by seeing if it has a GetStream() method.
54+template<typename T>
55+concept ContainsStream = requires(T t) { t.GetStream(); };
56+

TheCharlatan commented at 9:58 am on April 17, 2025:

Nit: Maybe add a description here too along the lines of: Check if type contains a SizeComputer by seeing if the return type of T’s GetStream() method is a SizeComputer.

l0rinc force-pushed on Apr 17, 2025

TheCharlatan commented at 11:58 am on April 17, 2025: contributor

This looks good, but I am not really reproducing any of the performance changes on my machine yet. Maybe my laptop is just too unreliable though.

l0rinc commented at 12:28 pm on April 17, 2025: contributor

but I am not really reproducing any of the performance changes on my machine yet

I just removed the single-byte specializations, they were quite ugly, will have to add it back somehow later - will draft it until I do that, thanks for checking!

l0rinc marked this as a draft on Apr 17, 2025

l0rinc force-pushed on Apr 17, 2025

l0rinc commented at 12:50 pm on April 17, 2025: contributor

@TheCharlatan, I’ve added back the single-byte optimizations (simplified slightly after rebase), your feedback is appreciated.

l0rinc marked this as ready for review on Apr 17, 2025

in src/crypto/sha256.cpp:724 in 5317366a83 outdated

722@@ -723,6 +723,21 @@ CSHA256& CSHA256::Write(const unsigned char* data, size_t len)
723     }
724     return *this;
725 }
726+CSHA256& CSHA256::Write(unsigned char data)

TheCharlatan commented at 1:58 pm on April 17, 2025:

I’m not sure about all these crypto changes. Will this be noticable at all? Why only do it for the sha256 hasher? Maybe do these once/if the other single byte steam writer changes are accepted.

TheCharlatan commented at 2:10 pm on April 17, 2025: contributor

This looks ok, albeit that I’m seeing more modest performance improvements on my end.

maflcko commented at 3:40 pm on April 17, 2025: member

While this wasn’t the main motivation for the change, IBD on Ubuntu/GCC on SSD with i9 indicates a 2% speedup as well:

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

Also, what code part is hitting this during IBD, given that reads and writes are now buffered?

l0rinc marked this as a draft on Apr 17, 2025

l0rinc commented at 3:47 pm on April 17, 2025: contributor

what code part is hitting this during IBD

During serialization we have very many 1 byte writes, this change avoids the heavy allocations. But drafting until I have time to properly remeasure everything after the recent merge.

maflcko commented at 3:50 pm on April 17, 2025: member

I understand that there are many 1-byte writes, but they go into the BufferedReader/Writer, which is not modified here. The Buffered* then passes them to AutoFile, which is modified here, but never receives 1-byte writes anymore

l0rinc force-pushed on Apr 19, 2025

TheCharlatan commented at 3:43 pm on May 1, 2025: contributor

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

I think block serialization being fast could be useful for future kernel users. Reading and de-serializing blocks is one of the common operations I’d expect outside of just pure block validation.

maflcko commented at 1:15 pm on May 4, 2025: member

If IBD speedup is not the main motivation, what else is the motivation, given that the title mentions [IBD]?

I think block serialization being fast could be useful for future kernel users. Reading and de-serializing blocks is one of the common operations I’d expect outside of just pure block validation.

Ok, seems fine. It would be good to list the main motivation(s) in the OP. Currently, it only says IBD and that IBD isn’t the main motivation. However, I don’t understand how this could have any effect at all, from looking at the code, see #31868 (comment).

l0rinc commented at 1:49 pm on May 4, 2025: contributor

I will work on this more in the upcoming weeks, drafted it for now since I don’t like the current single-byte specializations, have to find a better way. Suggestions are welcome.

maflcko commented at 2:46 pm on May 4, 2025: member

My guess is that the concept specialization here at most could have an effect on xor’d IO? This is because anything not xor’d won’t need a buffer to (un)do the xor before reading or writing.

Thus, it just seems easier to use the existing BufferedReader/Writer in places that could benefit from it?

However, for block serialization they are already used.

Again, no objection here, but it would be good to explain:

What is the goal of the changes here, and what real-world usage do they affect
How is it achieved and how do the benchmarks represent the real-world usage

Right now:

It only states IBD as motivation and also that IBD isn’t the motivation. Also the IBD speedup measurement seems outdated.
The added benchmark is about test-only code paths (DataStream) or p2p code paths (VectorWriter). It is unclear what the AutoFile changes are about. Also, it is unclear how this could possibly affect the kernel, given that the kernel doesn’t call into test-only or p2p-only code paths and instead uses BufferedReader/Writer.

DrahtBot added the label CI failed on May 16, 2025

DrahtBot added the label Needs rebase on May 16, 2025

l0rinc force-pushed on Jul 29, 2025

l0rinc commented at 11:34 pm on July 29, 2025: contributor

Rebased to solve conflict, the PR is still in draft mode until I remeasure its effect on IBD - and I need to find a nicer way to do the single-byte specializations, since they have a measurable effect but make the code super-ugly… Suggestions are welcome. In the meantime I have split out the first refactor into #33093.

DrahtBot removed the label Needs rebase on Jul 30, 2025

DrahtBot removed the label CI failed on Jul 30, 2025

achow101 referenced this in commit 8712e074bb on Jul 31, 2025

bench: measure block (size)serialization speed

Measure both full block serialization and size computation via `SizeComputer`.
`SizeComputer` returns the exact final size of the serialized content without writing any bytes.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          195,610.62 |            5,112.20 |    0.3% |     11.00 | `SerializeBlock`
|           12,061.83 |           82,906.19 |    0.1% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          867,857.55 |            1,152.26 |    0.0% |    8,015,883.90 |    3,116,099.08 |  2.572 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
|           30,928.27 |           32,332.88 |    0.0% |      221,683.03 |      111,055.84 |  1.996 |      53,037.03 |    0.8% |     11.03 | `SizeComputerBlock`

e4bf6e5fad

refactor: reduce template bloat in primitive serialization

Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later

80397c0c7b

refactor: add explicit static extent to spans f75a494a4b

optimization: merge `SizeComputer` specializations and add new overloads

Endianness doesn’t affect final size, so skip it in `SizeComputer`.
Fold existing overloads into one implementation, short‑circuiting logic when only the serialized size is needed.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
|           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`

4b80568081

optimization: add single byte writes

Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
It makes sense to avoid the generalized serialization infrastructure that isn't needed:
* AutoFile write doesn't need to allocate 4k buffer for a single byte now;
* `VectorWriter` and `DataStream` avoids memcpy/insert calls.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
|           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`

f4cb5591a4