optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD

l0rinc commented at 11:09 am on December 19, 2024: contributor

An alternative implementation of #31551, where the block is read into memory in chunks, resulting in similar gains, but is slightly more complicated, but uses less memory.

This is a precursor to #31144.

Currently, obfuscation operations are performed byte-by-byte during serialization. Buffering the reads allows batching these operations (implemented in #31144) and improves file access efficiency by reducing fread calls and associated locking overhead. Testing with various buffer sizes showed that 16 KiB provides the best performance.

For writes, buffering enables batched obfuscations directly on the internal buffer, avoiding the need to copy input spans for obfuscation. Xor key offsets are calculated based on the file position, and the batched obfuscation is applied before writing to disk.

BufferedFile was avoided for both reads and writes due to its unrelated operations and potential deprecation.

2 full IBD runs against master for 870k blocks (seeded from real nodes) indicates a 6% total speedup.

 0hyperfine --runs 2 --export-json /mnt/my_storage/ibd-xor-buffered.json --parameter-list COMMIT d73f37dda221835b5109ede1b84db2dc7c4b74a1,6853b2740851befffa3ca0b24d94212ed1e48d66 --prepare 'rm -rf /mnt/my_storage/BitcoinData/* && git checkout {COMMIT} && git clean -fxd && git reset --hard && cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_UTIL=OFF -DBUILD_TX=OFF -DBUILD_TESTS=OFF -DENABLE_WALLET=OFF -DINSTALL_MAN=OFF && cmake --build build -j$(nproc)' 'COMMIT={COMMIT} ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0'
 1
 2Benchmark 1: COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
 3  Time (mean ± σ):     40216.674 s ± 113.132 s    [User: 51496.289 s, System: 3541.340 s]
 4  Range (min … max):   40136.678 s … 40296.671 s    2 runs
 5
 6Benchmark 2: COMMIT=6853b2740851befffa3ca0b24d94212ed1e48d66 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0
 7  Time (mean ± σ):     37964.015 s ± 624.115 s    [User: 49086.037 s, System: 3375.072 s]
 8  Range (min … max):   37522.699 s … 38405.331 s    2 runs
 9
10Summary
11  COMMIT=6853b2740851befffa3ca0b24d94212ed1e48d66 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0 ran
12    1.06 ± 0.02 times faster than COMMIT=d73f37dda221835b5109ede1b84db2dc7c4b74a1 ./build/src/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=870000 -printtoconsole=0

Microbenchmarks show a ~23%/290% speedup (the followup XOR batching improves this further):

cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake –build build -j$(nproc) && build/src/bench/bench_bitcoin -filter=‘ReadBlockBench|SaveBlockBench’ -min-time=10000

Before:

ns/op	op/s	err%	total	benchmark
2,288,264.16	437.01	0.2%	11.00	`ReadBlockBench`
5,260,934.43	190.08	1.2%	11.08	`SaveBlockBench`

After optimization: Buffer serialization reads in UndoRead and ReadBlock (23% faster read):

ns/op	op/s	err%	total	benchmark
1,847,259.64	541.34	0.2%	11.03	`ReadBlockBench`

After optimization: Buffer serialization writes in UndoWrite and WriteBlock (290% faster write):

ns/op	op/s	err%	total	benchmark
1,804,208.61	554.26	1.4%	10.89	`SaveBlockBench`

DrahtBot commented at 11:09 am on December 19, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31539.

Reviews

See the guideline for information on the review process. A summary of reviews will appear here.

Conflicts

No conflicts as of last run.

l0rinc renamed this:
~~optimization: buffer reads/writes in [undo]blocks~~
optimization: buffer reads/writes in [undo]block [de]serialization
on Dec 19, 2024

l0rinc force-pushed on Dec 21, 2024

l0rinc renamed this:
~~optimization: buffer reads/writes in [undo]block [de]serialization~~
optimization: buffer reads(23%)/writes(19%) in [undo]block [de]serialization
on Dec 21, 2024

l0rinc force-pushed on Dec 21, 2024

l0rinc renamed this:
~~optimization: buffer reads(23%)/writes(19%) in [undo]block [de]serialization~~
optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization
on Dec 21, 2024

l0rinc renamed this:
~~optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization~~
optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD
on Dec 21, 2024

bitcoin deleted a comment on Jan 4, 2025

ryanofsky referenced this in commit 5d6f6fd00d on Jan 22, 2025

DrahtBot added the label Needs rebase on Jan 22, 2025

l0rinc force-pushed on Jan 22, 2025

DrahtBot removed the label Needs rebase on Jan 22, 2025

l0rinc force-pushed on Jan 22, 2025

l0rinc marked this as ready for review on Jan 22, 2025

l0rinc commented at 9:37 am on January 23, 2025: contributor

Not sure what the failure means:

100% tests passed, 0 tests failed out of 138

…

Error: Process completed with exit code 143.

maflcko commented at 10:09 am on January 23, 2025: member

Not sure what the failure means:

100% tests passed, 0 tests failed out of 138

…

Error: Process completed with exit code 143.

The unit tests are different from the functional tests. I guess the failure is due to a GHA VM corruption or OOM?

l0rinc marked this as a draft on Jan 23, 2025

l0rinc force-pushed on Jan 23, 2025

l0rinc commented at 5:39 pm on January 23, 2025: contributor

I’ll push a few more drafts to find out why CI is crashing - bear with me.

l0rinc force-pushed on Jan 23, 2025

maflcko commented at 7:11 pm on January 23, 2025: member

The unit tests are not failing. This can be seen in the output (100% tests passed, 0 tests failed out of 138). So skipping them should have no effect.

maflcko commented at 8:14 pm on January 23, 2025: member

My recommendation would be to measure the asan CI max memory usage before and after this change. Maybe it is not largely different, but just the last drip to tip an OOM?

l0rinc commented at 8:47 pm on January 23, 2025: contributor

You may be right, though I don’t yet see how this memory saving option keeps crashing after the rebase (it was passing before), while the memory hogging alternative is passing. https://github.com/bitcoin-dev-tools/benchcoin/pull/123#issuecomment-2610155151 shows the before/after memory to be the same.

I suspected a stuck file or process or infinite loop…

asan CI max memory usage

I don’t yet know how to run that on a Mac (and my linux boxes are running IBDs for now). I’ll push a few more rounds here to try to understand which part is causing the problem - and why only for ASan + LSan + UBSan + integer, if they won’t reveal anything, I’ll investigate how to replicate the CI behavior locally.

l0rinc force-pushed on Jan 23, 2025

optimization: Buffer serialization reads in `UndoRead` and `ReadBlock`

The Obfuscation (XOR) operations are currently done byte-by-byte during serialization, buffering the reads will enable batching the obfuscation operations later (not yet done here).

Also, different operating systems seem to handle file caching differently, so reading bigger batches (and processing those from memory) is also a bit faster (likely because of fewer native fread calls or less locking).

Buffered reading was done by first exhausting the buffer, and if more data is still needed, reading directly into the destination (to avoid copying into the buffer when we have huge destination Spans, such as the many megabyte vectors in later blocks), and lastly refilling the buffer completely.

Running the `ReadBlockBench` benchmarks with different buffer sizes indicated that 16 KiB is the optimal buffer size (roughly 25% faster than master).

Testing was done by randomly reading data from the same file with `AutoFile` and `BufferedReadOnlyFile` and making sure that the exact same data is read.

`BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future).

Before:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        2,288,264.16 |              437.01 |    0.2% |     11.00 | `ReadBlockBench`

After:
|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        1,847,259.64 |              541.34 |    0.2% |     11.03 | `ReadBlockBench`

f8b10a4863

optimization: Buffer serialization writes in `SaveBlockUndo` and `SaveBlock`

Similarly to the serialization reads, buffered writes will enable batched xor calculations - especially since currently we need to copy the write inputs Span to do the obfuscation on it, batching will enable doing the xor on the internal buffer instead.

All write operations are delegated to `AutoFile`.
Xor key offsets are also calculated based on where we are in the underlying file.
To avoid the buffered write of 4096 in `AutoFile.write`, we're disabling obfuscation for the underlying `m_dest` and doing it ourselves for the whole buffer (instead of byte-by-byte) before writing it to file.
We can't obfuscate the write's src directly since it would mutate the method's input (for very big writes it would avoid copying), but we can fill our own buffer and xor all of that safely.

Running the `SaveBlockBench` benchmarks with different buffer sizes indicated that 1 MiB was the optimal buffer size.

`BufferedFile` wasn't used here because it does too many unrelated operations (and might be removed in the future).

Before:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        5,260,934.43 |              190.08 |    1.2% |     11.08 | `SaveBlockBench`

After:
|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        1,804,208.61 |              554.26 |    1.4% |     10.89 | `SaveBlockBench`

b1b55a2c23

l0rinc force-pushed on Jan 23, 2025

Revert "optimization: Buffer serialization writes in `SaveBlockUndo` and `SaveBlock`"

This reverts commit b1b55a2c2363f94e7266d31b7e2d818372bd5ce4.

9f8a93e0f6

maflcko commented at 10:50 am on January 24, 2025: member

I’ll investigate how to replicate the CI behavior locally

It may be sufficient to just copy-paste the cxx flags into a normal build. In the future a cmake preset could be added, see also #31199 (comment)

l0rinc commented at 11:07 am on January 24, 2025: contributor

It looks like the problem was with BufferedWriteOnlyFile - if needed, I will investigate further.

Given that there is an alternative PR to this (reading the whole block into memory (instead of smaller chunks) - which we’re already doing in other places, so shouldn’t be controversial), I’ll close this PR. If the memory allocations come up in that, I’ll consider reopening.

l0rinc closed this on Jan 24, 2025

optimization: buffer reads(23%)/writes(290%) in [undo]block [de]serialization, 6% faster IBD #31539

Code Coverage & Benchmarks

Reviews

Conflicts