[IBD] Tracking PR for speeding up Initial Block Download #32043

pull l0rinc wants to merge 24 commits into bitcoin:master from l0rinc:l0rinc/IBD-optimizations changing 37 files +1081 −325

l0rinc commented at 4:20 PM on March 12, 2025: contributor
During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.

Summary: >20% full IBD speedup

We don't have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes. Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered. The unmerged changes here collectively achieve a >20% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.

Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5

Related issues:
- #32832
- #31494
PRs included here (in review priority order):
- #31490
- #31551
- #32487
- #31144
- #32827
- #32279
- #31645
- #33602
- #30442
- #32497
- #31132
- #31682
- #31868
Changing trends

The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load. <img width="500" alt="image" src="https://github.com/user-attachments/assets/d257b864-d9de-4f9d-b61b-acc6835f384f" /> Profiling IBD, given these circumstances, revealed many new optimization opportunities.

Similar efforts in the past years

There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:
- #25325 - use specialized pool allocator for in-memory cache (~21% faster IBD)
- #28358 - allow the full UTXO set to fit into memory
- #28280 (comment) - fine-grained in-memory cache eviction for pruned nodes (~30% IBD speedup on pruned nodes)
- #30039 (comment) - reduce LevelDB writes, compactions and open files (~30% faster IBD for small in-memory cache)
- #31490, #30849, #30906 - refactors derisking/enabling follow-up optimizations
- #30326 - favor the happy path for cache misses (~2% IBD speedup)
- #30884 - Windows regression fix
Reliable macro benchmarks

The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.

To make sure the setup reflected a real user's experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark). To reduce the instability of headers synchronization and peer acquisition, we first started bitcoind until block 1, followed by the actual benchmarks until block 886'000.

The top 2 PRs (https://github.com/bitcoin/bitcoin/pull/31551 and #31144) were measured together by multiple people with different settings (and varying results):
- @andrewtoth in #31144 (comment) - 9% speedup with GCC
- @hodlinator in #31144 (comment) - 11.9% speedup with GCC
- @mlori in #31144 (comment) - 12% faster with GCC
- @Sjors in #31144 (comment) - 3% speedup with Clang
Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.

Current changes (in order of importance, reviews and reproducers are welcome):

Plotting the performance of the blocks from the produced debug.log files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:

<details> <summary>debug.log visualizer</summary>
```
import os
import sys
import re
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np


def process_log_files_and_plot(log_dir, output_file="block_height_progress.png"):
    if not os.path.exists(log_dir) or not os.path.isdir(log_dir):
        print(f"Error: '{log_dir}' is not a valid directory", file=sys.stderr)
        return

    debug_files = [f for f in os.listdir(log_dir) if
                   f.startswith('debug-') and os.path.isfile(os.path.join(log_dir, f))]
    if not debug_files:
        print(f"Warning: No debug files found in '{log_dir}'", file=sys.stderr)
        return

    height_pattern = re.compile(r'UpdateTip:.*height=(\d+)')
    results = {}

    for filename in debug_files:
        filepath = os.path.join(log_dir, filename)
        print(f"Processing {filename}...", file=sys.stderr)

        update_tips = []
        first_timestamp = None
        line_count = tip_count = 0
        found_shutdown_done = False

        try:
            with open(filepath, 'r', errors='ignore') as file:
                for line_number, line in enumerate(file, 1):
                    line_count += 1
                    if line_count % 100000 == 0:
                        print(f"  Processed {line_count} lines, found {tip_count} UpdateTips...", file=sys.stderr)

                    if not found_shutdown_done:
                        if "Shutdown: done" in line:
                            found_shutdown_done = True
                            print(f"  Found 'Shutdown: done' at line {line_number}, starting to record",
                                  file=sys.stderr)
                        continue

                    if len(line) < 20 or "UpdateTip:" not in line:
                        continue

                    try:
                        timestamp = datetime.strptime(line[:20], "%Y-%m-%dT%H:%M:%SZ")
                        height_match = height_pattern.search(line)
                        if not height_match:
                            continue

                        height = int(height_match.group(1))
                        if first_timestamp is None:
                            first_timestamp = timestamp

                        update_tips.append((int((timestamp - first_timestamp).total_seconds()), height))
                        tip_count += 1
                    except ValueError:
                        continue
        except Exception as e:
            print(f"Error processing {filename}: {e}", file=sys.stderr)
            continue

        print(f"Finished processing {filename}: {line_count} lines, {tip_count} UpdateTips", file=sys.stderr)

        if update_tips:
            time_dict = {}
            for time, height in update_tips:
                time_dict[time] = height
            results[filename[6:14]] = sorted(time_dict.items())

    if not results:
        print("No valid data found in any files.", file=sys.stderr)
        return

    print(f"Creating plots with data from {len(results)} files", file=sys.stderr)

    sorted_results = []
    for name, pairs in results.items():
        if pairs:
            sorted_results.append((name, pairs[-1][0] / 3600, pairs))

    sorted_results.sort(key=lambda x: x[1], reverse=True)
    colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_results)))

    # Plot 1: Height vs Time
    plt.figure(figsize=(12, 8))

    final_points = []
    for idx, (name, last_time, pairs) in enumerate(sorted_results):
        times = [t / 3600 for t, _ in pairs]
        heights = [h for _, h in pairs]
        plt.plot(heights, times, label=f"{name} ({last_time:.2f}h)", color=colors[idx], linewidth=1)
        if pairs:
            final_points.append((last_time, pairs[-1][1], colors[idx]))

    for time, height, color in final_points:
        plt.axhline(y=time, color=color, linestyle='--', alpha=0.3)
        plt.axvline(x=height, color=color, linestyle='--', alpha=0.3)

    plt.title('Sync Time by Block Height')
    plt.xlabel('Block Height')
    plt.ylabel('Elapsed Time (hours)')
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend(loc='center left')
    plt.tight_layout()

    plt.savefig(output_file.replace('.png', '_reversed.png'), dpi=300)

    # Plot 2: Performance Ratio by Time
    if len(sorted_results) > 1:
        plt.figure(figsize=(12, 8))

        baseline = sorted_results[0]
        baseline_time_by_height = {h: t for t, h in baseline[2]}

        for idx, (name, _, pairs) in enumerate(sorted_results[1:], 1):
            time_by_height = {h: t for t, h in pairs}

            common_heights = [h for h in baseline_time_by_height.keys()
                              if h >= 400000 and h in time_by_height]
            common_heights.sort()

            ratios = []
            base_times = []

            for h in common_heights:
                base_t = baseline_time_by_height[h]
                result_t = time_by_height[h]

                if result_t > 0:
                    ratios.append(base_t / result_t)
                    base_times.append(base_t / 3600)

            plt.plot(base_times, ratios,
                     label=f"{name} vs {baseline[0]}",
                     color=colors[idx], linewidth=1)

        plt.axhline(y=1, color='gray', linestyle='--', alpha=0.7)

        plt.title('Performance Improvement Over Time (Higher is Better)')
        plt.xlabel('Baseline Elapsed Time (hours)')
        plt.ylabel('Speedup Ratio (baseline_time / commit_time)')
        plt.grid(True, linestyle='--', alpha=0.7)
        plt.legend(loc='best')
        plt.tight_layout()

        plt.savefig(output_file.replace('.png', '_time_ratio.png'), dpi=300)

    with open(output_file.replace('.png', '.csv'), 'w') as f:
        for name, _, pairs in sorted_results:
            f.write(f"{name},{','.join(f'{t}:{h}' for t, h in pairs)}\n")

    plt.show()


if __name__ == "__main__":
    log_dir = sys.argv[1] if len(sys.argv) > 1 else "."
    output_file = sys.argv[2] if len(sys.argv) > 2 else "block_height_progress.png"
    process_log_files_and_plot(log_dir, output_file)
```
</details>

<img width="1000" alt="image" src="https://github.com/user-attachments/assets/a43b43da-209c-4736-b1ef-d4d57f838d74" />

<img width="1000" alt="image" src="https://github.com/user-attachments/assets/2d43d867-0c9e-4daf-9d47-bd1148c48b55" />

Baseline

Base commit was 88debb3e42.

<details> <summary>8.59 hour IBD time</summary>
```
COMPILER=gcc COMMIT=88debb3e4297ef4ebc8966ffe599359bc7b231d0 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     30932.610 s ± 156.891 s    [User: 58248.505 s, System: 2142.974 s]
  Range (min … max):   30821.671 s … 31043.549 s    2 runs
```
</details>
- #31551
<details> <summary>7.91 hour IBD time</summary>
```
COMPILER=gcc COMMIT=6a8ce46e32dae2ffef2a73d2314ca33a2039186e ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     28501.588 s ± 119.886 s    [User: 56419.060 s, System: 1833.126 s]
  Range (min … max):   28416.815 s … 28586.361 s    2 runs
```
We can serialize the blocks and undos to any Stream which implements the appropriate read/write methods. AutoFile is one of these, writing the results "directly" to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).

Differential flame graphs indicate that the before/after speed change is because of fewer AutoFile reads and writes:

</details>
- #31144
<details> <summary>7.60 hour IBD time</summary>
```
COMPILER=gcc COMMIT=c5cc54d10187c9cb3a6cba8cc10f652b4f882e2a ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     27394.210 s ± 565.877 s    [User: 54902.315 s, System: 1891.951 s]
  Range (min … max):   26994.075 s … 27794.346 s    2 runs
```
</details>

Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.
- #31645
<details> <summary>7.50 hour IBD time</summary>
```
COMPILER=gcc COMMIT=9b4be912d20222b3b275ef056c1494a15ccde3f5 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     27019.086 s ± 112.340 s    [User: 54927.344 s, System: 1652.376 s]
  Range (min … max):   26939.649 s … 27098.522 s    2 runs
```
</details>

When the in-memory UTXO set is flushed to LevelDB (after IBD or AssumeUTXO load), it does so in batches to manage memory usage during the flush. While a hidden -dbbatchsize config option exists to modify this value, this PR introduces dynamic calculation of the batch size based on the -dbcache setting. By using larger batches when more memory is available (i.e., higher -dbcache), we can reduce the overhead from numerous small writes, minimize constant overhead per batch, improve I/O efficiency (especially on HDDs), and potentially allow LevelDB to optimize writes more effectively (e.g. by sorting the keys before write).

<img width="1000" alt="image" src="https://github.com/user-attachments/assets/0a99e32e-6a9b-481e-a08a-7216c82fe722" />

Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest: <img width="1000" alt="image" src="https://github.com/user-attachments/assets/8b56674b-b3e3-43cf-a19b-574e66948e72" />

<img width="1000" alt="image" src="https://github.com/user-attachments/assets/ce56cfba-a59f-4360-a6d7-2cc3e74959a3" />

<img width="1000" alt="image" src="https://github.com/user-attachments/assets/e346414c-b009-47c5-92ff-a264b1e2c6c4" />

<img width="1000" alt="image" src="https://github.com/user-attachments/assets/4db30a70-1ca9-401c-8c9c-2ddecd0d7516" />
- #31868
<details> <summary>7.41 hour IBD time</summary>
```
COMPILER=gcc COMMIT=817d7ac0767a3984295aa3cf6c961dcc5f29d571 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26711.460 s ± 244.118 s    [User: 54654.348 s, System: 1652.087 s]
  Range (min … max):   26538.843 s … 26884.077 s    2 runs
```
</details>

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.
- #31682
<details> <summary>7.31 hour IBD time</summary>
```
COMPILER=gcc COMMIT=182745cec4c0baf2f3c8cff2f74f847eac3c4330 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26326.867 s ± 45.887 s    [User: 54367.156 s, System: 1619.348 s]
  Range (min … max):   26294.420 s … 26359.314 s    2 runs
```
</details>

CheckBlock's latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.

This PR improves performance and maintainability by introducing the following changes:
- Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
- Optimized the general case by replacing std::set with sorted std::vector for improved locality.
- Simplified Null prevout checks from linear to constant time.
- #30442
<details> <summary>7.25 hour IBD time</summary>
```
COMPILER=gcc COMMIT=47d377bd0bb88dae6b34553a7789400170e0ccf6 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
  Time (mean ± σ):     26084.429 s ± 473.611 s    [User: 54310.780 s, System: 1815.967 s]
  Range (min … max):   25749.536 s … 26419.323 s    2 runs
```
</details>

The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.

Hashing a uint256 key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.
- #32279
The current prevector size of 28 bytes (chosen to fill the sizeof(CScript) aligned size) was introduced in 2015 (https://github.com/bitcoin/bitcoin/pull/6914) before SegWit and TapRoot. However, the increasingly common P2WSH and P2TR scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

The core trade-off of this change is to eliminate heap allocations for common 29-36 byte scripts at the cost of increasing the base memory footprint of all CScript objects by 8 bytes (while still respecting peak memory usage defined by -dbcache).

Other similar efforts waiting for reviews or revives (not included in this tracking PR):
- #30611 - for very big in-memory caches make sure we still flush to disk regularly (no significant IBD speed change)
- #28945 - was meant to preallocate the memory of recreated caches (~6% IBD speedup for small caches)
- #31102 - was meant to try to evict entries selectively instead of dropping the whole cache when full
- #32128 - draft PR showcasing a few other possible caching speedups
This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs. Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.
DrahtBot commented at 4:20 PM on March 12, 2025: contributor


The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.



Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32043.



Reviews

See the guideline for information on the review process.

Type Reviewers

Concept ACK jonatack

If your review is incorrectly listed, please copy-paste <code></code> into the comment that the bot should ignore.



Conflicts

Reviewers, this pull request conflicts with the following ones:
- #32521 (policy: make pathological transactions packed with legacy sigops non-standard by darosior)
- #32457 (bench: replace benchmark block with more representative one (413567 → 784588) by l0rinc)
- #32296 (refactor: reenable implicit-integer-sign-change check for serialize.h by l0rinc)
- #32279 ([IBD] prevector: store P2WSH/P2TR/P2PK scripts inline by l0rinc)
- #31868 ([IBD] specialize block serialization by l0rinc)
- #31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
- #31682 ([IBD] specialize CheckBlock's input & coinbase checks by l0rinc)
- #31144 ([IBD] multi-byte block obfuscation by l0rinc)
- #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
- #29307 (util: explicitly close all AutoFiles that have been written by vasild)
- #28531 (improve MallocUsage() accuracy by LarryRuane)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
DrahtBot renamed this:
~~[IBD] - Tracking PR for speeding up Initial Block Download~~
[IBD] - Tracking PR for speeding up Initial Block Download
on Mar 12, 2025
l0rinc renamed this:
~~[IBD] - Tracking PR for speeding up Initial Block Download~~
[IBD] Tracking PR for speeding up Initial Block Download
on Mar 12, 2025
ryanofsky commented at 5:14 PM on March 12, 2025: contributor
Thanks for creating this. This should make it easier to navigate the other PRs and discuss the overall topic of IBD performance and benchmarking without needing to necessarily repeat it in the individual PRs.

Would be useful to have concept ACKs/NACKs here from others who know more about performance and benchmarking. But from from what I can tell the individual optimizations do not seem very complicated and seem like they should be justified.

One suggestion for the PR description above would be to directly link to the PRs comprising this change in the summary, maybe pointing out any where review should be focused. Current list seems to be:
- #31551
- #31144
- #31645
- #31868
- #31682
- #30442
laanwj added the label Block storage on Mar 12, 2025
laanwj added the label P2P on Mar 12, 2025
l0rinc force-pushed on Mar 12, 2025
DrahtBot added the label CI failed on Mar 12, 2025
DrahtBot commented at 6:16 PM on March 12, 2025: contributor


🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/38650495272</sub>

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
- Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
- A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
- An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.

</details>
l0rinc force-pushed on Mar 12, 2025
DrahtBot removed the label CI failed on Mar 12, 2025
ajtowns commented at 6:18 AM on March 13, 2025: contributor

Plotting the performance of the blocks from the produced debug.log files shows (from the last run, can differ slightly from the normalized average shown below) the effect of each commit:

Wouldn't these plots be easier to read with block height on the x-axis and time on the y-axis, giving a consistent domain (each case goes from height 0 to 880k or so) with a simple "lower is better" comparison? (rather than "further left is better")

Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful? (Perhaps limited to height >= 400000)
l0rinc commented at 9:16 AM on March 13, 2025: contributor

with a simple "lower is better" comparison

I like the idea, updated the description and the code:

Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful?

Something like this? Conceptually seems useful, but here I don't know how to interpret it, seems too far zoomed in <img width="1000" alt="image" src="https://github.com/user-attachments/assets/14def0c8-241c-4e36-aa2d-554ce5fc4c15" /> Looks even funnier without the >400k cap: <img width="1000" alt="image" src="https://github.com/user-attachments/assets/fb10ac16-54d8-4d59-bbaa-4fac9c4e0ac0" />

Type	Reviewers
Concept ACK	jonatack

ajtowns commented at 12:45 PM on March 13, 2025: contributor

Something like this? Conceptually seems useful, but here I don't know how to interpret it, seems too far zoomed in

Fair; that might work better with a time on the x-axis rather than height, something like:

for time_commit in [time_commit_X, time_commit_Y, time_commit_Z]:
    for height in range(1,850000):
        y = time_baseline[height] / time_commit[height] # keep this one the same
        x = time_baseline[height]  # changed from x = height
        add_datapoint(x,y)

l0rinc commented at 1:08 PM on March 13, 2025: contributor

time on the x-axis rather than height

~If we do that we don't even need to filter out the first 400k blocks since they're so insignificant.~ Edit: you're right, it looks better to filter those out - I've updated the description with the images and code.
jonatack commented at 3:52 PM on March 13, 2025: member

Concept ACK, thank you for opening this.

I currently am in an environment of slow internet speed, where despite having a relatively fast laptop, IBD is slower than 2 orders of magnitude worse than the times in the OP.

Opened #32051 today to address an issue I'm also seeing of very frequent disconnections+reconnections of trusted addnode peers during IBD.
DrahtBot added the label Needs rebase on Mar 20, 2025
l0rinc force-pushed on Apr 8, 2025
l0rinc commented at 11:15 PM on April 8, 2025: contributor
Updated the tracking PR (+ general rebase) with the latest changes from:
DrahtBot removed the label Needs rebase on Apr 9, 2025
DrahtBot commented at 12:38 AM on April 9, 2025: contributor


🚧 At least one of the CI tasks failed. <sub>Debug: https://github.com/bitcoin/bitcoin/runs/40213922055</sub>

<details><summary>Hints</summary>

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
- Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
- A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
- An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.

</details>
DrahtBot added the label CI failed on Apr 9, 2025
l0rinc force-pushed on Apr 9, 2025
DrahtBot removed the label CI failed on Apr 9, 2025
l0rinc force-pushed on Apr 13, 2025
l0rinc commented at 9:01 PM on April 15, 2025: contributor

Added #32279 to the collection
achow101 referenced this in commit 33df4aebae on Apr 16, 2025
DrahtBot added the label Needs rebase on Apr 16, 2025

test: Compare util::Xor with randomized inputs against simple impl

Since production code only uses keys of length 8, we're not testing with other values anymore

3d203c2acf

bench: Make XorObfuscationBench more representative

Since another PR solves the tiny byte xors during serialization, we're only concentrating on big continuous chunks now.

>  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build -j$(nproc) \
&& build/bin/bench_bitcoin -filter='XorObfuscationBench' -min-time=10000

C++ compiler .......................... AppleClang 17.0.0.17000013

|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          731,927.62 |            1,366.26 |    0.2% |     10.67 | `XorObfuscationBench`

C++ compiler .......................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          941,015.26 |            1,062.68 |    0.0% |    9,437,186.97 |    3,378,911.52 |  2.793 |   1,048,577.15 |    0.0% |     10.99 | `XorObfuscationBench`

5d5f3d06dd

refactor: prepare dbwrapper for obfuscation key change

Since `CDBWrapper::Read` will still work with vectors, we won't be able to use the obfuscation key field to read into it directly.
This commit cleans up this part of the code, obviating that writing `obfuscate_key` is needed since following methods will actually use it implicitly, simplifying the `if (!key_exists` condition to extract the negation into the name of the boolean and inline the single-use `CreateObfuscateKey` which will just complicate the transition.

8e6e0acd36

refactor: prepare mempool_persist for obfuscation key change e50732d25f

optimization: Migrate fixed-size obfuscation end-to-end from `std::vector<std::byte>` to `uint64_t`

Since `util::Xor` accepts `uint64_t` values, we're eliminating any repeated vector-to-uint64_t conversions going back to the loading/saving of these values (we're still serializing them as vectors, but converting as soon as possible to `uint64_t`). This is the reason the tests still generate vector values and convert to `uint64_t` later instead of generating it directly.

We're also short-circuit `Xor` calls with 0 key values early to avoid unnecessary calculations (e.g. `MakeWritableByteSpan`) - even assuming that XOR is never called for 0.

>  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build -j$(nproc) \
&& build/bin/bench_bitcoin -filter='XorObfuscationBench' -min-time=10000

C++ compiler .......................... AppleClang 17.0.0.17000013

|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           14,730.40 |           67,886.80 |    0.1% |     11.01 | `XorObfuscationBench`

C++ compiler .......................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           51,187.17 |           19,536.15 |    0.0% |      327,683.95 |      183,747.58 |  1.783 |      65,536.55 |    0.0% |     11.00 | `XorObfuscationBench`

----

A few other benchmarks that seem to have improved as well (tested with Clang only):
Before:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        2,202,618.49 |              454.01 |    0.2% |     11.01 | `ReadBlockBench`
|          734,444.92 |            1,361.57 |    0.3% |     10.66 | `ReadRawBlockBench`

After:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        1,912,308.06 |              522.93 |    0.4% |     10.98 | `ReadBlockBench`
|           49,092.93 |           20,369.53 |    0.2% |     10.99 | `ReadRawBlockBench`

c5e866b190

bench: Add COutPoint and SaltedOutpointHasher benchmarks

This commit introduces new benchmarks to measure the performance of various operations using
SaltedOutpointHasher, including hash computation, set operations, and set creation.

These benchmarks are intended to provide insights about coin caching performance (e.g. during IBD).

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaltedOutpointHasherBench.*' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|               58.60 |       17,065,922.04 |    0.3% |     11.02 | `SaltedOutpointHasherBench_create_set`
|               11.97 |       83,576,684.83 |    0.1% |     11.01 | `SaltedOutpointHasherBench_hash`
|               14.50 |       68,985,850.12 |    0.3% |     10.96 | `SaltedOutpointHasherBench_match`
|               13.90 |       71,942,033.47 |    0.4% |     11.03 | `SaltedOutpointHasherBench_mismatch`

> C++ compiler .......................... GNU 13.3.0

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|              136.76 |        7,312,133.16 |    0.0% |        1,086.67 |          491.12 |  2.213 |         119.54 |    1.1% |     11.01 | `SaltedOutpointHasherBench_create_set`
|               23.82 |       41,978,882.62 |    0.0% |          252.01 |           85.57 |  2.945 |           4.00 |    0.0% |     11.00 | `SaltedOutpointHasherBench_hash`
|               60.42 |       16,549,695.42 |    0.1% |          460.51 |          217.04 |  2.122 |          21.00 |    1.4% |     10.99 | `SaltedOutpointHasherBench_match`
|               78.66 |       12,713,595.35 |    0.1% |          555.59 |          282.52 |  1.967 |          20.19 |    2.2% |     10.74 | `SaltedOutpointHasherBench_mismatch`

c497ca6e91

test: Rename k1/k2 to k0/k1 for consistency ae87260d29
refactor: Extract C0-C3 Siphash constants 155ba7c349

optimization: refactor: Introduce Uint256ExtraSipHasher to cache SipHash constant state

Previously, only k0 and k1 were stored, causing the constant xor operations to be recomputed in every call to `SipHashUint256Extra`.
This commit adds a dedicated `Uint256ExtraSipHasher` class that caches the initial state (v0-v3) and to perform the `SipHash` computation on a `uint256` (with an extra parameter), hiding the constant computation details from higher-level code and improving efficiency.
This basically brings the precalculations in the `CSipHasher` constructor to the `uint256` specialized SipHash implementation.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaltedOutpointHasherBench.*' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|               57.27 |       17,462,299.19 |    0.1% |     11.02 | `SaltedOutpointHasherBench_create_set`
|               11.24 |       88,997,888.48 |    0.3% |     11.04 | `SaltedOutpointHasherBench_hash`
|               13.91 |       71,902,014.20 |    0.2% |     11.01 | `SaltedOutpointHasherBench_match`
|               13.29 |       75,230,390.31 |    0.1% |     11.00 | `SaltedOutpointHasherBench_mismatch`

compared to master:
create_set - 17,462,299.19/17,065,922.04 - 2.3% faster
hash       - 88,997,888.48/83,576,684.83 - 6.4% faster
match      - 71,902,014.20/68,985,850.12 - 4.2% faster
mismatch   - 75,230,390.31/71,942,033.47 - 4.5% faster

> C++ compiler .......................... GNU 13.3.0

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|              135.38 |        7,386,349.49 |    0.0% |        1,078.19 |          486.16 |  2.218 |         119.56 |    1.1% |     11.00 | `SaltedOutpointHasherBench_create_set`
|               23.67 |       42,254,558.08 |    0.0% |          247.01 |           85.01 |  2.906 |           4.00 |    0.0% |     11.00 | `SaltedOutpointHasherBench_hash`
|               58.95 |       16,962,220.14 |    0.1% |          446.55 |          211.74 |  2.109 |          20.86 |    1.4% |     11.01 | `SaltedOutpointHasherBench_match`
|               76.98 |       12,991,047.69 |    0.1% |          548.93 |          276.50 |  1.985 |          20.25 |    2.3% |     10.72 | `SaltedOutpointHasherBench_mismatch`

compared to master:
create_set -  7,386,349.49/7,312,133.16  - 1% faster
hash       - 42,254,558.08/41,978,882.62 - 0.6% faster
match      - 16,962,220.14/16,549,695.42 - 2.4% faster
mismatch   - 12,991,047.69/12,713,595.35 - 2% faster

Co-authored-by: sipa <pieter@wuille.net>

73cfebb08b

bench: measure block (size)serialization speed

The SizeComputer is a special serializer which returns what the exact final size will be of the serialized content.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          195,610.62 |            5,112.20 |    0.3% |     11.00 | `SerializeBlock`
|           12,061.83 |           82,906.19 |    0.1% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          867,857.55 |            1,152.26 |    0.0% |    8,015,883.90 |    3,116,099.08 |  2.572 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
|           30,928.27 |           32,332.88 |    0.0% |      221,683.03 |      111,055.84 |  1.996 |      53,037.03 |    0.8% |     11.03 | `SizeComputerBlock`

3d7c8ae9fb

cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` 9f15d4da35

refactor: reduce template bloat in primitive serialization

Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later

5559eb68a9

refactor: add explicit static extent to spans 8f71de5f8f

optimization: merge SizeComputer specializations + add new ones

Endianness doesn't affect the final size, we can skip it for `SizeComputer`.
We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
|           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`

2bf6c56cab

optimization: add single byte writes

Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
It makes sense to avoid the generalized serialization infrastructure that isn't needed:
* AutoFile write doesn't need to allocate 4k buffer for a single byte now;
* `VectorWriter` and `DataStream` avoids memcpy/insert calls.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
|           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`

6269067bc3

test: validate duplicate detection in `CheckTransaction`

The `CheckTransaction` validation function in https://github.com/bitcoin/bitcoin/blob/master/src/consensus/tx_check.cpp#L41-L45 relies on a correct ordering relation for detecting duplicate transaction inputs.

This update to the tests ensures that:
* Accurate detection of duplicates: Beyond trivial cases (e.g., two identical inputs), duplicates are detected correctly in more complex scenarios.
* Consistency across methods: Both sorted sets and hash-based sets behave identically when detecting duplicates for `COutPoint` and related values.
* Robust ordering and equality relations: The function maintains expected behavior for ordering and equality checks.

Using randomized testing with shuffled inputs (to avoid any remaining bias introduced), the enhanced test validates that `CheckTransaction` remains robust and reliable across various input configurations. It confirms identical behavior to a hashing-based duplicate detection mechanism, ensuring consistency and correctness.

To make sure the new branches in the follow-up commits will be covered, `basic_transaction_tests` was extended a randomized test one comparing against the old implementation (and also an alternative duplicate). The iterations and ranges were chosen such that every new branch is expected to be hit once.

c15d130752

bench: measure `CheckBlock` speed separately from serialization

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          372,743.63 |            2,682.81 |    1.1% |     10.99 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,304,694.54 |              302.60 |    0.5% |     11.05 | `DuplicateInputs`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        1,096,261.84 |              912.19 |    0.1% |    7,963,390.88 |    3,487,375.26 |  2.283 |   1,266,941.00 |    1.8% |     11.03 | `CheckBlockBench`

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        8,366,309.48 |              119.53 |    0.0% |   23,865,177.67 |   26,620,160.23 |  0.897 |   5,972,887.41 |    4.0% |     10.78 | `DuplicateInputs`

bf580d7b4e

bench: add `ProcessTransactionBench` to measure `CheckBlock` in context

The newly introduced `ProcessTransactionBench` incorporates multiple steps in the validation pipeline, offering a more comprehensive view of `CheckBlock` performance within a realistic transaction validation context.

Previous microbenchmarks, such as DeserializeAndCheckBlockTest and DuplicateInputs, focused on isolated aspects of transaction and block validation. While these tests provided valuable insights for targeted profiling, they lacked context regarding the broader validation process, where interactions between components play a critical role.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            9,585.10 |          104,328.55 |    0.1% |     11.03 | `ProcessTransactionBench`

> C++ compiler .......................... GNU 13.3.0

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           56,199.57 |           17,793.73 |    0.1% |      229,263.01 |      178,766.31 |  1.282 |      15,509.97 |    0.5% |     10.91 | `ProcessTransactionBench`

a300325c5b

optimization: move duplicate checks outside of coinbase branch

`IsCoinBase` means single input with NULL prevout, so it makes sense to restrict duplicate check to non-coinbase transactions only.
The behavior is the same as before, except that single-input-transactions aren't checked for duplicates anymore (~70-90% of the cases, see https://transactionfee.info/charts/transactions-1in).
I've added braces to the conditions and loops to simplify review of followup commits.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          335,917.12 |            2,976.92 |    1.3% |     11.01 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,286,337.42 |              304.29 |    1.1% |     10.90 | `DuplicateInputs`
|            9,561.02 |          104,591.35 |    0.2% |     11.02 | `ProcessTransactionBench`

7b576a440d

optimization: simplify duplicate checks for trivial inputs

No need to create a set for checking duplicates for two-input-transactions.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          314,137.30 |            3,183.32 |    1.2% |     11.04 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,220,592.73 |              310.50 |    1.3% |     10.92 | `DuplicateInputs`
|            9,425.98 |          106,089.77 |    0.3% |     11.00 | `ProcessTransactionBench`

765b71b90b

optimization: replace tree with sorted vector

A pre-sized vector retains locality (enabling SIMD operations), speeding up sorting and equality checks.
It's also simpler (therefore more reliable) than a sorted set. It also causes less memory fragmentation.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          181,922.54 |            5,496.85 |    0.2% |     10.98 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          997,739.30 |            1,002.27 |    1.0% |     10.94 | `DuplicateInputs`
|            9,449.28 |          105,828.15 |    0.3% |     10.99 | `ProcessTransactionBench`

Co-authored-by: Pieter Wuille <pieter@wuille.net>

cb8c012b87

optimization: look for NULL prevouts in the sorted values

For the 2 input case we simply check them both, like we did with equality.

For the general case, we take advantage of sorting, making invalid value detection constant time instead of linear in the worst case.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          179,971.00 |            5,556.45 |    0.3% |     11.02 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          963,177.98 |            1,038.23 |    1.7% |     10.92 | `DuplicateInputs`
|            9,410.90 |          106,259.75 |    0.3% |     11.01 | `ProcessTransactionBench`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          834,855.94 |            1,197.81 |    0.0% |    6,518,548.86 |    2,656,039.78 |  2.454 |     919,160.84 |    1.5% |     10.78 | `CheckBlockBench`

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        4,261,492.75 |              234.66 |    0.0% |   17,379,823.40 |   13,559,793.33 |  1.282 |   4,265,714.28 |    3.4% |     11.00 | `DuplicateInputs`
|           55,819.53 |           17,914.88 |    0.1% |      227,828.15 |      177,520.09 |  1.283 |      15,184.36 |    0.4% |     10.91 | `ProcessTransactionBench`

f07c79f034

test: assert CScript allocation characteristics

Verifies that script types are correctly allocated using prevector's direct (stack) or indirect (heap) storage based on their size:

Direct (stack) allocated script types (size ≤ 28 bytes):
* OP_RETURN (small)
* P2WPKH
* P2SH
* P2PKH

Indirect (heap) allocated script types (size > 28 bytes):
* P2WSH
* P2TR
* P2PK
* MULTISIG (small)

This test provides a baseline for verifying changes to prevector's inline capacity.

b5dc42874d

Allocate `P2WSH`/`P2TR`/`P2PK` scripts on stack

The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (https://github.com/bitcoin/bitcoin/pull/6914) before SegWit and TapRoot.
However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`).

Increasing the `prevector` size allows these scripts to be stored on the stack, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads.

Due to memory alignment, increasing the `prevector` size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost.

Performance benchmarks for AssumeUTXO load and flush show:
- Small dbcache (450MB): ~1% performance penalty due to more frequent flushes
- Large dbcache (4500-4500MB+): ~6-7% performance improvement due to fewer heap allocations

Full IBD and reindex-chainstate with larger `dbcache` values also show an overall ~3% speedup.

Co-authored-by: Ava Chow <github@achow101.com>
Co-authored-by: Andrew Toth <andrewstoth@gmail.com>

b6b4235c14

l0rinc force-pushed on Apr 17, 2025
DrahtBot removed the label Needs rebase on Apr 17, 2025
DrahtBot commented at 9:28 AM on May 16, 2025: contributor



🐙 This pull request conflicts with the target branch and needs rebase.
DrahtBot added the label Needs rebase on May 16, 2025
achow101 referenced this in commit 5878f35446 on Jul 19, 2025
l0rinc commented at 1:56 AM on July 20, 2025: contributor
We're making progress, #31144 was just merged! 🎉 The next ones that need some love are:
achow101 referenced this in commit 321984705d on Jul 28, 2025
l0rinc commented at 8:58 PM on July 28, 2025: contributor
Thanks for reviewing and reproducing #32279 - it's also merged 🎉
- #32497
- #30442
Reviews and ACKs for the above 2 remaining ones would be very welcome.
maflcko commented at 7:02 AM on July 29, 2025: member

The remaining ones don't speed up IBD, do they?
l0rinc commented at 7:05 AM on July 29, 2025: contributor

The big ones are merged, thanks for your help. These are smaller ones - at least from an IBD perspective -, but both are quite simple.

Edit: rebased #31645 which does speed up a critical section of IBD measurably.
pstratem commented at 10:07 PM on August 1, 2025: contributor

We're calling CBlockHeader::GetHash hundreds of times for the same CBlockHeader object from ActivateBestChain.

I chased the calls with this commit and found all the results where from ProcessNewBlock ActivateBestChain

https://github.com/pstratem/bitcoin/commit/7de23309167d3f713f90cc552e1bd431464e4c49
l0rinc commented at 10:36 PM on August 1, 2025: contributor

Thanks @pstratem, I did almost exactly the same, the main problem is indeed the loop in https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L3498, probably bounded by BLOCK_DOWNLOAD_WINDOW - hence the ~1000 worst cases. I have a fix for most duplications, I'm running an IBD (and a reindex-chainstate) on my benchmarking servers to see the effect of the deduplications. I expect at most a 1-2% change, but it may still be worth it. I'll also investigate the effect of lowering BLOCK_DOWNLOAD_WINDOW.

l0rinc commented at 4:27 AM on August 18, 2025: contributor

I have compared v29.1rc1 with the latest master branch containing these optimizations:

Both branches still have defaultAssumeValid of 886'157:

The performance will improve further after we will adjust it before the release.

Running a reindex-chainstate benchmark until block 909090 with 5GB memory reveals a 9% speedup (8h:07m vs 7h:23m)!

<details> <summary>i7, hdd, -reindex-chainstate</summary>

COMMITS="565af03c37d8262632543a078b2d8d29459d0b91 dadf15f88cbad37538d85415ae5da12d4f0f1721"; \
STOP=909090; DBCACHE=5000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/rdx-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -f $DATA_DIR/debug.log; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -printtoconsole=0; sleep 10" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"

565af03c37 Merge bitcoin/bitcoin#33056: [29.x] final changes for v29.1rc1 dadf15f88c Merge bitcoin/bitcoin#33050: net, validation: don't punish peers for consensus-invalid txs

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
  Time (mean ± σ):     29308.463 s ± 72.560 s    [User: 41477.069 s, System: 1134.925 s]
  Range (min … max):   29257.156 s … 29359.771 s    2 runs

Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)
  Time (mean ± σ):     26780.313 s ± 240.625 s    [User: 38463.008 s, System: 1017.529 s]
  Range (min … max):   26610.166 s … 26950.460 s    2 runs
  
Relative speed comparison
        1.09 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)

</details>

Running a full IBD (3 runs per commit) until block 909090 with 5GB memory reveals a 21% speedup (9h:06m vs 7h:22m)! The difference comes mostly from the lack of block writes (and fewer block reads) during reindexes.

COMMITS="565af03c37d8262632543a078b2d8d29459d0b91 dadf15f88cbad37538d85415ae5da12d4f0f1721"; \
STOP=909090; DBCACHE=5000; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done; echo "") && \
hyperfine \
  --sort command \
  --runs 3 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=Release && ninja -C build bitcoind && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 10" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

565af03c37 Merge bitcoin/bitcoin#33056: [29.x] final changes for v29.1rc1 dadf15f88c Merge bitcoin/bitcoin#33050: net, validation: don't punish peers for consensus-invalid txs

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
  Time (mean ± σ):     32853.378 s ± 31.353 s    [User: 54364.406 s, System: 2245.790 s]
  Range (min … max):   32817.179 s … 32871.937 s    3 runs
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)
  Time (mean ± σ):     27046.236 s ± 631.042 s    [User: 49185.359 s, System: 2613.586 s]
  Range (min … max):   26545.657 s … 27755.087 s    3 runs
 
Relative speed comparison
        1.21 ±  0.03  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = 565af03c37d8262632543a078b2d8d29459d0b91)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=909090 -dbcache=5000 -blocksonly -printtoconsole=0 (COMMIT = dadf15f88cbad37538d85415ae5da12d4f0f1721)

</details>

There are 2 optimizations that we could still include in the next release, reviewers could help with them:

l0rinc commented at 6:01 PM on September 24, 2025: contributor

I have measured the performance of V30 with previous releases (same dbcache, same assumevalid) - on an Intel i9 with an NVMe SSD and an Intel i7 with a HDD.

Intel Core i9-9900K

Many of our optimization efforts were focusing on the underperforming, cheap hardware, so the i9 may not be representative - though running so many experiments is really slow so we often use the faster one to quickly check if an idea makes any difference.

<details> <summary>Individual measurements for i9</summary>

VERSIONS="23.2 24.2 25.2 26.2 27.2 28.2 29.1 30.0rc1"; \
STOP=914313; DBCACHE=4500; \
ASSUMEVALID="00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb"; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; BIN_DIR="$BASE_DIR/bitcoin_bins"; \
mkdir -p "$BIN_DIR" "$LOG_DIR" && \
(echo "Preparing Bitcoin Core versions:"; \
for v in $VERSIONS; do \
  echo -n "v$v: "; \
  TAR="bitcoin-$v-x86_64-linux-gnu.tar.gz"; \
  if [[ "$v" == *"rc1"* ]]; then \
    BASE_VERSION="${v%rc*}"; \
    URL="https://bitcoincore.org/bin/bitcoin-core-$BASE_VERSION/test.rc1/$TAR"; \
  else \
    URL="https://bitcoincore.org/bin/bitcoin-core-$v/$TAR"; \
  fi; \
  if [ ! -f "$BIN_DIR/$TAR" ]; then echo -n "downloading... "; wget -q -P "$BIN_DIR" "$URL" || exit 1; fi; \
  if [ ! -d "$BIN_DIR/bitcoin-$v" ]; then echo -n "extracting... "; tar -xzf "$BIN_DIR/$TAR" -C "$BIN_DIR" || exit 1; fi; \
  echo "ready"; \
done; echo "") && \
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-v$(echo $VERSIONS | tr ' ' '-')-$STOP-$DBCACHE-noscript.json" \
  --parameter-list VERSION ${VERSIONS// /,} \
  --prepare "killall bitcoind 2>/dev/null || true; rm -f $DATA_DIR/debug.log; \
    $BIN_DIR/bitcoin-{VERSION}/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=1000 -assumevalid=$ASSUMEVALID -printtoconsole=0; \
    sleep 20" \
  --cleanup "cp $DATA_DIR/debug.log $LOG_DIR/debug-v{VERSION}-$STOP-$(date +%s).log" \
  "$BIN_DIR/bitcoin-{VERSION}/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -assumevalid=$ASSUMEVALID -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0"
Preparing Bitcoin Core versions:
v23.2: ready
v24.2: ready
v25.2: ready
v26.2: ready
v27.2: ready
v28.2: ready
v29.1: ready
v30.0rc1: ready

Benchmark 1: /mnt/my_storage/bitcoin_bins/bitcoin-23.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     32803.796 s ± 119.689 s    [User: 30578.205 s, System: 5399.666 s]
  Range (min … max):   32719.163 s … 32888.429 s    2 runs

Benchmark 2: /mnt/my_storage/bitcoin_bins/bitcoin-24.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     32831.774 s ± 25.300 s    [User: 30569.080 s, System: 5393.159 s]
  Range (min … max):   32813.884 s … 32849.663 s    2 runs

Benchmark 3: /mnt/my_storage/bitcoin_bins/bitcoin-25.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     32964.296 s ± 80.832 s    [User: 30710.573 s, System: 5481.324 s]
  Range (min … max):   32907.139 s … 33021.453 s    2 runs

Benchmark 4: /mnt/my_storage/bitcoin_bins/bitcoin-26.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     29052.705 s ± 34.834 s    [User: 26759.088 s, System: 5076.640 s]
  Range (min … max):   29028.074 s … 29077.336 s    2 runs

Benchmark 5: /mnt/my_storage/bitcoin_bins/bitcoin-27.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     28953.893 s ± 38.531 s    [User: 26700.705 s, System: 5062.753 s]
  Range (min … max):   28926.648 s … 28981.139 s    2 runs

Benchmark 6: /mnt/my_storage/bitcoin_bins/bitcoin-28.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     27039.179 s ± 80.484 s    [User: 26588.946 s, System: 5457.619 s]
  Range (min … max):   26982.268 s … 27096.090 s    2 runs

Benchmark 7: /mnt/my_storage/bitcoin_bins/bitcoin-29.1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     18920.690 s ± 12.990 s    [User: 23500.492 s, System: 1012.983 s]
  Range (min … max):   18911.505 s … 18929.876 s    2 runs

Benchmark 8: /mnt/my_storage/bitcoin_bins/bitcoin-30.0rc1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     16462.172 s ±  8.480 s    [User: 20737.161 s, System: 926.230 s]
  Range (min … max):   16456.175 s … 16468.168 s    2 runs

</details>

Relative speed comparison
        1.99 ±  0.01  /mnt/my_storage/bitcoin_bins/bitcoin-23.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
        1.99 ±  0.00  /mnt/my_storage/bitcoin_bins/bitcoin-24.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
        2.00 ±  0.01  /mnt/my_storage/bitcoin_bins/bitcoin-25.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
        1.76 ±  0.00  /mnt/my_storage/bitcoin_bins/bitcoin-26.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
        1.76 ±  0.00  /mnt/my_storage/bitcoin_bins/bitcoin-27.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
        1.64 ±  0.00  /mnt/my_storage/bitcoin_bins/bitcoin-28.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
        1.15 ±  0.00  /mnt/my_storage/bitcoin_bins/bitcoin-29.1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
        1.00          /mnt/my_storage/bitcoin_bins/bitcoin-30.0rc1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914313 -dbcache=4500 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0

TLDR: recalculating the utxo set (basically IBD without the block downloads) is 2x faster in v30 than v25. Doing an actual IBD should make this gap even wider - but the internet bandwidth is usually the bottleneck, downloading 750GB of data is already ~17 hours on an average.

Tried doing a full IBD regardless, but likely because of the limited internet speed it shows "only" a 50% speedup.

VERSIONS="25.2 30.0rc1"; \
STOP=914452; DBCACHE=20000; ARCH="x86_64"; \
ASSUMEVALID="00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb"; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; BIN_DIR="$BASE_DIR/bitcoin_bins"; \
mkdir -p "$BIN_DIR" "$LOG_DIR" && \
(echo "Preparing Bitcoin Core versions:"; \
for v in $VERSIONS; do \
  echo -n "v$v: "; \
  TAR="bitcoin-$v-$ARCH-linux-gnu.tar.gz"; \
  if [[ "$v" == *"rc1"* ]]; then \
    BASE_VERSION="${v%rc*}"; \
    URL="https://bitcoincore.org/bin/bitcoin-core-$BASE_VERSION/test.rc1/$TAR"; \
  else \
    URL="https://bitcoincore.org/bin/bitcoin-core-$v/$TAR"; \
  fi; \
  if [ ! -f "$BIN_DIR/$TAR" ]; then echo -n "downloading... "; wget -q -P "$BIN_DIR" "$URL" || exit 1; fi; \
  if [ ! -d "$BIN_DIR/bitcoin-$v" ]; then echo -n "extracting... "; tar -xzf "$BIN_DIR/$TAR" -C "$BIN_DIR" || exit 1; fi; \
  echo "ready"; \
done; echo "") && \
killall bitcoind 2>/dev/null || true; \
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-v$(echo $VERSIONS | tr ' ' '-')-$STOP-$DBCACHE-noscript.json" \
  --parameter-list VERSION ${VERSIONS// /,} \
  --prepare "killall bitcoind 2>/dev/null || true; sleep 5; rm -rf $DATA_DIR/*" \
  --cleanup "killall bitcoind 2>/dev/null || true; sleep 2; cp $DATA_DIR/debug.log $LOG_DIR/debug-v{VERSION}-$STOP-$(date +%s).log" \
  "$BIN_DIR/bitcoin-{VERSION}/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -assumevalid=$ASSUMEVALID -blocksonly -printtoconsole=0"
Preparing Bitcoin Core versions:
v25.2: ready
v30.0rc1: ready

Benchmark 1: /mnt/my_storage/bitcoin_bins/bitcoin-25.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914452 -dbcache=20000 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -blocksonly -printtoconsole=0
  Time (mean ± σ):     34649.165 s ± 84.066 s    [User: 35362.818 s, System: 3215.555 s]
  Range (min … max):   34592.891 s … 34745.800 s    3 runs

Benchmark 2: /mnt/my_storage/bitcoin_bins/bitcoin-30.0rc1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914452 -dbcache=20000 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -blocksonly -printtoconsole=0
  Time (mean ± σ):     23018.173 s ± 530.349 s    [User: 26635.241 s, System: 1920.264 s]
  Range (min … max):   22568.741 s … 23603.136 s    3 runs

Relative speed comparison
        1.51 ±  0.03  /mnt/my_storage/bitcoin_bins/bitcoin-25.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914452 -dbcache=20000 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -blocksonly -printtoconsole=0
        1.00          /mnt/my_storage/bitcoin_bins/bitcoin-30.0rc1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=914452 -dbcache=20000 -assumevalid=00000000000000000000611fd22f2df7c8fbd0688745c3a6c3bb5109cc2a12cb -blocksonly -printtoconsole=0

</details>

Intel Core i7-7700

The more interesting plot is the cheaper processor with a slower background storage:

<details> <summary>Individual measurements for i7</summary>

Benchmark 1: /mnt/my_storage/bitcoin_bins/bitcoin-23.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     64033.557 s ± 43.256 s    [User: 164579.894 s, System: 7405.408 s]
  Range (min … max):   64002.971 s … 64064.143 s    2 runs

Benchmark 2: /mnt/my_storage/bitcoin_bins/bitcoin-24.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     62387.057 s ± 57.608 s    [User: 149991.204 s, System: 7343.513 s]
  Range (min … max):   62346.323 s … 62427.792 s    2 runs

Benchmark 3: /mnt/my_storage/bitcoin_bins/bitcoin-25.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     60525.100 s ± 213.505 s    [User: 130395.037 s, System: 7366.042 s]
  Range (min … max):   60374.130 s … 60676.071 s    2 runs

Benchmark 4: /mnt/my_storage/bitcoin_bins/bitcoin-26.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     53269.835 s ± 788.715 s    [User: 110152.030 s, System: 6812.833 s]
  Range (min … max):   52712.129 s … 53827.540 s    2 runs

Benchmark 5: /mnt/my_storage/bitcoin_bins/bitcoin-27.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     52100.563 s ± 928.979 s    [User: 97570.115 s, System: 6684.143 s]
  Range (min … max):   51443.676 s … 52757.450 s    2 runs

Benchmark 6: /mnt/my_storage/bitcoin_bins/bitcoin-28.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     50323.536 s ± 1255.428 s    [User: 72726.961 s, System: 6645.485 s]
  Range (min … max):   49435.814 s … 51211.258 s    2 runs

Benchmark 7: /mnt/my_storage/bitcoin_bins/bitcoin-29.1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     30498.674 s ± 1.184 s    [User: 46184.437 s, System: 1219.785 s]
  Range (min … max):   30497.838 s … 30499.511 s    2 runs

Benchmark 8: /mnt/my_storage/bitcoin_bins/bitcoin-30.0rc1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
  Time (mean ± σ):     25871.455 s ± 272.594 s    [User: 21685.465 s, System: 1010.052 s]
  Range (min … max):   25678.702 s … 26064.208 s    2 runs

</details>

Relative speed comparison
    2.48 ±  0.03  /mnt/my_storage/bitcoin_bins/bitcoin-23.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
    2.41 ±  0.03  /mnt/my_storage/bitcoin_bins/bitcoin-24.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
    2.34 ±  0.03  /mnt/my_storage/bitcoin_bins/bitcoin-25.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
    2.06 ±  0.04  /mnt/my_storage/bitcoin_bins/bitcoin-26.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
    2.01 ±  0.04  /mnt/my_storage/bitcoin_bins/bitcoin-27.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
    1.95 ±  0.05  /mnt/my_storage/bitcoin_bins/bitcoin-28.2/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
    1.18 ±  0.01  /mnt/my_storage/bitcoin_bins/bitcoin-29.1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0
    1.00          /mnt/my_storage/bitcoin_bins/bitcoin-30.0rc1/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=913622 -dbcache=4500 -reindex-chainstate -blocksonly -connect=0 -printtoconsole=0

Which shows beautifully how each release reduced useless work further and further, achieving a 2.5x faster reindex-chainstate in v30 compared to v23.2.

glozow referenced this in commit da6f041e39 on Oct 31, 2025
achow101 referenced this in commit c2975f26d6 on Dec 10, 2025
achow101 referenced this in commit 0f6d8a347a on Dec 10, 2025
l0rinc commented at 10:24 AM on January 14, 2026: contributor

Many of these were merged, thank you for the reviews. Continuing in a tracking issue in https://github.com/bitcoin/bitcoin/issues/34280
l0rinc closed this on Jan 14, 2026