[IBD] Tracking PR for speeding up Initial Block Download

l0rinc commented at 4:20 pm on March 12, 2025: contributor

During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.

Summary: 18% full IBD speedup

We don’t have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes. Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered. The unmerged changes here collectively achieve a ~18% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.

Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5

PRs included here (in review priority order):

Changing trends

The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load. Profiling IBD, given these circumstances, revealed many new optimization opportunities.

Similar efforts in the past years

There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:

#25325 - use specialized pool allocator for in-memory cache (~21% faster IBD)
#28358 - allow the full UTXO set to fit into memory
#28280 (comment) - fine-grained in-memory cache eviction for pruned nodes (~30% IBD speedup on pruned nodes)
#30039 (comment) - reduce LevelDB writes, compactions and open files (~30% faster IBD for small in-memory cache)
#31490, #30849, #30906 - refactors derisking/enabling follow-up optimizations
#30326 - favor the happy path for cache misses (~2% IBD speedup)
#30884 - Windows regression fix

Reliable macro benchmarks

The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.

To make sure the setup reflected a real user’s experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark). To reduce the instability of headers synchronization and peer acquisition, we first started bitcoind until block 1, followed by the actual benchmarks until block 886'000.

The top 2 PRs (https://github.com/bitcoin/bitcoin/pull/31551 and #31144) were measured together by multiple people with different settings (and varying results):

@andrewtoth in #31144 (comment) - 9% speedup with GCC
@hodlinator in #31144 (comment) - 11.9% speedup with GCC
@mlori in #31144 (comment) - 12% faster with GCC
@Sjors in #31144 (comment) - 3% speedup with Clang

Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.

Current changes (in order of importance, reviews and reproducers are welcome):

Plotting the performance of the blocks from the produced debug.log files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:

  0import os
  1import sys
  2import re
  3from datetime import datetime
  4import matplotlib.pyplot as plt
  5import numpy as np
  6
  7
  8def process_log_files_and_plot(log_dir, output_file="block_height_progress.png"):
  9    if not os.path.exists(log_dir) or not os.path.isdir(log_dir):
 10        print(f"Error: '{log_dir}' is not a valid directory", file=sys.stderr)
 11        return
 12
 13    debug_files = [f for f in os.listdir(log_dir) if
 14                   f.startswith('debug-') and os.path.isfile(os.path.join(log_dir, f))]
 15    if not debug_files:
 16        print(f"Warning: No debug files found in '{log_dir}'", file=sys.stderr)
 17        return
 18
 19    height_pattern = re.compile(r'UpdateTip:.*height=(\d+)')
 20    results = {}
 21
 22    for filename in debug_files:
 23        filepath = os.path.join(log_dir, filename)
 24        print(f"Processing {filename}...", file=sys.stderr)
 25
 26        update_tips = []
 27        first_timestamp = None
 28        line_count = tip_count = 0
 29        found_shutdown_done = False
 30
 31        try:
 32            with open(filepath, 'r', errors='ignore') as file:
 33                for line_number, line in enumerate(file, 1):
 34                    line_count += 1
 35                    if line_count % 100000 == 0:
 36                        print(f"  Processed {line_count} lines, found {tip_count} UpdateTips...", file=sys.stderr)
 37
 38                    if not found_shutdown_done:
 39                        if "Shutdown: done" in line:
 40                            found_shutdown_done = True
 41                            print(f"  Found 'Shutdown: done' at line {line_number}, starting to record",
 42                                  file=sys.stderr)
 43                        continue
 44
 45                    if len(line) < 20 or "UpdateTip:" not in line:
 46                        continue
 47
 48                    try:
 49                        timestamp = datetime.strptime(line[:20], "%Y-%m-%dT%H:%M:%SZ")
 50                        height_match = height_pattern.search(line)
 51                        if not height_match:
 52                            continue
 53
 54                        height = int(height_match.group(1))
 55                        if first_timestamp is None:
 56                            first_timestamp = timestamp
 57
 58                        update_tips.append((int((timestamp - first_timestamp).total_seconds()), height))
 59                        tip_count += 1
 60                    except ValueError:
 61                        continue
 62        except Exception as e:
 63            print(f"Error processing {filename}: {e}", file=sys.stderr)
 64            continue
 65
 66        print(f"Finished processing {filename}: {line_count} lines, {tip_count} UpdateTips", file=sys.stderr)
 67
 68        if update_tips:
 69            time_dict = {}
 70            for time, height in update_tips:
 71                time_dict[time] = height
 72            results[filename[6:14]] = sorted(time_dict.items())
 73
 74    if not results:
 75        print("No valid data found in any files.", file=sys.stderr)
 76        return
 77
 78    print(f"Creating plots with data from {len(results)} files", file=sys.stderr)
 79
 80    sorted_results = []
 81    for name, pairs in results.items():
 82        if pairs:
 83            sorted_results.append((name, pairs[-1][0] / 3600, pairs))
 84
 85    sorted_results.sort(key=lambda x: x[1], reverse=True)
 86    colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_results)))
 87
 88    # Plot 1: Height vs Time
 89    plt.figure(figsize=(12, 8))
 90
 91    final_points = []
 92    for idx, (name, last_time, pairs) in enumerate(sorted_results):
 93        times = [t / 3600 for t, _ in pairs]
 94        heights = [h for _, h in pairs]
 95        plt.plot(heights, times, label=f"{name} ({last_time:.2f}h)", color=colors[idx], linewidth=1)
 96        if pairs:
 97            final_points.append((last_time, pairs[-1][1], colors[idx]))
 98
 99    for time, height, color in final_points:
100        plt.axhline(y=time, color=color, linestyle='--', alpha=0.3)
101        plt.axvline(x=height, color=color, linestyle='--', alpha=0.3)
102
103    plt.title('Sync Time by Block Height')
104    plt.xlabel('Block Height')
105    plt.ylabel('Elapsed Time (hours)')
106    plt.grid(True, linestyle='--', alpha=0.7)
107    plt.legend(loc='center left')
108    plt.tight_layout()
109
110    plt.savefig(output_file.replace('.png', '_reversed.png'), dpi=300)
111
112    # Plot 2: Performance Ratio by Time
113    if len(sorted_results) > 1:
114        plt.figure(figsize=(12, 8))
115
116        baseline = sorted_results[0]
117        baseline_time_by_height = {h: t for t, h in baseline[2]}
118
119        for idx, (name, _, pairs) in enumerate(sorted_results[1:], 1):
120            time_by_height = {h: t for t, h in pairs}
121
122            common_heights = [h for h in baseline_time_by_height.keys()
123                              if h >= 400000 and h in time_by_height]
124            common_heights.sort()
125
126            ratios = []
127            base_times = []
128
129            for h in common_heights:
130                base_t = baseline_time_by_height[h]
131                result_t = time_by_height[h]
132
133                if result_t > 0:
134                    ratios.append(base_t / result_t)
135                    base_times.append(base_t / 3600)
136
137            plt.plot(base_times, ratios,
138                     label=f"{name} vs {baseline[0]}",
139                     color=colors[idx], linewidth=1)
140
141        plt.axhline(y=1, color='gray', linestyle='--', alpha=0.7)
142
143        plt.title('Performance Improvement Over Time (Higher is Better)')
144        plt.xlabel('Baseline Elapsed Time (hours)')
145        plt.ylabel('Speedup Ratio (baseline_time / commit_time)')
146        plt.grid(True, linestyle='--', alpha=0.7)
147        plt.legend(loc='best')
148        plt.tight_layout()
149
150        plt.savefig(output_file.replace('.png', '_time_ratio.png'), dpi=300)
151
152    with open(output_file.replace('.png', '.csv'), 'w') as f:
153        for name, _, pairs in sorted_results:
154            f.write(f"{name},{','.join(f'{t}:{h}' for t, h in pairs)}\n")
155
156    plt.show()
157
158
159if __name__ == "__main__":
160    log_dir = sys.argv[1] if len(sys.argv) > 1 else "."
161    output_file = sys.argv[2] if len(sys.argv) > 2 else "block_height_progress.png"
162    process_log_files_and_plot(log_dir, output_file)

Baseline

Base commit was 88debb3e42.

0COMPILER=gcc COMMIT=88debb3e4297ef4ebc8966ffe599359bc7b231d0 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1  Time (mean ± σ):     30932.610 s ± 156.891 s    [User: 58248.505 s, System: 2142.974 s]
2  Range (min … max):   30821.671 s … 31043.549 s    2 runs

#31551

0COMPILER=gcc COMMIT=6a8ce46e32dae2ffef2a73d2314ca33a2039186e ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1  Time (mean ± σ):     28501.588 s ± 119.886 s    [User: 56419.060 s, System: 1833.126 s]
2  Range (min … max):   28416.815 s … 28586.361 s    2 runs

We can serialize the blocks and undos to any Stream which implements the appropriate read/write methods. AutoFile is one of these, writing the results “directly” to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).

Differential flame graphs indicate that the before/after speed change is because of fewer AutoFile reads and writes: writes reads

#31144

0COMPILER=gcc COMMIT=c5cc54d10187c9cb3a6cba8cc10f652b4f882e2a ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1  Time (mean ± σ):     27394.210 s ± 565.877 s    [User: 54902.315 s, System: 1891.951 s]
2  Range (min … max):   26994.075 s … 27794.346 s    2 runs

Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.

obfuscation calls during IBD without batching

#31645

0COMPILER=gcc COMMIT=9b4be912d20222b3b275ef056c1494a15ccde3f5 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1  Time (mean ± σ):     27019.086 s ± 112.340 s    [User: 54927.344 s, System: 1652.376 s]
2  Range (min … max):   26939.649 s … 27098.522 s    2 runs

When the in-memory UTXO set is flushed to LevelDB (after IBD or AssumeUTXO load), it does so in batches to manage memory usage during the flush. While a hidden -dbbatchsize config option exists to modify this value, this PR introduces dynamic calculation of the batch size based on the -dbcache setting. By using larger batches when more memory is available (i.e., higher -dbcache), we can reduce the overhead from numerous small writes, minimize constant overhead per batch, improve I/O efficiency (especially on HDDs), and potentially allow LevelDB to optimize writes more effectively (e.g. by sorting the keys before write).

Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest:

#31868

0COMPILER=gcc COMMIT=817d7ac0767a3984295aa3cf6c961dcc5f29d571 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1  Time (mean ± σ):     26711.460 s ± 244.118 s    [User: 54654.348 s, System: 1652.087 s]
2  Range (min … max):   26538.843 s … 26884.077 s    2 runs

The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.

Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt’s first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.

#31682

0COMPILER=gcc COMMIT=182745cec4c0baf2f3c8cff2f74f847eac3c4330 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1  Time (mean ± σ):     26326.867 s ± 45.887 s    [User: 54367.156 s, System: 1619.348 s]
2  Range (min … max):   26294.420 s … 26359.314 s    2 runs

CheckBlock’s latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.

This PR improves performance and maintainability by introducing the following changes:

Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
Optimized the general case by replacing std::set with sorted std::vector for improved locality.
Simplified Null prevout checks from linear to constant time.

#30442

0COMPILER=gcc COMMIT=47d377bd0bb88dae6b34553a7789400170e0ccf6 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1  Time (mean ± σ):     26084.429 s ± 473.611 s    [User: 54310.780 s, System: 1815.967 s]
2  Range (min … max):   25749.536 s … 26419.323 s    2 runs

The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.

Hashing a uint256 key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.

#32279

The current prevector size of 28 bytes (chosen to fill the sizeof(CScript) aligned size) was introduced in 2015 (https://github.com/bitcoin/bitcoin/pull/6914) before SegWit and TapRoot. However, the increasingly common P2WSH and P2TR scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

The core trade-off of this change is to eliminate heap allocations for common 29-36 byte scripts at the cost of increasing the base memory footprint of all CScript objects by 8 bytes (while still respecting peak memory usage defined by -dbcache).

Other similar efforts waiting for reviews or revives (not included in this tracking PR):

#31132 - pre-warms the in-memory cache on multiple threads (10% IBD speedup for small in-memory caches)
#30611 - for very big in-memory caches make sure we still flush to disk regularly (no significant IBD speed change)
#28945 - was meant to preallocate the memory of recreated caches (~6% IBD speedup for small caches)
#31102 - was meant to try to evict entries selectively instead of dropping the whole cache when full
#32128 - draft PR showcasing a few other possible caching speedups

This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs. Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.

DrahtBot commented at 4:20 pm on March 12, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32043.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	jonatack

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32521 (policy: make pathological transactions packed with legacy sigops non-standard by darosior)
#32457 (bench: replace benchmark block with more representative one (413567 → 784588) by l0rinc)
#32296 (refactor: reenable implicit-integer-sign-change check for serialize.h by l0rinc)
#32279 ([IBD] prevector: store P2WSH/P2TR/P2PK scripts inline by l0rinc)
#31868 ([IBD] specialize block serialization by l0rinc)
#31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
#31682 ([IBD] specialize CheckBlock’s input & coinbase checks by l0rinc)
#31144 ([IBD] multi-byte block obfuscation by l0rinc)
#29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
#29307 (util: explicitly close all AutoFiles that have been written by vasild)
#28531 (improve MallocUsage() accuracy by LarryRuane)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot renamed this:
~~[IBD] - Tracking PR for speeding up Initial Block Download~~
[IBD] - Tracking PR for speeding up Initial Block Download
on Mar 12, 2025

l0rinc renamed this:
~~[IBD] - Tracking PR for speeding up Initial Block Download~~
[IBD] Tracking PR for speeding up Initial Block Download
on Mar 12, 2025

ryanofsky commented at 5:14 pm on March 12, 2025: contributor

Thanks for creating this. This should make it easier to navigate the other PRs and discuss the overall topic of IBD performance and benchmarking without needing to necessarily repeat it in the individual PRs.

Would be useful to have concept ACKs/NACKs here from others who know more about performance and benchmarking. But from from what I can tell the individual optimizations do not seem very complicated and seem like they should be justified.

One suggestion for the PR description above would be to directly link to the PRs comprising this change in the summary, maybe pointing out any where review should be focused. Current list seems to be:

laanwj added the label Block storage on Mar 12, 2025

laanwj added the label P2P on Mar 12, 2025

l0rinc force-pushed on Mar 12, 2025

DrahtBot added the label CI failed on Mar 12, 2025

DrahtBot commented at 6:16 pm on March 12, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38650495272

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

l0rinc force-pushed on Mar 12, 2025

DrahtBot removed the label CI failed on Mar 12, 2025

ajtowns commented at 6:18 am on March 13, 2025: contributor

Plotting the performance of the blocks from the produced debug.log files shows (from the last run, can differ slightly from the normalized average shown below) the effect of each commit:

Wouldn’t these plots be easier to read with block height on the x-axis and time on the y-axis, giving a consistent domain (each case goes from height 0 to 880k or so) with a simple “lower is better” comparison? (rather than “further left is better”)

Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful? (Perhaps limited to height >= 400000)

l0rinc commented at 9:16 am on March 13, 2025: contributor

with a simple “lower is better” comparison

I like the idea, updated the description and the code:

Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful?

Something like this? Conceptually seems useful, but here I don’t know how to interpret it, seems too far zoomed in Looks even funnier without the >400k cap:

ajtowns commented at 12:45 pm on March 13, 2025: contributor

Something like this? Conceptually seems useful, but here I don’t know how to interpret it, seems too far zoomed in

Fair; that might work better with a time on the x-axis rather than height, something like:

0for time_commit in [time_commit_X, time_commit_Y, time_commit_Z]:
1    for height in range(1,850000):
2        y = time_baseline[height] / time_commit[height] # keep this one the same
3        x = time_baseline[height]  # changed from x = height
4        add_datapoint(x,y)

l0rinc commented at 1:08 pm on March 13, 2025: contributor

time on the x-axis rather than height

~~If we do that we don’t even need to filter out the first 400k blocks since they’re so insignificant.~~ Edit: you’re right, it looks better to filter those out - I’ve updated the description with the images and code.

jonatack commented at 3:52 pm on March 13, 2025: member

Concept ACK, thank you for opening this.

I currently am in an environment of slow internet speed, where despite having a relatively fast laptop, IBD is slower than 2 orders of magnitude worse than the times in the OP.

Opened #32051 today to address an issue I’m also seeing of very frequent disconnections+reconnections of trusted addnode peers during IBD.

DrahtBot added the label Needs rebase on Mar 20, 2025

l0rinc force-pushed on Apr 8, 2025

l0rinc commented at 11:15 pm on April 8, 2025: contributor

Updated the tracking PR (+ general rebase) with the latest changes from:

DrahtBot removed the label Needs rebase on Apr 9, 2025

DrahtBot commented at 0:38 am on April 9, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/40213922055

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot added the label CI failed on Apr 9, 2025

l0rinc force-pushed on Apr 9, 2025

DrahtBot removed the label CI failed on Apr 9, 2025

l0rinc force-pushed on Apr 13, 2025

l0rinc commented at 9:01 pm on April 15, 2025: contributor

Added #32279 to the collection

achow101 referenced this in commit 33df4aebae on Apr 16, 2025

DrahtBot added the label Needs rebase on Apr 16, 2025

test: Compare util::Xor with randomized inputs against simple impl

Since production code only uses keys of length 8, we're not testing with other values anymore

3d203c2acf

bench: Make XorObfuscationBench more representative

Since another PR solves the tiny byte xors during serialization, we're only concentrating on big continuous chunks now.

>  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build -j$(nproc) \
&& build/bin/bench_bitcoin -filter='XorObfuscationBench' -min-time=10000

C++ compiler .......................... AppleClang 17.0.0.17000013

|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          731,927.62 |            1,366.26 |    0.2% |     10.67 | `XorObfuscationBench`

C++ compiler .......................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          941,015.26 |            1,062.68 |    0.0% |    9,437,186.97 |    3,378,911.52 |  2.793 |   1,048,577.15 |    0.0% |     10.99 | `XorObfuscationBench`

5d5f3d06dd

refactor: prepare dbwrapper for obfuscation key change

Since `CDBWrapper::Read` will still work with vectors, we won't be able to use the obfuscation key field to read into it directly.
This commit cleans up this part of the code, obviating that writing `obfuscate_key` is needed since following methods will actually use it implicitly, simplifying the `if (!key_exists` condition to extract the negation into the name of the boolean and inline the single-use `CreateObfuscateKey` which will just complicate the transition.

8e6e0acd36

refactor: prepare mempool_persist for obfuscation key change e50732d25f

optimization: Migrate fixed-size obfuscation end-to-end from `std::vector<std::byte>` to `uint64_t`

Since `util::Xor` accepts `uint64_t` values, we're eliminating any repeated vector-to-uint64_t conversions going back to the loading/saving of these values (we're still serializing them as vectors, but converting as soon as possible to `uint64_t`). This is the reason the tests still generate vector values and convert to `uint64_t` later instead of generating it directly.

We're also short-circuit `Xor` calls with 0 key values early to avoid unnecessary calculations (e.g. `MakeWritableByteSpan`) - even assuming that XOR is never called for 0.

>  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
&& cmake --build build -j$(nproc) \
&& build/bin/bench_bitcoin -filter='XorObfuscationBench' -min-time=10000

C++ compiler .......................... AppleClang 17.0.0.17000013

|              ns/MiB |               MiB/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|           14,730.40 |           67,886.80 |    0.1% |     11.01 | `XorObfuscationBench`

C++ compiler .......................... GNU 13.3.0

|              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           51,187.17 |           19,536.15 |    0.0% |      327,683.95 |      183,747.58 |  1.783 |      65,536.55 |    0.0% |     11.00 | `XorObfuscationBench`

----

A few other benchmarks that seem to have improved as well (tested with Clang only):
Before:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        2,202,618.49 |              454.01 |    0.2% |     11.01 | `ReadBlockBench`
|          734,444.92 |            1,361.57 |    0.3% |     10.66 | `ReadRawBlockBench`

After:

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        1,912,308.06 |              522.93 |    0.4% |     10.98 | `ReadBlockBench`
|           49,092.93 |           20,369.53 |    0.2% |     10.99 | `ReadRawBlockBench`

c5e866b190

bench: Add COutPoint and SaltedOutpointHasher benchmarks

This commit introduces new benchmarks to measure the performance of various operations using
SaltedOutpointHasher, including hash computation, set operations, and set creation.

These benchmarks are intended to provide insights about coin caching performance (e.g. during IBD).

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaltedOutpointHasherBench.*' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|               58.60 |       17,065,922.04 |    0.3% |     11.02 | `SaltedOutpointHasherBench_create_set`
|               11.97 |       83,576,684.83 |    0.1% |     11.01 | `SaltedOutpointHasherBench_hash`
|               14.50 |       68,985,850.12 |    0.3% |     10.96 | `SaltedOutpointHasherBench_match`
|               13.90 |       71,942,033.47 |    0.4% |     11.03 | `SaltedOutpointHasherBench_mismatch`

> C++ compiler .......................... GNU 13.3.0

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|              136.76 |        7,312,133.16 |    0.0% |        1,086.67 |          491.12 |  2.213 |         119.54 |    1.1% |     11.01 | `SaltedOutpointHasherBench_create_set`
|               23.82 |       41,978,882.62 |    0.0% |          252.01 |           85.57 |  2.945 |           4.00 |    0.0% |     11.00 | `SaltedOutpointHasherBench_hash`
|               60.42 |       16,549,695.42 |    0.1% |          460.51 |          217.04 |  2.122 |          21.00 |    1.4% |     10.99 | `SaltedOutpointHasherBench_match`
|               78.66 |       12,713,595.35 |    0.1% |          555.59 |          282.52 |  1.967 |          20.19 |    2.2% |     10.74 | `SaltedOutpointHasherBench_mismatch`

c497ca6e91

test: Rename k1/k2 to k0/k1 for consistency ae87260d29

refactor: Extract C0-C3 Siphash constants 155ba7c349

optimization: refactor: Introduce Uint256ExtraSipHasher to cache SipHash constant state

Previously, only k0 and k1 were stored, causing the constant xor operations to be recomputed in every call to `SipHashUint256Extra`.
This commit adds a dedicated `Uint256ExtraSipHasher` class that caches the initial state (v0-v3) and to perform the `SipHash` computation on a `uint256` (with an extra parameter), hiding the constant computation details from higher-level code and improving efficiency.
This basically brings the precalculations in the `CSipHasher` constructor to the `uint256` specialized SipHash implementation.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaltedOutpointHasherBench.*' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|               57.27 |       17,462,299.19 |    0.1% |     11.02 | `SaltedOutpointHasherBench_create_set`
|               11.24 |       88,997,888.48 |    0.3% |     11.04 | `SaltedOutpointHasherBench_hash`
|               13.91 |       71,902,014.20 |    0.2% |     11.01 | `SaltedOutpointHasherBench_match`
|               13.29 |       75,230,390.31 |    0.1% |     11.00 | `SaltedOutpointHasherBench_mismatch`

compared to master:
create_set - 17,462,299.19/17,065,922.04 - 2.3% faster
hash       - 88,997,888.48/83,576,684.83 - 6.4% faster
match      - 71,902,014.20/68,985,850.12 - 4.2% faster
mismatch   - 75,230,390.31/71,942,033.47 - 4.5% faster

> C++ compiler .......................... GNU 13.3.0

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|              135.38 |        7,386,349.49 |    0.0% |        1,078.19 |          486.16 |  2.218 |         119.56 |    1.1% |     11.00 | `SaltedOutpointHasherBench_create_set`
|               23.67 |       42,254,558.08 |    0.0% |          247.01 |           85.01 |  2.906 |           4.00 |    0.0% |     11.00 | `SaltedOutpointHasherBench_hash`
|               58.95 |       16,962,220.14 |    0.1% |          446.55 |          211.74 |  2.109 |          20.86 |    1.4% |     11.01 | `SaltedOutpointHasherBench_match`
|               76.98 |       12,991,047.69 |    0.1% |          548.93 |          276.50 |  1.985 |          20.25 |    2.3% |     10.72 | `SaltedOutpointHasherBench_mismatch`

compared to master:
create_set -  7,386,349.49/7,312,133.16  - 1% faster
hash       - 42,254,558.08/41,978,882.62 - 0.6% faster
match      - 16,962,220.14/16,549,695.42 - 2.4% faster
mismatch   - 12,991,047.69/12,713,595.35 - 2% faster

Co-authored-by: sipa <pieter@wuille.net>

73cfebb08b

bench: measure block (size)serialization speed

The SizeComputer is a special serializer which returns what the exact final size will be of the serialized content.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          195,610.62 |            5,112.20 |    0.3% |     11.00 | `SerializeBlock`
|           12,061.83 |           82,906.19 |    0.1% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          867,857.55 |            1,152.26 |    0.0% |    8,015,883.90 |    3,116,099.08 |  2.572 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
|           30,928.27 |           32,332.88 |    0.0% |      221,683.03 |      111,055.84 |  1.996 |      53,037.03 |    0.8% |     11.03 | `SizeComputerBlock`

3d7c8ae9fb

cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` 9f15d4da35

refactor: reduce template bloat in primitive serialization

Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later

5559eb68a9

refactor: add explicit static extent to spans 8f71de5f8f

optimization: merge SizeComputer specializations + add new ones

Endianness doesn't affect the final size, we can skip it for `SizeComputer`.
We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
|           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`

2bf6c56cab

optimization: add single byte writes

Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
It makes sense to avoid the generalized serialization infrastructure that isn't needed:
* AutoFile write doesn't need to allocate 4k buffer for a single byte now;
* `VectorWriter` and `DataStream` avoids memcpy/insert calls.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000

> C compiler ............................ AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
|           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
|           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`

6269067bc3

test: validate duplicate detection in `CheckTransaction`

The `CheckTransaction` validation function in https://github.com/bitcoin/bitcoin/blob/master/src/consensus/tx_check.cpp#L41-L45 relies on a correct ordering relation for detecting duplicate transaction inputs.

This update to the tests ensures that:
* Accurate detection of duplicates: Beyond trivial cases (e.g., two identical inputs), duplicates are detected correctly in more complex scenarios.
* Consistency across methods: Both sorted sets and hash-based sets behave identically when detecting duplicates for `COutPoint` and related values.
* Robust ordering and equality relations: The function maintains expected behavior for ordering and equality checks.

Using randomized testing with shuffled inputs (to avoid any remaining bias introduced), the enhanced test validates that `CheckTransaction` remains robust and reliable across various input configurations. It confirms identical behavior to a hashing-based duplicate detection mechanism, ensuring consistency and correctness.

To make sure the new branches in the follow-up commits will be covered, `basic_transaction_tests` was extended a randomized test one comparing against the old implementation (and also an alternative duplicate). The iterations and ranges were chosen such that every new branch is expected to be hit once.

c15d130752

bench: measure `CheckBlock` speed separately from serialization

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          372,743.63 |            2,682.81 |    1.1% |     10.99 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,304,694.54 |              302.60 |    0.5% |     11.05 | `DuplicateInputs`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        1,096,261.84 |              912.19 |    0.1% |    7,963,390.88 |    3,487,375.26 |  2.283 |   1,266,941.00 |    1.8% |     11.03 | `CheckBlockBench`

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        8,366,309.48 |              119.53 |    0.0% |   23,865,177.67 |   26,620,160.23 |  0.897 |   5,972,887.41 |    4.0% |     10.78 | `DuplicateInputs`

bf580d7b4e

bench: add `ProcessTransactionBench` to measure `CheckBlock` in context

The newly introduced `ProcessTransactionBench` incorporates multiple steps in the validation pipeline, offering a more comprehensive view of `CheckBlock` performance within a realistic transaction validation context.

Previous microbenchmarks, such as DeserializeAndCheckBlockTest and DuplicateInputs, focused on isolated aspects of transaction and block validation. While these tests provided valuable insights for targeted profiling, they lacked context regarding the broader validation process, where interactions between components play a critical role.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|            9,585.10 |          104,328.55 |    0.1% |     11.03 | `ProcessTransactionBench`

> C++ compiler .......................... GNU 13.3.0

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|           56,199.57 |           17,793.73 |    0.1% |      229,263.01 |      178,766.31 |  1.282 |      15,509.97 |    0.5% |     10.91 | `ProcessTransactionBench`

a300325c5b

optimization: move duplicate checks outside of coinbase branch

`IsCoinBase` means single input with NULL prevout, so it makes sense to restrict duplicate check to non-coinbase transactions only.
The behavior is the same as before, except that single-input-transactions aren't checked for duplicates anymore (~70-90% of the cases, see https://transactionfee.info/charts/transactions-1in).
I've added braces to the conditions and loops to simplify review of followup commits.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          335,917.12 |            2,976.92 |    1.3% |     11.01 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,286,337.42 |              304.29 |    1.1% |     10.90 | `DuplicateInputs`
|            9,561.02 |          104,591.35 |    0.2% |     11.02 | `ProcessTransactionBench`

7b576a440d

optimization: simplify duplicate checks for trivial inputs

No need to create a set for checking duplicates for two-input-transactions.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          314,137.30 |            3,183.32 |    1.2% |     11.04 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|        3,220,592.73 |              310.50 |    1.3% |     10.92 | `DuplicateInputs`
|            9,425.98 |          106,089.77 |    0.3% |     11.00 | `ProcessTransactionBench`

765b71b90b

optimization: replace tree with sorted vector

A pre-sized vector retains locality (enabling SIMD operations), speeding up sorting and equality checks.
It's also simpler (therefore more reliable) than a sorted set. It also causes less memory fragmentation.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          181,922.54 |            5,496.85 |    0.2% |     10.98 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          997,739.30 |            1,002.27 |    1.0% |     10.94 | `DuplicateInputs`
|            9,449.28 |          105,828.15 |    0.3% |     10.99 | `ProcessTransactionBench`

Co-authored-by: Pieter Wuille <pieter@wuille.net>

cb8c012b87

optimization: look for NULL prevouts in the sorted values

For the 2 input case we simply check them both, like we did with equality.

For the general case, we take advantage of sorting, making invalid value detection constant time instead of linear in the worst case.

> cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000

> C++ compiler .......................... AppleClang 16.0.0.16000026

|            ns/block |             block/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          179,971.00 |            5,556.45 |    0.3% |     11.02 | `CheckBlockBench`

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          963,177.98 |            1,038.23 |    1.7% |     10.92 | `DuplicateInputs`
|            9,410.90 |          106,259.75 |    0.3% |     11.01 | `ProcessTransactionBench`

> C++ compiler .......................... GNU 13.3.0

|            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|          834,855.94 |            1,197.81 |    0.0% |    6,518,548.86 |    2,656,039.78 |  2.454 |     919,160.84 |    1.5% |     10.78 | `CheckBlockBench`

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|        4,261,492.75 |              234.66 |    0.0% |   17,379,823.40 |   13,559,793.33 |  1.282 |   4,265,714.28 |    3.4% |     11.00 | `DuplicateInputs`
|           55,819.53 |           17,914.88 |    0.1% |      227,828.15 |      177,520.09 |  1.283 |      15,184.36 |    0.4% |     10.91 | `ProcessTransactionBench`

f07c79f034

test: assert CScript allocation characteristics

Verifies that script types are correctly allocated using prevector's direct (stack) or indirect (heap) storage based on their size:

Direct (stack) allocated script types (size ≤ 28 bytes):
* OP_RETURN (small)
* P2WPKH
* P2SH
* P2PKH

Indirect (heap) allocated script types (size > 28 bytes):
* P2WSH
* P2TR
* P2PK
* MULTISIG (small)

This test provides a baseline for verifying changes to prevector's inline capacity.

b5dc42874d

Allocate `P2WSH`/`P2TR`/`P2PK` scripts on stack

The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (https://github.com/bitcoin/bitcoin/pull/6914) before SegWit and TapRoot.
However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`).

Increasing the `prevector` size allows these scripts to be stored on the stack, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads.

Due to memory alignment, increasing the `prevector` size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost.

Performance benchmarks for AssumeUTXO load and flush show:
- Small dbcache (450MB): ~1% performance penalty due to more frequent flushes
- Large dbcache (4500-4500MB+): ~6-7% performance improvement due to fewer heap allocations

Full IBD and reindex-chainstate with larger `dbcache` values also show an overall ~3% speedup.

Co-authored-by: Ava Chow <github@achow101.com>
Co-authored-by: Andrew Toth <andrewstoth@gmail.com>

b6b4235c14

l0rinc force-pushed on Apr 17, 2025

DrahtBot removed the label Needs rebase on Apr 17, 2025

DrahtBot commented at 9:28 am on May 16, 2025: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

DrahtBot added the label Needs rebase on May 16, 2025

achow101 referenced this in commit 5878f35446 on Jul 19, 2025

l0rinc commented at 1:56 am on July 20, 2025: contributor

We’re making progress, #31144 was just merged! 🎉 The next ones that need some love are:

achow101 referenced this in commit 321984705d on Jul 28, 2025

l0rinc commented at 8:58 pm on July 28, 2025: contributor

Thanks for reviewing and reproducing #32279 - it’s also merged 🎉

Reviews and ACKs for the above 2 remaining ones would be very welcome.

maflcko commented at 7:02 am on July 29, 2025: member

The remaining ones don’t speed up IBD, do they?

l0rinc commented at 7:05 am on July 29, 2025: contributor

The big ones are merged, thanks for your help. These are smaller ones - at least from an IBD perspective -, but both are quite simple.

Edit: rebased #31645 which does speed up a critical section of IBD measurably.

pstratem commented at 10:07 pm on August 1, 2025: contributor

We’re calling CBlockHeader::GetHash hundreds of times for the same CBlockHeader object from ActivateBestChain.

I chased the calls with this commit and found all the results where from ProcessNewBlock ActivateBestChain

https://github.com/pstratem/bitcoin/commit/7de23309167d3f713f90cc552e1bd431464e4c49

l0rinc commented at 10:36 pm on August 1, 2025: contributor

Thanks @pstratem, I did almost exactly the same, the main problem is indeed the loop in https://github.com/bitcoin/bitcoin/blob/master/src/validation.cpp#L3498, probably bounded by BLOCK_DOWNLOAD_WINDOW - hence the ~1000 worst cases. I have a fix for most duplications, I’m running an IBD (and a reindex-chainstate) on my benchmarking servers to see the effect of the deduplications. I expect at most a 1-2% change, but it may still be worth it. I’ll also investigate the effect of lowering BLOCK_DOWNLOAD_WINDOW.

[IBD] Tracking PR for speeding up Initial Block Download #32043