During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.
Summary: >20% full IBD speedup
We don't have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes. Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered. The unmerged changes here collectively achieve a >20% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.
Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5
Related issues:
PRs included here (in review priority order):
Changing trends
The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load. <img width="500" alt="image" src="https://github.com/user-attachments/assets/d257b864-d9de-4f9d-b61b-acc6835f384f" /> Profiling IBD, given these circumstances, revealed many new optimization opportunities.
Similar efforts in the past years
There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:
- #25325 - use specialized pool allocator for in-memory cache (~21% faster IBD)
- #28358 - allow the full UTXO set to fit into memory
- #28280 (comment) - fine-grained in-memory cache eviction for pruned nodes (~30% IBD speedup on pruned nodes)
- #30039 (comment) - reduce LevelDB writes, compactions and open files (~30% faster IBD for small in-memory cache)
- #31490, #30849, #30906 - refactors derisking/enabling follow-up optimizations
- #30326 - favor the happy path for cache misses (~2% IBD speedup)
- #30884 - Windows regression fix
Reliable macro benchmarks
The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.
To make sure the setup reflected a real user's experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark).
To reduce the instability of headers synchronization and peer acquisition, we first started bitcoind until block 1, followed by the actual benchmarks until block 886'000.
The top 2 PRs (https://github.com/bitcoin/bitcoin/pull/31551 and #31144) were measured together by multiple people with different settings (and varying results):
- @andrewtoth in #31144 (comment) - 9% speedup with GCC
- @hodlinator in #31144 (comment) - 11.9% speedup with GCC
- @mlori in #31144 (comment) - 12% faster with GCC
- @Sjors in #31144 (comment) - 3% speedup with Clang
Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.
Current changes (in order of importance, reviews and reproducers are welcome):
Plotting the performance of the blocks from the produced debug.log files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:
<details> <summary>debug.log visualizer</summary>
import os
import sys
import re
from datetime import datetime
import matplotlib.pyplot as plt
import numpy as np
def process_log_files_and_plot(log_dir, output_file="block_height_progress.png"):
if not os.path.exists(log_dir) or not os.path.isdir(log_dir):
print(f"Error: '{log_dir}' is not a valid directory", file=sys.stderr)
return
debug_files = [f for f in os.listdir(log_dir) if
f.startswith('debug-') and os.path.isfile(os.path.join(log_dir, f))]
if not debug_files:
print(f"Warning: No debug files found in '{log_dir}'", file=sys.stderr)
return
height_pattern = re.compile(r'UpdateTip:.*height=(\d+)')
results = {}
for filename in debug_files:
filepath = os.path.join(log_dir, filename)
print(f"Processing {filename}...", file=sys.stderr)
update_tips = []
first_timestamp = None
line_count = tip_count = 0
found_shutdown_done = False
try:
with open(filepath, 'r', errors='ignore') as file:
for line_number, line in enumerate(file, 1):
line_count += 1
if line_count % 100000 == 0:
print(f" Processed {line_count} lines, found {tip_count} UpdateTips...", file=sys.stderr)
if not found_shutdown_done:
if "Shutdown: done" in line:
found_shutdown_done = True
print(f" Found 'Shutdown: done' at line {line_number}, starting to record",
file=sys.stderr)
continue
if len(line) < 20 or "UpdateTip:" not in line:
continue
try:
timestamp = datetime.strptime(line[:20], "%Y-%m-%dT%H:%M:%SZ")
height_match = height_pattern.search(line)
if not height_match:
continue
height = int(height_match.group(1))
if first_timestamp is None:
first_timestamp = timestamp
update_tips.append((int((timestamp - first_timestamp).total_seconds()), height))
tip_count += 1
except ValueError:
continue
except Exception as e:
print(f"Error processing {filename}: {e}", file=sys.stderr)
continue
print(f"Finished processing {filename}: {line_count} lines, {tip_count} UpdateTips", file=sys.stderr)
if update_tips:
time_dict = {}
for time, height in update_tips:
time_dict[time] = height
results[filename[6:14]] = sorted(time_dict.items())
if not results:
print("No valid data found in any files.", file=sys.stderr)
return
print(f"Creating plots with data from {len(results)} files", file=sys.stderr)
sorted_results = []
for name, pairs in results.items():
if pairs:
sorted_results.append((name, pairs[-1][0] / 3600, pairs))
sorted_results.sort(key=lambda x: x[1], reverse=True)
colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_results)))
# Plot 1: Height vs Time
plt.figure(figsize=(12, 8))
final_points = []
for idx, (name, last_time, pairs) in enumerate(sorted_results):
times = [t / 3600 for t, _ in pairs]
heights = [h for _, h in pairs]
plt.plot(heights, times, label=f"{name} ({last_time:.2f}h)", color=colors[idx], linewidth=1)
if pairs:
final_points.append((last_time, pairs[-1][1], colors[idx]))
for time, height, color in final_points:
plt.axhline(y=time, color=color, linestyle='--', alpha=0.3)
plt.axvline(x=height, color=color, linestyle='--', alpha=0.3)
plt.title('Sync Time by Block Height')
plt.xlabel('Block Height')
plt.ylabel('Elapsed Time (hours)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(loc='center left')
plt.tight_layout()
plt.savefig(output_file.replace('.png', '_reversed.png'), dpi=300)
# Plot 2: Performance Ratio by Time
if len(sorted_results) > 1:
plt.figure(figsize=(12, 8))
baseline = sorted_results[0]
baseline_time_by_height = {h: t for t, h in baseline[2]}
for idx, (name, _, pairs) in enumerate(sorted_results[1:], 1):
time_by_height = {h: t for t, h in pairs}
common_heights = [h for h in baseline_time_by_height.keys()
if h >= 400000 and h in time_by_height]
common_heights.sort()
ratios = []
base_times = []
for h in common_heights:
base_t = baseline_time_by_height[h]
result_t = time_by_height[h]
if result_t > 0:
ratios.append(base_t / result_t)
base_times.append(base_t / 3600)
plt.plot(base_times, ratios,
label=f"{name} vs {baseline[0]}",
color=colors[idx], linewidth=1)
plt.axhline(y=1, color='gray', linestyle='--', alpha=0.7)
plt.title('Performance Improvement Over Time (Higher is Better)')
plt.xlabel('Baseline Elapsed Time (hours)')
plt.ylabel('Speedup Ratio (baseline_time / commit_time)')
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(loc='best')
plt.tight_layout()
plt.savefig(output_file.replace('.png', '_time_ratio.png'), dpi=300)
with open(output_file.replace('.png', '.csv'), 'w') as f:
for name, _, pairs in sorted_results:
f.write(f"{name},{','.join(f'{t}:{h}' for t, h in pairs)}\n")
plt.show()
if __name__ == "__main__":
log_dir = sys.argv[1] if len(sys.argv) > 1 else "."
output_file = sys.argv[2] if len(sys.argv) > 2 else "block_height_progress.png"
process_log_files_and_plot(log_dir, output_file)
</details>
<img width="1000" alt="image" src="https://github.com/user-attachments/assets/a43b43da-209c-4736-b1ef-d4d57f838d74" />
<img width="1000" alt="image" src="https://github.com/user-attachments/assets/2d43d867-0c9e-4daf-9d47-bd1148c48b55" />
Baseline
Base commit was 88debb3e42.
<details> <summary>8.59 hour IBD time</summary>
COMPILER=gcc COMMIT=88debb3e4297ef4ebc8966ffe599359bc7b231d0 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
Time (mean ± σ): 30932.610 s ± 156.891 s [User: 58248.505 s, System: 2142.974 s]
Range (min … max): 30821.671 s … 31043.549 s 2 runs
</details>
<details> <summary>7.91 hour IBD time</summary>
COMPILER=gcc COMMIT=6a8ce46e32dae2ffef2a73d2314ca33a2039186e ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
Time (mean ± σ): 28501.588 s ± 119.886 s [User: 56419.060 s, System: 1833.126 s]
Range (min … max): 28416.815 s … 28586.361 s 2 runs
We can serialize the blocks and undos to any Stream which implements the appropriate read/write methods.
AutoFile is one of these, writing the results "directly" to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).
Differential flame graphs indicate that the before/after speed change is because of fewer AutoFile reads and writes:
</details>
<details> <summary>7.60 hour IBD time</summary>
COMPILER=gcc COMMIT=c5cc54d10187c9cb3a6cba8cc10f652b4f882e2a ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
Time (mean ± σ): 27394.210 s ± 565.877 s [User: 54902.315 s, System: 1891.951 s]
Range (min … max): 26994.075 s … 27794.346 s 2 runs
</details>
Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.
<details> <summary>7.50 hour IBD time</summary>
COMPILER=gcc COMMIT=9b4be912d20222b3b275ef056c1494a15ccde3f5 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
Time (mean ± σ): 27019.086 s ± 112.340 s [User: 54927.344 s, System: 1652.376 s]
Range (min … max): 26939.649 s … 27098.522 s 2 runs
</details>
When the in-memory UTXO set is flushed to LevelDB (after IBD or AssumeUTXO load), it does so in batches to manage memory usage during the flush. While a hidden -dbbatchsize config option exists to modify this value, this PR introduces dynamic calculation of the batch size based on the -dbcache setting. By using larger batches when more memory is available (i.e., higher -dbcache), we can reduce the overhead from numerous small writes, minimize constant overhead per batch, improve I/O efficiency (especially on HDDs), and potentially allow LevelDB to optimize writes more effectively (e.g. by sorting the keys before write).
<img width="1000" alt="image" src="https://github.com/user-attachments/assets/0a99e32e-6a9b-481e-a08a-7216c82fe722" />
Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest: <img width="1000" alt="image" src="https://github.com/user-attachments/assets/8b56674b-b3e3-43cf-a19b-574e66948e72" />
<img width="1000" alt="image" src="https://github.com/user-attachments/assets/ce56cfba-a59f-4360-a6d7-2cc3e74959a3" />
<img width="1000" alt="image" src="https://github.com/user-attachments/assets/e346414c-b009-47c5-92ff-a264b1e2c6c4" />
<img width="1000" alt="image" src="https://github.com/user-attachments/assets/4db30a70-1ca9-401c-8c9c-2ddecd0d7516" />
<details> <summary>7.41 hour IBD time</summary>
COMPILER=gcc COMMIT=817d7ac0767a3984295aa3cf6c961dcc5f29d571 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
Time (mean ± σ): 26711.460 s ± 244.118 s [User: 54654.348 s, System: 1652.087 s]
Range (min … max): 26538.843 s … 26884.077 s 2 runs
</details>
The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.
Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.
<details> <summary>7.31 hour IBD time</summary>
COMPILER=gcc COMMIT=182745cec4c0baf2f3c8cff2f74f847eac3c4330 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
Time (mean ± σ): 26326.867 s ± 45.887 s [User: 54367.156 s, System: 1619.348 s]
Range (min … max): 26294.420 s … 26359.314 s 2 runs
</details>
CheckBlock's latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.
This PR improves performance and maintainability by introducing the following changes:
- Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
- Optimized the general case by replacing
std::setwith sortedstd::vectorfor improved locality. - Simplified Null prevout checks from linear to constant time.
<details> <summary>7.25 hour IBD time</summary>
COMPILER=gcc COMMIT=47d377bd0bb88dae6b34553a7789400170e0ccf6 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
Time (mean ± σ): 26084.429 s ± 473.611 s [User: 54310.780 s, System: 1815.967 s]
Range (min … max): 25749.536 s … 26419.323 s 2 runs
</details>
The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.
Hashing a uint256 key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.
The current prevector size of 28 bytes (chosen to fill the sizeof(CScript) aligned size) was introduced in 2015 (https://github.com/bitcoin/bitcoin/pull/6914) before SegWit and TapRoot.
However, the increasingly common P2WSH and P2TR scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.
The core trade-off of this change is to eliminate heap allocations for common 29-36 byte scripts at the cost of increasing the base memory footprint of all CScript objects by 8 bytes (while still respecting peak memory usage defined by -dbcache).
Other similar efforts waiting for reviews or revives (not included in this tracking PR):
- #30611 - for very big in-memory caches make sure we still flush to disk regularly (no significant IBD speed change)
- #28945 - was meant to preallocate the memory of recreated caches (~6% IBD speedup for small caches)
- #31102 - was meant to try to evict entries selectively instead of dropping the whole cache when full
- #32128 - draft PR showcasing a few other possible caching speedups
This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs. Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.