During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.
Summary: 18% full IBD speedup
We don’t have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes. Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered. The unmerged changes here collectively achieve a ~18% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.
Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5
PRs included here (in review priority order):
Changing trends
The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load. Profiling IBD, given these circumstances, revealed many new optimization opportunities.
Similar efforts in the past years
There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:
- #25325 - use specialized pool allocator for in-memory cache (~21% faster IBD)
- #28358 - allow the full UTXO set to fit into memory
- #28280 (comment) - fine-grained in-memory cache eviction for pruned nodes (~30% IBD speedup on pruned nodes)
- #30039 (comment) - reduce LevelDB writes, compactions and open files (~30% faster IBD for small in-memory cache)
- #31490, #30849, #30906 - refactors derisking/enabling follow-up optimizations
- #30326 - favor the happy path for cache misses (~2% IBD speedup)
- #30884 - Windows regression fix
Reliable macro benchmarks
The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.
To make sure the setup reflected a real user’s experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark).
To reduce the instability of headers synchronization and peer acquisition, we first started bitcoind
until block 1, followed by the actual benchmarks until block 886'000.
The top 2 PRs (https://github.com/bitcoin/bitcoin/pull/31551 and #31144) were measured together by multiple people with different settings (and varying results):
- @andrewtoth in #31144 (comment) - 9% speedup with GCC
- @hodlinator in #31144 (comment) - 11.9% speedup with GCC
- @mlori in #31144 (comment) - 12% faster with GCC
- @Sjors in #31144 (comment) - 3% speedup with Clang
Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.
Current changes (in order of importance, reviews and reproducers are welcome):
Plotting the performance of the blocks from the produced debug.log
files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:
0import os
1import sys
2import re
3from datetime import datetime
4import matplotlib.pyplot as plt
5import numpy as np
6
7
8def process_log_files_and_plot(log_dir, output_file="block_height_progress.png"):
9 if not os.path.exists(log_dir) or not os.path.isdir(log_dir):
10 print(f"Error: '{log_dir}' is not a valid directory", file=sys.stderr)
11 return
12
13 debug_files = [f for f in os.listdir(log_dir) if
14 f.startswith('debug-') and os.path.isfile(os.path.join(log_dir, f))]
15 if not debug_files:
16 print(f"Warning: No debug files found in '{log_dir}'", file=sys.stderr)
17 return
18
19 height_pattern = re.compile(r'UpdateTip:.*height=(\d+)')
20 results = {}
21
22 for filename in debug_files:
23 filepath = os.path.join(log_dir, filename)
24 print(f"Processing {filename}...", file=sys.stderr)
25
26 update_tips = []
27 first_timestamp = None
28 line_count = tip_count = 0
29 found_shutdown_done = False
30
31 try:
32 with open(filepath, 'r', errors='ignore') as file:
33 for line_number, line in enumerate(file, 1):
34 line_count += 1
35 if line_count % 100000 == 0:
36 print(f" Processed {line_count} lines, found {tip_count} UpdateTips...", file=sys.stderr)
37
38 if not found_shutdown_done:
39 if "Shutdown: done" in line:
40 found_shutdown_done = True
41 print(f" Found 'Shutdown: done' at line {line_number}, starting to record",
42 file=sys.stderr)
43 continue
44
45 if len(line) < 20 or "UpdateTip:" not in line:
46 continue
47
48 try:
49 timestamp = datetime.strptime(line[:20], "%Y-%m-%dT%H:%M:%SZ")
50 height_match = height_pattern.search(line)
51 if not height_match:
52 continue
53
54 height = int(height_match.group(1))
55 if first_timestamp is None:
56 first_timestamp = timestamp
57
58 update_tips.append((int((timestamp - first_timestamp).total_seconds()), height))
59 tip_count += 1
60 except ValueError:
61 continue
62 except Exception as e:
63 print(f"Error processing {filename}: {e}", file=sys.stderr)
64 continue
65
66 print(f"Finished processing {filename}: {line_count} lines, {tip_count} UpdateTips", file=sys.stderr)
67
68 if update_tips:
69 time_dict = {}
70 for time, height in update_tips:
71 time_dict[time] = height
72 results[filename[6:14]] = sorted(time_dict.items())
73
74 if not results:
75 print("No valid data found in any files.", file=sys.stderr)
76 return
77
78 print(f"Creating plots with data from {len(results)} files", file=sys.stderr)
79
80 sorted_results = []
81 for name, pairs in results.items():
82 if pairs:
83 sorted_results.append((name, pairs[-1][0] / 3600, pairs))
84
85 sorted_results.sort(key=lambda x: x[1], reverse=True)
86 colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_results)))
87
88 # Plot 1: Height vs Time
89 plt.figure(figsize=(12, 8))
90
91 final_points = []
92 for idx, (name, last_time, pairs) in enumerate(sorted_results):
93 times = [t / 3600 for t, _ in pairs]
94 heights = [h for _, h in pairs]
95 plt.plot(heights, times, label=f"{name} ({last_time:.2f}h)", color=colors[idx], linewidth=1)
96 if pairs:
97 final_points.append((last_time, pairs[-1][1], colors[idx]))
98
99 for time, height, color in final_points:
100 plt.axhline(y=time, color=color, linestyle='--', alpha=0.3)
101 plt.axvline(x=height, color=color, linestyle='--', alpha=0.3)
102
103 plt.title('Sync Time by Block Height')
104 plt.xlabel('Block Height')
105 plt.ylabel('Elapsed Time (hours)')
106 plt.grid(True, linestyle='--', alpha=0.7)
107 plt.legend(loc='center left')
108 plt.tight_layout()
109
110 plt.savefig(output_file.replace('.png', '_reversed.png'), dpi=300)
111
112 # Plot 2: Performance Ratio by Time
113 if len(sorted_results) > 1:
114 plt.figure(figsize=(12, 8))
115
116 baseline = sorted_results[0]
117 baseline_time_by_height = {h: t for t, h in baseline[2]}
118
119 for idx, (name, _, pairs) in enumerate(sorted_results[1:], 1):
120 time_by_height = {h: t for t, h in pairs}
121
122 common_heights = [h for h in baseline_time_by_height.keys()
123 if h >= 400000 and h in time_by_height]
124 common_heights.sort()
125
126 ratios = []
127 base_times = []
128
129 for h in common_heights:
130 base_t = baseline_time_by_height[h]
131 result_t = time_by_height[h]
132
133 if result_t > 0:
134 ratios.append(base_t / result_t)
135 base_times.append(base_t / 3600)
136
137 plt.plot(base_times, ratios,
138 label=f"{name} vs {baseline[0]}",
139 color=colors[idx], linewidth=1)
140
141 plt.axhline(y=1, color='gray', linestyle='--', alpha=0.7)
142
143 plt.title('Performance Improvement Over Time (Higher is Better)')
144 plt.xlabel('Baseline Elapsed Time (hours)')
145 plt.ylabel('Speedup Ratio (baseline_time / commit_time)')
146 plt.grid(True, linestyle='--', alpha=0.7)
147 plt.legend(loc='best')
148 plt.tight_layout()
149
150 plt.savefig(output_file.replace('.png', '_time_ratio.png'), dpi=300)
151
152 with open(output_file.replace('.png', '.csv'), 'w') as f:
153 for name, _, pairs in sorted_results:
154 f.write(f"{name},{','.join(f'{t}:{h}' for t, h in pairs)}\n")
155
156 plt.show()
157
158
159if __name__ == "__main__":
160 log_dir = sys.argv[1] if len(sys.argv) > 1 else "."
161 output_file = sys.argv[2] if len(sys.argv) > 2 else "block_height_progress.png"
162 process_log_files_and_plot(log_dir, output_file)
Baseline
Base commit was 88debb3e42.
0COMPILER=gcc COMMIT=88debb3e4297ef4ebc8966ffe599359bc7b231d0 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1 Time (mean ± σ): 30932.610 s ± 156.891 s [User: 58248.505 s, System: 2142.974 s]
2 Range (min … max): 30821.671 s … 31043.549 s 2 runs
0COMPILER=gcc COMMIT=6a8ce46e32dae2ffef2a73d2314ca33a2039186e ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1 Time (mean ± σ): 28501.588 s ± 119.886 s [User: 56419.060 s, System: 1833.126 s]
2 Range (min … max): 28416.815 s … 28586.361 s 2 runs
We can serialize the blocks and undos to any Stream
which implements the appropriate read/write methods.
AutoFile
is one of these, writing the results “directly” to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).
Differential flame graphs indicate that the before/after speed change is because of fewer AutoFile
reads and writes:
0COMPILER=gcc COMMIT=c5cc54d10187c9cb3a6cba8cc10f652b4f882e2a ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1 Time (mean ± σ): 27394.210 s ± 565.877 s [User: 54902.315 s, System: 1891.951 s]
2 Range (min … max): 26994.075 s … 27794.346 s 2 runs
Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.
0COMPILER=gcc COMMIT=9b4be912d20222b3b275ef056c1494a15ccde3f5 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1 Time (mean ± σ): 27019.086 s ± 112.340 s [User: 54927.344 s, System: 1652.376 s]
2 Range (min … max): 26939.649 s … 27098.522 s 2 runs
The final UTXO set is written to disk in batches to avoid a gigantic spike at flush time. There is already a -dbbatchsize config option to change this value, this PR adjusts the default only. By increasing the default batch size, we can reduce overhead from repeated compaction cycles, minimize constant overhead per batch, and achieve more sequential writes.
Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest:
0COMPILER=gcc COMMIT=817d7ac0767a3984295aa3cf6c961dcc5f29d571 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1 Time (mean ± σ): 26711.460 s ± 244.118 s [User: 54654.348 s, System: 1652.087 s]
2 Range (min … max): 26538.843 s … 26884.077 s 2 runs
The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.
Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt’s first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.
0COMPILER=gcc COMMIT=182745cec4c0baf2f3c8cff2f74f847eac3c4330 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1 Time (mean ± σ): 26326.867 s ± 45.887 s [User: 54367.156 s, System: 1619.348 s]
2 Range (min … max): 26294.420 s … 26359.314 s 2 runs
CheckBlock
’s latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.
This PR improves performance and maintainability by introducing the following changes:
- Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
- Optimized the general case by replacing
std::set
with sortedstd::vector
for improved locality. - Simplified Null prevout checks from linear to constant time.
0COMPILER=gcc COMMIT=47d377bd0bb88dae6b34553a7789400170e0ccf6 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
1 Time (mean ± σ): 26084.429 s ± 473.611 s [User: 54310.780 s, System: 1815.967 s]
2 Range (min … max): 25749.536 s … 26419.323 s 2 runs
The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.
Hashing a uint256
key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.
Other similar efforts waiting for reviews or revives (not included in this tracking PR):
- #31132 - pre-warms the in-memory cache on multiple threads (10% IBD speedup for small in-memory caches)
- #30611 - for very big in-memory caches make sure we still flush to disk regularly (no significant IBD speed change)
- #28945 - was meant to preallocate the memory of recreated caches (~6% IBD speedup for small caches)
- #31102 - was meant to try to evict entries selectively instead of dropping the whole cache when full
- #32128 - draft PR showcasing a few other possible caching speedups
This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs. Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.