[IBD] Tracking PR for speeding up Initial Block Download #32043

pull l0rinc wants to merge 24 commits into bitcoin:master from l0rinc:l0rinc/IBD-optimizations changing 29 files +985 −326
  1. l0rinc commented at 4:20 pm on March 12, 2025: contributor

    During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.

    Summary: 18% full IBD speedup

    We don’t have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes. Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered. The unmerged changes here collectively achieve a ~18% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.

    Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5

    PRs included here (in review priority order):

    The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load. Profiling IBD, given these circumstances, revealed many new optimization opportunities.

    Similar efforts in the past years

    There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:

    • #25325 - use specialized pool allocator for in-memory cache (~21% faster IBD)
    • #28358 - allow the full UTXO set to fit into memory
    • #28280 (comment) - fine-grained in-memory cache eviction for pruned nodes (~30% IBD speedup on pruned nodes)
    • #30039 (comment) - reduce LevelDB writes, compactions and open files (~30% faster IBD for small in-memory cache)
    • #31490, #30849, #30906 - refactors derisking/enabling follow-up optimizations
    • #30326 - favor the happy path for cache misses (~2% IBD speedup)
    • #30884 - Windows regression fix

    Reliable macro benchmarks

    The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.

    To make sure the setup reflected a real user’s experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark). To reduce the instability of headers synchronization and peer acquisition, we first started bitcoind until block 1, followed by the actual benchmarks until block 886'000.

    The top 2 PRs (https://github.com/bitcoin/bitcoin/pull/31551 and #31144) were measured together by multiple people with different settings (and varying results):

    Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.

    Current changes (in order of importance, reviews and reproducers are welcome):

    Plotting the performance of the blocks from the produced debug.log files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:

      0import os
      1import sys
      2import re
      3from datetime import datetime
      4import matplotlib.pyplot as plt
      5import numpy as np
      6
      7
      8def process_log_files_and_plot(log_dir, output_file="block_height_progress.png"):
      9    if not os.path.exists(log_dir) or not os.path.isdir(log_dir):
     10        print(f"Error: '{log_dir}' is not a valid directory", file=sys.stderr)
     11        return
     12
     13    debug_files = [f for f in os.listdir(log_dir) if
     14                   f.startswith('debug-') and os.path.isfile(os.path.join(log_dir, f))]
     15    if not debug_files:
     16        print(f"Warning: No debug files found in '{log_dir}'", file=sys.stderr)
     17        return
     18
     19    height_pattern = re.compile(r'UpdateTip:.*height=(\d+)')
     20    results = {}
     21
     22    for filename in debug_files:
     23        filepath = os.path.join(log_dir, filename)
     24        print(f"Processing {filename}...", file=sys.stderr)
     25
     26        update_tips = []
     27        first_timestamp = None
     28        line_count = tip_count = 0
     29        found_shutdown_done = False
     30
     31        try:
     32            with open(filepath, 'r', errors='ignore') as file:
     33                for line_number, line in enumerate(file, 1):
     34                    line_count += 1
     35                    if line_count % 100000 == 0:
     36                        print(f"  Processed {line_count} lines, found {tip_count} UpdateTips...", file=sys.stderr)
     37
     38                    if not found_shutdown_done:
     39                        if "Shutdown: done" in line:
     40                            found_shutdown_done = True
     41                            print(f"  Found 'Shutdown: done' at line {line_number}, starting to record",
     42                                  file=sys.stderr)
     43                        continue
     44
     45                    if len(line) < 20 or "UpdateTip:" not in line:
     46                        continue
     47
     48                    try:
     49                        timestamp = datetime.strptime(line[:20], "%Y-%m-%dT%H:%M:%SZ")
     50                        height_match = height_pattern.search(line)
     51                        if not height_match:
     52                            continue
     53
     54                        height = int(height_match.group(1))
     55                        if first_timestamp is None:
     56                            first_timestamp = timestamp
     57
     58                        update_tips.append((int((timestamp - first_timestamp).total_seconds()), height))
     59                        tip_count += 1
     60                    except ValueError:
     61                        continue
     62        except Exception as e:
     63            print(f"Error processing {filename}: {e}", file=sys.stderr)
     64            continue
     65
     66        print(f"Finished processing {filename}: {line_count} lines, {tip_count} UpdateTips", file=sys.stderr)
     67
     68        if update_tips:
     69            time_dict = {}
     70            for time, height in update_tips:
     71                time_dict[time] = height
     72            results[filename[6:14]] = sorted(time_dict.items())
     73
     74    if not results:
     75        print("No valid data found in any files.", file=sys.stderr)
     76        return
     77
     78    print(f"Creating plots with data from {len(results)} files", file=sys.stderr)
     79
     80    sorted_results = []
     81    for name, pairs in results.items():
     82        if pairs:
     83            sorted_results.append((name, pairs[-1][0] / 3600, pairs))
     84
     85    sorted_results.sort(key=lambda x: x[1], reverse=True)
     86    colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_results)))
     87
     88    # Plot 1: Height vs Time
     89    plt.figure(figsize=(12, 8))
     90
     91    final_points = []
     92    for idx, (name, last_time, pairs) in enumerate(sorted_results):
     93        times = [t / 3600 for t, _ in pairs]
     94        heights = [h for _, h in pairs]
     95        plt.plot(heights, times, label=f"{name} ({last_time:.2f}h)", color=colors[idx], linewidth=1)
     96        if pairs:
     97            final_points.append((last_time, pairs[-1][1], colors[idx]))
     98
     99    for time, height, color in final_points:
    100        plt.axhline(y=time, color=color, linestyle='--', alpha=0.3)
    101        plt.axvline(x=height, color=color, linestyle='--', alpha=0.3)
    102
    103    plt.title('Sync Time by Block Height')
    104    plt.xlabel('Block Height')
    105    plt.ylabel('Elapsed Time (hours)')
    106    plt.grid(True, linestyle='--', alpha=0.7)
    107    plt.legend(loc='center left')
    108    plt.tight_layout()
    109
    110    plt.savefig(output_file.replace('.png', '_reversed.png'), dpi=300)
    111
    112    # Plot 2: Performance Ratio by Time
    113    if len(sorted_results) > 1:
    114        plt.figure(figsize=(12, 8))
    115
    116        baseline = sorted_results[0]
    117        baseline_time_by_height = {h: t for t, h in baseline[2]}
    118
    119        for idx, (name, _, pairs) in enumerate(sorted_results[1:], 1):
    120            time_by_height = {h: t for t, h in pairs}
    121
    122            common_heights = [h for h in baseline_time_by_height.keys()
    123                              if h >= 400000 and h in time_by_height]
    124            common_heights.sort()
    125
    126            ratios = []
    127            base_times = []
    128
    129            for h in common_heights:
    130                base_t = baseline_time_by_height[h]
    131                result_t = time_by_height[h]
    132
    133                if result_t > 0:
    134                    ratios.append(base_t / result_t)
    135                    base_times.append(base_t / 3600)
    136
    137            plt.plot(base_times, ratios,
    138                     label=f"{name} vs {baseline[0]}",
    139                     color=colors[idx], linewidth=1)
    140
    141        plt.axhline(y=1, color='gray', linestyle='--', alpha=0.7)
    142
    143        plt.title('Performance Improvement Over Time (Higher is Better)')
    144        plt.xlabel('Baseline Elapsed Time (hours)')
    145        plt.ylabel('Speedup Ratio (baseline_time / commit_time)')
    146        plt.grid(True, linestyle='--', alpha=0.7)
    147        plt.legend(loc='best')
    148        plt.tight_layout()
    149
    150        plt.savefig(output_file.replace('.png', '_time_ratio.png'), dpi=300)
    151
    152    with open(output_file.replace('.png', '.csv'), 'w') as f:
    153        for name, _, pairs in sorted_results:
    154            f.write(f"{name},{','.join(f'{t}:{h}' for t, h in pairs)}\n")
    155
    156    plt.show()
    157
    158
    159if __name__ == "__main__":
    160    log_dir = sys.argv[1] if len(sys.argv) > 1 else "."
    161    output_file = sys.argv[2] if len(sys.argv) > 2 else "block_height_progress.png"
    162    process_log_files_and_plot(log_dir, output_file)
    

    Baseline

    Base commit was 88debb3e42.

    0COMPILER=gcc COMMIT=88debb3e4297ef4ebc8966ffe599359bc7b231d0 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     30932.610 s ± 156.891 s    [User: 58248.505 s, System: 2142.974 s]
    2  Range (min … max):   30821.671 s … 31043.549 s    2 runs
    

    0COMPILER=gcc COMMIT=6a8ce46e32dae2ffef2a73d2314ca33a2039186e ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     28501.588 s ± 119.886 s    [User: 56419.060 s, System: 1833.126 s]
    2  Range (min … max):   28416.815 s … 28586.361 s    2 runs
    

    We can serialize the blocks and undos to any Stream which implements the appropriate read/write methods. AutoFile is one of these, writing the results “directly” to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).

    Differential flame graphs indicate that the before/after speed change is because of fewer AutoFile reads and writes: writes reads


    0COMPILER=gcc COMMIT=c5cc54d10187c9cb3a6cba8cc10f652b4f882e2a ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     27394.210 s ± 565.877 s    [User: 54902.315 s, System: 1891.951 s]
    2  Range (min … max):   26994.075 s … 27794.346 s    2 runs
    

    Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.

    obfuscation calls during IBD without batching


    0COMPILER=gcc COMMIT=9b4be912d20222b3b275ef056c1494a15ccde3f5 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     27019.086 s ± 112.340 s    [User: 54927.344 s, System: 1652.376 s]
    2  Range (min … max):   26939.649 s … 27098.522 s    2 runs
    

    The final UTXO set is written to disk in batches to avoid a gigantic spike at flush time. There is already a -dbbatchsize config option to change this value, this PR adjusts the default only. By increasing the default batch size, we can reduce overhead from repeated compaction cycles, minimize constant overhead per batch, and achieve more sequential writes.

    Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest:


    0COMPILER=gcc COMMIT=817d7ac0767a3984295aa3cf6c961dcc5f29d571 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     26711.460 s ± 244.118 s    [User: 54654.348 s, System: 1652.087 s]
    2  Range (min … max):   26538.843 s … 26884.077 s    2 runs
    

    The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.

    Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt’s first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.


    0COMPILER=gcc COMMIT=182745cec4c0baf2f3c8cff2f74f847eac3c4330 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     26326.867 s ± 45.887 s    [User: 54367.156 s, System: 1619.348 s]
    2  Range (min … max):   26294.420 s … 26359.314 s    2 runs
    

    CheckBlock’s latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.

    This PR improves performance and maintainability by introducing the following changes:

    • Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
    • Optimized the general case by replacing std::set with sorted std::vector for improved locality.
    • Simplified Null prevout checks from linear to constant time.

    0COMPILER=gcc COMMIT=47d377bd0bb88dae6b34553a7789400170e0ccf6 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     26084.429 s ± 473.611 s    [User: 54310.780 s, System: 1815.967 s]
    2  Range (min … max):   25749.536 s … 26419.323 s    2 runs
    

    The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.

    Hashing a uint256 key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.


    Other similar efforts waiting for reviews or revives (not included in this tracking PR):

    • #31132 - pre-warms the in-memory cache on multiple threads (10% IBD speedup for small in-memory caches)
    • #30611 - for very big in-memory caches make sure we still flush to disk regularly (no significant IBD speed change)
    • #28945 - was meant to preallocate the memory of recreated caches (~6% IBD speedup for small caches)
    • #31102 - was meant to try to evict entries selectively instead of dropping the whole cache when full
    • #32128 - draft PR showcasing a few other possible caching speedups

    This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs. Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.

  2. DrahtBot commented at 4:20 pm on March 12, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32043.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK jonatack

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #31868 ([IBD] specialize block serialization by l0rinc)
    • #31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
    • #31682 ([IBD] specialize CheckBlock’s input & coinbase checks by l0rinc)
    • #31551 ([IBD] batch block reads/writes during AutoFile serialization by l0rinc)
    • #31519 (refactor: Use std::span over Span by maflcko)
    • #31144 ([IBD] multi-byte block obfuscation by l0rinc)
    • #30442 ([IBD] precalculate SipHash constant salt calculations by l0rinc)
    • #30214 (refactor: Improve assumeutxo state representation by ryanofsky)
    • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot renamed this:
    [IBD] - Tracking PR for speeding up Initial Block Download
    [IBD] - Tracking PR for speeding up Initial Block Download
    on Mar 12, 2025
  4. l0rinc renamed this:
    [IBD] - Tracking PR for speeding up Initial Block Download
    [IBD] Tracking PR for speeding up Initial Block Download
    on Mar 12, 2025
  5. ryanofsky commented at 5:14 pm on March 12, 2025: contributor

    Thanks for creating this. This should make it easier to navigate the other PRs and discuss the overall topic of IBD performance and benchmarking without needing to necessarily repeat it in the individual PRs.

    Would be useful to have concept ACKs/NACKs here from others who know more about performance and benchmarking. But from from what I can tell the individual optimizations do not seem very complicated and seem like they should be justified.


    One suggestion for the PR description above would be to directly link to the PRs comprising this change in the summary, maybe pointing out any where review should be focused. Current list seems to be:

  6. laanwj added the label Block storage on Mar 12, 2025
  7. laanwj added the label P2P on Mar 12, 2025
  8. optimization: Bulk serialization reads in `UndoRead` and `ReadBlock`
    The Obfuscation (XOR) operations are currently done byte-by-byte during serialization, buffering the reads will enable batching the obfuscation operations later (not yet done here).
    
    Also, different operating systems seem to handle file caching differently, so reading bigger batches (and processing those from memory) is also a bit faster (likely because of fewer native fread calls or less locking).
    
    Since `ReadBlock[Undo]` is called with the file position being set after the [undo]block size, we have to start by backtracking 4 bytes to be able to read the expected size first.
    As a consequence, the `FlatFilePos pos` parameter in `ReadBlock` is copied now.
    
    `HashVerifier` was included in the try/catch to include the `undo_size` serialization there as well since the try is about `Deserialize` errors. This is why the final checksum verification was also included in the try.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ReadBlockBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    Before:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        2,289,743.62 |              436.73 |    0.3% |     11.03 | `ReadBlockBench`
    
    After:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,724,703.14 |              579.81 |    0.4% |     11.06 | `ReadBlockBench`
    
    > C++ compiler .......................... GNU 13.3.0
    
    Before:
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        7,786,309.20 |              128.43 |    0.0% |   70,832,812.80 |   23,803,523.16 |  2.976 |   5,073,002.56 |    0.4% |     10.72 | `ReadBlockBench`
    
    After:
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        6,272,557.28 |              159.42 |    0.0% |   63,251,231.42 |   19,739,780.92 |  3.204 |   3,589,886.66 |    0.3% |     10.57 | `ReadBlockBench`
    
    Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>
    edb2575fb6
  9. Add `AutoFile::write_large` for batching obfuscation operations
    Instead of copying the data and doing the xor in a 4096 byte array, we're doing it directly on the input.
    
    `DataStream` constructor was also added to enable presized serialization and writing in a single command.
    e18c96a3fe
  10. optimization: Bulk serialization writes in `SaveBlockUndo` and `SaveBlock`
    Similarly to the serialization reads, buffered writes will enable batched xor calculations - especially since currently we need to copy the write inputs Span to do the obfuscation on it, batching enables doing the xor on the internal buffer instead.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaveBlockBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    Before:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        5,267,613.94 |              189.84 |    1.0% |     11.05 | `SaveBlockBench`
    
    After:
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,767,367.40 |              565.81 |    1.6% |     10.86 | `SaveBlockBench`
    
    > C++ compiler .......................... GNU 13.3.0
    
    Before:
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,128,530.90 |              242.22 |    3.8% |   19,358,001.33 |    8,601,983.31 |  2.250 |   3,079,334.76 |    0.4% |     10.64 | `SaveBlockBench`
    
    After:
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        3,130,556.05 |              319.43 |    4.7% |   17,305,378.56 |    6,457,946.37 |  2.680 |   2,579,854.87 |    0.3% |     10.83 | `SaveBlockBench`
    
    Co-authored-by: Cory Fields <cory-nospam-@coryfields.com>
    23ed684c6a
  11. log: unify error messages for (read/write)[undo]block
    Co-authored-by: maflcko <6399679+maflcko@users.noreply.github.com>
    d0a86b343d
  12. test: Compare util::Xor with randomized inputs against simple impl
    Since production code only uses keys of length 8, we're not testing with other values anymore
    34afcc90c0
  13. bench: Make Xor benchmark more representative
    To make the benchmarks representative, I've collected the write-vector's sizes during IBD for every invocation of `util::Xor` until 860k blocks, and used it as a basis for the micro-benchmarks, having a similar distribution of random data (taking the 1000 most frequent ones, making sure the very big ones are also covered).
    
    And even though we already have serialization tests, `AutoFileXor` was added to serializing 1 MB via the provided key_bytes.
    This was used to test the effect of disabling obfuscation.
    
    >  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -j$(nproc) \
    && build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
    
    C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                1.07 |      937,527,289.88 |    0.4% |     10.24 | `AutoFileXor`
    |                0.87 |    1,149,859,017.49 |    0.3% |     10.80 | `XorHistogram`
    
    C++ compiler .......................... GNU 13.2.0
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                1.87 |      535,253,389.72 |    0.0% |            9.20 |            3.45 |  2.669 |           1.03 |    0.1% |     11.02 | `AutoFileXor`
    |                1.70 |      587,844,715.57 |    0.0% |            9.35 |            5.41 |  1.729 |           1.05 |    1.7% |     10.95 | `XorHistogram`
    8ce0670506
  14. optimization: Xor 64 bits together instead of byte-by-byte
    `util::Xor` method was split out into more focused parts:
    * one which assumes tha the `uint64_t` key is properly aligned, doing the first few xors as 64 bits (the memcpy is eliminated in most compilers), and the last iteration is optimized for 8/16/32 bytes.
    * an unaligned `uint64_t` key with a `key_offset` parameter which is rotated to accommodate the data (adjusting for endianness).
    * a legacy `std::vector<std::byte>` key with an asserted 8 byte size, converted to `uint64_t`.
    
    Note that the default statement alone would pass the tests, but would be very slow, since the 1, 2 and 4 byte versions won't be specialized by the compiler, hence the switch.
    
    Asserts were added throughout the code to make sure every such vector has length 8, since in the next commit we're converting all of them to `uint64_t`.
    
    refactor: Migrate fixed-size obfuscation end-to-end from `std::vector<std::byte>` to `uint64_t`
    
    Since `util::Xor` accepts `uint64_t` values, we're eliminating any repeated vector-to-uint64_t conversions going back to the loading/saving of these values (we're still serializing them as vectors, but converting as soon as possible to `uint64_t`). This is the reason the tests still generate vector values and convert to `uint64_t` later instead of generating it directly.
    
    We're also short-circuit `Xor` calls with 0 key values early to avoid unnecessary calculations (e.g. `MakeWritableByteSpan`) - even assuming that XOR is never called for 0.
    
    >  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -j$(nproc) \
    && build/src/bench/bench_bitcoin -filter='XorHistogram|AutoFileXor' -min-time=10000
    
    C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                0.09 |   10,799,585,470.46 |    1.3% |     11.00 | `AutoFileXor`
    |                0.14 |    7,144,743,097.97 |    0.2% |     11.01 | `XorHistogram`
    
    C++ compiler .......................... GNU 13.2.0
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                0.59 |    1,706,433,032.76 |    0.1% |            0.00 |            0.00 |  0.620 |           0.00 |    1.8% |     11.01 | `AutoFileXor`
    |                0.47 |    2,145,375,849.71 |    0.0% |            0.95 |            1.48 |  0.642 |           0.20 |    9.6% |     10.93 | `XorHistogram`
    
    ----
    
    A few other benchmarks that seem to have improved as well (tested with Clang only):
    Before:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        2,237,168.64 |              446.99 |    0.3% |     10.91 | `ReadBlockFromDiskTest`
    |          748,837.59 |            1,335.40 |    0.2% |     10.68 | `ReadRawBlockFromDiskTest`
    
    After:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,827,436.12 |              547.21 |    0.7% |     10.95 | `ReadBlockFromDiskTest`
    |           49,276.48 |           20,293.66 |    0.2% |     10.99 | `ReadRawBlockFromDiskTest`
    bb9cb81607
  15. bench: measure block (size)serialization speed
    The SizeComputer is a special serializer which returns what the exact final size will be of the serialized content.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock|DeserializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          936,285.45 |            1,068.05 |    0.1% |     11.01 | `DeserializeBlock`
    |          194,330.04 |            5,145.88 |    0.2% |     10.97 | `SerializeBlock`
    |           12,215.05 |           81,866.19 |    0.0% |     11.00 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,447,243.87 |              224.86 |    0.0% |   53,689,737.58 |   15,966,336.86 |  3.363 |   2,409,315.46 |    0.5% |     11.01 | `DeserializeBlock`
    |          869,833.14 |            1,149.65 |    0.0% |    8,015,883.90 |    3,123,013.80 |  2.567 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
    |           26,535.51 |           37,685.36 |    0.0% |      225,261.03 |       95,278.40 |  2.364 |      53,037.03 |    0.6% |     11.00 | `SizeComputerBlock`
    99b2c2a862
  16. refactor: reduce template bloat in primitive serialization
    Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
    This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later
    794180e8f8
  17. cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` 028c006541
  18. optimization: Add single byte write
    Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
    It makes sense to avoid the generalized serialization infrastructure that isn't needed:
    * AutoFile write doesn't need to allocate 4k buffer for a single byte now;
    * `VectorWriter` and `DataStream` avoids memcpy/insert calls.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock|DeserializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          934,120.45 |            1,070.53 |    0.2% |     11.01 | `DeserializeBlock`
    |          170,719.27 |            5,857.57 |    0.1% |     10.99 | `SerializeBlock`
    |           12,048.40 |           82,998.58 |    0.2% |     11.01 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,433,835.04 |              225.54 |    0.0% |   53,688,481.60 |   15,918,730.23 |  3.373 |   2,409,056.47 |    0.5% |     11.01 | `DeserializeBlock`
    |          563,663.10 |            1,774.11 |    0.0% |    7,386,775.59 |    2,023,525.77 |  3.650 |   1,385,368.57 |    0.5% |     11.00 | `SerializeBlock`
    |           27,351.60 |           36,560.93 |    0.1% |      225,261.03 |       98,209.77 |  2.294 |      53,037.03 |    0.9% |     11.00 | `SizeComputerBlock`
    c02600b8e1
  19. optimization: merge SizeComputer specializations + add new ones
    Endianness doesn't affect the final size, we can skip it for `SizeComputer`.
    We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock|DeserializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          888,859.82 |            1,125.04 |    0.4% |     10.87 | `DeserializeBlock`
    |          168,502.88 |            5,934.62 |    0.1% |     10.99 | `SerializeBlock`
    |           10,200.88 |           98,030.75 |    0.1% |     11.00 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,460,428.52 |              224.19 |    0.0% |   53,692,507.13 |   16,015,347.97 |  3.353 |   2,410,105.48 |    0.5% |     11.01 | `DeserializeBlock`
    |          567,042.65 |            1,763.54 |    0.0% |    7,386,775.59 |    2,035,613.84 |  3.629 |   1,385,368.57 |    0.5% |     11.01 | `SerializeBlock`
    |           25,728.56 |           38,867.32 |    0.0% |      172,750.03 |       92,366.64 |  1.870 |      42,131.03 |    1.7% |     11.00 | `SizeComputerBlock`
    c072498305
  20. test: validate duplicate detection in `CheckTransaction`
    The `CheckTransaction` validation function in https://github.com/bitcoin/bitcoin/blob/master/src/consensus/tx_check.cpp#L41-L45 relies on a correct ordering relation for detecting duplicate transaction inputs.
    
    This update to the tests ensures that:
    * Accurate detection of duplicates: Beyond trivial cases (e.g., two identical inputs), duplicates are detected correctly in more complex scenarios.
    * Consistency across methods: Both sorted sets and hash-based sets behave identically when detecting duplicates for `COutPoint` and related values.
    * Robust ordering and equality relations: The function maintains expected behavior for ordering and equality checks.
    
    Using randomized testing with shuffled inputs (to avoid any remaining bias introduced), the enhanced test validates that `CheckTransaction` remains robust and reliable across various input configurations. It confirms identical behavior to a hashing-based duplicate detection mechanism, ensuring consistency and correctness.
    
    To make sure the new branches in the follow-up commits will be covered, `basic_transaction_tests` was extended a randomized test one comparing against the old implementation (and also an alternative duplicate). The iterations and ranges were chosen such that every new branch is expected to be hit once.
    b07cdbe542
  21. bench: measure `CheckBlock` speed separately from serialization
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          372,743.63 |            2,682.81 |    1.1% |     10.99 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        3,304,694.54 |              302.60 |    0.5% |     11.05 | `DuplicateInputs`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        1,096,261.84 |              912.19 |    0.1% |    7,963,390.88 |    3,487,375.26 |  2.283 |   1,266,941.00 |    1.8% |     11.03 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        8,366,309.48 |              119.53 |    0.0% |   23,865,177.67 |   26,620,160.23 |  0.897 |   5,972,887.41 |    4.0% |     10.78 | `DuplicateInputs`
    99fe67e132
  22. bench: add `ProcessTransactionBench` to measure `CheckBlock` in context
    The newly introduced `ProcessTransactionBench` incorporates multiple steps in the validation pipeline, offering a more comprehensive view of `CheckBlock` performance within a realistic transaction validation context.
    
    Previous microbenchmarks, such as DeserializeAndCheckBlockTest and DuplicateInputs, focused on isolated aspects of transaction and block validation. While these tests provided valuable insights for targeted profiling, they lacked context regarding the broader validation process, where interactions between components play a critical role.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |            9,585.10 |          104,328.55 |    0.1% |     11.03 | `ProcessTransactionBench`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |           56,199.57 |           17,793.73 |    0.1% |      229,263.01 |      178,766.31 |  1.282 |      15,509.97 |    0.5% |     10.91 | `ProcessTransactionBench`
    452baf49e1
  23. optimization: move duplicate checks outside of coinbase branch
    `IsCoinBase` means single input with NULL prevout, so it makes sense to restrict duplicate check to non-coinbase transactions only.
    The behavior is the same as before, except that single-input-transactions aren't checked for duplicates anymore (~70-90% of the cases, see https://transactionfee.info/charts/transactions-1in).
    I've added braces to the conditions and loops to simplify review of followup commits.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          335,917.12 |            2,976.92 |    1.3% |     11.01 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        3,286,337.42 |              304.29 |    1.1% |     10.90 | `DuplicateInputs`
    |            9,561.02 |          104,591.35 |    0.2% |     11.02 | `ProcessTransactionBench`
    45f8cda9bb
  24. optimization: simplify duplicate checks for trivial inputs
    No need to create a set for checking duplicates for two-input-transactions.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          314,137.30 |            3,183.32 |    1.2% |     11.04 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        3,220,592.73 |              310.50 |    1.3% |     10.92 | `DuplicateInputs`
    |            9,425.98 |          106,089.77 |    0.3% |     11.00 | `ProcessTransactionBench`
    2a5df20ca9
  25. optimization: replace tree with sorted vector
    A pre-sized vector retains locality (enabling SIMD operations), speeding up sorting and equality checks.
    It's also simpler (therefore more reliable) than a sorted set. It also causes less memory fragmentation.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          181,922.54 |            5,496.85 |    0.2% |     10.98 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          997,739.30 |            1,002.27 |    1.0% |     10.94 | `DuplicateInputs`
    |            9,449.28 |          105,828.15 |    0.3% |     10.99 | `ProcessTransactionBench`
    
    Co-authored-by: Pieter Wuille <pieter@wuille.net>
    ce6840f701
  26. optimization: look for NULL prevouts in the sorted values
    For the 2 input case we simply check them both, like we did with equality.
    
    For the general case, we take advantage of sorting, making invalid value detection constant time instead of linear in the worst case.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          179,971.00 |            5,556.45 |    0.3% |     11.02 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          963,177.98 |            1,038.23 |    1.7% |     10.92 | `DuplicateInputs`
    |            9,410.90 |          106,259.75 |    0.3% |     11.01 | `ProcessTransactionBench`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          834,855.94 |            1,197.81 |    0.0% |    6,518,548.86 |    2,656,039.78 |  2.454 |     919,160.84 |    1.5% |     10.78 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,261,492.75 |              234.66 |    0.0% |   17,379,823.40 |   13,559,793.33 |  1.282 |   4,265,714.28 |    3.4% |     11.00 | `DuplicateInputs`
    |           55,819.53 |           17,914.88 |    0.1% |      227,828.15 |      177,520.09 |  1.283 |      15,184.36 |    0.4% |     10.91 | `ProcessTransactionBench`
    9f601c0cab
  27. coins: bump default LevelDB write batch size to 64 MiB
    The UTXO set has grown significantly, and flushing it from memory to LevelDB often takes over 20 minutes after a successful IBD with large dbcache values.
    The final UTXO set is written to disk in batches, which LevelDB sorts into SST files.
    By increasing the default batch size, we can reduce overhead from repeated compaction cycles, minimize constant overhead per batch, and achieve more sequential writes.
    
    Experiments with different batch sizes (loaded via assumeutxo at block 840k, then measuring final flush time) show that 64 MiB batches significantly reduce flush time without notably increasing memory usage:
    
    | dbbatchsize | flush_sum (ms) |
    |-------------|----------------|
    | 8 MiB       | ~240,000       |
    | 16 MiB      | ~220,000       |
    | 32 MiB      | ~200,000       |
    | *64 MiB*    | *~150,000*     |
    | 128 MiB     | ~156,000       |
    | 256 MiB     | ~166,000       |
    | 512 MiB     | ~186,000       |
    | 1 GiB       | ~186,000       |
    
    Checking the impact of a `-reindex-chainstate` with `-stopatheight=878000` and `-dbcache=30000` gives:
    16 << 20
    ```
    2025-01-12T07:31:05Z Flushed fee estimates to fee_estimates.dat.
    2025-01-12T07:31:05Z [warning] Flushing large (26 GiB) UTXO set to disk, it may take several minutes
    2025-01-12T07:53:51Z Shutdown: done
    ```
    Flush time: 22 minutes and 46 seconds
    
    64 >> 20
    ```
    2025-01-12T18:30:00Z Flushed fee estimates to fee_estimates.dat.
    2025-01-12T18:30:00Z [warning] Flushing large (26 GiB) UTXO set to disk, it may take several minutes
    2025-01-12T18:44:43Z Shutdown: done
    ```
    Flush time: ~14 minutes 43 seconds.
    626d55b9a8
  28. l0rinc force-pushed on Mar 12, 2025
  29. DrahtBot added the label CI failed on Mar 12, 2025
  30. DrahtBot commented at 6:16 pm on March 12, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38650495272

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  31. bench: Add COutPoint and SaltedOutpointHasher benchmarks
    This commit introduces new benchmarks to measure the performance of various operations using
    SaltedOutpointHasher, including hash computation, set operations, and set creation.
    
    These benchmarks are intended to provide insights about coin caching performance (e.g. during IBD).
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaltedOutpointHasherBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |               58.60 |       17,065,922.04 |    0.3% |     11.02 | `SaltedOutpointHasherBench_create_set`
    |               11.97 |       83,576,684.83 |    0.1% |     11.01 | `SaltedOutpointHasherBench_hash`
    |               14.50 |       68,985,850.12 |    0.3% |     10.96 | `SaltedOutpointHasherBench_match`
    |               13.90 |       71,942,033.47 |    0.4% |     11.03 | `SaltedOutpointHasherBench_mismatch`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |              136.76 |        7,312,133.16 |    0.0% |        1,086.67 |          491.12 |  2.213 |         119.54 |    1.1% |     11.01 | `SaltedOutpointHasherBench_create_set`
    |               23.82 |       41,978,882.62 |    0.0% |          252.01 |           85.57 |  2.945 |           4.00 |    0.0% |     11.00 | `SaltedOutpointHasherBench_hash`
    |               60.42 |       16,549,695.42 |    0.1% |          460.51 |          217.04 |  2.122 |          21.00 |    1.4% |     10.99 | `SaltedOutpointHasherBench_match`
    |               78.66 |       12,713,595.35 |    0.1% |          555.59 |          282.52 |  1.967 |          20.19 |    2.2% |     10.74 | `SaltedOutpointHasherBench_mismatch`
    24d35ec4ec
  32. test: Rename k1/k2 to k0/k1 for consistency 09131cc9d1
  33. refactor: Extract C0-C3 Siphash constants 39e33d4928
  34. optimization: refactor: Introduce Uint256ExtraSipHasher to cache SipHash constant state
    Previously, only k0 and k1 were stored, causing the constant xor operations to be recomputed in every call to `SipHashUint256Extra`.
    This commit adds a dedicated `Uint256ExtraSipHasher` class that caches the initial state (v0-v3) and to perform the `SipHash` computation on a `uint256` (with an extra parameter), hiding the constant computation details from higher-level code and improving efficiency.
    This basically brings the precalculations in the `CSipHasher` constructor to the `uint256` specialized SipHash implementation.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaltedOutpointHasherBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |               57.27 |       17,462,299.19 |    0.1% |     11.02 | `SaltedOutpointHasherBench_create_set`
    |               11.24 |       88,997,888.48 |    0.3% |     11.04 | `SaltedOutpointHasherBench_hash`
    |               13.91 |       71,902,014.20 |    0.2% |     11.01 | `SaltedOutpointHasherBench_match`
    |               13.29 |       75,230,390.31 |    0.1% |     11.00 | `SaltedOutpointHasherBench_mismatch`
    
    compared to master:
    create_set - 17,462,299.19/17,065,922.04 - 2.3% faster
    hash       - 88,997,888.48/83,576,684.83 - 6.4% faster
    match      - 71,902,014.20/68,985,850.12 - 4.2% faster
    mismatch   - 75,230,390.31/71,942,033.47 - 4.5% faster
    
    > C++ compiler .......................... GNU 13.3.0
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |              135.38 |        7,386,349.49 |    0.0% |        1,078.19 |          486.16 |  2.218 |         119.56 |    1.1% |     11.00 | `SaltedOutpointHasherBench_create_set`
    |               23.67 |       42,254,558.08 |    0.0% |          247.01 |           85.01 |  2.906 |           4.00 |    0.0% |     11.00 | `SaltedOutpointHasherBench_hash`
    |               58.95 |       16,962,220.14 |    0.1% |          446.55 |          211.74 |  2.109 |          20.86 |    1.4% |     11.01 | `SaltedOutpointHasherBench_match`
    |               76.98 |       12,991,047.69 |    0.1% |          548.93 |          276.50 |  1.985 |          20.25 |    2.3% |     10.72 | `SaltedOutpointHasherBench_mismatch`
    
    compared to master:
    create_set -  7,386,349.49/7,312,133.16  - 1% faster
    hash       - 42,254,558.08/41,978,882.62 - 0.6% faster
    match      - 16,962,220.14/16,549,695.42 - 2.4% faster
    mismatch   - 12,991,047.69/12,713,595.35 - 2% faster
    a06a674b43
  35. l0rinc force-pushed on Mar 12, 2025
  36. DrahtBot removed the label CI failed on Mar 12, 2025
  37. ajtowns commented at 6:18 am on March 13, 2025: contributor

    Plotting the performance of the blocks from the produced debug.log files shows (from the last run, can differ slightly from the normalized average shown below) the effect of each commit:

    Wouldn’t these plots be easier to read with block height on the x-axis and time on the y-axis, giving a consistent domain (each case goes from height 0 to 880k or so) with a simple “lower is better” comparison? (rather than “further left is better”)

    Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful? (Perhaps limited to height >= 400000)

  38. l0rinc commented at 9:16 am on March 13, 2025: contributor

    with a simple “lower is better” comparison

    I like the idea, updated the description and the code: image

    Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful?

    Something like this? Conceptually seems useful, but here I don’t know how to interpret it, seems too far zoomed in Looks even funnier without the >400k cap:

  39. ajtowns commented at 12:45 pm on March 13, 2025: contributor

    Something like this? Conceptually seems useful, but here I don’t know how to interpret it, seems too far zoomed in

    Fair; that might work better with a time on the x-axis rather than height, something like:

    0for time_commit in [time_commit_X, time_commit_Y, time_commit_Z]:
    1    for height in range(1,850000):
    2        y = time_baseline[height] / time_commit[height] # keep this one the same
    3        x = time_baseline[height]  # changed from x = height
    4        add_datapoint(x,y)
    
  40. l0rinc commented at 1:08 pm on March 13, 2025: contributor

    time on the x-axis rather than height

    If we do that we don’t even need to filter out the first 400k blocks since they’re so insignificant. Edit: you’re right, it looks better to filter those out - I’ve updated the description with the images and code.

  41. jonatack commented at 3:52 pm on March 13, 2025: member

    Concept ACK, thank you for opening this.

    I currently am in an environment of slow internet speed, where despite having a relatively fast laptop, IBD is slower than 2 orders of magnitude worse than the times in the OP.

    Opened #32051 today to address an issue I’m also seeing of very frequent disconnections+reconnections of trusted addnode peers during IBD.

  42. DrahtBot added the label Needs rebase on Mar 20, 2025
  43. DrahtBot commented at 10:09 am on March 20, 2025: contributor

    🐙 This pull request conflicts with the target branch and needs rebase.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-03-28 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me