[IBD] Tracking PR for speeding up Initial Block Download #32043

pull l0rinc wants to merge 24 commits into bitcoin:master from l0rinc:l0rinc/IBD-optimizations changing 37 files +1081 −325
  1. l0rinc commented at 4:20 pm on March 12, 2025: contributor

    During the last Core Dev meeting, it was proposed to create a tracking PR aggregating the individual IBD optimizations - to illustrate how these changes contribute to the broader performance improvement efforts.

    Summary: 18% full IBD speedup

    We don’t have many low-hanging fruits anymore, but big speed improvements can also be achieved by many small, focused changes. Many optimization opportunities are hiding in consensus critical code - this tracking PR provides justification for why those should also be considered. The unmerged changes here collectively achieve a ~18% speedup for full IBD (measured by multiple real runs until 886'000 blocks using 5GiB in-memory cache): from 8.59 hours on master to 7.25 hours for the PR.

    Anyone can (and is encouraged to) reproduce the results by following this guide: https://gist.github.com/l0rinc/83d2bdfce378ad7396610095ceb7bed5

    PRs included here (in review priority order):

    The UTXO count and average block size have drastically increased in the past few years, providing a better overall picture of how Bitcoin behaves under real load. Profiling IBD, given these circumstances, revealed many new optimization opportunities.

    Similar efforts in the past years

    There were many efforts to make sure Bitcoin Core remains performant in light of these new trends, a few recent notable examples include:

    • #25325 - use specialized pool allocator for in-memory cache (~21% faster IBD)
    • #28358 - allow the full UTXO set to fit into memory
    • #28280 (comment) - fine-grained in-memory cache eviction for pruned nodes (~30% IBD speedup on pruned nodes)
    • #30039 (comment) - reduce LevelDB writes, compactions and open files (~30% faster IBD for small in-memory cache)
    • #31490, #30849, #30906 - refactors derisking/enabling follow-up optimizations
    • #30326 - favor the happy path for cache misses (~2% IBD speedup)
    • #30884 - Windows regression fix

    Reliable macro benchmarks

    The measurements here were done on a high-end Intel i9-9900K CPU (8 cores/16 threads, 3.6GHz base, 5.0GHz boost), 64GB RAM, and a RAID configuration with multiple NVMe drives (total ~1.4TB fast storage), a dedicated Hetzner Auction box running latest Ubuntu. Sometimes a lower-end i7 was used with an HDD for comparison.

    To make sure the setup reflected a real user’s experience, we ran multiple full IBDs per commit (connecting to real nodes), until block 886'000 with a 5GiB in-memory cache where hyperfine was used to measure the final time (assuming normal distribution, stabilizing the final result via statistical methods), producing reliable results even when individual measurements varied (when hyperfine indicated that the measurements were all over the place we reran the whole benchmark). To reduce the instability of headers synchronization and peer acquisition, we first started bitcoind until block 1, followed by the actual benchmarks until block 886'000.

    The top 2 PRs (https://github.com/bitcoin/bitcoin/pull/31551 and #31144) were measured together by multiple people with different settings (and varying results):

    Also note that there is a separate effort to add a reliable macro-benchmarking suite to track the performance of the most critical usecases end-to-end (including IBD, compact blocks, UTXO iteration) - still WIP, not yet used here.

    Current changes (in order of importance, reviews and reproducers are welcome):

    Plotting the performance of the blocks from the produced debug.log files (taken from the last run for each commit - can differ slightly from the normalized average shown below) visualizing the effect of each commit:

      0import os
      1import sys
      2import re
      3from datetime import datetime
      4import matplotlib.pyplot as plt
      5import numpy as np
      6
      7
      8def process_log_files_and_plot(log_dir, output_file="block_height_progress.png"):
      9    if not os.path.exists(log_dir) or not os.path.isdir(log_dir):
     10        print(f"Error: '{log_dir}' is not a valid directory", file=sys.stderr)
     11        return
     12
     13    debug_files = [f for f in os.listdir(log_dir) if
     14                   f.startswith('debug-') and os.path.isfile(os.path.join(log_dir, f))]
     15    if not debug_files:
     16        print(f"Warning: No debug files found in '{log_dir}'", file=sys.stderr)
     17        return
     18
     19    height_pattern = re.compile(r'UpdateTip:.*height=(\d+)')
     20    results = {}
     21
     22    for filename in debug_files:
     23        filepath = os.path.join(log_dir, filename)
     24        print(f"Processing {filename}...", file=sys.stderr)
     25
     26        update_tips = []
     27        first_timestamp = None
     28        line_count = tip_count = 0
     29        found_shutdown_done = False
     30
     31        try:
     32            with open(filepath, 'r', errors='ignore') as file:
     33                for line_number, line in enumerate(file, 1):
     34                    line_count += 1
     35                    if line_count % 100000 == 0:
     36                        print(f"  Processed {line_count} lines, found {tip_count} UpdateTips...", file=sys.stderr)
     37
     38                    if not found_shutdown_done:
     39                        if "Shutdown: done" in line:
     40                            found_shutdown_done = True
     41                            print(f"  Found 'Shutdown: done' at line {line_number}, starting to record",
     42                                  file=sys.stderr)
     43                        continue
     44
     45                    if len(line) < 20 or "UpdateTip:" not in line:
     46                        continue
     47
     48                    try:
     49                        timestamp = datetime.strptime(line[:20], "%Y-%m-%dT%H:%M:%SZ")
     50                        height_match = height_pattern.search(line)
     51                        if not height_match:
     52                            continue
     53
     54                        height = int(height_match.group(1))
     55                        if first_timestamp is None:
     56                            first_timestamp = timestamp
     57
     58                        update_tips.append((int((timestamp - first_timestamp).total_seconds()), height))
     59                        tip_count += 1
     60                    except ValueError:
     61                        continue
     62        except Exception as e:
     63            print(f"Error processing {filename}: {e}", file=sys.stderr)
     64            continue
     65
     66        print(f"Finished processing {filename}: {line_count} lines, {tip_count} UpdateTips", file=sys.stderr)
     67
     68        if update_tips:
     69            time_dict = {}
     70            for time, height in update_tips:
     71                time_dict[time] = height
     72            results[filename[6:14]] = sorted(time_dict.items())
     73
     74    if not results:
     75        print("No valid data found in any files.", file=sys.stderr)
     76        return
     77
     78    print(f"Creating plots with data from {len(results)} files", file=sys.stderr)
     79
     80    sorted_results = []
     81    for name, pairs in results.items():
     82        if pairs:
     83            sorted_results.append((name, pairs[-1][0] / 3600, pairs))
     84
     85    sorted_results.sort(key=lambda x: x[1], reverse=True)
     86    colors = plt.cm.tab10(np.linspace(0, 1, len(sorted_results)))
     87
     88    # Plot 1: Height vs Time
     89    plt.figure(figsize=(12, 8))
     90
     91    final_points = []
     92    for idx, (name, last_time, pairs) in enumerate(sorted_results):
     93        times = [t / 3600 for t, _ in pairs]
     94        heights = [h for _, h in pairs]
     95        plt.plot(heights, times, label=f"{name} ({last_time:.2f}h)", color=colors[idx], linewidth=1)
     96        if pairs:
     97            final_points.append((last_time, pairs[-1][1], colors[idx]))
     98
     99    for time, height, color in final_points:
    100        plt.axhline(y=time, color=color, linestyle='--', alpha=0.3)
    101        plt.axvline(x=height, color=color, linestyle='--', alpha=0.3)
    102
    103    plt.title('Sync Time by Block Height')
    104    plt.xlabel('Block Height')
    105    plt.ylabel('Elapsed Time (hours)')
    106    plt.grid(True, linestyle='--', alpha=0.7)
    107    plt.legend(loc='center left')
    108    plt.tight_layout()
    109
    110    plt.savefig(output_file.replace('.png', '_reversed.png'), dpi=300)
    111
    112    # Plot 2: Performance Ratio by Time
    113    if len(sorted_results) > 1:
    114        plt.figure(figsize=(12, 8))
    115
    116        baseline = sorted_results[0]
    117        baseline_time_by_height = {h: t for t, h in baseline[2]}
    118
    119        for idx, (name, _, pairs) in enumerate(sorted_results[1:], 1):
    120            time_by_height = {h: t for t, h in pairs}
    121
    122            common_heights = [h for h in baseline_time_by_height.keys()
    123                              if h >= 400000 and h in time_by_height]
    124            common_heights.sort()
    125
    126            ratios = []
    127            base_times = []
    128
    129            for h in common_heights:
    130                base_t = baseline_time_by_height[h]
    131                result_t = time_by_height[h]
    132
    133                if result_t > 0:
    134                    ratios.append(base_t / result_t)
    135                    base_times.append(base_t / 3600)
    136
    137            plt.plot(base_times, ratios,
    138                     label=f"{name} vs {baseline[0]}",
    139                     color=colors[idx], linewidth=1)
    140
    141        plt.axhline(y=1, color='gray', linestyle='--', alpha=0.7)
    142
    143        plt.title('Performance Improvement Over Time (Higher is Better)')
    144        plt.xlabel('Baseline Elapsed Time (hours)')
    145        plt.ylabel('Speedup Ratio (baseline_time / commit_time)')
    146        plt.grid(True, linestyle='--', alpha=0.7)
    147        plt.legend(loc='best')
    148        plt.tight_layout()
    149
    150        plt.savefig(output_file.replace('.png', '_time_ratio.png'), dpi=300)
    151
    152    with open(output_file.replace('.png', '.csv'), 'w') as f:
    153        for name, _, pairs in sorted_results:
    154            f.write(f"{name},{','.join(f'{t}:{h}' for t, h in pairs)}\n")
    155
    156    plt.show()
    157
    158
    159if __name__ == "__main__":
    160    log_dir = sys.argv[1] if len(sys.argv) > 1 else "."
    161    output_file = sys.argv[2] if len(sys.argv) > 2 else "block_height_progress.png"
    162    process_log_files_and_plot(log_dir, output_file)
    

    Baseline

    Base commit was 88debb3e42.

    0COMPILER=gcc COMMIT=88debb3e4297ef4ebc8966ffe599359bc7b231d0 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     30932.610 s ± 156.891 s    [User: 58248.505 s, System: 2142.974 s]
    2  Range (min … max):   30821.671 s … 31043.549 s    2 runs
    

    0COMPILER=gcc COMMIT=6a8ce46e32dae2ffef2a73d2314ca33a2039186e ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     28501.588 s ± 119.886 s    [User: 56419.060 s, System: 1833.126 s]
    2  Range (min … max):   28416.815 s … 28586.361 s    2 runs
    

    We can serialize the blocks and undos to any Stream which implements the appropriate read/write methods. AutoFile is one of these, writing the results “directly” to disk (through the OS file cache). Batching these in memory first and reading/writing these to disk is measurably faster (likely because of fewer native fread calls or less locking, as observed by @martinus in a similar change).

    Differential flame graphs indicate that the before/after speed change is because of fewer AutoFile reads and writes: writes reads


    0COMPILER=gcc COMMIT=c5cc54d10187c9cb3a6cba8cc10f652b4f882e2a ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     27394.210 s ± 565.877 s    [User: 54902.315 s, System: 1891.951 s]
    2  Range (min … max):   26994.075 s … 27794.346 s    2 runs
    

    Current block obfuscations are done byte-by-byte, this PR batches them to 64 bit primitives to speed up obfuscating bigger memory batches. This is especially relevant after #31551 where we end up with bigger obfuscatable chunks.

    obfuscation calls during IBD without batching


    0COMPILER=gcc COMMIT=9b4be912d20222b3b275ef056c1494a15ccde3f5 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     27019.086 s ± 112.340 s    [User: 54927.344 s, System: 1652.376 s]
    2  Range (min … max):   26939.649 s … 27098.522 s    2 runs
    

    When the in-memory UTXO set is flushed to LevelDB (after IBD or AssumeUTXO load), it does so in batches to manage memory usage during the flush. While a hidden -dbbatchsize config option exists to modify this value, this PR introduces dynamic calculation of the batch size based on the -dbcache setting. By using larger batches when more memory is available (i.e., higher -dbcache), we can reduce the overhead from numerous small writes, minimize constant overhead per batch, improve I/O efficiency (especially on HDDs), and potentially allow LevelDB to optimize writes more effectively (e.g. by sorting the keys before write).

    Note that this PR mainly optimizes a critical section of IBD (memory to disk dump) - even if the effect on overall speed is modest:


    0COMPILER=gcc COMMIT=817d7ac0767a3984295aa3cf6c961dcc5f29d571 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     26711.460 s ± 244.118 s    [User: 54654.348 s, System: 1652.087 s]
    2  Range (min … max):   26538.843 s … 26884.077 s    2 runs
    

    The commits merge similar (de)serialization methods, and separates them internally with if constexpr - similarly to how it has been #28203. This enabled further SizeComputer optimizations as well.

    Other than these, since single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt’s first byte which is also needed for every (pre)Vector), it makes sense to avoid unnecessary generalized serialization infrastructure.


    0COMPILER=gcc COMMIT=182745cec4c0baf2f3c8cff2f74f847eac3c4330 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     26326.867 s ± 45.887 s    [User: 54367.156 s, System: 1619.348 s]
    2  Range (min … max):   26294.420 s … 26359.314 s    2 runs
    

    CheckBlock’s latency is critical for efficiently validating correct inputs during transaction validation, including mempool acceptance and new block creation.

    This PR improves performance and maintainability by introducing the following changes:

    • Simplified checks for the most common cases (1 or 2 inputs - 70-90% of transactions have a single input).
    • Optimized the general case by replacing std::set with sorted std::vector for improved locality.
    • Simplified Null prevout checks from linear to constant time.

    0COMPILER=gcc COMMIT=47d377bd0bb88dae6b34553a7789400170e0ccf6 ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=886000 -dbcache=5000 -blocksonly -printtoconsole=0
    1  Time (mean ± σ):     26084.429 s ± 473.611 s    [User: 54310.780 s, System: 1815.967 s]
    2  Range (min … max):   25749.536 s … 26419.323 s    2 runs
    

    The in-memory representation of the UTXO set uses (salted) SipHash for avoiding key collision attacks.

    Hashing a uint256 key is done so often that a specialized optimization was extracted to SipHashUint256Extra. The constant salting operations were already extracted in the general case, this PR adjusts the main specialization similarly.


    The current prevector size of 28 bytes (chosen to fill the sizeof(CScript) aligned size) was introduced in 2015 (https://github.com/bitcoin/bitcoin/pull/6914) before SegWit and TapRoot. However, the increasingly common P2WSH and P2TR scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.

    The core trade-off of this change is to eliminate heap allocations for common 29-36 byte scripts at the cost of increasing the base memory footprint of all CScript objects by 8 bytes (while still respecting peak memory usage defined by -dbcache).

    image


    Other similar efforts waiting for reviews or revives (not included in this tracking PR):

    • #31132 - pre-warms the in-memory cache on multiple threads (10% IBD speedup for small in-memory caches)
    • #30611 - for very big in-memory caches make sure we still flush to disk regularly (no significant IBD speed change)
    • #28945 - was meant to preallocate the memory of recreated caches (~6% IBD speedup for small caches)
    • #31102 - was meant to try to evict entries selectively instead of dropping the whole cache when full
    • #32128 - draft PR showcasing a few other possible caching speedups

    This PR is meant to stay in draft (not meant to be merged directly), to continually change based on comments received here and in the PRs. Comments, reproducers and high-level discussions are welcome here - code reviews should rather be done in the individual PRs.

  2. DrahtBot commented at 4:20 pm on March 12, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32043.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    Concept ACK jonatack

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #32457 (bench: replace benchmark block with more representative one (413567 → 784588) by l0rinc)
    • #32296 (refactor: reenable implicit-integer-sign-change check for serialize.h by l0rinc)
    • #32279 ([IBD] prevector: store P2WSH/P2TR/P2PK scripts inline by l0rinc)
    • #32128 (Draft: CCoinMap Experiments by martinus)
    • #31868 ([IBD] specialize block serialization by l0rinc)
    • #31860 (init: Take lock on blocks directory in BlockManager ctor by TheCharlatan)
    • #31682 ([IBD] specialize CheckBlock’s input & coinbase checks by l0rinc)
    • #31144 ([IBD] multi-byte block obfuscation by l0rinc)
    • #29641 (scripted-diff: Use LogInfo over LogPrintf [WIP, NOMERGE, DRAFT] by maflcko)
    • #29307 (util: explicitly close all AutoFiles that have been written by vasild)
    • #28531 (improve MallocUsage() accuracy by LarryRuane)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot renamed this:
    [IBD] - Tracking PR for speeding up Initial Block Download
    [IBD] - Tracking PR for speeding up Initial Block Download
    on Mar 12, 2025
  4. l0rinc renamed this:
    [IBD] - Tracking PR for speeding up Initial Block Download
    [IBD] Tracking PR for speeding up Initial Block Download
    on Mar 12, 2025
  5. ryanofsky commented at 5:14 pm on March 12, 2025: contributor

    Thanks for creating this. This should make it easier to navigate the other PRs and discuss the overall topic of IBD performance and benchmarking without needing to necessarily repeat it in the individual PRs.

    Would be useful to have concept ACKs/NACKs here from others who know more about performance and benchmarking. But from from what I can tell the individual optimizations do not seem very complicated and seem like they should be justified.


    One suggestion for the PR description above would be to directly link to the PRs comprising this change in the summary, maybe pointing out any where review should be focused. Current list seems to be:

  6. laanwj added the label Block storage on Mar 12, 2025
  7. laanwj added the label P2P on Mar 12, 2025
  8. l0rinc force-pushed on Mar 12, 2025
  9. DrahtBot added the label CI failed on Mar 12, 2025
  10. DrahtBot commented at 6:16 pm on March 12, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/38650495272

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  11. l0rinc force-pushed on Mar 12, 2025
  12. DrahtBot removed the label CI failed on Mar 12, 2025
  13. ajtowns commented at 6:18 am on March 13, 2025: contributor

    Plotting the performance of the blocks from the produced debug.log files shows (from the last run, can differ slightly from the normalized average shown below) the effect of each commit:

    Wouldn’t these plots be easier to read with block height on the x-axis and time on the y-axis, giving a consistent domain (each case goes from height 0 to 880k or so) with a simple “lower is better” comparison? (rather than “further left is better”)

    Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful? (Perhaps limited to height >= 400000)

  14. l0rinc commented at 9:16 am on March 13, 2025: contributor

    with a simple “lower is better” comparison

    I like the idea, updated the description and the code: image

    Plotting the difference between the various proposed commits and the baseline (time_baseline[height] / time_commit_X[height], higher is better) might also be helpful?

    Something like this? Conceptually seems useful, but here I don’t know how to interpret it, seems too far zoomed in Looks even funnier without the >400k cap:

  15. ajtowns commented at 12:45 pm on March 13, 2025: contributor

    Something like this? Conceptually seems useful, but here I don’t know how to interpret it, seems too far zoomed in

    Fair; that might work better with a time on the x-axis rather than height, something like:

    0for time_commit in [time_commit_X, time_commit_Y, time_commit_Z]:
    1    for height in range(1,850000):
    2        y = time_baseline[height] / time_commit[height] # keep this one the same
    3        x = time_baseline[height]  # changed from x = height
    4        add_datapoint(x,y)
    
  16. l0rinc commented at 1:08 pm on March 13, 2025: contributor

    time on the x-axis rather than height

    If we do that we don’t even need to filter out the first 400k blocks since they’re so insignificant. Edit: you’re right, it looks better to filter those out - I’ve updated the description with the images and code.

  17. jonatack commented at 3:52 pm on March 13, 2025: member

    Concept ACK, thank you for opening this.

    I currently am in an environment of slow internet speed, where despite having a relatively fast laptop, IBD is slower than 2 orders of magnitude worse than the times in the OP.

    Opened #32051 today to address an issue I’m also seeing of very frequent disconnections+reconnections of trusted addnode peers during IBD.

  18. DrahtBot added the label Needs rebase on Mar 20, 2025
  19. l0rinc force-pushed on Apr 8, 2025
  20. l0rinc commented at 11:15 pm on April 8, 2025: contributor

    Updated the tracking PR (+ general rebase) with the latest changes from:

  21. DrahtBot removed the label Needs rebase on Apr 9, 2025
  22. DrahtBot commented at 0:38 am on April 9, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/40213922055

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  23. DrahtBot added the label CI failed on Apr 9, 2025
  24. l0rinc force-pushed on Apr 9, 2025
  25. DrahtBot removed the label CI failed on Apr 9, 2025
  26. l0rinc force-pushed on Apr 13, 2025
  27. l0rinc commented at 9:01 pm on April 15, 2025: contributor
    Added #32279 to the collection
  28. achow101 referenced this in commit 33df4aebae on Apr 16, 2025
  29. DrahtBot added the label Needs rebase on Apr 16, 2025
  30. test: Compare util::Xor with randomized inputs against simple impl
    Since production code only uses keys of length 8, we're not testing with other values anymore
    3d203c2acf
  31. bench: Make XorObfuscationBench more representative
    Since another PR solves the tiny byte xors during serialization, we're only concentrating on big continuous chunks now.
    
    >  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -j$(nproc) \
    && build/bin/bench_bitcoin -filter='XorObfuscationBench' -min-time=10000
    
    C++ compiler .......................... AppleClang 17.0.0.17000013
    
    |              ns/MiB |               MiB/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          731,927.62 |            1,366.26 |    0.2% |     10.67 | `XorObfuscationBench`
    
    C++ compiler .......................... GNU 13.3.0
    
    |              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          941,015.26 |            1,062.68 |    0.0% |    9,437,186.97 |    3,378,911.52 |  2.793 |   1,048,577.15 |    0.0% |     10.99 | `XorObfuscationBench`
    5d5f3d06dd
  32. refactor: prepare dbwrapper for obfuscation key change
    Since `CDBWrapper::Read` will still work with vectors, we won't be able to use the obfuscation key field to read into it directly.
    This commit cleans up this part of the code, obviating that writing `obfuscate_key` is needed since following methods will actually use it implicitly, simplifying the `if (!key_exists` condition to extract the negation into the name of the boolean and inline the single-use `CreateObfuscateKey` which will just complicate the transition.
    8e6e0acd36
  33. refactor: prepare mempool_persist for obfuscation key change e50732d25f
  34. optimization: Migrate fixed-size obfuscation end-to-end from `std::vector<std::byte>` to `uint64_t`
    Since `util::Xor` accepts `uint64_t` values, we're eliminating any repeated vector-to-uint64_t conversions going back to the loading/saving of these values (we're still serializing them as vectors, but converting as soon as possible to `uint64_t`). This is the reason the tests still generate vector values and convert to `uint64_t` later instead of generating it directly.
    
    We're also short-circuit `Xor` calls with 0 key values early to avoid unnecessary calculations (e.g. `MakeWritableByteSpan`) - even assuming that XOR is never called for 0.
    
    >  cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release \
    && cmake --build build -j$(nproc) \
    && build/bin/bench_bitcoin -filter='XorObfuscationBench' -min-time=10000
    
    C++ compiler .......................... AppleClang 17.0.0.17000013
    
    |              ns/MiB |               MiB/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |           14,730.40 |           67,886.80 |    0.1% |     11.01 | `XorObfuscationBench`
    
    C++ compiler .......................... GNU 13.3.0
    
    |              ns/MiB |               MiB/s |    err% |         ins/MiB |         cyc/MiB |    IPC |        bra/MiB |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |           51,187.17 |           19,536.15 |    0.0% |      327,683.95 |      183,747.58 |  1.783 |      65,536.55 |    0.0% |     11.00 | `XorObfuscationBench`
    
    ----
    
    A few other benchmarks that seem to have improved as well (tested with Clang only):
    Before:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        2,202,618.49 |              454.01 |    0.2% |     11.01 | `ReadBlockBench`
    |          734,444.92 |            1,361.57 |    0.3% |     10.66 | `ReadRawBlockBench`
    
    After:
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        1,912,308.06 |              522.93 |    0.4% |     10.98 | `ReadBlockBench`
    |           49,092.93 |           20,369.53 |    0.2% |     10.99 | `ReadRawBlockBench`
    c5e866b190
  35. bench: Add COutPoint and SaltedOutpointHasher benchmarks
    This commit introduces new benchmarks to measure the performance of various operations using
    SaltedOutpointHasher, including hash computation, set operations, and set creation.
    
    These benchmarks are intended to provide insights about coin caching performance (e.g. during IBD).
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaltedOutpointHasherBench.*' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |               58.60 |       17,065,922.04 |    0.3% |     11.02 | `SaltedOutpointHasherBench_create_set`
    |               11.97 |       83,576,684.83 |    0.1% |     11.01 | `SaltedOutpointHasherBench_hash`
    |               14.50 |       68,985,850.12 |    0.3% |     10.96 | `SaltedOutpointHasherBench_match`
    |               13.90 |       71,942,033.47 |    0.4% |     11.03 | `SaltedOutpointHasherBench_mismatch`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |              136.76 |        7,312,133.16 |    0.0% |        1,086.67 |          491.12 |  2.213 |         119.54 |    1.1% |     11.01 | `SaltedOutpointHasherBench_create_set`
    |               23.82 |       41,978,882.62 |    0.0% |          252.01 |           85.57 |  2.945 |           4.00 |    0.0% |     11.00 | `SaltedOutpointHasherBench_hash`
    |               60.42 |       16,549,695.42 |    0.1% |          460.51 |          217.04 |  2.122 |          21.00 |    1.4% |     10.99 | `SaltedOutpointHasherBench_match`
    |               78.66 |       12,713,595.35 |    0.1% |          555.59 |          282.52 |  1.967 |          20.19 |    2.2% |     10.74 | `SaltedOutpointHasherBench_mismatch`
    c497ca6e91
  36. test: Rename k1/k2 to k0/k1 for consistency ae87260d29
  37. refactor: Extract C0-C3 Siphash constants 155ba7c349
  38. optimization: refactor: Introduce Uint256ExtraSipHasher to cache SipHash constant state
    Previously, only k0 and k1 were stored, causing the constant xor operations to be recomputed in every call to `SipHashUint256Extra`.
    This commit adds a dedicated `Uint256ExtraSipHasher` class that caches the initial state (v0-v3) and to perform the `SipHash` computation on a `uint256` (with an extra parameter), hiding the constant computation details from higher-level code and improving efficiency.
    This basically brings the precalculations in the `CSipHasher` constructor to the `uint256` specialized SipHash implementation.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SaltedOutpointHasherBench.*' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |               57.27 |       17,462,299.19 |    0.1% |     11.02 | `SaltedOutpointHasherBench_create_set`
    |               11.24 |       88,997,888.48 |    0.3% |     11.04 | `SaltedOutpointHasherBench_hash`
    |               13.91 |       71,902,014.20 |    0.2% |     11.01 | `SaltedOutpointHasherBench_match`
    |               13.29 |       75,230,390.31 |    0.1% |     11.00 | `SaltedOutpointHasherBench_mismatch`
    
    compared to master:
    create_set - 17,462,299.19/17,065,922.04 - 2.3% faster
    hash       - 88,997,888.48/83,576,684.83 - 6.4% faster
    match      - 71,902,014.20/68,985,850.12 - 4.2% faster
    mismatch   - 75,230,390.31/71,942,033.47 - 4.5% faster
    
    > C++ compiler .......................... GNU 13.3.0
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |              135.38 |        7,386,349.49 |    0.0% |        1,078.19 |          486.16 |  2.218 |         119.56 |    1.1% |     11.00 | `SaltedOutpointHasherBench_create_set`
    |               23.67 |       42,254,558.08 |    0.0% |          247.01 |           85.01 |  2.906 |           4.00 |    0.0% |     11.00 | `SaltedOutpointHasherBench_hash`
    |               58.95 |       16,962,220.14 |    0.1% |          446.55 |          211.74 |  2.109 |          20.86 |    1.4% |     11.01 | `SaltedOutpointHasherBench_match`
    |               76.98 |       12,991,047.69 |    0.1% |          548.93 |          276.50 |  1.985 |          20.25 |    2.3% |     10.72 | `SaltedOutpointHasherBench_mismatch`
    
    compared to master:
    create_set -  7,386,349.49/7,312,133.16  - 1% faster
    hash       - 42,254,558.08/41,978,882.62 - 0.6% faster
    match      - 16,962,220.14/16,549,695.42 - 2.4% faster
    mismatch   - 12,991,047.69/12,713,595.35 - 2% faster
    
    Co-authored-by: sipa <pieter@wuille.net>
    73cfebb08b
  39. bench: measure block (size)serialization speed
    The SizeComputer is a special serializer which returns what the exact final size will be of the serialized content.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          195,610.62 |            5,112.20 |    0.3% |     11.00 | `SerializeBlock`
    |           12,061.83 |           82,906.19 |    0.1% |     11.01 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          867,857.55 |            1,152.26 |    0.0% |    8,015,883.90 |    3,116,099.08 |  2.572 |   1,517,035.87 |    0.5% |     10.81 | `SerializeBlock`
    |           30,928.27 |           32,332.88 |    0.0% |      221,683.03 |      111,055.84 |  1.996 |      53,037.03 |    0.8% |     11.03 | `SizeComputerBlock`
    3d7c8ae9fb
  40. cleanup: remove unused `ser_writedata16be` and `ser_readdata16be` 9f15d4da35
  41. refactor: reduce template bloat in primitive serialization
    Merged multiple template methods into single constexpr-delimited implementation to reduce template bloat (i.e. related functionality is grouped into a single method, but can be optimized because of C++20 constexpr conditions).
    This unifies related methods that were only bound before by similar signatures - and enables `SizeComputer` optimizations later
    5559eb68a9
  42. refactor: add explicit static extent to spans 8f71de5f8f
  43. optimization: merge SizeComputer specializations + add new ones
    Endianness doesn't affect the final size, we can skip it for `SizeComputer`.
    We can `if constexpr` previous calls into existing method, short-circuiting existing logic when we only need their serialized sizes.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          191,652.29 |            5,217.78 |    0.4% |     10.96 | `SerializeBlock`
    |           10,323.55 |           96,865.92 |    0.2% |     11.01 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          614,847.32 |            1,626.42 |    0.0% |    8,015,883.64 |    2,207,628.07 |  3.631 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
    |           26,020.31 |           38,431.52 |    0.0% |      159,390.03 |       93,438.33 |  1.706 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
    2bf6c56cab
  44. optimization: add single byte writes
    Single byte writes are used very often (used for every (u)int8_t or std::byte or bool and for every VarInt's first byte which is also needed for every (pre)Vector).
    It makes sense to avoid the generalized serialization infrastructure that isn't needed:
    * AutoFile write doesn't need to allocate 4k buffer for a single byte now;
    * `VectorWriter` and `DataStream` avoids memcpy/insert calls.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/bin/bench_bitcoin -filter='SizeComputerBlock|SerializeBlock' --min-time=10000
    
    > C compiler ............................ AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          174,569.19 |            5,728.39 |    0.6% |     10.89 | `SerializeBlock`
    |           10,241.16 |           97,645.21 |    0.0% |     11.00 | `SizeComputerBlock`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          615,000.56 |            1,626.01 |    0.0% |    8,015,883.64 |    2,208,340.88 |  3.630 |   1,517,035.62 |    0.5% |     10.56 | `SerializeBlock`
    |           25,676.76 |           38,945.72 |    0.0% |      159,390.03 |       92,202.10 |  1.729 |      42,131.03 |    0.9% |     11.00 | `SizeComputerBlock`
    6269067bc3
  45. test: validate duplicate detection in `CheckTransaction`
    The `CheckTransaction` validation function in https://github.com/bitcoin/bitcoin/blob/master/src/consensus/tx_check.cpp#L41-L45 relies on a correct ordering relation for detecting duplicate transaction inputs.
    
    This update to the tests ensures that:
    * Accurate detection of duplicates: Beyond trivial cases (e.g., two identical inputs), duplicates are detected correctly in more complex scenarios.
    * Consistency across methods: Both sorted sets and hash-based sets behave identically when detecting duplicates for `COutPoint` and related values.
    * Robust ordering and equality relations: The function maintains expected behavior for ordering and equality checks.
    
    Using randomized testing with shuffled inputs (to avoid any remaining bias introduced), the enhanced test validates that `CheckTransaction` remains robust and reliable across various input configurations. It confirms identical behavior to a hashing-based duplicate detection mechanism, ensuring consistency and correctness.
    
    To make sure the new branches in the follow-up commits will be covered, `basic_transaction_tests` was extended a randomized test one comparing against the old implementation (and also an alternative duplicate). The iterations and ranges were chosen such that every new branch is expected to be hit once.
    c15d130752
  46. bench: measure `CheckBlock` speed separately from serialization
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          372,743.63 |            2,682.81 |    1.1% |     10.99 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        3,304,694.54 |              302.60 |    0.5% |     11.05 | `DuplicateInputs`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        1,096,261.84 |              912.19 |    0.1% |    7,963,390.88 |    3,487,375.26 |  2.283 |   1,266,941.00 |    1.8% |     11.03 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        8,366,309.48 |              119.53 |    0.0% |   23,865,177.67 |   26,620,160.23 |  0.897 |   5,972,887.41 |    4.0% |     10.78 | `DuplicateInputs`
    bf580d7b4e
  47. bench: add `ProcessTransactionBench` to measure `CheckBlock` in context
    The newly introduced `ProcessTransactionBench` incorporates multiple steps in the validation pipeline, offering a more comprehensive view of `CheckBlock` performance within a realistic transaction validation context.
    
    Previous microbenchmarks, such as DeserializeAndCheckBlockTest and DuplicateInputs, focused on isolated aspects of transaction and block validation. While these tests provided valuable insights for targeted profiling, they lacked context regarding the broader validation process, where interactions between components play a critical role.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |            9,585.10 |          104,328.55 |    0.1% |     11.03 | `ProcessTransactionBench`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |           56,199.57 |           17,793.73 |    0.1% |      229,263.01 |      178,766.31 |  1.282 |      15,509.97 |    0.5% |     10.91 | `ProcessTransactionBench`
    a300325c5b
  48. optimization: move duplicate checks outside of coinbase branch
    `IsCoinBase` means single input with NULL prevout, so it makes sense to restrict duplicate check to non-coinbase transactions only.
    The behavior is the same as before, except that single-input-transactions aren't checked for duplicates anymore (~70-90% of the cases, see https://transactionfee.info/charts/transactions-1in).
    I've added braces to the conditions and loops to simplify review of followup commits.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          335,917.12 |            2,976.92 |    1.3% |     11.01 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        3,286,337.42 |              304.29 |    1.1% |     10.90 | `DuplicateInputs`
    |            9,561.02 |          104,591.35 |    0.2% |     11.02 | `ProcessTransactionBench`
    7b576a440d
  49. optimization: simplify duplicate checks for trivial inputs
    No need to create a set for checking duplicates for two-input-transactions.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          314,137.30 |            3,183.32 |    1.2% |     11.04 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |        3,220,592.73 |              310.50 |    1.3% |     10.92 | `DuplicateInputs`
    |            9,425.98 |          106,089.77 |    0.3% |     11.00 | `ProcessTransactionBench`
    765b71b90b
  50. optimization: replace tree with sorted vector
    A pre-sized vector retains locality (enabling SIMD operations), speeding up sorting and equality checks.
    It's also simpler (therefore more reliable) than a sorted set. It also causes less memory fragmentation.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          181,922.54 |            5,496.85 |    0.2% |     10.98 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          997,739.30 |            1,002.27 |    1.0% |     10.94 | `DuplicateInputs`
    |            9,449.28 |          105,828.15 |    0.3% |     10.99 | `ProcessTransactionBench`
    
    Co-authored-by: Pieter Wuille <pieter@wuille.net>
    cb8c012b87
  51. optimization: look for NULL prevouts in the sorted values
    For the 2 input case we simply check them both, like we did with equality.
    
    For the general case, we take advantage of sorting, making invalid value detection constant time instead of linear in the worst case.
    
    > cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release && cmake --build build -j$(nproc) && build/src/bench/bench_bitcoin -filter='CheckBlockBench|DuplicateInputs|ProcessTransactionBench' -min-time=10000
    
    > C++ compiler .......................... AppleClang 16.0.0.16000026
    
    |            ns/block |             block/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          179,971.00 |            5,556.45 |    0.3% |     11.02 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |          963,177.98 |            1,038.23 |    1.7% |     10.92 | `DuplicateInputs`
    |            9,410.90 |          106,259.75 |    0.3% |     11.01 | `ProcessTransactionBench`
    
    > C++ compiler .......................... GNU 13.3.0
    
    |            ns/block |             block/s |    err% |       ins/block |       cyc/block |    IPC |      bra/block |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |          834,855.94 |            1,197.81 |    0.0% |    6,518,548.86 |    2,656,039.78 |  2.454 |     919,160.84 |    1.5% |     10.78 | `CheckBlockBench`
    
    |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |        4,261,492.75 |              234.66 |    0.0% |   17,379,823.40 |   13,559,793.33 |  1.282 |   4,265,714.28 |    3.4% |     11.00 | `DuplicateInputs`
    |           55,819.53 |           17,914.88 |    0.1% |      227,828.15 |      177,520.09 |  1.283 |      15,184.36 |    0.4% |     10.91 | `ProcessTransactionBench`
    f07c79f034
  52. test: assert CScript allocation characteristics
    Verifies that script types are correctly allocated using prevector's direct (stack) or indirect (heap) storage based on their size:
    
    Direct (stack) allocated script types (size ≤ 28 bytes):
    * OP_RETURN (small)
    * P2WPKH
    * P2SH
    * P2PKH
    
    Indirect (heap) allocated script types (size > 28 bytes):
    * P2WSH
    * P2TR
    * P2PK
    * MULTISIG (small)
    
    This test provides a baseline for verifying changes to prevector's inline capacity.
    b5dc42874d
  53. Allocate `P2WSH`/`P2TR`/`P2PK` scripts on stack
    The current `prevector` size of 28 bytes (chosen to fill the `sizeof(CScript)` aligned size) was introduced in 2015 (https://github.com/bitcoin/bitcoin/pull/6914) before SegWit and TapRoot.
    However, the increasingly common `P2WSH` and `P2TR` scripts are both 34 bytes, and are forced to use heap (re)allocation rather than efficient inline storage.
    
    The core trade-off of this change is to eliminate heap allocations for common 34-36 byte scripts at the cost of increasing the base memory footprint of all `CScript` objects by 8 bytes (while still respecting peak memory usage defined by `-dbcache`).
    
    Increasing the `prevector` size allows these scripts to be stored on the stack, avoiding heap allocations, reducing potential memory fragmentation, and improving performance during cache flushes. Massif analysis confirms a lower stable memory usage after flushing, suggesting the elimination of heap allocations outweighs the larger base size for common workloads.
    
    Due to memory alignment, increasing the `prevector` size to 36 bytes doesn't change the overall `sizeof(CScript)` compared to an increase to 34 bytes, allowing us to include `P2PK` scripts as well at no additional memory cost.
    
    Performance benchmarks for AssumeUTXO load and flush show:
    - Small dbcache (450MB): ~1% performance penalty due to more frequent flushes
    - Large dbcache (4500-4500MB+): ~6-7% performance improvement due to fewer heap allocations
    
    Full IBD and reindex-chainstate with larger `dbcache` values also show an overall ~3% speedup.
    
    Co-authored-by: Ava Chow <github@achow101.com>
    Co-authored-by: Andrew Toth <andrewstoth@gmail.com>
    b6b4235c14
  54. l0rinc force-pushed on Apr 17, 2025
  55. DrahtBot removed the label Needs rebase on Apr 17, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-05-09 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me