[RFC] cluster_linearize: add O(N) fast path for chain-shaped clusters #34643

pull HowHsu wants to merge 5 commits into bitcoin:master from HowHsu:chain_linearize changing 6 files +401 −13
  1. HowHsu commented at 5:11 am on February 21, 2026: none

    This PR relies on PR #34625

    Summary

    This PR adds TryLinearizeChain(), an O(N) specialised linearizer for chain-shaped clusters, and a benchmark targeting the feerate distribution that triggers the worst-case behaviour of the general SFL algorithm on chains.

    Commits:

    • ad15466be1e4 cluster_linearize: add O(N) fast path for chain-shaped clusters
    • 72bf0faf820c cluster_linearize: benchmark Linearize on worst-case chain feerate distribution

    Motivation

    Instrumentation added to a running node revealed that virtually most cluster seen in the mempool is chain-shaped — each transaction spends an output of the previous one. This is consistent with widespread use of CPFP (Child-Pays-For-Parent) fee bumping, where a low-feerate parent transaction is paired with a high-feerate child that spends one of its outputs, forming a two-transaction chain; longer chains arise from repeated CPFP stacking or batch-payment patterns.

    Because chain clusters dominate in practice, a topology-specific O(N) fast path has disproportionately large impact compared with optimising for rarer topologies.

    The general Linearize() function uses the Spanning Forest Linearization (SFL) algorithm, whose MakeTopological() phase merges chunks iteratively. For a chain cluster with N transactions and strictly increasing individual feerates (fee_i = i+1), MakeTopological triggers a cascade that merges the entire chain into one chunk step-by-step. At merge step k, two operations each scan the growing top-part chunk of size k:

    • UpdateChunk iterates over all k transactions to write their chunk_rep (unconditionally, regardless of any inner-loop conditions).
    • MergeChunks iterates over all k transactions to locate the dependency connecting top and bottom chunks.

    This accumulates O(1+2+…+(N-1)) = O(N²) loop iterations across the full cascade. An O(N) specialised path avoids the SFL overhead entirely for chain-shaped clusters.


    Algorithm

    Detection (TryLinearizeChain)

    A cluster with N transactions is a chain if and only if its ancestor-set sizes form exactly {1, 2, …, N}. This is checked in O(N) using a boolean presence array;
    Ancestors(i).Count() is O(1) for the fixed-size BitSet<64> used here.

    Topological order

    The node with k ancestors occupies position k−1 in the unique topological order. Recovered in O(N) by a single mapping pass.

    Why PostLinearize is not needed

    The topological order of a chain cluster is unique, and every adjacent pair has a dependency, so ChunkLinearizationInfo’s unconditional merge and PostLinearize’s merge/swap behave identically (the swap branch never triggers). The topological order itself is the optimal linearization, and PostLinearize can be safely skipped.

    The result is returned early from Linearize() before the SFL machinery is initialised:

    0// In Linearize():                                                               
    1if (auto chain_lin = TryLinearizeChain(depgraph); !chain_lin.empty()) {          
    2    return {std::move(chain_lin), /*optimal=*/true, /*cost=*/0, /*is_chain=*/true};
    3}                                                                                
    

    The fourth return value (is_chain) allows the caller (Relinearize()) to skip the PostLinearize() pass, which is redundant for chains: TryLinearizeChain already produces an optimal, connected linearization.


    Benchmark results

    LinearizeOptimallyMonotoneChainTotal — increasing feerates (SFL worst case)

    MakeMonotoneChainGraph builds a chain with fee_i = i+1, size_i = 1. For the SFL algorithm, every prefix of the chain has feerate (k+1)/2 (strictly increasing with length), so MakeTopological must merge the entire chain into a single chunk through N−1 steps. Each step scans the growing top-part chunk (via UpdateChunk’s unconditional chunk_rep write loop and MergeChunks’s dep-search loop), accumulating O(N²) iterations in total. TryLinearizeChain eliminates this entirely.

    Size Before (ns/op) After (ns/op) Speedup
    9 tx 973 48 20×
    16 tx 1,772 72 25×
    32 tx 3,880 129 30×
    48 tx 6,205 185 34×
    63 tx 8,551 238 36×

    Instructions per call after optimisation:

    N Total instructions Instructions/tx
    9 954 106
    16 1,360 85
    32 2,295 72
    48 3,230 67
    63 4,100 65

    The “after” timings confirm that TryLinearizeChain runs in O(N) regardless of feerate distribution (~65–106 instructions per transaction at N=9..63, constant converging as N grows). The speedup grows with N because the baseline trends towards O(N²) while the optimised path remains O(N).

  2. cluster_linearize: add chain cluster benchmarks for PostLinearize and Linearize
    Add MakeChainGraph() to build deterministic random chain dependency graphs
    (tx0->tx1->...->tx_{n-1}) with randomized feerates, and two sets of
    chain-specific benchmarks:
    
    - PostLinearizeChain{16,32,48,64,75,99}Tx: measure PostLinearize throughput
      on chain-shaped clusters across a range of sizes.
    - LinearizeOptimallyChainTotal_{9,16,32,48,63}tx: measure the total time for
      Linearize to reach an optimal result on chain-shaped clusters.
    
    Chain clusters are a common real-world mempool pattern (each transaction
    spending an output of the previous one). Dedicated benchmarks make it easy
    to observe the effect of topology-specific linearization optimizations.
    386f6084cc
  3. cluster_linearize: test PostLinearize optimality on random tree-shaped clusters
    Add MakeRandomTreeGraph() which generates random tree-shaped dependency graphs
    in two orientations:
    - Upright tree: each transaction i picks a random parent from [0, i), giving
      a DAG where every node has at most one direct parent.
    - Reverse tree: each transaction i picks a random child from [0, i), giving
      a DAG where every node has at most one direct child.
    
    Add postlinearize_tree_optimal_tests which verifies, for random trees of
    various sizes (5..30 transactions), that PostLinearize matches the quality of
    the exhaustive Linearize() result, as guaranteed by PostLinearize's interface.
    Also verifies idempotency: a second PostLinearize pass produces no change.
    e25f63ba6e
  4. cluster_linearize: test PostLinearize optimality on random chain clusters
    Add MakeRandomChainGraph() which generates random linear chain dependency
    graphs (tx0->tx1->...->tx_{n-1}) with randomized feerates.
    
    Add postlinearize_chain_optimal_tests which verifies, for chains of various
    sizes (5..50 transactions), that PostLinearize matches the quality of the
    exhaustive Linearize() result. A chain satisfies both tree-shape conditions
    simultaneously (each transaction has at most one parent and at most one
    child), so PostLinearize is guaranteed to produce an optimal linearization.
    Also verifies idempotency: a second PostLinearize pass produces no change.
    0a3aacde22
  5. cluster_linearize: add O(N) fast path for chain-shaped clusters
    Add TryLinearizeChain() which detects chain-shaped clusters and
    produces an optimal linearization in O(N) time, bypassing the O(N^2)
    SPF algorithm in Linearize().
    
    Detection relies on the property that a cluster with N transactions is
    a chain iff its ancestor-set sizes form exactly {1,...,N}. This is
    checked in O(N) using a presence array, since Ancestors().Count() is
    O(1) for the fixed-size BitSets used here.
    
    The topological order is recovered in O(N) by mapping each node's
    ancestor count to its position. For a chain, this topological order is
    unique and already optimal — no further chunking or merge pass is
    needed. PostLinearize can also be skipped: every adjacent pair has a
    dependency, so ChunkLinearizationInfo's unconditional merge and
    PostLinearize's merge/swap behave identically (the swap branch never
    triggers).
    ad15466be1
  6. cluster_linearize: benchmark Linearize on worst-case chain feerate distribution
    Add MakeMonotoneChainGraph() which builds a chain of N transactions with
    strictly increasing individual feerates (fee_i = i+1, size_i = 1). This is
    a pessimal feerate distribution for the SPF algorithm: every transaction
    looks attractive in isolation, but including tx_i requires all i ancestors,
    so SPF must discover through many improvement steps that the entire chain
    is the single optimal chunk. Without TryLinearizeChain this exhibits O(N²)
    behaviour; with TryLinearizeChain the result is produced in O(N).
    
    Add LinearizeOptimallyMonotoneChainTotal benchmarks (sizes 9..63 tx) to
    measure this worst-case scenario and confirm the O(N) speedup.
    72bf0faf82
  7. DrahtBot renamed this:
    cluster_linearize: add O(N) fast path for chain-shaped clusters
    cluster_linearize: add O(N) fast path for chain-shaped clusters
    on Feb 21, 2026
  8. DrahtBot commented at 5:12 am on February 21, 2026: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #34616 (Cluster mempool: SFL cost model (take 2) by sipa)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  9. ajtowns commented at 5:50 am on February 21, 2026: contributor
    Why would this be worth the code complexity and review cost?
  10. DrahtBot added the label CI failed on Feb 21, 2026
  11. DrahtBot commented at 6:30 am on February 21, 2026: contributor

    🚧 At least one of the CI tasks failed. Task test ancestor commits: https://github.com/bitcoin/bitcoin/actions/runs/22250911275/job/64374269520 LLM reason (✨ experimental): CI build failed during cmake build with exit code 2.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  12. gmaxwell commented at 6:54 am on February 21, 2026: contributor

    I think a useful question to answer is– is this faster primarily on cases that are already quite fast? It won’t improve the worst case performance (because the worst cases aren’t these chains) – but it could perhaps be justified on the basis of improving average case performance (as time saved could be spent elsewhere, including optimizing harder clusters). But the savings may not be significant if the cases this is speeding up are already fast.

    A similar issue came up vs CSS in that CSS is generally faster than SFL for small clusters or for some range of dependency densities. And small clusters are even quite common… but these are cases that are quite fast for SFL in any case and so keeping CSS around for those cases hardly makes a difference in total runtime.

    I think it would be reasonable to benchmark this with real data and exclude clusters of size 1 (most of them…), and 2/3 (which probably should get their own specialized solvers because one can essentially hardcode the possibilities for those and solve them very quickly e.g. just by checking the area under the curve for the n! orderings).

    If it doesn’t make a real speed up in that kind of benchmark I think it’s hard to justify. One could try to make a case that clusters of larger than 64 could be accepted for such special shapes, but the issue there is that one additional transaction (potentially by a third party) can easily knock it out of the special shape and then the resulting cluster is no longer acceptable and that seems brittle and hard to reason about compared to a consistent cluster size limit.

  13. HowHsu commented at 10:03 am on February 21, 2026: none

    I think a useful question to answer is– is this faster primarily on cases that are already quite fast? It won’t improve the worst case performance (because the worst cases aren’t these chains) – but it could perhaps be justified on the basis of improving average case performance (as time saved could be spent elsewhere, including optimizing harder clusters). But the savings may not be significant if the cases this is speeding up are already fast.

    A similar issue came up vs CSS in that CSS is generally faster than SFL for small clusters or for some range of dependency densities. And small clusters are even quite common… but these are cases that are quite fast for SFL in any case and so keeping CSS around for those cases hardly makes a difference in total runtime.

    I think it would be reasonable to benchmark this with real data and exclude clusters of size 1 (most of them…), and 2/3 (which probably should get their own specialized solvers because one can essentially hardcode the possibilities for those and solve them very quickly e.g. just by checking the area under the curve for the n! orderings).


    Hi gmaxwell, size 1 isn’t handled by this code path, it has its own cluster type and special Relinearize implementation. I see what you mean, the motivation of this PR is I observe the mempool and record the status of it everytime a block is connected. From block 936651 to block 937094, 444 blocks in total, and 331 blocks of them have a mempool which is full of chain style clusters, at other 113 blocks generation time, chain style clusters are also the biggest pattern in the mempool. a typical one is like below:

     02026-02-17T14:56:54Z Cluster stats at height 937094: 
     1total_tx=14155, mempool_mem=28 MB, txgraph_mem=1966 KB, cluster_mem=705 KB total_chain_num: 14134 | 
     2size=1: 213 clusters 14 KB, chain: 213, 
     3size=2: 7 clusters 1 KB, chain: 14, 
     4size=3: 3 clusters 0 KB, chain: 9, 
     5size=4: 2 clusters 0 KB, chain: 8, 
     6size=6: 133 clusters 57 KB, chain: 798, 
     7size=7: 16 clusters 7 KB, chain: 112, 
     8size=8: 4 clusters 1 KB, chain: 32, 
     9size=9: 2 clusters 1 KB, chain: 18, 
    10size=10: 1 clusters 0 KB, chain: 10, 
    11size=11: 5 clusters 3 KB, chain: 55, 
    12size=12: 8 clusters 5 KB, chain: 96, 
    13size=13: 3 clusters 2 KB, chain: 39, 
    14size=14: 1 clusters 0 KB, chain: 14, 
    15size=15: 3 clusters 2 KB, chain: 45, 
    16size=16: 2 clusters 1 KB, chain: 32, 
    17size=17: 24 clusters 20 KB, chain: 408, 
    18size=18: 3 clusters 2 KB, chain: 54, 
    19size=19: 2 clusters 1 KB, chain: 38, 
    20size=20: 261 clusters 250 KB, chain: 5220,
    21size=21: 14 clusters 14 KB, chain: 273,
    22size=22: 27 clusters 28 KB, chain: 594, 
    23size=23: 57 clusters 61 KB, chain: 1311, 
    24size=24: 59 clusters 65 KB, chain: 1416, 
    25size=25: 133 clusters 156 KB, chain: 3325
    

    size=25: 133 clusters 156KB, chain: 3325 means there are 133 clusters of size 25, total memory is 156KB, and 3325 Txs are in chain clusters of this size (25*133=3325, so this means all the clusters of size 25 are chain clusters) The data I gathered is limited, so the results might not fully reflect real-world performance, I’ll keep observe. As you suggested, I will test whether this PR improves average-case performance. If the total time spent handling these chains is not negligible, there may be room for improvement.

  14. HowHsu commented at 10:10 am on February 21, 2026: none

    Why would this be worth the code complexity and review cost?

    #34643 (comment) Based on gmaxwell’s comment, I’m no longer sure whether this optimization is meaningful. I’ll run some additional tests to better understand its impact.

  15. HowHsu renamed this:
    cluster_linearize: add O(N) fast path for chain-shaped clusters
    [RFC] cluster_linearize: add O(N) fast path for chain-shaped clusters
    on Feb 21, 2026
  16. HowHsu commented at 12:51 pm on February 22, 2026: none
    https://howhsu.github.io/thinking-in-bitcoin/articles/chain-fast-path-replay-bench.en.html Here comes some test results for the real mainnet mempool in about 12 hours. @gmaxwell @ajtowns
  17. sipa commented at 2:41 pm on February 22, 2026: member

    @HowHsu If you want, I have dumps with mempool cluster data for all of 2023 (built by @sdaftuar by replaying historical p2p data) and recent data since october 2025: https://bitcoin.sipa.be/mempool_dumps/. The file format is:

    • A sequence of dumps (until EOF), consisting of:
      • A uint64_t timestamp (microseconds since epoch)
      • A sequence of clusters in DepGraphFormatter format
      • An empty cluster (just a 0x00 byte).

    The recent data has a dump every 60 seconds, so it’s quite large when decompressed, due to duplication.

    Regarding this PR:

    • The code changes look pretty simple, but as others pointed out, it may have negligible impact.
    • If you want to look into more optimizations, I did write (and at some point, implement) a more generalized version of this: https://delvingbitcoin.org/t/how-to-linearize-your-cluster/303#p-788-h-12-bottleneck-splitting-4. The idea is that you compute the intersection of the (anc(x) union desc(x)) over all transactions x. The result is a strictly linear subset of “bottleneck” transactions in the cluster, and you can split the cluster into sets interspersed by these bottlenecks (which can then be linearized by the normal SFL algorithm). In a fully linear chain, every transaction is a bottleneck. I doubt this is useful, because the startup cost for still running SFL probably dwarfs the gains you can make - but I thought I’d mention it still.
    • Be aware that there are O(n^2) operations inside txgraph, which may end up dominating real-world performance even when actual linearization is O(n). For example, removing transactions from a cluster (and re-compacting the DepGraph following that) is O(n^2), as is constructing a DepGraph from scratch. For this reason, if we really care about performance of linear chains, it may be better to have an entirely separate Cluster implementation for such linear chains that avoids all of this (by avoiding computing/maintaining/storing a DepGraph entirely, for example).
    • It’s SFL (Spanning Forest Linearization), not SPF :)
  18. HowHsu commented at 3:46 pm on February 22, 2026: none

    Hi @sipa ,

    @HowHsu If you want, I have dumps with mempool cluster data for all of 2023 (built by @sdaftuar by replaying historical p2p data) and recent data since october 2025: https://bitcoin.sipa.be/mempool_dumps/. The file format is:

    • A sequence of dumps (until EOF), consisting of:

      • A uint64_t timestamp (microseconds since epoch)
      • A sequence of clusters in DepGraphFormatter format
      • An empty cluster (just a 0x00 byte).

    The recent data has a dump every 60 seconds, so it’s quite large when decompressed, due to duplication.

    Thanks for providing this.

    Regarding this PR:

    • The code changes look pretty simple, but as others pointed out, it may have negligible impact.
    • If you want to look into more optimizations, I did write (and at some point, implement) a more generalized version of this: https://delvingbitcoin.org/t/how-to-linearize-your-cluster/303#p-788-h-12-bottleneck-splitting-4. The idea is that you compute the intersection of the (anc(x) union desc(x)) over all transactions x. The result is a strictly linear subset of “bottleneck” transactions in the cluster, and you can split the cluster into sets interspersed by these bottlenecks (which can then be linearized by the normal SFL algorithm). In a fully linear chain, every transaction is a bottleneck. I doubt this is useful, because the startup cost for still running SFL probably dwarfs the gains you can make - but I thought I’d mention it still.

    I’ll look into this.

    • Be aware that there are O(n^2) operations inside txgraph, which may end up dominating real-world performance even when actual linearization is O(n). For example, removing transactions from a cluster (and re-compacting the DepGraph following that) is O(n^2), as is constructing a DepGraph from scratch. For this reason, if we really care about performance of linear chains, it may be better to have an entirely separate Cluster implementation for such linear chains that avoids all of this (by avoiding computing/maintaining/storing a DepGraph entirely, for example).

    I see, I thought about a separate ChainCluster implementation, but the fallback (from it to GenericCluster) costs much. I’ll do more research for this idea.

    • It’s SFL (Spanning Forest Linearization), not SPF :)

    Sorry, I’ll fix it if later this PR is proved to be meaningful.

  19. HowHsu commented at 8:58 am on February 24, 2026: none

    @HowHsu If you want, I have dumps with mempool cluster data for all of 2023 (built by @sdaftuar by replaying historical p2p data) and recent data since october 2025: https://bitcoin.sipa.be/mempool_dumps/. The file format is:

    • A sequence of dumps (until EOF), consisting of:

      • A uint64_t timestamp (microseconds since epoch)
      • A sequence of clusters in DepGraphFormatter format
      • An empty cluster (just a 0x00 byte).

    The recent data has a dump every 60 seconds, so it’s quite large when decompressed, due to duplication.

    Hi @sipa,

    Does the 2023 dataset also include a dump every 60 seconds (i.e., replaying P2P data to reconstruct the mempool state and dumping it every 60 seconds)? I’m inferring this from the timestamps.

  20. HowHsu commented at 10:43 am on February 24, 2026: none

    @HowHsu If you want, I have dumps with mempool cluster data for all of 2023 (built by @sdaftuar by replaying historical p2p data) and recent data since october 2025: https://bitcoin.sipa.be/mempool_dumps/. The file format is:

    • A sequence of dumps (until EOF), consisting of:

      • A uint64_t timestamp (microseconds since epoch)
      • A sequence of clusters in DepGraphFormatter format
      • An empty cluster (just a 0x00 byte).

    The recent data has a dump every 60 seconds, so it’s quite large when decompressed, due to duplication.

    https://github.com/HowHsu/thinking-in-bitcoin/blob/main/articles/mempool-cluster-distribution-2025.en.md Analyzed the 2023 and recent data, seems they tell two different stories. The 2025 data shows that chain clusters play a big role in mempool, while the 2023 data not. The 2025 data makes me believe that a ChainCluster may be valuable. I’m going to make a PoC of it, the problem is how to measure its performance.

  21. sipa commented at 6:12 pm on February 24, 2026: member

    Does the 2023 dataset also include a dump every 60 seconds (i.e., replaying P2P data to reconstruct the mempool state and dumping it every 60 seconds)? I’m inferring this from the timestamps.

    I didn’t remember what frequency of dumps was used there, but yes, that’s certainly possible.

  22. DrahtBot added the label Needs rebase on Feb 25, 2026
  23. DrahtBot commented at 1:36 pm on February 25, 2026: contributor
    🐙 This pull request conflicts with the target branch and needs rebase.
  24. instagibbs commented at 3:04 pm on February 25, 2026: member

    re: chain usage patterns, I think most of this is explained by runestones, see these longs chains like https://mempool.space/tx/d4dbdcfe1b96a441a618c4c11e9a8a249ff94aa2999956c44716a1f61c143813

    iirc these didnt exist in 2023


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-03-09 21:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me