Cluster mempool: more accurate cost model for SFL #34138

pull sipa wants to merge 19 commits into bitcoin:master from sipa:202512_sfl_costs changing 7 files +729 −445
  1. sipa commented at 2:57 pm on December 22, 2025: member

    Builds on top of #34023, part of #30289.

    This introduces a more accurate cost model for SFL, to control how much CPU time is spent inside the algorithm for clusters that cannot be linearized perfectly within a reasonable amount of time.

    The goal is having a metric for the amount of work performed, so that txmempool can impose limits on that work: a lower bound that is always performed (unless optimality is reached before that point, of course), and an upper bound to limit the latency and total CPU time spent on this. There are conflicting design goals here:

    • On the one hand, it seems ideal if this metric is closely correlated to actual CPU time, because otherwise the limits become inaccurate.
    • On the other hand, it seems a nightmare to have the metric be platform/system dependent, as it makes network-wide reasoning nearly impossible. It’s expected that slower systems take longer to do the same thing; this holds for everything, and we don’t need to compensate for this.

    There are multiple solutions to this:

    • One extreme is just measuring the time. This is very accurate, but extremely platform dependent, and also non-deterministic due to random scheduling/cache effects.
    • The other extreme is using a very abstract metric like counting how many times certain loops/function inside the algorithm run. That is what is implemented as of #32545 and #34023, just counting the sum of the numbers of transactions updated across all UpdateChunks() calls. It however necessarily fails to account for significant portions of runtime spent elsewhere, resulting in a rather wide range of “ns per cost” values.
    • This PR takes a middle ground, counting many function calls / branches / loops, with weights that were determined through benchmarking on a reference system (Zen 4 based).

    This is draft because:

    • The weights used are outdated, as several non-trivial improvements were made to #32545 and #34023 since.
    • I don’t know if it is the right middle ground, and it seems good to have some conceptual discussion first:
      • Using benchmarks on a reference system inherently captures some platform specifics, making it possible that the metric is not very accurate on other systems.
      • A benchmark-based metric is nontrivial to maintain.
  2. DrahtBot commented at 2:57 pm on December 22, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34138.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #34085 (cluster mempool: exploit SFL properties in txgraph by sipa)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

    LLM Linter (✨ experimental)

    Possible places where named args for integral literals may be used (e.g. func(x, /*named_arg=*/0) in C++, and func(x, named_arg=0) in Python):

    • [m_nonminimal_chunks.emplace_back(chunk_idx, pivot_idx, flags ^ 3)] in src/cluster_linearize.h

    2025-12-26

  3. sipa renamed this:
    202512 sfl costs
    Cluster mempool: more accurate cost model for SFL
    on Dec 22, 2025
  4. sipa force-pushed on Dec 22, 2025
  5. DrahtBot added the label CI failed on Dec 22, 2025
  6. sipa force-pushed on Dec 22, 2025
  7. DrahtBot removed the label CI failed on Dec 22, 2025
  8. sipa force-pushed on Dec 22, 2025
  9. DrahtBot added the label CI failed on Dec 22, 2025
  10. DrahtBot commented at 10:24 pm on December 22, 2025: contributor

    🚧 At least one of the CI tasks failed. Task lint: https://github.com/bitcoin/bitcoin/actions/runs/20443977146/job/58743290866 LLM reason (✨ experimental): Lint check failed due to scripted_diff differences in the patch.

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  11. sipa force-pushed on Dec 22, 2025
  12. DrahtBot removed the label CI failed on Dec 23, 2025
  13. gmaxwell commented at 7:47 pm on December 24, 2025: contributor

    My thoughts:

    It should use a metric like this, not time, because it would be weird and undesirable for the behavior to change a lot just based on what system you’re on or what you’re doing with it. People expect slower computers to have higher cpu utilization and the intended limits are low enough that I think they won’t be problematic on even quite slow computers.

    An operation based metric based on the performance of any specific computer will likely be much more reflective of the real resourced used on another kind of computer than some simple iteration count– so even if something profiled on a single computer doesn’t reflect yours it beats keeping the iteration based approach.

    One possibility would be to benchmark on several modern architectures like zen4, macpro, and graviton… normalize their coefficients for equal time on some corpus, and then average them. As in the weights for the different measures are just the average of the systems except for some per-system scaling factor of expected average performance difference. (Or in the context of the regression used to fit this, fit a model on the coefficients with Archs-1 additional free scaling factors and then discard the scaling factors)– e.g. minimize ((A*x+B*y+C*z+..)+Arch1*(A*x+B*y+C*z+..)+Arch2*(A*x+B*y+C*z+..)-(Arch0time+Arch1time+Arch2time+…))^2. This wouldn’t reflect the cost on any specific system perfectly but it might give a better estimate overall, across multiple systems. And my argument of “A wrong cost is still better than no cost model” still applies. The downside is that you’d need to measure multiple systems to calibrate it.

    An alternative might be to just use the coefficients for whatever popular modern hardware you think is most resource constrained– better to have the unexpectedly show runs impact the fast computers more. And at least this approach only requires the extensive profiling on one system.

    As far as what transactions calibration should be against, a corpus of historical clusters is probably okay. It would probably make sense to exclude from the corpus easy clusters, e.g. ones with fewer than 10 transactions. You only care if the metric is accurate on times near the limit. I don’t think using artificial cases is that helpful, as the metric will only be accurate ‘on average’ anyways.

    I don’t think the coefficients need to be that well maintained over time except due to future algorithmic optimizations simply because doing anything here is better than just iteration counting. And since it’s desirable to behave mostly consistently (so when someone asks why was some suboptimal cluster picked you can just look and see finding it is usually over the limit, etc), you wouldn’t want to updated them often – probably only when advances in typical computer speeds also justify meaningfully increasing the limit overall such that almost all transactions would get as much or more optimization permitted under the new limits.

  14. clusterlin: split chunks based on maximum gain (optimization)
    This introduces the notion of gain to the SFL algorithm. Given a chunk c,
    an active dependency d in it, and the chunks (t, b) that c would split into
    if d were deactivated, the gain is defined as either (they are equivalent):
    
      (feerate(t) - feerate(b)) * size(t) * size(b)
      fee(t) * size(b) - fee(b) * size(t)
    
    It happens to also be equal to these:
    
      (feerate(t) - feerate(c)) * size(t) * size(c)
      fee(t) * size(c) - fee(c) * size(t)
    
    Its relevance is that this metric is proportional to a lower bound on the area
    under the fee-size diagram which would be gained IF a deactivation of d does not
    result in a self-merge of t and b again.
    
    This commit adds logic to find, within each chunk, the dependency with the highest
    gain. In benchmarks, this appears to be a very good heuristic for deciding which
    splits are worth making.
    bc27102438
  15. clusterlin: split tx/chunk dep counting (preparation)
    This splits the chunk_deps variable in LoadLinearization in two, one for
    tracking tx dependencies and one for chunk dependencies. This is a
    preparation for a later commit, where chunks won't be identified anymore
    by a presentative transaction in them, but by a separate index. With
    that, it seems weird to keep them both in the same structure if they
    will be indexed in an incompatible way.
    c5fee18a3f
  16. clusterlin: count chunk deps without loop (optimization)
    This small optimization avoids the need to loop over the parents of each
    transaction when initializing the dependency-counting structures inside
    GetLinearization().
    99736b7800
  17. scripted-diff: rename _rep -> _idx in SFL
    This is a preparation for the next commit, where chunks will no longer
    be identified using a representative transaction, but using a set index.
    Reduce the load of line changes by doing this rename ahead of time.
    
    -BEGIN VERIFY SCRIPT-
    sed --in-place 's/_rep/_idx/g' src/cluster_linearize.h
    -END VERIFY SCRIPT-
    0876d384e6
  18. clusterlin: get rid of DepData, reuse sets (optimization)
    This significantly changes the data structures used in SFL, based on the
    observation that the DepData::top_setinfo fields are quite wasteful:
    there is one per dependency (up to n^2/4), but we only ever need one per
    active dependency (of which there at most n-1). In total, the number of
    chunks plus the number of active dependencies is always exactly equal to
    the number of transactions, so it makes sense to have a shared pool of
    SetInfos, which are used for both chunks and top sets.
    
    To that effect, introduce a separate m_set_info variable, which stores a
    SetInfo per transaction. Some of these are used for chunk sets, and some
    for active dependencies' top sets. Every activation transforms the
    parent's chunk into the top set for the new dependency. Every
    deactivation transforms the top set into the new parent chunk.
    
    With that, there is little need for DepData anymore. Use parent/child
    TxIdxs to refer to dependencies, and find their top set by having a
    child TxIdx-indexed vector in each TxData, rather than a list of
    dependencies. This makes code for iterating over dependencies more
    natural and simpler.
    
    With indexes into m_set_data (SetIdx) becoming bounded by the number of
    transactions, we can use a SetType to represent sets of SetIdxs.
    Specifically, an m_chunk_idxs is added which contains all SetIdx
    referring to chunks. This leads to a much more natural way of iterating
    over chunks.
    
    Also use this opportunity to normalize many variable names.
    a9d05cc3f7
  19. clusterlin: improve TxData::dep_top_idx type (optimization)
    The combined size of TxData::dep_top_idx can be 16 KiB with 64
    transactions and SetIdx = uint32_t. Use a smaller type where possible to
    reduce memory footprint and improve cache locality of m_tx_data.
    
    Also switch from an std::vector to an std::array, reducing allocation
    overhead and indirections.
    7726f67db2
  20. clusterlin: abstract out functions from MergeStep (refactor)
    This is a simple refactor to make the code more readable.
    f452e1b7e5
  21. clusterlin: simplify PickMergeCandidate (optimization)
    The current process consists of iterating over the transactions of the
    chunk one by one, and then for each figuring out which of its
    parents/children are in unprocessed chunks.
    
    Simplify this (and speed it up slightly) by splitting this process into
    two phases: first determine the union of all parents/children, and then
    find which chunks those belong to.
    e4d5ef982c
  22. clusterlin: precompute reachable sets (optimization)
    Instead of computing the set of reachable transactions inside
    PickMergeCandidate, make the information precomputed, and updated in
    Activate (by merging the two chucks' reachable sets) and Deactive (by
    recomputing).
    
    This is a small performance gain on itself, but also a preparation for
    future optimizations that rely on quickly testing whether dependencies
    between chunks exist.
    aaebb6c904
  23. clusterlin: make MergeSequence take SetIdx (simplification)
    Future changes will rely on knowing the chunk indexes of the two created
    chunks after a split. It is natural to return this information from
    Deactivate, which also simplifies MergeSequence.
    13942f6930
  24. clusterlin: special-case self-merges (optimization)
    After a split, if the top part has a dependency on the bottom part, the
    first MergeSequence will always perform this merge and then stop. This
    is referred to as a self-merge.
    
    We can special case these by detecting self-merges early, and avoiding
    the overhead of a full MergeSequence which involves two
    PickMergeCandidate calls (a succesful and an unsuccesful one).
    c7c076d6df
  25. clusterlin: keep track of active children (optimization)
    This means we can iterate over all active dependencies in a
    cluster/chunk in O(ntx) time rather than O(ndeps) (*), as the number of
    active dependencies in a set of transactions of size is at most ntx-1.
    
    (*) Asymptotically, this is not actually true, as for large transaction
    counts, iterating over a BitSet still scales with ntx. In practice
    however, where BitSets are represented by a constant number of integers,
    it holds.
    ab0d2285ab
  26. clusterlin: track suboptimal chunks (optimization)
    This avoids adding them a second time to m_suboptimal_chunks when they
    happen to already be there.
    383cf262c8
  27. clusterlin: use random split in SFL on every 3rd attempt (feature)
    Out of an abundance of caution that adversarially-constructed clusters might
    reliably result in bad chunk split decisions with the maximum-gain strategy,
    make every third consecutive attempt to split the same chunk use a random
    split strategy instead.
    8a896005c1
  28. clusterlin: unidirectional MakeTopological initially (optimization)
    It suffices to initially only attempt one direction of merges in
    MakeTopological(), and only try both directions on chunks that are the
    result of other merges.
    9342728274
  29. clusterlin: minimize chunks (feature)
    After the normal optimization process finishes, and finds an optimal
    spanning forest, run a second process (while computation budget remains)
    to split chunks into minimal equal-feerate chunks.
    c990f53dc4
  30. clusterlin: inline UpdateChunk into (De)Activate (optimization)
    The two calls to UpdateChunk, in Activate and Deactive each, are subtly
    different: the top one needs to update the chunk_idx of iterated
    transactions, while the bottom one leaves it unchanged. To exploit this
    difference, inline the four function calls, getting rid of UpdateChunks.
    
    This is also a preparation for a future improvement that inlines the
    recomputation of reachable sets in the same loop in Deactivate.
    cd20513ef6
  31. clusterlin: inline GetReachable into Deactivate (optimization)
    Avoid two full iterations over all of a chunks' transactions to
    recompute the reachable sets, by inlining them into the
    dependency-updating loops.
    cf0d8da590
  32. clusterlin: add more accurate cost modeling (feature)
    This adds a rough estimate of algorithm runtime, so it can be interrupted if no
    solution is found in time. Due to inherent differences between platforms, this
    will not be extremely accurate, but it is preferable over directly measuring
    time for consistency.
    
    The runtime corresponds to up to 2 ns/cost now on modern desktop systems, so
    increase ACCEPTABLE_ITERS correspondingly.
    04b9afa063
  33. sipa force-pushed on Dec 26, 2025
  34. sipa commented at 6:45 pm on December 27, 2025: member

    An alternative might be to just use the coefficients for whatever popular modern hardware you think is most resource constrained– better to have the unexpectedly show runs impact the fast computers more. And at least this approach only requires the extensive profiling on one system.

    Fair enough, but an argument can also be made in favor of using metrics tuned for a relatively high-end system, in that those are likely a better predictor for which machines on the network, in the short to medium term future, will be responsible for actual transaction and block propagation. The numbers are such that any node, even running very low-end hardware, will not have trouble keeping up with mempool/relay, but from a network perspective, it’s a well-connected subset of relatively fast systems that is responsible for relay.

    I don’t think using artificial cases is that helpful, as the metric will only be accurate ‘on average’ anyways.

    Hmm, that does risk situations where adverserially-constructed transactions are significantly (but still probably not more than a small factor, say 2x-3x) more expensive per cost. I think it makes sense to use as wide a variety of clusters, real and synthetic, to calibrate.

  35. gmaxwell commented at 7:51 pm on December 27, 2025: contributor

    Yeah I was trying to capture ‘reflective of the future’ in the word “modern”. Fair point on the relatively fast systems being the ones that count – while slow systems are just “also runs” and it’s sufficient that they just aren’t broken. Fair enough.

    Hmm, that does risk situations where adverserially-constructed transactions are significantly (but still probably not more than a small factor, say 2x-3x) more expensive per cost.

    A concern I had with synthetics is that with artificially hard clusters the costs might end up distorted by ‘hard’ transactions tending to trigger more of an operation and having that cost inflated in the regression though that operation itself is actually fast. It might make sense to have a starting offset based on the number of dependencies or some costs having a multiplier of the number of dependencies in them.

  36. sipa commented at 0:13 am on December 30, 2025: member

    A concern I had with synthetics is that with artificially hard clusters the costs might end up distorted by ‘hard’ transactions tending to trigger more of an operation and having that cost inflated in the regression though that operation itself is actually fast.

    FWIW, the current PR (although outdated now)’s weights were determined by benchmarking small pieces of code (functions, sometimes individual loop executions) and running linear regression on those. I also tried regression that tried to relate overall linearization time with various linear factors (number of times each function/loop is executed), but it turned out to be far too noisy. The result ended up containing negative factors etc.

    So I think that when #34023 is in, I will redo the numbers here with the same approach. For every function or non-trivial branch in the algorithm, benchmark it, and gather a histogram of its runtimes, broken down by loop iterations (e.g. in Activate broken down by #tx affected, in PickMergeChunk broken down by number of chunks visited, …). Then look at the 90th or 99th percentile of those runtimes, and perform linear regression with just one variable, that iteration count.

    The exact composition of the training data set shouldn’t matter really, except to the extent that it has some variety, and not say all clusters with deps=tx-1 or so.

  37. DrahtBot added the label Needs rebase on Jan 6, 2026
  38. DrahtBot commented at 5:16 pm on January 6, 2026: contributor
    🐙 This pull request conflicts with the target branch and needs rebase.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-01-07 03:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me