This introduces a more accurate cost model for SFL, to control how much CPU time is spent inside the algorithm for clusters that cannot be linearized perfectly within a reasonable amount of time.
The goal is having a metric for the amount of work performed, so that txmempool can impose limits on that work: a lower bound that is always performed (unless optimality is reached before that point, of course), and an upper bound to limit the latency and total CPU time spent on this. There are conflicting design goals here:
On the one hand, it seems ideal if this metric is closely correlated to actual CPU time, because otherwise the limits become inaccurate.
On the other hand, it seems a nightmare to have the metric be platform/system dependent, as it makes network-wide reasoning nearly impossible. It’s expected that slower systems take longer to do the same thing; this holds for everything, and we don’t need to compensate for this.
There are multiple solutions to this:
One extreme is just measuring the time. This is very accurate, but extremely platform dependent, and also non-deterministic due to random scheduling/cache effects.
The other extreme is using a very abstract metric like counting how many times certain loops/function inside the algorithm run. That is what is implemented as of #32545 and #34023, just counting the sum of the numbers of transactions updated across all UpdateChunks() calls. It however necessarily fails to account for significant portions of runtime spent elsewhere, resulting in a rather wide range of “ns per cost” values.
This PR takes a middle ground, counting many function calls / branches / loops, with weights that were determined through benchmarking on an average on a number of systems.
DrahtBot
commented at 2:57 pm on December 22, 2025:
contributor
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
See the guideline for information on the review process.
A summary of reviews will appear here.
Conflicts
Reviewers, this pull request conflicts with the following ones:
#34023 (Optimized SFL cluster linearization by sipa)
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
sipa renamed this:
202512 sfl costs
Cluster mempool: more accurate cost model for SFL
on Dec 22, 2025
sipa force-pushed
on Dec 22, 2025
DrahtBot added the label
CI failed
on Dec 22, 2025
sipa force-pushed
on Dec 22, 2025
DrahtBot removed the label
CI failed
on Dec 22, 2025
sipa force-pushed
on Dec 22, 2025
DrahtBot added the label
CI failed
on Dec 22, 2025
DrahtBot
commented at 10:24 pm on December 22, 2025:
contributor
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
sipa force-pushed
on Dec 22, 2025
DrahtBot removed the label
CI failed
on Dec 23, 2025
gmaxwell
commented at 7:47 pm on December 24, 2025:
contributor
My thoughts:
It should use a metric like this, not time, because it would be weird and undesirable for the behavior to change a lot just based on what system you’re on or what you’re doing with it. People expect slower computers to have higher cpu utilization and the intended limits are low enough that I think they won’t be problematic on even quite slow computers.
An operation based metric based on the performance of any specific computer will likely be much more reflective of the real resourced used on another kind of computer than some simple iteration count– so even if something profiled on a single computer doesn’t reflect yours it beats keeping the iteration based approach.
One possibility would be to benchmark on several modern architectures like zen4, macpro, and graviton… normalize their coefficients for equal time on some corpus, and then average them. As in the weights for the different measures are just the average of the systems except for some per-system scaling factor of expected average performance difference. (Or in the context of the regression used to fit this, fit a model on the coefficients with Archs-1 additional free scaling factors and then discard the scaling factors)– e.g. minimize ((A*x+B*y+C*z+..)+Arch1*(A*x+B*y+C*z+..)+Arch2*(A*x+B*y+C*z+..)-(Arch0time+Arch1time+Arch2time+…))^2. This wouldn’t reflect the cost on any specific system perfectly but it might give a better estimate overall, across multiple systems. And my argument of “A wrong cost is still better than no cost model” still applies. The downside is that you’d need to measure multiple systems to calibrate it.
An alternative might be to just use the coefficients for whatever popular modern hardware you think is most resource constrained– better to have the unexpectedly show runs impact the fast computers more. And at least this approach only requires the extensive profiling on one system.
As far as what transactions calibration should be against, a corpus of historical clusters is probably okay. It would probably make sense to exclude from the corpus easy clusters, e.g. ones with fewer than 10 transactions. You only care if the metric is accurate on times near the limit. I don’t think using artificial cases is that helpful, as the metric will only be accurate ‘on average’ anyways.
I don’t think the coefficients need to be that well maintained over time except due to future algorithmic optimizations simply because doing anything here is better than just iteration counting. And since it’s desirable to behave mostly consistently (so when someone asks why was some suboptimal cluster picked you can just look and see finding it is usually over the limit, etc), you wouldn’t want to updated them often – probably only when advances in typical computer speeds also justify meaningfully increasing the limit overall such that almost all transactions would get as much or more optimization permitted under the new limits.
sipa force-pushed
on Dec 26, 2025
sipa
commented at 6:45 pm on December 27, 2025:
member
An alternative might be to just use the coefficients for whatever popular modern hardware you think is most resource constrained– better to have the unexpectedly show runs impact the fast computers more. And at least this approach only requires the extensive profiling on one system.
Fair enough, but an argument can also be made in favor of using metrics tuned for a relatively high-end system, in that those are likely a better predictor for which machines on the network, in the short to medium term future, will be responsible for actual transaction and block propagation. The numbers are such that any node, even running very low-end hardware, will not have trouble keeping up with mempool/relay, but from a network perspective, it’s a well-connected subset of relatively fast systems that is responsible for relay.
I don’t think using artificial cases is that helpful, as the metric will only be accurate ‘on average’ anyways.
Hmm, that does risk situations where adverserially-constructed transactions are significantly (but still probably not more than a small factor, say 2x-3x) more expensive per cost. I think it makes sense to use as wide a variety of clusters, real and synthetic, to calibrate.
gmaxwell
commented at 7:51 pm on December 27, 2025:
contributor
Yeah I was trying to capture ‘reflective of the future’ in the word “modern”. Fair point on the relatively fast systems being the ones that count – while slow systems are just “also runs” and it’s sufficient that they just aren’t broken. Fair enough.
Hmm, that does risk situations where adverserially-constructed transactions are significantly (but still probably not more than a small factor, say 2x-3x) more expensive per cost.
A concern I had with synthetics is that with artificially hard clusters the costs might end up distorted by ‘hard’ transactions tending to trigger more of an operation and having that cost inflated in the regression though that operation itself is actually fast. It might make sense to have a starting offset based on the number of dependencies or some costs having a multiplier of the number of dependencies in them.
sipa
commented at 0:13 am on December 30, 2025:
member
A concern I had with synthetics is that with artificially hard clusters the costs might end up distorted by ‘hard’ transactions tending to trigger more of an operation and having that cost inflated in the regression though that operation itself is actually fast.
FWIW, the current PR (although outdated now)’s weights were determined by benchmarking small pieces of code (functions, sometimes individual loop executions) and running linear regression on those. I also tried regression that tried to relate overall linearization time with various linear factors (number of times each function/loop is executed), but it turned out to be far too noisy. The result ended up containing negative factors etc.
So I think that when #34023 is in, I will redo the numbers here with the same approach. For every function or non-trivial branch in the algorithm, benchmark it, and gather a histogram of its runtimes, broken down by loop iterations (e.g. in Activate broken down by #tx affected, in PickMergeChunk broken down by number of chunks visited, …). Then look at the 90th or 99th percentile of those runtimes, and perform linear regression with just one variable, that iteration count.
The exact composition of the training data set shouldn’t matter really, except to the extent that it has some variety, and not say all clusters with deps=tx-1 or so.
DrahtBot added the label
Needs rebase
on Jan 6, 2026
sipa force-pushed
on Jan 7, 2026
sipa force-pushed
on Jan 7, 2026
DrahtBot added the label
CI failed
on Jan 7, 2026
DrahtBot removed the label
Needs rebase
on Jan 7, 2026
sipa force-pushed
on Jan 12, 2026
sipa force-pushed
on Jan 13, 2026
sipa force-pushed
on Jan 13, 2026
DrahtBot removed the label
CI failed
on Jan 13, 2026
DrahtBot added the label
Needs rebase
on Jan 15, 2026
sipa force-pushed
on Jan 15, 2026
DrahtBot removed the label
Needs rebase
on Jan 15, 2026
sipa
commented at 2:36 am on February 2, 2026:
member
I can use some help gathering benchmarks to construct a fair cost model here.
Run zstd -d <cluster-benchmarks.zst | ./build/bin/bitcoin-costmodel -iters=3000 run - >output.log. Feel free to modify the -iters= value, more is better; it should take 10-30 seconds per -iters= depending on hardware. It was ~14 hours on a Ryzen 5950X CPU with -iters=3000.
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
DrahtBot removed the label
CI failed
on Feb 6, 2026
DrahtBot added the label
Needs rebase
on Feb 11, 2026
sipa force-pushed
on Feb 12, 2026
clusterlin: split tx/chunk dep counting (preparation)
This splits the chunk_deps variable in LoadLinearization in two, one for
tracking tx dependencies and one for chunk dependencies. This is a
preparation for a later commit, where chunks won't be identified anymore
by a presentative transaction in them, but by a separate index. With
that, it seems weird to keep them both in the same structure if they
will be indexed in an incompatible way.
Note that the changes in src/test/util/cluster_linearize.h to the table
of worst observed iteration counts are due to switching to a different
data set, and are unrelated to the changes in this commit.
89c36576e2
clusterlin: count chunk deps without loop (optimization)
This small optimization avoids the need to loop over the parents of each
transaction when initializing the dependency-counting structures inside
GetLinearization().
f34ec142af
scripted-diff: rename _rep -> _idx in SFL
This is a preparation for the next commit, where chunks will no longer
be identified using a representative transaction, but using a set index.
Reduce the load of line changes by doing this rename ahead of time.
-BEGIN VERIFY SCRIPT-
sed --in-place 's/_rep/_idx/g' src/cluster_linearize.h
-END VERIFY SCRIPT-
2c6fde72a4
clusterlin: get rid of DepData, reuse sets (optimization)
This significantly changes the data structures used in SFL, based on the
observation that the DepData::top_setinfo fields are quite wasteful:
there is one per dependency (up to n^2/4), but we only ever need one per
active dependency (of which there at most n-1). In total, the number of
chunks plus the number of active dependencies is always exactly equal to
the number of transactions, so it makes sense to have a shared pool of
SetInfos, which are used for both chunks and top sets.
To that effect, introduce a separate m_set_info variable, which stores a
SetInfo per transaction. Some of these are used for chunk sets, and some
for active dependencies' top sets. Every activation transforms the
parent's chunk into the top set for the new dependency. Every
deactivation transforms the top set into the new parent chunk.
With that, there is little need for DepData anymore. Use parent/child
TxIdxs to refer to dependencies, and find their top set by having a
child TxIdx-indexed vector in each TxData, rather than a list of
dependencies. This makes code for iterating over dependencies more
natural and simpler.
With indexes into m_set_data (SetIdx) becoming bounded by the number of
transactions, we can use a SetType to represent sets of SetIdxs.
Specifically, an m_chunk_idxs is added which contains all SetIdx
referring to chunks. This leads to a much more natural way of iterating
over chunks.
Also use this opportunity to normalize many variable names.
9f7bcecbfd
clusterlin: improve TxData::dep_top_idx type (optimization)
The combined size of TxData::dep_top_idx can be 16 KiB with 64
transactions and SetIdx = uint32_t. Use a smaller type where possible to
reduce memory footprint and improve cache locality of m_tx_data.
Also switch from an std::vector to an std::array, reducing allocation
overhead and indirections.
afcd239666
clusterlin: abstract out functions from MergeStep (refactor)
This is a simple refactor to make the code more readable.
e3c6ae3236
clusterlin: split up OptimizeStep (refactor)03fa996e39
The current process consists of iterating over the transactions of the
chunk one by one, and then for each figuring out which of its
parents/children are in unprocessed chunks.
Simplify this (and speed it up slightly) by splitting this process into
two phases: first determine the union of all parents/children, and then
find which chunks those belong to.
Instead of computing the set of reachable transactions inside
PickMergeCandidate, make the information precomputed, and updated in
Activate (by merging the two chucks' reachable sets) and Deactive (by
recomputing).
This is a small performance gain on itself, but also a preparation for
future optimizations that rely on quickly testing whether dependencies
between chunks exist.
a9de59bade
clusterlin: make MergeSequence take SetIdx (simplification)
Future changes will rely on knowing the chunk indexes of the two created
chunks after a split. It is natural to return this information from
Deactivate, which also simplifies MergeSequence.
After a split, if the top part has a dependency on the bottom part, the
first MergeSequence will always perform this merge and then stop. This
is referred to as a self-merge.
We can special case these by detecting self-merges early, and avoiding
the overhead of a full MergeSequence which involves two
PickMergeCandidate calls (a succesful and an unsuccesful one).
900f21c782
clusterlin: keep track of active children (optimization)
This means we can iterate over all active dependencies in a
cluster/chunk in O(ntx) time rather than O(ndeps) (*), as the number of
active dependencies in a set of transactions of size is at most ntx-1.
(*) Asymptotically, this is not actually true, as for large transaction
counts, iterating over a BitSet still scales with ntx. In practice
however, where BitSets are represented by a constant number of integers,
it holds.
It suffices to initially only attempt one direction of merges in
MakeTopological(), and only try both directions on chunks that are the
result of other merges.
2e709a07f8
clusterlin: inline UpdateChunk into (De)Activate (optimization)
The two calls to UpdateChunk, in Activate and Deactive each, are subtly
different: the top one needs to update the chunk_idx of iterated
transactions, while the bottom one leaves it unchanged. To exploit this
difference, inline the four function calls, getting rid of UpdateChunks.
This is also a preparation for a future improvement that inlines the
recomputation of reachable sets in the same loop in Deactivate.
a79cb88f11
clusterlin: inline GetReachable into Deactivate (optimization)
Avoid two full iterations over all of a chunks' transactions to
recompute the reachable sets, by inlining them into the
dependency-updating loops.
decb6d7e4a
clusterlin: introduce CostModel class (preparation)
This parametrizes the cost model for the SFL algorithm with another
class. Right now, the behavior of that class matches the naive cost
model so far, but it will be replaced with a more advanced on in a
future commit.
The reason for abstracting this out is that it makes benchmarking for
creating such cost models easy, by instantiating the cost model class
with one that tracks time.
69114d3103
clusterlin: adopt trained cost model (feature)f3861d8383
sipa force-pushed
on Feb 12, 2026
DrahtBot added the label
CI failed
on Feb 12, 2026
DrahtBot
commented at 11:24 pm on February 12, 2026:
contributor
Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being
incompatible with the current code in the target branch). If so, make sure to rebase on the latest
commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the
affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
DrahtBot removed the label
CI failed
on Feb 13, 2026
DrahtBot removed the label
Needs rebase
on Feb 13, 2026
l0rinc
commented at 6:51 pm on February 17, 2026:
contributor
Reran them with the latest version on a few different hardware with both gcc and (apple)clang where possible:
This is a metadata mirror of the GitHub repository
bitcoin/bitcoin.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2026-02-20 18:13 UTC
This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me