Follow-up to #32545, part of #30289.
This contains many of the optimizations that were originally part of #32545 itself.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34023.
See the guideline for information on the review process. A summary of reviews will appear here.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
Possible typos and grammar issues:
2026-01-15
🚧 At least one of the CI tasks failed.
Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/19993781856/job/57338323718
LLM reason (✨ experimental): Cluster linearize tests fail due to a size mismatch (linearization size != depgraph TxCount) triggering assertion failures and aborted tests.
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
🚧 At least one of the CI tasks failed.
Task Windows-cross to x86_64, ucrt: https://github.com/bitcoin/bitcoin/actions/runs/20182154259/job/57944831923
LLM reason (✨ experimental): Compiler error: -Werror treats a range-for over a tuple as an error in cluster_linearize.h, causing the build to fail.
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
Ready for review.
I have made a substantial change here, mostly in the new “get rid of DepData, reuse sets (optimization)” commit. It’s cleaner code, but does a few things quite differently from the merged code. As a result, it touches a good portion of the SFL code due to data structure changes. I’ve used the opportunity to make variable names more consistent throughout.
Sorry for this - if I had considered this approach before #32545, I’d have used a data structure layout closer to this one from the start, avoiding such a big change. As is however, I don’t see a good way of splitting it up into smaller meaningful changes.
🚧 At least one of the CI tasks failed.
Task lint: https://github.com/bitcoin/bitcoin/actions/runs/20444032805/job/58743457093
LLM reason (✨ experimental): Lint check failed due to scripted-diffs detecting mismatches in code changes.
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
Made a number of improvements:
PickRandomTx, GetReachable, PickMergeCandidate), hopefully making things more readable.Activate() and Deactivate().Improve(), avoiding the need for MergeSequence() in that case.Ugh, this cluster takes 1.25 ms on my Ryzen 5950X machine to linearize optimally on average (most of the time is in optimizing, the time spent on minimizing chunks is negligible). For comparison, the worst existing benchmark on the same system takes 0.051 ms.
001020001020101020201020301020401020501020601020701020801020901020a01020b01020c01020d01020e01020f0102100102110102127b02137b02147b021501020001000000000100000000010001000000000e0102010000000000000000000001000000000000020f0102020001000101000000000000000000000000000e01020300000001010100000000000000000000000004010204000000010001000000000000000000000000000e010205000000000000000000010000000001000000000b0102060000000000000100000000000000000000010009010207000100000100000000000000000000000000000501020800000000000000000000000000000000010000001a01020a00000000000000000000000000000000000000001701020a00000000000000000000000000000000000100001401020b00000000010000000000000000000000000000001401020c00000000000000000000000000000001000000000801020d00000000000000000100000000000000000000000701020e00000000000000000100000000000000000000000601020f000100000000000000000000000000000000000006010210000000000000000000000001000000000000000003010211000000000001000000000000000000000000000001010212000000000000000000000000000000000000000000220102130000000000000000000000000000000000000000001a0102140000000000000000000000000000000000000000000c010215000000000000000000000000000000000000000000070102000000010000010000000000000000000000000029010201000000000001000000000000010000000000000c01020200000000000000000100000100000000000000090102030000000100010000000000000000000000000001010204000101000000000000000000000000000000000101020500000000000000000000000000000000010000002401020600000001000000000000000000000000000000001d01020700000000000000000000000000000001000000001c01020800000000000000000000000000000000010000001c01020900000000000000000000000000000000010000001501020a00000000000000000000000000000000000001000901020b00000000000000000000000000000000000100000101020c0000000000000000000000000000000000000000002e01020d0000000000000000000000000000000000000000002c01020e0000000000000000000000000000000000000000002b01020f0000000000000000000000000000000000000000001e0102100000000000000000000000000000000000000000000a01021100000000000000000000000000000000000000000006010212000000000000000000000000000000000000000000040102130000000000000000000000000000000000000000000000
Interestingly, it’s only 0.15 ms when using a splitting strategy consisting of first 2 random split attempts on a given chunk, and then all max-gain splits.
huh, I was playing around with your new worst case SFL example. I’m getting ~2ms per optimal solve which roughly matches your number.
But, if I modify the harness to do 1'700*5 iterations per Linearize, and loop until optimal passing in prior linearizations emitted (was curious how bad it would be with DoWork), I’m getting 245mics to optimal? (edit: looks like it’s taking just 2 iterations at this level)
As I add more to max iterations, the performance starts to progressively get worse until it solves it in one call.
If I drop it to down to 1,700 iterations only, it becomes 150mics average. Any lower and it starts to fail solving it ever since we’re presumably not making enough progress to overcome the fact we’re losing SFL state in between.
I guess this is randomness bailing us out once every N invocations rather than getting stuck in a loop deep somewhere?
@instagibbs Interesting, that’s arguably a more relevant metric in the grand scheme of things, though it doesn’t say much here, as the example isn’t specifically designed to need a high “minimum iterations to make progress” - it’s very well possible that examples exist that are far worse than this one in that regard.
These outlier long optimal linearization times seem very dependent on the particular splitting strategy used. The current PR here uses max-gain always, except every third split is done randomly (so random is used iff m_self_merges for a chunk is 2, 5, 8, 11, …). It seems like there may be much better (and much worse) choices here. I’m investigating this more.
This is pretty unexpected. I’ve analyzed a number of split strategies, and gathered the maximum iteration count needed to reach optimal, under a large-ish (~380000 clusters) rather arbitrarily constructed set of input clusters (while providing them with bad or missing input linearization).
0MMMMMMMMMMMM: ((m_self_merges[chunk_idx] + 0) % 1) == 0: 387722 # always max_gain
1MrMrMrMrMrMr: ((m_self_merges[chunk_idx] + 0) % 2) == 0: 5614508
2MrrMrrMrrMrr: ((m_self_merges[chunk_idx] + 0) % 3) == 0: 3354629
3MrrrMrrrMrrr: ((m_self_merges[chunk_idx] + 0) % 4) == 0: 1648066
4MrrrrMrrrrMr: ((m_self_merges[chunk_idx] + 0) % 5) == 0: 2139700
5MrrrrrMrrrrr: ((m_self_merges[chunk_idx] + 0) % 6) == 0: 1021390
6
7rrrrrrrrrrrr: ((m_self_merges[chunk_idx] + 0) % 1) != 0: 100000000 # always random
8rMrMrMrMrMrM: ((m_self_merges[chunk_idx] + 0) % 2) != 0: 100000000
9rMMrMMrMMrMM: ((m_self_merges[chunk_idx] + 0) % 3) != 0: 100000000
10rMMMrMMMrMMM: ((m_self_merges[chunk_idx] + 0) % 4) != 0: 4479215
11rMMMMrMMMMrM: ((m_self_merges[chunk_idx] + 0) % 5) != 0: 861766
12rMMMMMrMMMMM: ((m_self_merges[chunk_idx] + 0) % 6) != 0: 459358
13
14MMMMMMMMMMMM: ((m_self_merges[chunk_idx] + 1) % 1) == 0: 387722 # always max_gain, same as ((m_self_merges[chunk_idx] + 0) % 1) == 0
15rMrMrMrMrMrM: ((m_self_merges[chunk_idx] + 1) % 2) == 0: 100000000 # same as ((m_self_merges[chunk_idx] + 0) % 2) != 0
16rrMrrMrrMrrM: ((m_self_merges[chunk_idx] + 1) % 3) == 0: 100000000
17rrrMrrrMrrrM: ((m_self_merges[chunk_idx] + 1) % 4) == 0: 100000000
18rrrrMrrrrMrr: ((m_self_merges[chunk_idx] + 1) % 5) == 0: 100000000
19rrrrrMrrrrrM: ((m_self_merges[chunk_idx] + 1) % 6) == 0: 100000000
20
21rrrrrrrrrrrr: ((m_self_merges[chunk_idx] + 1) % 1) != 0: 100000000 # always random, same as ((m_self_merges[chunk_idx] + 0) % 1) != 0
22MrMrMrMrMrMr: ((m_self_merges[chunk_idx] + 1) % 2) != 0: 5614508 # same as ((m_self_merges[chunk_idx] + 0) % 2) == 0
23MMrMMrMMrMrM: ((m_self_merges[chunk_idx] + 1) % 3) != 0: 3659198 # the current PR
24MMMrMMMrMMrM: ((m_self_merges[chunk_idx] + 1) % 4) != 0: 1083539
25MMMMrMMMMrMM: ((m_self_merges[chunk_idx] + 1) % 5) != 0: 650703
26MMMMMrMMMMMr: ((m_self_merges[chunk_idx] + 1) % 6) != 0: 668819
27
28MMMMMMMMMMMM: m_self_merges[chunk_idx] >= 0: 387722 # always max_gain, same as ((m_self_merges[chunk_idx] + 0) % 1) == 0
29rMMMMMMMMMMM: m_self_merges[chunk_idx] >= 1: 388318
30rrMMMMMMMMMM: m_self_merges[chunk_idx] >= 2: 594013
31rrrMMMMMMMMM: m_self_merges[chunk_idx] >= 3: 8543778
32rrrrMMMMMMMM: m_self_merges[chunk_idx] >= 4: 27802956
33rrrrrMMMMMMM: m_self_merges[chunk_idx] >= 5: 100000000
34
35rrrrrrrrrrrr: m_self_merges[chunk_idx] < 0: 100000000 # always random, same as ((m_self_merges[chunk_idx] + 0) % 1) != 0
36Mrrrrrrrrrrr: m_self_merges[chunk_idx] < 1: 2191100
37MMrrrrrrrrrr: m_self_merges[chunk_idx] < 2: 759915
38MMMrrrrrrrrr: m_self_merges[chunk_idx] < 3: 464325
39MMMMrrrrrrrr: m_self_merges[chunk_idx] < 4: 1042613
40MMMMMrrrrrrr: m_self_merges[chunk_idx] < 5: 523463
The maximum iteration count passed to Linearize was 100M, so when 100000000 appears it may well mean “indefinite”.
More max-gain seems to be pretty much always better. However, I’m very surprised that always-random apparently doesn’t reach optimal for some input clusters. Will investigate further, but with this, I’m inclined to take something like ((m_self_merges[chunk_idx] + 0) % 6) != 0 rather than the PR’s ((m_self_merges[chunk_idx] + 1) % 3) != 0.
EDIT: redid the numbers after fixing a bug that sometimes caused splits with zero gain.
0MMMMMMMMMMMM: ((m_self_merges[chunk_idx] + 0) % 1) == 0: 387722 # always max_gain
1MrMrMrMrMrMr: ((m_self_merges[chunk_idx] + 0) % 2) == 0: 6012644
2MrrMrrMrrMrr: ((m_self_merges[chunk_idx] + 0) % 3) == 0: 3354629
3MrrrMrrrMrrr: ((m_self_merges[chunk_idx] + 0) % 4) == 0: 1648066
4MrrrrMrrrrMr: ((m_self_merges[chunk_idx] + 0) % 5) == 0: 2139700
5MrrrrrMrrrrr: ((m_self_merges[chunk_idx] + 0) % 6) == 0: 1003802
6
7rrrrrrrrrrrr: ((m_self_merges[chunk_idx] + 0) % 1) != 0: 416225 # always random
8rMrMrMrMrMrM: ((m_self_merges[chunk_idx] + 0) % 2) != 0: 4796320
9rMMrMMrMMrMM: ((m_self_merges[chunk_idx] + 0) % 3) != 0: 1427591
10rMMMrMMMrMMM: ((m_self_merges[chunk_idx] + 0) % 4) != 0: 1095346
11rMMMMrMMMMrM: ((m_self_merges[chunk_idx] + 0) % 5) != 0: 570429
12rMMMMMrMMMMM: ((m_self_merges[chunk_idx] + 0) % 6) != 0: 426605
13
14MMMMMMMMMMMM: ((m_self_merges[chunk_idx] + 1) % 1) == 0: 387722 # always max_gain, same as ((m_self_merges[chunk_idx] + 0) % 1) == 0
15rMrMrMrMrMrM: ((m_self_merges[chunk_idx] + 1) % 2) == 0: 4796320 # same as ((m_self_merges[chunk_idx] + 0) % 2) != 0
16rrMrrMrrMrrM: ((m_self_merges[chunk_idx] + 1) % 3) == 0: 2473649
17rrrMrrrMrrrM: ((m_self_merges[chunk_idx] + 1) % 4) == 0: 1520079
18rrrrMrrrrMrr: ((m_self_merges[chunk_idx] + 1) % 5) == 0: 2107109
19rrrrrMrrrrrM: ((m_self_merges[chunk_idx] + 1) % 6) == 0: 882761
20
21rrrrrrrrrrrr: ((m_self_merges[chunk_idx] + 1) % 1) != 0: 416225 # always random, same as ((m_self_merges[chunk_idx] + 0) % 1) != 0
22MrMrMrMrMrMr: ((m_self_merges[chunk_idx] + 1) % 2) != 0: 6012644 # same as ((m_self_merges[chunk_idx] + 0) % 2) == 0
23MMrMMrMMrMrM: ((m_self_merges[chunk_idx] + 1) % 3) != 0: 2223704 # the current PR
24MMMrMMMrMMrM: ((m_self_merges[chunk_idx] + 1) % 4) != 0: 872725
25MMMMrMMMMrMM: ((m_self_merges[chunk_idx] + 1) % 5) != 0: 916415
26MMMMMrMMMMMr: ((m_self_merges[chunk_idx] + 1) % 6) != 0: 668819
27
28MMMMMMMMMMMM: m_self_merges[chunk_idx] >= 0: 387722 # always max_gain, same as ((m_self_merges[chunk_idx] + 0) % 1) == 0
29rMMMMMMMMMMM: m_self_merges[chunk_idx] >= 1: 471715
30rrMMMMMMMMMM: m_self_merges[chunk_idx] >= 2: 598471
31rrrMMMMMMMMM: m_self_merges[chunk_idx] >= 3: 270490
32rrrrMMMMMMMM: m_self_merges[chunk_idx] >= 4: 248124
33rrrrrMMMMMMM: m_self_merges[chunk_idx] >= 5: 291305
34
35rrrrrrrrrrrr: m_self_merges[chunk_idx] < 0: 416225 # always random, same as ((m_self_merges[chunk_idx] + 0) % 1) != 0
36Mrrrrrrrrrrr: m_self_merges[chunk_idx] < 1: 700377
37MMrrrrrrrrrr: m_self_merges[chunk_idx] < 2: 546130
38MMMrrrrrrrrr: m_self_merges[chunk_idx] < 3: 464325
39MMMMrrrrrrrr: m_self_merges[chunk_idx] < 4: 467645
40MMMMMrrrrrrr: m_self_merges[chunk_idx] < 5: 523463
As expected, generally more max-gain is better, but it does seem that alternating often between the two strategies is worse than sticking with one. Also - and this may be an artifact of a biased test set - it seems that doing a few random iterations first helps.
Certainly there seems to be something wrong. These clusters that need millions/billions of iterations only need a few 1000 iterations in my Python version of the algorithm, which ought to be roughly identical to the always-random strategy.
EDIT: found the bug, indeed inside the random strategy (it would under certain conditions split equal-feerate tops off).
((m_self_merges[chunk_idx] + 0) % 6) != 0), and fixed the bug in the random strategy.
🚧 At least one of the CI tasks failed.
Task Windows native, fuzz, VS 2022: https://github.com/bitcoin/bitcoin/actions/runs/20908711728/job/60067268129
LLM reason (✨ experimental): Fuzz test txgraph failed due to an assertion violation (iters <= iters_for_optimal) causing the fuzz target to exit with code 1.
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
🚧 At least one of the CI tasks failed.
Task lint: https://github.com/bitcoin/bitcoin/actions/runs/20942872497/job/60179846256
LLM reason (✨ experimental): Lint check failed (scripted-diffs) causing the CI failure.
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
This introduces the notion of gain to the SFL algorithm. Given a chunk c,
an active dependency d in it, and the chunks (t, b) that c would split into
if d were deactivated, the gain is defined as either (they are equivalent):
(feerate(t) - feerate(b)) * size(t) * size(b)
fee(t) * size(b) - fee(b) * size(t)
It happens to also be equal to these:
(feerate(t) - feerate(c)) * size(t) * size(c)
fee(t) * size(c) - fee(c) * size(t)
Its relevance is that this metric is proportional to a lower bound on the area
under the fee-size diagram which would be gained IF a deactivation of d does not
result in a self-merge of t and b again.
This commit adds logic to find, within each chunk, the dependency with the highest
gain. In benchmarks, this appears to be a very good heuristic for deciding which
splits are worth making.
This splits the chunk_deps variable in LoadLinearization in two, one for
tracking tx dependencies and one for chunk dependencies. This is a
preparation for a later commit, where chunks won't be identified anymore
by a presentative transaction in them, but by a separate index. With
that, it seems weird to keep them both in the same structure if they
will be indexed in an incompatible way.
This small optimization avoids the need to loop over the parents of each
transaction when initializing the dependency-counting structures inside
GetLinearization().
This is a preparation for the next commit, where chunks will no longer
be identified using a representative transaction, but using a set index.
Reduce the load of line changes by doing this rename ahead of time.
-BEGIN VERIFY SCRIPT-
sed --in-place 's/_rep/_idx/g' src/cluster_linearize.h
-END VERIFY SCRIPT-
This significantly changes the data structures used in SFL, based on the
observation that the DepData::top_setinfo fields are quite wasteful:
there is one per dependency (up to n^2/4), but we only ever need one per
active dependency (of which there at most n-1). In total, the number of
chunks plus the number of active dependencies is always exactly equal to
the number of transactions, so it makes sense to have a shared pool of
SetInfos, which are used for both chunks and top sets.
To that effect, introduce a separate m_set_info variable, which stores a
SetInfo per transaction. Some of these are used for chunk sets, and some
for active dependencies' top sets. Every activation transforms the
parent's chunk into the top set for the new dependency. Every
deactivation transforms the top set into the new parent chunk.
With that, there is little need for DepData anymore. Use parent/child
TxIdxs to refer to dependencies, and find their top set by having a
child TxIdx-indexed vector in each TxData, rather than a list of
dependencies. This makes code for iterating over dependencies more
natural and simpler.
With indexes into m_set_data (SetIdx) becoming bounded by the number of
transactions, we can use a SetType to represent sets of SetIdxs.
Specifically, an m_chunk_idxs is added which contains all SetIdx
referring to chunks. This leads to a much more natural way of iterating
over chunks.
Also use this opportunity to normalize many variable names.
The combined size of TxData::dep_top_idx can be 16 KiB with 64
transactions and SetIdx = uint32_t. Use a smaller type where possible to
reduce memory footprint and improve cache locality of m_tx_data.
Also switch from an std::vector to an std::array, reducing allocation
overhead and indirections.
This is a simple refactor to make the code more readable.
The current process consists of iterating over the transactions of the
chunk one by one, and then for each figuring out which of its
parents/children are in unprocessed chunks.
Simplify this (and speed it up slightly) by splitting this process into
two phases: first determine the union of all parents/children, and then
find which chunks those belong to.
Instead of computing the set of reachable transactions inside
PickMergeCandidate, make the information precomputed, and updated in
Activate (by merging the two chucks' reachable sets) and Deactive (by
recomputing).
This is a small performance gain on itself, but also a preparation for
future optimizations that rely on quickly testing whether dependencies
between chunks exist.
Future changes will rely on knowing the chunk indexes of the two created
chunks after a split. It is natural to return this information from
Deactivate, which also simplifies MergeSequence.
After a split, if the top part has a dependency on the bottom part, the
first MergeSequence will always perform this merge and then stop. This
is referred to as a self-merge.
We can special case these by detecting self-merges early, and avoiding
the overhead of a full MergeSequence which involves two
PickMergeCandidate calls (a succesful and an unsuccesful one).
This means we can iterate over all active dependencies in a
cluster/chunk in O(ntx) time rather than O(ndeps) (*), as the number of
active dependencies in a set of transactions of size is at most ntx-1.
(*) Asymptotically, this is not actually true, as for large transaction
counts, iterating over a BitSet still scales with ntx. In practice
however, where BitSets are represented by a constant number of integers,
it holds.
This avoids adding them a second time to m_suboptimal_chunks when they
happen to already be there.
Out of an abundance of caution that adversarially-constructed clusters
might reliably result in bad chunk split decisions with the maximum-gain
strategy, make the first and every sixth consecutive attempt to split
the same chunk use a random split strategy instead.
It suffices to initially only attempt one direction of merges in
MakeTopological(), and only try both directions on chunks that are the
result of other merges.
The two calls to UpdateChunk, in Activate and Deactive each, are subtly
different: the top one needs to update the chunk_idx of iterated
transactions, while the bottom one leaves it unchanged. To exploit this
difference, inline the four function calls, getting rid of UpdateChunks.
This is also a preparation for a future improvement that inlines the
recomputation of reachable sets in the same loop in Deactivate.
Avoid two full iterations over all of a chunks' transactions to
recompute the reachable sets, by inlining them into the
dependency-updating loops.