Follow-up to #32545, part of #30289.
This contains many of the optimizations that were originally part of #32545 itself.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34023.
See the guideline for information on the review process. A summary of reviews will appear here.
No conflicts as of last run.
🚧 At least one of the CI tasks failed.
Task 32 bit ARM: https://github.com/bitcoin/bitcoin/actions/runs/19993781856/job/57338323718
LLM reason (✨ experimental): Cluster linearize tests fail due to a size mismatch (linearization size != depgraph TxCount) triggering assertion failures and aborted tests.
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
🚧 At least one of the CI tasks failed.
Task Windows-cross to x86_64, ucrt: https://github.com/bitcoin/bitcoin/actions/runs/20182154259/job/57944831923
LLM reason (✨ experimental): Compiler error: -Werror treats a range-for over a tuple as an error in cluster_linearize.h, causing the build to fail.
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
Ready for review.
I have made a substantial change here, mostly in the new “get rid of DepData, reuse sets (optimization)” commit. It’s cleaner code, but does a few things quite differently from the merged code. As a result, it touches a good portion of the SFL code due to data structure changes. I’ve used the opportunity to make variable names more consistent throughout.
Sorry for this - if I had considered this approach before #32545, I’d have used a data structure layout closer to this one from the start, avoiding such a big change. As is however, I don’t see a good way of splitting it up into smaller meaningful changes.
🚧 At least one of the CI tasks failed.
Task lint: https://github.com/bitcoin/bitcoin/actions/runs/20444032805/job/58743457093
LLM reason (✨ experimental): Lint check failed due to scripted-diffs detecting mismatches in code changes.
Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:
Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.
Leave a comment here, if you need help tracking down a confusing failure.
This introduces the notion of gain to the SFL algorithm. Given a chunk c,
an active dependency d in it, and the chunks (t, b) that c would split into
if d were deactivated, the gain is defined as either (they are equivalent):
(feerate(t) - feerate(b)) * size(t) * size(b)
fee(t) * size(b) - fee(b) * size(t)
It happens to also be equal to these:
(feerate(t) - feerate(c)) * size(t) * size(c)
fee(t) * size(c) - fee(c) * size(t)
Its relevance is that this metric is proportional to a lower bound on the area
under the fee-size diagram which would be gained IF a deactivation of d does not
result in a self-merge of t and b again.
This commit adds logic to find, within each chunk, the dependency with the highest
gain. In benchmarks, this appears to be a very good heuristic for deciding which
splits are worth making.
This splits the chunk_deps variable in LoadLinearization in two, one for
tracking tx dependencies and one for chunk dependencies. This is a
preparation for a later commit, where chunks won't be identified anymore
by a presentative transaction in them, but by a separate index. With
that, it seems weird to keep them both in the same structure if they
will be indexed in an incompatible way.
This small optimization avoids the need to loop over the parents of each
transaction when initializing the dependency-counting structures inside
GetLinearization().
This is a preparation for the next commit, where chunks will no longer
be identified using a representative transaction, but using a set index.
Reduce the load of line changes by doing this rename ahead of time.
-BEGIN VERIFY SCRIPT-
sed --in-place 's/_rep/_idx/g' src/cluster_linearize.h
-END VERIFY SCRIPT-
This significantly changes the data structures used in SFL, based on the
observation that the DepData::top_setinfo fields are quite wasteful:
there is one per dependency (up to n^2/4), but we only ever need one per
active dependency (of which there at most n-1). In total, the number of
chunks plus the number of active dependencies is always exactly equal to
the number of transactions, so it makes sense to have a shared pool of
SetInfos, which are used for both chunks and top sets.
To that effect, introduce a separate m_set_info variable, which stores a
SetInfo per transaction. Some of these are used for chunk sets, and some
for active dependencies' top sets. Every activation transforms the
parent's chunk into the top set for the new dependency. Every
deactivation transforms the top set into the new parent chunk.
With that, there is little need for DepData anymore. Use parent/child
TxIdxs to refer to dependencies, and find their top set by having a
child TxIdx-indexed vector in each TxData, rather than a list of
dependencies. This makes code for iterating over dependencies more
natural and simpler.
With indexes into m_set_data (SetIdx) becoming bounded by the number of
transactions, we can use a SetType to represent sets of SetIdxs.
Specifically, an m_chunk_idxs is added which contains all SetIdx
referring to chunks. This leads to a much more natural way of iterating
over chunks.
Also use this opportunity to normalize many variable names.
The combined size of TxData::dep_top_idx can be 16 KiB with 64
transactions and SetIdx = uint32_t. Use a smaller type where possible to
reduce memory footprint and improve cache locality of m_tx_data.
Also switch from an std::vector to an std::array, reducing allocation
overhead and indirections.
This is a simple refactor to make the code more readable.
The current process consists of iterating over the transactions of the
chunk one by one, and then for each figuring out which of its
parents/children are in unprocessed chunks.
Simplify this (and speed it up slightly) by splitting this process into
two phases: first determine the union of all parents/children, and then
find which chunks those belong to.
Instead of computing the set of reachable transactions inside
PickMergeCandidate, make the information precomputed, and updated in
Activate (by merging the two chucks' reachable sets) and Deactive (by
recomputing).
This is a small performance gain on itself, but also a preparation for
future optimizations that rely on quickly testing whether dependencies
between chunks exist.
Future changes will rely on knowing the chunk indexes of the two created
chunks after a split. It is natural to return this information from
Deactivate, which also simplifies MergeSequence.
After a split, if the top part has a dependency on the bottom part, the
first MergeSequence will always perform this merge and then stop. This
is referred to as a self-merge.
We can special case these by detecting self-merges early, and avoiding
the overhead of a full MergeSequence which involves two
PickMergeCandidate calls (a succesful and an unsuccesful one).
This means we can iterate over all active dependencies in a
cluster/chunk in O(ntx) time rather than O(ndeps) (*), as the number of
active dependencies in a set of transactions of size is at most ntx-1.
(*) Asymptotically, this is not actually true, as for large transaction
counts, iterating over a BitSet still scales with ntx. In practice
however, where BitSets are represented by a constant number of integers,
it holds.
Made a number of improvements:
PickRandomTx, GetReachable, PickMergeCandidate), hopefully making things more readable.Activate() and Deactivate().Improve(), avoiding the need for MergeSequence() in that case.
This avoids adding them a second time to m_suboptimal_chunks when they
happen to already be there.
Out of an abundance of caution that adversarially-constructed clusters might
reliably result in bad chunk split decisions with the maximum-gain strategy,
make every third consecutive attempt to split the same chunk use a random
split strategy instead.
It suffices to initially only attempt one direction of merges in
MakeTopological(), and only try both directions on chunks that are the
result of other merges.
After the normal optimization process finishes, and finds an optimal
spanning forest, run a second process (while computation budget remains)
to split chunks into minimal equal-feerate chunks.
The two calls to UpdateChunk, in Activate and Deactive each, are subtly
different: the top one needs to update the chunk_idx of iterated
transactions, while the bottom one leaves it unchanged. To exploit this
difference, inline the four function calls, getting rid of UpdateChunks.
This is also a preparation for a future improvement that inlines the
recomputation of reachable sets in the same loop in Deactivate.
Avoid two full iterations over all of a chunks' transactions to
recompute the reachable sets, by inlining them into the
dependency-updating loops.