cluster mempool: add TxGraph work controls

sipa commented at 9:29 pm on April 13, 2025: member

Part of #30289. Builds on top of #31553.

So far, the TxGraph::DoWork() function took no parameters, and just made all clusters reach the “acceptable” internal quality level by performing a minimum number of improvement iterations on it, but:

Did not attempt to go beyond that.
Was broken, as the QualityLevel of optimal clusters that merge together was not being reset.

Fix this by adding an argument to DoWork() to control how much work it is allowed to do right now, which will first be used to get all clusters to the acceptable level, and if more budget remains, use it to try to get some or all clusters optimal. The function will now return true if all clusters are known to be optimal (and thus no further work remains). This is verified in the tests, by remembering whether the graph is optimal, and if it is at the end of the simulation run, verify that the overall linearization cannot be improved further.

DrahtBot commented at 9:29 pm on April 13, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32263.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	instagibbs, ismaelsadeeq, glozow

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#32545 (Replace cluster linearization algorithm with SFL by sipa)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

“nothing to do left” → “nothing left to do” [correct word order for clarity]

drahtbot_id_4_m

sipa marked this as a draft on Apr 13, 2025

DrahtBot added the label CI failed on Apr 13, 2025

DrahtBot commented at 10:42 pm on April 13, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/40471194330

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

sipa force-pushed on Apr 14, 2025

DrahtBot removed the label CI failed on Apr 14, 2025

in src/test/fuzz/txgraph.cpp:321 in 81a88311b5 outdated

268@@ -269,9 +269,11 @@ FUZZ_TARGET(txgraph)
269     auto max_count = provider.ConsumeIntegralInRange<DepGraphIndex>(1, MAX_CLUSTER_COUNT_LIMIT);
270     // And the maximum combined size of transactions per cluster.
271     auto max_size = provider.ConsumeIntegralInRange<uint64_t>(1, 0x3fffff * MAX_CLUSTER_COUNT_LIMIT);
272+    // And the number of iterations to consider a cluster acceptably linearized.
273+    auto acceptable_iters = provider.ConsumeIntegralInRange<uint64_t>(0, 10000);

instagibbs commented at 5:27 pm on April 14, 2025:

nice, this should allow some non-optimals to finally be created :+1:

DrahtBot added the label Needs rebase on Apr 17, 2025

sipa force-pushed on Apr 17, 2025

DrahtBot removed the label Needs rebase on Apr 17, 2025

sipa force-pushed on Apr 22, 2025

DrahtBot added the label CI failed on Apr 28, 2025

sipa force-pushed on Apr 30, 2025

DrahtBot removed the label CI failed on Apr 30, 2025

sipa force-pushed on May 12, 2025

sipa force-pushed on May 13, 2025

sipa added the label Mempool on May 20, 2025

sipa force-pushed on May 21, 2025

DrahtBot added the label CI failed on May 22, 2025

sipa force-pushed on May 22, 2025

DrahtBot removed the label CI failed on May 22, 2025

sipa force-pushed on May 22, 2025

sipa force-pushed on May 28, 2025

DrahtBot added the label CI failed on May 28, 2025

DrahtBot commented at 3:32 am on May 28, 2025: contributor

🚧 At least one of the CI tasks failed. Task macOS-cross, gui, no tests: https://github.com/bitcoin/bitcoin/runs/43009701810 LLM reason (✨ experimental): The CI failure is due to a compilation error caused by an incompatible call to std::max with mixed types in txgraph.cpp.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot removed the label CI failed on May 28, 2025

sipa force-pushed on May 28, 2025

sipa force-pushed on Jun 17, 2025

DrahtBot added the label CI failed on Jun 17, 2025

DrahtBot commented at 10:53 pm on June 17, 2025: contributor

🚧 At least one of the CI tasks failed. Task multiprocess, i686, DEBUG: https://github.com/bitcoin/bitcoin/runs/44292522153 LLM reason (✨ experimental): Build failure due to compilation errors in txgraph_tests.cpp caused by incorrect function usage.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

sipa force-pushed on Jun 18, 2025

DrahtBot removed the label CI failed on Jun 18, 2025

sipa force-pushed on Jun 18, 2025

sipa force-pushed on Jul 2, 2025

DrahtBot added the label CI failed on Jul 2, 2025

DrahtBot commented at 8:12 pm on July 2, 2025: contributor

🚧 At least one of the CI tasks failed. Task tidy: https://github.com/bitcoin/bitcoin/runs/45240690589 LLM reason (✨ experimental): The CI failure is caused by errors from clang-tidy due to inclusion of deprecated headers in txgraph.cpp.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot removed the label CI failed on Jul 2, 2025

sipa force-pushed on Jul 7, 2025

sipa marked this as ready for review on Jul 7, 2025

sipa commented at 9:25 pm on July 7, 2025: member

Rebased after the merge of #31553, and undrafted.

in src/txgraph.cpp:981 in afd15f2a03 outdated

977@@ -978,10 +978,11 @@ bool Cluster::Split(TxGraphImpl& graph) noexcept
978     // Iterate over the connected components of this Cluster's m_depgraph.
979     while (todo.Any()) {
980         auto component = m_depgraph.FindConnectedComponent(todo);
981+        auto split_quality = component.Count() == 1 ? QualityLevel::OPTIMAL : new_quality;

instagibbs commented at 7:48 pm on July 10, 2025:

would this also be trivially true for <=2?

sipa commented at 3:39 am on July 11, 2025:

Nice, indeed. With PostLinearize I think even 3 may work, but changed to 2 now.

in src/bench/txgraph.cpp:51 in 46c49fbdd0 outdated

45@@ -46,6 +46,9 @@ void BenchTxGraphTrim(benchmark::Bench& bench)
46     static constexpr int NUM_DEPS_PER_BOTTOM_TX = 100;
47     /** Set a very large cluster size limit so that only the count limit is triggered. */
48     static constexpr int32_t MAX_CLUSTER_SIZE = 100'000 * 100;
49+    /** Set a very high number for acceptable iterations, so that we certainly benchmark optimal
50+     *  linearization. */
51+    static constexpr uint64_t NUM_ACCEPTABLE_ITERS = 100'000'000;

instagibbs commented at 8:00 pm on July 10, 2025:

comment below says

“// Make the graph linearize all clusters acceptably.”

I’m guessing these chains are trivial to linearize, so this was likely already achieved with 10k iterations?

sipa commented at 3:28 am on July 11, 2025:

Right, they only have one valid linearization anyway, so any number of iterations suffices to find optimal (though I guess it could require some iterations to figure out that it is actually optimal).

in src/txgraph.cpp:258 in 46c49fbdd0 outdated

254@@ -255,6 +255,8 @@ class TxGraphImpl final : public TxGraph
255     const DepGraphIndex m_max_cluster_count;
256     /** This TxGraphImpl's maximum cluster size limit. */
257     const uint64_t m_max_cluster_size;
258+    /** The number of linearization improvements steps needed for it to be considered acceptable. */

instagibbs commented at 8:02 pm on July 10, 2025:

0    /** The number of linearization improvements steps needed per cluster to be considered acceptable. */

sipa commented at 3:39 am on July 11, 2025:

Done.

in src/txgraph.h:251 in 46c49fbdd0 outdated

246@@ -247,7 +247,8 @@ class TxGraph
247 
248 /** Construct a new TxGraph with the specified limit on the number of transactions within a cluster,
249  *  and on the sum of transaction sizes within a cluster. max_cluster_count cannot exceed
250- *  MAX_CLUSTER_COUNT_LIMIT. */
251-std::unique_ptr<TxGraph> MakeTxGraph(unsigned max_cluster_count, uint64_t max_cluster_size) noexcept;
252+ *  MAX_CLUSTER_COUNT_LIMIT. acceptable_iters controls how many linearization optimization
253+ *  steps will be performed before it is considered to be of acceptable quality. */

instagibbs commented at 8:04 pm on July 10, 2025:

0* steps will be performed per cluster before it is considered to be of acceptable quality. */

sipa commented at 3:39 am on July 11, 2025:

Done.

in src/txgraph.cpp:2511 in 18e40b6cd5 outdated

2508+        auto& queue = clusterset.m_clusters[int(QualityLevel::ACCEPTABLE)];
2509+        while (!queue.empty()) {
2510+            // Randomize the order in which we process, so that if the first cluster somehow needs
2511+            // more work than what iters allows, we don't keep spending it on the same one.
2512+            auto pos = m_rng.randrange<size_t>(queue.size());
2513+            auto [cost, optimal] = queue[pos].get()->Relinearize(*this, iters - iters_done);

instagibbs commented at 8:10 pm on July 10, 2025:

0            Assume(iters >= iters_done);
1            auto [cost, optimal] = queue[pos].get()->Relinearize(*this, iters - iters_done);

sipa commented at 3:38 am on July 11, 2025:

Done.

in src/test/fuzz/txgraph.cpp:733 in 18e40b6cd5 outdated

728@@ -721,7 +729,18 @@ FUZZ_TARGET(txgraph)
729                 break;
730             } else if (command-- == 0) {
731                 // DoWork.
732-                real->DoWork();
733+                uint64_t iters = provider.ConsumeIntegralInRange<uint64_t>(0, alt ? 10000 : 255);
734+                if (real->DoWork(iters)) {

instagibbs commented at 8:26 pm on July 10, 2025:

would it be possible to assert that this must return true if the value is high enough since we have a limited txgraph size?

sipa commented at 3:38 am on July 11, 2025:

Will think about. I haven’t done this yet, so leaving open.

sipa commented at 7:37 pm on July 13, 2025:

Done. I’ve factored out the logic to count the number of iterations required for optimal linearization, and then used that in the txgraph simulation when DoWork() returns false.

It turned out that the implemented logic so far made this actually very hard to test for. If the remaining iterations budget in DoWork dropped below m_acceptable_iterations, but there were NEEDS_RELINEARIZE clusters left, the function would return false immediately. This would happen even when the remaining budget was actually sufficient to make everything optimal.

Thus, the DoWork overhaul commit has been changed fairly substantially (and, I think, simplified a bit).

instagibbs commented at 1:10 pm on July 14, 2025:

The new logic means that you may not have enough to make it considered ACCEPTABLE BUT you may end up with OPTIMAL anyways and are able to continue until you hit a difficult cluster. In practice this may end up with a lot more clusters known as optimal when their topology is simple.

Might be worth a comment on the “if (!improved) return false;” line

sipa commented at 2:45 pm on July 14, 2025:

I have added some additional comments.

instagibbs commented at 8:28 pm on July 10, 2025: member

review through 18e40b6cd5a11e23b7a0654bdd12f08e610ce980

no big concerns on first pass

in src/txgraph.cpp:1119 in 3e49d0a908 outdated

1107@@ -1108,6 +1108,9 @@ void Cluster::ApplyDependencies(TxGraphImpl& graph, std::span<std::pair<GraphInd
1108     // linearization, and post-linearize it to fix up the worst problems with it.
1109     FixLinearization(m_depgraph, m_linearization);
1110     PostLinearize(m_depgraph, m_linearization);
1111+    if (IsAcceptable()) {

instagibbs commented at 8:33 pm on July 10, 2025:

double-checking: NEEDS_RELINEARIZE, ACCEPTABLE and OPTIMAL are the only possible cases at this point, right?

sipa commented at 7:37 pm on July 13, 2025:

Indeed. Added some Assumes to that effect.

sipa force-pushed on Jul 11, 2025

sipa force-pushed on Jul 13, 2025

DrahtBot added the label CI failed on Jul 13, 2025

DrahtBot commented at 6:43 pm on July 13, 2025: contributor

🚧 At least one of the CI tasks failed. Task previous releases, depends DEBUG: https://github.com/bitcoin/bitcoin/runs/45882550988 LLM reason (✨ experimental): The CI failure is caused by a compilation error due to warnings being treated as errors, specifically an unused function that is being flagged and preventing compilation from succeeding.

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

sipa force-pushed on Jul 13, 2025

DrahtBot removed the label CI failed on Jul 13, 2025

in src/txgraph.cpp:193 in 04778c5b25 outdated

188@@ -189,8 +189,9 @@ class Cluster
189     void Merge(TxGraphImpl& graph, Cluster& cluster) noexcept;
190     /** Given a span of (parent, child) pairs that all belong to this Cluster, apply them. */
191     void ApplyDependencies(TxGraphImpl& graph, std::span<std::pair<GraphIndex, GraphIndex>> to_apply) noexcept;
192-    /** Improve the linearization of this Cluster. Returns how much work was performed. */
193-    uint64_t Relinearize(TxGraphImpl& graph, uint64_t max_iters) noexcept;
194+    /** Improve the linearization of this Cluster. Returns how much work was performed and whether
195+     *  the Cluster has changed quality now. */

instagibbs commented at 12:54 pm on July 14, 2025:

s/quality/QualityLevel/

sipa commented at 2:45 pm on July 14, 2025:

Done.

in src/txgraph.h:98 in 04778c5b25 outdated

 93@@ -94,9 +94,10 @@ class TxGraph
 94     virtual void SetTransactionFee(const Ref& arg, int64_t fee) noexcept = 0;
 95 
 96     /** TxGraph is internally lazy, and will not compute many things until they are needed.
 97-     *  Calling DoWork will compute everything now, so that future operations are fast. This can be
 98-     *  invoked while oversized. */
 99-    virtual void DoWork() noexcept = 0;
100+     *  Calling DoWork will perform some work now (controlled by iters) so that future operations
101+     *  are fast, if there is any. Returns whether all work is done. This can be invoked while

instagibbs commented at 1:21 pm on July 14, 2025:

It will not “expend” iterations on oversized levels, so it can return “true” when doing no work. I think that may be surprising.

sipa commented at 2:45 pm on July 14, 2025:

Added some comments to clarify.

instagibbs commented at 1:24 pm on July 14, 2025: member

reviewed via git range-diff master 18e40b6cd5a11e23b7a0654bdd12f08e610ce980 4788c87f7c119f35c551da05e7c43bd5be51420c

suggested changes were included, including making sure DoWork is reaching optimality when expected

txgraph: reset quality when merging clusters (bugfix) fad0eb091e

txgraph: 1-or-2-tx split-off clusters are optimal (optimization) 6ba316eaa0

txgraph: track amount of work done in linearization (preparation) cfe9958852

txgraph: make number of acceptable iterations configurable (feature) e96b00d99e

txgraph: add work limit to DoWork(), try optimal (feature)

This adds an `iters` parameter to DoWork(), which controls how much work it is
allowed to do right now.

Additionally, DoWork() won't stop at just getting everything ACCEPTABLE, but if
there is work budget left, will also attempt to get every cluster linearized
optimally.

f3c2fc867f

txgraph: check that DoWork finds optimal if given high budget (tests) 62ed1f92ef

sipa force-pushed on Jul 14, 2025

sipa commented at 2:46 pm on July 14, 2025: member

Addressed comments, and also made some further simplifications (dropped the returning of amount of work performed by MakeAcceptable and MakeAllAcceptable in the preparation commit, as the new DoWork() doesn’t use them anymore but invokes Cluster::Relinearize directly).

instagibbs approved

instagibbs commented at 4:35 pm on July 14, 2025: member

ACK 62ed1f92efff42bc79c50935e6dbd9da4e072020

ismaelsadeeq commented at 10:11 pm on July 14, 2025: member

Concept ACK. The test added in f3c2fc867fc4332dfed0a3766997433e1676dbe3 fails without the bugfix in the first commit.

I notice DoWork is unused in #28676, so this PR is not a blocker for the project?

in src/txgraph.h:100 in f3c2fc867f outdated

 98-     *  invoked while oversized. */
 99-    virtual void DoWork() noexcept = 0;
100+     *  Calling DoWork will perform some work now (controlled by iters) so that future operations
101+     *  are fast, if there is any. Returns whether all currently-available work is done. This can
102+     *  be invoked while oversized, but oversized graphs will be skipped by this call. */
103+    virtual bool DoWork(uint64_t iters) noexcept = 0;

ismaelsadeeq commented at 10:17 pm on July 14, 2025:

nit: maximum_iterations would be clearer than iters?

This is nice the mempool can decide what to pass to DoWork by converting the time it can spend into iterations.

instagibbs commented at 1:45 pm on July 15, 2025:

max_iters suffices, if you end up touching the PR

ismaelsadeeq commented at 4:08 pm on July 15, 2025:

yeah will make internal comments of DoWork easier to follow as well.

in src/txgraph.cpp:981 in 6ba316eaa0 outdated

977@@ -978,10 +978,11 @@ bool Cluster::Split(TxGraphImpl& graph) noexcept
978     // Iterate over the connected components of this Cluster's m_depgraph.
979     while (todo.Any()) {
980         auto component = m_depgraph.FindConnectedComponent(todo);
981+        auto split_quality = component.Count() <= 2 ? QualityLevel::OPTIMAL : new_quality;

ismaelsadeeq commented at 3:00 pm on July 15, 2025:

Why size two? Isn’t this supposed to be one? That is, individual splits are already optimal, but any component with size greater than one needs to be linearized, no?

instagibbs commented at 4:17 pm on July 15, 2025:

It’s already will be in a topologically valid ordering, and there’s only a single ordering possible in a connected component of size 2, thus optimal.

ismaelsadeeq commented at 4:22 pm on July 15, 2025:

Makes sense, I overlooked that even when parent pays for it self the ordering is the same.

sipa commented at 5:07 pm on July 15, 2025:

Indeed. With a PostLinearize() added, I think we can even go to 3, but that seems like only a very marginal benefit anyway, and be much less obvious.

in src/txgraph.cpp:1673 in f3c2fc867f outdated

1670@@ -1670,11 +1671,17 @@ uint64_t Cluster::Relinearize(TxGraphImpl& graph, uint64_t max_iters) noexcept
1671     // Update the linearization.
1672     m_linearization = std::move(linearization);
1673     // Update the Cluster's quality.

ismaelsadeeq commented at 3:32 pm on July 15, 2025:

0    // Update the Cluster's quality if it has improved.

in src/test/fuzz/txgraph.cpp:477 in f3c2fc867f outdated

473@@ -467,6 +474,7 @@ FUZZ_TARGET(txgraph)
474                     if (top_sim.graph.Ancestors(pos_par)[pos_chl]) break;
475                 }
476                 top_sim.AddDependency(par, chl);
477+                top_sim.real_is_optimal = false;

ismaelsadeeq commented at 3:42 pm on July 15, 2025:

this is a duplication because AddDependency already made real_is_optimal to be false

ismaelsadeeq commented at 4:07 pm on July 15, 2025: member

Code review ACK 62ed1f92efff42bc79c50935e6dbd9da4e072020

Just a question and some nits.

sipa commented at 5:08 pm on July 15, 2025: member

Will address the outstanding comments if I touch up.

in src/txgraph.cpp:1678 in f3c2fc867f outdated

1675-    graph.SetClusterQuality(m_level, m_quality, m_setindex, new_quality);
1676+    bool improved = false;
1677+    if (optimal) {
1678+        graph.SetClusterQuality(m_level, m_quality, m_setindex, QualityLevel::OPTIMAL);
1679+        improved = true;
1680+    } else if (max_iters >= graph.m_acceptable_iters && !IsAcceptable()) {

glozow commented at 2:36 pm on July 18, 2025:

I was confused by why we use max_iters instead of cost to gauge whether we did enough work to make the cluster acceptable (intuitively, we care about actual work done rather than the upper bound given). But when optimal is not true that means we ~used up the iters available (or determined it wasn’t enough).

Alternatively I thought maybe max_iters == graph.m_acceptable_iters is used as a way to tell that we’re still in the “make things acceptable” stage and haven’t yet run out of budget.

in src/txgraph.cpp:2523 in f3c2fc867f outdated

2523+                iters_done += cost;
2524+                // If no improvement was made to the Cluster, it means we've essentially run out of
2525+                // budget. Even though it may be the case that iters_done < iters still, the
2526+                // linearizer decided there wasn't enough budget left to attempt anything with.
2527+                // To avoid an infinite loop that keeps trying clusters with minuscule budgets,
2528+                // stop here too.

glozow commented at 3:47 pm on July 18, 2025:

To check my understanding of the improved logic:

We have improved = true when either (1) we took the cluster from any non-optimal state -> optimal regardless of iters usage or (2) we are working on a non-acceptable cluster and had m_acceptable_iters to spend.
That means improved = false if we didn’t have enough iters to do m_acceptable_iters on a non-acceptable cluster (ran out of iters), or we have moved on to acceptable clusters but didn’t have enough to make it optimal.
It’s possible that we have some iters left and it would be enough to linearize e.g. a smaller cluster that remains, but we exit as soon as we hit a cluster for which our budget is insufficient.

glozow commented at 3:49 pm on July 18, 2025: member

ACK 62ed1f92efff42bc79c50935e6dbd9da4e072020

glozow merged this on Jul 29, 2025

glozow closed this on Jul 29, 2025

cluster mempool: add TxGraph work controls #32263

Code Coverage & Benchmarks

Reviews

Conflicts

LLM Linter (✨ experimental)