cluster mempool: optimized candidate search

sipa commented at 2:16 am on June 14, 2024: member

Part of cluster mempool: #30289

Depends on #30126, and was split off from it.

This improves the candidate search algorithm introduced in the previous PR with a variety of optimizations.

The resulting search algorithm largely follows Section 2 of How to linearize your cluster, though with a few changes:

Connected component analysis is performed inside the search algorithm (creating initial work items per component for each candidate), rather than once at a higher level. This duplicates some work but is significantly simpler in implementation.
No ancestor-set based presplitting inside the search is performed; instead, the best value is initialized with the best topologically valid set known to the LIMO algorithm before search starts: the better one out of the highest-feerate remaining ancestor set, and the highest-feerate prefix of remaining transactions in old_linearization.
Work items are represented using an included set inc and an undefined set und, rather than included and excluded.
Potential sets pot are not computed for work items with empty inc.

At a high level, the only missing optimization from that post is bottleneck analysis; my thinking is that it only really helps with clusters that are already relatively cheap to linearize (doing so would need to be done at a higher level, not inside the search algorithm).

Overview of the impact of each commit here on linearize performance:

clusterlin bench: have low/high iter benchmarks instead of per-iter: no impact
separate initial search entries per component (optimization): reduce iterations, increase start-up cost
add reordering support for DepGraph: no impact
use feerate-sorted depgraph in SearchCandidateFinder: typically reduce iterations, increase start-up cost
track upper bound potential set for work items: reduce iterations, increase cost per iteration
reduce computation of unnecessary pot sets: reduce cost per iteration
include topological pot subsets automatically: reduce iterations, increase cost per iteration
improve heuristic to decide split transaction: typically reduce iterations, increase cost per iteration
only start/use search when enough iterations left: just account for start-up cost as equivalent iterations

DrahtBot commented at 2:16 am on June 14, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	instagibbs, glozow, sdaftuar
Stale ACK	ismaelsadeeq

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30857 (cluster mempool: extend DepGraph functionality by sipa)
#30605 (Cluster linearization: separate tests from tests-of-tests by sipa)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

sipa renamed this:
~~Low-level cluster linearization code: optimizations~~
cluster mempool: optimized candidate search
on Jun 14, 2024

sipa added the label Mempool on Jun 14, 2024

sipa force-pushed on Jul 2, 2024

sipa force-pushed on Jul 8, 2024

sipa force-pushed on Jul 10, 2024

DrahtBot added the label CI failed on Jul 10, 2024

DrahtBot commented at 10:46 pm on July 10, 2024: contributor

🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the documentation.

Possibly this is due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

Leave a comment here, if you need help tracking down a confusing failure.

Debug: https://github.com/bitcoin/bitcoin/runs/27292784397

sipa force-pushed on Jul 11, 2024

DrahtBot removed the label CI failed on Jul 11, 2024

sipa force-pushed on Jul 17, 2024

sipa force-pushed on Jul 19, 2024

sipa force-pushed on Jul 24, 2024

glozow referenced this in commit 37bd70a225 on Jul 26, 2024

DrahtBot added the label Needs rebase on Jul 26, 2024

sipa force-pushed on Jul 26, 2024

DrahtBot removed the label Needs rebase on Jul 26, 2024

in src/cluster_linearize.h:859 in a3ba9c1714 outdated

827+            // - A work item is (potentially) added with that transaction plus its remaining
828+            //   ancestors included (added to the inc set).
829+            //
830+            // To decide what to split, pick among the undecided ancestors of the highest
831+            // individual feerate transaction among the undecided ones the one which reduces the
832+            // search space most:

glozow commented at 10:30 am on July 30, 2024:

nit in a3ba9c1714b7d74fcb08609500170f8794378fe5, I found this hard to parse

0            // To decide what to split, consider the undecided ancestors of the highest
1            // individual feerate undecided transaction. Pick the one which reduces the
2            // search space most:

glozow commented at 10:31 am on July 30, 2024:

nit in commit message: s/emperically/empirically

sipa commented at 11:55 am on August 5, 2024:

Fixed.

in src/cluster_linearize.h:185 in 9e154b0eea outdated

180+        if (todo.None()) return todo;
181+        auto first = todo.First();
182+        SetType ret = Descendants(first) | Ancestors(first);
183+        ret &= todo;
184+        SetType to_add = ret;
185+        to_add.Reset(first);

glozow commented at 11:09 am on July 30, 2024:

9e154b0eea2c1409a38cc02e742e1e2320947495

Would it be cleaner to not do the first iteration outside the loop?

0        SetType ret;
1        SetType to_add = Singleton(todo.First());

sipa commented at 10:14 pm on July 30, 2024:

It’s certainly cleaner. It’s perhaps also a bit slower, but probably not measurably so. Done.

in src/cluster_linearize.h:198 in 9e154b0eea outdated

170@@ -171,6 +171,45 @@ class DepGraph
171         return ret;
172     }
173 
174+    /** Find some connected component within the subset "todo" of this graph.

glozow commented at 2:30 pm on July 30, 2024:

nit: maybe add “Specifically, all of the transactions connected to the first element of todo.”

sipa commented at 11:55 am on August 5, 2024:

Done (in #30285 already).

in src/cluster_linearize.h:726 in fbb988a612 outdated

720@@ -711,8 +721,11 @@ class SearchCandidateFinder
721              *  pot_feerate. It starts off equal to inc. */
722             auto pot = inc;
723             if (!inc.feerate.IsEmpty()) {
724-                // Add entries to pot.
725-                for (auto pos : und) {
726+                // Add entries to pot. We iterate over all undecided transactions whose feerate is
727+                // higher than best. While undecided transactions of lower feerate may improve pot
728+                // still, if they do, the resulting pot feerate cannot possibly exceed best's (and

glozow commented at 3:26 pm on July 30, 2024:

fbb988a6128b859b204299357cb6bb4133f83e25, are there some extra words?

0                // higher than best. While undecided transactions of lower feerate may improve pot,
1                // the resulting pot feerate cannot possibly exceed best's (and

sipa commented at 11:55 am on August 5, 2024:

Done.

in src/cluster_linearize.h:977 in 3eb993466b outdated

977     linearization.reserve(depgraph.TxCount());
978     bool optimal = true;
979 
980+    // Treat the initialization of SearchCandidateFinder as taking N^2/64 (rounded up) iterations.
981+    // If we don't have that many, don't start it.
982+    uint64_t start_iterations = (uint64_t{depgraph.TxCount()} * depgraph.TxCount() + 63) / 64;

glozow commented at 3:49 pm on July 30, 2024:

Is the idea that, if we have fewer than n^2/64, we can pretty much expect SearchCandidateFinder to not make any meaningful progress?

sipa commented at 10:17 pm on July 30, 2024:

No, it’s trying to use “iterations” as a generic metric for computation time, and benchmarks show that n^2/64 times the time per iteration is roughly what just constructing the search object can take.

The real purpose here is that if you pass in iterations=0, the code won’t bother performing a relatively costly initialization of a search object if it’s not going to be used. But that’s then generalized a bit to also skip it if doing so would cost more than the time spent proportionally on the number of iterations you asked for.

glozow commented at 10:56 am on July 31, 2024:

Ahh I see, thanks

glozow commented at 3:50 pm on July 30, 2024: member

Did an initial read-through

sipa force-pushed on Jul 30, 2024

sipa force-pushed on Aug 4, 2024

sipa commented at 6:00 pm on August 4, 2024: member

Rebased on top of #30285 as it looks like that is closer.

sipa force-pushed on Aug 5, 2024

in src/test/fuzz/cluster_linearize.cpp:551 in 5fddd456b1 outdated

547@@ -513,9 +548,11 @@ FUZZ_TARGET(clusterlin_search_finder)
548             assert(found.transactions.IsSupersetOf(depgraph.Ancestors(i) & todo));
549         }
550 
551-        // At most 2^N-1 iterations can be required: the number of non-empty subsets a graph with N
552-        // transactions has.
553-        assert(iterations_done <= ((uint64_t{1} << todo.Count()) - 1));
554+        // At most 2^(N-1) iterations can be required: the maximum number of topological subsets a

instagibbs commented at 7:08 pm on August 5, 2024:

non-empty topological subsets?

sipa commented at 11:34 pm on August 7, 2024:

Indeed, done.

in src/cluster_linearize.h:619 in 4e7ff15ce4 outdated

621+            m_original_to_sorted[m_sorted_to_original[i]] = i;
622+        }
623+        // Compute reordered dependency graph.
624+        m_depgraph = DepGraph(depgraph, m_original_to_sorted);
625+        // Set todo to the entire graph.
626+        m_todo = SetType::Fill(depgraph.TxCount());

instagibbs commented at 2:41 pm on August 6, 2024:

move of m_todo instantiation seems like a no-op?

sipa commented at 11:34 pm on August 7, 2024:

Indeed, fixed.

in src/cluster_linearize.h:1008 in 400a72b4ab outdated

1016+                // Invoke bounded search to update best, with up to half of our remaining
1017+                // iterations as limit.
1018+                iterations_left -= base_iterations;
1019+                max_iterations_now = (iterations_left + 1) / 2;
1020+                std::tie(best, iterations_done_now) = src_finder->FindCandidateSet(max_iterations_now, best);
1021+                iterations_left -= iterations_done_now;

glozow commented at 2:59 pm on August 6, 2024:

Since we’re subtracting both base_iterations (the N/4 estimate) and iterations_done_now from iterations_left, is the idea that base_iterations incorporates the computational overhead beyond the splitting operations?

sipa commented at 5:30 pm on August 6, 2024:

Indeed; largely due to the connected-component splitting. I’ve expanded the comments a bit.

in src/cluster_linearize.h:849 in a88207c4ee outdated

836+            //
837+            // To decide what to split on, consider the undecided ancestors of the highest
838+            // individual feerate undecided transaction. Pick the one which reduces the search space
839+            // most. Let I(t) be the size of the undecided set after including t, and E(t) the size
840+            // of the undecided set after excluding t. Then choose the split transaction t such
841+            // that 2^I(t) + 2^E(t) is minimal, tie-breaking by highest individual feerate for t.

glozow commented at 3:19 pm on August 6, 2024:

Wow, cool way to convey pair comparison

instagibbs commented at 5:51 pm on August 6, 2024:

I would have appreciated slightly more pedantic comments demonstrating how the computation matches the comments :sweat:

sipa commented at 11:37 pm on August 7, 2024:

@instagibbs I’ve expanded the comments to explain why the (max,min) sort order is the same as (2^max + 2^min).

in src/cluster_linearize.h:125 in 4e7ff15ce4 outdated

117@@ -118,6 +118,28 @@ class DepGraph
118         }
119     }
120 
121+    /** Construct a DepGraph object given another DepGraph and a mapping from old to new.
122+     *
123+     * Complexity: O(N^2) where N=depgraph.TxCount().
124+     */
125+    DepGraph(const DepGraph<SetType>& depgraph, Span<const ClusterIndex> mapping) noexcept : entries(depgraph.TxCount())

instagibbs commented at 3:33 pm on August 6, 2024:

I think this can be used with DepGraphFormatter during Unser with reversed reordering

sipa commented at 11:35 pm on August 7, 2024:

Nice, indeed! I needed to make some change to the unserializer to track in reordering the reverse mapping, which I’ve for now done in the same commit. Happy to split that out if useful.

instagibbs commented at 2:02 pm on August 8, 2024:

LGTM, I tested by manually making the map and reversing it right at the end, which also seemed to work?

sipa commented at 1:47 am on August 9, 2024:

I’ve simplified the change a bit (now much closer to older deserialization code) and moved it to a separate commit (together with the introduction of the DepGraph reordering constructor).

in src/cluster_linearize.h:710 in 5fddd456b1 outdated

627+            auto component = m_depgraph.FindConnectedComponent(to_cover);
628+            to_cover -= component;
629+            // If best is not provided, set it to the first component, so that during the work
630+            // processing loop below, and during the add_fn/split_fn calls, we do not need to deal
631+            // with the best=empty case.
632+            if (best.feerate.IsEmpty()) best = SetInfo(m_depgraph, component);

glozow commented at 3:44 pm on August 6, 2024:

Would it make sense to use if (best.feerate.IsEmpty() || best.feerate << m_depgraph.feerate(component)? We set imp based on this afterward, and it seems like it could help it start smaller.

sipa commented at 5:31 pm on August 6, 2024:

Perhaps, but due the BFS search order, the normal iteration code should very quickly try splitting the whole-component work items anyway. The goal here is really just making sure best is not empty, so branches aren’t needed to deal with that case.

in src/cluster_linearize.h:704 in 4e63e0a951 outdated

700@@ -688,7 +701,7 @@ class SearchCandidateFinder
701             // processing loop below, and during the add_fn/split_fn calls, we do not need to deal
702             // with the best=empty case.
703             if (best.feerate.IsEmpty()) best = SetInfo(m_depgraph, component);
704-            queue.emplace_back(SetInfo<SetType>{}, std::move(component));
705+            queue.emplace_back(SetInfo<SetType>{}, std::move(component), FeeFrac{});

instagibbs commented at 4:02 pm on August 6, 2024:

I do realize the argument changes later, so maybe this just on the later commit

0            queue.emplace_back(/*inc=*/SetInfo<SetType>{}, /*und=*/std::move(component), /*pot_feerate=*/FeeFrac{});

sipa commented at 2:59 am on August 8, 2024:

Done.

glozow commented at 4:26 pm on August 6, 2024: member

code review ACK. I haven’t reviewed the fuzzers yet, doing that next. Tested by writing some benches that run a lot faster with the optimizations.

sipa force-pushed on Aug 6, 2024

in src/cluster_linearize.h:740 in 29cac3e85b outdated

728@@ -723,19 +729,25 @@ class SearchCandidateFinder
729          *
730          * - inc: the "inc" value for the new work item (must be topological).
731          * - und: the "und" value for the new work item ((inc | und) must be topological).
732+         * - pot: a subset of the "pot" value for the new work item (but a superset of inc).
733+         *        It does not need to be the full pot value; missing pot transactions will be added
734+         *        to it by add_fn.
735+         * - grow_inc: whether to attempt moving transactions from und to inc, if it can be proven

instagibbs commented at 7:45 pm on August 6, 2024:

in 29cac3e85b23f1264121cbd50e4a09755b0c8352

I think this commit may benefit from being two smaller commits, one that demonstrates the caching, and a second that demonstrates using it to skip the splitting computation?

I do not understand what it’s doing, as-is.

sipa commented at 2:58 am on August 8, 2024:

I have split the commit in two, with additional comments, and I renamed grow_inc to reconsider_pot.

Roughly what is going on is that:

In “include topological pot subsets automatically”, all of every new work item’s pot set is searched for topologically-valid subsets (which can be moved to inc instead)
In “avoid recomputing potential set on every split” (before my last push), only the part of the new item’s pot set that wasn’t in the parent work item’s pot set is searched for topologically-valid subsets unless grow_inc was set, in which case all of pot is searched.

I have now split this second commit into:

In “avoid recomputing potential set on every split”, still all of the new item’s pot set is searched.
In “avoid jump ahead in unmodified pot sets”, only the part of the new item’s pot set that wasn’t in the parent work item’s set is searched unless reconsider_pot is set.

instagibbs commented at 2:19 pm on August 8, 2024:

I think this has helped a lot, “clusterlin: avoid recomputing potential set on every split (optimization)” now makes sense to me. Lots of interlocking parts, so having small bite-sizes helps.

Can you motivate why reconsider_pot is set as true on inclusion of anc and not on exclude?

When anc is included, it’s included in both inc and pot, then:

if (reconsider_pot) consider_inc = pot.transactions - inc.transactions;

so no part of anc will be in consider_inc.

sipa commented at 3:26 pm on August 8, 2024:

@instagibbs @glozow Let me try giving an explanation here. If this helps you, I’ll rework it into a comment in the code.

Think of the set of transactions in every work items as falling within one of four classes (roughly from “more likely to be included” to “less likely to be included”):

I: definitely included transactions, (= item.inc).
U: undecided transactions that may or may not end up being included (= item.und), subdided in:
- UH: possibly included, and including any of them will definitely increase the included set’s feerate (even when other lower-fee UH transactions are included too) (=item.pot - item.inc, =item.und & item.pot).
- UL: possibly included, but including it is not guaranteed to increase the included set’s feerate (=item.und - item.pot)
E: definitely excluded transactions, (= todo - item.inc - item.und).

In other words, item.pot is I+UH and item.und is U=UH+UL. Further, I is always topological, as is I+U. I+UH however may be non-topological.

Now let’s consider what happens in an inclusion branch (i.e., the call to add_fn for adding the split transaction and its ancestors):

We start with a parent object which has a given I, UH, UL, E (assume I is non-empty; I=empty is a special case that has separate logic in several places).
A split transaction t is chosen from U=UH+UL (=item.und).
anc(t) are moved from UH+UL to I (this is done inside split_fn, in the invocation of add_fn).
anc(t) will always include at least one UL transaction (*), and because of that the feerate of I may have gone down. If so, it is possible that more transactions can be moved from UL to UH. This is done inside add_fn, in the for (auto pos : (imp & und) - pot.transactions) { loop.
Because of the changes to I and UH above, it is possible that there is now a new topologically-valid subset of I+UH that isn’t already entirely in I, so it can be moved to I (jump ahead). This can be due to UH entries added in (4) or due to I entries added in (3).. This is done in add_fn, in the for (auto pos : consider_inc) { loop.
If I now has the best feerate ever, remember it.
Construct a new work item with the resulting I,UH,UL,E, unless UH+UL is empty (which is always the case if UL is empty, **) because then no more splits can take place.

Compare this to what happens in an exclusion branch:

Same
Same
desc(t) are moved from UH+UL to E and I does not change.
desc(t) may include a UH transaction, and because of that it is possible that more transactions can be moved from UL to UH.
Because of the above, it is possible that there are new topologically-valid subsets of I+UH now that can be moved to I. This can only be due to UH entries added in (4).
Same
Same

The bold parts correspond to reconsider_pot. In the inclusion branch, UL transactions moved to UH may result in existing UH transactions becoming part of a topologically-valid subset, so UH needs to be reconsidered. In an exclusion branch I does not change in (3), and so only new UH transactions added in (4) need to be considered for moving to I. Now, if that happens (something moves from UH to I due to jump ahead), it may be that what was in UH before becomes eligible to move as well, but doing so does not seem to affect the performance or number of iterations the algorithm needs.

(*) The jump ahead is exactly what guarantees that anc(t) always includes a UL transaction. If anc(t) was entirely within UH, it would be topologically valid, and thus the parent work item’s jump ahead step would have moved it to I already.

(**) Same reason: if UL is empty, then the entirety if I+UH is topologically valid, and thus would have been moved to I entirely in the jump ahead. Thus, UL empty implies UH is also empty.

glozow commented at 11:40 am on August 12, 2024:

@sipa thanks this is really helpful, particularly this part:

anc(t) will always include at least one UL transaction () () The jump ahead is exactly what guarantees that anc(t) always includes a UL transaction. If anc(t) was entirely within UH, it would be topologically valid, and thus the parent work item’s jump ahead step would have moved it to I already.

However I’m still having trouble with this part:

Compare this to what happens in an exclusion branch … 4. desc(t) may include a UH transaction, and because of that it is possible that more transactions can be moved from UL to UH. 5. Because of the above, it is possible that there are new topologically-valid subsets of I+UH now that can be moved to I. This can only be due to UH entries added in (4).

It seems true that only the entries added in (4) can cause jump-aheads. However, I’m not convinced that existing pot transactions cannot be jump-aheadable. For instance, it’s possible that we added a new tx that is the parent of a tx that was already in pot.

I added an assert(anc.Overlaps(elem.und - elem.pot.transactions)) in split_fn, right before we call add_fn for the inclusion branch in order to check that “jump ahead guarantees that anc(t) always includes a UL transaction”. The assertion hit, and essentially what I saw was:

Going into a split, we had a really high pot feerate, which included tx C.
For the exclusion branch, desc(t) included lots of high feerate transactions which lowered the pot feerate. C is still in pot. At the start of this add_fn, we add tx P to pot, since its feerate is now high enough.
P is part of consider_inc, and is added to inc in the jump-ahead.
C, which a child of P, would be good because anc(C) is a topological subset of pot, but is skipped because reconsider_pot=false
Later, we split on C. C is part of UH, not UL, which hits the assert. anc(C) = C which is in UH, but it wasn’t added to inc during a jump-ahead. This is because C was added to UH when its ancestor P was considered UL, and reconsider_pot=false when P was added to UH.

Does this make sense? Or maybe my assert isn’t the right interpretation of this? I don’t think this changes what FindCandidateSet outputs, but it means I still don’t quite understand the “avoid jump ahead in unmodified pot sets” commit :(

(fwiw the fuzz I crashed on is echo "AQBJCAgAIysjIyMiIyMjIyMjIyMjIyM6IyMIACMrIyMjIiMjIyMjIyMjIyMjIyMjCBlE+xEvRG9v AggAbwAR" | base 64 -d)

EDIT: ah sorry, I just reread this part

it may be that what was in UH before becomes eligible to move as well, but doing so does not seem to affect the performance or number of iterations the algorithm needs.

so I see that this is expected. Perhaps including this in the commit message would help, as I was confused by “In exclusion branches (where inc does not change, and transactions are only removed from und), the jump ahead optimization cannot find anything to add to inc that wasn’t possible already.”

sipa commented at 1:04 pm on August 12, 2024:

@glozow Yeah I only re-discovered (and re-tested…) this fact (that after moving from new-UH to I in an exclusion branch, that former-UH also become eligible to move to I) while writing this. I don’t have a good explanation for why it doesn’t affect the further run of the algorithm though.

glozow commented at 9:03 am on August 13, 2024:

How significant do we expect the reconsider_pot optimization to be? Locally my LinearizePerIter*TxWorstCase bench doesn’t seem to do better after c68d26a0e97.

sipa commented at 2:36 pm on August 15, 2024:

@glozow Good question, I’ve pushed a change that allows for perhaps a bit more accurate measurements (run the 5000 and 15000 iteration benchmark, and subtract). No other changes.

glozow commented at 9:50 am on August 16, 2024:

Here’s what I get on 255571339e7:

 0|               ns/op |                op/s |    err% |     total | benchmark
 1|--------------------:|--------------------:|--------:|----------:|:----------
 2|          724,714.00 |            1,379.85 |    2.4% |      0.01 | `Linearize32TxWorstCase15000Iters`
 3|          245,712.75 |            4,069.79 |    0.3% |      0.01 | `Linearize32TxWorstCase5000Iters`
 4|          688,589.50 |            1,452.24 |    5.2% |      0.01 | :wavy_dash: `Linearize48TxWorstCase15000Iters` (Unstable with ~1.2 iters. Increase `minEpochIterations` to e.g. 12)
 5|          233,424.25 |            4,284.05 |    4.8% |      0.01 | `Linearize48TxWorstCase5000Iters`
 6|          662,959.00 |            1,508.39 |    1.7% |      0.01 | `Linearize64TxWorstCase15000Iters`
 7|          242,351.75 |            4,126.23 |    1.9% |      0.01 | `Linearize64TxWorstCase5000Iters`
 8|        1,051,925.00 |              950.64 |    4.3% |      0.01 | `Linearize75TxWorstCase15000Iters`
 9|          344,038.00 |            2,906.66 |    1.8% |      0.01 | `Linearize75TxWorstCase5000Iters`
10|        1,023,533.00 |              977.01 |    2.1% |      0.01 | `Linearize99TxWorstCase15000Iters`
11|          356,557.67 |            2,804.60 |    1.3% |      0.01 | `Linearize99TxWorstCase5000Iters`

on d04997871f5:

 0|               ns/op |                op/s |    err% |     total | benchmark
 1|--------------------:|--------------------:|--------:|----------:|:----------
 2|          705,210.00 |            1,418.02 |    4.4% |      0.01 | `Linearize32TxWorstCase15000Iters`
 3|          231,874.00 |            4,312.69 |    0.2% |      0.01 | `Linearize32TxWorstCase5000Iters`
 4|          664,206.00 |            1,505.56 |    4.7% |      0.01 | `Linearize48TxWorstCase15000Iters`
 5|          222,808.25 |            4,488.16 |    1.9% |      0.01 | `Linearize48TxWorstCase5000Iters`
 6|          627,877.00 |            1,592.67 |    0.2% |      0.01 | `Linearize64TxWorstCase15000Iters`
 7|          223,255.50 |            4,479.17 |    2.6% |      0.01 | `Linearize64TxWorstCase5000Iters`
 8|          993,733.00 |            1,006.31 |    3.8% |      0.01 | `Linearize75TxWorstCase15000Iters`
 9|          327,429.67 |            3,054.09 |    2.3% |      0.01 | `Linearize75TxWorstCase5000Iters`
10|          957,057.00 |            1,044.87 |    3.9% |      0.01 | `Linearize99TxWorstCase15000Iters`
11|          339,669.33 |            2,944.04 |    1.8% |      0.01 | `Linearize99TxWorstCase5000Iters`

Looks somewhat better but not much / close to the err%:

Benchmark	Table1 ns/op	Table2 ns/op	Percentage Speedup
`Linearize32TxWorstCase15000Iters`	724,714.00	705,210.00	2.70%
`Linearize32TxWorstCase5000Iters`	245,712.75	231,874.00	5.64%
`Linearize48TxWorstCase15000Iters`	688,589.50	664,206.00	3.54%
`Linearize48TxWorstCase5000Iters`	233,424.25	222,808.25	4.55%
`Linearize64TxWorstCase15000Iters`	662,959.00	627,877.00	5.30%
`Linearize64TxWorstCase5000Iters`	242,351.75	223,255.50	7.88%
`Linearize75TxWorstCase15000Iters`	1,051,925.00	993,733.00	5.54%
`Linearize75TxWorstCase5000Iters`	344,038.00	327,429.67	4.84%
`Linearize99TxWorstCase15000Iters`	1,023,533.00	957,057.00	6.49%
`Linearize99TxWorstCase5000Iters`	356,557.67	339,669.33	4.74%

in src/cluster_linearize.h:973 in 06ea1016ff outdated

976@@ -971,10 +977,19 @@ std::pair<std::vector<ClusterIndex>, bool> Linearize(const DepGraph<SetType>& de
977     std::vector<ClusterIndex> linearization;
978 
979     AncestorCandidateFinder anc_finder(depgraph);
980-    SearchCandidateFinder src_finder(depgraph, rng_seed);
981+    std::optional<SearchCandidateFinder<SetType>> src_finder;
982     linearization.reserve(depgraph.TxCount());
983     bool optimal = true;
984 
985+    // Treat the initialization of SearchCandidateFinder as taking N^2/64 (rounded up) iterations

instagibbs commented at 3:34 pm on August 7, 2024:

why 64?

sipa commented at 3:00 am on August 8, 2024:

Just a rough estimate based on benchmarks. Added a comment.

in src/cluster_linearize.h:998 in 06ea1016ff outdated

1008         uint64_t iterations_done_now = 0;
1009-        std::tie(best, iterations_done_now) = src_finder.FindCandidateSet(max_iterations_now, best);
1010-        iterations_left -= iterations_done_now;
1011+        uint64_t max_iterations_now = 0;
1012+        if (src_finder) {
1013+            // Treat the invocation of SearchCandidateFinder::FindCandidateSet() as costing N/4

instagibbs commented at 3:40 pm on August 7, 2024:

why 4?

sipa commented at 3:00 am on August 8, 2024:

Just a rough estimate based on benchmarks. Added a comment.

instagibbs commented at 3:44 pm on August 7, 2024: member

reviewed through https://github.com/bitcoin/bitcoin/pull/30286/commits/06ea1016ff40ebcf3e668bdb993a0747904cb40f

everything made sense to me except https://github.com/bitcoin/bitcoin/commit/29cac3e85b23f1264121cbd50e4a09755b0c8352

sipa force-pushed on Aug 7, 2024

sipa force-pushed on Aug 8, 2024

in src/cluster_linearize.h:977 in f46c0cc3d2 outdated

994 
995+    // Treat the initialization of SearchCandidateFinder as taking N^2/64 (rounded up) iterations
996+    // (largely due to the cost of constructing the internal sorted-by-feerate DepGraph inside
997+    // SearchCandidateFinder), a rough approximation based on benchmark. If we don't have that
998+    // many, don't start it.
999+    uint64_t start_iterations = (uint64_t{depgraph.TxCount()} * depgraph.TxCount() + 63) / 64;

gmaxwell commented at 9:47 am on August 8, 2024:

Do these adhoc initialization costs interact well with the with the upper bound 2^(k-1) and the empirical sqrt(2^k)+1 limits in the tests or should the bounds in the test be increased by these adhoc initialization costs?

sipa commented at 10:23 pm on August 8, 2024:

@gmaxwell I think you’re right that these ought to be added to the limits in the tests, though I don’t actually see any failures with a quick test. The sqrt(2^k)+1 limit is only hit for very small clusters (say, 3 transactions) though, for larger ones the worst I’ve seen is around 70%-80% of this presumed limit, which may explain not seeing failures.

It does make me wonder if perhaps Linearize shouldn’t bother with trying to do CPU time management, and should just expose a bool do_search option.

gmaxwell commented at 0:32 am on August 9, 2024:

Yeah, I assumed that the specific limit used in the tests right how happened to evade failing. I don’t have a strong opinion. The adjustments seemed fine to me but maybe just making some function for the adjustments and having the test use that to derive its test limits might make sense– so that further updates to those adhoc adjustments don’t accidentally break test limits that might be hit infrequently (or after changing the amount of work the test does).

sipa commented at 7:04 pm on August 16, 2024:

Ok, mystery solved. These adhoc initialization costs do not interact with the $\sqrt{2^k}+1$ rule, because that rule is at the per-candidate-set level, while the adhoc costs are applied at the linearization level. It does in theory interact with the derived $\sqrt{12 * 2^n}+n$ rule, but for $n \leq 3$ the linearization code is optimal without any search, and for $n \geq 4$, the formula is sufficiently overshooting to accomodate the ad-hoc costs. Still, I’ve changed the test to use precomputed limits instead.

in src/cluster_linearize.h:752 in f46c0cc3d2 outdated

771+                    // it. If not, we're done updating pot. This relies on the fact that
772+                    // m_depgraph, and thus the transactions iterated over, are in decreasing
773+                    // individual feerate order.
774+                    if (!(m_depgraph.FeeRate(pos) >> pot.feerate)) break;
775+                    pot.Set(m_depgraph, pos);
776+                    consider_inc.Set(pos);

glozow commented at 10:52 am on August 8, 2024:

in c68d26a0e97d7544076f00a467a8eca16658077a

If we’re meant to avoid jump ahead, should this be gated on reconsider_pot as well?

sipa commented at 2:33 pm on August 8, 2024:

No, we always want to consider transactions that were missing from pot, because those couldn’t have been considered for the parent work item. reconsider_pot only controls whether we should reconsider the ones already in pot.

glozow commented at 3:56 pm on August 9, 2024:

Oh of course, I was confusing myself :+1:

sipa force-pushed on Aug 9, 2024

sipa commented at 2:13 am on August 9, 2024: member

Unsure where this should live, but I have a Python reimplementation of the depgraph serialization code + mermaid formatter: https://gist.github.com/sipa/822f60db6608a26bb4aae75fd31bcb12

sipa force-pushed on Aug 15, 2024

in src/bench/cluster_linearize.cpp:206 in ce177d2600 outdated

207-static void LinearizePerIter48TxWorstCase(benchmark::Bench& bench) { BenchLinearizePerIterWorstCase<BitSet<48>>(48, bench); }
208-static void LinearizePerIter64TxWorstCase(benchmark::Bench& bench) { BenchLinearizePerIterWorstCase<BitSet<64>>(64, bench); }
209-static void LinearizePerIter75TxWorstCase(benchmark::Bench& bench) { BenchLinearizePerIterWorstCase<BitSet<75>>(75, bench); }
210-static void LinearizePerIter99TxWorstCase(benchmark::Bench& bench) { BenchLinearizePerIterWorstCase<BitSet<99>>(99, bench); }
211+static void Linearize16TxWorstCase50Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<16>>(16, bench, 50); }
212+static void Linearize16TxWorstCase150Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<16>>(16, bench, 150); }

glozow commented at 10:10 am on August 16, 2024:

CI is failing because it halts at 128 iters at 5709d828f36, it seems. Original iters given was 2^(16/2-1) = 128. However, sqrt(2^(n-1)) = sqrt(2^15) > 150

sipa commented at 3:28 pm on August 16, 2024:

Fixed.

glozow commented at 1:19 pm on August 16, 2024: member

fwiw, my bench results (I commented out assert(iters_performed == max_iters) since it fails)

git rebase -i --exec="make -j12 > /dev/null && src/bench/bench_bitcoin -filter=Linearize.*" HEAD~11 Output: bench_30286_all.txt

Reformatted as a csv: output_benchmarks_ns_op.csv

Displayed as a line graph (x-axis is commit, y-axis is ns/op):

sipa force-pushed on Aug 16, 2024

sipa commented at 6:56 pm on August 16, 2024: member

(Reposting at PR level instead of review level) @glozow Cool, building on that:

 0$ git rebase --exec='F=$(mktemp) && make -j -C src bench/bench_bitcoin >/dev/null && src/bench/bench_bitcoin -filter=Linearize64TxWorstCase.* -min-time=60000 -output-json=$F >/dev/null 2>/dev/null && NS=$(cat $F | jq "(.results[0][\"median(elapsed)\"] - .results[1][\"median(elapsed)\"]) / 10000 * 1000000000") && TITLE=$(git show -s --format="%s") && echo "$NS $TITLE"' HEAD~11 2>/dev/null
 111.60165442504903  clusterlin bench: have low/high iter benchmarks instead of per-iter
 211.583260381566241 clusterlin: separate initial search entries per component (optimization)
 311.53167576849342  clusterlin: add reordering support for DepGraph
 412.451696899512429 clusterlin: use feerate-sorted depgraph in SearchCandidateFinder
 528.3225308342194   clusterlin: track upper bound potential set for work items (optimization)
 628.0684870149178   clusterlin: reduce computation of unnecessary pot sets (optimization)
 733.4621797288421   clusterlin: include topological pot subsets automatically (optimization)
 842.1621546397433   clusterlin: improve heuristic to decide split transaction (optimization)
 946.8534839828272   clusterlin: avoid recomputing potential set on every split (optimization)
1046.167085876022306 clusterlin: avoid jump ahead in unmodified pot sets (optimization)
1145.5678434124763   clusterlin: only start/use search when enough iterations left

A general trend of increasing time per iteration is expected, as many changes are intended to reduce the number of iterations needed at the cost of increasing the per-iteration cost, but not all. I’m surprised at the lack of (significant) per-iteration time reduction for:

clusterlin: reduce computation of unnecessary pot sets (optimization)
clusterlin: avoid recomputing potential set on every split (optimization)
clusterlin: avoid jump ahead in unmodified pot sets (optimization)

It may be that these are optimizations that only help in the average case and not the worst case that’s benchmarked here (which is probably worth some performance reduction in the worst case, but not much) or don’t help at all. I’ll investigate a bit more, and if not significant, drop them from this PR.

gmaxwell commented at 11:07 pm on August 17, 2024: contributor

@sipa Though there may be other patterns which are a worst case even more worse than the particular lattice used for testing without those optimizations? If you’ve been searching for worst case examples for a long time with those optimizations in place you may have missed cases that would be worse without them. No?

An important thing to keep in mind that the ‘worst case’ used in the tests is speculative. You don’t know for sure that it’s actually a worse case under any version of the algorithm. It might be the worst case under some but not under others or it may never be a worst case (except for very small N which you’ve presumably tested exhaustively).

sipa force-pushed on Aug 19, 2024

sipa commented at 9:15 pm on August 19, 2024: member

I’m surprised at the lack of (significant) per-iteration time reduction for:

clusterlin: reduce computation of unnecessary pot sets (optimization)

clusterlin: avoid recomputing potential set on every split (optimization)

clusterlin: avoid jump ahead in unmodified pot sets (optimization)

I have dropped the 2nd and 3rd, due to inconclusive evidence that they help. The 1st one I have kept; my belief is that it in fact does not improve the worst case, but it is also has very low cost even when it doesn’t do anything. The 2nd and 3rd do have a cost if ineffective (as shown by @glozow’s and my benchmarks). Reasoning:

reduce computation of unnecessary pot sets: I don’t believe this affects the worst case. My conjecture is that worst cases are ones where knowing the best candidate set up front does not change the number of iterations requires (i.e., all search tree subtrees either fail immediately, or need to be fully explored because they appear to be potential best ones until the very end). In these subtrees, this optimization doesn’t do anything, because its only effect is reducing computation in work items that are doomed to fail soon after anyway, and if my conjecture is true, such branches don’t exist in the first place in worst case situations. Still, due to the low cost, I’ve left it in.
avoid recomputing potential set on every split: this optimization was based on a belief that a significant amount of time would be spent recomputing the same pot subset after every split. With some more investigation, (hard) clusters exist where this change is a net improvement and ones where it is a net degradation (found through a custom fuzz test that maximizes or minimizes the number of pot addition steps minus the cost of updating pot in split_fn, not an actual benchmark). The per-iteration potential gains seem higher than the per-iteration potential degrations, but the largest absolute losses are with close-to-worst clusters, while the largest absolute gains are with significantly simpler clusters, which is weak evidence that it may hurt the worst case more than it helps. I think this will need analysis with real-world clusters to see actual impact, which may be easier once the cluster mempool project is further along. I’ve dropped it for this reason.
avoid jump ahead in unmodified pot sets: does not make sense without the previous change, so I also dropped it. @gmaxwell

Though there may be other patterns which are a worst case even more worse than the particular lattice used for testing without those optimizations? If you’ve been searching for worst case examples for a long time with those optimizations in place you may have missed cases that would be worse without them. No?

Absolutely; even the apparent worst-case complexity $\mathcal{O}(\sqrt{2^n})$ is just a conjecture - it’s possible that clusters exist which have up to $\mathcal{O}(2^{n-1})$ complexity, and if so, a lot of empirical evidence that’s been used to guide this development would go out the window. But given that I came up with it in an attempt to reduce a pattern which I incorrectly though would be very common, there isn’t much justification to have it.

sipa force-pushed on Aug 19, 2024

in src/cluster_linearize.h:572 in a12ca7b8d9 outdated

570+    std::vector<ClusterIndex> m_sorted_to_original;
571+    /** m_original_to_sorted[i] is the sorted position original transaction position i has. */
572+    std::vector<ClusterIndex> m_original_to_sorted;
573+    /** Internal dependency graph for the cluster (with transactions in decreasing individual
574+     *  feerate order). */
575+    DepGraph<SetType> m_depgraph;

instagibbs commented at 3:02 pm on August 21, 2024:

a12ca7b8d99d868ff49e90d33b5e90c169849ccc

may want to consider something like m_sorted_depgraph

sipa commented at 6:48 pm on August 21, 2024:

Done.

in src/cluster_linearize.h:676 in 805f9122e5 outdated

673 
674             /** Construct a new work item. */
675-            WorkItem(SetInfo<SetType>&& i, SetType&& u) noexcept :
676-                inc(std::move(i)), und(std::move(u)) {}
677+            WorkItem(SetInfo<SetType>&& i, SetType&& u, FeeFrac&& p_f) noexcept :
678+                inc(std::move(i)), und(std::move(u)), pot_feerate(std::move(p_f)) {}

instagibbs commented at 3:09 pm on August 21, 2024:

same check already being done in split_fn, but I also like it here on creation

0                inc(std::move(i)), und(std::move(u)), pot_feerate(std::move(p_f)) {
1                Assume(pot_feerate.IsEmpty() == inc.feerate.IsEmpty());
2                }

sipa commented at 6:48 pm on August 21, 2024:

Done.

instagibbs approved

instagibbs commented at 6:20 pm on August 21, 2024: member

code review ACK 1b8f143fb791fe5ba40ce6610cd4e8e04b951495

Thanks for dropping those couple commits.

I think this will need analysis with real-world clusters to see actual impact, which may be easier once the cluster mempool project is further along. I’ve dropped it for this reason.

On that note I think we’ll need some diagram-quality style benchmarks to make informed decisions in the future, otherwise it makes ongoing changes difficult to justify.

DrahtBot requested review from glozow on Aug 21, 2024

sipa force-pushed on Aug 21, 2024

sipa commented at 6:50 pm on August 21, 2024: member

On that note I think we’ll need some diagram-quality style benchmarks to make informed decisions in the future, otherwise it makes ongoing changes difficult to justify.

FWIW, the two dropped commits are pure optimizations (they should have zero effect on the outcome, or the number of iterations performed; just on the amount of time spent per iteration).

I’ll add an overview of the expected impact of each commit in this PR description.

instagibbs commented at 6:51 pm on August 21, 2024: member

@sipa sure, but if we’re weighing a change to this, it still matters as it may modify how many iterations we’re willing to do

instagibbs commented at 7:09 pm on August 21, 2024: member

reACK 12760a57b3a6cd1aeb3b7532311f648de2b45aa4

via git range-diff master 1b8f143fb791fe5ba40ce6610cd4e8e04b951495 12760a57b3a6cd1aeb3b7532311f648de2b45aa4

in src/cluster_linearize.h:287 in 781e0fb3aa outdated

278@@ -279,6 +279,14 @@ struct SetInfo
279     explicit SetInfo(const DepGraph<SetType>& depgraph, const SetType& txn) noexcept :
280         transactions(txn), feerate(depgraph.FeeRate(txn)) {}
281 
282+    /** Add a transaction to this SetInfo. */
283+    void Set(const DepGraph<SetType>& depgraph, ClusterIndex pos) noexcept
284+    {
285+        Assume(!transactions[pos]);
286+        transactions.Set(pos);
287+        feerate += depgraph.FeeRate(pos);

glozow commented at 10:33 am on August 23, 2024:

781e0fb3aa5: maybe paranoid, but

0        transactions.Set(pos);
1        if (Assume(!transactions[pos])) feerate += depgraph.FeeRate(pos);

sipa commented at 4:47 pm on September 1, 2024:

I’d rather not, the function is intended for uses where the caller knows that the transaction isn’t yet included, and I hope the tests are sufficient to cover all code paths. Making it dependent on an Assume still involves adding a check of that bit plus a conditional branch (in the current code, the Assume is compiled out in production builds).

in src/test/fuzz/cluster_linearize.cpp:557 in c84c5c86ba outdated

552-        // transactions has.
553-        assert(iterations_done <= ((uint64_t{1} << todo.Count()) - 1));
554+        // At most 2^(N-1) iterations can be required: the maximum number of non-empty topological
555+        // subsets a (connected) cluster with N transactions can have. Even when the cluster is no
556+        // longer connected after removing certain transactions, this holds, because the connected
557+        // components are searched separately.

glozow commented at 12:43 pm on August 23, 2024:

Was trying to come up with some intuition for this bound. With no topological restrictions, we have 2^n subsets (yes/no n times). Minus the empty one. But every tx is in a dependency relationship if the graph is connected, therefore at least half of the subsets are not topological?

sipa commented at 1:54 pm on August 23, 2024:

I believe the worst case for number of connected topologically valid sets in a connected cluster is a cluster with one parent, and then N-1 independent children.

In this setting, every non-enpty topologically valid subset consists of the parent plus any of the 2^(N-1) subsets of children.

This isn’t a proof that nothing worse exists, but I haven’t been able to come with anything else.

instagibbs commented at 1:57 pm on August 23, 2024:

(in case my reasoning was faulty) My example was N-1 parents and 1 child.

2^(N-1) - 1 possible choices of parents without child Adding child is one more possibility with all parents selected, resulting in 2^(N-1)?

sipa commented at 2:07 pm on August 23, 2024:

@instagibbs Right, that works too. I was focused on just counting connected topologically valid subsets, but that isn’t actually required in this context. Good to see it’s still the same 2^(N-1) number.

instagibbs commented at 2:16 pm on August 23, 2024:

~~@sipa I psyop’d myself into thinking they were connected… somehow.. Well, I believe your example!~~

right, in this context it only matters what the remaining cluster was, so my example makes sense

glozow commented at 1:45 pm on August 23, 2024: member

ACK 12760a57b3a

Agree with keeping “reduce computation of unnecessary pot sets”, intuitively it seems useful in the average situation and quite cheap.

Thanks for dropping “avoid jump ahead in unmodified pot sets,” I wasn’t convinced that processing the existing pot txns in the exclusion branch is always redundant or faster to skip. It also made the code harder for future people to understand, which I think is important.

I redid code review, all pretty straightforward now.

Prettified bench results:

bench.csv

in src/bench/cluster_linearize.cpp:119 in 21a184db63 outdated

114+ * Its goal is measuring how much time every additional search iteration in linearization costs,
115+ * by running with a low and a high count, subtracting the results, and divided by the number
116+ * iterations difference.
117  */
118 template<typename SetType>
119-void BenchLinearizePerIterWorstCase(ClusterIndex ntx, benchmark::Bench& bench)

ismaelsadeeq commented at 12:52 pm on August 26, 2024:

While reading through the bench I noticed that docstring explanation of BenchLinearizeNoItersWorstCaseAnc still refer to BenchLinearizePerIterWorstCase

sipa commented at 4:47 pm on September 1, 2024:

I’ve rewritten the comments.

ismaelsadeeq commented at 4:58 pm on August 26, 2024: member

Concept ACK

in src/cluster_linearize.h:673 in 9fe834fa97 outdated

667@@ -668,7 +668,9 @@ class SearchCandidateFinder
668             SetType und;
669             /** (Only when inc is not empty) The best feerate of any superset of inc that is also a
670              *  subset of (inc | und), without requiring it to be topologically valid. It forms a
671-             *  conservative upper bound on how good a set this work item can give rise to. */
672+             *  conservative upper bound on how good a set this work item can give rise to. If the
673+             *  real best such feerate does not exceed best's, then this value is not guaranteed to
674+             *  be accurate. */

ismaelsadeeq commented at 12:30 pm on August 30, 2024:

Can you clarify this comment for me, please? It doesn’t need to be retouched; I just want to understand it better.

what does “If the real best such feerate does not exceed best’s” imply?

sipa commented at 5:01 pm on September 1, 2024:

I have expanded the comment a bit, but let me elaborate more here. If this helps, I’m happy to rework it into a code comment.

Imagine 3 transactions, T1, T2, and T3. All the same size, and each has its number as fee (T1 = 1 fee, T2 = 2 fee, T3 = 3 fee). Imagine the following series of events (it needs more complexity to actually happen, but perhaps this suffices as an explanation):

inc={T3} und={T1,T2} gets add_fnd. Its pot_feerate is 3 (just T3). This sets best={T3} (with feerate 3 as well), and T1 and T2 get removed from imp.
inc={T1} und={T2} gets add_fnd. Its real pot_feerate would be 1.5 ({T1,T2}), but non-imp transactions are not considered for computing pot_feerate, so pot_feerate is just set to 1 (just {T1}). This is fine, because despite adding T2 to pot_feerate would increase its feerate to 1.5, it will stay below the best feerate of 3 anyway. When split_fn pops this work item from the queue, it’ll discard it anyway (because pot_feerate is worse than best, and T2 being included in it or not does not change that fact).

ismaelsadeeq commented at 10:33 am on September 2, 2024:

The new comment is much better, thanks for the example here.

ismaelsadeeq commented at 4:28 pm on August 30, 2024: member

I was hoping to see the benchmark improve but it seems it doesn’t but it decreases a bit after the optimizations of this PR.

Benched on M2 Pro

On 21a184db63bd13cf0c2dff9e6e61ca85f2ff4454

ns/op	op/s	err%	total	benchmark
34,200.59	29,239.26	0.4%	0.01	`Linearize16TxWorstCase120Iters`
6,384.70	156,624.46	2.1%	0.01	`Linearize16TxWorstCase20Iters`
3,416,000.00	292.74	1.0%	0.04	`Linearize32TxWorstCase15000Iters`
1,159,875.00	862.16	1.7%	0.01	`Linearize32TxWorstCase5000Iters`
3,372,792.00	296.49	2.1%	0.04	`Linearize48TxWorstCase15000Iters`
1,144,166.00	874.00	2.2%	0.01	`Linearize48TxWorstCase5000Iters`
3,365,333.00	297.15	1.8%	0.04	`Linearize64TxWorstCase15000Iters`
1,122,792.00	890.64	0.7%	0.01	`Linearize64TxWorstCase5000Iters`
5,721,541.00	174.78	1.0%	0.06	`Linearize75TxWorstCase15000Iters`
1,904,166.00	525.16	1.1%	0.02	`Linearize75TxWorstCase5000Iters`
5,672,875.00	176.28	1.3%	0.06	`Linearize99TxWorstCase15000Iters`
1,889,542.00	529.23	0.8%	0.02	`Linearize99TxWorstCase5000Iters`

On 12760a57b3a6cd1aeb3b7532311f648de2b45aa4

ns/op	op/s	err%	total	benchmark
152,479.17	6,558.27	1.3%	0.01	`Linearize16TxWorstCase120Iters`
40,915.00	24,440.91	2.3%	0.01	`Linearize16TxWorstCase20Iters`
12,573,458.00	79.53	1.3%	0.14	`Linearize32TxWorstCase15000Iters`
4,422,792.00	226.10	1.0%	0.05	`Linearize32TxWorstCase5000Iters`
12,813,167.00	78.04	0.5%	0.14	`Linearize48TxWorstCase15000Iters`
4,635,208.00	215.74	1.1%	0.05	`Linearize48TxWorstCase5000Iters`
13,188,833.00	75.82	1.2%	0.14	`Linearize64TxWorstCase15000Iters`
4,867,542.00	205.44	1.0%	0.05	`Linearize64TxWorstCase5000Iters`
20,246,083.00	49.39	1.3%	0.22	`Linearize75TxWorstCase15000Iters`
7,692,334.00	130.00	1.1%	0.08	`Linearize75TxWorstCase5000Iters`
20,629,500.00	48.47	2.2%	0.22	`Linearize99TxWorstCase15000Iters`
7,990,417.00	125.15	1.7%	0.09	`Linearize99TxWorstCase5000Iters`

I believe this is due to the increased per-iteration cost introduced by the optimizations. In favor of reduction of the maximum number of iterations from $2^{N-1}$ to $\sqrt{2^N}$ (which is really nice).

One notable thing I observe from this result is as the total number of iterations increase the total time taken also increases a bit.

For smaller transaction clusters which are more common in the mempool is the tradeoff is justified given that the total time to complete in master and this PR is fairly even from the bench?

I attempted to benchmark the maximum `iter_limit` values before and after this PR on few cluster sizes

I did the benchmark by modifying the iter_limit parameter of the BenchLinearizeWorstCase function for various cluster sizes (16, 32, 48, 64, 75, and 99).

On 21a184db63bd13cf0c2dff9e6e61ca85f2ff4454

I set iter_limit to $2^{N-1}$

 0diff --git a/src/bench/cluster_linearize.cpp b/src/bench/cluster_linearize.cpp
 1index e63e5d76c8b..d18df32b202 100644
 2--- a/src/bench/cluster_linearize.cpp
 3+++ b/src/bench/cluster_linearize.cpp
 4@@ -114,14 +114,17 @@ DepGraph<SetType> MakeHardGraph(ClusterIndex ntx)
 5  * iterations difference.
 6  */
 7 template<typename SetType>
 8-void BenchLinearizeWorstCase(ClusterIndex ntx, benchmark::Bench& bench, uint64_t iter_limit)
 9+void BenchLinearizeWorstCase(ClusterIndex ntx, benchmark::Bench& bench, std::optional<uint64_t> iter_limit)
10 {
11+    if (!iter_limit.has_value()) {
12+        iter_limit = 1 << (ntx - 1);
13+    }
14     const auto depgraph = MakeHardGraph<SetType>(ntx);
15     uint64_t rng_seed = 0;
16     bench.run([&] {
17         SearchCandidateFinder finder(depgraph, rng_seed++);
18-        auto [candidate, iters_performed] = finder.FindCandidateSet(iter_limit, {});
19-        assert(iters_performed == iter_limit);
20+        auto [candidate, iters_performed] = finder.FindCandidateSet(iter_limit.value(), {});
21+        assert(iters_performed <= iter_limit.value());
22     });
23 }
24 
25@@ -217,6 +220,13 @@ static void Linearize75TxWorstCase15000Iters(benchmark::Bench& bench) { BenchLin
26 static void Linearize99TxWorstCase5000Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, 5000); }
27 static void Linearize99TxWorstCase15000Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, 15000); }
28
29+static void Linearize16TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<16>>(16, bench, std::nullopt); }
30+static void Linearize32TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<32>>(32, bench, std::nullopt); }
31+static void Linearize48TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<48>>(48, bench, std::nullopt); }
32+static void Linearize64TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<64>>(64, bench, std::nullopt); }
33+static void Linearize75TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<75>>(75, bench, std::nullopt); }
34+static void Linearize99TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, std::nullopt); }
35+
36 static void LinearizeNoIters16TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<16>>(16, bench); }
37 static void LinearizeNoIters32TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<32>>(32, bench); }
38 static void LinearizeNoIters48TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<48>>(48, bench); }
39@@ -245,43 +255,50 @@ static void MergeLinearizations64TxWorstCase(benchmark::Bench& bench) { BenchMer
40 static void MergeLinearizations75TxWorstCase(benchmark::Bench& bench) { BenchMergeLinearizationsWorstCase<BitSet<75>>(75, bench); }
41 static void MergeLinearizations99TxWorstCase(benchmark::Bench& bench) { BenchMergeLinearizationsWorstCase<BitSet<99>>(99, bench); }
42
43+BENCHMARK(Linearize16TxWorstCase, benchmark::PriorityLevel::HIGH);
44+BENCHMARK(Linearize32TxWorstCase, benchmark::PriorityLevel::HIGH);
45+BENCHMARK(Linearize48TxWorstCase, benchmark::PriorityLevel::HIGH);
46+BENCHMARK(Linearize64TxWorstCase, benchmark::PriorityLevel::HIGH);
47+BENCHMARK(Linearize75TxWorstCase, benchmark::PriorityLevel::HIGH);
48+BENCHMARK(Linearize99TxWorstCase, benchmark::PriorityLevel::HIGH);

ns/op	op/s	err%	total	Benchmark
1,168,959.00	855.46	1.8%	0.01	`Linearize16TxWorstCase`
5,546,978,667.00	0.18	0.8%	60.83	`Linearize32TxWorstCase`
7,401,916.00	135.10	0.4%	0.08	`Linearize48TxWorstCase`
421,208.50	2,374.12	0.8%	0.01	`Linearize75TxWorstCase`
4,004.02	249,749.05	0.6%	0.01	`Linearize99TxWorstCase`

The benchmark for Linearize64TxWorstCase hung during execution, while the others completed successfully. Linearize32TxWorstCase completed after a minute.

On 12760a57b3a6cd1aeb3b7532311f648de2b45aa4

I set `iter_limit`` to $√{2^{N}}$

 0diff --git a/src/bench/cluster_linearize.cpp b/src/bench/cluster_linearize.cpp
 1index 48ede53bd7d..e50bc55f4d9 100644
 2--- a/src/bench/cluster_linearize.cpp
 3+++ b/src/bench/cluster_linearize.cpp
 4@@ -6,6 +6,7 @@
 5 
 6 #include <util/bitset.h>
 7 #include <cluster_linearize.h>
 8+#include <cmath>
 9 
10 using namespace cluster_linearize;
11 
12@@ -114,14 +115,17 @@ DepGraph<SetType> MakeHardGraph(ClusterIndex ntx)
13  * iterations difference.
14  */
15 template<typename SetType>
16-void BenchLinearizeWorstCase(ClusterIndex ntx, benchmark::Bench& bench, uint64_t iter_limit)
17+void BenchLinearizeWorstCase(ClusterIndex ntx, benchmark::Bench& bench, std::optional<uint64_t> iter_limit)
18 {
19+    if (!iter_limit.has_value()) {
20+        iter_limit = std::sqrt(1 << ntx);
21+    }
22     const auto depgraph = MakeHardGraph<SetType>(ntx);
23     uint64_t rng_seed = 0;
24     bench.run([&] {
25         SearchCandidateFinder finder(depgraph, rng_seed++);
26-        auto [candidate, iters_performed] = finder.FindCandidateSet(iter_limit, {});
27-        assert(iters_performed == iter_limit);
28+        auto [candidate, iters_performed] = finder.FindCandidateSet(iter_limit.value(), {});
29+        assert(iters_performed <= iter_limit.value());
30     });
31 }
32 
33@@ -217,6 +221,13 @@ static void Linearize75TxWorstCase15000Iters(benchmark::Bench& bench) { BenchLin
34 static void Linearize99TxWorstCase5000Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, 5000); }
35 static void Linearize99TxWorstCase15000Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, 15000); }
36
37+static void Linearize16TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<16>>(16, bench, std::nullopt); }
38+static void Linearize32TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<32>>(32, bench, std::nullopt); }
39+static void Linearize48TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<48>>(48, bench, std::nullopt); }
40+static void Linearize64TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<64>>(64, bench, std::nullopt); }
41+static void Linearize75TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<75>>(75, bench, std::nullopt); }
42+static void Linearize99TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, std::nullopt); }
43+
44 static void LinearizeNoIters16TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<16>>(16, bench); }
45 static void LinearizeNoIters32TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<32>>(32, bench); }
46 static void LinearizeNoIters48TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<48>>(48, bench); }
47@@ -245,43 +256,50 @@ static void MergeLinearizations64TxWorstCase(benchmark::Bench& bench) { BenchMer
48 static void MergeLinearizations75TxWorstCase(benchmark::Bench& bench) { BenchMergeLinearizationsWorstCase<BitSet<75>>(75, bench); }
49 static void MergeLinearizations99TxWorstCase(benchmark::Bench& bench) { BenchMergeLinearizationsWorstCase<BitSet<99>>(99, bench); }
50
51+BENCHMARK(Linearize16TxWorstCase, benchmark::PriorityLevel::HIGH);
52+BENCHMARK(Linearize32TxWorstCase, benchmark::PriorityLevel::HIGH);
53+BENCHMARK(Linearize48TxWorstCase, benchmark::PriorityLevel::HIGH);
54+BENCHMARK(Linearize64TxWorstCase, benchmark::PriorityLevel::HIGH);
55+BENCHMARK(Linearize75TxWorstCase, benchmark::PriorityLevel::HIGH);
56+BENCHMARK(Linearize99TxWorstCase, benchmark::PriorityLevel::HIGH);

ns/op	op/s	err%	total	Benchmark
168,569.50	5,932.27	0.6%	0.01	`Linearize16TxWorstCase`
18,648.34	53,624.08	0.3%	0.01	`Linearize32TxWorstCase`
770,417.00	1,298.00	1.0%	0.01	`Linearize48TxWorstCase`
34,281.25	29,170.46	0.3%	0.01	`Linearize64TxWorstCase`
379,319.67	2,636.30	2.2%	0.01	`Linearize75TxWorstCase`
78,330.08	12,766.49	2.2%	0.01	`Linearize99TxWorstCase`

All benchmarks completed successfully, and the results indicate a performance improvement compared to the previous test.

Setting iter_limit to $2^{N-1}$ before this PR resulted in longer execution times and, in one case (Linearize64TxWorstCase), caused Linearize64TxWorstCase benchmark to hang which I did not investigate why.
Setting the iter_limit to $\sqrt{2^{N}}$ improved performance across all cluster sizes, with all benchmarks completing successfully.

sipa force-pushed on Sep 1, 2024

sipa commented at 5:19 pm on September 1, 2024: member

@ismaelsadeeq I believe you’re triggering integer overflows in your code (1 << ntx is 0 in case ntx is 32 or more, on typical platforms; with uint64_t{1} << ntx you can support ntx up to 63).

Also, yes, it’s definitely possible that for sufficiently small clusters simpler algorithms are better, but it’s hard to say how much those matters, as they’ll be super fast already anyway. I believe that for ntx up to 3, just PostLinearize will linearize optimally directly (as would AncestorCandidateFinder). Up to ntx=6 or so, the naive search algorithm that’s used in the fuzz test SimpleCandidateFinder is probably faster than SearchCandidateFinder… but that likely still finishes within microseconds.

To actually demonstrate the effectiveness of the optimizations here, perhaps it’s worth gathering some real-world hard clusters. @sdaftuar do you think you could instrument your code to gather/dump clusters that linearize optimally with say 1,000,000 iterations but not using 100,000 iterations, using the code up to commit 21b826af07fdfa26aaf981cc0810edd6b55b45e0 (before that commit the search algorithm is so naive that high iteration count isn’t really an indication of being hard). Perhaps a corpus of such clusters could be added as a benchmark - hopefully that shows a real benefit of each commit here.

Despite being the current next PR to review in the cluster mempool project, this isn’t actually blocking anything, so no rush here.

ismaelsadeeq commented at 11:17 am on September 2, 2024: member

light code review and Tested ACK 100e1b9ecb29c76187112fda9fed1c01da192c99

glozow commented at 5:07 pm on September 2, 2024: member

reACK 100e1b9ecb2 via range-diff

instagibbs commented at 6:20 pm on September 3, 2024: member

reACK 100e1b9ecb29c76187112fda9fed1c01da192c99

via git range-diff master 12760a57b3a6cd1aeb3b7532311f648de2b45aa4 100e1b9ecb29c76187112fda9fed1c01da192c99

just some comment updates

sdaftuar commented at 1:19 pm on September 4, 2024: member

@sdaftuar do you think you could instrument your code to gather/dump clusters that linearize optimally with say 1,000,000 iterations but not using 100,000 iterations, using the code up to commit 21b826a (before that commit the search algorithm is so naive that high iteration count isn’t really an indication of being hard).

I used my simulation environment to run the commit you listed on last year’s data, searching for clusters that were optimal within 1M iterations but required more than 100,000 iterations. I then sorted those by iteration count and took the “worst” 20 clusters, which I added to our benchmark suite in this branch: https://github.com/sdaftuar/bitcoin/tree/test-30286

I then ran that benchmark on each intermediate commit of this PR. At the beginning of this PR, on my machine, the fastest cluster takes ~5ms to linearize, with nearly all the rest being at around 10ms. By the end of this PR, the fastest cluster linearizes (optimally) in 1.4us, and the slowest cluster takes 115us (and other than the slowest cluster I found, all the rest are under 50us). So this is a huge improvement.

Note that I excluded any LIMO calculations in this benchmark (by not passing in a pre-existing linearization to Linearize).

I’m attaching my output here for others to review/double check my summary (but you can also try to recreate the benchmarks locally using the branch I shared above): output.bench.txt

instagibbs commented at 1:59 pm on September 4, 2024: member

@sdaftuar awesome data! Encouraging to see the timings move along with what @sipa was asserting :+1:

sipa force-pushed on Sep 5, 2024

sipa commented at 5:04 pm on September 5, 2024: member

@sdaftuar Added your benchmarks as a commit to this PR (I’ve reworked it to auto-detect the transaction count).

sipa force-pushed on Sep 5, 2024

DrahtBot added the label CI failed on Sep 5, 2024

DrahtBot commented at 5:09 pm on September 5, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/29741549265

Make sure to run all tests locally, according to the documentation.

The failure may happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

sipa force-pushed on Sep 5, 2024

in src/bench/cluster_linearize.cpp:232 in cf94bf7886 outdated

227+    auto runner_fn = [&]<typename SetType>() noexcept {
228+        DepGraph<SetType> depgraph;
229+        reader >> Using<DepGraphFormatter>(depgraph);
230+        uint64_t rng_seed = 0;
231+        bench.run([&] {
232+            auto res = Linearize(depgraph, /*max_iterations=*/1000000, rng_seed++);

instagibbs commented at 5:47 pm on September 5, 2024:

is f94c546ac02ad11ba73174cb8d25220515cee893 the first commit that you can assert optimality? I’m not sure having the benchmark in prior to that is so meaningful

instagibbs commented at 5:50 pm on September 5, 2024:

also seems like the assertion fails for me at that commit, jump ahead opt is the first commit that seems to hit optimal for me

sipa commented at 6:05 pm on September 5, 2024:

Nice catch. Apparently the example graphs were obtained by finding 100k-1M iterations-optimal clusters with LIMO (so while passing in an already-good linearization as input), while the benchmarks didn’t pass in an existing linearization. @sdaftuar is going to construct new ones.

sipa commented at 0:15 am on September 6, 2024:

Done, pushed new ones, and moved the commit that introduces them forward to the point where they make sense.

DrahtBot removed the label CI failed on Sep 5, 2024

sipa force-pushed on Sep 6, 2024

gmaxwell commented at 0:12 am on September 6, 2024: contributor

Nice benchmarks, any value in running them on the removed optimizations?

clusterlin: avoid recomputing potential set on every split (optimization) clusterlin: avoid jump ahead in unmodified pot sets (optimization)

Any idea why LinearizeSerializedCluster5 is an outlier now? It separates from the rest at “improve heuristic to decide split transaction”.

sipa commented at 1:14 am on September 6, 2024: member

@gmaxwell I’ve just pushed a new set of example benchmark clusters, with the property that they need between 100k and 1M iterations to linearize optimally sometimes and never more than 10M iterations. I’m not well set up right now to benchmark, but at the end of this PR, I see:

ns/op	op/s	err%	total	benchmark
13,034,966.68	76.72	1.2%	10.96	`LinearizeOptimallyExample00`
72,373.73	13,817.17	0.2%	10.65	`LinearizeOptimallyExample01`
65,761.78	15,206.40	0.1%	10.59	`LinearizeOptimallyExample02`
65,761.24	15,206.53	0.4%	10.63	`LinearizeOptimallyExample03`
6,732.35	148,536.44	0.3%	10.61	`LinearizeOptimallyExample04`
69,041.02	14,484.14	1.3%	10.64	`LinearizeOptimallyExample05`
69,270.49	14,436.16	2.0%	10.41	`LinearizeOptimallyExample06`
69,446.96	14,399.48	2.3%	10.67	`LinearizeOptimallyExample07`
48,274.16	20,715.02	0.2%	11.03	`LinearizeOptimallyExample08`
77,312.66	12,934.49	0.1%	10.73	`LinearizeOptimallyExample09`
77,423.28	12,916.01	0.2%	10.74	`LinearizeOptimallyExample10`
56,558.64	17,680.76	3.7%	11.22	`LinearizeOptimallyExample11`
45,744.29	21,860.65	0.1%	10.99	`LinearizeOptimallyExample12`
32,857.15	30,434.77	0.3%	11.04	`LinearizeOptimallyExample13`
57,328.16	17,443.43	0.3%	10.97	`LinearizeOptimallyExample14`
16,797.84	59,531.47	0.2%	10.98	`LinearizeOptimallyExample15`
205,594,200.40	4.86	0.1%	11.31	`LinearizeOptimallyExample16`
208,598,643.80	4.79	0.5%	10.82	`LinearizeOptimallyExample17`
193,159,355.60	5.18	0.9%	10.84	`LinearizeOptimallyExample18`
208,075,819.25	4.81	0.8%	11.22	`LinearizeOptimallyExample19`
@sdaftuar Is it possible that some of these clusters are “related” (as in are really modifications of one another with a few transactions added/removed)? That could explain the strong correlations between consecutive ones.

I have pushed a branch with the two dropped optimization commits re-added to https://github.com/sipa/bitcoin/commits/202409_clusterlin_more_opt, if someone wants to benchmark.

glozow commented at 1:15 pm on September 6, 2024: member

Examples 16-19 seem to get slower? My results on the last commit look similar to sipa’s:

ns/op	op/s	err%	total	benchmark
17,187,614.00	58.18	4.0%	0.19	`LinearizeOptimallyExample00`
96,365.70	10,377.14	0.2%	0.01	`LinearizeOptimallyExample01`
87,832.50	11,385.31	3.1%	0.01	`LinearizeOptimallyExample02`
86,292.45	11,588.50	1.9%	0.01	`LinearizeOptimallyExample03`
8,905.69	112,287.78	1.3%	0.01	`LinearizeOptimallyExample04`
90,947.18	10,995.39	3.2%	0.01	`LinearizeOptimallyExample05`
90,872.91	11,004.38	4.2%	0.01	`LinearizeOptimallyExample06`
93,178.40	10,732.10	1.3%	0.01	`LinearizeOptimallyExample07`
72,569.92	13,779.81	3.9%	0.01	`LinearizeOptimallyExample08`
106,545.57	9,385.66	3.8%	0.01	`LinearizeOptimallyExample09`
103,444.11	9,667.06	0.9%	0.01	`LinearizeOptimallyExample10`
74,771.30	13,374.12	2.9%	0.01	`LinearizeOptimallyExample11`
57,063.53	17,524.33	0.3%	0.01	`LinearizeOptimallyExample12`
43,317.04	23,085.60	2.1%	0.01	`LinearizeOptimallyExample13`
82,872.00	12,066.80	2.6%	0.01	`LinearizeOptimallyExample14`
21,907.07	45,647.37	0.8%	0.01	`LinearizeOptimallyExample15`
276,053,164.00	3.62	0.3%	3.03	`LinearizeOptimallyExample16`
275,137,154.00	3.63	0.2%	3.04	`LinearizeOptimallyExample17`
258,526,924.00	3.87	0.5%	2.85	`LinearizeOptimallyExample18`
273,558,093.00	3.66	0.3%	3.02	`LinearizeOptimallyExample19`

But here’s commit-by-commit 2024-09-06-bench-examples.txt

sipa commented at 1:20 pm on September 6, 2024: member

Thanks for the graphs, @glozow. That confirms the impression we got: 16-19 seem to just be really hard, and the later optimizations just don’t help with it (and increase per-iteration cost, for no meaningful iteration reduction).

It doesn’t surprise me that this is possible (they are after all very far below the conjectured upper bound of $\mathcal{O}(\sqrt{2^N})$, but it does surprise me that such clusters are seen in historical real-world mempool data.

Perhaps it’s worth investigating (in a follow up) if there are more specific optimizations targetting the patterns in those clusters.

sdaftuar commented at 1:49 pm on September 6, 2024: member

@sdaftuar Is it possible that some of these clusters are “related” (as in are really modifications of one another with a few transactions added/removed)? That could explain the strong correlations between consecutive ones.

I do believe those last 4 examples are likely to be related, as all were found within minutes of each other on the same day.

instagibbs commented at 2:40 pm on September 6, 2024: member

Confirmed that at least for example 19, it simply can’t find the optimal solution until the commit it’s included, even if total iterations are cranked up a couple orders of magnitude.

sipa commented at 2:42 pm on September 6, 2024: member

Example cluster 18:

example18

From a fee-optimization perspective it’s a remarkably uninteresting graph: individual transaction feerates range from ~78 sat/vB to ~84 sat/vB.

(created with newly-added DOT support in https://gist.github.com/sipa/822f60db6608a26bb4aae75fd31bcb12)

instagibbs commented at 4:09 pm on September 9, 2024: member

reACK 59028de422bbf8ccf53da27f1153e343952e585f

via git range-diff master 100e1b9ecb29c76187112fda9fed1c01da192c99 59028de422bbf8ccf53da27f1153e343952e585f

fwiw I ran the branch with the two dropped optimizations, and it lead to a slight improvement to most of the optimal benchmarks on my machine

DrahtBot requested review from glozow on Sep 9, 2024

DrahtBot requested review from ismaelsadeeq on Sep 9, 2024

glozow commented at 9:05 pm on September 10, 2024: member

ACK 59028de422b

ismaelsadeeq commented at 3:12 pm on September 11, 2024: member

Re-ACK 59028de422bbf8ccf53da27f1153e343952e585f

I am having consistently similar result like @glozow although some bench were unstable.

ns/op	op/s	err%	total	benchmark
13,409.81	74,572.28	8.4%	0.01	:wavy_dash: `Linearize16TxWorstCase120Iters` (Unstable with ~71.1 iters. Increase `minEpochIterations` to e.g. 711)
2,857.30	349,980.23	1.1%	0.01	`Linearize16TxWorstCase20Iters`
824,833.00	1,212.37	2.1%	0.01	`Linearize32TxWorstCase15000Iters`
263,687.50	3,792.37	3.1%	0.01	`Linearize32TxWorstCase5000Iters`
685,583.00	1,458.61	4.4%	0.01	`Linearize48TxWorstCase15000Iters`
217,283.20	4,602.29	1.3%	0.01	`Linearize48TxWorstCase5000Iters`
566,271.00	1,765.94	5.2%	0.01	:wavy_dash: `Linearize64TxWorstCase15000Iters` (Unstable with ~1.7 iters. Increase `minEpochIterations` to e.g. 17)
207,125.00	4,828.00	5.2%	0.01	:wavy_dash: `Linearize64TxWorstCase5000Iters` (Unstable with ~4.9 iters. Increase `minEpochIterations` to e.g. 49)
794,375.00	1,258.85	2.2%	0.01	`Linearize75TxWorstCase15000Iters`
287,364.50	3,479.90	2.8%	0.01	`Linearize75TxWorstCase5000Iters`
774,791.00	1,290.67	3.1%	0.01	`Linearize99TxWorstCase15000Iters`
286,264.00	3,493.28	1.6%	0.01	`Linearize99TxWorstCase5000Iters`
711.44	1,405,591.29	2.0%	0.01	`LinearizeNoIters16TxWorstCaseAnc`
811.92	1,231,654.51	1.1%	0.01	`LinearizeNoIters16TxWorstCaseLIMO`
2,209.56	452,579.03	2.0%	0.01	`LinearizeNoIters32TxWorstCaseAnc`
2,791.17	358,272.43	2.1%	0.01	`LinearizeNoIters32TxWorstCaseLIMO`
4,802.87	208,208.65	2.0%	0.01	`LinearizeNoIters48TxWorstCaseAnc`
6,225.00	160,642.57	1.7%	0.01	`LinearizeNoIters48TxWorstCaseLIMO`
8,375.67	119,393.48	2.2%	0.01	`LinearizeNoIters64TxWorstCaseAnc`
11,179.56	89,448.96	1.9%	0.01	`LinearizeNoIters64TxWorstCaseLIMO`
13,085.68	76,419.44	2.9%	0.01	`LinearizeNoIters75TxWorstCaseAnc`
20,939.10	47,757.55	1.8%	0.01	`LinearizeNoIters75TxWorstCaseLIMO`
22,215.91	45,012.79	0.4%	0.01	`LinearizeNoIters99TxWorstCaseAnc`
37,518.69	26,653.38	0.8%	0.01	`LinearizeNoIters99TxWorstCaseLIMO`
12,650,708.00	79.05	5.8%	0.14	:wavy_dash: `LinearizeOptimallyExample00` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
70,053.57	14,274.79	1.9%	0.01	`LinearizeOptimallyExample01`
63,852.80	15,661.02	1.6%	0.01	`LinearizeOptimallyExample02`
63,375.00	15,779.09	1.8%	0.01	`LinearizeOptimallyExample03`
7,607.93	131,441.79	1.0%	0.01	`LinearizeOptimallyExample04`
78,541.60	12,732.11	6.1%	0.01	:wavy_dash: `LinearizeOptimallyExample05` (Unstable with ~7.0 iters. Increase `minEpochIterations` to e.g. 70)
68,958.33	14,501.51	2.5%	0.01	`LinearizeOptimallyExample06`
75,253.73	13,288.38	7.2%	0.01	:wavy_dash: `LinearizeOptimallyExample07` (Unstable with ~12.4 iters. Increase `minEpochIterations` to e.g. 124)
55,020.90	18,174.91	9.7%	0.01	:wavy_dash: `LinearizeOptimallyExample08` (Unstable with ~12.5 iters. Increase `minEpochIterations` to e.g. 125)
77,760.42	12,860.01	3.2%	0.01	`LinearizeOptimallyExample09`
76,808.30	13,019.43	1.6%	0.01	`LinearizeOptimallyExample10`
53,569.44	18,667.36	2.1%	0.01	`LinearizeOptimallyExample11`
43,802.08	22,829.96	0.8%	0.01	`LinearizeOptimallyExample12`
31,438.15	31,808.49	2.4%	0.01	`LinearizeOptimallyExample13`
57,560.17	17,373.13	2.1%	0.01	`LinearizeOptimallyExample14`
17,059.81	58,617.31	2.3%	0.01	`LinearizeOptimallyExample15`
193,585,541.00	5.17	1.3%	2.15	`LinearizeOptimallyExample16`
196,138,167.00	5.10	3.0%	2.24	`LinearizeOptimallyExample17`
182,834,750.00	5.47	1.8%	2.07	`LinearizeOptimallyExample18`
200,634,666.00	4.98	2.2%	2.20	`LinearizeOptimallyExample19`

That confirms the impression we got: 16-19 seem to just be really hard, and the later optimizations just don’t help with it (and increase per-iteration cost, for no meaningful iteration reduction).

16-19 has similar speed with the other clusters in until bb198964e10e84eaa93a31a5b9860bad026e1dc5

Perhaps it’s worth investigating (in a follow up) if there are more specific optimizations targetting the patterns in those clusters.

+1

instagibbs commented at 3:16 pm on September 11, 2024: member

@ismaelsadeeq I typically overcome that with:

0  -min-time=<milliseconds>
1       Minimum runtime per benchmark, in milliseconds (default: 10)

to make it long enough to be stable

sipa force-pushed on Sep 11, 2024

sipa commented at 7:08 pm on September 11, 2024: member

I’ve pushed a new update, just replacing the example benchmarks. They’re now gathered based on real-world clusters (seen in simulations of historical mempool data) at 3 distinct commits (“track upper bound potential set for work items”, “include topological pot subsets automatically”, and “improve heuristic to decide split transaction”), picking among example with high maximum from-scratch linearization time both ones that have a short and long linearization time in case the optimal is already known (thanks, @sdaftuar!).

sipa force-pushed on Sep 11, 2024

DrahtBot added the label CI failed on Sep 11, 2024

sipa commented at 8:23 pm on September 11, 2024: member

instagibbs commented at 4:16 pm on September 12, 2024: member

reACK a5ea6467b6a694facc1f5efe460bf30539167597

via git range-diff master 59028de422bbf8ccf53da27f1153e343952e585f a5ea6467b6a694facc1f5efe460bf30539167597

Just benchmark changes, with examples 11-14 added in 2fcfb4c0392a2f3114195e6e9979c1431798632c and 15-19 in 8fbc6f3c33c6a5b23314279f0856de0625cc6d3f

New benchmarks all run for each commit properly post-introduction. my benchmark runs for comparison starting from introduction commit to tip: https://gist.github.com/instagibbs/058f51587074187cb78f5f9b9327442e

DrahtBot requested review from ismaelsadeeq on Sep 12, 2024

clusterlin bench: have low/high iter benchmarks instead of per-iter e4faea9ca7

clusterlin: separate initial search entries per component (optimization)

Before this commit, the worst case for linearization involves clusters which
break apart in several smaller components after the first candidate is
included in the output linearization.

Address this by never considering work items that span multiple components
of what remains of the cluster.

85a285a306

clusterlin: add reordering support for DepGraph

Add a DepGraph(depgraph, reordering) function that constructs a new DepGraph
corresponding to an old one, but with its transactions is a modified order
(given as a vector from old to new positions).

Also use this reordering feature inside DepGraphFormatter::Unser, which needs
a small modification so that its reordering mapping is old-to-new (rather than
the new-to-old it used before).

b80e6dfe78

clusterlin: use feerate-sorted depgraph in SearchCandidateFinder

This is a requirement for a future commit, which will rely on quickly iterating
over transaction sets in decreasing individual feerate order.

9e43e4ce10

clusterlin: track upper bound potential set for work items (optimization)

In each work item, keep track of a conservative overestimate of the best
possible feerate that can be reached from it, and then use these to avoid
exploring hopeless work items.

2965fbf203

clusterlin bench: add example hard cluster benchmarks

Co-Authored-By: Suhas Daftuar <sdaftuar@gmail.com>

6060a948ca

clusterlin: reduce computation of unnecessary pot sets (optimization)

Keep track of which transactions in the graph have an individual
feerate that is better than the best included set so far. Others do not
need to be added to the pot set, as they cannot possibly help beating
best.

e20fda77a2

clusterlin: include topological pot subsets automatically (optimization)

Automatically add topologically-valid subsets of the potential set pot
to inc. It can be proven that these must be part of the best reachable
topologically-valid set from that work item.

This is a crucial optimization that (apparently) reduces the maximum
number of iterations from ~2^(N-1) to ~sqrt(2^N).

Co-Authored-By: Suhas Daftuar <sdaftuar@gmail.com>

71f2629398

clusterlin: improve heuristic to decide split transaction (optimization)

Empirically, this approach seems to be more efficient in common real-life
clusters, and does not change the worst case.

Co-Authored-By: Suhas Daftuar <sdaftuar@gmail.com>

bd044356ed

clusterlin: only start/use search when enough iterations left 9ad2fe7e69

sipa force-pushed on Sep 12, 2024

DrahtBot removed the label CI failed on Sep 13, 2024

sdaftuar commented at 3:37 pm on September 13, 2024: member

I have been gathering more data to try to measure the effect of the improvements in this PR. I decided to take my cluster mempool branch (#28676), combined with my simulation environment for playing back historical data, and measure the effect of a series of 4 commits from this PR on the number of clusters encountered in 2023 that would NOT have been optimally linearized, for each of 5 different values of the maximum iterations we’d permit when linearizing.

I plotted these values on a log-scale, so we can more easily see the effect of the optimizations here. Each data series is a different number of maximum iterations allotted to the linearization algorithm, and I’ve plotted how the number of clusters that are non-optimally sorted largely decreases with successive commits in this PR:

Optimal linearizations 2023 by commit and iteration count

instagibbs commented at 4:21 pm on September 13, 2024: member

@sdaftuar is that number of iterations per search invocation, or total over the entire linearization?

Could a version of this graph show the total fraction of transactions that are optimally linearized? Or maybe just do a chart where a cluster min size of is used to ignore the many small clusters we know we can optimally linearize? (but I guess if they were easy to linearize, that would be shown in the upper right with 1M?)

instagibbs approved

instagibbs commented at 5:12 pm on September 13, 2024: member

reACK 9ad2fe7e69e9e69949ebbb280a15756dc3301f09

just a rebase on master to take up Revert "build: Minimize I/O operations in GenerateHeaderFrom{Json,Raw}.cmake" I presume

via git range-diff master a5ea6467b6a694facc1f5efe460bf30539167597 9ad2fe7e69e9e69949ebbb280a15756dc3301f09

sdaftuar commented at 7:49 pm on September 13, 2024: member

@sdaftuar is that number of iterations per search invocation, or total over the entire linearization?

This is the total over the entire linearization, not each search invocation.

Could a version of this graph show the total fraction of transactions that are optimally linearized?

This seems a bit difficult to state, since a transaction may be part of many clusters over its lifetime, so I’m not sure precisely what we’d want to measure.

Or maybe just do a chart where a cluster min size of is used to ignore the many small clusters we know we can optimally linearize? (but I guess if they were easy to linearize, that would be shown in the upper right with 1M?)

Yeah, the idea with the first commit shown is to give a baseline understanding of what fraction of clusters are non-trivial to linearize in the first place.

gmaxwell commented at 8:46 pm on September 13, 2024: contributor

@sdaftuar

Iteration counts as the metric overstates the improvements from things that reduce iteration counts but increase work per iteration, and understates the improvements from things that only make iterations faster. But I totally get why cutting off based on time would be unworkable from a practical benchmarking perspective and the differences are large enough that this is still very interesting.

I thought it was interesting that the 100k and 1m lines get slightly worse in the last commit. I wonder if there are other clusters that share something in common with example 11. – I don’t think the behavior is a concern, but perhaps it could identify some opportunity for further optimization.

glozow commented at 2:26 pm on September 15, 2024: member

My graph on a log scale, looks similar to sipa’s:

Results txt: 2024-09-14-bench.txt

in src/bench/cluster_linearize.cpp:297 in 9ad2fe7e69

298-BENCHMARK(LinearizePerIter99TxWorstCase, benchmark::PriorityLevel::HIGH);
299+// The following example clusters were constructed by replaying historical mempool activity, and
300+// selecting for ones that take many iterations (after the introduction of some but not all
301+// linearization algorithm optimizations).
302+
303+/* 2023-05-05T23:12:21Z 71, 521780, 543141,*/

glozow commented at 9:17 pm on September 15, 2024:

Are these comments in the format timestamp, tx count, fee, size (or similar)? Or are they actually something like iteration count? I added a print statement to get feerates of clusters but got something slightly different.

 0total feerate: 4818977 / 54585, 71 txns
 1total feerate: 15029636 / 99008, 81 txns
 2total feerate: 1288344 / 98145, 90 txns
 3total feerate: 15755611 / 77992, 87 txns
 4total feerate: 429736 / 8245, 35 txns
 5total feerate: 695118 / 60056, 60 txns
 6total feerate: 941605 / 80052, 99 txns
 7total feerate: 1948779 / 51103, 52 txns
 8total feerate: 11583092 / 76049, 69 txns
 9total feerate: 871078 / 55437, 77 txns
10total feerate: 6127495 / 94464, 48 txns
11total feerate: 1176444 / 14638, 77 txns
12total feerate: 76548 / 7724, 40 txns
13total feerate: 10487482 / 100707, 96 txns
14total feerate: 14575007 / 90755, 93 txns
15total feerate: 867760 / 10558, 55 txns
16total feerate: 1205280 / 14663, 76 txns
17total feerate: 1493360 / 18235, 98 txns
18total feerate: 1446604 / 17766, 99 txns
19total feerate: 457656 / 23949, 78 txns

sdaftuar commented at 9:39 pm on September 15, 2024:

The format was timestamp (from my simulation), tx count, min iterations when running Linearize 10 times with different rng seeds and then once more when running it woth LIMO using the optimal linearization as input, and then max iterations over the same set of 11 runs.

glozow commented at 11:47 pm on September 15, 2024:

aha, that makes much more sense!

instagibbs commented at 1:55 pm on September 16, 2024:

would be worth adding that description if touched again or in followup

sipa commented at 3:12 pm on September 16, 2024:

Will do if I retouch.

ismaelsadeeq commented at 4:12 pm on October 4, 2024:

+1 on adding this in a follow-up. I can create a follow for it see https://github.com/ismaelsadeeq/bitcoin/commit/9350732b17915d315cf3ce9c744d1491c1b5d386 with the description @sdaftuar provided

glozow commented at 9:19 pm on September 15, 2024: member

reACK 9ad2fe7e69e, just have a question about the docs

sdaftuar commented at 4:42 pm on September 16, 2024: member

ACK 9ad2fe7e69e9e69949ebbb280a15756dc3301f09

glozow merged this on Sep 16, 2024

glozow closed this on Sep 16, 2024

ismaelsadeeq commented at 4:13 pm on October 4, 2024: member

post merge ACK 9ad2fe7e69e9e69949ebbb280a15756dc3301f09 My bench result locally was consistent with what @glozow and sipa has provided.

TheCharlatan referenced this in commit 8bb47d4c2c on Nov 2, 2024

heyolaniran commented at 10:20 am on May 25, 2025: none

These changes could potentially lead to better performance under high-load scenarios.I would love to know what are impacts of this changes in terms of optimization times and how many transactions it can handle synchroniously.

cluster mempool: optimized candidate search #30286

Code Coverage

Reviews

Conflicts

I attempted to benchmark the maximum iter_limit values before and after this PR on few cluster sizes

I attempted to benchmark the maximum `iter_limit` values before and after this PR on few cluster sizes