cluster mempool: optimized candidate search #30286

pull sipa wants to merge 10 commits into bitcoin:master from sipa:202406_clusterlin_opt changing 4 files +525 −98
  1. sipa commented at 2:16 am on June 14, 2024: member

    Part of cluster mempool: #30289

    Depends on #30126, and was split off from it.

    This improves the candidate search algorithm introduced in the previous PR with a variety of optimizations.

    The resulting search algorithm largely follows Section 2 of How to linearize your cluster, though with a few changes:

    • Connected component analysis is performed inside the search algorithm (creating initial work items per component for each candidate), rather than once at a higher level. This duplicates some work but is significantly simpler in implementation.
    • No ancestor-set based presplitting inside the search is performed; instead, the best value is initialized with the best topologically valid set known to the LIMO algorithm before search starts: the better one out of the highest-feerate remaining ancestor set, and the highest-feerate prefix of remaining transactions in old_linearization.
    • Work items are represented using an included set inc and an undefined set und, rather than included and excluded.
    • Potential sets pot are not computed for work items with empty inc.

    At a high level, the only missing optimization from that post is bottleneck analysis; my thinking is that it only really helps with clusters that are already relatively cheap to linearize (doing so would need to be done at a higher level, not inside the search algorithm).


    Overview of the impact of each commit here on linearize performance:

  2. DrahtBot commented at 2:16 am on June 14, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK instagibbs, glozow, sdaftuar
    Stale ACK ismaelsadeeq

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30857 (cluster mempool: extend DepGraph functionality by sipa)
    • #30605 (Cluster linearization: separate tests from tests-of-tests by sipa)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. sipa renamed this:
    Low-level cluster linearization code: optimizations
    cluster mempool: optimized candidate search
    on Jun 14, 2024
  4. sipa added the label Mempool on Jun 14, 2024
  5. sipa force-pushed on Jul 2, 2024
  6. sipa force-pushed on Jul 8, 2024
  7. sipa force-pushed on Jul 10, 2024
  8. DrahtBot added the label CI failed on Jul 10, 2024
  9. DrahtBot commented at 10:46 pm on July 10, 2024: contributor

    🚧 At least one of the CI tasks failed. Make sure to run all tests locally, according to the documentation.

    Possibly this is due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    Leave a comment here, if you need help tracking down a confusing failure.

    Debug: https://github.com/bitcoin/bitcoin/runs/27292784397

  10. sipa force-pushed on Jul 11, 2024
  11. sipa force-pushed on Jul 11, 2024
  12. DrahtBot removed the label CI failed on Jul 11, 2024
  13. sipa force-pushed on Jul 17, 2024
  14. sipa force-pushed on Jul 19, 2024
  15. sipa force-pushed on Jul 24, 2024
  16. glozow referenced this in commit 37bd70a225 on Jul 26, 2024
  17. DrahtBot added the label Needs rebase on Jul 26, 2024
  18. sipa force-pushed on Jul 26, 2024
  19. DrahtBot removed the label Needs rebase on Jul 26, 2024
  20. in src/cluster_linearize.h:859 in a3ba9c1714 outdated
    827+            // - A work item is (potentially) added with that transaction plus its remaining
    828+            //   ancestors included (added to the inc set).
    829+            //
    830+            // To decide what to split, pick among the undecided ancestors of the highest
    831+            // individual feerate transaction among the undecided ones the one which reduces the
    832+            // search space most:
    


    glozow commented at 10:30 am on July 30, 2024:

    nit in a3ba9c1714b7d74fcb08609500170f8794378fe5, I found this hard to parse

    0            // To decide what to split, consider the undecided ancestors of the highest
    1            // individual feerate undecided transaction. Pick the one which reduces the
    2            // search space most:
    

    glozow commented at 10:31 am on July 30, 2024:
    nit in commit message: s/emperically/empirically

    sipa commented at 11:55 am on August 5, 2024:
    Fixed.
  21. in src/cluster_linearize.h:185 in 9e154b0eea outdated
    180+        if (todo.None()) return todo;
    181+        auto first = todo.First();
    182+        SetType ret = Descendants(first) | Ancestors(first);
    183+        ret &= todo;
    184+        SetType to_add = ret;
    185+        to_add.Reset(first);
    


    glozow commented at 11:09 am on July 30, 2024:

    9e154b0eea2c1409a38cc02e742e1e2320947495

    Would it be cleaner to not do the first iteration outside the loop?

    0        SetType ret;
    1        SetType to_add = Singleton(todo.First());
    

    sipa commented at 10:14 pm on July 30, 2024:
    It’s certainly cleaner. It’s perhaps also a bit slower, but probably not measurably so. Done.
  22. in src/cluster_linearize.h:198 in 9e154b0eea outdated
    170@@ -171,6 +171,45 @@ class DepGraph
    171         return ret;
    172     }
    173 
    174+    /** Find some connected component within the subset "todo" of this graph.
    


    glozow commented at 2:30 pm on July 30, 2024:
    nit: maybe add “Specifically, all of the transactions connected to the first element of todo.”

    sipa commented at 11:55 am on August 5, 2024:
    Done (in #30285 already).
  23. in src/cluster_linearize.h:726 in fbb988a612 outdated
    720@@ -711,8 +721,11 @@ class SearchCandidateFinder
    721              *  pot_feerate. It starts off equal to inc. */
    722             auto pot = inc;
    723             if (!inc.feerate.IsEmpty()) {
    724-                // Add entries to pot.
    725-                for (auto pos : und) {
    726+                // Add entries to pot. We iterate over all undecided transactions whose feerate is
    727+                // higher than best. While undecided transactions of lower feerate may improve pot
    728+                // still, if they do, the resulting pot feerate cannot possibly exceed best's (and
    


    glozow commented at 3:26 pm on July 30, 2024:

    fbb988a6128b859b204299357cb6bb4133f83e25, are there some extra words?

    0                // higher than best. While undecided transactions of lower feerate may improve pot,
    1                // the resulting pot feerate cannot possibly exceed best's (and
    

    sipa commented at 11:55 am on August 5, 2024:
    Done.
  24. in src/cluster_linearize.h:977 in 3eb993466b outdated
    977     linearization.reserve(depgraph.TxCount());
    978     bool optimal = true;
    979 
    980+    // Treat the initialization of SearchCandidateFinder as taking N^2/64 (rounded up) iterations.
    981+    // If we don't have that many, don't start it.
    982+    uint64_t start_iterations = (uint64_t{depgraph.TxCount()} * depgraph.TxCount() + 63) / 64;
    


    glozow commented at 3:49 pm on July 30, 2024:
    Is the idea that, if we have fewer than n^2/64, we can pretty much expect SearchCandidateFinder to not make any meaningful progress?

    sipa commented at 10:17 pm on July 30, 2024:

    No, it’s trying to use “iterations” as a generic metric for computation time, and benchmarks show that n^2/64 times the time per iteration is roughly what just constructing the search object can take.

    The real purpose here is that if you pass in iterations=0, the code won’t bother performing a relatively costly initialization of a search object if it’s not going to be used. But that’s then generalized a bit to also skip it if doing so would cost more than the time spent proportionally on the number of iterations you asked for.


    glozow commented at 10:56 am on July 31, 2024:
    Ahh I see, thanks
  25. glozow commented at 3:50 pm on July 30, 2024: member
    Did an initial read-through
  26. sipa force-pushed on Jul 30, 2024
  27. sipa force-pushed on Aug 4, 2024
  28. sipa commented at 6:00 pm on August 4, 2024: member
    Rebased on top of #30285 as it looks like that is closer.
  29. sipa force-pushed on Aug 5, 2024
  30. in src/test/fuzz/cluster_linearize.cpp:551 in 5fddd456b1 outdated
    547@@ -513,9 +548,11 @@ FUZZ_TARGET(clusterlin_search_finder)
    548             assert(found.transactions.IsSupersetOf(depgraph.Ancestors(i) & todo));
    549         }
    550 
    551-        // At most 2^N-1 iterations can be required: the number of non-empty subsets a graph with N
    552-        // transactions has.
    553-        assert(iterations_done <= ((uint64_t{1} << todo.Count()) - 1));
    554+        // At most 2^(N-1) iterations can be required: the maximum number of topological subsets a
    


    instagibbs commented at 7:08 pm on August 5, 2024:
    non-empty topological subsets?

    sipa commented at 11:34 pm on August 7, 2024:
    Indeed, done.
  31. in src/cluster_linearize.h:619 in 4e7ff15ce4 outdated
    621+            m_original_to_sorted[m_sorted_to_original[i]] = i;
    622+        }
    623+        // Compute reordered dependency graph.
    624+        m_depgraph = DepGraph(depgraph, m_original_to_sorted);
    625+        // Set todo to the entire graph.
    626+        m_todo = SetType::Fill(depgraph.TxCount());
    


    instagibbs commented at 2:41 pm on August 6, 2024:
    move of m_todo instantiation seems like a no-op?

    sipa commented at 11:34 pm on August 7, 2024:
    Indeed, fixed.
  32. in src/cluster_linearize.h:1008 in 400a72b4ab outdated
    1016+                // Invoke bounded search to update best, with up to half of our remaining
    1017+                // iterations as limit.
    1018+                iterations_left -= base_iterations;
    1019+                max_iterations_now = (iterations_left + 1) / 2;
    1020+                std::tie(best, iterations_done_now) = src_finder->FindCandidateSet(max_iterations_now, best);
    1021+                iterations_left -= iterations_done_now;
    


    glozow commented at 2:59 pm on August 6, 2024:
    Since we’re subtracting both base_iterations (the N/4 estimate) and iterations_done_now from iterations_left, is the idea that base_iterations incorporates the computational overhead beyond the splitting operations?

    sipa commented at 5:30 pm on August 6, 2024:
    Indeed; largely due to the connected-component splitting. I’ve expanded the comments a bit.
  33. in src/cluster_linearize.h:849 in a88207c4ee outdated
    836+            //
    837+            // To decide what to split on, consider the undecided ancestors of the highest
    838+            // individual feerate undecided transaction. Pick the one which reduces the search space
    839+            // most. Let I(t) be the size of the undecided set after including t, and E(t) the size
    840+            // of the undecided set after excluding t. Then choose the split transaction t such
    841+            // that 2^I(t) + 2^E(t) is minimal, tie-breaking by highest individual feerate for t.
    


    glozow commented at 3:19 pm on August 6, 2024:
    Wow, cool way to convey pair comparison

    instagibbs commented at 5:51 pm on August 6, 2024:
    I would have appreciated slightly more pedantic comments demonstrating how the computation matches the comments :sweat:

    sipa commented at 11:37 pm on August 7, 2024:
    @instagibbs I’ve expanded the comments to explain why the (max,min) sort order is the same as (2^max + 2^min).
  34. in src/cluster_linearize.h:125 in 4e7ff15ce4 outdated
    117@@ -118,6 +118,28 @@ class DepGraph
    118         }
    119     }
    120 
    121+    /** Construct a DepGraph object given another DepGraph and a mapping from old to new.
    122+     *
    123+     * Complexity: O(N^2) where N=depgraph.TxCount().
    124+     */
    125+    DepGraph(const DepGraph<SetType>& depgraph, Span<const ClusterIndex> mapping) noexcept : entries(depgraph.TxCount())
    


    instagibbs commented at 3:33 pm on August 6, 2024:
    I think this can be used with DepGraphFormatter during Unser with reversed reordering

    sipa commented at 11:35 pm on August 7, 2024:
    Nice, indeed! I needed to make some change to the unserializer to track in reordering the reverse mapping, which I’ve for now done in the same commit. Happy to split that out if useful.

    instagibbs commented at 2:02 pm on August 8, 2024:
    LGTM, I tested by manually making the map and reversing it right at the end, which also seemed to work?

    sipa commented at 1:47 am on August 9, 2024:
    I’ve simplified the change a bit (now much closer to older deserialization code) and moved it to a separate commit (together with the introduction of the DepGraph reordering constructor).
  35. in src/cluster_linearize.h:710 in 5fddd456b1 outdated
    627+            auto component = m_depgraph.FindConnectedComponent(to_cover);
    628+            to_cover -= component;
    629+            // If best is not provided, set it to the first component, so that during the work
    630+            // processing loop below, and during the add_fn/split_fn calls, we do not need to deal
    631+            // with the best=empty case.
    632+            if (best.feerate.IsEmpty()) best = SetInfo(m_depgraph, component);
    


    glozow commented at 3:44 pm on August 6, 2024:
    Would it make sense to use if (best.feerate.IsEmpty() || best.feerate << m_depgraph.feerate(component)? We set imp based on this afterward, and it seems like it could help it start smaller.

    sipa commented at 5:31 pm on August 6, 2024:
    Perhaps, but due the BFS search order, the normal iteration code should very quickly try splitting the whole-component work items anyway. The goal here is really just making sure best is not empty, so branches aren’t needed to deal with that case.
  36. in src/cluster_linearize.h:704 in 4e63e0a951 outdated
    700@@ -688,7 +701,7 @@ class SearchCandidateFinder
    701             // processing loop below, and during the add_fn/split_fn calls, we do not need to deal
    702             // with the best=empty case.
    703             if (best.feerate.IsEmpty()) best = SetInfo(m_depgraph, component);
    704-            queue.emplace_back(SetInfo<SetType>{}, std::move(component));
    705+            queue.emplace_back(SetInfo<SetType>{}, std::move(component), FeeFrac{});
    


    instagibbs commented at 4:02 pm on August 6, 2024:

    I do realize the argument changes later, so maybe this just on the later commit

    0            queue.emplace_back(/*inc=*/SetInfo<SetType>{}, /*und=*/std::move(component), /*pot_feerate=*/FeeFrac{});
    

    sipa commented at 2:59 am on August 8, 2024:
    Done.
  37. glozow commented at 4:26 pm on August 6, 2024: member
    code review ACK. I haven’t reviewed the fuzzers yet, doing that next. Tested by writing some benches that run a lot faster with the optimizations.
  38. sipa force-pushed on Aug 6, 2024
  39. in src/cluster_linearize.h:740 in 29cac3e85b outdated
    728@@ -723,19 +729,25 @@ class SearchCandidateFinder
    729          *
    730          * - inc: the "inc" value for the new work item (must be topological).
    731          * - und: the "und" value for the new work item ((inc | und) must be topological).
    732+         * - pot: a subset of the "pot" value for the new work item (but a superset of inc).
    733+         *        It does not need to be the full pot value; missing pot transactions will be added
    734+         *        to it by add_fn.
    735+         * - grow_inc: whether to attempt moving transactions from und to inc, if it can be proven
    


    instagibbs commented at 7:45 pm on August 6, 2024:

    in 29cac3e85b23f1264121cbd50e4a09755b0c8352

    I think this commit may benefit from being two smaller commits, one that demonstrates the caching, and a second that demonstrates using it to skip the splitting computation?

    I do not understand what it’s doing, as-is.


    sipa commented at 2:58 am on August 8, 2024:

    I have split the commit in two, with additional comments, and I renamed grow_inc to reconsider_pot.

    Roughly what is going on is that:

    • In “include topological pot subsets automatically”, all of every new work item’s pot set is searched for topologically-valid subsets (which can be moved to inc instead)
    • In “avoid recomputing potential set on every split” (before my last push), only the part of the new item’s pot set that wasn’t in the parent work item’s pot set is searched for topologically-valid subsets unless grow_inc was set, in which case all of pot is searched.

    I have now split this second commit into:

    • In “avoid recomputing potential set on every split”, still all of the new item’s pot set is searched.
    • In “avoid jump ahead in unmodified pot sets”, only the part of the new item’s pot set that wasn’t in the parent work item’s set is searched unless reconsider_pot is set.

    instagibbs commented at 2:19 pm on August 8, 2024:

    I think this has helped a lot, “clusterlin: avoid recomputing potential set on every split (optimization)” now makes sense to me. Lots of interlocking parts, so having small bite-sizes helps.

    Can you motivate why reconsider_pot is set as true on inclusion of anc and not on exclude?

    When anc is included, it’s included in both inc and pot, then:

    if (reconsider_pot) consider_inc = pot.transactions - inc.transactions;

    so no part of anc will be in consider_inc.


    sipa commented at 3:26 pm on August 8, 2024:

    @instagibbs @glozow Let me try giving an explanation here. If this helps you, I’ll rework it into a comment in the code.

    Think of the set of transactions in every work items as falling within one of four classes (roughly from “more likely to be included” to “less likely to be included”):

    • I: definitely included transactions, (= item.inc).
    • U: undecided transactions that may or may not end up being included (= item.und), subdided in:
      • UH: possibly included, and including any of them will definitely increase the included set’s feerate (even when other lower-fee UH transactions are included too) (=item.pot - item.inc, =item.und & item.pot).
      • UL: possibly included, but including it is not guaranteed to increase the included set’s feerate (=item.und - item.pot)
    • E: definitely excluded transactions, (= todo - item.inc - item.und).

    In other words, item.pot is I+UH and item.und is U=UH+UL. Further, I is always topological, as is I+U. I+UH however may be non-topological.

    Now let’s consider what happens in an inclusion branch (i.e., the call to add_fn for adding the split transaction and its ancestors):

    1. We start with a parent object which has a given I, UH, UL, E (assume I is non-empty; I=empty is a special case that has separate logic in several places).
    2. A split transaction t is chosen from U=UH+UL (=item.und).
    3. anc(t) are moved from UH+UL to I (this is done inside split_fn, in the invocation of add_fn).
    4. anc(t) will always include at least one UL transaction (*), and because of that the feerate of I may have gone down. If so, it is possible that more transactions can be moved from UL to UH. This is done inside add_fn, in the for (auto pos : (imp & und) - pot.transactions) { loop.
    5. Because of the changes to I and UH above, it is possible that there is now a new topologically-valid subset of I+UH that isn’t already entirely in I, so it can be moved to I (jump ahead). This can be due to UH entries added in (4) or due to I entries added in (3).. This is done in add_fn, in the for (auto pos : consider_inc) { loop.
    6. If I now has the best feerate ever, remember it.
    7. Construct a new work item with the resulting I,UH,UL,E, unless UH+UL is empty (which is always the case if UL is empty, **) because then no more splits can take place.

    Compare this to what happens in an exclusion branch:

    1. Same
    2. Same
    3. desc(t) are moved from UH+UL to E and I does not change.
    4. desc(t) may include a UH transaction, and because of that it is possible that more transactions can be moved from UL to UH.
    5. Because of the above, it is possible that there are new topologically-valid subsets of I+UH now that can be moved to I. This can only be due to UH entries added in (4).
    6. Same
    7. Same

    The bold parts correspond to reconsider_pot. In the inclusion branch, UL transactions moved to UH may result in existing UH transactions becoming part of a topologically-valid subset, so UH needs to be reconsidered. In an exclusion branch I does not change in (3), and so only new UH transactions added in (4) need to be considered for moving to I. Now, if that happens (something moves from UH to I due to jump ahead), it may be that what was in UH before becomes eligible to move as well, but doing so does not seem to affect the performance or number of iterations the algorithm needs.

    (*) The jump ahead is exactly what guarantees that anc(t) always includes a UL transaction. If anc(t) was entirely within UH, it would be topologically valid, and thus the parent work item’s jump ahead step would have moved it to I already.

    (**) Same reason: if UL is empty, then the entirety if I+UH is topologically valid, and thus would have been moved to I entirely in the jump ahead. Thus, UL empty implies UH is also empty.


    glozow commented at 11:40 am on August 12, 2024:

    @sipa thanks this is really helpful, particularly this part:

    1. anc(t) will always include at least one UL transaction () () The jump ahead is exactly what guarantees that anc(t) always includes a UL transaction. If anc(t) was entirely within UH, it would be topologically valid, and thus the parent work item’s jump ahead step would have moved it to I already.

    However I’m still having trouble with this part:

    Compare this to what happens in an exclusion branch … 4. desc(t) may include a UH transaction, and because of that it is possible that more transactions can be moved from UL to UH. 5. Because of the above, it is possible that there are new topologically-valid subsets of I+UH now that can be moved to I. This can only be due to UH entries added in (4).

    It seems true that only the entries added in (4) can cause jump-aheads. However, I’m not convinced that existing pot transactions cannot be jump-aheadable. For instance, it’s possible that we added a new tx that is the parent of a tx that was already in pot.

    I added an assert(anc.Overlaps(elem.und - elem.pot.transactions)) in split_fn, right before we call add_fn for the inclusion branch in order to check that “jump ahead guarantees that anc(t) always includes a UL transaction”. The assertion hit, and essentially what I saw was:

    1. Going into a split, we had a really high pot feerate, which included tx C.
    2. For the exclusion branch, desc(t) included lots of high feerate transactions which lowered the pot feerate. C is still in pot. At the start of this add_fn, we add tx P to pot, since its feerate is now high enough.
    3. P is part of consider_inc, and is added to inc in the jump-ahead.
    4. C, which a child of P, would be good because anc(C) is a topological subset of pot, but is skipped because reconsider_pot=false
    5. Later, we split on C. C is part of UH, not UL, which hits the assert. anc(C) = C which is in UH, but it wasn’t added to inc during a jump-ahead. This is because C was added to UH when its ancestor P was considered UL, and reconsider_pot=false when P was added to UH.

    Does this make sense? Or maybe my assert isn’t the right interpretation of this? I don’t think this changes what FindCandidateSet outputs, but it means I still don’t quite understand the “avoid jump ahead in unmodified pot sets” commit :(

    (fwiw the fuzz I crashed on is echo "AQBJCAgAIysjIyMiIyMjIyMjIyMjIyM6IyMIACMrIyMjIiMjIyMjIyMjIyMjIyMjCBlE+xEvRG9v AggAbwAR" | base 64 -d)

    EDIT: ah sorry, I just reread this part

    it may be that what was in UH before becomes eligible to move as well, but doing so does not seem to affect the performance or number of iterations the algorithm needs.

    so I see that this is expected. Perhaps including this in the commit message would help, as I was confused by “In exclusion branches (where inc does not change, and transactions are only removed from und), the jump ahead optimization cannot find anything to add to inc that wasn’t possible already.”


    sipa commented at 1:04 pm on August 12, 2024:
    @glozow Yeah I only re-discovered (and re-tested…) this fact (that after moving from new-UH to I in an exclusion branch, that former-UH also become eligible to move to I) while writing this. I don’t have a good explanation for why it doesn’t affect the further run of the algorithm though.

    glozow commented at 9:03 am on August 13, 2024:
    How significant do we expect the reconsider_pot optimization to be? Locally my LinearizePerIter*TxWorstCase bench doesn’t seem to do better after c68d26a0e97.

    sipa commented at 2:36 pm on August 15, 2024:
    @glozow Good question, I’ve pushed a change that allows for perhaps a bit more accurate measurements (run the 5000 and 15000 iteration benchmark, and subtract). No other changes.

    glozow commented at 9:50 am on August 16, 2024:

    Here’s what I get on 255571339e7:

     0|               ns/op |                op/s |    err% |     total | benchmark
     1|--------------------:|--------------------:|--------:|----------:|:----------
     2|          724,714.00 |            1,379.85 |    2.4% |      0.01 | `Linearize32TxWorstCase15000Iters`
     3|          245,712.75 |            4,069.79 |    0.3% |      0.01 | `Linearize32TxWorstCase5000Iters`
     4|          688,589.50 |            1,452.24 |    5.2% |      0.01 | :wavy_dash: `Linearize48TxWorstCase15000Iters` (Unstable with ~1.2 iters. Increase `minEpochIterations` to e.g. 12)
     5|          233,424.25 |            4,284.05 |    4.8% |      0.01 | `Linearize48TxWorstCase5000Iters`
     6|          662,959.00 |            1,508.39 |    1.7% |      0.01 | `Linearize64TxWorstCase15000Iters`
     7|          242,351.75 |            4,126.23 |    1.9% |      0.01 | `Linearize64TxWorstCase5000Iters`
     8|        1,051,925.00 |              950.64 |    4.3% |      0.01 | `Linearize75TxWorstCase15000Iters`
     9|          344,038.00 |            2,906.66 |    1.8% |      0.01 | `Linearize75TxWorstCase5000Iters`
    10|        1,023,533.00 |              977.01 |    2.1% |      0.01 | `Linearize99TxWorstCase15000Iters`
    11|          356,557.67 |            2,804.60 |    1.3% |      0.01 | `Linearize99TxWorstCase5000Iters`
    

    on d04997871f5:

     0|               ns/op |                op/s |    err% |     total | benchmark
     1|--------------------:|--------------------:|--------:|----------:|:----------
     2|          705,210.00 |            1,418.02 |    4.4% |      0.01 | `Linearize32TxWorstCase15000Iters`
     3|          231,874.00 |            4,312.69 |    0.2% |      0.01 | `Linearize32TxWorstCase5000Iters`
     4|          664,206.00 |            1,505.56 |    4.7% |      0.01 | `Linearize48TxWorstCase15000Iters`
     5|          222,808.25 |            4,488.16 |    1.9% |      0.01 | `Linearize48TxWorstCase5000Iters`
     6|          627,877.00 |            1,592.67 |    0.2% |      0.01 | `Linearize64TxWorstCase15000Iters`
     7|          223,255.50 |            4,479.17 |    2.6% |      0.01 | `Linearize64TxWorstCase5000Iters`
     8|          993,733.00 |            1,006.31 |    3.8% |      0.01 | `Linearize75TxWorstCase15000Iters`
     9|          327,429.67 |            3,054.09 |    2.3% |      0.01 | `Linearize75TxWorstCase5000Iters`
    10|          957,057.00 |            1,044.87 |    3.9% |      0.01 | `Linearize99TxWorstCase15000Iters`
    11|          339,669.33 |            2,944.04 |    1.8% |      0.01 | `Linearize99TxWorstCase5000Iters`
    

    Looks somewhat better but not much / close to the err%:

    Benchmark Table1 ns/op Table2 ns/op Percentage Speedup
    Linearize32TxWorstCase15000Iters 724,714.00 705,210.00 2.70%
    Linearize32TxWorstCase5000Iters 245,712.75 231,874.00 5.64%
    Linearize48TxWorstCase15000Iters 688,589.50 664,206.00 3.54%
    Linearize48TxWorstCase5000Iters 233,424.25 222,808.25 4.55%
    Linearize64TxWorstCase15000Iters 662,959.00 627,877.00 5.30%
    Linearize64TxWorstCase5000Iters 242,351.75 223,255.50 7.88%
    Linearize75TxWorstCase15000Iters 1,051,925.00 993,733.00 5.54%
    Linearize75TxWorstCase5000Iters 344,038.00 327,429.67 4.84%
    Linearize99TxWorstCase15000Iters 1,023,533.00 957,057.00 6.49%
    Linearize99TxWorstCase5000Iters 356,557.67 339,669.33 4.74%
  40. in src/cluster_linearize.h:973 in 06ea1016ff outdated
    976@@ -971,10 +977,19 @@ std::pair<std::vector<ClusterIndex>, bool> Linearize(const DepGraph<SetType>& de
    977     std::vector<ClusterIndex> linearization;
    978 
    979     AncestorCandidateFinder anc_finder(depgraph);
    980-    SearchCandidateFinder src_finder(depgraph, rng_seed);
    981+    std::optional<SearchCandidateFinder<SetType>> src_finder;
    982     linearization.reserve(depgraph.TxCount());
    983     bool optimal = true;
    984 
    985+    // Treat the initialization of SearchCandidateFinder as taking N^2/64 (rounded up) iterations
    


    instagibbs commented at 3:34 pm on August 7, 2024:
    why 64?

    sipa commented at 3:00 am on August 8, 2024:
    Just a rough estimate based on benchmarks. Added a comment.
  41. in src/cluster_linearize.h:998 in 06ea1016ff outdated
    1008         uint64_t iterations_done_now = 0;
    1009-        std::tie(best, iterations_done_now) = src_finder.FindCandidateSet(max_iterations_now, best);
    1010-        iterations_left -= iterations_done_now;
    1011+        uint64_t max_iterations_now = 0;
    1012+        if (src_finder) {
    1013+            // Treat the invocation of SearchCandidateFinder::FindCandidateSet() as costing N/4
    


    instagibbs commented at 3:40 pm on August 7, 2024:
    why 4?

    sipa commented at 3:00 am on August 8, 2024:
    Just a rough estimate based on benchmarks. Added a comment.
  42. sipa force-pushed on Aug 7, 2024
  43. sipa force-pushed on Aug 7, 2024
  44. sipa force-pushed on Aug 8, 2024
  45. in src/cluster_linearize.h:977 in f46c0cc3d2 outdated
    994 
    995+    // Treat the initialization of SearchCandidateFinder as taking N^2/64 (rounded up) iterations
    996+    // (largely due to the cost of constructing the internal sorted-by-feerate DepGraph inside
    997+    // SearchCandidateFinder), a rough approximation based on benchmark. If we don't have that
    998+    // many, don't start it.
    999+    uint64_t start_iterations = (uint64_t{depgraph.TxCount()} * depgraph.TxCount() + 63) / 64;
    


    gmaxwell commented at 9:47 am on August 8, 2024:
    Do these adhoc initialization costs interact well with the with the upper bound 2^(k-1) and the empirical sqrt(2^k)+1 limits in the tests or should the bounds in the test be increased by these adhoc initialization costs?

    sipa commented at 10:23 pm on August 8, 2024:

    @gmaxwell I think you’re right that these ought to be added to the limits in the tests, though I don’t actually see any failures with a quick test. The sqrt(2^k)+1 limit is only hit for very small clusters (say, 3 transactions) though, for larger ones the worst I’ve seen is around 70%-80% of this presumed limit, which may explain not seeing failures.

    It does make me wonder if perhaps Linearize shouldn’t bother with trying to do CPU time management, and should just expose a bool do_search option.


    gmaxwell commented at 0:32 am on August 9, 2024:
    Yeah, I assumed that the specific limit used in the tests right how happened to evade failing. I don’t have a strong opinion. The adjustments seemed fine to me but maybe just making some function for the adjustments and having the test use that to derive its test limits might make sense– so that further updates to those adhoc adjustments don’t accidentally break test limits that might be hit infrequently (or after changing the amount of work the test does).

    sipa commented at 7:04 pm on August 16, 2024:
    Ok, mystery solved. These adhoc initialization costs do not interact with the $\sqrt{2^k}+1$ rule, because that rule is at the per-candidate-set level, while the adhoc costs are applied at the linearization level. It does in theory interact with the derived $\sqrt{12 * 2^n}+n$ rule, but for $n \leq 3$ the linearization code is optimal without any search, and for $n \geq 4$, the formula is sufficiently overshooting to accomodate the ad-hoc costs. Still, I’ve changed the test to use precomputed limits instead.
  46. in src/cluster_linearize.h:752 in f46c0cc3d2 outdated
    771+                    // it. If not, we're done updating pot. This relies on the fact that
    772+                    // m_depgraph, and thus the transactions iterated over, are in decreasing
    773+                    // individual feerate order.
    774+                    if (!(m_depgraph.FeeRate(pos) >> pot.feerate)) break;
    775+                    pot.Set(m_depgraph, pos);
    776+                    consider_inc.Set(pos);
    


    glozow commented at 10:52 am on August 8, 2024:

    in c68d26a0e97d7544076f00a467a8eca16658077a

    If we’re meant to avoid jump ahead, should this be gated on reconsider_pot as well?


    sipa commented at 2:33 pm on August 8, 2024:
    No, we always want to consider transactions that were missing from pot, because those couldn’t have been considered for the parent work item. reconsider_pot only controls whether we should reconsider the ones already in pot.

    glozow commented at 3:56 pm on August 9, 2024:
    Oh of course, I was confusing myself :+1:
  47. sipa force-pushed on Aug 9, 2024
  48. sipa commented at 2:13 am on August 9, 2024: member
    Unsure where this should live, but I have a Python reimplementation of the depgraph serialization code + mermaid formatter: https://gist.github.com/sipa/822f60db6608a26bb4aae75fd31bcb12
  49. sipa force-pushed on Aug 15, 2024
  50. in src/bench/cluster_linearize.cpp:206 in ce177d2600 outdated
    207-static void LinearizePerIter48TxWorstCase(benchmark::Bench& bench) { BenchLinearizePerIterWorstCase<BitSet<48>>(48, bench); }
    208-static void LinearizePerIter64TxWorstCase(benchmark::Bench& bench) { BenchLinearizePerIterWorstCase<BitSet<64>>(64, bench); }
    209-static void LinearizePerIter75TxWorstCase(benchmark::Bench& bench) { BenchLinearizePerIterWorstCase<BitSet<75>>(75, bench); }
    210-static void LinearizePerIter99TxWorstCase(benchmark::Bench& bench) { BenchLinearizePerIterWorstCase<BitSet<99>>(99, bench); }
    211+static void Linearize16TxWorstCase50Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<16>>(16, bench, 50); }
    212+static void Linearize16TxWorstCase150Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<16>>(16, bench, 150); }
    


    glozow commented at 10:10 am on August 16, 2024:
    CI is failing because it halts at 128 iters at 5709d828f36, it seems. Original iters given was 2^(16/2-1) = 128. However, sqrt(2^(n-1)) = sqrt(2^15) > 150

    sipa commented at 3:28 pm on August 16, 2024:
    Fixed.
  51. glozow commented at 1:19 pm on August 16, 2024: member

    fwiw, my bench results (I commented out assert(iters_performed == max_iters) since it fails)

    git rebase -i --exec="make -j12 > /dev/null && src/bench/bench_bitcoin -filter=Linearize.*" HEAD~11 Output: bench_30286_all.txt

    Reformatted as a csv: output_benchmarks_ns_op.csv

    Displayed as a line graph (x-axis is commit, y-axis is ns/op): image

  52. sipa force-pushed on Aug 16, 2024
  53. sipa force-pushed on Aug 16, 2024
  54. sipa commented at 6:56 pm on August 16, 2024: member

    (Reposting at PR level instead of review level) @glozow Cool, building on that:

     0$ git rebase --exec='F=$(mktemp) && make -j -C src bench/bench_bitcoin >/dev/null && src/bench/bench_bitcoin -filter=Linearize64TxWorstCase.* -min-time=60000 -output-json=$F >/dev/null 2>/dev/null && NS=$(cat $F | jq "(.results[0][\"median(elapsed)\"] - .results[1][\"median(elapsed)\"]) / 10000 * 1000000000") && TITLE=$(git show -s --format="%s") && echo "$NS $TITLE"' HEAD~11 2>/dev/null
     111.60165442504903  clusterlin bench: have low/high iter benchmarks instead of per-iter
     211.583260381566241 clusterlin: separate initial search entries per component (optimization)
     311.53167576849342  clusterlin: add reordering support for DepGraph
     412.451696899512429 clusterlin: use feerate-sorted depgraph in SearchCandidateFinder
     528.3225308342194   clusterlin: track upper bound potential set for work items (optimization)
     628.0684870149178   clusterlin: reduce computation of unnecessary pot sets (optimization)
     733.4621797288421   clusterlin: include topological pot subsets automatically (optimization)
     842.1621546397433   clusterlin: improve heuristic to decide split transaction (optimization)
     946.8534839828272   clusterlin: avoid recomputing potential set on every split (optimization)
    1046.167085876022306 clusterlin: avoid jump ahead in unmodified pot sets (optimization)
    1145.5678434124763   clusterlin: only start/use search when enough iterations left
    

    A general trend of increasing time per iteration is expected, as many changes are intended to reduce the number of iterations needed at the cost of increasing the per-iteration cost, but not all. I’m surprised at the lack of (significant) per-iteration time reduction for:

    • clusterlin: reduce computation of unnecessary pot sets (optimization)
    • clusterlin: avoid recomputing potential set on every split (optimization)
    • clusterlin: avoid jump ahead in unmodified pot sets (optimization)

    It may be that these are optimizations that only help in the average case and not the worst case that’s benchmarked here (which is probably worth some performance reduction in the worst case, but not much) or don’t help at all. I’ll investigate a bit more, and if not significant, drop them from this PR.

  55. gmaxwell commented at 11:07 pm on August 17, 2024: contributor

    @sipa Though there may be other patterns which are a worst case even more worse than the particular lattice used for testing without those optimizations? If you’ve been searching for worst case examples for a long time with those optimizations in place you may have missed cases that would be worse without them. No?

    An important thing to keep in mind that the ‘worst case’ used in the tests is speculative. You don’t know for sure that it’s actually a worse case under any version of the algorithm. It might be the worst case under some but not under others or it may never be a worst case (except for very small N which you’ve presumably tested exhaustively).

  56. sipa force-pushed on Aug 19, 2024
  57. sipa commented at 9:15 pm on August 19, 2024: member

    I’m surprised at the lack of (significant) per-iteration time reduction for:

    • clusterlin: reduce computation of unnecessary pot sets (optimization)
    • clusterlin: avoid recomputing potential set on every split (optimization)
    • clusterlin: avoid jump ahead in unmodified pot sets (optimization)

    I have dropped the 2nd and 3rd, due to inconclusive evidence that they help. The 1st one I have kept; my belief is that it in fact does not improve the worst case, but it is also has very low cost even when it doesn’t do anything. The 2nd and 3rd do have a cost if ineffective (as shown by @glozow’s and my benchmarks). Reasoning:

    • reduce computation of unnecessary pot sets: I don’t believe this affects the worst case. My conjecture is that worst cases are ones where knowing the best candidate set up front does not change the number of iterations requires (i.e., all search tree subtrees either fail immediately, or need to be fully explored because they appear to be potential best ones until the very end). In these subtrees, this optimization doesn’t do anything, because its only effect is reducing computation in work items that are doomed to fail soon after anyway, and if my conjecture is true, such branches don’t exist in the first place in worst case situations. Still, due to the low cost, I’ve left it in.
    • avoid recomputing potential set on every split: this optimization was based on a belief that a significant amount of time would be spent recomputing the same pot subset after every split. With some more investigation, (hard) clusters exist where this change is a net improvement and ones where it is a net degradation (found through a custom fuzz test that maximizes or minimizes the number of pot addition steps minus the cost of updating pot in split_fn, not an actual benchmark). The per-iteration potential gains seem higher than the per-iteration potential degrations, but the largest absolute losses are with close-to-worst clusters, while the largest absolute gains are with significantly simpler clusters, which is weak evidence that it may hurt the worst case more than it helps. I think this will need analysis with real-world clusters to see actual impact, which may be easier once the cluster mempool project is further along. I’ve dropped it for this reason.
    • avoid jump ahead in unmodified pot sets: does not make sense without the previous change, so I also dropped it. @gmaxwell

    Though there may be other patterns which are a worst case even more worse than the particular lattice used for testing without those optimizations? If you’ve been searching for worst case examples for a long time with those optimizations in place you may have missed cases that would be worse without them. No?

    Absolutely; even the apparent worst-case complexity $\mathcal{O}(\sqrt{2^n})$ is just a conjecture - it’s possible that clusters exist which have up to $\mathcal{O}(2^{n-1})$ complexity, and if so, a lot of empirical evidence that’s been used to guide this development would go out the window. But given that I came up with it in an attempt to reduce a pattern which I incorrectly though would be very common, there isn’t much justification to have it.

  58. sipa force-pushed on Aug 19, 2024
  59. in src/cluster_linearize.h:572 in a12ca7b8d9 outdated
    570+    std::vector<ClusterIndex> m_sorted_to_original;
    571+    /** m_original_to_sorted[i] is the sorted position original transaction position i has. */
    572+    std::vector<ClusterIndex> m_original_to_sorted;
    573+    /** Internal dependency graph for the cluster (with transactions in decreasing individual
    574+     *  feerate order). */
    575+    DepGraph<SetType> m_depgraph;
    


    instagibbs commented at 3:02 pm on August 21, 2024:

    a12ca7b8d99d868ff49e90d33b5e90c169849ccc

    may want to consider something like m_sorted_depgraph


    sipa commented at 6:48 pm on August 21, 2024:
    Done.
  60. in src/cluster_linearize.h:676 in 805f9122e5 outdated
    673 
    674             /** Construct a new work item. */
    675-            WorkItem(SetInfo<SetType>&& i, SetType&& u) noexcept :
    676-                inc(std::move(i)), und(std::move(u)) {}
    677+            WorkItem(SetInfo<SetType>&& i, SetType&& u, FeeFrac&& p_f) noexcept :
    678+                inc(std::move(i)), und(std::move(u)), pot_feerate(std::move(p_f)) {}
    


    instagibbs commented at 3:09 pm on August 21, 2024:

    same check already being done in split_fn, but I also like it here on creation

    0                inc(std::move(i)), und(std::move(u)), pot_feerate(std::move(p_f)) {
    1                Assume(pot_feerate.IsEmpty() == inc.feerate.IsEmpty());
    2                }
    

    sipa commented at 6:48 pm on August 21, 2024:
    Done.
  61. instagibbs approved
  62. instagibbs commented at 6:20 pm on August 21, 2024: member

    code review ACK 1b8f143fb791fe5ba40ce6610cd4e8e04b951495

    Thanks for dropping those couple commits.

    I think this will need analysis with real-world clusters to see actual impact, which may be easier once the cluster mempool project is further along. I’ve dropped it for this reason.

    On that note I think we’ll need some diagram-quality style benchmarks to make informed decisions in the future, otherwise it makes ongoing changes difficult to justify.

  63. DrahtBot requested review from glozow on Aug 21, 2024
  64. sipa force-pushed on Aug 21, 2024
  65. sipa commented at 6:50 pm on August 21, 2024: member

    On that note I think we’ll need some diagram-quality style benchmarks to make informed decisions in the future, otherwise it makes ongoing changes difficult to justify.

    FWIW, the two dropped commits are pure optimizations (they should have zero effect on the outcome, or the number of iterations performed; just on the amount of time spent per iteration).

    I’ll add an overview of the expected impact of each commit in this PR description.

  66. instagibbs commented at 6:51 pm on August 21, 2024: member
    @sipa sure, but if we’re weighing a change to this, it still matters as it may modify how many iterations we’re willing to do
  67. instagibbs commented at 7:09 pm on August 21, 2024: member

    reACK 12760a57b3a6cd1aeb3b7532311f648de2b45aa4

    via git range-diff master 1b8f143fb791fe5ba40ce6610cd4e8e04b951495 12760a57b3a6cd1aeb3b7532311f648de2b45aa4

  68. in src/cluster_linearize.h:287 in 781e0fb3aa outdated
    278@@ -279,6 +279,14 @@ struct SetInfo
    279     explicit SetInfo(const DepGraph<SetType>& depgraph, const SetType& txn) noexcept :
    280         transactions(txn), feerate(depgraph.FeeRate(txn)) {}
    281 
    282+    /** Add a transaction to this SetInfo. */
    283+    void Set(const DepGraph<SetType>& depgraph, ClusterIndex pos) noexcept
    284+    {
    285+        Assume(!transactions[pos]);
    286+        transactions.Set(pos);
    287+        feerate += depgraph.FeeRate(pos);
    


    glozow commented at 10:33 am on August 23, 2024:

    781e0fb3aa5: maybe paranoid, but

    0        transactions.Set(pos);
    1        if (Assume(!transactions[pos])) feerate += depgraph.FeeRate(pos);
    

    sipa commented at 4:47 pm on September 1, 2024:
    I’d rather not, the function is intended for uses where the caller knows that the transaction isn’t yet included, and I hope the tests are sufficient to cover all code paths. Making it dependent on an Assume still involves adding a check of that bit plus a conditional branch (in the current code, the Assume is compiled out in production builds).
  69. in src/test/fuzz/cluster_linearize.cpp:557 in c84c5c86ba outdated
    552-        // transactions has.
    553-        assert(iterations_done <= ((uint64_t{1} << todo.Count()) - 1));
    554+        // At most 2^(N-1) iterations can be required: the maximum number of non-empty topological
    555+        // subsets a (connected) cluster with N transactions can have. Even when the cluster is no
    556+        // longer connected after removing certain transactions, this holds, because the connected
    557+        // components are searched separately.
    


    glozow commented at 12:43 pm on August 23, 2024:
    Was trying to come up with some intuition for this bound. With no topological restrictions, we have 2^n subsets (yes/no n times). Minus the empty one. But every tx is in a dependency relationship if the graph is connected, therefore at least half of the subsets are not topological?

    sipa commented at 1:54 pm on August 23, 2024:

    I believe the worst case for number of connected topologically valid sets in a connected cluster is a cluster with one parent, and then N-1 independent children.

    In this setting, every non-enpty topologically valid subset consists of the parent plus any of the 2^(N-1) subsets of children.

    This isn’t a proof that nothing worse exists, but I haven’t been able to come with anything else.


    instagibbs commented at 1:57 pm on August 23, 2024:

    (in case my reasoning was faulty) My example was N-1 parents and 1 child.

    2^(N-1) - 1 possible choices of parents without child Adding child is one more possibility with all parents selected, resulting in 2^(N-1)?


    sipa commented at 2:07 pm on August 23, 2024:
    @instagibbs Right, that works too. I was focused on just counting connected topologically valid subsets, but that isn’t actually required in this context. Good to see it’s still the same 2^(N-1) number.

    instagibbs commented at 2:16 pm on August 23, 2024:

    @sipa I psyop’d myself into thinking they were connected… somehow.. Well, I believe your example!

    right, in this context it only matters what the remaining cluster was, so my example makes sense

  70. glozow commented at 1:45 pm on August 23, 2024: member

    ACK 12760a57b3a

    Agree with keeping “reduce computation of unnecessary pot sets”, intuitively it seems useful in the average situation and quite cheap.

    Thanks for dropping “avoid jump ahead in unmodified pot sets,” I wasn’t convinced that processing the existing pot txns in the exclusion branch is always redundant or faster to skip. It also made the code harder for future people to understand, which I think is important.

    I redid code review, all pretty straightforward now.

    Prettified bench results: image

    bench.csv

  71. in src/bench/cluster_linearize.cpp:119 in 21a184db63 outdated
    114+ * Its goal is measuring how much time every additional search iteration in linearization costs,
    115+ * by running with a low and a high count, subtracting the results, and divided by the number
    116+ * iterations difference.
    117  */
    118 template<typename SetType>
    119-void BenchLinearizePerIterWorstCase(ClusterIndex ntx, benchmark::Bench& bench)
    


    ismaelsadeeq commented at 12:52 pm on August 26, 2024:
    While reading through the bench I noticed that docstring explanation of BenchLinearizeNoItersWorstCaseAnc still refer to BenchLinearizePerIterWorstCase

    sipa commented at 4:47 pm on September 1, 2024:
    I’ve rewritten the comments.
  72. ismaelsadeeq commented at 4:58 pm on August 26, 2024: member
    Concept ACK
  73. in src/cluster_linearize.h:673 in 9fe834fa97 outdated
    667@@ -668,7 +668,9 @@ class SearchCandidateFinder
    668             SetType und;
    669             /** (Only when inc is not empty) The best feerate of any superset of inc that is also a
    670              *  subset of (inc | und), without requiring it to be topologically valid. It forms a
    671-             *  conservative upper bound on how good a set this work item can give rise to. */
    672+             *  conservative upper bound on how good a set this work item can give rise to. If the
    673+             *  real best such feerate does not exceed best's, then this value is not guaranteed to
    674+             *  be accurate. */
    


    ismaelsadeeq commented at 12:30 pm on August 30, 2024:

    Can you clarify this comment for me, please? It doesn’t need to be retouched; I just want to understand it better.

    what does “If the real best such feerate does not exceed best’s” imply?


    sipa commented at 5:01 pm on September 1, 2024:

    I have expanded the comment a bit, but let me elaborate more here. If this helps, I’m happy to rework it into a code comment.

    Imagine 3 transactions, T1, T2, and T3. All the same size, and each has its number as fee (T1 = 1 fee, T2 = 2 fee, T3 = 3 fee). Imagine the following series of events (it needs more complexity to actually happen, but perhaps this suffices as an explanation):

    • inc={T3} und={T1,T2} gets add_fnd. Its pot_feerate is 3 (just T3). This sets best={T3} (with feerate 3 as well), and T1 and T2 get removed from imp.
    • inc={T1} und={T2} gets add_fnd. Its real pot_feerate would be 1.5 ({T1,T2}), but non-imp transactions are not considered for computing pot_feerate, so pot_feerate is just set to 1 (just {T1}). This is fine, because despite adding T2 to pot_feerate would increase its feerate to 1.5, it will stay below the best feerate of 3 anyway. When split_fn pops this work item from the queue, it’ll discard it anyway (because pot_feerate is worse than best, and T2 being included in it or not does not change that fact).

    ismaelsadeeq commented at 10:33 am on September 2, 2024:
    The new comment is much better, thanks for the example here.
  74. ismaelsadeeq commented at 4:28 pm on August 30, 2024: member

    I was hoping to see the benchmark improve but it seems it doesn’t but it decreases a bit after the optimizations of this PR.

    Benched on M2 Pro

    On 21a184db63bd13cf0c2dff9e6e61ca85f2ff4454

    ns/op op/s err% total benchmark
    34,200.59 29,239.26 0.4% 0.01 Linearize16TxWorstCase120Iters
    6,384.70 156,624.46 2.1% 0.01 Linearize16TxWorstCase20Iters
    3,416,000.00 292.74 1.0% 0.04 Linearize32TxWorstCase15000Iters
    1,159,875.00 862.16 1.7% 0.01 Linearize32TxWorstCase5000Iters
    3,372,792.00 296.49 2.1% 0.04 Linearize48TxWorstCase15000Iters
    1,144,166.00 874.00 2.2% 0.01 Linearize48TxWorstCase5000Iters
    3,365,333.00 297.15 1.8% 0.04 Linearize64TxWorstCase15000Iters
    1,122,792.00 890.64 0.7% 0.01 Linearize64TxWorstCase5000Iters
    5,721,541.00 174.78 1.0% 0.06 Linearize75TxWorstCase15000Iters
    1,904,166.00 525.16 1.1% 0.02 Linearize75TxWorstCase5000Iters
    5,672,875.00 176.28 1.3% 0.06 Linearize99TxWorstCase15000Iters
    1,889,542.00 529.23 0.8% 0.02 Linearize99TxWorstCase5000Iters

    On 12760a57b3a6cd1aeb3b7532311f648de2b45aa4

    ns/op op/s err% total benchmark
    152,479.17 6,558.27 1.3% 0.01 Linearize16TxWorstCase120Iters
    40,915.00 24,440.91 2.3% 0.01 Linearize16TxWorstCase20Iters
    12,573,458.00 79.53 1.3% 0.14 Linearize32TxWorstCase15000Iters
    4,422,792.00 226.10 1.0% 0.05 Linearize32TxWorstCase5000Iters
    12,813,167.00 78.04 0.5% 0.14 Linearize48TxWorstCase15000Iters
    4,635,208.00 215.74 1.1% 0.05 Linearize48TxWorstCase5000Iters
    13,188,833.00 75.82 1.2% 0.14 Linearize64TxWorstCase15000Iters
    4,867,542.00 205.44 1.0% 0.05 Linearize64TxWorstCase5000Iters
    20,246,083.00 49.39 1.3% 0.22 Linearize75TxWorstCase15000Iters
    7,692,334.00 130.00 1.1% 0.08 Linearize75TxWorstCase5000Iters
    20,629,500.00 48.47 2.2% 0.22 Linearize99TxWorstCase15000Iters
    7,990,417.00 125.15 1.7% 0.09 Linearize99TxWorstCase5000Iters

    I believe this is due to the increased per-iteration cost introduced by the optimizations. In favor of reduction of the maximum number of iterations from $2^{N-1}$ to $\sqrt{2^N}$ (which is really nice).

    One notable thing I observe from this result is as the total number of iterations increase the total time taken also increases a bit.

    For smaller transaction clusters which are more common in the mempool is the tradeoff is justified given that the total time to complete in master and this PR is fairly even from the bench?


    I attempted to benchmark the maximum iter_limit values before and after this PR on few cluster sizes

    I did the benchmark by modifying the iter_limit parameter of the BenchLinearizeWorstCase function for various cluster sizes (16, 32, 48, 64, 75, and 99).

    On 21a184db63bd13cf0c2dff9e6e61ca85f2ff4454

    I set iter_limit to $2^{N-1}$

     0diff --git a/src/bench/cluster_linearize.cpp b/src/bench/cluster_linearize.cpp
     1index e63e5d76c8b..d18df32b202 100644
     2--- a/src/bench/cluster_linearize.cpp
     3+++ b/src/bench/cluster_linearize.cpp
     4@@ -114,14 +114,17 @@ DepGraph<SetType> MakeHardGraph(ClusterIndex ntx)
     5  * iterations difference.
     6  */
     7 template<typename SetType>
     8-void BenchLinearizeWorstCase(ClusterIndex ntx, benchmark::Bench& bench, uint64_t iter_limit)
     9+void BenchLinearizeWorstCase(ClusterIndex ntx, benchmark::Bench& bench, std::optional<uint64_t> iter_limit)
    10 {
    11+    if (!iter_limit.has_value()) {
    12+        iter_limit = 1 << (ntx - 1);
    13+    }
    14     const auto depgraph = MakeHardGraph<SetType>(ntx);
    15     uint64_t rng_seed = 0;
    16     bench.run([&] {
    17         SearchCandidateFinder finder(depgraph, rng_seed++);
    18-        auto [candidate, iters_performed] = finder.FindCandidateSet(iter_limit, {});
    19-        assert(iters_performed == iter_limit);
    20+        auto [candidate, iters_performed] = finder.FindCandidateSet(iter_limit.value(), {});
    21+        assert(iters_performed <= iter_limit.value());
    22     });
    23 }
    24 
    25@@ -217,6 +220,13 @@ static void Linearize75TxWorstCase15000Iters(benchmark::Bench& bench) { BenchLin
    26 static void Linearize99TxWorstCase5000Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, 5000); }
    27 static void Linearize99TxWorstCase15000Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, 15000); }
    28
    29+static void Linearize16TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<16>>(16, bench, std::nullopt); }
    30+static void Linearize32TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<32>>(32, bench, std::nullopt); }
    31+static void Linearize48TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<48>>(48, bench, std::nullopt); }
    32+static void Linearize64TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<64>>(64, bench, std::nullopt); }
    33+static void Linearize75TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<75>>(75, bench, std::nullopt); }
    34+static void Linearize99TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, std::nullopt); }
    35+
    36 static void LinearizeNoIters16TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<16>>(16, bench); }
    37 static void LinearizeNoIters32TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<32>>(32, bench); }
    38 static void LinearizeNoIters48TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<48>>(48, bench); }
    39@@ -245,43 +255,50 @@ static void MergeLinearizations64TxWorstCase(benchmark::Bench& bench) { BenchMer
    40 static void MergeLinearizations75TxWorstCase(benchmark::Bench& bench) { BenchMergeLinearizationsWorstCase<BitSet<75>>(75, bench); }
    41 static void MergeLinearizations99TxWorstCase(benchmark::Bench& bench) { BenchMergeLinearizationsWorstCase<BitSet<99>>(99, bench); }
    42
    43+BENCHMARK(Linearize16TxWorstCase, benchmark::PriorityLevel::HIGH);
    44+BENCHMARK(Linearize32TxWorstCase, benchmark::PriorityLevel::HIGH);
    45+BENCHMARK(Linearize48TxWorstCase, benchmark::PriorityLevel::HIGH);
    46+BENCHMARK(Linearize64TxWorstCase, benchmark::PriorityLevel::HIGH);
    47+BENCHMARK(Linearize75TxWorstCase, benchmark::PriorityLevel::HIGH);
    48+BENCHMARK(Linearize99TxWorstCase, benchmark::PriorityLevel::HIGH);
    
    ns/op op/s err% total Benchmark
    1,168,959.00 855.46 1.8% 0.01 Linearize16TxWorstCase
    5,546,978,667.00 0.18 0.8% 60.83 Linearize32TxWorstCase
    7,401,916.00 135.10 0.4% 0.08 Linearize48TxWorstCase
    421,208.50 2,374.12 0.8% 0.01 Linearize75TxWorstCase
    4,004.02 249,749.05 0.6% 0.01 Linearize99TxWorstCase

    The benchmark for Linearize64TxWorstCase hung during execution, while the others completed successfully. Linearize32TxWorstCase completed after a minute.

    On 12760a57b3a6cd1aeb3b7532311f648de2b45aa4

    I set `iter_limit`` to $√{2^{N}}$

     0diff --git a/src/bench/cluster_linearize.cpp b/src/bench/cluster_linearize.cpp
     1index 48ede53bd7d..e50bc55f4d9 100644
     2--- a/src/bench/cluster_linearize.cpp
     3+++ b/src/bench/cluster_linearize.cpp
     4@@ -6,6 +6,7 @@
     5 
     6 #include <util/bitset.h>
     7 #include <cluster_linearize.h>
     8+#include <cmath>
     9 
    10 using namespace cluster_linearize;
    11 
    12@@ -114,14 +115,17 @@ DepGraph<SetType> MakeHardGraph(ClusterIndex ntx)
    13  * iterations difference.
    14  */
    15 template<typename SetType>
    16-void BenchLinearizeWorstCase(ClusterIndex ntx, benchmark::Bench& bench, uint64_t iter_limit)
    17+void BenchLinearizeWorstCase(ClusterIndex ntx, benchmark::Bench& bench, std::optional<uint64_t> iter_limit)
    18 {
    19+    if (!iter_limit.has_value()) {
    20+        iter_limit = std::sqrt(1 << ntx);
    21+    }
    22     const auto depgraph = MakeHardGraph<SetType>(ntx);
    23     uint64_t rng_seed = 0;
    24     bench.run([&] {
    25         SearchCandidateFinder finder(depgraph, rng_seed++);
    26-        auto [candidate, iters_performed] = finder.FindCandidateSet(iter_limit, {});
    27-        assert(iters_performed == iter_limit);
    28+        auto [candidate, iters_performed] = finder.FindCandidateSet(iter_limit.value(), {});
    29+        assert(iters_performed <= iter_limit.value());
    30     });
    31 }
    32 
    33@@ -217,6 +221,13 @@ static void Linearize75TxWorstCase15000Iters(benchmark::Bench& bench) { BenchLin
    34 static void Linearize99TxWorstCase5000Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, 5000); }
    35 static void Linearize99TxWorstCase15000Iters(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, 15000); }
    36
    37+static void Linearize16TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<16>>(16, bench, std::nullopt); }
    38+static void Linearize32TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<32>>(32, bench, std::nullopt); }
    39+static void Linearize48TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<48>>(48, bench, std::nullopt); }
    40+static void Linearize64TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<64>>(64, bench, std::nullopt); }
    41+static void Linearize75TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<75>>(75, bench, std::nullopt); }
    42+static void Linearize99TxWorstCase(benchmark::Bench& bench) { BenchLinearizeWorstCase<BitSet<99>>(99, bench, std::nullopt); }
    43+
    44 static void LinearizeNoIters16TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<16>>(16, bench); }
    45 static void LinearizeNoIters32TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<32>>(32, bench); }
    46 static void LinearizeNoIters48TxWorstCaseAnc(benchmark::Bench& bench) { BenchLinearizeNoItersWorstCaseAnc<BitSet<48>>(48, bench); }
    47@@ -245,43 +256,50 @@ static void MergeLinearizations64TxWorstCase(benchmark::Bench& bench) { BenchMer
    48 static void MergeLinearizations75TxWorstCase(benchmark::Bench& bench) { BenchMergeLinearizationsWorstCase<BitSet<75>>(75, bench); }
    49 static void MergeLinearizations99TxWorstCase(benchmark::Bench& bench) { BenchMergeLinearizationsWorstCase<BitSet<99>>(99, bench); }
    50
    51+BENCHMARK(Linearize16TxWorstCase, benchmark::PriorityLevel::HIGH);
    52+BENCHMARK(Linearize32TxWorstCase, benchmark::PriorityLevel::HIGH);
    53+BENCHMARK(Linearize48TxWorstCase, benchmark::PriorityLevel::HIGH);
    54+BENCHMARK(Linearize64TxWorstCase, benchmark::PriorityLevel::HIGH);
    55+BENCHMARK(Linearize75TxWorstCase, benchmark::PriorityLevel::HIGH);
    56+BENCHMARK(Linearize99TxWorstCase, benchmark::PriorityLevel::HIGH);
    
    ns/op op/s err% total Benchmark
    168,569.50 5,932.27 0.6% 0.01 Linearize16TxWorstCase
    18,648.34 53,624.08 0.3% 0.01 Linearize32TxWorstCase
    770,417.00 1,298.00 1.0% 0.01 Linearize48TxWorstCase
    34,281.25 29,170.46 0.3% 0.01 Linearize64TxWorstCase
    379,319.67 2,636.30 2.2% 0.01 Linearize75TxWorstCase
    78,330.08 12,766.49 2.2% 0.01 Linearize99TxWorstCase

    All benchmarks completed successfully, and the results indicate a performance improvement compared to the previous test.

    • Setting iter_limit to $2^{N-1}$ before this PR resulted in longer execution times and, in one case (Linearize64TxWorstCase), caused Linearize64TxWorstCase benchmark to hang which I did not investigate why.

    • Setting the iter_limit to $\sqrt{2^{N}}$ improved performance across all cluster sizes, with all benchmarks completing successfully.

  75. sipa force-pushed on Sep 1, 2024
  76. sipa commented at 5:19 pm on September 1, 2024: member

    @ismaelsadeeq I believe you’re triggering integer overflows in your code (1 << ntx is 0 in case ntx is 32 or more, on typical platforms; with uint64_t{1} << ntx you can support ntx up to 63).

    Also, yes, it’s definitely possible that for sufficiently small clusters simpler algorithms are better, but it’s hard to say how much those matters, as they’ll be super fast already anyway. I believe that for ntx up to 3, just PostLinearize will linearize optimally directly (as would AncestorCandidateFinder). Up to ntx=6 or so, the naive search algorithm that’s used in the fuzz test SimpleCandidateFinder is probably faster than SearchCandidateFinder… but that likely still finishes within microseconds.

    To actually demonstrate the effectiveness of the optimizations here, perhaps it’s worth gathering some real-world hard clusters. @sdaftuar do you think you could instrument your code to gather/dump clusters that linearize optimally with say 1,000,000 iterations but not using 100,000 iterations, using the code up to commit 21b826af07fdfa26aaf981cc0810edd6b55b45e0 (before that commit the search algorithm is so naive that high iteration count isn’t really an indication of being hard). Perhaps a corpus of such clusters could be added as a benchmark - hopefully that shows a real benefit of each commit here.

    Despite being the current next PR to review in the cluster mempool project, this isn’t actually blocking anything, so no rush here.

  77. ismaelsadeeq commented at 11:17 am on September 2, 2024: member
    light code review and Tested ACK 100e1b9ecb29c76187112fda9fed1c01da192c99
  78. glozow commented at 5:07 pm on September 2, 2024: member
    reACK 100e1b9ecb2 via range-diff
  79. instagibbs commented at 6:20 pm on September 3, 2024: member

    reACK 100e1b9ecb29c76187112fda9fed1c01da192c99

    via git range-diff master 12760a57b3a6cd1aeb3b7532311f648de2b45aa4 100e1b9ecb29c76187112fda9fed1c01da192c99

    just some comment updates

  80. sdaftuar commented at 1:19 pm on September 4, 2024: member

    @sdaftuar do you think you could instrument your code to gather/dump clusters that linearize optimally with say 1,000,000 iterations but not using 100,000 iterations, using the code up to commit 21b826a (before that commit the search algorithm is so naive that high iteration count isn’t really an indication of being hard).

    I used my simulation environment to run the commit you listed on last year’s data, searching for clusters that were optimal within 1M iterations but required more than 100,000 iterations. I then sorted those by iteration count and took the “worst” 20 clusters, which I added to our benchmark suite in this branch: https://github.com/sdaftuar/bitcoin/tree/test-30286

    I then ran that benchmark on each intermediate commit of this PR. At the beginning of this PR, on my machine, the fastest cluster takes ~5ms to linearize, with nearly all the rest being at around 10ms. By the end of this PR, the fastest cluster linearizes (optimally) in 1.4us, and the slowest cluster takes 115us (and other than the slowest cluster I found, all the rest are under 50us). So this is a huge improvement.

    Note that I excluded any LIMO calculations in this benchmark (by not passing in a pre-existing linearization to Linearize).

    I’m attaching my output here for others to review/double check my summary (but you can also try to recreate the benchmarks locally using the branch I shared above): output.bench.txt

  81. instagibbs commented at 1:59 pm on September 4, 2024: member
    @sdaftuar awesome data! Encouraging to see the timings move along with what @sipa was asserting :+1:
  82. sipa force-pushed on Sep 5, 2024
  83. sipa commented at 5:04 pm on September 5, 2024: member
    @sdaftuar Added your benchmarks as a commit to this PR (I’ve reworked it to auto-detect the transaction count).
  84. sipa force-pushed on Sep 5, 2024
  85. DrahtBot added the label CI failed on Sep 5, 2024
  86. DrahtBot commented at 5:09 pm on September 5, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/29741549265

    Make sure to run all tests locally, according to the documentation.

    The failure may happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  87. sipa force-pushed on Sep 5, 2024
  88. in src/bench/cluster_linearize.cpp:232 in cf94bf7886 outdated
    227+    auto runner_fn = [&]<typename SetType>() noexcept {
    228+        DepGraph<SetType> depgraph;
    229+        reader >> Using<DepGraphFormatter>(depgraph);
    230+        uint64_t rng_seed = 0;
    231+        bench.run([&] {
    232+            auto res = Linearize(depgraph, /*max_iterations=*/1000000, rng_seed++);
    


    instagibbs commented at 5:47 pm on September 5, 2024:
    is f94c546ac02ad11ba73174cb8d25220515cee893 the first commit that you can assert optimality? I’m not sure having the benchmark in prior to that is so meaningful

    instagibbs commented at 5:50 pm on September 5, 2024:
    also seems like the assertion fails for me at that commit, jump ahead opt is the first commit that seems to hit optimal for me

    sipa commented at 6:05 pm on September 5, 2024:
    Nice catch. Apparently the example graphs were obtained by finding 100k-1M iterations-optimal clusters with LIMO (so while passing in an already-good linearization as input), while the benchmarks didn’t pass in an existing linearization. @sdaftuar is going to construct new ones.

    sipa commented at 0:15 am on September 6, 2024:
    Done, pushed new ones, and moved the commit that introduces them forward to the point where they make sense.
  89. DrahtBot removed the label CI failed on Sep 5, 2024
  90. sipa force-pushed on Sep 6, 2024
  91. gmaxwell commented at 0:12 am on September 6, 2024: contributor

    Nice benchmarks, any value in running them on the removed optimizations?

    clusterlin: avoid recomputing potential set on every split (optimization) clusterlin: avoid jump ahead in unmodified pot sets (optimization)

    Any idea why LinearizeSerializedCluster5 is an outlier now? It separates from the rest at “improve heuristic to decide split transaction”.

  92. sipa commented at 1:14 am on September 6, 2024: member

    @gmaxwell I’ve just pushed a new set of example benchmark clusters, with the property that they need between 100k and 1M iterations to linearize optimally sometimes and never more than 10M iterations. I’m not well set up right now to benchmark, but at the end of this PR, I see:

    ns/op op/s err% total benchmark
    13,034,966.68 76.72 1.2% 10.96 LinearizeOptimallyExample00
    72,373.73 13,817.17 0.2% 10.65 LinearizeOptimallyExample01
    65,761.78 15,206.40 0.1% 10.59 LinearizeOptimallyExample02
    65,761.24 15,206.53 0.4% 10.63 LinearizeOptimallyExample03
    6,732.35 148,536.44 0.3% 10.61 LinearizeOptimallyExample04
    69,041.02 14,484.14 1.3% 10.64 LinearizeOptimallyExample05
    69,270.49 14,436.16 2.0% 10.41 LinearizeOptimallyExample06
    69,446.96 14,399.48 2.3% 10.67 LinearizeOptimallyExample07
    48,274.16 20,715.02 0.2% 11.03 LinearizeOptimallyExample08
    77,312.66 12,934.49 0.1% 10.73 LinearizeOptimallyExample09
    77,423.28 12,916.01 0.2% 10.74 LinearizeOptimallyExample10
    56,558.64 17,680.76 3.7% 11.22 LinearizeOptimallyExample11
    45,744.29 21,860.65 0.1% 10.99 LinearizeOptimallyExample12
    32,857.15 30,434.77 0.3% 11.04 LinearizeOptimallyExample13
    57,328.16 17,443.43 0.3% 10.97 LinearizeOptimallyExample14
    16,797.84 59,531.47 0.2% 10.98 LinearizeOptimallyExample15
    205,594,200.40 4.86 0.1% 11.31 LinearizeOptimallyExample16
    208,598,643.80 4.79 0.5% 10.82 LinearizeOptimallyExample17
    193,159,355.60 5.18 0.9% 10.84 LinearizeOptimallyExample18
    208,075,819.25 4.81 0.8% 11.22 LinearizeOptimallyExample19
    @sdaftuar Is it possible that some of these clusters are “related” (as in are really modifications of one another with a few transactions added/removed)? That could explain the strong correlations between consecutive ones.

    I have pushed a branch with the two dropped optimization commits re-added to https://github.com/sipa/bitcoin/commits/202409_clusterlin_more_opt, if someone wants to benchmark.

  93. glozow commented at 1:15 pm on September 6, 2024: member

    Examples 16-19 seem to get slower? My results on the last commit look similar to sipa’s:

    ns/op op/s err% total benchmark
    17,187,614.00 58.18 4.0% 0.19 LinearizeOptimallyExample00
    96,365.70 10,377.14 0.2% 0.01 LinearizeOptimallyExample01
    87,832.50 11,385.31 3.1% 0.01 LinearizeOptimallyExample02
    86,292.45 11,588.50 1.9% 0.01 LinearizeOptimallyExample03
    8,905.69 112,287.78 1.3% 0.01 LinearizeOptimallyExample04
    90,947.18 10,995.39 3.2% 0.01 LinearizeOptimallyExample05
    90,872.91 11,004.38 4.2% 0.01 LinearizeOptimallyExample06
    93,178.40 10,732.10 1.3% 0.01 LinearizeOptimallyExample07
    72,569.92 13,779.81 3.9% 0.01 LinearizeOptimallyExample08
    106,545.57 9,385.66 3.8% 0.01 LinearizeOptimallyExample09
    103,444.11 9,667.06 0.9% 0.01 LinearizeOptimallyExample10
    74,771.30 13,374.12 2.9% 0.01 LinearizeOptimallyExample11
    57,063.53 17,524.33 0.3% 0.01 LinearizeOptimallyExample12
    43,317.04 23,085.60 2.1% 0.01 LinearizeOptimallyExample13
    82,872.00 12,066.80 2.6% 0.01 LinearizeOptimallyExample14
    21,907.07 45,647.37 0.8% 0.01 LinearizeOptimallyExample15
    276,053,164.00 3.62 0.3% 3.03 LinearizeOptimallyExample16
    275,137,154.00 3.63 0.2% 3.04 LinearizeOptimallyExample17
    258,526,924.00 3.87 0.5% 2.85 LinearizeOptimallyExample18
    273,558,093.00 3.66 0.3% 3.02 LinearizeOptimallyExample19

    But here’s commit-by-commit 2024-09-06-bench-examples.txt image

  94. sipa commented at 1:20 pm on September 6, 2024: member

    Thanks for the graphs, @glozow. That confirms the impression we got: 16-19 seem to just be really hard, and the later optimizations just don’t help with it (and increase per-iteration cost, for no meaningful iteration reduction).

    It doesn’t surprise me that this is possible (they are after all very far below the conjectured upper bound of $\mathcal{O}(\sqrt{2^N})$, but it does surprise me that such clusters are seen in historical real-world mempool data.

    Perhaps it’s worth investigating (in a follow up) if there are more specific optimizations targetting the patterns in those clusters.

  95. sdaftuar commented at 1:49 pm on September 6, 2024: member

    @sdaftuar Is it possible that some of these clusters are “related” (as in are really modifications of one another with a few transactions added/removed)? That could explain the strong correlations between consecutive ones.

    I do believe those last 4 examples are likely to be related, as all were found within minutes of each other on the same day.

  96. instagibbs commented at 2:40 pm on September 6, 2024: member
    Confirmed that at least for example 19, it simply can’t find the optimal solution until the commit it’s included, even if total iterations are cranked up a couple orders of magnitude.
  97. sipa commented at 2:42 pm on September 6, 2024: member

    Example cluster 18:

    example18

    From a fee-optimization perspective it’s a remarkably uninteresting graph: individual transaction feerates range from ~78 sat/vB to ~84 sat/vB.

    (created with newly-added DOT support in https://gist.github.com/sipa/822f60db6608a26bb4aae75fd31bcb12)

  98. instagibbs commented at 4:09 pm on September 9, 2024: member

    reACK 59028de422bbf8ccf53da27f1153e343952e585f

    via git range-diff master 100e1b9ecb29c76187112fda9fed1c01da192c99 59028de422bbf8ccf53da27f1153e343952e585f

    fwiw I ran the branch with the two dropped optimizations, and it lead to a slight improvement to most of the optimal benchmarks on my machine

  99. DrahtBot requested review from glozow on Sep 9, 2024
  100. DrahtBot requested review from ismaelsadeeq on Sep 9, 2024
  101. glozow commented at 9:05 pm on September 10, 2024: member
    ACK 59028de422b
  102. ismaelsadeeq commented at 3:12 pm on September 11, 2024: member

    Re-ACK 59028de422bbf8ccf53da27f1153e343952e585f

    I am having consistently similar result like @glozow although some bench were unstable.

    ns/op op/s err% total benchmark
    13,409.81 74,572.28 8.4% 0.01 :wavy_dash: Linearize16TxWorstCase120Iters (Unstable with ~71.1 iters. Increase minEpochIterations to e.g. 711)
    2,857.30 349,980.23 1.1% 0.01 Linearize16TxWorstCase20Iters
    824,833.00 1,212.37 2.1% 0.01 Linearize32TxWorstCase15000Iters
    263,687.50 3,792.37 3.1% 0.01 Linearize32TxWorstCase5000Iters
    685,583.00 1,458.61 4.4% 0.01 Linearize48TxWorstCase15000Iters
    217,283.20 4,602.29 1.3% 0.01 Linearize48TxWorstCase5000Iters
    566,271.00 1,765.94 5.2% 0.01 :wavy_dash: Linearize64TxWorstCase15000Iters (Unstable with ~1.7 iters. Increase minEpochIterations to e.g. 17)
    207,125.00 4,828.00 5.2% 0.01 :wavy_dash: Linearize64TxWorstCase5000Iters (Unstable with ~4.9 iters. Increase minEpochIterations to e.g. 49)
    794,375.00 1,258.85 2.2% 0.01 Linearize75TxWorstCase15000Iters
    287,364.50 3,479.90 2.8% 0.01 Linearize75TxWorstCase5000Iters
    774,791.00 1,290.67 3.1% 0.01 Linearize99TxWorstCase15000Iters
    286,264.00 3,493.28 1.6% 0.01 Linearize99TxWorstCase5000Iters
    711.44 1,405,591.29 2.0% 0.01 LinearizeNoIters16TxWorstCaseAnc
    811.92 1,231,654.51 1.1% 0.01 LinearizeNoIters16TxWorstCaseLIMO
    2,209.56 452,579.03 2.0% 0.01 LinearizeNoIters32TxWorstCaseAnc
    2,791.17 358,272.43 2.1% 0.01 LinearizeNoIters32TxWorstCaseLIMO
    4,802.87 208,208.65 2.0% 0.01 LinearizeNoIters48TxWorstCaseAnc
    6,225.00 160,642.57 1.7% 0.01 LinearizeNoIters48TxWorstCaseLIMO
    8,375.67 119,393.48 2.2% 0.01 LinearizeNoIters64TxWorstCaseAnc
    11,179.56 89,448.96 1.9% 0.01 LinearizeNoIters64TxWorstCaseLIMO
    13,085.68 76,419.44 2.9% 0.01 LinearizeNoIters75TxWorstCaseAnc
    20,939.10 47,757.55 1.8% 0.01 LinearizeNoIters75TxWorstCaseLIMO
    22,215.91 45,012.79 0.4% 0.01 LinearizeNoIters99TxWorstCaseAnc
    37,518.69 26,653.38 0.8% 0.01 LinearizeNoIters99TxWorstCaseLIMO
    12,650,708.00 79.05 5.8% 0.14 :wavy_dash: LinearizeOptimallyExample00 (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
    70,053.57 14,274.79 1.9% 0.01 LinearizeOptimallyExample01
    63,852.80 15,661.02 1.6% 0.01 LinearizeOptimallyExample02
    63,375.00 15,779.09 1.8% 0.01 LinearizeOptimallyExample03
    7,607.93 131,441.79 1.0% 0.01 LinearizeOptimallyExample04
    78,541.60 12,732.11 6.1% 0.01 :wavy_dash: LinearizeOptimallyExample05 (Unstable with ~7.0 iters. Increase minEpochIterations to e.g. 70)
    68,958.33 14,501.51 2.5% 0.01 LinearizeOptimallyExample06
    75,253.73 13,288.38 7.2% 0.01 :wavy_dash: LinearizeOptimallyExample07 (Unstable with ~12.4 iters. Increase minEpochIterations to e.g. 124)
    55,020.90 18,174.91 9.7% 0.01 :wavy_dash: LinearizeOptimallyExample08 (Unstable with ~12.5 iters. Increase minEpochIterations to e.g. 125)
    77,760.42 12,860.01 3.2% 0.01 LinearizeOptimallyExample09
    76,808.30 13,019.43 1.6% 0.01 LinearizeOptimallyExample10
    53,569.44 18,667.36 2.1% 0.01 LinearizeOptimallyExample11
    43,802.08 22,829.96 0.8% 0.01 LinearizeOptimallyExample12
    31,438.15 31,808.49 2.4% 0.01 LinearizeOptimallyExample13
    57,560.17 17,373.13 2.1% 0.01 LinearizeOptimallyExample14
    17,059.81 58,617.31 2.3% 0.01 LinearizeOptimallyExample15
    193,585,541.00 5.17 1.3% 2.15 LinearizeOptimallyExample16
    196,138,167.00 5.10 3.0% 2.24 LinearizeOptimallyExample17
    182,834,750.00 5.47 1.8% 2.07 LinearizeOptimallyExample18
    200,634,666.00 4.98 2.2% 2.20 LinearizeOptimallyExample19

    That confirms the impression we got: 16-19 seem to just be really hard, and the later optimizations just don’t help with it (and increase per-iteration cost, for no meaningful iteration reduction).

    16-19 has similar speed with the other clusters in until bb198964e10e84eaa93a31a5b9860bad026e1dc5

    Perhaps it’s worth investigating (in a follow up) if there are more specific optimizations targetting the patterns in those clusters.

    +1

  103. instagibbs commented at 3:16 pm on September 11, 2024: member

    @ismaelsadeeq I typically overcome that with:

    0  -min-time=<milliseconds>
    1       Minimum runtime per benchmark, in milliseconds (default: 10)
    

    to make it long enough to be stable

  104. sipa force-pushed on Sep 11, 2024
  105. sipa commented at 7:08 pm on September 11, 2024: member
    I’ve pushed a new update, just replacing the example benchmarks. They’re now gathered based on real-world clusters (seen in simulations of historical mempool data) at 3 distinct commits (“track upper bound potential set for work items”, “include topological pot subsets automatically”, and “improve heuristic to decide split transaction”), picking among example with high maximum from-scratch linearization time both ones that have a short and long linearization time in case the optimal is already known (thanks, @sdaftuar!).
  106. sipa force-pushed on Sep 11, 2024
  107. DrahtBot added the label CI failed on Sep 11, 2024
  108. sipa commented at 8:23 pm on September 11, 2024: member
    plot
  109. instagibbs commented at 4:16 pm on September 12, 2024: member

    reACK a5ea6467b6a694facc1f5efe460bf30539167597

    via git range-diff master 59028de422bbf8ccf53da27f1153e343952e585f a5ea6467b6a694facc1f5efe460bf30539167597

    Just benchmark changes, with examples 11-14 added in 2fcfb4c0392a2f3114195e6e9979c1431798632c and 15-19 in 8fbc6f3c33c6a5b23314279f0856de0625cc6d3f

    New benchmarks all run for each commit properly post-introduction. my benchmark runs for comparison starting from introduction commit to tip: https://gist.github.com/instagibbs/058f51587074187cb78f5f9b9327442e

  110. DrahtBot requested review from ismaelsadeeq on Sep 12, 2024
  111. clusterlin bench: have low/high iter benchmarks instead of per-iter e4faea9ca7
  112. clusterlin: separate initial search entries per component (optimization)
    Before this commit, the worst case for linearization involves clusters which
    break apart in several smaller components after the first candidate is
    included in the output linearization.
    
    Address this by never considering work items that span multiple components
    of what remains of the cluster.
    85a285a306
  113. clusterlin: add reordering support for DepGraph
    Add a DepGraph(depgraph, reordering) function that constructs a new DepGraph
    corresponding to an old one, but with its transactions is a modified order
    (given as a vector from old to new positions).
    
    Also use this reordering feature inside DepGraphFormatter::Unser, which needs
    a small modification so that its reordering mapping is old-to-new (rather than
    the new-to-old it used before).
    b80e6dfe78
  114. clusterlin: use feerate-sorted depgraph in SearchCandidateFinder
    This is a requirement for a future commit, which will rely on quickly iterating
    over transaction sets in decreasing individual feerate order.
    9e43e4ce10
  115. clusterlin: track upper bound potential set for work items (optimization)
    In each work item, keep track of a conservative overestimate of the best
    possible feerate that can be reached from it, and then use these to avoid
    exploring hopeless work items.
    2965fbf203
  116. clusterlin bench: add example hard cluster benchmarks
    Co-Authored-By: Suhas Daftuar <sdaftuar@gmail.com>
    6060a948ca
  117. clusterlin: reduce computation of unnecessary pot sets (optimization)
    Keep track of which transactions in the graph have an individual
    feerate that is better than the best included set so far. Others do not
    need to be added to the pot set, as they cannot possibly help beating
    best.
    e20fda77a2
  118. clusterlin: include topological pot subsets automatically (optimization)
    Automatically add topologically-valid subsets of the potential set pot
    to inc. It can be proven that these must be part of the best reachable
    topologically-valid set from that work item.
    
    This is a crucial optimization that (apparently) reduces the maximum
    number of iterations from ~2^(N-1) to ~sqrt(2^N).
    
    Co-Authored-By: Suhas Daftuar <sdaftuar@gmail.com>
    71f2629398
  119. clusterlin: improve heuristic to decide split transaction (optimization)
    Empirically, this approach seems to be more efficient in common real-life
    clusters, and does not change the worst case.
    
    Co-Authored-By: Suhas Daftuar <sdaftuar@gmail.com>
    bd044356ed
  120. clusterlin: only start/use search when enough iterations left 9ad2fe7e69
  121. sipa force-pushed on Sep 12, 2024
  122. DrahtBot removed the label CI failed on Sep 13, 2024
  123. sdaftuar commented at 3:37 pm on September 13, 2024: member

    I have been gathering more data to try to measure the effect of the improvements in this PR. I decided to take my cluster mempool branch (#28676), combined with my simulation environment for playing back historical data, and measure the effect of a series of 4 commits from this PR on the number of clusters encountered in 2023 that would NOT have been optimally linearized, for each of 5 different values of the maximum iterations we’d permit when linearizing.

    I plotted these values on a log-scale, so we can more easily see the effect of the optimizations here. Each data series is a different number of maximum iterations allotted to the linearization algorithm, and I’ve plotted how the number of clusters that are non-optimally sorted largely decreases with successive commits in this PR:

    Optimal linearizations 2023 by commit and iteration count

  124. instagibbs commented at 4:21 pm on September 13, 2024: member

    @sdaftuar is that number of iterations per search invocation, or total over the entire linearization?

    Could a version of this graph show the total fraction of transactions that are optimally linearized? Or maybe just do a chart where a cluster min size of is used to ignore the many small clusters we know we can optimally linearize? (but I guess if they were easy to linearize, that would be shown in the upper right with 1M?)

  125. instagibbs approved
  126. instagibbs commented at 5:12 pm on September 13, 2024: member

    reACK 9ad2fe7e69e9e69949ebbb280a15756dc3301f09

    just a rebase on master to take up Revert "build: Minimize I/O operations in GenerateHeaderFrom{Json,Raw}.cmake" I presume

    via git range-diff master a5ea6467b6a694facc1f5efe460bf30539167597 9ad2fe7e69e9e69949ebbb280a15756dc3301f09

  127. sdaftuar commented at 7:49 pm on September 13, 2024: member

    @sdaftuar is that number of iterations per search invocation, or total over the entire linearization?

    This is the total over the entire linearization, not each search invocation.

    Could a version of this graph show the total fraction of transactions that are optimally linearized?

    This seems a bit difficult to state, since a transaction may be part of many clusters over its lifetime, so I’m not sure precisely what we’d want to measure.

    Or maybe just do a chart where a cluster min size of is used to ignore the many small clusters we know we can optimally linearize? (but I guess if they were easy to linearize, that would be shown in the upper right with 1M?)

    Yeah, the idea with the first commit shown is to give a baseline understanding of what fraction of clusters are non-trivial to linearize in the first place.

  128. gmaxwell commented at 8:46 pm on September 13, 2024: contributor

    @sdaftuar

    Iteration counts as the metric overstates the improvements from things that reduce iteration counts but increase work per iteration, and understates the improvements from things that only make iterations faster. But I totally get why cutting off based on time would be unworkable from a practical benchmarking perspective and the differences are large enough that this is still very interesting.

    I thought it was interesting that the 100k and 1m lines get slightly worse in the last commit. I wonder if there are other clusters that share something in common with example 11. – I don’t think the behavior is a concern, but perhaps it could identify some opportunity for further optimization.

  129. glozow commented at 2:26 pm on September 15, 2024: member

    My graph on a log scale, looks similar to sipa’s:

    image

    Results txt: 2024-09-14-bench.txt

  130. in src/bench/cluster_linearize.cpp:297 in 9ad2fe7e69
    298-BENCHMARK(LinearizePerIter99TxWorstCase, benchmark::PriorityLevel::HIGH);
    299+// The following example clusters were constructed by replaying historical mempool activity, and
    300+// selecting for ones that take many iterations (after the introduction of some but not all
    301+// linearization algorithm optimizations).
    302+
    303+/* 2023-05-05T23:12:21Z 71, 521780, 543141,*/
    


    glozow commented at 9:17 pm on September 15, 2024:

    Are these comments in the format timestamp, tx count, fee, size (or similar)? Or are they actually something like iteration count? I added a print statement to get feerates of clusters but got something slightly different.

     0total feerate: 4818977 / 54585, 71 txns
     1total feerate: 15029636 / 99008, 81 txns
     2total feerate: 1288344 / 98145, 90 txns
     3total feerate: 15755611 / 77992, 87 txns
     4total feerate: 429736 / 8245, 35 txns
     5total feerate: 695118 / 60056, 60 txns
     6total feerate: 941605 / 80052, 99 txns
     7total feerate: 1948779 / 51103, 52 txns
     8total feerate: 11583092 / 76049, 69 txns
     9total feerate: 871078 / 55437, 77 txns
    10total feerate: 6127495 / 94464, 48 txns
    11total feerate: 1176444 / 14638, 77 txns
    12total feerate: 76548 / 7724, 40 txns
    13total feerate: 10487482 / 100707, 96 txns
    14total feerate: 14575007 / 90755, 93 txns
    15total feerate: 867760 / 10558, 55 txns
    16total feerate: 1205280 / 14663, 76 txns
    17total feerate: 1493360 / 18235, 98 txns
    18total feerate: 1446604 / 17766, 99 txns
    19total feerate: 457656 / 23949, 78 txns
    

    sdaftuar commented at 9:39 pm on September 15, 2024:
    The format was timestamp (from my simulation), tx count, min iterations when running Linearize 10 times with different rng seeds and then once more when running it woth LIMO using the optimal linearization as input, and then max iterations over the same set of 11 runs.

    glozow commented at 11:47 pm on September 15, 2024:
    aha, that makes much more sense!

    instagibbs commented at 1:55 pm on September 16, 2024:
    would be worth adding that description if touched again or in followup

    sipa commented at 3:12 pm on September 16, 2024:
    Will do if I retouch.

    ismaelsadeeq commented at 4:12 pm on October 4, 2024:
    +1 on adding this in a follow-up. I can create a follow for it see https://github.com/ismaelsadeeq/bitcoin/commit/9350732b17915d315cf3ce9c744d1491c1b5d386 with the description @sdaftuar provided
  131. glozow commented at 9:19 pm on September 15, 2024: member
    reACK 9ad2fe7e69e, just have a question about the docs
  132. sdaftuar commented at 4:42 pm on September 16, 2024: member
    ACK 9ad2fe7e69e9e69949ebbb280a15756dc3301f09
  133. glozow merged this on Sep 16, 2024
  134. glozow closed this on Sep 16, 2024

  135. ismaelsadeeq commented at 4:13 pm on October 4, 2024: member
    post merge ACK 9ad2fe7e69e9e69949ebbb280a15756dc3301f09 My bench result locally was consistent with what @glozow and sipa has provided.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-30 18:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me