Epoch Mempool #17268

pull JeremyRubin wants to merge 33 commits into bitcoin:master from JeremyRubin:mempool-experiments-2 changing 9 files +359 −241
  1. JeremyRubin commented at 9:22 pm on October 26, 2019: contributor

    Hansel and Gretel left bread crumb trails as they went but birds ate them all.

    This Pull Request improves the asymptotic complexity of many of the mempool’s algorithms as well as makes them more performant in practice.

    In the mempool we have a lot of algorithms which depend on rather computationally expensive state tracking. The state tracking algorithms we use make a lot of algorithms inefficient because we’re inserting into various sets that grow with the number of ancestors and descendants, or we may have to iterate multiple times of data we’ve already seen.

    To address these inefficiencies, we can closely limit the maximum possible data we iterate and reject transactions above the limits.

    However, a rational miner/user will purchase faster hardware and raise these limits to be able to collect more fee revenue or process more transactions. Further, changes coming from around the ecosystem (lightning, OP_SECURETHEBAG) have critical use cases which benefit when the mempool has fewer limitations.

    Rather than use expensive state tracking, we can do something simpler. Like Hansel and Gretel, we can leave breadcrumbs in the mempool as we traverse it, so we can know if we are going somewhere we’ve already been. Luckily, in bitcoind there are no birds!

    These breadcrumbs are a uint64_t epoch per CTxMemPoolEntry. (64 bits is enough that it will never overflow). The mempool also has a counter.

    Every time we begin traversing the mempool, and we need some help tracking state, we increment the mempool’s counter. Any CTxMemPoolEntry with an epoch less than the MemPools has not yet been touched, and should be traversed and have it’s epoch set higher.

    Given the one-to-one mapping between CTxMemPool entries and transactions, this is a safe way to cache data.

    Using these bread-crumbs allows us to get rid of many of the std::set accumulators in favor of a std::vector as we no longer rely on the de-duplicative properties of std::set.

    We can also improve many of the related algorithms to no longer make heavy use of heapstack based processing.

    The overall changes are not that big. It’s 302 additions and 208 deletions. You could review it all at once, especially for a Concept Ack. But I’ve carefully done them in small steps to permit careful analysis of the changes. So I’d recommend giving the final state of the PR a read over first to get familiar with the end-goal (it won’t take long), and then step through commit-by-commit so you can be convinced of the correctness for more thorough code reviewers.

    But is it good in practice? Yes! Basic benchmarks show it being 5% faster on the AssembleBlock benchmark, and 14.5% faster on MempoolEviction, but they aren’t that tough. Tougher benchmarks, as described below, show a >2x speedup.

    PR #17292, adds a motivating benchmark for the Epoch Mempool.

    ./src/bench/bench_bitcoin –filter=ComplexMemPool –scaling=5

    Before:

    # Benchmark, evals, iterations, total, min, max, median
    ComplexMemPool, 5, 5, 6.62715, 0.264409, 0.265441, 0.2654
    

    After:

    # Benchmark, evals, iterations, total, min, max, median
    ComplexMemPool, 5, 5, 3.07116, 0.122246, 0.12351, 0.122783
    

    This shows a > 2x speedup on this difficult benchmark (~2.15x across total, min, max, median).

    Note: For scientific purity, I created it “single shot blind”. That is, I did not modify the test at all after running it on the new branch (I had to run it on master to fix bugs / adjust the expected iters for a second parameter, and then adjust the internal params to get it to be about a second).

    What’s the catch? The specific downsides of this approach are twofold:

    1. We need to store an extra 8 bytes per CTxMemPoolEntry
    2. Some of the algorithms are a little bit less obvious & are “less safe” to call concurrently (e.g., from a recursive call, not necessarily in parallel)

    But for these two specific downsides:

    1. The 8 bytes could be reduced to 4 (with code to reset all CTxMemPoolEntries near overflow, but then we would need to more carefully check that we won’t overflow the epoch in a recursive call.
    2. The 8 bytes could be a 4 byte index into a table of 8 (or 4, or even fewer) Byte epochs + back pointer, at expense of indirection like vTxHashes. Then we would only ever need as many epochs as the things we walk per-epoch.
    3. We can replace the epoch with an atomic, which would allow for new parallel graph traversal algorithms to build descendants more quickly
    4. Some algorithms benefit from recursive epoch use

    So overall, I felt the potential benefits outweigh the risks.

    What’s next? I’ve submitted this a bit early for review to gain concept acks on the approach, independent testing, and feedback on methodologies for testing, as well as to collect more worst-case edge conditions to put into benchmarks. It’s not quite a WIP – the code passes tests and is “clean”, but it should receive intense scrutiny before accepting.

  2. DrahtBot added the label Mempool on Oct 26, 2019
  3. DrahtBot added the label Mining on Oct 26, 2019
  4. DrahtBot added the label RPC/REST/ZMQ on Oct 26, 2019
  5. DrahtBot added the label Tests on Oct 26, 2019
  6. DrahtBot added the label TX fees and policy on Oct 26, 2019
  7. DrahtBot added the label Validation on Oct 26, 2019
  8. DrahtBot added the label Needs rebase on Oct 26, 2019
  9. JeremyRubin force-pushed on Oct 26, 2019
  10. JeremyRubin commented at 9:30 pm on October 26, 2019: contributor
    Rebased :+1:
  11. fanquake removed the label Needs rebase on Oct 26, 2019
  12. fanquake removed the label RPC/REST/ZMQ on Oct 26, 2019
  13. fanquake removed the label Tests on Oct 26, 2019
  14. DrahtBot commented at 10:43 pm on October 26, 2019: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #17925 (Improve UpdateTransactionsFromBlock with Epochs by JeremyRubin)
    • #17791 (Remove UBSan suppressions for CTxMemPool* by hebasto)
    • #17786 (refactor: Nuke policy/fees->mempool circular dependencies by hebasto)
    • #16910 (wallet: reduce loading time by using unordered maps by achow101)
    • #16490 (rpc: Report reason for replaceable txpool transactions by MarcoFalke)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  15. JeremyRubin force-pushed on Oct 27, 2019
  16. in src/miner.cpp:187 in 7f7f7a2fbb outdated
    191+        max_element -= !not_in_block;
    192+        testSet[i] = testSet[!not_in_block*max_element + not_in_block*i];
    193+        i += not_in_block;
    194     }
    195+    // drop erased elements
    196+    testSet.resize(max_element);
    


    JeremyRubin commented at 6:17 pm on October 27, 2019:
    This can be replaced by a std::remove_if and a std::erase idiom, which will be of similar behavior (remove_if keeps order stable) and same complexity. Much easier to read and less chance of bugs.
  17. in src/miner.cpp:248 in 7f7f7a2fbb outdated
    247+        // No need to add self (it) because we would filter it from the loop
    248+        mempool.CalculateDescendantsVec(it, descendants);
    249         // Insert all descendants (not yet in block) into the modified set
    250         for (CTxMemPool::txiter desc : descendants) {
    251-            if (alreadyAdded.count(desc))
    252+            if (std::binary_search(alreadyAdded.cbegin(), alreadyAdded.cend(), desc, CTxMemPool::CompareIteratorByHash()))
    


    JeremyRubin commented at 6:19 pm on October 27, 2019:

    std::binary_search on a std::set takes O(N log N) because std::advance on the iterators is O(N).

    To fix this, allow passing in a predicate that knows how to query alreadyAdded efficient. Other options would be to make alreadyAdded a sorted vector or make it a hashtable always. But that’s a bigger change than the predicate, which doesn’t change any data types.

  18. JeremyRubin commented at 6:22 pm on October 27, 2019: contributor
    2 cleanups to be addressed momentarily: One better use of STL and one performance regression fix.
  19. JeremyRubin commented at 7:51 pm on October 28, 2019: contributor

    Added some changes as per suggestion from @TheBlueMatt, which should make reviewing the final state easier.

    The epochs are no longer stored on the stack (instead we reference them via the mempool itself) and we use a scoped stack guard in order to ensure that only one part of the code is using epochs at a time. This prevents a called function from creating an epoch guard (via assertion). While there is occasionally use for a multiple outstanding epochs, none are implemented currently. If it comes up in the future it can be special cased.

  20. JeremyRubin commented at 11:23 pm on October 28, 2019: contributor

    I just opened up #17292, which adds a motivating benchmark for the Epoch Mempool.

    ./src/bench/bench_bitcoin --filter=ComplexMemPool --scaling=5

    Before:

    0# Benchmark, evals, iterations, total, min, max, median
    1ComplexMemPool, 5, 5, 6.62715, 0.264409, 0.265441, 0.2654
    

    After:

    0# Benchmark, evals, iterations, total, min, max, median
    1ComplexMemPool, 5, 5, 3.07116, 0.122246, 0.12351, 0.122783
    

    This shows a > 2x speedup on this difficult benchmark (~2.15x across total, min, max, median).

    Note: For scientific purity, I created it “single shot blind”. That is, I did not modify the test at all after running it on the new branch (I had to run it on master to fix bugs / adjust the expected iters for a second parameter, and then adjust the internal params to get it to be about a second).

  21. JeremyRubin commented at 4:29 pm on October 29, 2019: contributor

    I ran the test against master, changing the parameter on number transactions. For this test, the 2x improvement seems to hold. They both seem to be quadratic fundamentally, but that may be a reflection of the test setup and not the actual upper bound.

    image

    image

  22. gmaxwell commented at 9:24 pm on October 29, 2019: contributor

    This seems like a moderately important area for optimization. One thing that might be worth keeping in mind: It would probably be useful to support a “fast but imprecise” accounting method for the first block template after a new block. This is really easy to do with the current code (just skip updating the costs while inserting transactions), not sure if your work would change that.

    [Aside, usually when talking about improving asymptotic complexity people ignore constant factors. You might want to use different language when describing this work to avoid confusing people.]

    In theory, I think the common case where confirmation includes an entire disjoint subgraph at once it should be possible to make that linear time (by basically noticing that the entire subgraph is removed, so any cost updates can be skipped). Interestingly: for attack resistance the worst case (permitted by policy) is what matters… but for revenue reasons the average case is more important.

    I’m not sure if there are greater wins though from speedups or from latency hiding hacks. e.g. if one keeps a small redundant highest fee only mempool perhaps without CPFP dependency tracking at all, and uses it while the real stuff is updating in the background… then it probably doesn’t matter that much if the main mempool is taking tens of seconds.

  23. JeremyRubin commented at 2:19 am on October 30, 2019: contributor

    Thanks for the review @gmaxwell.

    I think the “fast but imprecise” accounting method is not affected at all by this change, but someone who has implemented that should verify.

    I’ve also used the terminology correctly RE: asymptotes. Imagine a set of O(N) transactions which each have 1 output, and a single transaction which spends all outputs. Assembling the ancestors (e.g., in CalculateMemPoolAncestors) will iterate over O(N) inputs and make O(N log N) comparisons in set entries. In the new code I’ve written, you do O(N) work as it’s O(1) (rather than O(log N)) to query if we’ve already included it.

    Now, you’re correct to note that in the benchmark I gave, it’s only a constant factor. This is the purpose of the benchmark; a “believable” transaction workload. As noted above, the O(N^2) work manifests as a result of the structure of that test specifically – you have N transactions you’re inserting into the mempool, and you need to percolate the updated package info up the tree every insert. Thus, you end up making N*N traversals. Thus, short of some sort of magic metadata data structure, any implementation’s improvements will manifest lower-bounded by W(N^2).

    I can generate some micro benchmarks which show that Epoch Mempool beats out the status quo in asymptotes. I figured it’s less interesting than showing it’s better in the domain of believable workloads, because while it’s easy to see that these algorithms win in the asymptotes, it’s less clear that they win on realistic workloads.

    Also, some of the “composed” algorithms have two parts – one half is improved by this PR, and the other half isn’t – yet. For example in addUnchecked, we have two components: 1 builds setParentTransactions and the other calls UpdateParent on everything in setParentTransactions. Normally, building setParentTransactions is O(N log N), but with this update it becomes O(N). However, the internals of UpdateParent are an ordered set, so we gain back the log N factor. However future work can update those data structures to make UpdateParent O(1).

    To your point about making subtree detaching O(N), this work actually already does that moslty. removeRecursive was previously O(N log N), with this PR it is now basically O(N) (we skip the metadata updating in removeRecursive’s call to RemoveStaged with updateDescendants = false, but removeUnchecked removes n items from a std::set so it regains a log(n) factor, but it’s a pretty conceptually easy follow up PR to make txlinksMap a std::unordered_set).

    I haven’t pushed the changes I was working on w.r.t. changing the std::sets to std::unordered_sets because while conceptually simple, they have a lot more edge case in if it will be better or worse (esp when it comes to memory). So the epoch’s seemed like a better one as there’s really no downside.

  24. DrahtBot added the label Needs rebase on Oct 30, 2019
  25. JeremyRubin force-pushed on Oct 31, 2019
  26. DrahtBot removed the label Needs rebase on Oct 31, 2019
  27. gmaxwell commented at 7:34 pm on October 31, 2019: contributor
    Ah! Point on asympotics accepted. You might also want to make a benchmark that shows that case! I agree that a ’toy’ case isn’t a useful benchmark, but if you believe your algo is nlogn in some case it’s a useful testing point to actually demonstrate that (e.g. and confirm that there isn’t a hidden O(N) in an underlying datastructure that breaks your performance).
  28. JeremyRubin force-pushed on Oct 31, 2019
  29. MarcoFalke referenced this in commit a5224be645 on Nov 1, 2019
  30. sidhujag referenced this in commit e298a8ea3e on Nov 2, 2019
  31. DrahtBot added the label Needs rebase on Nov 4, 2019
  32. JeremyRubin force-pushed on Nov 4, 2019
  33. JeremyRubin force-pushed on Nov 4, 2019
  34. DrahtBot removed the label Needs rebase on Nov 5, 2019
  35. JeremyRubin force-pushed on Nov 24, 2019
  36. JeremyRubin commented at 8:21 am on November 24, 2019: contributor
    I tried rebasing – i have no idea how to clear this appveyor cache? @MarcoFalke @sipsorcery may be related to recent changes?
  37. JeremyRubin force-pushed on Nov 24, 2019
  38. sipsorcery commented at 9:27 am on November 24, 2019: member

    @JeremyRubin the appveyor cache doesn’t need to be cleared (only time that’s required are when the vcpkg or Qt dependencies change).

    The appveyor job failure here is due to a compilation error:

    0C:\projects\bitcoin\src\httpserver.cpp(373,5): error C3861: 'evthread_use_windows_threads': identifier not found [C:\projects\bitcoin\build_msvc\libbitcoin_server\libbitcoin_server.vcxproj]
    

    I’ll try a local Windows build of this PR to see if I can spot anything.

  39. sipsorcery commented at 10:05 am on November 24, 2019: member
    @JeremyRubin I pulled this PR on top of master and msbuild was good. My only guess is that something has gone wrong when you rebased.
  40. JeremyRubin commented at 12:07 pm on November 24, 2019: contributor

    Appveyor failure was before rebasing though

    On Sun, Nov 24, 2019, 2:05 AM Aaron Clauson notifications@github.com wrote:

    @JeremyRubin https://github.com/JeremyRubin I pulled this PR on top of master and msbuild was good. My only guess is that something has gone wrong when you rebased.

    — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bitcoin/bitcoin/pull/17268?email_source=notifications&email_token=AAGYN67EFBUZTHPU2GI55NLQVJGVPA5CNFSM4JFO5JWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAHXSA#issuecomment-557874120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGYN64PFRDQSLHJR7MJNYTQVJGVPANCNFSM4JFO5JWA .

  41. sipsorcery commented at 3:15 pm on November 24, 2019: member
    The appveyor build has been broken on master at a few points in the recent past. It’s working on current master at 9cbd87d8ee2910ac55c215451453e5162e1c377a.
  42. JeremyRubin force-pushed on Nov 24, 2019
  43. JeremyRubin commented at 7:46 pm on November 24, 2019: contributor
    It looks like there are a lot of failures across other branches too… will ignore for now till it’s fixed…
  44. JeremyRubin force-pushed on Nov 25, 2019
  45. DrahtBot added the label Needs rebase on Jan 2, 2020
  46. Add epoch fields to Mempool fe00ccd4ec
  47. Use epochs in UpdateTransactionsFromBlock 1fd804376c
  48. Replace addUnchecked's de-duplication set 4aa9e539d6
  49. Make UpdateForDescendents use epochs (part 1) c40013f547
  50. Make UpdateForDescendants more efficient by eliminating in-between containers 7f6335969a
  51. Restructuring the loop 71618c0471
  52. Switch to lambda 1997a3b197
  53. Making the lambda capture (almost) nothing local a62267c877
  54. Fully transform UpdateForDescendants into a mostly recursive form d050078172
  55. Make CalculateDescendents efficiently use epochs ea2f363a66
  56. Limit heap-stack usage in CalculateDescendants by using depth restricted recursion 4b16ac9340
  57. Expose interface for using more caching in CalculateDescendants c75ade25ed
  58. Add CalculateDescendantsVec and use where easy to adapt 7518372306
  59. Refactor RemoveStaged and UpdateRemoveForMempool to accept a vector instead of a setEntries 4601968f9b
  60. use epochs in CalculateMemPoolAncestors 0feb47ddbb
  61. Comput Ancestors more effciently abd701d09d
  62. make CalculateAncestors use a vector<txiter> instead of a set<txiter> for ancestors output
    TODO: Modify Miner
    4c69c81eee
  63. Optimize miner to take advantage of CalculateAncestors 6dd027cc8b
  64. rename setAncestors to ancestors as we no longer use a set type e253db28fe
  65. replace all calls to CalculateDescendants with CalculateDescendantsVec 2bae8ace1b
  66. Remove CalculateDescendants 4dee350d07
  67. Drop 'set' from variable names for non-set data structures, replace vector<txiter> with typedef vecEntries 5abc36813a
  68. skip getting descendants during txToRemove if we've walked it already d007a5407a
  69. make removeRecursive and removeForReorg more efficient in calculating txToRemove 8328c1b1e3
  70. Remove RemoveStagedImpl and UpdateForRemoveFromMemPoolImpl as we have
    updated dependents which required to use it generically.
    18555982db
  71. Inline state updating function in mempool to make macos analyzer happy 09c58a18be
  72. Replace onlyUnconfirmed custom implementation with the STL way: remove_if/erase idiom a8a3548bbc
  73. In miner.cpp, fix minor performance regression caused by using std::binary_search on a std::set instead of std::count. std::binary_search has to use std::advance on the passed iterators which leads to search taking O(N log N) instead of O(log N). 2c3721794b
  74. Remove epochs from stack variables 1da99d18e7
  75. remove extra pass over children in removeUnchecked 75b949252a
  76. Make mempool use a Smart EpochGuard which prevents (via assertion) multiple simultaneous Mempool Epoch users. fa51a652fd
  77. remove unused already_touched variants b25963315c
  78. Make CalculateDescendantsVec fully stackless 8e7e658131
  79. JeremyRubin force-pushed on Jan 7, 2020
  80. JeremyRubin commented at 5:36 pm on January 7, 2020: contributor
    Rebased
  81. DrahtBot removed the label Needs rebase on Jan 7, 2020
  82. in src/txmempool.cpp:383 in 4aa9e539d6 outdated
    380     for (unsigned int i = 0; i < tx.vin.size(); i++) {
    381         mapNextTx.insert(std::make_pair(&tx.vin[i].prevout, &tx));
    382-        setParentTransactions.insert(tx.vin[i].prevout.hash);
    383+        // Update ancestors with information about this tx
    384+        auto maybe_it = GetIter(tx.vin[i].prevout.hash);
    385+        if (maybe_it && !(*maybe_it)->already_touched(epoch)) UpdateParent(newit, *maybe_it, true);
    


    sdaftuar commented at 1:48 pm on January 14, 2020:
    So this replaces doing lookups in setParentTransactions (to deduplicate entries) with more lookups in mapTx (via GetIter) – presumably that is a win because mapTx is using an unordered map, but just want to verify that sounds right? (If mapTx were using a regular map lookup, then I might think this change could slow things down for big transactions that have lots of inputs coming from the same transaction.)

    JeremyRubin commented at 8:25 pm on January 14, 2020:

    Nope; we actually read the map below from GetIterSet so the access to mapTx is the same, we just do it a bit earlier.

    The win is getting rid of setParentTransactions and inlining UpdateParent in this pass.

  83. in src/txmempool.cpp:89 in c40013f547 outdated
    93-            } else if (!setAllDescendants.count(childEntry)) {
    94+            } else {
    95                 // Schedule for later processing
    96-                stageEntries.insert(childEntry);
    97+                stack.push_back(childEntry);
    98+                all_descendants.push_back(childEntry);
    


    sdaftuar commented at 1:55 pm on January 14, 2020:

    I think it’s possible to add the same transaction to all_descendants twice here, and thus miscalculate the statistics in the loop starting at line 98. Imagine the transaction we’re passed in (A) has two direct children, B and C, and C is also a child of B.

    B and C will both be added to all_descendants at line 64. Suppose we process B before C when looping over A’s children at line 68/69. Then we’d calculate B’s children at line 75 ([C]), which will not have already been touched yet at line 77. and hence would get added to all_descendants at line 89 – a duplicate entry.


    JeremyRubin commented at 8:23 pm on January 14, 2020:
    Yes this is a bug; my bad. It’s fixed in some later changes I had staged for the mempool but forgot to propagate to this PR. Will push up a fix.
  84. sdaftuar commented at 2:24 pm on January 14, 2020: member
    Concept ACK. Just started my review and found a bug that I think the author has a fix for, so leaving a comment for other reviewers/posterity.
  85. JeremyRubin commented at 3:44 am on January 15, 2020: contributor

    Just noting here:

    My current plan with this PR is to break it down into some smaller pieces so that the overall review burden is much less.

    As such, I’m closing this PR.

    see #17925 for the first part of this.

  86. JeremyRubin closed this on Jan 15, 2020

  87. JeremyRubin added this to the "Epoch MemPool" column in a project

  88. laanwj referenced this in commit b2df21b32c on Feb 3, 2020
  89. sidhujag referenced this in commit 2e207dabf6 on Feb 3, 2020
  90. PastaPastaPasta referenced this in commit af0cfe5a8c on Apr 12, 2020
  91. sidhujag referenced this in commit 9774b2f8db on Nov 10, 2020
  92. sidhujag referenced this in commit e2835e9b8c on Nov 10, 2020
  93. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-01 10:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me