Epoch Mempool

JeremyRubin commented at 9:22 pm on October 26, 2019: contributor

Hansel and Gretel left bread crumb trails as they went but birds ate them all.

This Pull Request improves the asymptotic complexity of many of the mempool’s algorithms as well as makes them more performant in practice.

In the mempool we have a lot of algorithms which depend on rather computationally expensive state tracking. The state tracking algorithms we use make a lot of algorithms inefficient because we’re inserting into various sets that grow with the number of ancestors and descendants, or we may have to iterate multiple times of data we’ve already seen.

To address these inefficiencies, we can closely limit the maximum possible data we iterate and reject transactions above the limits.

However, a rational miner/user will purchase faster hardware and raise these limits to be able to collect more fee revenue or process more transactions. Further, changes coming from around the ecosystem (lightning, OP_SECURETHEBAG) have critical use cases which benefit when the mempool has fewer limitations.

Rather than use expensive state tracking, we can do something simpler. Like Hansel and Gretel, we can leave breadcrumbs in the mempool as we traverse it, so we can know if we are going somewhere we’ve already been. Luckily, in bitcoind there are no birds!

These breadcrumbs are a uint64_t epoch per CTxMemPoolEntry. (64 bits is enough that it will never overflow). The mempool also has a counter.

Every time we begin traversing the mempool, and we need some help tracking state, we increment the mempool’s counter. Any CTxMemPoolEntry with an epoch less than the MemPools has not yet been touched, and should be traversed and have it’s epoch set higher.

Given the one-to-one mapping between CTxMemPool entries and transactions, this is a safe way to cache data.

Using these bread-crumbs allows us to get rid of many of the std::set accumulators in favor of a std::vector as we no longer rely on the de-duplicative properties of std::set.

We can also improve many of the related algorithms to no longer make heavy use of heapstack based processing.

The overall changes are not that big. It’s 302 additions and 208 deletions. You could review it all at once, especially for a Concept Ack. But I’ve carefully done them in small steps to permit careful analysis of the changes. So I’d recommend giving the final state of the PR a read over first to get familiar with the end-goal (it won’t take long), and then step through commit-by-commit so you can be convinced of the correctness for more thorough code reviewers.

But is it good in practice? Yes! Basic benchmarks show it being 5% faster on the AssembleBlock benchmark, and 14.5% faster on MempoolEviction, but they aren’t that tough. Tougher benchmarks, as described below, show a >2x speedup.

PR #17292, adds a motivating benchmark for the Epoch Mempool.

./src/bench/bench_bitcoin –filter=ComplexMemPool –scaling=5

Before:
# Benchmark, evals, iterations, total, min, max, median
ComplexMemPool, 5, 5, 6.62715, 0.264409, 0.265441, 0.2654
After:
# Benchmark, evals, iterations, total, min, max, median
ComplexMemPool, 5, 5, 3.07116, 0.122246, 0.12351, 0.122783
This shows a > 2x speedup on this difficult benchmark (~2.15x across total, min, max, median).

Note: For scientific purity, I created it “single shot blind”. That is, I did not modify the test at all after running it on the new branch (I had to run it on master to fix bugs / adjust the expected iters for a second parameter, and then adjust the internal params to get it to be about a second).

What’s the catch? The specific downsides of this approach are twofold:

We need to store an extra 8 bytes per CTxMemPoolEntry
Some of the algorithms are a little bit less obvious & are “less safe” to call concurrently (e.g., from a recursive call, not necessarily in parallel)

But for these two specific downsides:

The 8 bytes could be reduced to 4 (with code to reset all CTxMemPoolEntries near overflow, but then we would need to more carefully check that we won’t overflow the epoch in a recursive call.
The 8 bytes could be a 4 byte index into a table of 8 (or 4, or even fewer) Byte epochs + back pointer, at expense of indirection like vTxHashes. Then we would only ever need as many epochs as the things we walk per-epoch.
We can replace the epoch with an atomic, which would allow for new parallel graph traversal algorithms to build descendants more quickly
Some algorithms benefit from recursive epoch use

So overall, I felt the potential benefits outweigh the risks.

What’s next? I’ve submitted this a bit early for review to gain concept acks on the approach, independent testing, and feedback on methodologies for testing, as well as to collect more worst-case edge conditions to put into benchmarks. It’s not quite a WIP – the code passes tests and is “clean”, but it should receive intense scrutiny before accepting.

DrahtBot added the label Mempool on Oct 26, 2019

DrahtBot added the label Mining on Oct 26, 2019

DrahtBot added the label RPC/REST/ZMQ on Oct 26, 2019

DrahtBot added the label Tests on Oct 26, 2019

DrahtBot added the label TX fees and policy on Oct 26, 2019

DrahtBot added the label Validation on Oct 26, 2019

DrahtBot added the label Needs rebase on Oct 26, 2019

JeremyRubin force-pushed on Oct 26, 2019

JeremyRubin commented at 9:30 pm on October 26, 2019: contributor

Rebased :+1:

fanquake removed the label Needs rebase on Oct 26, 2019

fanquake removed the label RPC/REST/ZMQ on Oct 26, 2019

fanquake removed the label Tests on Oct 26, 2019

DrahtBot commented at 10:43 pm on October 26, 2019: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#17925 (Improve UpdateTransactionsFromBlock with Epochs by JeremyRubin)
#17791 (Remove UBSan suppressions for CTxMemPool* by hebasto)
#17786 (refactor: Nuke policy/fees->mempool circular dependencies by hebasto)
#16910 (wallet: reduce loading time by using unordered maps by achow101)
#16490 (rpc: Report reason for replaceable txpool transactions by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

JeremyRubin force-pushed on Oct 27, 2019

in src/miner.cpp:187 in 7f7f7a2fbb outdated

191+        max_element -= !not_in_block;
192+        testSet[i] = testSet[!not_in_block*max_element + not_in_block*i];
193+        i += not_in_block;
194     }
195+    // drop erased elements
196+    testSet.resize(max_element);

JeremyRubin commented at 6:17 pm on October 27, 2019:

This can be replaced by a std::remove_if and a std::erase idiom, which will be of similar behavior (remove_if keeps order stable) and same complexity. Much easier to read and less chance of bugs.

in src/miner.cpp:248 in 7f7f7a2fbb outdated

247+        // No need to add self (it) because we would filter it from the loop
248+        mempool.CalculateDescendantsVec(it, descendants);
249         // Insert all descendants (not yet in block) into the modified set
250         for (CTxMemPool::txiter desc : descendants) {
251-            if (alreadyAdded.count(desc))
252+            if (std::binary_search(alreadyAdded.cbegin(), alreadyAdded.cend(), desc, CTxMemPool::CompareIteratorByHash()))

JeremyRubin commented at 6:19 pm on October 27, 2019:

std::binary_search on a std::set takes O(N log N) because std::advance on the iterators is O(N).

To fix this, allow passing in a predicate that knows how to query alreadyAdded efficient. Other options would be to make alreadyAdded a sorted vector or make it a hashtable always. But that’s a bigger change than the predicate, which doesn’t change any data types.

JeremyRubin commented at 6:22 pm on October 27, 2019: contributor

2 cleanups to be addressed momentarily: One better use of STL and one performance regression fix.

JeremyRubin commented at 7:51 pm on October 28, 2019: contributor

Added some changes as per suggestion from @TheBlueMatt, which should make reviewing the final state easier.

The epochs are no longer stored on the stack (instead we reference them via the mempool itself) and we use a scoped stack guard in order to ensure that only one part of the code is using epochs at a time. This prevents a called function from creating an epoch guard (via assertion). While there is occasionally use for a multiple outstanding epochs, none are implemented currently. If it comes up in the future it can be special cased.

JeremyRubin commented at 11:23 pm on October 28, 2019: contributor

I just opened up #17292, which adds a motivating benchmark for the Epoch Mempool.

./src/bench/bench_bitcoin --filter=ComplexMemPool --scaling=5

Before:

0# Benchmark, evals, iterations, total, min, max, median
1ComplexMemPool, 5, 5, 6.62715, 0.264409, 0.265441, 0.2654

After:

0# Benchmark, evals, iterations, total, min, max, median
1ComplexMemPool, 5, 5, 3.07116, 0.122246, 0.12351, 0.122783

This shows a > 2x speedup on this difficult benchmark (~2.15x across total, min, max, median).

Note: For scientific purity, I created it “single shot blind”. That is, I did not modify the test at all after running it on the new branch (I had to run it on master to fix bugs / adjust the expected iters for a second parameter, and then adjust the internal params to get it to be about a second).

JeremyRubin commented at 4:29 pm on October 29, 2019: contributor

I ran the test against master, changing the parameter on number transactions. For this test, the 2x improvement seems to hold. They both seem to be quadratic fundamentally, but that may be a reflection of the test setup and not the actual upper bound.

gmaxwell commented at 9:24 pm on October 29, 2019: contributor

This seems like a moderately important area for optimization. One thing that might be worth keeping in mind: It would probably be useful to support a “fast but imprecise” accounting method for the first block template after a new block. This is really easy to do with the current code (just skip updating the costs while inserting transactions), not sure if your work would change that.

[Aside, usually when talking about improving asymptotic complexity people ignore constant factors. You might want to use different language when describing this work to avoid confusing people.]

In theory, I think the common case where confirmation includes an entire disjoint subgraph at once it should be possible to make that linear time (by basically noticing that the entire subgraph is removed, so any cost updates can be skipped). Interestingly: for attack resistance the worst case (permitted by policy) is what matters… but for revenue reasons the average case is more important.

I’m not sure if there are greater wins though from speedups or from latency hiding hacks. e.g. if one keeps a small redundant highest fee only mempool perhaps without CPFP dependency tracking at all, and uses it while the real stuff is updating in the background… then it probably doesn’t matter that much if the main mempool is taking tens of seconds.

JeremyRubin commented at 2:19 am on October 30, 2019: contributor

Thanks for the review @gmaxwell.

I think the “fast but imprecise” accounting method is not affected at all by this change, but someone who has implemented that should verify.

I’ve also used the terminology correctly RE: asymptotes. Imagine a set of O(N) transactions which each have 1 output, and a single transaction which spends all outputs. Assembling the ancestors (e.g., in CalculateMemPoolAncestors) will iterate over O(N) inputs and make O(N log N) comparisons in set entries. In the new code I’ve written, you do O(N) work as it’s O(1) (rather than O(log N)) to query if we’ve already included it.

Now, you’re correct to note that in the benchmark I gave, it’s only a constant factor. This is the purpose of the benchmark; a “believable” transaction workload. As noted above, the O(N^2) work manifests as a result of the structure of that test specifically – you have N transactions you’re inserting into the mempool, and you need to percolate the updated package info up the tree every insert. Thus, you end up making N*N traversals. Thus, short of some sort of magic metadata data structure, any implementation’s improvements will manifest lower-bounded by W(N^2).

I can generate some micro benchmarks which show that Epoch Mempool beats out the status quo in asymptotes. I figured it’s less interesting than showing it’s better in the domain of believable workloads, because while it’s easy to see that these algorithms win in the asymptotes, it’s less clear that they win on realistic workloads.

Also, some of the “composed” algorithms have two parts – one half is improved by this PR, and the other half isn’t – yet. For example in addUnchecked, we have two components: 1 builds setParentTransactions and the other calls UpdateParent on everything in setParentTransactions. Normally, building setParentTransactions is O(N log N), but with this update it becomes O(N). However, the internals of UpdateParent are an ordered set, so we gain back the log N factor. However future work can update those data structures to make UpdateParent O(1).

To your point about making subtree detaching O(N), this work actually already does that moslty. removeRecursive was previously O(N log N), with this PR it is now basically O(N) (we skip the metadata updating in removeRecursive’s call to RemoveStaged with updateDescendants = false, but removeUnchecked removes n items from a std::set so it regains a log(n) factor, but it’s a pretty conceptually easy follow up PR to make txlinksMap a std::unordered_set).

I haven’t pushed the changes I was working on w.r.t. changing the std::sets to std::unordered_sets because while conceptually simple, they have a lot more edge case in if it will be better or worse (esp when it comes to memory). So the epoch’s seemed like a better one as there’s really no downside.

DrahtBot added the label Needs rebase on Oct 30, 2019

JeremyRubin force-pushed on Oct 31, 2019

DrahtBot removed the label Needs rebase on Oct 31, 2019

gmaxwell commented at 7:34 pm on October 31, 2019: contributor

Ah! Point on asympotics accepted. You might also want to make a benchmark that shows that case! I agree that a ’toy’ case isn’t a useful benchmark, but if you believe your algo is nlogn in some case it’s a useful testing point to actually demonstrate that (e.g. and confirm that there isn’t a hidden O(N) in an underlying datastructure that breaks your performance).

JeremyRubin force-pushed on Oct 31, 2019

MarcoFalke referenced this in commit a5224be645 on Nov 1, 2019

sidhujag referenced this in commit e298a8ea3e on Nov 2, 2019

DrahtBot added the label Needs rebase on Nov 4, 2019

JeremyRubin force-pushed on Nov 4, 2019

DrahtBot removed the label Needs rebase on Nov 5, 2019

JeremyRubin force-pushed on Nov 24, 2019

JeremyRubin commented at 8:21 am on November 24, 2019: contributor

I tried rebasing – i have no idea how to clear this appveyor cache? @MarcoFalke @sipsorcery may be related to recent changes?

JeremyRubin force-pushed on Nov 24, 2019

sipsorcery commented at 9:27 am on November 24, 2019: member

@JeremyRubin the appveyor cache doesn’t need to be cleared (only time that’s required are when the vcpkg or Qt dependencies change).

The appveyor job failure here is due to a compilation error:

0C:\projects\bitcoin\src\httpserver.cpp(373,5): error C3861: 'evthread_use_windows_threads': identifier not found [C:\projects\bitcoin\build_msvc\libbitcoin_server\libbitcoin_server.vcxproj]

I’ll try a local Windows build of this PR to see if I can spot anything.

sipsorcery commented at 10:05 am on November 24, 2019: member

@JeremyRubin I pulled this PR on top of master and msbuild was good. My only guess is that something has gone wrong when you rebased.

JeremyRubin commented at 12:07 pm on November 24, 2019: contributor

Appveyor failure was before rebasing though

On Sun, Nov 24, 2019, 2:05 AM Aaron Clauson notifications@github.com wrote:

@JeremyRubin https://github.com/JeremyRubin I pulled this PR on top of master and msbuild was good. My only guess is that something has gone wrong when you rebased.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bitcoin/bitcoin/pull/17268?email_source=notifications&email_token=AAGYN67EFBUZTHPU2GI55NLQVJGVPA5CNFSM4JFO5JWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAHXSA#issuecomment-557874120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGYN64PFRDQSLHJR7MJNYTQVJGVPANCNFSM4JFO5JWA .

sipsorcery commented at 3:15 pm on November 24, 2019: member

The appveyor build has been broken on master at a few points in the recent past. It’s working on current master at 9cbd87d8ee2910ac55c215451453e5162e1c377a.

JeremyRubin force-pushed on Nov 24, 2019

JeremyRubin commented at 7:46 pm on November 24, 2019: contributor

It looks like there are a lot of failures across other branches too… will ignore for now till it’s fixed…

JeremyRubin force-pushed on Nov 25, 2019

DrahtBot added the label Needs rebase on Jan 2, 2020

Add epoch fields to Mempool fe00ccd4ec

Use epochs in UpdateTransactionsFromBlock 1fd804376c

Replace addUnchecked's de-duplication set 4aa9e539d6

Make UpdateForDescendents use epochs (part 1) c40013f547

Make UpdateForDescendants more efficient by eliminating in-between containers 7f6335969a

Restructuring the loop 71618c0471

Switch to lambda 1997a3b197

Making the lambda capture (almost) nothing local a62267c877

Fully transform UpdateForDescendants into a mostly recursive form d050078172

Make CalculateDescendents efficiently use epochs ea2f363a66

Limit heap-stack usage in CalculateDescendants by using depth restricted recursion 4b16ac9340

Expose interface for using more caching in CalculateDescendants c75ade25ed

Add CalculateDescendantsVec and use where easy to adapt 7518372306

Refactor RemoveStaged and UpdateRemoveForMempool to accept a vector instead of a setEntries 4601968f9b

use epochs in CalculateMemPoolAncestors 0feb47ddbb

Comput Ancestors more effciently abd701d09d

make CalculateAncestors use a vector<txiter> instead of a set<txiter> for ancestors output

TODO: Modify Miner

4c69c81eee

Optimize miner to take advantage of CalculateAncestors 6dd027cc8b

rename setAncestors to ancestors as we no longer use a set type e253db28fe

replace all calls to CalculateDescendants with CalculateDescendantsVec 2bae8ace1b

Remove CalculateDescendants 4dee350d07

Drop 'set' from variable names for non-set data structures, replace vector<txiter> with typedef vecEntries 5abc36813a

skip getting descendants during txToRemove if we've walked it already d007a5407a

make removeRecursive and removeForReorg more efficient in calculating txToRemove 8328c1b1e3

Remove RemoveStagedImpl and UpdateForRemoveFromMemPoolImpl as we have

updated dependents which required to use it generically.

18555982db

Inline state updating function in mempool to make macos analyzer happy 09c58a18be

Replace onlyUnconfirmed custom implementation with the STL way: remove_if/erase idiom a8a3548bbc

In miner.cpp, fix minor performance regression caused by using std::binary_search on a std::set instead of std::count. std::binary_search has to use std::advance on the passed iterators which leads to search taking O(N log N) instead of O(log N). 2c3721794b

Remove epochs from stack variables 1da99d18e7

remove extra pass over children in removeUnchecked 75b949252a

Make mempool use a Smart EpochGuard which prevents (via assertion) multiple simultaneous Mempool Epoch users. fa51a652fd

remove unused already_touched variants b25963315c

Make CalculateDescendantsVec fully stackless 8e7e658131

JeremyRubin force-pushed on Jan 7, 2020

JeremyRubin commented at 5:36 pm on January 7, 2020: contributor

Rebased

DrahtBot removed the label Needs rebase on Jan 7, 2020

in src/txmempool.cpp:383 in 4aa9e539d6 outdated

380     for (unsigned int i = 0; i < tx.vin.size(); i++) {
381         mapNextTx.insert(std::make_pair(&tx.vin[i].prevout, &tx));
382-        setParentTransactions.insert(tx.vin[i].prevout.hash);
383+        // Update ancestors with information about this tx
384+        auto maybe_it = GetIter(tx.vin[i].prevout.hash);
385+        if (maybe_it && !(*maybe_it)->already_touched(epoch)) UpdateParent(newit, *maybe_it, true);

sdaftuar commented at 1:48 pm on January 14, 2020:

So this replaces doing lookups in setParentTransactions (to deduplicate entries) with more lookups in mapTx (via GetIter) – presumably that is a win because mapTx is using an unordered map, but just want to verify that sounds right? (If mapTx were using a regular map lookup, then I might think this change could slow things down for big transactions that have lots of inputs coming from the same transaction.)

JeremyRubin commented at 8:25 pm on January 14, 2020:

Nope; we actually read the map below from GetIterSet so the access to mapTx is the same, we just do it a bit earlier.

The win is getting rid of setParentTransactions and inlining UpdateParent in this pass.

in src/txmempool.cpp:89 in c40013f547 outdated

93-            } else if (!setAllDescendants.count(childEntry)) {
94+            } else {
95                 // Schedule for later processing
96-                stageEntries.insert(childEntry);
97+                stack.push_back(childEntry);
98+                all_descendants.push_back(childEntry);

sdaftuar commented at 1:55 pm on January 14, 2020:

I think it’s possible to add the same transaction to all_descendants twice here, and thus miscalculate the statistics in the loop starting at line 98. Imagine the transaction we’re passed in (A) has two direct children, B and C, and C is also a child of B.

B and C will both be added to all_descendants at line 64. Suppose we process B before C when looping over A’s children at line 68/69. Then we’d calculate B’s children at line 75 ([C]), which will not have already been touched yet at line 77. and hence would get added to all_descendants at line 89 – a duplicate entry.

JeremyRubin commented at 8:23 pm on January 14, 2020:

Yes this is a bug; my bad. It’s fixed in some later changes I had staged for the mempool but forgot to propagate to this PR. Will push up a fix.

sdaftuar commented at 2:24 pm on January 14, 2020: member

Concept ACK. Just started my review and found a bug that I think the author has a fix for, so leaving a comment for other reviewers/posterity.

JeremyRubin commented at 3:44 am on January 15, 2020: contributor

Just noting here:

My current plan with this PR is to break it down into some smaller pieces so that the overall review burden is much less.

As such, I’m closing this PR.

see #17925 for the first part of this.

JeremyRubin closed this on Jan 15, 2020

JeremyRubin added this to the "Epoch MemPool" column in a project

laanwj referenced this in commit b2df21b32c on Feb 3, 2020

sidhujag referenced this in commit 2e207dabf6 on Feb 3, 2020

PastaPastaPasta referenced this in commit af0cfe5a8c on Apr 12, 2020

sidhujag referenced this in commit 9774b2f8db on Nov 10, 2020

sidhujag referenced this in commit e2835e9b8c on Nov 10, 2020

DrahtBot locked this on Feb 15, 2022

Epoch Mempool #17268

Conflicts