Improve UpdateForDescendants by using Epochs and Removing CacheMap

JeremyRubin commented at 9:27 pm on February 3, 2020: contributor

This PR contains two refactors to the UpdateForDescendants method that could be stand alone, but I think are easiest to review together. This is follow up work from #17925 and a related to the work laid out in #17268.

The first change refactors UpdateForDescendants to use Epochs instead of sets to de-duplicate traversal, and replaces the cache entries with a vector instead of a set. This is a straightforward win. The algorithm is a bit clever so as not to require two separate vectors for txiters that need expanding and txiters that have already been processed, but I think it’s relatively easy to reason about.

The second change removes the cache altogether, as it is not needed.

The cache was an optimization to limit re-traversal in UpdateForDescendants. But because we know that descendants limit limits the total number of descendants, this actually doesn’t help that much. There are some common cases where having a CacheMap might be faster, but an adversary can trivially construct a worst-case example that causes equal behavior with or without a cachemap. For example, a transaction with N/2 children which each have N/2 outputs that get spent by N/2 grandchildren will cause the parent to iterate over N^2 entries with or without CacheMap. Post Epochs, UpdateForDescendants is sufficiently cheap such that the cachemap is just extra memory overhead.

This extra memory overhead isn’t free, and my expectation is that getting rid of it should be faster in general. In an average case, we might expect not to have very deep long chains so we’re allocating memory for little benefit.

It could be the case that the first commit causes a larger speed-up, because the upgraded cache vectors are much more efficient to build and iterate over. Then by removing them, we need to now go back to a traversal of std::sets for the MemPool children (before the first commit these cache lines were std::sets which are slow to iterate). However, we also eliminate the cache entry lookups, which are also slow, and get rid of a bunch of allocation. My guess is that it “comes out in the wash”, and the new algorithm after both patches is superior to both the previous algorithm and the algorithm after only the first patch. The descendants limit comes into play helping feel comfortable with this: we know that the vecEntries stage can be no larger than the desendants limit (25), so we can bound how much traversals we’d do at 25*24.

As a future work, the stage can become a cached epoch member (since we know we have the lock), so that this function does not need to allocate on each call, but this seemed like a minor improvement and invites some more discussion on if pre-allocating these scratchpads is a good idea.

fanquake added the label Mempool on Feb 3, 2020

DrahtBot commented at 11:41 pm on February 3, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#18191 (Epoch mempool cache regression plus remove excluded set. by JeremyRubin)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

JeremyRubin added this to the "Epoch MemPool" column in a project

in src/txmempool.cpp:89 in 89a50ce5cb outdated

100+        // N.B. grand_children may also be children
101+        const CTxMemPool::setEntries& grand_children = GetMemPoolChildren(child_it);
102+        for (const txiter grand_child_it : grand_children) {
103+            if (visited(grand_child_it)) continue;
104+            cacheMap::iterator cached_great_grand_children = cache.find(grand_child_it);
105+            if (cached_great_grand_children != cache.end()) {

sdaftuar commented at 7:17 pm on February 10, 2020:

From reading this code, it looks like if you have a hit in the cache, then you wouldn’t end up adding grand_child_it to update_cache; you’d just add grand_child_it’s descendants? That looks like a bug, though I haven’t written a test to try to demonstrate it.

I know the cache goes away in the next commit, so assuming that makes sense then this may not be worth fixing (versus just squashing the two commits together), but I think it’d just be a two-line fix, if my understanding is correct?

JeremyRubin commented at 8:44 pm on February 10, 2020:

I think you’re correct about that. The fix is to include grand_child_it and then swap it into the already_traversed index if there is a hit.

JeremyRubin commented at 1:23 am on February 11, 2020:

fixed

sdaftuar commented at 8:12 pm on February 10, 2020: member

The cache was an optimization to limit re-traversal in UpdateForDescendants. But because we know that descendants limit limits the total number of descendants, this actually doesn’t help that much. There are some common cases where having a CacheMap might be faster, but an adversary can trivially construct a worst-case example that causes equal behavior with or without a cachemap. For example, a transaction with N/2 children which each have N/2 outputs that get spent by N/2 grandchildren will cause the parent to iterate over N^2 entries with or without CacheMap. Post Epochs, UpdateForDescendants is sufficiently cheap such that the cachemap is just extra memory overhead.

I’ve been thinking about this again, trying to remember the motivation here. I think the concern is that when you’re adding back transactions from a block, you have no guarantees for what kinds of chains could exist in that block. For instance, maybe you have a block with 4000 transactions that consist of each transaction being a child of the one preceding it? In that situation, it seems like caching the results as you go could be faster than rewalking each time.

I can try to come up with some adversarial examples and benchmark them. It’s not totally obvious what to be optimizing for; I imagine that in the common case having no cache is best, but limiting worst-case behavior might be the better metric here (so that a reorg-capable-attacker isn’t sidelining honest hashpower through tricks like this).

JeremyRubin commented at 10:14 pm on February 10, 2020: contributor

So I think what’s confusing about it is that it does “hinder” one attacker, but opens up a different attacker. I think what’s worse is that with this redress, we have gone from a small processing time issue (which is sad) to a memory attack (which can cause nodes to OOM). Therefore IMO it’s not worthwhile.

It’s also unclear who is getting walked in this case to me – are we walking things that are still in the MemPool or things that came from the block?

In either case, the argument is roughly the same:

Suppose I have N/2 direct children with N/2 outputs and N/2 grandchildren with N/2 inputs.

So the traversal is equal or better for every node except node -1. However, if we use a node 17 instead of node -1:

What this shows us is the cache working to keep the runtimes consistent on this pattern, but that uncached isn’t substantially worse. If you look at these in Big-O, they come out to the same polynomial, so it’s really about the constants.

Let’s look at another example.

So by doing a simple linear chain, our cache can now use O(N^2) memory. Here the caching is worthless, except if it is faster to traverse the cache data structure than the MempoolChildren. For context if the smallest transaction is 61 bytes…

(1mb / 61 bytes)**2 * 8 bytes = 2 GB.

This sort of attack is more concerning than the fan out one above, because the fan in/out inputs and outputs are somewhat “self limiting”, e.g.:

maximize nm subject to (61 + n9) + (61 + n * 41 + m * 9) + (61 + m * 41) < 1e6

m, n = 10000, 10000

Then we have n caches of size m, or 800 MB.

The worst case complexity for non-cached is probably combo-ing a one in one out with a fanned out anchor. Suppose we have N/2 for a chain, and then N/2 for fan out. With or without caching, you end up with something that is N^3 (with caching: N/2 iterations over a N^2 cache, without caching: N/2 iterations over a N^2 subtree).

Change UpdateForDescendants to use Epochs 057c8b4ce8

Remove CacheMap from mempool UpdateForDescendants. CacheMap was an optimization to limit re-traversal in UpdateForDescendants. But because we know that descendants limit limits the total number of descendants, this actually doesn't help that much. There are some common cases where having a CacheMap might be faster, but an adversary can trivially construct a worst-case example that causes equal behavior with or without a cachemap. For example, a transaction with N/2 children which each have N/2 outputs that get spent by N/2 grandchildren will cause the parent to iterate over N^2 entries with or without CacheMap. Post Epochs, UpdateForDescendants is sufficiently cheap such that the cachemap is just extra memory overhead. a2cae42707

JeremyRubin force-pushed on Feb 11, 2020

adamjonas commented at 10:55 pm on February 18, 2020: member

To catch reviewers up, ~~#18120~~ #18191 has been peeled off from this because it is a more clear-cut win.

Based on the conversation between @sdaftuar and @JeremyRubin in IRC, the second commit, which eliminates the cache, can allocate a lot of memory which would OOM the process in worst-case scenarios. There may be an opportunity, however, to find a sweet spot where the memory tradeoff is worth the CPU benefit.

Further investigation and possible re-architecting are still under discussion.

JeremyRubin closed this on Mar 6, 2020

DrahtBot locked this on Feb 15, 2022

Improve UpdateForDescendants by using Epochs and Removing CacheMap #18063

Conflicts