Optimize Mempool Reorg logic using Epochs, improving memory usage and runtime. #24158

JeremyRubin commented at 2:20 AM on January 26, 2022: contributor

This is a follow up PR to #21464 which improves the memory usage and runtime of the reorg update logic. The main changes are:

Use std::vector<txiter> in cacheMap instead of std::set<txiter> which makes (iirc) entries go from > ~64 bytes to just 8 bytes
Don't look up in cacheMap the updateIt N times, look it up once and directly write the vector with known length (saves resizing)
Use epochs to de-duplicate (saves N log N lookups)
trimming from setExclude in a hotter loop

Optimize Mempool Reorg logic using Epochs, improving memory usage and runtime 1eac472976

fanquake added the label Mempool on Jan 26, 2022

DrahtBot commented at 11:53 PM on January 26, 2022: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#25290 ([kernel 3a/n] Decouple CTxMemPool from ArgsManager by dongcarl)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot cross-referenced this on Jan 27, 2022 from issue [mempool] allow tx replacement by smaller witness by LarryRuane

DrahtBot cross-referenced this on Jan 27, 2022 from issue Use `int32_t` type for most transaction size/weight values by hebasto

mzumsande commented at 7:49 PM on February 16, 2022: contributor

What is the relationship of this PR to your #18191? Just judging from the title, they seem to do very similar things, although the code changes are not the same.

JeremyRubin commented at 7:59 PM on February 16, 2022: contributor

They were split out from an initial patch set to make it easier to review.

#18191 is the component which has some behavioral change, #24158 is a pure optimization.

JeremyRubin commented at 8:01 PM on February 16, 2022: contributor

see the pr description of #18191

There's potential for a better -- but more sophisticated -- algorithm that can be used taking advantage of epochs, but I figured it is better to do something that is simple and works first and upgrade it later as the other epoch mempool work proceeds as it makes the patches for the epoch algorithm simpler to understand, so you can consider this as preparatory work. It could either go in now if it is not controversial, or we could wait until the other patch is ready to go.

mzumsande commented at 8:14 PM on February 16, 2022: contributor

see the pr description of #18191

That's from #21464 which was merged, my question is about the older #18191.

JeremyRubin commented at 8:28 PM on February 16, 2022: contributor

oh oops.

uh yeah TBH I forgot I had that other PR open.

They should be mostly the same.

The main difference is the earlier one also applies an optimization getting rid of setExclude and just using the cache line presence instead, which ends up being redundant with the setExclude.

We can add that optimization as a separate PR since it's a little bit less obvious why it works, I left it out when I rewrote this one.

in src/txmempool.cpp:144 in 1eac472976

 154 | +        }
 155 | +        for (size_t i = 0, n_to_process = descendants.size(); i < n_to_process; ++i) {
 156 | +            const CTxMemPoolEntry& descendant = *descendants[i];
 157 | +            const CTxMemPoolEntry::Children& children = descendant.GetMemPoolChildrenConst();
 158 | +            for (const CTxMemPoolEntry& childEntry : children) {
 159 | +                cacheMap::iterator cacheIt = cachedDescendants.find(mapTx.iterator_to(childEntry));

glozow commented at 11:15 AM on February 17, 2022:

Shouldn't you first look for descendant in cachedDescendants before fetching its children? If its descendant set is available there, you wasted a cycle looking at the first generation.

JeremyRubin commented at 6:26 PM on February 18, 2022:

i think you are right on this. The patch here is a 100% behavioral match, but it does seem like one could evade caching a little because we don't check if children are cached. Can you think of any functional differences? (this code is incredibly subtle, so i'm reticent to change it).

Crypt-iQ commented at 2:22 AM on April 15, 2022:

I think it should only be done if children is non-empty - this change should also allow getting rid of the current find call in the children loop and just add straight to descendants. The current logic in master of checking cachedDescendants is a bit odd as if there are three txn's (t1 -> t2 -> t3) in a chain that depend on each other sequentially, then the find call will only get a "hit" when t3 is updateIt as it "skips" t2

in src/txmempool.cpp:156 in 1eac472976

 166 | +                            descendants.emplace_back(cacheEntry);
 167 | +                            // skip self-swap because of buggy std::swap implementations
 168 | +                            // on some platforms
 169 | +                            if (!(descendants.size() == i+2)) {
 170 | +                                std::swap(descendants[i+1], descendants.back());
 171 | +                            }

glozow commented at 11:30 AM on February 17, 2022:

IIUC, you're swapping here in order to insert all of the descendants between this child and the next entry in the descendants vector, so you can just increment i to skip to the next child in the same generation. Note that you're still moving everything downwards n times where n is the number of cached descendants. Approach-wise, perhaps a std::deque (linked list) is more appropriate if you really want the constant time insert at arbitrary position. Also, I'm not too familiar with the implementation of std::vector, but I feel like it should be optimized enough for you to feel okay using insert(i+1) without using swaps. It might even do that in the background.

Also, please add a comment because it was not immediately obvious that you're doing swaps to maintain ordering.

JeremyRubin commented at 6:14 PM on February 18, 2022:

I think you have it backwards kinda. we do not care about ordering whatsoever, we are keeping a "partition" of processed and unprocessed elements.

essentially we have a queue with processed and unprocessed elements and a pointer to where we should process next.

// start

[P1 P2 P3 P4 *U5 U6 U7 U8]

// process element

[P1 P2 P3 P4 P5 *U6 U7 U8]

// insert processed element using push back and swap
[P1 P2 P3 P4 P5 *U6 U7 U8 P9]
[P1 P2 P3 P4 P5 *P9 U7 U8 U6]
[P1 P2 P3 P4 P5 P9 *U7 U8 U6]
// insert unprocessed element
[P1 P2 P3 P4 P5 P9 *U7 U8 U6 U10]

If we were to try to do the same with inserts, it would cause N^2 behavior. As you can see, we also do not preserve order during the swap back approach.

If we were to do insert it would look like:

[P1 P2 P3 P4 P5 *U6 U7 U8]
// insert processed element using insert
[P1 P2 P3 P4 P5 *P9 U6 U7 U8 P9]
[P1 P2 P3 P4 P5 P9 *U6 U7 U8 P9]

And the shifting would cost O(N).

Deque is a good data structure, but has an awful lot of extra overhead making it a poor fit for a performance critical code section. One of the key benefits of this approach is that we keep our data structure quick to iterate/add to.

The swap approach is O(1) per element added, which is fine. If we used insert / shifting it would be O(N) per element, which would cause a quadratic blowup.

glozow commented at 11:31 AM on February 21, 2022:

By preserve ordering, I merely meant that you're keeping the unprocessed elements behind the processed ones. We're saying the same thing, sorry for being unclear.

If we were to try to do the same with inserts, it would cause N^2 behavior.

Sure, this is the case if you're shifting everything per-insert inside the vector. This is why I suggested using a deque, where it's O(1) per element. It'd be the same performance as what you're doing here, except you just call insert instead of swapping. But my concerns with the complexity/readability of the code will be gone if you just comment what you're doing here.

JeremyRubin commented at 8:38 PM on February 21, 2022:

yeah deque is same complexity, but it's not the same performance, so it makes sense to continue using a vec even if we have to do a little bit of index management. can add a comment.

in src/txmempool.cpp:132 in 1eac472976

 142 | +    std::vector<txiter> descendants(children.size());
 143 | +    size_t i = 0;
 144 | +    for (const auto& child: children) {
 145 | +        descendants[i] = mapTx.iterator_to(child);
 146 | +        ++i;
 147 | +    }

glozow commented at 11:31 AM on February 17, 2022:

What is the purpose of i here? You're re-assigning it later anyway.

Also, why can't you just do a std::transform from children to populate descendants?

JeremyRubin commented at 6:29 PM on February 18, 2022:

i is needed to index descendants which is allocated up front.

it could be a std::transform I suppose, but I don't think std::transform provides a readability improvement here.

glozow commented at 10:56 AM on February 21, 2022:

i is needed to index descendants which is allocated up front

Just push_back, and you don't need to allocate anything

JeremyRubin commented at 8:41 PM on February 21, 2022:

if i do a call to reserve instead of the sized constructor https://en.cppreference.com/w/cpp/container/vector/vector, that would work w/o extra allocations.

the sized constructor + default insert + indexed insert is a bit better IMO because there are some overheads to push_back/emplace_back in terms of updating the vec bookkeeping that the simple indexed insert above does not have, but it's all something that would need measuring as the compiler probably does an OK job at it.

do you have a strong preference around this?

in src/txmempool.cpp:194 in 1eac472976

 222 | +    // emplace into a new vector to guarantee we trim memory
 223 | +    const auto& it = cachedDescendants.emplace(std::piecewise_construct,
 224 | +            std::forward_as_tuple(updateIt),
 225 | +            std::forward_as_tuple(descendants.begin(), included_upto));
 226 | +    // swap with descendants to release it early!
 227 | +    std::vector<txiter>().swap(descendants);

glozow commented at 11:52 AM on February 17, 2022:

Would using std::shrink_to_fit() https://www.cplusplus.com/reference/vector/vector/shrink_to_fit/ "guarantee we trim" or no?

glozow commented at 11:54 AM on February 18, 2022:

Also I'm curious as to why it's necessary to free this memory here. We don't need to allocate anything in the rest of the function, and descendants goes out of scope when we return.

JeremyRubin commented at 6:20 PM on February 18, 2022:

from what you cited:

The request is non-binding, and the container implementation is free to optimize otherwise and leave the vector with a capacity greater than its size.

Ideally we would be happy trusting our shrink_to_fit, but I'd rather get exactly a trimmed vector here.

we do allocate (descendants_to_remove.insert), so it's a cautious approach to free it explicitly when it will never be used again than later.

glozow commented at 10:51 AM on February 21, 2022:

Sounds good.

And my other question - what's the point of releasing descendants early?

JeremyRubin commented at 8:42 PM on February 21, 2022:

e do allocate (descendants_to_remove.insert), so it's a cautious approach to free it explicitly when it will never be used again than later.

in src/txmempool.cpp:127 in 1eac472976

 137 | -                    descendants.insert(*cacheEntry);
 138 | +    const CTxMemPoolEntry::Children& children = updateIt->GetMemPoolChildrenConst();
 139 | +
 140 | +    // initialize to hold all direct children
 141 | +    // children guaranteed to be unique at this point
 142 | +    std::vector<txiter> descendants(children.size());

glozow commented at 12:01 PM on February 17, 2022:

actually it might make more sense to resize to the children's aggregated descendant counts. You could overestimate, but there's most likely even fewer resizes.

JeremyRubin commented at 8:43 PM on February 21, 2022:

this is a good point. But how do we know in advance of iterating what that would be? I figured we do not?

glozow commented at 12:04 PM on February 17, 2022: member

Seems like this could be an improvement, but not convinced unless there's a bench or a breakdown of the memory used.

bitcoin deleted a comment on Feb 21, 2022

DrahtBot cross-referenced this on Apr 18, 2022 from issue mempool: reduce lookups, insertions to cache in UpdateForDescendants by Crypt-iQ

Crypt-iQ commented at 1:11 AM on April 19, 2022: contributor

I think UpdateForDescendants should have the LOCKS_EXCLUDED annotation for m_epoch in txmempool.h

glozow cross-referenced this on Apr 22, 2022 from issue Change UpdateForDescendants to use Epochs by JeremyRubin

DrahtBot cross-referenced this on Jun 7, 2022 from issue [kernel 3a/n] Decouple `CTxMemPool` from `ArgsManager` by dongcarl

DrahtBot cross-referenced this on Jun 21, 2022 from issue Add a `-mempoolfullrbf` node setting by ariard

DrahtBot commented at 8:03 AM on June 29, 2022: contributor

🐙 This pull request conflicts with the target branch and needs rebase.

<sub>Want to unsubscribe from rebase notifications on this pull request? Just convert this pull request to a "draft".</sub>

DrahtBot added the label Needs rebase on Jun 29, 2022

glozow commented at 7:02 PM on October 12, 2022: member

@JeremyRubin as discussed offline, closing this for now and assigning to @stickies-v.

glozow closed this on Oct 12, 2022

bitcoin locked this on Oct 12, 2023