Deduplicate parent txid loop of requested transactions and missing parents of orphan transactions

sdaftuar commented at 4:01 am on July 27, 2020: member

I noticed a couple of places recently where we loop over all inputs of a transaction in order to do some processing on the txids we find in those inputs. There may be thousands of inputs in a transaction, and the same txid may appear many times. In a couple of places in particular, we loop over those txids and add them to a rolling bloom filter; doing that multiple times for the same txid wastes entries in that filter.

This PR fixes that in two places relating to transaction relay: one on the server side, where we look for parent transactions of a tx that we are delivering to a peer to ensure that getdata requests for those parents will succeed; and the other on the client side, where when we process an orphan tx we want to loop over the parent txids and ensure that all are eventually requested from the peer who provided the orphan.

This addresses a couple of related comments left in #19109.

fanquake added the label P2P on Jul 27, 2020

fanquake requested review from ajtowns on Jul 27, 2020

fanquake requested review from sipa on Jul 27, 2020

JeremyRubin commented at 6:43 am on July 27, 2020: contributor

It should be substantially faster and memory friendly for these big txns to fill the parents set as a vector of hashes, sort it, use std::unique to shift out the duplicates, and then erase garbage.

std::set will do O(n) allocations and use about 3x the memory compared to std::vector, and set traversal and construction is worse than vector.

The only case where set may win is if the duplicate rate was large, but the adversarial worst case is better under vector than under set and I suspect the average case (not a big duplicate rate) is better too.

jonatack commented at 7:31 am on July 27, 2020: member

Good ideas. There is the question of using sets for the data structure. The first commit 2898d6505a3 “Rewrite parent txid loop of requested transactions” seems to break a number of functional tests for me locally. All the tests pass again after rebasing to keep only the second commit.

 0interface_rest.py                       | ✖ Failed  | 35 s
 1mempool_packages.py                     | ✖ Failed  | 3 s
 2p2p_segwit.py                           | ✖ Failed  | 23 s
 3rpc_rawtransaction.py                   | ✖ Failed  | 9 s
 4wallet_avoidreuse.py                    | ✖ Failed  | 6 s
 5wallet_avoidreuse.py --descriptors      | ✖ Failed  | 7 s
 6wallet_backup.py                        | ✖ Failed  | 9 s
 7wallet_balance.py                       | ✖ Failed  | 15 s
 8wallet_bumpfee.py                       | ✖ Failed  | 5 s
 9wallet_groups.py                        | ✖ Failed  | 12 s
10wallet_listtransactions.py              | ✖ Failed  | 36 s

DrahtBot commented at 10:34 am on July 27, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#19610 (p2p: refactor AlreadyHave(), CInv::type, INV/TX processing by jonatack)
#19306 (refactor: Replace RecursiveMutex with Mutex in CTxMemPool by hebasto)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

ajtowns commented at 10:38 am on July 27, 2020: member

It should be substantially faster and memory friendly for these big txns to fill the parents set as a vector of hashes, sort it, use std::unique to shift out the duplicates, and then erase garbage.

I think this is just:

 0--- a/src/net_processing.cpp
 1+++ b/src/net_processing.cpp
 2@@ -2990,10 +2990,12 @@ void ProcessMessage(
 3 
 4             // Deduplicate parent txids, so that we don't have to loop over
 5             // the same parent txid more than once down below.
 6-            std::set<uint256> unique_parents;
 7+            std::vector<uint256> unique_parents;
 8+            unique_parents.reserve(tx.vin.size());
 9             for (const CTxIn& txin : tx.vin) {
10-                unique_parents.insert(txin.prevout.hash);
11+                unique_parents.push_back(txin.prevout.hash);
12             }
13+            std::sort(unique_parents.begin(), unique_parents.end());
14+            unique_parents.erase(std::unique(unique_parents.begin(), unique_parents.end()), unique_parents.end());
15             for (const uint256& parent_txid : unique_parents) {
16                 if (recentRejects->contains(parent_txid)) {
17                     fRejectedParents = true;

I think I agree the vector/sort/unique approach is better – temporarily allocating up to twice the space it takes to store the transaction seems fine, and allocating memory at once rather than for each input seems better; but not sure it should matter all that much.

Alternatively, could maybe do it as:

 0    for (txin : tx.vin) {
 1        if (recentRejects->contains(txin.prevout.hash)) {
 2            fRejectedParents = true;
 3        }
 4    }
 5    if (!fRejectedParents && pfrom.m_tx_relay != nullptr) {
 6        LOCK(pfrom.m_tx_relay->cs_tx_inventory);
 7        for (txin : tx.vin) {
 8            if (pfrom.m_tx_relay->filterInventoryKnown.contains(txin.prevout.hash)) continue;
 9            if (AlreadyHave(CInv(MSG_TX | nFetchFlags, txin.prevout.hash))) continue;
10            pfrom.m_tx_relay->filterInventoryKnown.insert(txin.prevout.hash)
11            RequestTx(txin.prevout.hash);
12        }
13    }

and rely on both recentRejects and filterInventoryKnown lookups being quick enough that duplicates don’t matter, and use filterInventoryKnown to ensure the slow ops only get queried once per tx, rather than once per prevout.

(travis reports a lock order violation btw)

sdaftuar force-pushed on Jul 27, 2020

sdaftuar commented at 11:23 pm on July 27, 2020: member

Thanks all for the quick review! @ajtowns I grabbed your code for using a vector instead of a set, and hopefully fixed the lock inversion problem. @jonatack I’m surprised by the test failures, is it possible the lock inversion caused that?

sipa commented at 6:34 pm on July 28, 2020: member

Code review ACK 842f908cefc33baea116bb134810fc9ff8a0ea17

I don’t understand where the locking failure is coming from. Is it possibly fixed now?

sdaftuar commented at 6:48 pm on July 28, 2020: member

I don’t understand where the locking failure is coming from. Is it possibly fixed now? @sipa Yes it should be fixed now; when I opened the PR initially, I wasn’t letting go of the mempool lock before acquiring the others in the first commit.

in src/net_processing.cpp:3003 in 842f908cef outdated

2998             for (const CTxIn& txin : tx.vin) {
2999-                if (recentRejects->contains(txin.prevout.hash)) {
3000+                unique_parents.push_back(txin.prevout.hash);
3001+            }
3002+            std::sort(unique_parents.begin(), unique_parents.end());
3003+            unique_parents.erase(std::unique(unique_parents.begin(), unique_parents.end()), unique_parents.end());

jonatack commented at 4:48 am on July 30, 2020:

nit: only if need to retouch, separating the unique operation from the erase, similar to the code example in https://en.cppreference.com/w/cpp/algorithm/unique, seems slightly clearer to me

0-            unique_parents.erase(std::unique(unique_parents.begin(), unique_parents.end()), unique_parents.end());
1+            auto last_unique_parent = std::unique(unique_parents.begin(), unique_parents.end());
2+            unique_parents.erase(last_unique_parent, unique_parents.end());

jnewbery commented at 7:14 am on July 30, 2020:

I disagree. The remove-erase (or in this case unique-erase) idiom is common in c++. Once you’ve seen it a couple of times it’s clear what it’s doing. Plus, unique leaves the last elements of the vector in indeterminate state, so we don’t want anyone to be tempted to do anything with it between unique and erase.

jonatack commented at 8:58 am on July 30, 2020:

Fair.

JeremyRubin commented at 5:34 pm on July 30, 2020:

It’s definitely clear. The only thing that would be a bit better is if instead of erasing, we just range over a Span(beigin, unique_it) because then we can do the erasing when the unique_parents vector is dropped. But I think that hash is trivially destructible, so not clear that erase will specialize with any overhead as there is no destructor to call. In any case, we only actually erase anything if there are duplicates, which I think is sufficiently rare of a case (but we still need to handle it correctly) that I’m not worried there would be much to erase.

jonatack commented at 4:51 am on July 30, 2020: member

@jonatack I’m surprised by the test failures, is it possible the lock inversion caused that?

Yes, I believe so, the local test failures (debug build, gcc Debian 10.1.0-6) stemming from the first commit were similar to the Travis one.

ACK 842f908cefc33baea116bb134810fc9ff8a0ea17 code review (the changes are straightforward, verified among other things mempool.info vs mempool.GetMemPoolParents()), debug build/tests/ran bitcoind for a day; operation nominal.

in src/net_processing.cpp:1737 in 842f908cef outdated

1744-                if (txinfo.tx && txinfo.m_time > now - UNCONDITIONAL_RELAY_DELAY) {
1745-                    // Relaying a transaction with a recent but unconfirmed parent.
1746-                    if (WITH_LOCK(pfrom.m_tx_relay->cs_tx_inventory, return !pfrom.m_tx_relay->filterInventoryKnown.contains(txin.prevout.hash))) {
1747-                        LOCK(cs_main);
1748-                        State(pfrom.GetId())->m_recently_announced_invs.insert(txin.prevout.hash);
1749+            std::vector<uint256> parent_ids_to_add;

jnewbery commented at 7:00 am on July 30, 2020:

Would it be better to reserve() the capacity of this vector to tx->vin.size() here (or even better to parents.size() later) to avoid repeated reallocations in the worst case?

sdaftuar commented at 6:00 pm on August 4, 2020:

Practically speaking the worst case here is pretty small, but sure I did the reserve() thing with parents.size().

in src/net_processing.cpp:2997 in 842f908cef outdated

2989@@ -2980,8 +2990,18 @@ void ProcessMessage(
2990         else if (state.GetResult() == TxValidationResult::TX_MISSING_INPUTS)
2991         {
2992             bool fRejectedParents = false; // It may be the case that the orphans parents have all been rejected
2993+
2994+            // Deduplicate parent txids, so that we don't have to loop over
2995+            // the same parent txid more than once down below.
2996+            std::vector<uint256> unique_parents;
2997+            unique_parents.reserve(tx.vin.size());

jnewbery commented at 7:11 am on July 30, 2020:

I think parents would be a better name here. They only become unique later once you deduplicate them.

sdaftuar commented at 6:00 pm on August 4, 2020:

Added a comment instead, since the first thing we do is deduplicate, and then we use this as a list of unique_parents later on.

in src/net_processing.cpp:3000 in 842f908cef outdated

2995+            // the same parent txid more than once down below.
2996+            std::vector<uint256> unique_parents;
2997+            unique_parents.reserve(tx.vin.size());
2998             for (const CTxIn& txin : tx.vin) {
2999-                if (recentRejects->contains(txin.prevout.hash)) {
3000+                unique_parents.push_back(txin.prevout.hash);

jnewbery commented at 7:20 am on July 30, 2020:

This might be an over-optimization, but I expect that transactions which contain multiple inputs from the same parent may group those inputs together. If that’s true, then checking the prevout hash against the last element of unique_parents would prevent us from adding duplicates in many cases.

JeremyRubin commented at 5:26 pm on July 30, 2020:

concept nack this one: I agree that this might be a good heuristic, but how commonly does a single wallet have multiple payments from the same parent? The need to compare an extra O(N) times would probably be slower than sorting & removing the extra elements anyways. We’ve already allocated the memory, may as well use it!

If this does come up as a big common case, the approach I’d take is to call std::unique/erase before sorting and after sorting. I suspect this would be faster because all the data is inline at this point and we don’t have to contend with branch mispredictions while filling the vector.

But I think it’s overkill :)

jnewbery commented at 10:34 am on July 31, 2020:

yep, agree that this is overoptimization.

in src/net_processing.cpp:3020 in 842f908cef outdated

3018                         // non-wtxid-relay peers.
3019                         // Eventually we should replace this with an improved
3020                         // protocol for getting all unconfirmed parents.
3021-                        CInv _inv(MSG_TX | nFetchFlags, txin.prevout.hash);
3022-                        pfrom.AddKnownTx(txin.prevout.hash);
3023+                        CInv _inv(MSG_TX | nFetchFlags, parent_txid);

jnewbery commented at 7:26 am on July 30, 2020:

Since you’re already touching this line, maybe we could remove nFetchFlags here.

AlreadyHave() and RequestTx() don’t do anything differently depending on whether the CInv is an MSG_TX or MSG_WITNESSTX, so adding the MSG_WITNESS_FLAG is useless here. (the same is true for the nFetchFlags in the INV processing in ProcessMessages).

Perhaps best left for a different PR.

jonatack commented at 2:39 pm on August 3, 2020:

Since you’re already touching this line, maybe we could remove nFetchFlags here.

AlreadyHave() and RequestTx() don’t do anything differently depending on whether the CInv is an MSG_TX or MSG_WITNESSTX, so adding the MSG_WITNESS_FLAG is useless here. (the same is true for the nFetchFlags in the INV processing in ProcessMessages).

Perhaps best left for a different PR.

Thanks @jnewbery, am looking at removing these in #19611.

jonatack commented at 3:36 pm on August 4, 2020:

Removed unused nFetchFlags from ProcessMessages TX and INV processing in #19611.

sdaftuar commented at 5:44 pm on August 4, 2020:

Leaving this alone, thanks.

jnewbery commented at 7:30 am on July 30, 2020: member

Looks good. A few comments inline.

std::set will do O(n) allocations and use about 3x the memory compared to std::vector @JeremyRubin how did you get 3x? My understanding is that overhead for a std::set node is ~32 bytes, which would imply double the memory usage of a vector in this case.

sipa commented at 4:06 pm on July 30, 2020: member

@jnewbery Overhead of a set over a vector is the size of 3 pointers and an int (for red black tree structure), plus 2 pointers (for malloc overhead), and that rounded up to the next multiple of 2 pointers’ worth.

For 64-bit systems that means between 48 and 63 bytes overhead.

jnewbery commented at 4:21 pm on July 30, 2020: member

Thanks @sipa!

JeremyRubin commented at 5:47 pm on July 30, 2020: contributor

If that’s not clear for anyone reading along, because I frequently have to re-check this logic…

the implementation we assume for std::set is a red black tree, with a layout of:

0template <typename T>
1struct node {
2int color;
3void* parent;
4void* left;
5void* right;
6T element;
7}

This sums to 8 bytes * 4 (int has 4 bytes padding) = 32 bytes of overhead, no extra alignment for T required (unless there’s some external alignment requirement > 32). But in std::set, every node is individually alloc’d with malloc (as opposed to using an arena, which would reduce overhead). Malloc typically requires an extra 16 bytes of data (but can be higher or lower platform dependent, not counting whatever other internal bookkeeping memory overhead heap may required?), and then pads out the allocation to a 16 byte boundary. If you look at MallocUsage, it’s ((s + 31) >> 4 <<4), but the 31 is really +16 (for malloc metadata) and +15 >>4 <<4 for the padding.

This means that for a 32 byte hash, you need an estimated 80 bytes per element in a std::set (so 2.5x, a bit better than 3x, but still excessive). The main issue IMO is less so the memory and more so the calls to malloc itself. E.g., if vector used 2x memory and std::set used 1x memory, it still might be preferable to use the vector to avoid N calls to malloc.

jnewbery commented at 10:15 pm on July 30, 2020: member

Thanks @JeremyRubin! Very interesting.

luke-jr commented at 0:51 am on July 31, 2020: member

Has unordered_set been evaluated?

JeremyRubin commented at 4:50 am on July 31, 2020: contributor

unordered_set… ah the promise of O(1) accesses. I don’t think it makes sense to do here since N is small. Here’s my overall thinking:

This piece of code isn’t exactly adversarial. Whatever algorithm we pick – sorting or hash tabling, is going to be good enough. This is mostly about picking something with good average runtime and acceptable worst case runtime, and preferably low memory on average.

For unordered_set we use about 72 bytes per entry (32 bytes, 1 pointer, padded out to 64 bytes, and then 8 more for the buckets), and still need do O(n) allocations (unless we do a custom arena allocator, then we can save it!). Then we get an implementation that will be O(N) instead of O(N log N). However I think the largest N is about 100kb/40b (largest standard txn divided by size of input), so we’re looking at about 2,500 inputs, and log2(2500) is about 11. However, if you think about sorting a vector of bytearrays, I think that should be pretty performant compared to all the pointer chasing, hashing, and allocating that a hash map will make you do. And in the common case, which is smaller numbers of inputs, the log N will matter less.

So I think a plain hash map, which better big O, is going to be worse on average and not much better worst case, if any better at all.

if you did want to make a simple competitive algorithm for doing this based on hash sets, here’s what I would do:

allocate a std::vector 8x the size of the number of inputs in bits and a vector for N elements
get 5 x 12 bits of entropy for each entry, and check and set the bit in the vector for each by using the projecting trick. a. If all 5 bits are already set, insert the item into a std::set or std::unordered_set named likely_dups. b. otherwise, push back element onto the vector
iterate over the vector removing anything in the set (quit if set empty)
extend the end of the vector with whatever is left in the set

This algorithm is correct because every item with no perfect overlapped bits in the set is unique. So by step 3, the vector is unique, and the std::unordered_set has all the likely non unique entries (but is also, conveniently, unique). We then attempt to remove from the set every element in the vector, guaranteeing the set only has what’s not in the vector. Then we add those to the vector, and the vector contains all unique elements from the original vector of elems.

However this algorithm works really nicely for the expected case (no colliding) because we only have to make 2 malloc calls and emplace everything into our vector after checking the filter (which should have low prob of collision), so it should be pretty speedy! And in the case where there are a lot of duplicates, we don’t end up allocating that much memory in the likely set because they are duplicated! The worst case is where every element has exactly 1 duplicate, then we end up doing extra work.

However, this is a lot of work to do and get right so I think the sort remove technique is fine.

DrahtBot added the label Needs rebase on Jul 31, 2020

Rewrite parent txid loop of requested transactions

Previously, we would potentially add the same txid many times to the rolling
bloom filter of recently announced transactions to a peer, if many outputs of
the same txid appeared as inputs in a transaction. Eliminate this problem and
avoid redundant lookups by asking the mempool for the unique parents of a
requested transaction.

8196176243

Deduplicate missing parents of orphan transactions

In the logic for requesting missing parents of orphan transactions, parent
transactions with multiple outputs being spent by the given orphan were being
processed multiple times. Fix this by deduplicating the set of missing parent
txids first.

Co-authored-by: Anthony Towns <aj@erisian.com.au>

4c0731f9c5

sdaftuar force-pushed on Aug 4, 2020

DrahtBot removed the label Needs rebase on Aug 4, 2020

sdaftuar commented at 7:52 pm on August 4, 2020: member

Rebased and addressed @jnewbery’s feedback.

jonatack commented at 7:23 am on August 5, 2020: member

ACK 4c0731f9c50b0556f8a57b912c8f295c7a9ea89c

in src/net_processing.cpp:1754 in 4c0731f9c5

1756                 }
1757             }
1758+            for (const uint256& parent_txid : parent_ids_to_add) {
1759+                // Relaying a transaction with a recent but unconfirmed parent.
1760+                if (WITH_LOCK(pfrom.m_tx_relay->cs_tx_inventory, return !pfrom.m_tx_relay->filterInventoryKnown.contains(parent_txid))) {
1761+                    LOCK(cs_main);

ajtowns commented at 4:52 am on August 6, 2020:

It’s probably fine in practice, and it’s not a regression in this PR, but locking and unlocking cs_tx_inventory and cs_main (potentially) for each parent kind of bugs me each time I look at it, so I had a go at trying to avoid it. Came up with:

 0{
 1    LOCK(pfrom.m_tx_relay->cs_tx_inventory);
 2    auto is_known = [&](uint256& parent_txid) EXCLUSIVE_LOCKS_REQUIRED(pfrom.m_tx_relay->cs_tx_inventory) { return pfrom.m_tx_relay->filterInventoryKnown.contains(parent_txid); };
 3    parent_ids_to_add.erase(std::remove_if(parent_ids_to_add.begin(), parent_ids_to_add.end(), is_known), parent_ids_to_add.end());
 4}
 5if (!parent_ids_to_add.empty()) {
 6    LOCK(cs_main);
 7    for (const uint256& parent_txid : parent_ids_to_add) {
 8        // Relaying a transaction with a recent but unconfirmed parent.
 9        State(pfrom.GetId())->m_recently_announced_invs.insert(parent_txid);
10    }
11}

jnewbery commented at 2:24 pm on August 6, 2020:

Calling State() to iterate over every peer for every input txid also seems wasteful, especially since we’re already in a loop over all getdata items. I think better would be:

0            if (!parent_ids_to_add.empty()) {
1                LOCK(cs_main);
2                CNodeState* node = State(pfrom.GetId());
3                for (const uint256& parent_txid : parent_ids_to_add) {
4                    // Relaying a transaction with a recent but unconfirmed parent.
5                    node->m_recently_announced_invs.insert(parent_txid);
6                }
7            }

I also don’t understand the benefit of adding the clang lock annotation to the lambda. It’s only in scope here, so it’s never going to be called from somewhere else without pfrom.m_tx_relay->cs_tx_inventory being held.

All of this locking/unlocking of multiple mutexes goes away once peer data is consolidated into a single Peer object.

sdaftuar commented at 4:43 pm on August 6, 2020:

I started to do this, but this doesn’t exactly add to the readability of the code; can we punt on this to some future refactor, especially if we think existing refactors under way would change all this anyway?

In practice there should be more than 25 unconfirmed parents anyway, and I wouldn’t expect much lock contention either.

ajtowns commented at 6:34 pm on August 6, 2020:

Definitely seems fine to punt it to me

ajtowns commented at 6:36 pm on August 6, 2020:

@jnewbery Accessing guarded data structures in a lambda means the lambda needs to be annotated with the required lock (or an AssertLockHeld or LockAssertion added cf #19668), or there will be compile time thread safety errors. Doesn’t matter how limited the lambda’s scope is.

ajtowns approved

ajtowns commented at 4:57 am on August 6, 2020: member

ACK 4c0731f9c50b0556f8a57b912c8f295c7a9ea89c

laanwj commented at 6:37 pm on August 10, 2020: member

Code review ACK 4c0731f9c50b0556f8a57b912c8f295c7a9ea89c

laanwj merged this on Aug 10, 2020

laanwj closed this on Aug 10, 2020

sidhujag referenced this in commit b14a07762a on Aug 11, 2020

deadalnix referenced this in commit 009a334cb2 on Sep 7, 2021

Fabcien referenced this in commit bec895ada9 on Sep 7, 2021

DrahtBot locked this on Feb 15, 2022

Deduplicate parent txid loop of requested transactions and missing parents of orphan transactions #19596

Conflicts