Faster Input Deduplication Algorithm

JeremyRubin commented at 6:42 am on October 4, 2018: contributor

This PR introduces a faster input deduplication algorithm.

In the first commit, I introduce a worst case duplicate input benchmark.

The existing code achieves the following performance.

DuplicateInputs, 5, 10, 0.710982, 0.0140756, 0.0143986, 0.0141852

In the next commit, I introduce a probabilistic checker which is much faster. It’s analysis and use is documented extensively in the code comments. It achieves the following performance on the same benchmark.

DuplicateInputs, 5, 10, 0.166576, 0.00329314, 0.0034048, 0.0033221

This is about 4.3X faster.

Future work can further improve this by back-grounding the checks and hashing to another thread.

This does introduce one behavior change in which DoS error gets reported for transactions which have both duplicates and null inputs; the first error found will be the one reported.

fanquake added the label Consensus on Oct 4, 2018

JeremyRubin closed this on Oct 4, 2018

JeremyRubin reopened this on Oct 4, 2018

in src/bench/duplicate_inputs.cpp:91 in 5d768bf136 outdated

86+
87+    block.vtx.push_back(MakeTransactionRef(std::move(coinbaseTx)));
88+    block.vtx.push_back(MakeTransactionRef(std::move(naughtyTx)));
89+
90+    block.hashMerkleRoot = BlockMerkleRoot(block);
91+

MarcoFalke commented at 7:24 am on October 4, 2018:

nit: some trailing white space here?

Also, I removed all comments about the linter failure, because they are just distracting from the actual pull request.

MarcoFalke deleted a comment on Oct 4, 2018

instagibbs commented at 7:29 am on October 4, 2018: member

Can you post “typical” case benchmark comparisons?

DrahtBot commented at 8:38 am on October 4, 2018: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#14400 (Add Benchmark to test input de-duplication worst case by JeremyRubin)
#14397 (Faster duplicate input check in CheckTransaction (alternative to #14387) by sipa)
#14074 (Use std::unordered_set instead of set in blockfilter interface by jimpo)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

practicalswift commented at 8:58 am on October 4, 2018: contributor

@JeremyRubin Impressive speedup! What is the risk-reward trade-off we’re facing with this change? More specifically: what risks do you see associated with this change to consensus critical code?

Does the change in which DoS error gets reported for transactions which have both duplicates and null inputs have any consequences or impose any risks?

in src/consensus/tx_verify.cpp:211 in 5d768bf136 outdated

222+        //
223+        // We also allow reusing a 'dirty' table because zeroing 300 KB
224+        // can be expensive, and the table will operate acceptably for all of the
225+        // transactions in a given block.
226+        //
227+        // Then, we iterate through the elements one by one, generated 8 salted

leishman commented at 1:08 pm on October 4, 2018:

Why 8?

JeremyRubin commented at 11:59 pm on October 4, 2018:

It’s my favorite number!

It’s to get the expected work below one for the worst case as shown in the analysis.

We could add a 9th hash (we have more bits in the hash computed) but every hash adds more memory accesses in our table.

We could also remove a hash or two and the EV would be less than 10, which is probably acceptable 2. At about 4 or 5 hashes is when it blows up a bit more to an “unacceptable point” (EV 50 to 500 comparisons).

Solve for x such that:

Sum i = 0 to 24390 [ i*( i*x / 2**21)**x ] < 1

It’s also possible to modify the algorithm such that if a false positive is hit, you do the current set based algorithm up to the conflict. I’m not sure how to analyze that though, and the current code is sufficiently simple with low enough probability of expensive scan that we don’t care that much.

JeremyRubin commented at 3:54 am on October 5, 2018:

More precisely, it’s one of two choices (9 hashes or 8) with 21 bits with expected work less than 1.

Extracting 9 hashes from 3 64 bit integers is a bit more complex code wise, but doable.

sorted((sum(i*(float(i)x/ 2.0**y)**x for i in xrange(24930)) if (xy <= 192) else (‘Inf’, 0, 0),y,x) for y in xrange(1,22) for x in xrange(1,20))[:10]

[(0.10374566662377155, 21, 9), (0.4157347268068221, 21, 8), (1.9074647424172138, 21, 7), (10.226961125517702, 21, 6), (53.11778131137103, 20, 9), (65.8563129753341, 21, 5), (106.42809006254646, 20, 8), (244.15548702940336, 20, 7), (529.481130078109, 21, 4), (654.5255120331329, 20, 6)]

Another option would be to increase the number of hashes to 16 and then use a 20 bit table, requiring a 320-bit hash . This makes the expected work about 7 comparisons in the worst case, but makes the table half as large which reduces the constant bloat.

leishman commented at 6:30 am on October 5, 2018:

Ok cool. Thanks for the analysis. 8 seems like a pretty reasonable choice. I left a comment above, but was wondering how this bloom filter compares to native unordered set implementations in the stdlib.

promag commented at 11:33 pm on October 4, 2018: member

Anyone measured -reindex with and without this change?

in src/consensus/tx_verify.cpp:202 in 5d768bf136 outdated

213+    } else{
214+        // This duplication checking algorithm uses a probabilistic filter
215+        // to check for collisions efficiently.
216+        //
217+        // This is faster than the naive construction, using a set, which
218+        // requires more allocation and comparison of uint256s.

leishman commented at 6:27 am on October 5, 2018:

How does your implementation compare to using std::unordered_set?

JeremyRubin commented at 6:56 am on October 5, 2018:

See #14397 for a more obvious version that is just the obvious thing.

For std::unordered_set, I’m clocking much worse performance for DeserializeAndCheckBlockTest and 2x worse performance for DuplicateInputs.

sipa commented at 6:58 am on October 5, 2018:

@leishman I tried a few alternatives:

Master: 13.6 ms
#14397: 6.3 ms
Using a sorted vector with SipHash’ed prevouts: 3.7 ms
This PR: 2.7 ms

JeremyRubin commented at 7:05 am on October 5, 2018:

I like sorting the siphash’d prevouts – I’m guessing you then do the expensive check if that collides? Or are you tracking the pointers when you sort too?

sipa commented at 7:09 am on October 5, 2018:

@JeremyRubin Yeah, just delay the expensive check until after the cheap check fails. I haven’t PR’ed that because I only have PoC, and I don’t want to overload reviewers with a series of PRs without even knowing if we want to increase complexity here at all.

JeremyRubin commented at 7:37 am on October 5, 2018:

For the expensive check, you can still do it in O(n) per colliding entry FYI, which is less expensive than doing the full O(n log n) expensive check given that we don’t expect colliding entries without a duplicate.

jnewbery commented at 7:00 am on October 5, 2018: member

My immediate reaction is that this seems very complex compared to a naive std::set comparison! This also pulls our SipHash implementation into consensus-critical code, which seems like a big price to pay for a performance win. I lean pretty strongly towards concept NACK. @JeremyRubin - your PR description talks about what this PR does, but not why. This makes block propagation faster, but do we have an understanding of how much these milliseconds matter? Is there a way we can determine whether the increased complexity introduced is a reasonable price to pay for the performance win?

Also, +1 to @MarcoFalke’s comment here: #14387 (review) . Can reviewers please not start nitting code before there’s been a concept discussion.

JeremyRubin commented at 7:10 am on October 5, 2018: contributor

Goal was to minimize the performance regression caused by the CVE fix. Understand this is sensitive code for that reason. This code is also generally theoretically useful for several other contexts because it is O(N). An adapted version (different parameters) could be used to check for duplicate inputs across a large number of txns (e.g., mempool syncing context).

It’s actually not thaaat complicated; it’s basically just a bloom filter. The complexity is also mostly in the performance, the correctness is somewhat easy to check.

I don’t know if the performance win is worth it. I’ll leave that for others to determine. Just putting it out there.

practicalswift commented at 7:40 am on October 5, 2018: contributor

@JeremyRubin It is sufficiently complicated to introduce undefined behaviour in consensus critical code without any of the reviewers noticing .-)

I’m afraid the code as it is currently formulated will trigger undefined behaviour due to shift exponents being too large.

JeremyRubin commented at 7:57 am on October 5, 2018: contributor

@practicalswift i think I fixed that – can you confirm? (and also a copy-paste error on which bit was being set :( )

Add Benchmark to test input de-duplication worst case e1ed29fd90

practicalswift commented at 9:24 am on October 5, 2018: contributor

@practicalswift No, I’m afraid the undefined behaviour is still present.

Check this code:

https://github.com/bitcoin/bitcoin/blob/4ec7597add35cfc458680fcba2b8fb64931711ef/src/consensus/tx_verify.cpp#L281-L288

gmaxwell commented at 8:32 pm on October 5, 2018: contributor

I feel like this is too much review work vs the gain.

Add Probabalistic Checker for Duplicate Inputs 2871bc9a71

Fix hash computations to get unique bits d2b195b201

Use less error-prone code for bitset 5ff5f74f30

revert tx_verify.* to master 752682d554

Add probabalistic duplicate detector to CheckBlock 4ca0ed7ae8

Clean up the implementation of fast deduplication checking c176a9c2c5

Small clean up of comment & code layout for Duplicate Input checking 1f2477364d

JeremyRubin force-pushed on Oct 6, 2018

JeremyRubin commented at 2:13 am on October 6, 2018: contributor

@gmaxwell @sipa please re-review the updated version (should you have time). @practicalswift I think I have eliminated the UB, if not please let me know.

In this version I have kept complexity limited in scope to validation.cpp.

Performance wise this version is actually a bit better in the worst case compared to using the filter per-transaction (DuplicateInputs) and better in an average case (DeserializeAndCheckBlockTest) compared to master.

The simpler put all in vector then sort then find duplicate algorithm could be used here too.

The major benefit of this approach (as amended) is that we not only detect duplicate inputs per transaction, but across the entire block at the same time. This guarantees we won’t see an in-block double spend in ConnectBlock and CheckTxInputs. This might enable us to parallelize checking that inputs exist (against a cache that tracks at what index an output created in that block was created).

gmaxwell commented at 4:39 pm on October 7, 2018: contributor

The major benefit of this approach (as amended) is that we not only detect duplicate inputs per transaction, but across the entire block at the sam

One can’t do that without losing the ability to cache the check as part of the wtxid validity caching, as the simpler check could be.

I am failing to see the argument for a gain here that even justifies reviewing the change at all.

JeremyRubin commented at 2:18 am on October 9, 2018: contributor

@gmaxwell That’s not accurate – by fixing the salt for the hash (which should be safe – I will consider this more closely), you could store three uint64_t’s per input in the wtxid validity cache and then re-insert those on the forward pass of the algorithm. To be clear, this just saves sip-hashing at the expense of memory, but you can just keep a table of precomputed sip-hashes for the inputs in the top of the mempool if it’s actually an issue. Once you have the hashes, the check itself is very inexpensive… At the very least, such a cache could be configurable.

By checking (without inserting) that all of the outputs (not txids) created in a transaction are not already in the table as we scan we ensure that none of the inputs spent are created later in the block. This can also be done with an additional backwards pass with a new table only tracking TXIDs for no hashing (for tx in txs.reverse(): insert_table(txid); for input in tx.inputs(): check_table(input.coutpoint.hash)). The overall expected more work on either of these approaches is around 2x, and with current parameters this is reasonable. With this check completed, it would be possible to apply all the transactions in a block out of order. Without removing any checks or adding parallelization, this should make less fragile much of the code after CheckBlock (e.g., ConnectBlock) because we never reach it for a block which has out-of-longchain-order transactions (and cause us to have to abort partially applied transactions).

I wanted to get just the duplicates checking reviewed and accepted first, then, in the future work on these other projects. @instagibbs with this current version, it seems to minorly worse (1.5% median to median) on DeserializeAndCheckBlockTest. I’m unaware if this is a representative sample of blocks or if this tell you anything about the larger performance of a block being validated with respect to things like number of transactions, caches, etc so I’m reticent to give too much weight to this one in any case. If you have ideas for how we can write a better benchmark to test this for future work, let’s chat about it.

DeserializeAndCheckBlockTest, 5, 160, 12.2812, 0.0153002, 0.0154073, 0.0153456 DeserializeAndCheckBlockTest, 5, 160, 12.5319, 0.0153902, 0.0159464, 0.0155924

Much thanks to all the reviewers who have spent their time reviewing this PR so-far, I appreciate that you took the time to review this contribution to Bitcoin.

DrahtBot commented at 3:31 pm on November 6, 2018: member

DrahtBot added the label Needs rebase on Nov 6, 2018

MarcoFalke referenced this in commit 327129f7a6 on Nov 25, 2018

gmaxwell commented at 8:11 am on November 29, 2018: contributor

@jnewbery nothing in this PR is in the block propagation typical case anymore (not since we finally implemented the full BIP152 HB mode forwarding), so the belief that this speeds up block propagation is largely mistaken.

“Goal was to minimize the performance regression caused by the CVE fix.” – there wasn’t one, or at least not an interesting one. The speedup to duplicate checking was later superseded by changes to allow blocks to be propagated without validating anything more than hash consistency.

JeremyRubin commented at 8:34 am on November 29, 2018: contributor

@gmaxwell see #14837 which supersedes this PR.

In my opinion, you are incorrect that CheckBlock is not latency sensitive. Certainly there are a large class of users for whom CheckBlock performance is critical (e.g., miners performing full validation before mining a new block, and miners calling testblockvalidity to get a new template).

This also has a non negligible impact on benchmarks like DeserializeAndCheckBlockTest, which suggests to me that speeding up these checks is important for reindexing, bootstrap, and other activities.

fanquake commented at 8:34 am on November 29, 2018: member

Closing in favour of #14837.

fanquake closed this on Nov 29, 2018

gmaxwell commented at 8:50 am on November 29, 2018: contributor

In my opinion, you are incorrect that CheckBlock is not latency sensitive. Certainly there are a large class of users for whom CheckBlock performance is critical (e.g., miners performing full validation before mining a new block, and miners calling testblockvalidity to get a new template).

When it was on the critical path of propagation the small delays involved were cumulative across the whole network. As a result even though one was not particularly interesting, the sum total could be. Without that effect, you only get the single one shot delay. The effect of a one shot half millisecond delay on mining is negligible, and efforts spend considering that optimization could be better spent on things like getting testblockvalidity out of the critical path– which would be an order of magnitude larger speedup, likely simpler to review and verify, and would also further moot the new proposed benefit.

This also has a non negligible impact on benchmarks like DeserializeAndCheckBlockTest, which suggests to me that speeding up these checks is important for reindexing, bootstrap, and other activities.

That logic is spurious. Microbenchmarks are microbenchmarks. If it had an impact on reindexing/bootstrap it could be measured. If it did make a big impact there that would be an argument in favor of it, but unless prior profiling was erroneous, that isn’t possible.

practicalswift commented at 8:58 am on November 29, 2018: contributor

@JeremyRand Regarding my comment regarding UB above. The problem was the following:

0uint64_t bit1 = 1<<((std::get<0>(h)) & 63);

… is problematic due to …

0$ cling
1[cling]$ #include <cstdint>
2[cling]$ 1 << 63
3warning: shift count >= width of type [-Wshift-count-overflow]
4 1 << 63
5   ^  ~~
6(int) 0

… which can be contrasted to …

0$ cling
1[cling]$ #include <cstdint>
2[cling]$ static_cast<uint64_t>(1) << 63
3(unsigned long) 9223372036854775808

laanwj removed the label Needs rebase on Oct 24, 2019

Munkybooty referenced this in commit df8134b1a7 on Jul 30, 2021

MarcoFalke locked this on Dec 16, 2021

Faster Input Deduplication Algorithm #14387

Conflicts