Optimize block validation code and block propagation speed

gavinandresen commented at 1:59 am on February 27, 2015: contributor

I was running some benchmarks on 20MB blocks, and was surprised at how long block validation took even when all transactions had already been seen and were in the signature cache.

So I moved caching up from being a low-level signature cache to a high-level “have we already validated this txid.”

Benchmark results on a recent 365K block with all transactions already in the memory pool / signature cache:

Old code: (results from running with -debug=bench and hacked code to make sure all txn pre-validated): - Verify 1839 txins: 191.66ms (0.104ms/txin) New code: - Verify 1839 txins: 2.06ms (0.001ms/txin)

Extrapolating to full (1MB) blocks, the old code would spend about 600ms CPU time per hop propagating blocks with already-seen transactions, this new code will spend under 10ms.

Memory usage is about 60 times smaller, also: Old code: (sizeof(scriptSig)+std::set overhead) * number of txins (about 200K for the 365K block) New code: an extra 4 bytes per transaction in the mempool (about 3K for the 365K block)

I first wrote this as a standalone CTxValidationCache class, but it is simpler and more efficient to extend the mempool. The second commit in this pull request fixes two bugs in limitedmap I stumbled across when testing that code.

REBASED to keep the old signature cache (see discussion below)

gmaxwell commented at 2:14 am on February 27, 2015: contributor

My recollection was that the prior code was at the level it was at to avoid a DOS attack where you have a transaction with 1000 valid signatures, and one invalid one that you repeat over and over again?

Is the thinking that the DoS stuff is now adequate to protect against that?

Any idea from profiling where all that time is going, absent the cache?

gavinandresen commented at 2:17 am on February 27, 2015: contributor

@gmaxwell : From an email I sent to Sergio earlier today to see if he can think of an attack from getting rid of the low-level signature cache:

It is easy for an attacker to force a CHECKSIG cache miss; they can just use CODESEPARATOR (at the cost of an extra byte in whatever Script they’re using to try to CPU DoS).

Free-floating p2sh transactions are limited to 15 sigops by AreInputsStandard(), so an attacker can’t CPU exhaust the transaction validation code.

There is an economic disincentive for attackers to create valid-POW but very-expensive-to-validate blocks (those blocks will propagate very slowly), and, again, the sigops-per-block limit will also mitigate possible attacks.

gmaxwell commented at 2:25 am on February 27, 2015: contributor

WRT POW, I was worrying about the lose transaction case, not attacks in blocks. The attack is the attacker has 1000 valid signatures, 1 invalid signature; the time is wasted on the valid signatures; the invalid is just there to avoid paying fees. I don’t see a way to cheaply invalidate the cache on the valid signatures.

The P2SH limit is per signature, I’m pretty sure you can get1000 checksigs in an IsStandard transaction, you’d just use 66 inputs with 15 checksigs each. With unusual script pubkeys (op_dup-ing the signatures) such a transaction would only be under15kb I’d guess.

[I think at worst this might just mean having a reason to keep a plain signature cache around. (Also) caching at a higher level makes sense if it saves that much time; though I’m surprised at that!]

gavinandresen commented at 2:43 am on February 27, 2015: contributor

The 1,000-valid-plus-one-invalid attack will get the attacker DoS-banned for sending an invalid transaction.

They would have to Sybil and reconnect to get around that with either the old or new code; with the old code, they would have to pre-generate more valid signatures than the size of the signature cache AND Sybil to succeed in wasting CPU time. You’re right, the new code makes the attack slightly easier.

But I think getting rid of the low-level cache is the right thing to do just because of the memory savings.

gmaxwell commented at 3:48 am on February 27, 2015: contributor

Hm. So the caching isn’t parametrized by the state of CCoinsview cache and validity can be conditional on it, but it’s not clear to me that it matters.

What happens with this sequence of events?

Mempool contains transaction B that spends input X. Block contains transactions A, B. Both A and B spend X. A wasn’t in the mempool and will be a cache miss. B is in the mempool, cached with inputs.HaveInputs(tx) passing. But it should not pass in the context of a block containing tx A. This would result in accepting blocks containing double spend pairs where half is in the mempool; but I believe it’s currently prevented by a redundant test in ConnectBlock.

It would be good to have a test there, I could see this invariant being broken very easily.

gmaxwell commented at 9:08 am on February 27, 2015: contributor

It appears that most of the performance comes from avoiding the /seemingly/ redundant interrogation of the CCoinsViewCache in CheckInputs:

 0diff --git a/src/main.cpp b/src/main.cpp
 1index b6a61f7..6590b98 100644
 2--- a/src/main.cpp
 3+++ b/src/main.cpp
 4@@ -1436,11 +1436,6 @@ bool CheckInputs(const CTransaction& tx, CValidationState &state, const CCoinsVi
 5         if (pvChecks)
 6             pvChecks->reserve(tx.vin.size());
 7
 8-        // This doesn't trigger the DoS code on purpose; if it did, it would make it easier
 9-        // for an attacker to attempt to split the network.
10-        if (!inputs.HaveInputs(tx))
11-            return state.Invalid(error("CheckInputs(): %s inputs unavailable", tx.GetHash().ToString()));
12-
13         // While checking, GetBestBlock() refers to the parent block.
14         // This is also true for mempool checks.
15         CBlockIndex *pindexPrev = mapBlockIndex.find(inputs.GetBestBlock())->second;

Results in going from “Verify 1730 txins: 187.01ms (0.108ms/txin)” to “Verify 1730 txins: 13.29ms (0.008ms/txin)”

I’ve not yet convinced myself that its actually redundant, but if it is, it’s good to take it out since it wastes time on mempool acceptance too (where it also appears to be behind an existing HaveInputs test).

gavinandresen commented at 1:10 pm on February 27, 2015: contributor

@gmaxwell: I agree, a unit and/or regression test for one-double-spend-in-mempool, two-in-a-block is a very good idea.

RE: removing possibly redundant HaveInputs() : should be a separate pull request. Thanks for reviewing and looking deeper!

gavinandresen commented at 1:14 pm on February 27, 2015: contributor

I realized this morning memory savings is even better than I thought: it is an extra 4 bytes per transaction in the mempool (not 32 bytes, I was either thinking of sizeof(txid) or confusing 32-bit-integer with 32-bytes).

gmaxwell commented at 1:24 pm on February 27, 2015: contributor

hah. I read it as bits to begin with (and was very confused by your most recent comment.

SergioDemianLerner commented at 2:32 pm on February 27, 2015: contributor

I responded to Gavin, but now that I see this thread, I copy parts here:

The sigcache was added for security reasons and, as a by-product, it increases performance.

With a TxId cache you can achieve the performance part, but not the security protection.

The problem is described here https://bitcointalk.org/index.php?topic=136422.0 and only affects transactions during free forwarding (not in blocks).

To summarize the attack, the idea is that the attacker can send transactions that cost very little to create (can even reuse signatures with SIGHASH_ANYONECANPAY) and fail to verify after a lot of CPU has been wasted, but still the victim peer does not increase the DoS score of the attacker. For example, re-using 999 good inputs, but failing in the 1000th one.

The attacker can prevent DoS score to be increased because, there are two cases where it’s not:

If an input refers to a immature coinbase (even if it’s not owned by the attacker)
If an input has a not standard signature encoding.

So removing the sigcache has consequences.

Regarding performance, take into account that each signature is verified twice in AcceptToMemoryPool(), one with standard flags and another with mandatory flags (for extra security), so if you take out sigcache, then you double signature verification time.

Several solutions are possible:

Add a second level of cache for TxId’s and get an additional speedup. This is probably the simplest and best.
Change VerifyInputs() so that it always increases the DoS score on a bad input, but the code explicitly states in one part: " Check whether the failure was caused by a non-mandatory script verification check …; if so, don’t trigger DoS protection to avoid splitting the network between upgraded and non-upgraded nodes. "

so you will be breaking this protection.

Add a DoS check that accumulates the amount of CPU wasted by a peer (e.g. processing good inputs before a bad input) and increase the DoS score if the waste is over a threshold. Node resource encapsulation likes this is something Mike Hearn and I had pushed in the past, and I think M.H. is crowdfounding money for that purpose in his lighthouse project.

Obviously the best is just adding a second level of cache of TxIds, without changing anything security-related. The cache also should cache the block number where the txId can be included, if the transaction uses locktime (if it’s not re-verified, of course).

sdaftuar commented at 2:47 pm on February 27, 2015: member

Gavin, were you able to test the effect of this change on overall runtime? Sergio’s observation that signatures are checked twice on entry to the mempool seems like it could plausibly make things slower to take out the signature cache, and I’ve noticed in my initial testing that when I play back historical data through a node running this pull’s code, I am seeing a substantial (nearly 2x) slowdown, although ConnectBlock is about 20% faster.

gavinandresen commented at 3:29 pm on February 27, 2015: contributor

@sdaftuar: thanks for testing! I’ll revert the low-level signature cache removal; it will, indeed, speed up initial download and accept-to-mempool.

If this pull is accepted there should be a separate follow-up pull request to make the default signature cache much smaller to save memory; it should only need to be as large as one transaction’s worth of signatures, if we’re not relying on it to optimize block validation time.

gavinandresen commented at 4:19 pm on February 27, 2015: contributor

I ran this overnight with -debug=bench to get a sense of how good the assumption of “blocks contain mostly transactions that are already in our memory pool” is, and it turns out that is a very good assumption.

The last 11 blocks had ALL transactions except the coinbase already in the mempool. The twelfth had 161 of 170 in the memory pool, and I have to go back another 11 blocks to find one that wasn’t all cache hits (it had 3 transactions not in my mempool).

gavinandresen force-pushed on Feb 27, 2015

gavinandresen commented at 4:37 pm on February 27, 2015: contributor

Rebased to keep the old signature cache code. @SergioDemianLerner : RE locktime transactions: The memory pool does do the right thing with lockTime transactions (a bugfix went into 0.10.0 for that, if I remember correctly). But even if it didn’t, AcceptBlock calls ContextualCheckBlock which makes sure all transactions in the block are final, and those checks are always done for new blocks.

ghost commented at 0:03 am on February 28, 2015: none

Related #5163: Upon cursory review, still an outstanding optimization.

sdaftuar commented at 11:33 am on February 28, 2015: member

@gavinandresen I’ve been running a version of this code overnight, and I’m curious if your results are similar to mine – I’m typically seeing Verify times above 15 microseconds per txin, eg:

02015-02-28 02:41:57     - Verify 2671 txins: 39.67ms (0.015ms/txin) [443.41s]

which is an improvement, but still an order of magnitude slower than what you originally posted (1 microsecond/txin), so just want to confirm if my observations are consistent with yours?

For reference, my 10 fastest verify times out of the last 50 blocks (reported in milliseconds per txin) are:

(I manually checked, and all but one of those times correspond to blocks with no cache misses; one of those 0.017ms/txin blocks had 1 cache miss.) @gmaxwell I’m puzzled by your comment earlier about the substantial gains from commenting out the duplicate HaveInputs call in CheckInputs – I don’t see how that could result in the speedups you mentioned, because the Verify time still includes one call to HaveInputs for every transaction in the block. So it seems like it shouldn’t be possible to get more than a factor of 2 speedup if the only change were to remove the second call, unless I’m missing something?

gavinandresen commented at 12:56 pm on March 1, 2015: contributor

@sdaftuar : you are seeing verification times of 0.015 milliseconds per txin (0.015ms really does mean 15-thousandths-of-a-millisecond). Note: I found the -bench numbers in brackets confusing; they are cumulative times (total time spent verifying or connecting or whatever, since startup).

I should have time to add a double-spend-in-block, one-tx-in-mempool unit test for this tomorrow.

sdaftuar commented at 5:49 pm on March 2, 2015: member

@gavinandresen: Agreed, 0.015 milliseconds/txin. To clarify, I was trying to make the point that the 0.001ms/txin that you reported at the beginning of this conversation is not consistent with what I was seeing, even when I restrict my samples to the very fastest blocks.

I’ve done some more testing and I think the 60x speedup is way off for the performance benefit of this change, even in the case where there are no cache misses. I’ve tested this code and timed the calls in ConnectBlock which occur in the normal course of a node doing block validation, and this code looks to be roughly 20% faster for verifying blocks (even in the case of no cache misses).

Of course I’m not trying to suggest a 20% improvement isn’t a nice speedup, I just want to make sure we’re all on the same page about what kind of speedup we’re expecting!

gavinandresen commented at 4:26 pm on March 3, 2015: contributor

Updated with new unit test as suggested by @gmaxwell – with the new unit test, Travis passing, and keeping the old sigcache I think I’ve addressed all concerns. @sdaftuar : can you email me a longer snippet of your debug.log, showing all the -debug=bench times? And the head of config.log? (you’re not benchmarking with -g or -DDEBUG_LOCKORDER, are you?)

gavinandresen commented at 9:48 pm on March 3, 2015: contributor

@sdaftuar : never mind, I’m seeing similar results outside my benchmarking code. From a profile, it looks like adding a transaction to the memory pool isn’t making UTXOs “stick” in the CCoinsViewCache for some reason, so the on-disk database is being hit during ProcessNewBlock.

Two bug fixes for limitedmap size-limiting code

The code for limitedmap::insert would leave an entry in
rmap with no corresponding entry in the forward map
if the map was full and the entry being inserted had the
lowest value (so was immediately erased).

It also had an off-by-one error; setting max_size to 11 results
in a map with room for only 10 entries.

Neither of these bugs matter for current use of limitedmap;
I found them writing a unit test for new transaction validation
caching code.

d37a83590a

Optimize block validation by caching validation results in mempool

Store validation flags along with transactions in the memory pool, and
when validating a block, skip full validation if the transaction is
in the pool and was validated with appropriate SCRIPT_VERIFY_ flags.

This significantly speeds up block verification, and, therefore,
block propagation.

5da41c1a41

gavinandresen force-pushed on Mar 4, 2015

jonasschnelli commented at 9:21 am on March 5, 2015: contributor

F.I.Y: i’m getting some new warnings while compiling this: https://gist.github.com/jonasschnelli/3047fb27ebce6e15ab8a

Will test and try to time-profile/compare with current master.

Unit test doublespends in new blocks

As suggested by Greg Maxwell-- unit test to make sure a block
with a double-spend in it doesn't pass validation if half of
the double-spend is already in the memory pool (so full-blown
transaction validation is skipped) when the block is received.

2fba2c5d4a

gavinandresen force-pushed on Mar 5, 2015

gavinandresen commented at 3:46 pm on March 5, 2015: contributor

@jonasschnelli : thanks for testing. I fixed the new warning (and switched back to compiling my tree with clang++, so I’ll catch them myself next time).

gavinandresen force-pushed the base branch on Mar 5, 2015

jonasschnelli commented at 7:52 am on March 6, 2015: contributor

Did run a mainnet node over night with this PR on top.

Last 10 blocks:

 0    - Verify 771 txins: 19.68ms (0.026ms/txin) [11767.47s]
 1UpdateTip: new best=000000000000000002e13ee715262f37180c4633d378bab17567da89990e7196  height=346377  log2_work=82.371513  tx=61554777 
 2
 3    - Verify 452 txins: 5.09ms (0.011ms/txin) [11767.47s]
 4UpdateTip: new best=000000000000000013ff0184ca697936fd67b5d704b5370c68351a3a3efec08f  height=346378  log2_work=82.371559  tx=61554825 
 5
 6    - Verify 1702 txins: 33.67ms (0.020ms/txin) [11767.51s]
 7UpdateTip: new best=00000000000000001452f4f078a23f823004baac88767b12ce6f5411ab568eaa  height=346379  log2_work=82.371606  tx=61555399 
 8
 9    - Verify 3944 txins: 85.32ms (0.022ms/txin) [11767.59s]
10UpdateTip: new best=00000000000000000ede63ad0203aa389d140a6b6a9acf7f4aa8903220254550  height=346380  log2_work=82.371652  tx=61556917 
11
12    - Verify 3107 txins: 197.60ms (0.064ms/txin) [11767.79s]
13UpdateTip: new best=000000000000000009bdaccdf6d5765e874bc3873bec06d8bf017ec4796faa05  height=346381  log2_work=82.371698  tx=61557229 
14
15    - Verify 2725 txins: 89.91ms (0.033ms/txin) [11767.88s]
16UpdateTip: new best=00000000000000000cf5268b4418cf20f83909a18184277d69e3b516f5460c7b  height=346382  log2_work=82.371744  tx=61557437 
17
18    - Verify 4048 txins: 73.94ms (0.018ms/txin) [11767.95s]
19UpdateTip: new best=00000000000000000c57c303ffba3c647dc406336ed78019b5bc1e472bd3a586  height=346383  log2_work=82.371791  tx=61558563 
20
21    - Verify 4568 txins: 61.33ms (0.013ms/txin) [11768.01s]
22UpdateTip: new best=00000000000000000bf5ffc2d19bae985cfd050da2c6876732318111c7dcbe43  height=346384  log2_work=82.371837  tx=61559083 
23
24    - Verify 6404 txins: 63.72ms (0.010ms/txin) [11768.08s]
25UpdateTip: new best=00000000000000000228198eedafed2af4f8ef87ddeec8879821653faca0fa2c  height=346385  log2_work=82.371883  tx=61559578
26
27    - Verify 4874 txins: 61.42ms (0.013ms/txin) [11768.14s]
28UpdateTip: new best=000000000000000008d9b0b9e2d2758dc715fe9bf75d19ce626c937dbe147cf3  height=346386  log2_work=82.371929  
29
30    - Verify 5236 txins: 52.78ms (0.010ms/txin) [11768.19s]
31UpdateTip: new best=000000000000000016889cd874dafc64a774a331044e054addbdb79542a6f70f  height=346387

Processor info:

 0processor   : 7
 1vendor_id   : GenuineIntel
 2cpu family  : 6
 3model       : 42
 4model name  : Intel(R) Xeon(R) CPU E31245 @ 3.30GHz
 5stepping    : 7
 6microcode   : 0x17
 7cpu MHz     : 1600.000
 8cache size  : 8192 KB
 9physical id : 0
10siblings    : 8
11core id     : 3
12cpu cores   : 4

Now i switched back to the master and will report again the bench of next 10 blocks.

jonasschnelli commented at 12:02 pm on March 6, 2015: contributor

current master bench (8 recent blocks):

 0    - Verify 5443 txins: 421.00ms (0.077ms/txin) [0.85s]
 1UpdateTip: new best=0000000000000000097f38b3a09f92b0f4d2a647548774f46baa4b822cb3e71b  height=346390  log2_work=82.372114  tx=61562832
 2
 3    - Verify 4619 txins: 273.11ms (0.059ms/txin) [1.12s]
 4UpdateTip: new best=000000000000000012a453ebe873ba9731b9d7b8bbbddbb2275406d1de4460cf  height=346391  log2_work=82.37216  tx=61563318  
 5
 6    - Verify 4169 txins: 829.47ms (0.199ms/txin) [1.95s]
 7UpdateTip: new best=00000000000000000471a61459599c2bde70b982feaf792b3f122e8a18a7a50e  height=346392  log2_work=82.372207  tx=61563871
 8
 9    - Verify 3305 txins: 96.93ms (0.029ms/txin) [2.05s]
10UpdateTip: new best=000000000000000008c37a9dab271dbc2b7108636e18210eb0514924f16890ab  height=346393  log2_work=82.372253  tx=61564914 
11
12    - Verify 1384 txins: 62.65ms (0.045ms/txin) [2.11s]
13UpdateTip: new best=00000000000000001704f6b8cb65238df1fcda4a6ec34c3967879b013b5c5478  height=346394  log2_work=82.372299  tx=61565427
14
15    - Verify 2143 txins: 62.94ms (0.029ms/txin) [2.17s]
16UpdateTip: new best=0000000000000000058939f70f808050c802f2d7b50a81bb2ab0c280ff526aa6  height=346395  log2_work=82.372345  tx=61566298 
17
18    - Verify 361 txins: 120.67ms (0.334ms/txin) [2.29s]
19UpdateTip: new best=0000000000000000125a6766bdc03bc2b633516714ff59099c91c2b2c9712381  height=346396  log2_work=82.372391  tx=61566345 
20
21    - Verify 1431 txins: 28.41ms (0.020ms/txin) [2.32s]
22UpdateTip: new best=0000000000000000148af21b20abc03d84d9a9e038dcd4d007d1cd680c0d9e77  height=346397  log2_work=82.372438  tx=61566665

laanwj added the label Improvement on Mar 9, 2015

sipa commented at 2:39 pm on April 27, 2015: member

Concept ACK, though I would prefer keeping the cache (which becomes consensus-critical) outside of the mempool code. @SergioDemianLerner would the security concern not also be addressed by a simple per-txin-verification cache (which wouldn’t need synchronization across validation threads, etc).

SergioDemianLerner commented at 5:06 pm on April 28, 2015: contributor

@sipa No, but I don’t know if I understood you correctly. The attack uses multiple txs, so the cache at the txin level wouldn’t help.

Let’s try to tackle the contention problem…

One solution could be a cache data structure where reads use only a lightweight mutex (such as a per-thread incremental variable), and writes must be delayed until all reads have been performed (using a spin-lock over all the per-thread mutexes).

If the bottleneck is block processing, then why not acquire a lock to the cache before processing the block, and delay all writes into thread specific caches until the full block has been processed ? Then all per-thread caches are merged back into the main cache. Then contention is eliminated, at the expense of some kind weakness to a DoS where the same signature is used again and again for all transactions (e.g. using SIGHASH_SINGLE bug). Maybe we could add a special mutex protected cache only for SIGHASH_SINGLE signatures where an input that has an index bigger than the maximum output index.

sipa commented at 5:12 pm on April 28, 2015: member

@SergioDemianLerner Or a per-verification-thread cache?

gavinandresen commented at 6:18 pm on July 27, 2015: contributor

Closing, @sipa ’s version is better (and just got merged).

gavinandresen closed this on Jul 27, 2015

MarcoFalke locked this on Sep 8, 2021

Optimize block validation code and block propagation speed #5835