Faster sigcache nonce

JeremyRubin commented at 6:26 pm on May 9, 2018: contributor

This PR replaces nonces in two places with pre-salted hashers.

The nonce is chosen to be 64 bytes long so that it forces the SHA256 hasher to process the chunk. This leaves the next 64 (or 56 depending if final chunk) open for data. In the case of the script execution cache, this does not make a big performance improvement because the nonce was already properly padded to fit into one buffer, but does make the code a little simpler. In the case of the sig cache, this should reduce the hashing overhead slightly because we are less likely to need an additional processing step.

I haven’t benchmarked this, but back of the envelope it should reduce the hashing by one buffer for all combinations except compressed public keys with compact signatures.

laanwj added the label Validation on May 9, 2018

in src/script/sigcache.cpp:35 in 9c8cba86fd outdated

32 
33 public:
34     CSignatureCache()
35     {
36-        GetRandBytes(nonce.begin(), 32);
37+        base_blob<64*8> nonce;

jimpo commented at 10:29 pm on May 14, 2018:

nit: unsigned char nonce[64] ought to work fine.

jimpo approved

jimpo commented at 10:33 pm on May 14, 2018: contributor

utACK 9c8cba8

laanwj referenced this in commit 3c2a41a9fc on May 23, 2018

DrahtBot closed this on Jul 22, 2018

DrahtBot reopened this on Jul 22, 2018

DrahtBot commented at 6:55 pm on September 7, 2018: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#18815 (bench: Add logging benchmark by MarcoFalke)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

sipa commented at 3:47 am on November 10, 2018: member

Benchmarks would be welcome.

DrahtBot closed this on Apr 28, 2019

DrahtBot commented at 7:11 pm on April 28, 2019: member

DrahtBot reopened this on Apr 28, 2019

MarcoFalke commented at 0:27 am on April 29, 2019: member

This will be closed due to inactivity in two weeks

MarcoFalke commented at 2:15 pm on May 7, 2019: member

Closing for now. Let me know when you want this reopened to work on it again.

MarcoFalke closed this on May 7, 2019

gmaxwell commented at 7:23 pm on May 7, 2019: contributor

This seems like an obvious (minor) win. It does need to be benchmarked, but it could just be an informal test, not some benchmark tool checked in.

MarcoFalke commented at 8:21 pm on May 7, 2019: member

The tests were failing, so at the very least this needs to be fixed:

0test_bitcoin-qt: random.cpp:577: void ProcRand(unsigned char*, int, RNGLevel): Assertion `num <= 32' failed.

MarcoFalke reopened this on May 7, 2019

jamesob commented at 8:41 pm on May 7, 2019: member

Oddly, pruned IBD (500_000 - 505_000) comparison indicates this PR slightly slower than master. Maybe I’ll try again after a rebase?

nonce ibd local 500000 505000 dbcache=2048

faster-sigcache-nonce vs. master (absolute)

name	iterations	faster-sigcache-nonce	master
build.make.23.clang	1	125.9031 (± 0.0000)	135.5557 (± 0.0000)
build.make.23.clang.mem-usage	1	700992.0000 (± 0.0000)	647144.0000 (± 0.0000)
ibd.local.500000.505000.dbcache=2048	3	472.4958 (± 2.4356)	439.0951 (± 2.0998)
ibd.local.500000.505000.dbcache=2048.mem-usage	3	2434952.0000 (± 2700.1412)	2417438.6667 (± 1709.0577)

faster-sigcache-nonce vs. master (relative)

name	iterations	faster-sigcache-nonce	master
build.make.23.clang	1	1.000	1.08
build.make.23.clang.mem-usage	1	1.083	1.00
ibd.local.500000.505000.dbcache=2048	3	1.076	1.00
ibd.local.500000.505000.dbcache=2048.mem-usage	3	1.007	1.00

JeremyRubin force-pushed on May 7, 2019

JeremyRubin commented at 8:50 pm on May 7, 2019: contributor

Rebased – go ahead and retry!

in src/validation.cpp:1348 in f6b9da76b6 outdated

1344@@ -1345,9 +1345,14 @@ int GetSpendHeight(const CCoinsViewCache& inputs)
1345 
1346 
1347 static CuckooCache::cache<uint256, SignatureCacheHasher> scriptExecutionCache;
1348-static uint256 scriptExecutionCacheNonce(GetRandHash());
1349+static CSHA256 scriptExecutionCacheHasher;

MarcoFalke commented at 10:41 pm on May 7, 2019:

style-nit:

0static CSHA256 g_scriptExecutionCacheHasher;

Also, should probably squash the third commit, so that all commits compile and pass the tests?

JeremyRubin commented at 10:49 pm on May 7, 2019:

I’d prefer to leave it intact as we would want to undo this commit in the future if we get rid of the 32 byte limit on GetRand which did not exist at the time this PR was made, but I’m happy to squash if it’s your preference

gmaxwell commented at 11:14 pm on May 7, 2019: contributor

Irrelevant nit: I’d recommend just zerofilling the input rather than calling getrand again. 256-bits is more than then enough (and less wouldn’t make getrand any faster).

jamesob commented at 3:00 am on May 8, 2019: member

Current revision fails to build. Benched the rebase and didn’t see much difference between master and this branch, though I don’t like how much variance there was so I’m going to run again once this compiles.

ibd local 500000 505000 dbcache=2048

JeremyRubin commented at 10:09 pm on May 9, 2019: contributor

fixed build issue; travis needs to be kicked @MarcoFalke

MarcoFalke commented at 10:40 pm on May 9, 2019: member

kicked travis, @jamesob needs to be kicked

MarcoFalke commented at 10:41 pm on May 9, 2019: member

I’d prefer if the commits were squashed, so that git bisect would’t break on half of them

JeremyRubin commented at 10:48 pm on May 9, 2019: contributor

I can squash them down to one commit, unless you prefer the original two?

MarcoFalke commented at 12:26 pm on May 10, 2019: member

One commit is fine. Makes it trivial to see that bisect won’t break.

jamesob commented at 1:38 pm on May 10, 2019: member

Yeah I need a good kick once in a while.

Benching shows this as being negligibly slower than master.

ibd local 500005 505000 dbcache=2048

faster-sigcache-nonce vs. master (absolute)

bench name	x	faster-sigcache-nonce	master
ibd.local.500000.505000.dbcache=2048	1	310.3099 (± 0.0000)
ibd.local.500000.505000.dbcache=2048.mem-usage	1	2459212.0000 (± 0.0000)
ibd.local.500005.505000.dbcache=2048	2	323.1098 (± 4.7144)	319.6353 (± 1.7855)
ibd.local.500005.505000.dbcache=2048.mem-usage	2	2457362.0000 (± 1446.0000)	2464301.3333 (± 5965.4940)

faster-sigcache-nonce vs. master (relative)

bench name	x	faster-sigcache-nonce	master
ibd.local.500000.505000.dbcache=2048	1	1.00
ibd.local.500000.505000.dbcache=2048.mem-usage	1	1.00
ibd.local.500005.505000.dbcache=2048	2	1.01	1.000
ibd.local.500005.505000.dbcache=2048.mem-usage	2	1.00	1.003

MarcoFalke commented at 5:48 pm on May 10, 2019: member

The code is only run when fScriptCheck (see https://github.com/bitcoin/bitcoin/pull/13868/files#diff-24efdb00bfbe56b140fb006b562cc70bR1847)

You might want to check height=[565k-575k]

JeremyRubin commented at 10:19 pm on May 10, 2019: contributor

Are you able to just look at the time spent in connect block as opposed to the aggregate time.

Thats the time we reasonably care about, the total time is not granular enough

gmaxwell commented at 11:06 pm on May 11, 2019: contributor

If it’s making total time worse though, that suggests either we’re suffering measurement error or something interesting is going on. Indeed, this only applies where scriptchecks are taking effect. Use assumevalid=0 ?

jamesob commented at 3:25 pm on May 13, 2019: member

Latest bench with -assumevalid=0 is consistent with previous results: basically no difference between this and master. I think the only way to sensibly proceed on this is if someone comes up with an actually-relevant microbench that I can measure. I ran the microbenches on this last run, but it wasn’t clear to me which bench was actually worth looking at, and the total runtime was swamped by (spurious?) prevector differences.

NACK from me until we come up with a clear way of showing perf improvements.

ibd local 500000 505000 dbcache=2048

MarcoFalke commented at 4:00 pm on May 13, 2019: member

Could we remove the caching if it has no impact with noassumevalid (other than code complexity)?

MarcoFalke commented at 4:31 pm on May 13, 2019: member

I guess the only way to see a difference is to populate a mempool and then create a block with and without the cache enabled?

JeremyRubin commented at 5:30 pm on May 13, 2019: contributor

@jamesob are the times you are showing total time or just time inside or block validation/check inputs?

MarcoFalke commented at 5:34 pm on May 13, 2019: member

Total time

JeremyRubin commented at 6:21 pm on May 13, 2019: contributor

It should be fairly easy to pull out the validation times from the debug log and compare those.

MarcoFalke commented at 6:25 pm on May 13, 2019: member

See https://github.com/chaincodelabs/bitcoinperf/issues/37

JeremyRubin force-pushed on May 13, 2019

JeremyRubin commented at 6:53 pm on May 13, 2019: contributor

rebased, squashed, applied @gmaxwell’s nit (was easiest to just write the nonce twice rather than write zeros, which would have added more lines).

in src/script/sigcache.cpp:40 in 51b6e73377 outdated

34     CSignatureCache()
35     {
36-        GetRandBytes(nonce.begin(), 32);
37+        uint256 nonce = GetRandHash();
38+        salted_hasher.Write(nonce.begin(), 32);
39+        salted_hasher.Write(nonce.begin(), 32);

ryanofsky commented at 6:12 pm on July 23, 2019:

It’d be good to have a code comment here saying why this writes the same nonce twice, since it does look strange. The PR description has a good explanation, but there should be some hint in the code.

The nonce is chosen to be 64 bytes long so that it forces the SHA256 hasher to process the chunk. This leaves the next 64 (or 56 depending if final chunk) open for data. In the case of the script execution cache, this does not make a big performance improvement because the nonce was already properly padded to fit into one buffer, but does make the code a little simpler. In the case of the sig cache, this should reduce the hashing overhead slightly because we are less likely to need an additional processing step.

JeremyRubin commented at 8:14 pm on July 24, 2019:

Comment added.

in src/validation.cpp:1488 in 51b6e73377 outdated

1348-static uint256 scriptExecutionCacheNonce(GetRandHash());
1349+static CuckooCache::cache<uint256, SignatureCacheHasher> g_scriptExecutionCache;
1350+static CSHA256 g_scriptExecutionCacheHasher;
1351 
1352 void InitScriptExecutionCache() {
1353+    // Setup the salted hasher

ryanofsky commented at 6:14 pm on July 23, 2019:

Again would be good to say something about why double nonce setup is intentional.

ryanofsky commented at 6:33 pm on July 23, 2019: member

Code review ACK 51b6e73377cad1a1b0d180be5ff72e1f9c4e7a14. This is a simple code cleanup. Nice especially to get rid or the 19 + 32 + 4 = 55 bytes (+ 8 + 1 = 64) checking.

Since there isn’t a microbenchmark for this or other benchmark showing a performance difference, it’s possible the PR title is overpromising by saying this is faster. But “back of the envelope” reasoning to me seems adequate for a minor change like this.

ryanofsky approved

JeremyRubin force-pushed on Jul 24, 2019

JeremyRubin commented at 8:37 pm on July 24, 2019: contributor

I added a contrived microbenchmark which shows:

0# Benchmark, evals, iterations, total, min, max, median
1PrePadded, 5, 10000, 0.00320418, 6.18847e-08, 7.06114e-08, 6.28207e-08
2RegularPadded, 5, 10000, 0.00560628, 1.09935e-07, 1.18426e-07, 1.10481e-07

Pre padding is indeed better than regular padding.

ryanofsky approved

ryanofsky commented at 8:50 pm on July 24, 2019: member

utACK 5b877679d9f20d7ce18097b22b0843a652eb105b. Only changes since last review are new comments and benchmark. The new comments look good! And the new microbenchmark should be a sufficient sanity check for performance of this change, along with the previous neutral IBD results. Like I wrote in #13204#pullrequestreview-265587021, I think this change seems nice even as a simple code cleanup.

in src/bench/hashpadding.cpp:41 in 5b877679d9 outdated

36+    uint256 nonce = GetRandHash();
37+    uint256 data = GetRandHash();
38+    while (state.KeepRunning()) {
39+        unsigned char out[32];
40+        CSHA256 h = hasher;
41+        hasher.Write(nonce.begin(), 32);

MarcoFalke commented at 9:42 pm on July 24, 2019:

h or hasher? If hasher is intended, it looks like that this benchmark will perform as good as the previous one on every other run. Might want to add a comment to explain why. If h is intended, could make hasher const?

JeremyRubin commented at 9:54 pm on July 24, 2019:

Ah you’re right. good catch. Let me fix that…

JeremyRubin commented at 9:55 pm on July 24, 2019:

It can’t be made const because it needs to be updated a little bit.

JeremyRubin commented at 9:56 pm on July 24, 2019:

well it can be for one of the tests, and not the other. But I want to keep the code as similar as possible between pre/regular.

in src/script/sigcache.cpp:27 in 5b877679d9 outdated

23@@ -24,21 +24,27 @@ class CSignatureCache
24 {
25 private:
26      //! Entries are SHA256(nonce || signature hash || public key || signature):
27-    uint256 nonce;
28+    CSHA256 salted_hasher;

MarcoFalke commented at 9:43 pm on July 24, 2019:

0    CSHA256 m_salted_hasher;

MarcoFalke commented at 9:43 pm on July 24, 2019: member

Thanks for adding a benchmark. Just one question about it.

Would be nice to add the benchmark in the first commit, so that performance can be compared.

JeremyRubin commented at 9:53 pm on July 24, 2019: contributor

@MarcoFalke because of modularization, there’s no real interfaces exposed for me to run in a bench, so the benchmark just had both techniques demonstrated (doesn’t really matter if the commit comes before or after, the code changes don’t affect the benchmark).

JeremyRubin force-pushed on Jul 24, 2019

JeremyRubin commented at 10:01 pm on July 24, 2019: contributor

@MarcoFalke nits addressed

ryanofsky approved

ryanofsky commented at 5:06 pm on July 31, 2019: member

utACK dec5f968788bd96f1eec4dfea3f4ab584a2d7672. Only changes since last review were adding suggested m_ prefixes and making the RegularPadded case closer to what it’s supposed to measure.

ariard approved

ariard commented at 5:47 pm on August 2, 2019: member

ACK dec5f96, I’ve tested change locally and run benchmark on a i7-7700HQ CPU @ 2.80GHz (Debian), got the following results (yes PrePadded is faster):

0# Benchmark, evals, iterations, total, min, max, median
1RegularPadded, 5, 10000, 0.0195961, 3.75465e-07, 4.17618e-07, 3.83324e-07
2# Benchmark, evals, iterations, total, min, max, median
3PrePadded, 5, 10000, 0.0114673, 1.9201e-07, 2.7354e-07, 2.1187e-07

DrahtBot added the label Needs rebase on Sep 2, 2019

JeremyRubin force-pushed on Apr 29, 2020

Add Hash Padding Microbenchmarks 5495fa5850

DrahtBot removed the label Needs rebase on Apr 29, 2020

JeremyRubin force-pushed on Apr 29, 2020

Use salted hasher instead of nonce in sigcache

Use salted hasher instead of nonce in Script Execution Cache

Don't read more than 32 bytes from GetRand

Apply g_* naming convention to scriptExecutionCache in validation.cpp

Fully apply g_* naming convention to scriptCacheHasher

Write same uint256 nonce twice for cache hash rather than calling getrand twice

Use salted hasher instead of nonce in sigcache

Use salted hasher instead of nonce in Script Execution Cache

Don't read more than 32 bytes from GetRand

Apply g_* naming convention to scriptExecutionCache in validation.cpp

Fully apply g_* naming convention to scriptCacheHasher

Write same uint256 nonce twice for cache hash rather than calling getrand twice

152e8baf08

ryanofsky approved

ryanofsky commented at 8:54 pm on May 13, 2020: member

Code review ACK 152e8baf08c7379e5cc09f90863e6309bdd4866c. No code changes, just rebase since last review and expanded commit message

JeremyRubin commented at 9:59 pm on May 13, 2020: contributor

I think this is ready to merge @MarcoFalke.

All outstanding feedback has been addressed & has two “fresh” ACKs and a handful of stale utACKs/makes senses.

ryanofsky commented at 7:06 pm on May 19, 2020: member

I think this is ready to merge @MarcoFalke.

All outstanding feedback has been addressed & has two “fresh” ACKs and a handful of stale utACKs/makes senses.

Any agreement / disagreement?

MarcoFalke merged this on Jun 2, 2020

MarcoFalke closed this on Jun 2, 2020

MarcoFalke commented at 1:29 pm on June 2, 2020: member

On arm:

0$ ./src/bench/bench_bitcoin --filter=.*Padded
1# Benchmark, evals, iterations, total, min, max, median
2PrePadded, 5, 10000, 0.0305467, 6.09309e-07, 6.13757e-07, 6.10061e-07
3RegularPadded, 5, 10000, 0.057972, 1.15744e-06, 1.16534e-06, 1.15813e-06

sidhujag referenced this in commit 5d2d0f4905 on Jun 2, 2020

Fabcien referenced this in commit 95d0162dae on Feb 12, 2021

UdjinM6 referenced this in commit c85dc3df80 on Jun 18, 2021

UdjinM6 referenced this in commit 26087f4dd1 on Jun 24, 2021

UdjinM6 referenced this in commit f33ed5594a on Jun 26, 2021

UdjinM6 referenced this in commit 408d0a2afc on Jun 26, 2021

UdjinM6 referenced this in commit c8ee5e1665 on Jun 28, 2021

DrahtBot locked this on Feb 15, 2022

Faster sigcache nonce #13204

Conflicts

faster-sigcache-nonce vs. master (absolute)

faster-sigcache-nonce vs. master (relative)

faster-sigcache-nonce vs. master (absolute)

faster-sigcache-nonce vs. master (relative)