While exploring some C++20 changes and checking against our code I found this potential improvement:
- We can replace our custom implementation of
rotl32
in crypto/chacha20 withstd::rotl
from the newbit
header.
While exploring some C++20 changes and checking against our code I found this potential improvement:
rotl32
in crypto/chacha20 with std::rotl
from the new bit
header.The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
For detailed information about the code coverage, see the test coverage report.
See the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
No conflicts as of last run.
Not sure. This needs a benchmark to check against slowdown, see #29057 (comment)
Also, C++20 is enabled for years, so I don’t understand the other commit. I think you either forgot to compile locally, or your compiler doesn’t emit any warnings that the changes are wrong?
0txmempool.cpp:88:61: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
1 88 | mapTx.modify(mapTx.iterator_to(descendant), [=, this](CTxMemPoolEntry& e) {
2 | ~~^~~~
3txmempool.cpp:99:32: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
4 99 | mapTx.modify(updateIt, [=, this](CTxMemPoolEntry& e) { e.UpdateDescendantState(modifySize, modifyFee, modifyCount); });
5 | ~~^~~~
6txmempool.cpp:300:38: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
7 300 | mapTx.modify(ancestorIt, [=, this](CTxMemPoolEntry& e) { e.UpdateDescendantState(updateSize, updateFee, updateCount); });
8 | ~~^~~~
9txmempool.cpp:315:26: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
10 315 | mapTx.modify(it, [=, this](CTxMemPoolEntry& e){ e.UpdateAncestorState(updateSize, updateFee, updateCount, updateSigOpsCost); });
11 | ~~^~~~
12txmempool.cpp:345:39: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
13 345 | mapTx.modify(dit, [=, this](CTxMemPoolEntry& e){ e.UpdateAncestorState(modifySize, modifyFee, -1, modifySigOps); });
14 | ~~^~~~
15txmempool.cpp:903:46: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
16 903 | mapTx.modify(ancestorIt, [=, this](CTxMemPoolEntry& e){ e.UpdateDescendantState(0, nFeeDelta, 0);});
17 | ~~^~~~
18txmempool.cpp:910:48: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
19 910 | mapTx.modify(descendantIt, [=, this](CTxMemPoolEntry& e){ e.UpdateAncestorState(0, nFeeDelta, 0, 0); });
20 | ~~^~~~
Also, C++20 is enabled for years, so I don’t understand the other commit.
I don’t understand this comment. What’s the ‘other’ commit?
I think you either forgot to compile locally, or your compiler doesn’t emit any warnings that the changes are wrong?
I compiled locally just fine and ran all test using Apple clang version 15.0.0
as the compiler. I will look into the warnings.
this
(e.g. by accessing a class or struct member in the body of the lambda) and therefore don’t need to be changed.
Only the example in the scheduler does that (which is a doc). Not the use of [=]
in general is deprecated, just using [=]
in combination with accessing this
via implicit capture.
See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0806r2.html for a longer explanation.
Seems right that the capture doesn’t always need to be explicit, I dropped the second commit.
I don’t see any performance impact on chacha20 from the first commit:
$ src/bench/bench_bitcoin -filter=“CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES”
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
1.92 | 522,177,619.15 | 0.1% | 0.02 | CHACHA20_1MB |
1.94 | 514,584,551.79 | 0.3% | 0.01 | CHACHA20_256BYTES |
2.02 | 495,631,174.53 | 0.1% | 0.01 | CHACHA20_64BYTES |
$ src/bench/bench_bitcoin -filter=“CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES”
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
1.91 | 522,318,337.55 | 0.1% | 0.02 | CHACHA20_1MB |
1.94 | 514,750,575.59 | 0.2% | 0.01 | CHACHA20_256BYTES |
2.02 | 495,682,263.12 | 0.0% | 0.01 | CHACHA20_64BYTES |
$ src/bench/bench_bitcoin -filter=“CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES”
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
1.92 | 521,106,902.91 | 0.3% | 0.02 | CHACHA20_1MB |
1.94 | 515,778,495.78 | 0.0% | 0.01 | CHACHA20_256BYTES |
2.02 | 495,731,465.14 | 0.0% | 0.01 | CHACHA20_64BYTES |
$ src/bench/bench_bitcoin -filter=“CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES”
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
1.92 | 520,578,875.51 | 0.3% | 0.02 | CHACHA20_1MB |
1.94 | 515,833,155.08 | 0.1% | 0.01 | CHACHA20_256BYTES |
2.02 | 495,727,504.36 | 0.0% | 0.01 | CHACHA20_64BYTES |
$ src/bench/bench_bitcoin -filter=“CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES”
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
1.91 | 522,209,865.84 | 0.1% | 0.02 | CHACHA20_1MB |
1.94 | 515,822,396.86 | 0.0% | 0.01 | CHACHA20_256BYTES |
2.02 | 495,715,040.19 | 0.0% | 0.01 | CHACHA20_64BYTES |
$ src/bench/bench_bitcoin -filter=“CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES”
ns/byte | byte/s | err% | total | benchmark |
---|---|---|---|---|
1.92 | 522,069,205.88 | 0.1% | 0.02 | CHACHA20_1MB |
1.94 | 514,707,650.50 | 0.2% | 0.01 | CHACHA20_256BYTES |
2.02 | 495,706,912.69 | 0.0% | 0.01 | CHACHA20_64BYTES |
If we are going to change it here, should also switch in
hash.cpp
Done, thanks! I grepped for it didn’t find this one, probably because of the all-caps.
I guess this is fine, because the shift is constant and the compiler can skip the expensive modulo?
@maflcko Constant propagation + strength reduction should make that code not emit an actual modulo operation in any compiler from the last 20 years (it’ll use x & (d-1)
instead of x % d
, if d
is a known power of two).
But hopefully that pattern in libc++ also matches whatever pattern matching clang has to make sure it instead emits an actual rot instruction on platforms where it exists.
Rotl
in sha3 and ROTL
in siphash as well?
Any reason not to do
Rotl
in sha3 andROTL
in siphash as well?
I don’t see any, apparently I just need to learn to grep patterns better…
Done!
ACK 60446285436da62adef1c0a9b11c3336d82b4d89
Had a look at the difference in master vs this PR (rebased) on aarch64. Seems like for the most part, this gives the same output +- a handful of instructions. (Identical is identical minus difference in addresses / argument ordering etc).
Function | GCC 13.2 | Clang 17.0.6 |
---|---|---|
MurmurHash3 | Identical | PR gives 5 less instructions: https://www.diffchecker.com/Jmlpl4rG/ |
ChaCha20Aligned::Keystream | PR gives 1 less instruction: https://www.diffchecker.com/d6dp5tsk/ | Identical |
KeccakF | Identical | Identical |
CSipHasher:write (unsigned long) | Identical | Identical |
So this seems ok, unless anyone can find any major regressions. The output on corecheck also seems ok: https://corecheck.dev/bitcoin/bitcoin/pulls/29085.
Which side of the diff is which version?
Left is master @ f1fcc9638cde7664b9642018fe6872148bbb0172 , right was the PR based on f1fcc9638cde7664b9642018fe6872148bbb0172.
MurmurHash3
, master (left), and this PR (right), we get less code and more rol
instructions: https://www.diffchecker.com/I7s5JeTr/. For GCC 13.2, this PR makes no difference to the generated assembly: https://www.diffchecker.com/nM5DgKfN/.