While exploring some C++20 changes and checking against our code I found this potential improvement:
- We can replace our custom implementation of
rotl32in crypto/chacha20 withstd::rotlfrom the newbitheader.
While exploring some C++20 changes and checking against our code I found this potential improvement:
rotl32 in crypto/chacha20 with std::rotl from the new bit header.<!--e57a25ab6845829454e8d69fc972939a-->
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
<!--006a51241073e994b41acfe9ec718e94-->
For detailed information about the code coverage, see the test coverage report.
<!--021abf342d371248e50ceaed478a90ca-->
See the guideline for information on the review process.
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
<!--174a7506f384e20aa4161008e828411d-->
No conflicts as of last run.
Not sure. This needs a benchmark to check against slowdown, see #29057 (comment)
Also, C++20 is enabled for years, so I don't understand the other commit. I think you either forgot to compile locally, or your compiler doesn't emit any warnings that the changes are wrong?
txmempool.cpp:88:61: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
88 | mapTx.modify(mapTx.iterator_to(descendant), [=, this](CTxMemPoolEntry& e) {
| ~~^~~~
txmempool.cpp:99:32: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
99 | mapTx.modify(updateIt, [=, this](CTxMemPoolEntry& e) { e.UpdateDescendantState(modifySize, modifyFee, modifyCount); });
| ~~^~~~
txmempool.cpp:300:38: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
300 | mapTx.modify(ancestorIt, [=, this](CTxMemPoolEntry& e) { e.UpdateDescendantState(updateSize, updateFee, updateCount); });
| ~~^~~~
txmempool.cpp:315:26: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
315 | mapTx.modify(it, [=, this](CTxMemPoolEntry& e){ e.UpdateAncestorState(updateSize, updateFee, updateCount, updateSigOpsCost); });
| ~~^~~~
txmempool.cpp:345:39: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
345 | mapTx.modify(dit, [=, this](CTxMemPoolEntry& e){ e.UpdateAncestorState(modifySize, modifyFee, -1, modifySigOps); });
| ~~^~~~
txmempool.cpp:903:46: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
903 | mapTx.modify(ancestorIt, [=, this](CTxMemPoolEntry& e){ e.UpdateDescendantState(0, nFeeDelta, 0);});
| ~~^~~~
txmempool.cpp:910:48: error: lambda capture 'this' is not used [-Werror,-Wunused-lambda-capture]
910 | mapTx.modify(descendantIt, [=, this](CTxMemPoolEntry& e){ e.UpdateAncestorState(0, nFeeDelta, 0, 0); });
| ~~^~~~
Also, C++20 is enabled for years, so I don't understand the other commit.
I don't understand this comment. What's the 'other' commit?
I think you either forgot to compile locally, or your compiler doesn't emit any warnings that the changes are wrong?
I compiled locally just fine and ran all test using Apple clang version 15.0.0 as the compiler. I will look into the warnings.
I think the issue is that most of the spots that were changed don't actually make use of the implicit capture of this (e.g. by accessing a class or struct member in the body of the lambda) and therefore don't need to be changed.
Only the example in the scheduler does that (which is a doc). Not the use of [=] in general is deprecated, just using [=] in combination with accessing this via implicit capture.
See https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0806r2.html for a longer explanation.
Seems right that the capture doesn't always need to be explicit, I dropped the second commit.
I don't see any performance impact on chacha20 from the first commit:
<details> <summary>MASTER</summary>
$ src/bench/bench_bitcoin -filter="CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES"
| ns/byte | byte/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.92 | 522,177,619.15 | 0.1% | 0.02 | CHACHA20_1MB
| 1.94 | 514,584,551.79 | 0.3% | 0.01 | CHACHA20_256BYTES
| 2.02 | 495,631,174.53 | 0.1% | 0.01 | CHACHA20_64BYTES
$ src/bench/bench_bitcoin -filter="CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES"
| ns/byte | byte/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.91 | 522,318,337.55 | 0.1% | 0.02 | CHACHA20_1MB
| 1.94 | 514,750,575.59 | 0.2% | 0.01 | CHACHA20_256BYTES
| 2.02 | 495,682,263.12 | 0.0% | 0.01 | CHACHA20_64BYTES
$ src/bench/bench_bitcoin -filter="CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES"
| ns/byte | byte/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.92 | 521,106,902.91 | 0.3% | 0.02 | CHACHA20_1MB
| 1.94 | 515,778,495.78 | 0.0% | 0.01 | CHACHA20_256BYTES
| 2.02 | 495,731,465.14 | 0.0% | 0.01 | CHACHA20_64BYTES
</details>
<details> <summary>THIS BRANCH</summary>
$ src/bench/bench_bitcoin -filter="CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES"
| ns/byte | byte/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.92 | 520,578,875.51 | 0.3% | 0.02 | CHACHA20_1MB
| 1.94 | 515,833,155.08 | 0.1% | 0.01 | CHACHA20_256BYTES
| 2.02 | 495,727,504.36 | 0.0% | 0.01 | CHACHA20_64BYTES
$ src/bench/bench_bitcoin -filter="CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES"
| ns/byte | byte/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.91 | 522,209,865.84 | 0.1% | 0.02 | CHACHA20_1MB
| 1.94 | 515,822,396.86 | 0.0% | 0.01 | CHACHA20_256BYTES
| 2.02 | 495,715,040.19 | 0.0% | 0.01 | CHACHA20_64BYTES
$ src/bench/bench_bitcoin -filter="CHACHA20_1MB|CHACHA20_256BYTES|CHACHA20_64BYTES"
| ns/byte | byte/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 1.92 | 522,069,205.88 | 0.1% | 0.02 | CHACHA20_1MB
| 1.94 | 514,707,650.50 | 0.2% | 0.01 | CHACHA20_256BYTES
| 2.02 | 495,706,912.69 | 0.0% | 0.01 | CHACHA20_64BYTES
</details>
If we are going to change it here, should also switch in
hash.cpp
Done, thanks! I grepped for it didn't find this one, probably because of the all-caps.
I guess this is fine, because the shift is constant and the compiler can skip the expensive modulo?
@maflcko Constant propagation + strength reduction should make that code not emit an actual modulo operation in any compiler from the last 20 years (it'll use x & (d-1) instead of x % d, if d is a known power of two).
But hopefully that pattern in libc++ also matches whatever pattern matching clang has to make sure it instead emits an actual rot instruction on platforms where it exists.
<!--9cd9c72976c961c55c7acef8f6ba82cd-->
lgtm ACK 842c288852c4d66de359e6907d9c82b7e618ab65
Any reason not to do Rotl in sha3 and ROTL in siphash as well?
Any reason not to do
Rotlin sha3 andROTLin siphash as well?
I don't see any, apparently I just need to learn to grep patterns better...
Done!
Concept ACK.
ACK 60446285436da62adef1c0a9b11c3336d82b4d89
Had a look at the difference in master vs this PR (rebased) on aarch64. Seems like for the most part, this gives the same output +- a handful of instructions. (Identical is identical minus difference in addresses / argument ordering etc).
| Function | GCC 13.2 | Clang 17.0.6 |
|---|---|---|
| MurmurHash3 | Identical | PR gives 5 less instructions: https://www.diffchecker.com/Jmlpl4rG/ |
| ChaCha20Aligned::Keystream | PR gives 1 less instruction: https://www.diffchecker.com/d6dp5tsk/ | Identical |
| KeccakF | Identical | Identical |
| CSipHasher:write (unsigned long) | Identical | Identical |
So this seems ok, unless anyone can find any major regressions. The output on corecheck also seems ok: https://corecheck.dev/bitcoin/bitcoin/pulls/29085.
Which side of the diff is which version?
Left is master @ f1fcc9638cde7664b9642018fe6872148bbb0172 , right was the PR based on f1fcc9638cde7664b9642018fe6872148bbb0172.
Looks like the improvement is a bit more pronouced for x86_64, at least when using Clang (17.0.6). For example, comparing MurmurHash3, master (left), and this PR (right), we get less code and more rol instructions: https://www.diffchecker.com/I7s5JeTr/. For GCC 13.2, this PR makes no difference to the generated assembly: https://www.diffchecker.com/nM5DgKfN/.