Unroll the ChaCha20 inner loop for performance #24946
pull sipa wants to merge 1 commits into bitcoin:master from sipa:202204_unrollchacha changing 1 files +28 −20-
sipa commented at 4:32 pm on April 22, 2022: memberUnrolling the inner ChaCha20 loop gives a ~15% speedup for me in the CHACHA20_* benchmarks. It’s a simple change, this performance helps with RNG generation, and will matter more for BIP324.
-
DrahtBot added the label Utils/log/libs on Apr 22, 2022
-
in src/crypto/chacha20.cpp:124 in 7f3f84c833 outdated
127- QUARTERROUND( x0, x5,x10,x15) 128- QUARTERROUND( x1, x6,x11,x12) 129- QUARTERROUND( x2, x7, x8,x13) 130- QUARTERROUND( x3, x4, x9,x14) 131- } 132+
kristapsk commented at 5:28 pm on April 22, 2022:Maybe add a comment that loop unrolling was done for performance reasons?
sipa commented at 6:01 pm on April 22, 2022:Done.kristapsk commented at 5:30 pm on April 22, 2022: contributorConcept ACK, I see./src/bench/bench_bitcoin
improvements with this change.sipa force-pushed on Apr 22, 2022MarcoFalke added the label DrahtBot Guix build requested on Apr 22, 2022instagibbs commented at 5:58 pm on April 22, 2022: memberCan you give commands to run just those benches for those willing to replicate?sipa commented at 5:59 pm on April 22, 2022: member./src/bench/bench_bitcoin -filter=.*CHACHA20_[1-9].*
laanwj commented at 6:40 pm on April 22, 2022: memberI’m somewhat surprised unrolling a loop that is2010 times the same thing gives that much performance win on modern CPUs. But it’s just a few ROL instructions I guess so the loop overhead easily dominates? Anyhow, concept ACK.instagibbs commented at 6:42 pm on April 22, 2022: memberGetting a rough average of 15% speedup as wellsipa commented at 6:53 pm on April 22, 2022: member@laanwj It may also have to do with better register scheduling when unrolling (the same variable doesn’t need to stay in the same register every iteration), though I haven’t investigated what the difference in emitted asm is.
This change may be very compiler and platform dependent, so it may be good to know what its impact is with modern clang versions and/or on arm64 systems.
jonatack commented at 7:29 pm on April 22, 2022: memberDebian testing clang 15, normal (non-debug) build, fixed CPU speed, I’m not sure I’m seeing a difference. Trying again after optimizing and tuning further.laanwj commented at 8:24 pm on April 22, 2022: memberGcc 11.2.0, x86_64:
- The function
ChaCha20::Keystream
grows in size from 992 bytes to 3840 (doesn’t seem too bad, still fits in a page). - One iteration of the loop looks like:
0 370: 41 01 ed add %ebp,%r13d 1 373: 41 01 db add %ebx,%r11d 2 376: 41 01 f2 add %esi,%r10d 3 379: 44 31 e9 xor %r13d,%ecx 4 37c: 44 31 da xor %r11d,%edx 5 37f: 44 31 d0 xor %r10d,%eax 6 382: c1 c1 10 rol $0x10,%ecx 7 385: c1 c2 10 rol $0x10,%edx 8 388: 41 01 c9 add %ecx,%r9d 9 38b: 01 d7 add %edx,%edi 10 38d: c1 c0 10 rol $0x10,%eax 11 390: 44 31 cd xor %r9d,%ebp 12 393: 31 fb xor %edi,%ebx 13 395: 41 01 c4 add %eax,%r12d 14 398: c1 c5 0c rol $0xc,%ebp 15 39b: c1 c3 0c rol $0xc,%ebx 16 39e: 44 31 e6 xor %r12d,%esi 17 3a1: 41 01 ed add %ebp,%r13d 18 3a4: 41 01 db add %ebx,%r11d 19 3a7: c1 c6 0c rol $0xc,%esi 20 3aa: 44 31 e9 xor %r13d,%ecx 21 3ad: 44 31 da xor %r11d,%edx 22 3b0: 41 01 f2 add %esi,%r10d 23 3b3: c1 c1 08 rol $0x8,%ecx 24 3b6: c1 c2 08 rol $0x8,%edx 25 3b9: 44 31 d0 xor %r10d,%eax 26 3bc: 41 01 c9 add %ecx,%r9d 27 3bf: 01 d7 add %edx,%edi 28 3c1: 44 31 cd xor %r9d,%ebp 29 3c4: 31 fb xor %edi,%ebx 30 3c6: 89 7c 24 08 mov %edi,0x8(%rsp) 31 3ca: c1 c5 07 rol $0x7,%ebp 32 3cd: c1 c3 07 rol $0x7,%ebx 33 3d0: 44 89 4c 24 04 mov %r9d,0x4(%rsp) 34 3d5: c1 c0 08 rol $0x8,%eax 35 3d8: 45 01 f8 add %r15d,%r8d 36 3db: 41 01 dd add %ebx,%r13d 37 3de: 45 31 c6 xor %r8d,%r14d 38 3e1: 41 01 c4 add %eax,%r12d 39 3e4: 44 89 f7 mov %r14d,%edi 40 3e7: 44 8b 74 24 0c mov 0xc(%rsp),%r14d 41 3ec: 44 31 e6 xor %r12d,%esi 42 3ef: c1 c7 10 rol $0x10,%edi 43 3f2: c1 c6 07 rol $0x7,%esi 44 3f5: 41 01 fe add %edi,%r14d 45 3f8: 41 01 f3 add %esi,%r11d 46 3fb: 45 31 f7 xor %r14d,%r15d 47 3fe: 45 89 f1 mov %r14d,%r9d 48 401: 44 31 d9 xor %r11d,%ecx 49 404: 41 c1 c7 0c rol $0xc,%r15d 50 408: c1 c1 10 rol $0x10,%ecx 51 40b: 45 01 f8 add %r15d,%r8d 52 40e: 44 31 c7 xor %r8d,%edi 53 411: c1 c7 08 rol $0x8,%edi 54 414: 41 01 f9 add %edi,%r9d 55 417: 44 31 ef xor %r13d,%edi 56 41a: c1 c7 10 rol $0x10,%edi 57 41d: 45 31 cf xor %r9d,%r15d 58 420: 41 01 c9 add %ecx,%r9d 59 423: 41 01 fc add %edi,%r12d 60 426: 41 c1 c7 07 rol $0x7,%r15d 61 42a: 44 31 e3 xor %r12d,%ebx 62 42d: c1 c3 0c rol $0xc,%ebx 63 430: 41 01 dd add %ebx,%r13d 64 433: 44 31 ef xor %r13d,%edi 65 436: 41 89 fe mov %edi,%r14d 66 439: 41 c1 c6 08 rol $0x8,%r14d 67 43d: 45 01 f4 add %r14d,%r12d 68 440: 44 31 e3 xor %r12d,%ebx 69 443: c1 c3 07 rol $0x7,%ebx 70 446: 44 31 ce xor %r9d,%esi 71 449: 45 01 fa add %r15d,%r10d 72 44c: 41 01 e8 add %ebp,%r8d 73 44f: c1 c6 0c rol $0xc,%esi 74 452: 44 31 d2 xor %r10d,%edx 75 455: 44 31 c0 xor %r8d,%eax 76 458: c1 c2 10 rol $0x10,%edx 77 45b: 41 01 f3 add %esi,%r11d 78 45e: c1 c0 10 rol $0x10,%eax 79 461: 44 31 d9 xor %r11d,%ecx 80 464: c1 c1 08 rol $0x8,%ecx 81 467: 41 8d 3c 09 lea (%r9,%rcx,1),%edi 82 46b: 44 8b 4c 24 04 mov 0x4(%rsp),%r9d 83 470: 31 fe xor %edi,%esi 84 472: 89 7c 24 0c mov %edi,0xc(%rsp) 85 476: 8b 7c 24 08 mov 0x8(%rsp),%edi 86 47a: 41 01 d1 add %edx,%r9d 87 47d: c1 c6 07 rol $0x7,%esi 88 480: 01 c7 add %eax,%edi 89 482: 45 31 cf xor %r9d,%r15d 90 485: 31 fd xor %edi,%ebp 91 487: 41 c1 c7 0c rol $0xc,%r15d 92 48b: c1 c5 0c rol $0xc,%ebp 93 48e: 45 01 fa add %r15d,%r10d 94 491: 41 01 e8 add %ebp,%r8d 95 494: 44 31 d2 xor %r10d,%edx 96 497: 44 31 c0 xor %r8d,%eax 97 49a: c1 c2 08 rol $0x8,%edx 98 49d: c1 c0 08 rol $0x8,%eax 99 4a0: 41 01 d1 add %edx,%r9d 100 4a3: 01 c7 add %eax,%edi 101 4a5: 45 31 cf xor %r9d,%r15d 102 4a8: 31 fd xor %edi,%ebp 103 4aa: 41 c1 c7 07 rol $0x7,%r15d 104 4ae: c1 c5 07 rol $0x7,%ebp 105 4b1: 83 6c 24 10 01 subl $0x1,0x10(%rsp) 106 4b6: 0f 85 b4 fe ff ff jne 370 <ChaCha20::Keystream(unsigned char*, unsigned long)+0x140>
- The unrolling indeed causes different register allocation, as well as instructions from multiple iterations to be interspersed (maybe better for scheduling, maybe it’s possible to combine?).
- Benchmarks before on old AMD Phenom(tm) II X6 1075T:
0| ns/byte | byte/s | err% | total | benchmark 1|--------------------:|--------------------:|--------:|----------:|:---------- 2| 2.18 | 459,187,395.75 | 0.3% | 0.03 | `CHACHA20_1MB` 3| 2.21 | 452,155,530.63 | 0.2% | 0.01 | `CHACHA20_256BYTES` 4| 2.34 | 427,257,435.31 | 0.0% | 0.01 | `CHACHA20_64BYTES`
- Benchmarks after on same (~12% speedup):
0| ns/byte | byte/s | err% | total | benchmark 1|--------------------:|--------------------:|--------:|----------:|:---------- 2| 1.91 | 523,324,820.67 | 0.4% | 0.02 | `CHACHA20_1MB` 3| 1.94 | 516,638,576.63 | 0.0% | 0.01 | `CHACHA20_256BYTES` 4| 2.22 | 451,258,216.13 | 4.6% | 0.01 | `CHACHA20_64BYTES`
jonatack commented at 9:30 pm on April 22, 2022: memberRestarted and tuned (i7 6500U CPU @ 2.5 GHz) with
pyperf system tune
, non-debug build, seeing roughly a 3 to 4% improvement.0Linux 5.16.0-6-amd64 [#1](/bitcoin-bitcoin/1/) SMP PREEMPT Debian 5.16.18-1 (2022-03-29) x86_64 GNU/Linux. 1 2Debian clang version 15.0.0-++20220422111431+ba46ae7bd853-1~exp1~20220422111525.449 3Target: x86_64-pc-linux-gnu 4Thread model: posix 5InstalledDir: /usr/bin
0master 1 2| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark 3|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- 4| 2.43 | 410,814,309.33 | 0.3% | 18.61 | 6.29 | 2.957 | 0.20 | 0.0% | 0.03 | `CHACHA20_1MB` 5| 2.46 | 406,907,108.96 | 0.0% | 18.89 | 6.37 | 2.965 | 0.22 | 0.0% | 0.01 | `CHACHA20_256BYTES` 6| 2.59 | 385,499,110.76 | 1.0% | 19.72 | 6.68 | 2.952 | 0.28 | 0.0% | 0.01 | `CHACHA20_64BYTES`
0branch 1 2| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark 3|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- 4| 2.35 | 425,969,024.53 | 0.7% | 16.70 | 6.07 | 2.752 | 0.05 | 0.0% | 0.03 | `CHACHA20_1MB` 5| 2.37 | 422,279,272.14 | 0.0% | 17.14 | 6.14 | 2.792 | 0.07 | 0.0% | 0.01 | `CHACHA20_256BYTES` 6| 2.52 | 396,803,365.77 | 0.1% | 18.45 | 6.53 | 2.825 | 0.13 | 0.0% | 0.01 | `CHACHA20_64BYTES`
Edit: re-ran the bench a dozen times each to verify that these results are representative.
ajtowns commented at 8:50 am on April 23, 2022: memberI’m seeing much smaller improvements (0%-2.5% with gcc 11; 1.3%-7% with clang 13) on an old i7. (And very slightly worse performance compared to master with debug enabled)
Did you consider just changing the
for() { ... }
loop toREPEAT10( ... )
with#define REPEAT10(a) a a a a a a a a a a
?laanwj commented at 9:50 am on April 23, 2022: member- gcc 11.2.0, RISC-V 64-bit (SiFive Unmatched, 1.2Ghz): speedup is there, but much less pronounced (~5%):
0| ns/byte | byte/s | err% | ins/byte | cyc/byte | bra/byte | miss% | total | benchmark 1|--------------------:|--------------------:|--------:|----------------:|----------------:|---------------:|--------:|----------:|:---------- 2Before: 3| 22.29 | 44,862,631.89 | 0.8% | 0.00 | 0.00 | 0.00 | 0.0% | 0.26 | `CHACHA20_1MB` 4After: 5| 21.23 | 47,101,646.21 | 0.9% | 0.00 | 0.00 | 0.00 | 0.0% | 0.25 | `CHACHA20_1MB`
- gcc 10.2.1, aarch64 (custom i.MX8MQ board, 1Ghz), ~8% speedup:
0| ns/byte | byte/s | err% | ins/byte | bra/byte | miss% | total | benchmark 1|--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:---------- 2Before: 3| 6.04 | 165,526,246.91 | 0.1% | 16.84 | 0.16 | 11.8% | 0.07 | `CHACHA20_1MB` 4After: 5| 5.58 | 179,185,196.22 | 0.1% | 15.86 | 0.02 | 0.0% | 0.06 | `CHACHA20_1MB`
It’s a nice speedup, and a simple change, tested ACK 4f3a18906880b065b6119ccf32b2875748b297b2
Did you consider just changing the for() { … } loop to REPEAT10( … ) with #define REPEAT10(a) a a a a a a a a a a ?
I like this idea, more elegantly than copy/pasting it makes it immediately clear it’s the same. I would guess the generated code is exactly the same.
MarcoFalke commented at 5:10 pm on April 23, 2022: memberNot seeing a large difference on an i7. (Maybe a 1%-3% speedup?)
gcc-12 Before:
0| ns/byte | byte/s | err% | total | benchmark 1|--------------------:|--------------------:|--------:|----------:|:---------- 2| 2.23 | 447,617,214.06 | 0.2% | 0.03 | `CHACHA20_1MB` 3| 2.26 | 441,653,947.12 | 0.1% | 0.01 | `CHACHA20_256BYTES` 4| 2.50 | 399,993,391.82 | 6.1% | 0.01 | :wavy_dash: `CHACHA20_64BYTES` (Unstable with ~6,241.4 iters. Increase `minEpochIterations` to e.g. 62414) 5| 7.03 | 142,173,319.29 | 10.1% | 0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10) 6| 3.26 | 307,218,931.17 | 1.7% | 0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT` 7| 8.83 | 113,259,198.67 | 1.3% | 0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` 8| 4.28 | 233,685,573.34 | 0.4% | 0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT` 9| 15.78 | 63,391,055.77 | 0.6% | 0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT` 10| 7.71 | 129,684,901.52 | 0.4% | 0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`
gcc-12 After:
0| ns/byte | byte/s | err% | total | benchmark 1|--------------------:|--------------------:|--------:|----------:|:---------- 2| 2.20 | 454,707,913.08 | 0.8% | 0.03 | `CHACHA20_1MB` 3| 2.36 | 424,359,263.25 | 4.9% | 0.01 | `CHACHA20_256BYTES` 4| 2.41 | 414,622,602.59 | 0.4% | 0.01 | `CHACHA20_64BYTES` 5| 6.99 | 143,089,808.99 | 7.2% | 0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10) 6| 3.26 | 306,926,493.73 | 4.2% | 0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT` 7| 9.59 | 104,251,645.58 | 8.6% | 0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~402.1 iters. Increase `minEpochIterations` to e.g. 4021) 8| 4.33 | 230,986,007.33 | 0.6% | 0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT` 9| 16.23 | 61,602,235.65 | 1.7% | 0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT` 10| 9.63 | 103,830,365.13 | 9.9% | 0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,639.9 iters. Increase `minEpochIterations` to e.g. 16399)
gcc-10 Before:
0| ns/byte | byte/s | err% | total | benchmark 1|--------------------:|--------------------:|--------:|----------:|:---------- 2| 2.26 | 442,527,877.02 | 0.5% | 0.03 | `CHACHA20_1MB` 3| 2.30 | 435,535,172.72 | 1.9% | 0.01 | `CHACHA20_256BYTES` 4| 2.39 | 418,262,709.74 | 0.4% | 0.01 | `CHACHA20_64BYTES` 5| 6.93 | 144,210,951.65 | 5.9% | 0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10) 6| 3.16 | 316,109,217.24 | 4.8% | 0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT` 7| 8.43 | 118,625,079.49 | 0.3% | 0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` 8| 4.18 | 239,143,934.28 | 0.2% | 0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT` 9| 16.05 | 62,308,156.96 | 5.2% | 0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT` (Unstable with ~961.0 iters. Increase `minEpochIterations` to e.g. 9610) 10| 7.63 | 131,070,821.81 | 0.1% | 0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`
gcc-10 after:
0| ns/byte | byte/s | err% | total | benchmark 1|--------------------:|--------------------:|--------:|----------:|:---------- 2| 2.20 | 454,351,689.08 | 0.2% | 0.03 | `CHACHA20_1MB` 3| 2.40 | 416,825,911.73 | 4.4% | 0.01 | `CHACHA20_256BYTES` 4| 2.40 | 416,369,054.39 | 0.2% | 0.01 | `CHACHA20_64BYTES` 5| 6.58 | 151,882,394.04 | 10.5% | 0.08 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10) 6| 3.03 | 329,600,644.76 | 0.9% | 0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT` 7| 9.40 | 106,431,172.41 | 10.1% | 0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~434.9 iters. Increase `minEpochIterations` to e.g. 4349) 8| 4.30 | 232,776,146.25 | 0.2% | 0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT` 9| 16.17 | 61,831,918.45 | 1.3% | 0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT` 10| 8.83 | 113,301,205.50 | 5.8% | 0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,771.5 iters. Increase `minEpochIterations` to e.g. 17715)
martinus commented at 5:45 pm on April 23, 2022: contributorI get the same 1-3% speedup on my i7. In my test adding
#pragma GCC unroll 10
in front of the loop seems to produce exactly the same unrolled loop as the hand coded, this works for GCC and clangSide note 1: use e.g.
./src/bench/bench_bitcoin -filter="CHACHA20.*" -min_time=2000
to run each test for 2 seconds to get more stable resultsSide note 2: No need to quote the result, it’s markdown :slightly_smiling_face:
My results on i7-8700, with clang 13.0.1:
master
ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark 1.91 523,770,793.06 0.1% 18.52 6.08 3.043 0.20 0.0% 1.09 CHACHA20_1MB
1.94 515,227,758.97 0.3% 18.79 6.16 3.048 0.22 0.0% 1.10 CHACHA20_256BYTES
2.02 494,527,885.82 0.2% 19.61 6.44 3.046 0.28 0.0% 1.10 CHACHA20_64BYTES
branch
ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark 1.83 547,223,233.51 0.0% 17.08 5.83 2.931 0.05 0.0% 1.07 CHACHA20_1MB
1.87 535,851,391.81 0.1% 17.51 5.95 2.942 0.07 0.0% 1.10 CHACHA20_256BYTES
1.98 504,774,917.46 0.0% 18.81 6.32 2.977 0.13 0.0% 1.10 CHACHA20_64BYTES
Empact commented at 11:15 pm on April 23, 2022: member+1 for#pragma unroll
or similarlaanwj commented at 9:03 am on April 27, 2022: memberSo I think the conclusion here is that on i7 there’s no (or not much) difference but on other platforms it varies. But it never becomes worse. I think a performance optimization like this is mostly interesting for slower CPUs with less effective branch prediction so that’s OK with me.MarcoFalke commented at 6:28 pm on May 4, 2022: memberTIL that it is possible to pass multiple lines as an argument to a macrosipa commented at 6:29 pm on May 4, 2022: memberTIL that it is possible to pass multiple lines as an argument to a macro
You clearly never saw the original serialization code this codebase had ;)
sipa force-pushed on May 4, 2022in src/crypto/chacha20.cpp:21 in 266bf15ddc outdated
17@@ -18,6 +18,8 @@ constexpr static inline uint32_t rotl32(uint32_t v, int c) { return (v << c) | ( 18 a += b; d = rotl32(d ^ a, 8); \ 19 c += d; b = rotl32(b ^ c, 7); 20 21+#define REPEAT10(a) a a a a a a a a a a
MarcoFalke commented at 6:51 pm on May 4, 2022:0#define REPEAT10(a) do { a a a a a a a a a a } while (0)
nit: Shouldn’t this use do-while?
Otherwise writing
0if (blub) REPEAT10(bla());
will do the wrong thing?
Also, leaving the semicolon after the do-while in the definition makes the compiler enforce that one is placed after the call.
sipa commented at 6:53 pm on May 4, 2022:Done.Unroll the ChaCha20 inner loop for performance 81c09ee45csipa force-pushed on May 4, 2022martinus commented at 5:47 am on May 5, 2022: contributortested ACK 81c09ee with clang++ 13.0.1, test
CHACHA20_1MB
:- 4.3% faster on i9-9960X
- 4.5% faster on i9-9980HK
- 4.4% faster on i7-8700
DrahtBot commented at 1:53 am on May 8, 2022: memberGuix builds
DrahtBot removed the label DrahtBot Guix build requested on May 8, 2022MarcoFalke commented at 11:54 am on May 9, 2022: member- A few percent faster on AMD EPYC as well with gcc-9/gcc-11.2/gcc-12.1/clang-14
- Same on AMD EPYC with guix built bench
- Same on Cortex-A72 with guix built bench
MarcoFalke commented at 11:56 am on May 9, 2022: memberACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟
Signature:
0-----BEGIN PGP SIGNED MESSAGE----- 1Hash: SHA512 2 3ACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟 4-----BEGIN PGP SIGNATURE----- 5 6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p 7pUgxzQv9FMC3MiK58jmwXRv26Mf41HrwpXJawhRSU/j+VM0Vq9JI6RlIkZ3E5Biy 8EKOxtL9cMKv6cMOyE5bihZF3uIqnwJCMAx+8cb+/6RYm33UseEMHxX/S8T+Q8/vy 94r5BU/kisbX77yAjooN7Lr0/nKSv2E8APFjvcp7NIkWkx89W2zrk9z4eoFS5Dri/ 10yAbMpc95eTtu4gmsbjNNE73/Q1MsdfXiBgzwP8ToV/grzoZPpBTt7dsb1QRRjn1N 11NAY/xG1p1kFo7ORbJ0ZHiKE4waat0Erqi8MX35f5mkMVa47X5VdDuP1FGn191f9K 12oS6cfgSZr4d+SE3SFer56/3QOVToa06VmxjmKoRv0j12S7NVOxnjRNjwN6XkhgoK 13wlpkNa3HxNxdMNmaUDqxXk5Z1zH5RCjZwiPQuMG5sExjemAAJXOFQ8WYnJFGp04R 14dFlXeMTy2ZQWMWoEMhdJ2jCDjvggjMW8t51VA3+GQvr8ZZmN10dzXPA+Qi1c25es 15QNkpUvPg 16=2W4Z 17-----END PGP SIGNATURE-----
MarcoFalke merged this on May 9, 2022MarcoFalke closed this on May 9, 2022
sidhujag referenced this in commit 346bcd37d7 on May 9, 2022DrahtBot locked this on May 9, 2023
This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-21 15:12 UTC
More mirrored repositories can be found on mirror.b10c.me