Unroll the ChaCha20 inner loop for performance

sipa commented at 4:32 pm on April 22, 2022: member

Unrolling the inner ChaCha20 loop gives a ~15% speedup for me in the CHACHA20_* benchmarks. It’s a simple change, this performance helps with RNG generation, and will matter more for BIP324.

DrahtBot added the label Utils/log/libs on Apr 22, 2022

in src/crypto/chacha20.cpp:124 in 7f3f84c833 outdated

127-            QUARTERROUND( x0, x5,x10,x15)
128-            QUARTERROUND( x1, x6,x11,x12)
129-            QUARTERROUND( x2, x7, x8,x13)
130-            QUARTERROUND( x3, x4, x9,x14)
131-        }
132+

kristapsk commented at 5:28 pm on April 22, 2022:

Maybe add a comment that loop unrolling was done for performance reasons?

sipa commented at 6:01 pm on April 22, 2022:

Done.

kristapsk commented at 5:30 pm on April 22, 2022: contributor

Concept ACK, I see ./src/bench/bench_bitcoin improvements with this change.

sipa force-pushed on Apr 22, 2022

MarcoFalke added the label DrahtBot Guix build requested on Apr 22, 2022

instagibbs commented at 5:58 pm on April 22, 2022: member

Can you give commands to run just those benches for those willing to replicate?

sipa commented at 5:59 pm on April 22, 2022: member

@instagibbs

./src/bench/bench_bitcoin -filter=.*CHACHA20_[1-9].*

laanwj commented at 6:40 pm on April 22, 2022: member

I’m somewhat surprised unrolling a loop that is 20 10 times the same thing gives that much performance win on modern CPUs. But it’s just a few ROL instructions I guess so the loop overhead easily dominates? Anyhow, concept ACK.

instagibbs commented at 6:42 pm on April 22, 2022: member

Getting a rough average of 15% speedup as well

sipa commented at 6:53 pm on April 22, 2022: member

@laanwj It may also have to do with better register scheduling when unrolling (the same variable doesn’t need to stay in the same register every iteration), though I haven’t investigated what the difference in emitted asm is.

This change may be very compiler and platform dependent, so it may be good to know what its impact is with modern clang versions and/or on arm64 systems.

jonatack commented at 7:29 pm on April 22, 2022: member

Debian testing clang 15, normal (non-debug) build, fixed CPU speed, I’m not sure I’m seeing a difference. Trying again after optimizing and tuning further.

laanwj commented at 8:24 pm on April 22, 2022: member

Gcc 11.2.0, x86_64:

The function ChaCha20::Keystream grows in size from 992 bytes to 3840 (doesn’t seem too bad, still fits in a page).
One iteration of the loop looks like:

  0 370:	41 01 ed             	add    %ebp,%r13d
  1 373:	41 01 db             	add    %ebx,%r11d
  2 376:	41 01 f2             	add    %esi,%r10d
  3 379:	44 31 e9             	xor    %r13d,%ecx
  4 37c:	44 31 da             	xor    %r11d,%edx
  5 37f:	44 31 d0             	xor    %r10d,%eax
  6 382:	c1 c1 10             	rol    $0x10,%ecx
  7 385:	c1 c2 10             	rol    $0x10,%edx
  8 388:	41 01 c9             	add    %ecx,%r9d
  9 38b:	01 d7                	add    %edx,%edi
 10 38d:	c1 c0 10             	rol    $0x10,%eax
 11 390:	44 31 cd             	xor    %r9d,%ebp
 12 393:	31 fb                	xor    %edi,%ebx
 13 395:	41 01 c4             	add    %eax,%r12d
 14 398:	c1 c5 0c             	rol    $0xc,%ebp
 15 39b:	c1 c3 0c             	rol    $0xc,%ebx
 16 39e:	44 31 e6             	xor    %r12d,%esi
 17 3a1:	41 01 ed             	add    %ebp,%r13d
 18 3a4:	41 01 db             	add    %ebx,%r11d
 19 3a7:	c1 c6 0c             	rol    $0xc,%esi
 20 3aa:	44 31 e9             	xor    %r13d,%ecx
 21 3ad:	44 31 da             	xor    %r11d,%edx
 22 3b0:	41 01 f2             	add    %esi,%r10d
 23 3b3:	c1 c1 08             	rol    $0x8,%ecx
 24 3b6:	c1 c2 08             	rol    $0x8,%edx
 25 3b9:	44 31 d0             	xor    %r10d,%eax
 26 3bc:	41 01 c9             	add    %ecx,%r9d
 27 3bf:	01 d7                	add    %edx,%edi
 28 3c1:	44 31 cd             	xor    %r9d,%ebp
 29 3c4:	31 fb                	xor    %edi,%ebx
 30 3c6:	89 7c 24 08          	mov    %edi,0x8(%rsp)
 31 3ca:	c1 c5 07             	rol    $0x7,%ebp
 32 3cd:	c1 c3 07             	rol    $0x7,%ebx
 33 3d0:	44 89 4c 24 04       	mov    %r9d,0x4(%rsp)
 34 3d5:	c1 c0 08             	rol    $0x8,%eax
 35 3d8:	45 01 f8             	add    %r15d,%r8d
 36 3db:	41 01 dd             	add    %ebx,%r13d
 37 3de:	45 31 c6             	xor    %r8d,%r14d
 38 3e1:	41 01 c4             	add    %eax,%r12d
 39 3e4:	44 89 f7             	mov    %r14d,%edi
 40 3e7:	44 8b 74 24 0c       	mov    0xc(%rsp),%r14d
 41 3ec:	44 31 e6             	xor    %r12d,%esi
 42 3ef:	c1 c7 10             	rol    $0x10,%edi
 43 3f2:	c1 c6 07             	rol    $0x7,%esi
 44 3f5:	41 01 fe             	add    %edi,%r14d
 45 3f8:	41 01 f3             	add    %esi,%r11d
 46 3fb:	45 31 f7             	xor    %r14d,%r15d
 47 3fe:	45 89 f1             	mov    %r14d,%r9d
 48 401:	44 31 d9             	xor    %r11d,%ecx
 49 404:	41 c1 c7 0c          	rol    $0xc,%r15d
 50 408:	c1 c1 10             	rol    $0x10,%ecx
 51 40b:	45 01 f8             	add    %r15d,%r8d
 52 40e:	44 31 c7             	xor    %r8d,%edi
 53 411:	c1 c7 08             	rol    $0x8,%edi
 54 414:	41 01 f9             	add    %edi,%r9d
 55 417:	44 31 ef             	xor    %r13d,%edi
 56 41a:	c1 c7 10             	rol    $0x10,%edi
 57 41d:	45 31 cf             	xor    %r9d,%r15d
 58 420:	41 01 c9             	add    %ecx,%r9d
 59 423:	41 01 fc             	add    %edi,%r12d
 60 426:	41 c1 c7 07          	rol    $0x7,%r15d
 61 42a:	44 31 e3             	xor    %r12d,%ebx
 62 42d:	c1 c3 0c             	rol    $0xc,%ebx
 63 430:	41 01 dd             	add    %ebx,%r13d
 64 433:	44 31 ef             	xor    %r13d,%edi
 65 436:	41 89 fe             	mov    %edi,%r14d
 66 439:	41 c1 c6 08          	rol    $0x8,%r14d
 67 43d:	45 01 f4             	add    %r14d,%r12d
 68 440:	44 31 e3             	xor    %r12d,%ebx
 69 443:	c1 c3 07             	rol    $0x7,%ebx
 70 446:	44 31 ce             	xor    %r9d,%esi
 71 449:	45 01 fa             	add    %r15d,%r10d
 72 44c:	41 01 e8             	add    %ebp,%r8d
 73 44f:	c1 c6 0c             	rol    $0xc,%esi
 74 452:	44 31 d2             	xor    %r10d,%edx
 75 455:	44 31 c0             	xor    %r8d,%eax
 76 458:	c1 c2 10             	rol    $0x10,%edx
 77 45b:	41 01 f3             	add    %esi,%r11d
 78 45e:	c1 c0 10             	rol    $0x10,%eax
 79 461:	44 31 d9             	xor    %r11d,%ecx
 80 464:	c1 c1 08             	rol    $0x8,%ecx
 81 467:	41 8d 3c 09          	lea    (%r9,%rcx,1),%edi
 82 46b:	44 8b 4c 24 04       	mov    0x4(%rsp),%r9d
 83 470:	31 fe                	xor    %edi,%esi
 84 472:	89 7c 24 0c          	mov    %edi,0xc(%rsp)
 85 476:	8b 7c 24 08          	mov    0x8(%rsp),%edi
 86 47a:	41 01 d1             	add    %edx,%r9d
 87 47d:	c1 c6 07             	rol    $0x7,%esi
 88 480:	01 c7                	add    %eax,%edi
 89 482:	45 31 cf             	xor    %r9d,%r15d
 90 485:	31 fd                	xor    %edi,%ebp
 91 487:	41 c1 c7 0c          	rol    $0xc,%r15d
 92 48b:	c1 c5 0c             	rol    $0xc,%ebp
 93 48e:	45 01 fa             	add    %r15d,%r10d
 94 491:	41 01 e8             	add    %ebp,%r8d
 95 494:	44 31 d2             	xor    %r10d,%edx
 96 497:	44 31 c0             	xor    %r8d,%eax
 97 49a:	c1 c2 08             	rol    $0x8,%edx
 98 49d:	c1 c0 08             	rol    $0x8,%eax
 99 4a0:	41 01 d1             	add    %edx,%r9d
100 4a3:	01 c7                	add    %eax,%edi
101 4a5:	45 31 cf             	xor    %r9d,%r15d
102 4a8:	31 fd                	xor    %edi,%ebp
103 4aa:	41 c1 c7 07          	rol    $0x7,%r15d
104 4ae:	c1 c5 07             	rol    $0x7,%ebp
105 4b1:	83 6c 24 10 01       	subl   $0x1,0x10(%rsp)
106 4b6:	0f 85 b4 fe ff ff    	jne    370 <ChaCha20::Keystream(unsigned char*, unsigned long)+0x140>

The unrolling indeed causes different register allocation, as well as instructions from multiple iterations to be interspersed (maybe better for scheduling, maybe it’s possible to combine?).
Benchmarks before on old AMD Phenom(tm) II X6 1075T:

0|             ns/byte |              byte/s |    err% |     total | benchmark
1|--------------------:|--------------------:|--------:|----------:|:----------
2|                2.18 |      459,187,395.75 |    0.3% |      0.03 | `CHACHA20_1MB`
3|                2.21 |      452,155,530.63 |    0.2% |      0.01 | `CHACHA20_256BYTES`
4|                2.34 |      427,257,435.31 |    0.0% |      0.01 | `CHACHA20_64BYTES`

Benchmarks after on same (~12% speedup):

0|             ns/byte |              byte/s |    err% |     total | benchmark
1|--------------------:|--------------------:|--------:|----------:|:----------
2|                1.91 |      523,324,820.67 |    0.4% |      0.02 | `CHACHA20_1MB`
3|                1.94 |      516,638,576.63 |    0.0% |      0.01 | `CHACHA20_256BYTES`
4|                2.22 |      451,258,216.13 |    4.6% |      0.01 | `CHACHA20_64BYTES`

jonatack commented at 9:30 pm on April 22, 2022: member

Restarted and tuned (i7 6500U CPU @ 2.5 GHz) with pyperf system tune, non-debug build, seeing roughly a 3 to 4% improvement.

0Linux 5.16.0-6-amd64 [#1](/bitcoin-bitcoin/1/) SMP PREEMPT Debian 5.16.18-1 (2022-03-29) x86_64 GNU/Linux.
1
2Debian clang version 15.0.0-++20220422111431+ba46ae7bd853-1~exp1~20220422111525.449
3Target: x86_64-pc-linux-gnu                                    
4Thread model: posix                                            
5InstalledDir: /usr/bin

0master
1
2|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
3|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
4|                2.43 |      410,814,309.33 |    0.3% |           18.61 |            6.29 |  2.957 |           0.20 |    0.0% |      0.03 | `CHACHA20_1MB`
5|                2.46 |      406,907,108.96 |    0.0% |           18.89 |            6.37 |  2.965 |           0.22 |    0.0% |      0.01 | `CHACHA20_256BYTES`
6|                2.59 |      385,499,110.76 |    1.0% |           19.72 |            6.68 |  2.952 |           0.28 |    0.0% |      0.01 | `CHACHA20_64BYTES`

0branch
1
2|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
3|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
4|                2.35 |      425,969,024.53 |    0.7% |           16.70 |            6.07 |  2.752 |           0.05 |    0.0% |      0.03 | `CHACHA20_1MB`
5|                2.37 |      422,279,272.14 |    0.0% |           17.14 |            6.14 |  2.792 |           0.07 |    0.0% |      0.01 | `CHACHA20_256BYTES`
6|                2.52 |      396,803,365.77 |    0.1% |           18.45 |            6.53 |  2.825 |           0.13 |    0.0% |      0.01 | `CHACHA20_64BYTES`

Edit: re-ran the bench a dozen times each to verify that these results are representative.

ajtowns commented at 8:50 am on April 23, 2022: member

I’m seeing much smaller improvements (0%-2.5% with gcc 11; 1.3%-7% with clang 13) on an old i7. (And very slightly worse performance compared to master with debug enabled)

Did you consider just changing the for() { ... } loop to REPEAT10( ... ) with #define REPEAT10(a) a a a a a a a a a a ?

laanwj commented at 9:50 am on April 23, 2022: member

gcc 11.2.0, RISC-V 64-bit (SiFive Unmatched, 1.2Ghz): speedup is there, but much less pronounced (~5%):

0|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |       bra/byte |   miss% |     total | benchmark
1|--------------------:|--------------------:|--------:|----------------:|----------------:|---------------:|--------:|----------:|:----------
2Before:
3|               22.29 |       44,862,631.89 |    0.8% |            0.00 |            0.00 |           0.00 |    0.0% |      0.26 | `CHACHA20_1MB`
4After:
5|               21.23 |       47,101,646.21 |    0.9% |            0.00 |            0.00 |           0.00 |    0.0% |      0.25 | `CHACHA20_1MB`

gcc 10.2.1, aarch64 (custom i.MX8MQ board, 1Ghz), ~8% speedup:

0|             ns/byte |              byte/s |    err% |        ins/byte |       bra/byte |   miss% |     total | benchmark
1|--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:----------
2Before:
3|                6.04 |      165,526,246.91 |    0.1% |           16.84 |           0.16 |   11.8% |      0.07 | `CHACHA20_1MB`
4After:
5|                5.58 |      179,185,196.22 |    0.1% |           15.86 |           0.02 |    0.0% |      0.06 | `CHACHA20_1MB`

It’s a nice speedup, and a simple change, tested ACK 4f3a18906880b065b6119ccf32b2875748b297b2

Did you consider just changing the for() { … } loop to REPEAT10( … ) with #define REPEAT10(a) a a a a a a a a a a ?

I like this idea, more elegantly than copy/pasting it makes it immediately clear it’s the same. I would guess the generated code is exactly the same.

MarcoFalke commented at 5:10 pm on April 23, 2022: member

Not seeing a large difference on an i7. (Maybe a 1%-3% speedup?)

gcc-12 Before:

 0|             ns/byte |              byte/s |    err% |     total | benchmark
 1|--------------------:|--------------------:|--------:|----------:|:----------
 2|                2.23 |      447,617,214.06 |    0.2% |      0.03 | `CHACHA20_1MB`
 3|                2.26 |      441,653,947.12 |    0.1% |      0.01 | `CHACHA20_256BYTES`
 4|                2.50 |      399,993,391.82 |    6.1% |      0.01 | :wavy_dash: `CHACHA20_64BYTES` (Unstable with ~6,241.4 iters. Increase `minEpochIterations` to e.g. 62414)
 5|                7.03 |      142,173,319.29 |   10.1% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
 6|                3.26 |      307,218,931.17 |    1.7% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
 7|                8.83 |      113,259,198.67 |    1.3% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT`
 8|                4.28 |      233,685,573.34 |    0.4% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
 9|               15.78 |       63,391,055.77 |    0.6% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
10|                7.71 |      129,684,901.52 |    0.4% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`

gcc-12 After:

 0|             ns/byte |              byte/s |    err% |     total | benchmark
 1|--------------------:|--------------------:|--------:|----------:|:----------
 2|                2.20 |      454,707,913.08 |    0.8% |      0.03 | `CHACHA20_1MB`
 3|                2.36 |      424,359,263.25 |    4.9% |      0.01 | `CHACHA20_256BYTES`
 4|                2.41 |      414,622,602.59 |    0.4% |      0.01 | `CHACHA20_64BYTES`
 5|                6.99 |      143,089,808.99 |    7.2% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
 6|                3.26 |      306,926,493.73 |    4.2% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
 7|                9.59 |      104,251,645.58 |    8.6% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~402.1 iters. Increase `minEpochIterations` to e.g. 4021)
 8|                4.33 |      230,986,007.33 |    0.6% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
 9|               16.23 |       61,602,235.65 |    1.7% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
10|                9.63 |      103,830,365.13 |    9.9% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,639.9 iters. Increase `minEpochIterations` to e.g. 16399)

gcc-10 Before:

 0|             ns/byte |              byte/s |    err% |     total | benchmark
 1|--------------------:|--------------------:|--------:|----------:|:----------
 2|                2.26 |      442,527,877.02 |    0.5% |      0.03 | `CHACHA20_1MB`
 3|                2.30 |      435,535,172.72 |    1.9% |      0.01 | `CHACHA20_256BYTES`
 4|                2.39 |      418,262,709.74 |    0.4% |      0.01 | `CHACHA20_64BYTES`
 5|                6.93 |      144,210,951.65 |    5.9% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
 6|                3.16 |      316,109,217.24 |    4.8% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
 7|                8.43 |      118,625,079.49 |    0.3% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT`
 8|                4.18 |      239,143,934.28 |    0.2% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
 9|               16.05 |       62,308,156.96 |    5.2% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT` (Unstable with ~961.0 iters. Increase `minEpochIterations` to e.g. 9610)
10|                7.63 |      131,070,821.81 |    0.1% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`

gcc-10 after:

 0|             ns/byte |              byte/s |    err% |     total | benchmark
 1|--------------------:|--------------------:|--------:|----------:|:----------
 2|                2.20 |      454,351,689.08 |    0.2% |      0.03 | `CHACHA20_1MB`
 3|                2.40 |      416,825,911.73 |    4.4% |      0.01 | `CHACHA20_256BYTES`
 4|                2.40 |      416,369,054.39 |    0.2% |      0.01 | `CHACHA20_64BYTES`
 5|                6.58 |      151,882,394.04 |   10.5% |      0.08 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
 6|                3.03 |      329,600,644.76 |    0.9% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
 7|                9.40 |      106,431,172.41 |   10.1% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~434.9 iters. Increase `minEpochIterations` to e.g. 4349)
 8|                4.30 |      232,776,146.25 |    0.2% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
 9|               16.17 |       61,831,918.45 |    1.3% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
10|                8.83 |      113,301,205.50 |    5.8% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,771.5 iters. Increase `minEpochIterations` to e.g. 17715)

martinus commented at 5:45 pm on April 23, 2022: contributor

I get the same 1-3% speedup on my i7. In my test adding #pragma GCC unroll 10 in front of the loop seems to produce exactly the same unrolled loop as the hand coded, this works for GCC and clang

Side note 1: use e.g. ./src/bench/bench_bitcoin -filter="CHACHA20.*" -min_time=2000 to run each test for 2 seconds to get more stable results

Side note 2: No need to quote the result, it’s markdown :slightly_smiling_face:

My results on i7-8700, with clang 13.0.1:

master

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
1.91	523,770,793.06	0.1%	18.52	6.08	3.043	0.20	0.0%	1.09	`CHACHA20_1MB`
1.94	515,227,758.97	0.3%	18.79	6.16	3.048	0.22	0.0%	1.10	`CHACHA20_256BYTES`
2.02	494,527,885.82	0.2%	19.61	6.44	3.046	0.28	0.0%	1.10	`CHACHA20_64BYTES`

branch

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
1.83	547,223,233.51	0.0%	17.08	5.83	2.931	0.05	0.0%	1.07	`CHACHA20_1MB`
1.87	535,851,391.81	0.1%	17.51	5.95	2.942	0.07	0.0%	1.10	`CHACHA20_256BYTES`
1.98	504,774,917.46	0.0%	18.81	6.32	2.977	0.13	0.0%	1.10	`CHACHA20_64BYTES`

Empact commented at 11:15 pm on April 23, 2022: member

+1 for #pragma unroll or similar

laanwj commented at 9:03 am on April 27, 2022: member

So I think the conclusion here is that on i7 there’s no (or not much) difference but on other platforms it varies. But it never becomes worse. I think a performance optimization like this is mostly interesting for slower CPUs with less effective branch prediction so that’s OK with me.

laanwj commented at 6:18 pm on May 4, 2022: member

@sipa What are your thoughts on using #pragma unroll or a macro? Or do you prefer keeping it this way?

sipa commented at 6:20 pm on May 4, 2022: member

@laanwj That won’t work on every compiler.

I’d be ok with switching to a macro to do the 10x expansion.

MarcoFalke commented at 6:28 pm on May 4, 2022: member

TIL that it is possible to pass multiple lines as an argument to a macro

sipa commented at 6:29 pm on May 4, 2022: member

TIL that it is possible to pass multiple lines as an argument to a macro

You clearly never saw the original serialization code this codebase had ;)

sipa force-pushed on May 4, 2022

sipa commented at 6:45 pm on May 4, 2022: member

I’d be ok with switching to a macro to do the 10x expansion.

Done, used @ajtowns’s approach suggested above.

in src/crypto/chacha20.cpp:21 in 266bf15ddc outdated

17@@ -18,6 +18,8 @@ constexpr static inline uint32_t rotl32(uint32_t v, int c) { return (v << c) | (
18   a += b; d = rotl32(d ^ a, 8); \
19   c += d; b = rotl32(b ^ c, 7);
20 
21+#define REPEAT10(a) a a a a a a a a a a

MarcoFalke commented at 6:51 pm on May 4, 2022:

0#define REPEAT10(a) do { a a a a a a a a a a } while (0)

nit: Shouldn’t this use do-while?

Otherwise writing

0if (blub) REPEAT10(bla());

will do the wrong thing?

Also, leaving the semicolon after the do-while in the definition makes the compiler enforce that one is placed after the call.

sipa commented at 6:53 pm on May 4, 2022:

Done.

Unroll the ChaCha20 inner loop for performance 81c09ee45c

sipa force-pushed on May 4, 2022

martinus commented at 5:47 am on May 5, 2022: contributor

tested ACK 81c09ee with clang++ 13.0.1, test CHACHA20_1MB:

4.3% faster on i9-9960X
4.5% faster on i9-9980HK
4.4% faster on i7-8700

DrahtBot commented at 1:53 am on May 8, 2022: member

Guix builds

File	commit 460450836304f257d3fc20e9fe32cb3a4efaa82b(master)	commit 4fd4a5fe65ca0676ab8fb5c73665c83b5971822c(master and this pull)
SHA256SUMS.part	`2cc4936b229484ae...`	`72b43e559d44ffb7...`
*-aarch64-linux-gnu-debug.tar.gz	`9bddf61e8a0520d9...`	`deeb9f228bac481f...`
*-aarch64-linux-gnu.tar.gz	`e9d04df826660c90...`	`522251f6bae6c76e...`
*-arm-linux-gnueabihf-debug.tar.gz	`4beeedbf37980025...`	`1670184e36cf8954...`
*-arm-linux-gnueabihf.tar.gz	`3d050332fd87e4a2...`	`d96deb05bc2a4c32...`
*-arm64-apple-darwin-unsigned.dmg	`4da5ad0e9772e4e6...`	`76cc400c040540b0...`
*-arm64-apple-darwin-unsigned.tar.gz	`dcfdc8f43192660a...`	`b2635ec3d4fae50f...`
*-arm64-apple-darwin.tar.gz	`4a5d70a227973d83...`	`d1fe669b0e4947c4...`
*-powerpc64-linux-gnu-debug.tar.gz	`53d2e35633a6a31c...`	`49c537f1d856cbc4...`
*-powerpc64-linux-gnu.tar.gz	`c375a065b6a1548d...`	`a86b04e408ec9570...`
*-powerpc64le-linux-gnu-debug.tar.gz	`4969102569ae2d44...`	`8e8ba290e0752fbd...`
*-powerpc64le-linux-gnu.tar.gz	`e4127eb1ced1baa7...`	`d79b1b65762f872b...`
*-riscv64-linux-gnu-debug.tar.gz	`2e0700a552bee2aa...`	`a1ecf6009d2ebb4e...`
*-riscv64-linux-gnu.tar.gz	`2e1a90e7096072c7...`	`df8f4d16400de367...`
*-win64-debug.zip	`a42b196a2551ff55...`	`4cecf16acf68c1af...`
*-win64-setup-unsigned.exe	`ecea0e8c84dfa767...`	`cf8ed2f4015e5a1d...`
*-win64-unsigned.tar.gz	`76ac00a1fded7fab...`	`f00c3589f6bc5cff...`
*-win64.zip	`c78256b254a7b0da...`	`6818154f30241f4c...`
*-x86_64-apple-darwin-unsigned.dmg	`086d56f16fe9ec2d...`	`f6dbb479fce016e5...`
*-x86_64-apple-darwin-unsigned.tar.gz	`a32dbe948312f693...`	`ce21801a74eb1971...`
*-x86_64-apple-darwin.tar.gz	`31010eb16dfe31bc...`	`c4ed0c95ff3d6e2d...`
*-x86_64-linux-gnu-debug.tar.gz	`f9d542596820f03f...`	`770b969a8724b3ea...`
*-x86_64-linux-gnu.tar.gz	`34f5361f2f5dc8ac...`	`cc157942e3794baf...`
*.tar.gz	`1eaba963a7fb2753...`	`be1c1b02a7e1bbd8...`
guix_build.log	`3690d59eb499477b...`	`e1f3ab033c1676fe...`
guix_build.log.diff		`e933d67b5ec7a7fc...`

DrahtBot removed the label DrahtBot Guix build requested on May 8, 2022

MarcoFalke commented at 11:54 am on May 9, 2022: member

A few percent faster on AMD EPYC as well with gcc-9/gcc-11.2/gcc-12.1/clang-14
Same on AMD EPYC with guix built bench
Same on Cortex-A72 with guix built bench

MarcoFalke commented at 11:56 am on May 9, 2022: member

ACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟

Signature:

 0-----BEGIN PGP SIGNED MESSAGE-----
 1Hash: SHA512
 2
 3ACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟
 4-----BEGIN PGP SIGNATURE-----
 5
 6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
 7pUgxzQv9FMC3MiK58jmwXRv26Mf41HrwpXJawhRSU/j+VM0Vq9JI6RlIkZ3E5Biy
 8EKOxtL9cMKv6cMOyE5bihZF3uIqnwJCMAx+8cb+/6RYm33UseEMHxX/S8T+Q8/vy
 94r5BU/kisbX77yAjooN7Lr0/nKSv2E8APFjvcp7NIkWkx89W2zrk9z4eoFS5Dri/
10yAbMpc95eTtu4gmsbjNNE73/Q1MsdfXiBgzwP8ToV/grzoZPpBTt7dsb1QRRjn1N
11NAY/xG1p1kFo7ORbJ0ZHiKE4waat0Erqi8MX35f5mkMVa47X5VdDuP1FGn191f9K
12oS6cfgSZr4d+SE3SFer56/3QOVToa06VmxjmKoRv0j12S7NVOxnjRNjwN6XkhgoK
13wlpkNa3HxNxdMNmaUDqxXk5Z1zH5RCjZwiPQuMG5sExjemAAJXOFQ8WYnJFGp04R
14dFlXeMTy2ZQWMWoEMhdJ2jCDjvggjMW8t51VA3+GQvr8ZZmN10dzXPA+Qi1c25es
15QNkpUvPg
16=2W4Z
17-----END PGP SIGNATURE-----

MarcoFalke merged this on May 9, 2022

MarcoFalke closed this on May 9, 2022

sidhujag referenced this in commit 346bcd37d7 on May 9, 2022

DrahtBot locked this on May 9, 2023

Unroll the ChaCha20 inner loop for performance #24946

Guix builds