Unroll the ChaCha20 inner loop for performance #24946

pull sipa wants to merge 1 commits into bitcoin:master from sipa:202204_unrollchacha changing 1 files +28 −20
  1. sipa commented at 4:32 pm on April 22, 2022: member
    Unrolling the inner ChaCha20 loop gives a ~15% speedup for me in the CHACHA20_* benchmarks. It’s a simple change, this performance helps with RNG generation, and will matter more for BIP324.
  2. DrahtBot added the label Utils/log/libs on Apr 22, 2022
  3. in src/crypto/chacha20.cpp:124 in 7f3f84c833 outdated
    127-            QUARTERROUND( x0, x5,x10,x15)
    128-            QUARTERROUND( x1, x6,x11,x12)
    129-            QUARTERROUND( x2, x7, x8,x13)
    130-            QUARTERROUND( x3, x4, x9,x14)
    131-        }
    132+
    


    kristapsk commented at 5:28 pm on April 22, 2022:
    Maybe add a comment that loop unrolling was done for performance reasons?

    sipa commented at 6:01 pm on April 22, 2022:
    Done.
  4. kristapsk commented at 5:30 pm on April 22, 2022: contributor
    Concept ACK, I see ./src/bench/bench_bitcoin improvements with this change.
  5. sipa force-pushed on Apr 22, 2022
  6. MarcoFalke added the label DrahtBot Guix build requested on Apr 22, 2022
  7. instagibbs commented at 5:58 pm on April 22, 2022: member
    Can you give commands to run just those benches for those willing to replicate?
  8. sipa commented at 5:59 pm on April 22, 2022: member

    @instagibbs

    ./src/bench/bench_bitcoin -filter=.*CHACHA20_[1-9].*

  9. laanwj commented at 6:40 pm on April 22, 2022: member
    I’m somewhat surprised unrolling a loop that is 20 10 times the same thing gives that much performance win on modern CPUs. But it’s just a few ROL instructions I guess so the loop overhead easily dominates? Anyhow, concept ACK.
  10. instagibbs commented at 6:42 pm on April 22, 2022: member
    Getting a rough average of 15% speedup as well
  11. sipa commented at 6:53 pm on April 22, 2022: member

    @laanwj It may also have to do with better register scheduling when unrolling (the same variable doesn’t need to stay in the same register every iteration), though I haven’t investigated what the difference in emitted asm is.

    This change may be very compiler and platform dependent, so it may be good to know what its impact is with modern clang versions and/or on arm64 systems.

  12. jonatack commented at 7:29 pm on April 22, 2022: member
    Debian testing clang 15, normal (non-debug) build, fixed CPU speed, I’m not sure I’m seeing a difference. Trying again after optimizing and tuning further.
  13. laanwj commented at 8:24 pm on April 22, 2022: member

    Gcc 11.2.0, x86_64:

    • The function ChaCha20::Keystream grows in size from 992 bytes to 3840 (doesn’t seem too bad, still fits in a page).
    • One iteration of the loop looks like:
      0 370:	41 01 ed             	add    %ebp,%r13d
      1 373:	41 01 db             	add    %ebx,%r11d
      2 376:	41 01 f2             	add    %esi,%r10d
      3 379:	44 31 e9             	xor    %r13d,%ecx
      4 37c:	44 31 da             	xor    %r11d,%edx
      5 37f:	44 31 d0             	xor    %r10d,%eax
      6 382:	c1 c1 10             	rol    $0x10,%ecx
      7 385:	c1 c2 10             	rol    $0x10,%edx
      8 388:	41 01 c9             	add    %ecx,%r9d
      9 38b:	01 d7                	add    %edx,%edi
     10 38d:	c1 c0 10             	rol    $0x10,%eax
     11 390:	44 31 cd             	xor    %r9d,%ebp
     12 393:	31 fb                	xor    %edi,%ebx
     13 395:	41 01 c4             	add    %eax,%r12d
     14 398:	c1 c5 0c             	rol    $0xc,%ebp
     15 39b:	c1 c3 0c             	rol    $0xc,%ebx
     16 39e:	44 31 e6             	xor    %r12d,%esi
     17 3a1:	41 01 ed             	add    %ebp,%r13d
     18 3a4:	41 01 db             	add    %ebx,%r11d
     19 3a7:	c1 c6 0c             	rol    $0xc,%esi
     20 3aa:	44 31 e9             	xor    %r13d,%ecx
     21 3ad:	44 31 da             	xor    %r11d,%edx
     22 3b0:	41 01 f2             	add    %esi,%r10d
     23 3b3:	c1 c1 08             	rol    $0x8,%ecx
     24 3b6:	c1 c2 08             	rol    $0x8,%edx
     25 3b9:	44 31 d0             	xor    %r10d,%eax
     26 3bc:	41 01 c9             	add    %ecx,%r9d
     27 3bf:	01 d7                	add    %edx,%edi
     28 3c1:	44 31 cd             	xor    %r9d,%ebp
     29 3c4:	31 fb                	xor    %edi,%ebx
     30 3c6:	89 7c 24 08          	mov    %edi,0x8(%rsp)
     31 3ca:	c1 c5 07             	rol    $0x7,%ebp
     32 3cd:	c1 c3 07             	rol    $0x7,%ebx
     33 3d0:	44 89 4c 24 04       	mov    %r9d,0x4(%rsp)
     34 3d5:	c1 c0 08             	rol    $0x8,%eax
     35 3d8:	45 01 f8             	add    %r15d,%r8d
     36 3db:	41 01 dd             	add    %ebx,%r13d
     37 3de:	45 31 c6             	xor    %r8d,%r14d
     38 3e1:	41 01 c4             	add    %eax,%r12d
     39 3e4:	44 89 f7             	mov    %r14d,%edi
     40 3e7:	44 8b 74 24 0c       	mov    0xc(%rsp),%r14d
     41 3ec:	44 31 e6             	xor    %r12d,%esi
     42 3ef:	c1 c7 10             	rol    $0x10,%edi
     43 3f2:	c1 c6 07             	rol    $0x7,%esi
     44 3f5:	41 01 fe             	add    %edi,%r14d
     45 3f8:	41 01 f3             	add    %esi,%r11d
     46 3fb:	45 31 f7             	xor    %r14d,%r15d
     47 3fe:	45 89 f1             	mov    %r14d,%r9d
     48 401:	44 31 d9             	xor    %r11d,%ecx
     49 404:	41 c1 c7 0c          	rol    $0xc,%r15d
     50 408:	c1 c1 10             	rol    $0x10,%ecx
     51 40b:	45 01 f8             	add    %r15d,%r8d
     52 40e:	44 31 c7             	xor    %r8d,%edi
     53 411:	c1 c7 08             	rol    $0x8,%edi
     54 414:	41 01 f9             	add    %edi,%r9d
     55 417:	44 31 ef             	xor    %r13d,%edi
     56 41a:	c1 c7 10             	rol    $0x10,%edi
     57 41d:	45 31 cf             	xor    %r9d,%r15d
     58 420:	41 01 c9             	add    %ecx,%r9d
     59 423:	41 01 fc             	add    %edi,%r12d
     60 426:	41 c1 c7 07          	rol    $0x7,%r15d
     61 42a:	44 31 e3             	xor    %r12d,%ebx
     62 42d:	c1 c3 0c             	rol    $0xc,%ebx
     63 430:	41 01 dd             	add    %ebx,%r13d
     64 433:	44 31 ef             	xor    %r13d,%edi
     65 436:	41 89 fe             	mov    %edi,%r14d
     66 439:	41 c1 c6 08          	rol    $0x8,%r14d
     67 43d:	45 01 f4             	add    %r14d,%r12d
     68 440:	44 31 e3             	xor    %r12d,%ebx
     69 443:	c1 c3 07             	rol    $0x7,%ebx
     70 446:	44 31 ce             	xor    %r9d,%esi
     71 449:	45 01 fa             	add    %r15d,%r10d
     72 44c:	41 01 e8             	add    %ebp,%r8d
     73 44f:	c1 c6 0c             	rol    $0xc,%esi
     74 452:	44 31 d2             	xor    %r10d,%edx
     75 455:	44 31 c0             	xor    %r8d,%eax
     76 458:	c1 c2 10             	rol    $0x10,%edx
     77 45b:	41 01 f3             	add    %esi,%r11d
     78 45e:	c1 c0 10             	rol    $0x10,%eax
     79 461:	44 31 d9             	xor    %r11d,%ecx
     80 464:	c1 c1 08             	rol    $0x8,%ecx
     81 467:	41 8d 3c 09          	lea    (%r9,%rcx,1),%edi
     82 46b:	44 8b 4c 24 04       	mov    0x4(%rsp),%r9d
     83 470:	31 fe                	xor    %edi,%esi
     84 472:	89 7c 24 0c          	mov    %edi,0xc(%rsp)
     85 476:	8b 7c 24 08          	mov    0x8(%rsp),%edi
     86 47a:	41 01 d1             	add    %edx,%r9d
     87 47d:	c1 c6 07             	rol    $0x7,%esi
     88 480:	01 c7                	add    %eax,%edi
     89 482:	45 31 cf             	xor    %r9d,%r15d
     90 485:	31 fd                	xor    %edi,%ebp
     91 487:	41 c1 c7 0c          	rol    $0xc,%r15d
     92 48b:	c1 c5 0c             	rol    $0xc,%ebp
     93 48e:	45 01 fa             	add    %r15d,%r10d
     94 491:	41 01 e8             	add    %ebp,%r8d
     95 494:	44 31 d2             	xor    %r10d,%edx
     96 497:	44 31 c0             	xor    %r8d,%eax
     97 49a:	c1 c2 08             	rol    $0x8,%edx
     98 49d:	c1 c0 08             	rol    $0x8,%eax
     99 4a0:	41 01 d1             	add    %edx,%r9d
    100 4a3:	01 c7                	add    %eax,%edi
    101 4a5:	45 31 cf             	xor    %r9d,%r15d
    102 4a8:	31 fd                	xor    %edi,%ebp
    103 4aa:	41 c1 c7 07          	rol    $0x7,%r15d
    104 4ae:	c1 c5 07             	rol    $0x7,%ebp
    105 4b1:	83 6c 24 10 01       	subl   $0x1,0x10(%rsp)
    106 4b6:	0f 85 b4 fe ff ff    	jne    370 <ChaCha20::Keystream(unsigned char*, unsigned long)+0x140>
    
    • The unrolling indeed causes different register allocation, as well as instructions from multiple iterations to be interspersed (maybe better for scheduling, maybe it’s possible to combine?).
    • Benchmarks before on old AMD Phenom(tm) II X6 1075T:
    0|             ns/byte |              byte/s |    err% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------:|:----------
    2|                2.18 |      459,187,395.75 |    0.3% |      0.03 | `CHACHA20_1MB`
    3|                2.21 |      452,155,530.63 |    0.2% |      0.01 | `CHACHA20_256BYTES`
    4|                2.34 |      427,257,435.31 |    0.0% |      0.01 | `CHACHA20_64BYTES`
    
    • Benchmarks after on same (~12% speedup):
    0|             ns/byte |              byte/s |    err% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------:|:----------
    2|                1.91 |      523,324,820.67 |    0.4% |      0.02 | `CHACHA20_1MB`
    3|                1.94 |      516,638,576.63 |    0.0% |      0.01 | `CHACHA20_256BYTES`
    4|                2.22 |      451,258,216.13 |    4.6% |      0.01 | `CHACHA20_64BYTES`
    
  14. jonatack commented at 9:30 pm on April 22, 2022: member

    Restarted and tuned (i7 6500U CPU @ 2.5 GHz) with pyperf system tune, non-debug build, seeing roughly a 3 to 4% improvement.

    0Linux 5.16.0-6-amd64 [#1](/bitcoin-bitcoin/1/) SMP PREEMPT Debian 5.16.18-1 (2022-03-29) x86_64 GNU/Linux.
    1
    2Debian clang version 15.0.0-++20220422111431+ba46ae7bd853-1~exp1~20220422111525.449
    3Target: x86_64-pc-linux-gnu                                    
    4Thread model: posix                                            
    5InstalledDir: /usr/bin      
    
    0master
    1
    2|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    3|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    4|                2.43 |      410,814,309.33 |    0.3% |           18.61 |            6.29 |  2.957 |           0.20 |    0.0% |      0.03 | `CHACHA20_1MB`
    5|                2.46 |      406,907,108.96 |    0.0% |           18.89 |            6.37 |  2.965 |           0.22 |    0.0% |      0.01 | `CHACHA20_256BYTES`
    6|                2.59 |      385,499,110.76 |    1.0% |           19.72 |            6.68 |  2.952 |           0.28 |    0.0% |      0.01 | `CHACHA20_64BYTES`
    
    0branch
    1
    2|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    3|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    4|                2.35 |      425,969,024.53 |    0.7% |           16.70 |            6.07 |  2.752 |           0.05 |    0.0% |      0.03 | `CHACHA20_1MB`
    5|                2.37 |      422,279,272.14 |    0.0% |           17.14 |            6.14 |  2.792 |           0.07 |    0.0% |      0.01 | `CHACHA20_256BYTES`
    6|                2.52 |      396,803,365.77 |    0.1% |           18.45 |            6.53 |  2.825 |           0.13 |    0.0% |      0.01 | `CHACHA20_64BYTES`
    

    Edit: re-ran the bench a dozen times each to verify that these results are representative.

  15. ajtowns commented at 8:50 am on April 23, 2022: member

    I’m seeing much smaller improvements (0%-2.5% with gcc 11; 1.3%-7% with clang 13) on an old i7. (And very slightly worse performance compared to master with debug enabled)

    Did you consider just changing the for() { ... } loop to REPEAT10( ... ) with #define REPEAT10(a) a a a a a a a a a a ?

  16. laanwj commented at 9:50 am on April 23, 2022: member
    • gcc 11.2.0, RISC-V 64-bit (SiFive Unmatched, 1.2Ghz): speedup is there, but much less pronounced (~5%):
    0|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |       bra/byte |   miss% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------------:|----------------:|---------------:|--------:|----------:|:----------
    2Before:
    3|               22.29 |       44,862,631.89 |    0.8% |            0.00 |            0.00 |           0.00 |    0.0% |      0.26 | `CHACHA20_1MB`
    4After:
    5|               21.23 |       47,101,646.21 |    0.9% |            0.00 |            0.00 |           0.00 |    0.0% |      0.25 | `CHACHA20_1MB`
    
    • gcc 10.2.1, aarch64 (custom i.MX8MQ board, 1Ghz), ~8% speedup:
    0|             ns/byte |              byte/s |    err% |        ins/byte |       bra/byte |   miss% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:----------
    2Before:
    3|                6.04 |      165,526,246.91 |    0.1% |           16.84 |           0.16 |   11.8% |      0.07 | `CHACHA20_1MB`
    4After:
    5|                5.58 |      179,185,196.22 |    0.1% |           15.86 |           0.02 |    0.0% |      0.06 | `CHACHA20_1MB`
    

    It’s a nice speedup, and a simple change, tested ACK 4f3a18906880b065b6119ccf32b2875748b297b2

    Did you consider just changing the for() { … } loop to REPEAT10( … ) with #define REPEAT10(a) a a a a a a a a a a ?

    I like this idea, more elegantly than copy/pasting it makes it immediately clear it’s the same. I would guess the generated code is exactly the same.

  17. MarcoFalke commented at 5:10 pm on April 23, 2022: member

    Not seeing a large difference on an i7. (Maybe a 1%-3% speedup?)

    gcc-12 Before:

     0|             ns/byte |              byte/s |    err% |     total | benchmark
     1|--------------------:|--------------------:|--------:|----------:|:----------
     2|                2.23 |      447,617,214.06 |    0.2% |      0.03 | `CHACHA20_1MB`
     3|                2.26 |      441,653,947.12 |    0.1% |      0.01 | `CHACHA20_256BYTES`
     4|                2.50 |      399,993,391.82 |    6.1% |      0.01 | :wavy_dash: `CHACHA20_64BYTES` (Unstable with ~6,241.4 iters. Increase `minEpochIterations` to e.g. 62414)
     5|                7.03 |      142,173,319.29 |   10.1% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
     6|                3.26 |      307,218,931.17 |    1.7% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
     7|                8.83 |      113,259,198.67 |    1.3% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT`
     8|                4.28 |      233,685,573.34 |    0.4% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
     9|               15.78 |       63,391,055.77 |    0.6% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
    10|                7.71 |      129,684,901.52 |    0.4% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`
    

    gcc-12 After:

     0|             ns/byte |              byte/s |    err% |     total | benchmark
     1|--------------------:|--------------------:|--------:|----------:|:----------
     2|                2.20 |      454,707,913.08 |    0.8% |      0.03 | `CHACHA20_1MB`
     3|                2.36 |      424,359,263.25 |    4.9% |      0.01 | `CHACHA20_256BYTES`
     4|                2.41 |      414,622,602.59 |    0.4% |      0.01 | `CHACHA20_64BYTES`
     5|                6.99 |      143,089,808.99 |    7.2% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
     6|                3.26 |      306,926,493.73 |    4.2% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
     7|                9.59 |      104,251,645.58 |    8.6% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~402.1 iters. Increase `minEpochIterations` to e.g. 4021)
     8|                4.33 |      230,986,007.33 |    0.6% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
     9|               16.23 |       61,602,235.65 |    1.7% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
    10|                9.63 |      103,830,365.13 |    9.9% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,639.9 iters. Increase `minEpochIterations` to e.g. 16399)
    

    gcc-10 Before:

     0|             ns/byte |              byte/s |    err% |     total | benchmark
     1|--------------------:|--------------------:|--------:|----------:|:----------
     2|                2.26 |      442,527,877.02 |    0.5% |      0.03 | `CHACHA20_1MB`
     3|                2.30 |      435,535,172.72 |    1.9% |      0.01 | `CHACHA20_256BYTES`
     4|                2.39 |      418,262,709.74 |    0.4% |      0.01 | `CHACHA20_64BYTES`
     5|                6.93 |      144,210,951.65 |    5.9% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
     6|                3.16 |      316,109,217.24 |    4.8% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
     7|                8.43 |      118,625,079.49 |    0.3% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT`
     8|                4.18 |      239,143,934.28 |    0.2% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
     9|               16.05 |       62,308,156.96 |    5.2% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT` (Unstable with ~961.0 iters. Increase `minEpochIterations` to e.g. 9610)
    10|                7.63 |      131,070,821.81 |    0.1% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`
    

    gcc-10 after:

     0|             ns/byte |              byte/s |    err% |     total | benchmark
     1|--------------------:|--------------------:|--------:|----------:|:----------
     2|                2.20 |      454,351,689.08 |    0.2% |      0.03 | `CHACHA20_1MB`
     3|                2.40 |      416,825,911.73 |    4.4% |      0.01 | `CHACHA20_256BYTES`
     4|                2.40 |      416,369,054.39 |    0.2% |      0.01 | `CHACHA20_64BYTES`
     5|                6.58 |      151,882,394.04 |   10.5% |      0.08 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
     6|                3.03 |      329,600,644.76 |    0.9% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
     7|                9.40 |      106,431,172.41 |   10.1% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~434.9 iters. Increase `minEpochIterations` to e.g. 4349)
     8|                4.30 |      232,776,146.25 |    0.2% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
     9|               16.17 |       61,831,918.45 |    1.3% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
    10|                8.83 |      113,301,205.50 |    5.8% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,771.5 iters. Increase `minEpochIterations` to e.g. 17715)
    
  18. martinus commented at 5:45 pm on April 23, 2022: contributor

    I get the same 1-3% speedup on my i7. In my test adding #pragma GCC unroll 10 in front of the loop seems to produce exactly the same unrolled loop as the hand coded, this works for GCC and clang

    Side note 1: use e.g. ./src/bench/bench_bitcoin -filter="CHACHA20.*" -min_time=2000 to run each test for 2 seconds to get more stable results

    Side note 2: No need to quote the result, it’s markdown :slightly_smiling_face:

    My results on i7-8700, with clang 13.0.1:

    master

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    1.91 523,770,793.06 0.1% 18.52 6.08 3.043 0.20 0.0% 1.09 CHACHA20_1MB
    1.94 515,227,758.97 0.3% 18.79 6.16 3.048 0.22 0.0% 1.10 CHACHA20_256BYTES
    2.02 494,527,885.82 0.2% 19.61 6.44 3.046 0.28 0.0% 1.10 CHACHA20_64BYTES

    branch

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    1.83 547,223,233.51 0.0% 17.08 5.83 2.931 0.05 0.0% 1.07 CHACHA20_1MB
    1.87 535,851,391.81 0.1% 17.51 5.95 2.942 0.07 0.0% 1.10 CHACHA20_256BYTES
    1.98 504,774,917.46 0.0% 18.81 6.32 2.977 0.13 0.0% 1.10 CHACHA20_64BYTES
  19. Empact commented at 11:15 pm on April 23, 2022: member
    +1 for #pragma unroll or similar
  20. laanwj commented at 9:03 am on April 27, 2022: member
    So I think the conclusion here is that on i7 there’s no (or not much) difference but on other platforms it varies. But it never becomes worse. I think a performance optimization like this is mostly interesting for slower CPUs with less effective branch prediction so that’s OK with me.
  21. laanwj commented at 6:18 pm on May 4, 2022: member
    @sipa What are your thoughts on using #pragma unroll or a macro? Or do you prefer keeping it this way?
  22. sipa commented at 6:20 pm on May 4, 2022: member

    @laanwj That won’t work on every compiler.

    I’d be ok with switching to a macro to do the 10x expansion.

  23. MarcoFalke commented at 6:28 pm on May 4, 2022: member
    TIL that it is possible to pass multiple lines as an argument to a macro
  24. sipa commented at 6:29 pm on May 4, 2022: member

    TIL that it is possible to pass multiple lines as an argument to a macro

    You clearly never saw the original serialization code this codebase had ;)

  25. sipa force-pushed on May 4, 2022
  26. sipa commented at 6:45 pm on May 4, 2022: member

    I’d be ok with switching to a macro to do the 10x expansion.

    Done, used @ajtowns’s approach suggested above.

  27. in src/crypto/chacha20.cpp:21 in 266bf15ddc outdated
    17@@ -18,6 +18,8 @@ constexpr static inline uint32_t rotl32(uint32_t v, int c) { return (v << c) | (
    18   a += b; d = rotl32(d ^ a, 8); \
    19   c += d; b = rotl32(b ^ c, 7);
    20 
    21+#define REPEAT10(a) a a a a a a a a a a
    


    MarcoFalke commented at 6:51 pm on May 4, 2022:
    0#define REPEAT10(a) do { a a a a a a a a a a } while (0)
    

    nit: Shouldn’t this use do-while?

    Otherwise writing

    0if (blub) REPEAT10(bla());
    

    will do the wrong thing?

    Also, leaving the semicolon after the do-while in the definition makes the compiler enforce that one is placed after the call.


    sipa commented at 6:53 pm on May 4, 2022:
    Done.
  28. Unroll the ChaCha20 inner loop for performance 81c09ee45c
  29. sipa force-pushed on May 4, 2022
  30. martinus commented at 5:47 am on May 5, 2022: contributor

    tested ACK 81c09ee with clang++ 13.0.1, test CHACHA20_1MB:

    • 4.3% faster on i9-9960X
    • 4.5% faster on i9-9980HK
    • 4.4% faster on i7-8700
  31. DrahtBot commented at 1:53 am on May 8, 2022: member

    Guix builds

    File commit 460450836304f257d3fc20e9fe32cb3a4efaa82b(master) commit 4fd4a5fe65ca0676ab8fb5c73665c83b5971822c(master and this pull)
    SHA256SUMS.part 2cc4936b229484ae... 72b43e559d44ffb7...
    *-aarch64-linux-gnu-debug.tar.gz 9bddf61e8a0520d9... deeb9f228bac481f...
    *-aarch64-linux-gnu.tar.gz e9d04df826660c90... 522251f6bae6c76e...
    *-arm-linux-gnueabihf-debug.tar.gz 4beeedbf37980025... 1670184e36cf8954...
    *-arm-linux-gnueabihf.tar.gz 3d050332fd87e4a2... d96deb05bc2a4c32...
    *-arm64-apple-darwin-unsigned.dmg 4da5ad0e9772e4e6... 76cc400c040540b0...
    *-arm64-apple-darwin-unsigned.tar.gz dcfdc8f43192660a... b2635ec3d4fae50f...
    *-arm64-apple-darwin.tar.gz 4a5d70a227973d83... d1fe669b0e4947c4...
    *-powerpc64-linux-gnu-debug.tar.gz 53d2e35633a6a31c... 49c537f1d856cbc4...
    *-powerpc64-linux-gnu.tar.gz c375a065b6a1548d... a86b04e408ec9570...
    *-powerpc64le-linux-gnu-debug.tar.gz 4969102569ae2d44... 8e8ba290e0752fbd...
    *-powerpc64le-linux-gnu.tar.gz e4127eb1ced1baa7... d79b1b65762f872b...
    *-riscv64-linux-gnu-debug.tar.gz 2e0700a552bee2aa... a1ecf6009d2ebb4e...
    *-riscv64-linux-gnu.tar.gz 2e1a90e7096072c7... df8f4d16400de367...
    *-win64-debug.zip a42b196a2551ff55... 4cecf16acf68c1af...
    *-win64-setup-unsigned.exe ecea0e8c84dfa767... cf8ed2f4015e5a1d...
    *-win64-unsigned.tar.gz 76ac00a1fded7fab... f00c3589f6bc5cff...
    *-win64.zip c78256b254a7b0da... 6818154f30241f4c...
    *-x86_64-apple-darwin-unsigned.dmg 086d56f16fe9ec2d... f6dbb479fce016e5...
    *-x86_64-apple-darwin-unsigned.tar.gz a32dbe948312f693... ce21801a74eb1971...
    *-x86_64-apple-darwin.tar.gz 31010eb16dfe31bc... c4ed0c95ff3d6e2d...
    *-x86_64-linux-gnu-debug.tar.gz f9d542596820f03f... 770b969a8724b3ea...
    *-x86_64-linux-gnu.tar.gz 34f5361f2f5dc8ac... cc157942e3794baf...
    *.tar.gz 1eaba963a7fb2753... be1c1b02a7e1bbd8...
    guix_build.log 3690d59eb499477b... e1f3ab033c1676fe...
    guix_build.log.diff e933d67b5ec7a7fc...
  32. DrahtBot removed the label DrahtBot Guix build requested on May 8, 2022
  33. MarcoFalke commented at 11:54 am on May 9, 2022: member
    • A few percent faster on AMD EPYC as well with gcc-9/gcc-11.2/gcc-12.1/clang-14
    • Same on AMD EPYC with guix built bench
    • Same on Cortex-A72 with guix built bench
  34. MarcoFalke commented at 11:56 am on May 9, 2022: member

    ACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟

    Signature:

     0-----BEGIN PGP SIGNED MESSAGE-----
     1Hash: SHA512
     2
     3ACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟
     4-----BEGIN PGP SIGNATURE-----
     5
     6iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
     7pUgxzQv9FMC3MiK58jmwXRv26Mf41HrwpXJawhRSU/j+VM0Vq9JI6RlIkZ3E5Biy
     8EKOxtL9cMKv6cMOyE5bihZF3uIqnwJCMAx+8cb+/6RYm33UseEMHxX/S8T+Q8/vy
     94r5BU/kisbX77yAjooN7Lr0/nKSv2E8APFjvcp7NIkWkx89W2zrk9z4eoFS5Dri/
    10yAbMpc95eTtu4gmsbjNNE73/Q1MsdfXiBgzwP8ToV/grzoZPpBTt7dsb1QRRjn1N
    11NAY/xG1p1kFo7ORbJ0ZHiKE4waat0Erqi8MX35f5mkMVa47X5VdDuP1FGn191f9K
    12oS6cfgSZr4d+SE3SFer56/3QOVToa06VmxjmKoRv0j12S7NVOxnjRNjwN6XkhgoK
    13wlpkNa3HxNxdMNmaUDqxXk5Z1zH5RCjZwiPQuMG5sExjemAAJXOFQ8WYnJFGp04R
    14dFlXeMTy2ZQWMWoEMhdJ2jCDjvggjMW8t51VA3+GQvr8ZZmN10dzXPA+Qi1c25es
    15QNkpUvPg
    16=2W4Z
    17-----END PGP SIGNATURE-----
    
  35. MarcoFalke merged this on May 9, 2022
  36. MarcoFalke closed this on May 9, 2022

  37. sidhujag referenced this in commit 346bcd37d7 on May 9, 2022
  38. DrahtBot locked this on May 9, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-06-29 13:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me