Unroll the ChaCha20 inner loop for performance #24946

pull sipa wants to merge 1 commits into bitcoin:master from sipa:202204_unrollchacha changing 1 files +28 −20
  1. sipa commented at 4:32 PM on April 22, 2022: member

    Unrolling the inner ChaCha20 loop gives a ~15% speedup for me in the CHACHA20_* benchmarks. It's a simple change, this performance helps with RNG generation, and will matter more for BIP324.

  2. DrahtBot added the label Utils/log/libs on Apr 22, 2022
  3. in src/crypto/chacha20.cpp:124 in 7f3f84c833 outdated
     127 | -            QUARTERROUND( x0, x5,x10,x15)
     128 | -            QUARTERROUND( x1, x6,x11,x12)
     129 | -            QUARTERROUND( x2, x7, x8,x13)
     130 | -            QUARTERROUND( x3, x4, x9,x14)
     131 | -        }
     132 | +
    


    kristapsk commented at 5:28 PM on April 22, 2022:

    Maybe add a comment that loop unrolling was done for performance reasons?


    sipa commented at 6:01 PM on April 22, 2022:

    Done.

  4. kristapsk commented at 5:30 PM on April 22, 2022: contributor

    Concept ACK, I see ./src/bench/bench_bitcoin improvements with this change.

  5. sipa force-pushed on Apr 22, 2022
  6. MarcoFalke added the label DrahtBot Guix build requested on Apr 22, 2022
  7. instagibbs commented at 5:58 PM on April 22, 2022: member

    Can you give commands to run just those benches for those willing to replicate?

  8. sipa commented at 5:59 PM on April 22, 2022: member

    @instagibbs

    ./src/bench/bench_bitcoin -filter=.*CHACHA20_[1-9].*

  9. laanwj commented at 6:40 PM on April 22, 2022: member

    I'm somewhat surprised unrolling a loop that is 20 10 times the same thing gives that much performance win on modern CPUs. But it's just a few ROL instructions I guess so the loop overhead easily dominates? Anyhow, concept ACK.

  10. instagibbs commented at 6:42 PM on April 22, 2022: member

    Getting a rough average of 15% speedup as well

  11. sipa commented at 6:53 PM on April 22, 2022: member

    @laanwj It may also have to do with better register scheduling when unrolling (the same variable doesn't need to stay in the same register every iteration), though I haven't investigated what the difference in emitted asm is.

    This change may be very compiler and platform dependent, so it may be good to know what its impact is with modern clang versions and/or on arm64 systems.

  12. jonatack commented at 7:29 PM on April 22, 2022: member

    Debian testing clang 15, normal (non-debug) build, fixed CPU speed, I'm not sure I'm seeing a difference. Trying again after optimizing and tuning further.

  13. laanwj commented at 8:24 PM on April 22, 2022: member

    Gcc 11.2.0, x86_64:

    • The function ChaCha20::Keystream grows in size from 992 bytes to 3840 (doesn't seem too bad, still fits in a page).
    • One iteration of the loop looks like:
     370:	41 01 ed             	add    %ebp,%r13d
     373:	41 01 db             	add    %ebx,%r11d
     376:	41 01 f2             	add    %esi,%r10d
     379:	44 31 e9             	xor    %r13d,%ecx
     37c:	44 31 da             	xor    %r11d,%edx
     37f:	44 31 d0             	xor    %r10d,%eax
     382:	c1 c1 10             	rol    $0x10,%ecx
     385:	c1 c2 10             	rol    $0x10,%edx
     388:	41 01 c9             	add    %ecx,%r9d
     38b:	01 d7                	add    %edx,%edi
     38d:	c1 c0 10             	rol    $0x10,%eax
     390:	44 31 cd             	xor    %r9d,%ebp
     393:	31 fb                	xor    %edi,%ebx
     395:	41 01 c4             	add    %eax,%r12d
     398:	c1 c5 0c             	rol    $0xc,%ebp
     39b:	c1 c3 0c             	rol    $0xc,%ebx
     39e:	44 31 e6             	xor    %r12d,%esi
     3a1:	41 01 ed             	add    %ebp,%r13d
     3a4:	41 01 db             	add    %ebx,%r11d
     3a7:	c1 c6 0c             	rol    $0xc,%esi
     3aa:	44 31 e9             	xor    %r13d,%ecx
     3ad:	44 31 da             	xor    %r11d,%edx
     3b0:	41 01 f2             	add    %esi,%r10d
     3b3:	c1 c1 08             	rol    $0x8,%ecx
     3b6:	c1 c2 08             	rol    $0x8,%edx
     3b9:	44 31 d0             	xor    %r10d,%eax
     3bc:	41 01 c9             	add    %ecx,%r9d
     3bf:	01 d7                	add    %edx,%edi
     3c1:	44 31 cd             	xor    %r9d,%ebp
     3c4:	31 fb                	xor    %edi,%ebx
     3c6:	89 7c 24 08          	mov    %edi,0x8(%rsp)
     3ca:	c1 c5 07             	rol    $0x7,%ebp
     3cd:	c1 c3 07             	rol    $0x7,%ebx
     3d0:	44 89 4c 24 04       	mov    %r9d,0x4(%rsp)
     3d5:	c1 c0 08             	rol    $0x8,%eax
     3d8:	45 01 f8             	add    %r15d,%r8d
     3db:	41 01 dd             	add    %ebx,%r13d
     3de:	45 31 c6             	xor    %r8d,%r14d
     3e1:	41 01 c4             	add    %eax,%r12d
     3e4:	44 89 f7             	mov    %r14d,%edi
     3e7:	44 8b 74 24 0c       	mov    0xc(%rsp),%r14d
     3ec:	44 31 e6             	xor    %r12d,%esi
     3ef:	c1 c7 10             	rol    $0x10,%edi
     3f2:	c1 c6 07             	rol    $0x7,%esi
     3f5:	41 01 fe             	add    %edi,%r14d
     3f8:	41 01 f3             	add    %esi,%r11d
     3fb:	45 31 f7             	xor    %r14d,%r15d
     3fe:	45 89 f1             	mov    %r14d,%r9d
     401:	44 31 d9             	xor    %r11d,%ecx
     404:	41 c1 c7 0c          	rol    $0xc,%r15d
     408:	c1 c1 10             	rol    $0x10,%ecx
     40b:	45 01 f8             	add    %r15d,%r8d
     40e:	44 31 c7             	xor    %r8d,%edi
     411:	c1 c7 08             	rol    $0x8,%edi
     414:	41 01 f9             	add    %edi,%r9d
     417:	44 31 ef             	xor    %r13d,%edi
     41a:	c1 c7 10             	rol    $0x10,%edi
     41d:	45 31 cf             	xor    %r9d,%r15d
     420:	41 01 c9             	add    %ecx,%r9d
     423:	41 01 fc             	add    %edi,%r12d
     426:	41 c1 c7 07          	rol    $0x7,%r15d
     42a:	44 31 e3             	xor    %r12d,%ebx
     42d:	c1 c3 0c             	rol    $0xc,%ebx
     430:	41 01 dd             	add    %ebx,%r13d
     433:	44 31 ef             	xor    %r13d,%edi
     436:	41 89 fe             	mov    %edi,%r14d
     439:	41 c1 c6 08          	rol    $0x8,%r14d
     43d:	45 01 f4             	add    %r14d,%r12d
     440:	44 31 e3             	xor    %r12d,%ebx
     443:	c1 c3 07             	rol    $0x7,%ebx
     446:	44 31 ce             	xor    %r9d,%esi
     449:	45 01 fa             	add    %r15d,%r10d
     44c:	41 01 e8             	add    %ebp,%r8d
     44f:	c1 c6 0c             	rol    $0xc,%esi
     452:	44 31 d2             	xor    %r10d,%edx
     455:	44 31 c0             	xor    %r8d,%eax
     458:	c1 c2 10             	rol    $0x10,%edx
     45b:	41 01 f3             	add    %esi,%r11d
     45e:	c1 c0 10             	rol    $0x10,%eax
     461:	44 31 d9             	xor    %r11d,%ecx
     464:	c1 c1 08             	rol    $0x8,%ecx
     467:	41 8d 3c 09          	lea    (%r9,%rcx,1),%edi
     46b:	44 8b 4c 24 04       	mov    0x4(%rsp),%r9d
     470:	31 fe                	xor    %edi,%esi
     472:	89 7c 24 0c          	mov    %edi,0xc(%rsp)
     476:	8b 7c 24 08          	mov    0x8(%rsp),%edi
     47a:	41 01 d1             	add    %edx,%r9d
     47d:	c1 c6 07             	rol    $0x7,%esi
     480:	01 c7                	add    %eax,%edi
     482:	45 31 cf             	xor    %r9d,%r15d
     485:	31 fd                	xor    %edi,%ebp
     487:	41 c1 c7 0c          	rol    $0xc,%r15d
     48b:	c1 c5 0c             	rol    $0xc,%ebp
     48e:	45 01 fa             	add    %r15d,%r10d
     491:	41 01 e8             	add    %ebp,%r8d
     494:	44 31 d2             	xor    %r10d,%edx
     497:	44 31 c0             	xor    %r8d,%eax
     49a:	c1 c2 08             	rol    $0x8,%edx
     49d:	c1 c0 08             	rol    $0x8,%eax
     4a0:	41 01 d1             	add    %edx,%r9d
     4a3:	01 c7                	add    %eax,%edi
     4a5:	45 31 cf             	xor    %r9d,%r15d
     4a8:	31 fd                	xor    %edi,%ebp
     4aa:	41 c1 c7 07          	rol    $0x7,%r15d
     4ae:	c1 c5 07             	rol    $0x7,%ebp
     4b1:	83 6c 24 10 01       	subl   $0x1,0x10(%rsp)
     4b6:	0f 85 b4 fe ff ff    	jne    370 <ChaCha20::Keystream(unsigned char*, unsigned long)+0x140>
    
    • The unrolling indeed causes different register allocation, as well as instructions from multiple iterations to be interspersed (maybe better for scheduling, maybe it's possible to combine?).
    • Benchmarks before on old AMD Phenom(tm) II X6 1075T:
    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                2.18 |      459,187,395.75 |    0.3% |      0.03 | `CHACHA20_1MB`
    |                2.21 |      452,155,530.63 |    0.2% |      0.01 | `CHACHA20_256BYTES`
    |                2.34 |      427,257,435.31 |    0.0% |      0.01 | `CHACHA20_64BYTES`
    
    • Benchmarks after on same (~12% speedup):
    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                1.91 |      523,324,820.67 |    0.4% |      0.02 | `CHACHA20_1MB`
    |                1.94 |      516,638,576.63 |    0.0% |      0.01 | `CHACHA20_256BYTES`
    |                2.22 |      451,258,216.13 |    4.6% |      0.01 | `CHACHA20_64BYTES`
    
  14. jonatack commented at 9:30 PM on April 22, 2022: member

    Restarted and tuned (i7 6500U CPU @ 2.5 GHz) with pyperf system tune, non-debug build, seeing roughly a 3 to 4% improvement.

    Linux 5.16.0-6-amd64 [#1](/bitcoin-bitcoin/1/) SMP PREEMPT Debian 5.16.18-1 (2022-03-29) x86_64 GNU/Linux.
    
    Debian clang version 15.0.0-++20220422111431+ba46ae7bd853-1~exp1~20220422111525.449
    Target: x86_64-pc-linux-gnu                                    
    Thread model: posix                                            
    InstalledDir: /usr/bin      
    
    master
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                2.43 |      410,814,309.33 |    0.3% |           18.61 |            6.29 |  2.957 |           0.20 |    0.0% |      0.03 | `CHACHA20_1MB`
    |                2.46 |      406,907,108.96 |    0.0% |           18.89 |            6.37 |  2.965 |           0.22 |    0.0% |      0.01 | `CHACHA20_256BYTES`
    |                2.59 |      385,499,110.76 |    1.0% |           19.72 |            6.68 |  2.952 |           0.28 |    0.0% |      0.01 | `CHACHA20_64BYTES`
    
    branch
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                2.35 |      425,969,024.53 |    0.7% |           16.70 |            6.07 |  2.752 |           0.05 |    0.0% |      0.03 | `CHACHA20_1MB`
    |                2.37 |      422,279,272.14 |    0.0% |           17.14 |            6.14 |  2.792 |           0.07 |    0.0% |      0.01 | `CHACHA20_256BYTES`
    |                2.52 |      396,803,365.77 |    0.1% |           18.45 |            6.53 |  2.825 |           0.13 |    0.0% |      0.01 | `CHACHA20_64BYTES`
    

    Edit: re-ran the bench a dozen times each to verify that these results are representative.

  15. ajtowns commented at 8:50 AM on April 23, 2022: member

    I'm seeing much smaller improvements (0%-2.5% with gcc 11; 1.3%-7% with clang 13) on an old i7. (And very slightly worse performance compared to master with debug enabled)

    Did you consider just changing the for() { ... } loop to REPEAT10( ... ) with #define REPEAT10(a) a a a a a a a a a a ?

  16. laanwj commented at 9:50 AM on April 23, 2022: member
    • gcc 11.2.0, RISC-V 64-bit (SiFive Unmatched, 1.2Ghz): speedup is there, but much less pronounced (~5%):
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|---------------:|--------:|----------:|:----------
    Before:
    |               22.29 |       44,862,631.89 |    0.8% |            0.00 |            0.00 |           0.00 |    0.0% |      0.26 | `CHACHA20_1MB`
    After:
    |               21.23 |       47,101,646.21 |    0.9% |            0.00 |            0.00 |           0.00 |    0.0% |      0.25 | `CHACHA20_1MB`
    
    • gcc 10.2.1, aarch64 (custom i.MX8MQ board, 1Ghz), ~8% speedup:
    |             ns/byte |              byte/s |    err% |        ins/byte |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|---------------:|--------:|----------:|:----------
    Before:
    |                6.04 |      165,526,246.91 |    0.1% |           16.84 |           0.16 |   11.8% |      0.07 | `CHACHA20_1MB`
    After:
    |                5.58 |      179,185,196.22 |    0.1% |           15.86 |           0.02 |    0.0% |      0.06 | `CHACHA20_1MB`
    

    It's a nice speedup, and a simple change, tested ACK 4f3a18906880b065b6119ccf32b2875748b297b2

    Did you consider just changing the for() { ... } loop to REPEAT10( ... ) with #define REPEAT10(a) a a a a a a a a a a ?

    I like this idea, more elegantly than copy/pasting it makes it immediately clear it's the same. I would guess the generated code is exactly the same.

  17. MarcoFalke commented at 5:10 PM on April 23, 2022: member

    Not seeing a large difference on an i7. (Maybe a 1%-3% speedup?)

    gcc-12 Before:

    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                2.23 |      447,617,214.06 |    0.2% |      0.03 | `CHACHA20_1MB`
    |                2.26 |      441,653,947.12 |    0.1% |      0.01 | `CHACHA20_256BYTES`
    |                2.50 |      399,993,391.82 |    6.1% |      0.01 | :wavy_dash: `CHACHA20_64BYTES` (Unstable with ~6,241.4 iters. Increase `minEpochIterations` to e.g. 62414)
    |                7.03 |      142,173,319.29 |   10.1% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
    |                3.26 |      307,218,931.17 |    1.7% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
    |                8.83 |      113,259,198.67 |    1.3% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT`
    |                4.28 |      233,685,573.34 |    0.4% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
    |               15.78 |       63,391,055.77 |    0.6% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
    |                7.71 |      129,684,901.52 |    0.4% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`
    

    gcc-12 After:

    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                2.20 |      454,707,913.08 |    0.8% |      0.03 | `CHACHA20_1MB`
    |                2.36 |      424,359,263.25 |    4.9% |      0.01 | `CHACHA20_256BYTES`
    |                2.41 |      414,622,602.59 |    0.4% |      0.01 | `CHACHA20_64BYTES`
    |                6.99 |      143,089,808.99 |    7.2% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
    |                3.26 |      306,926,493.73 |    4.2% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
    |                9.59 |      104,251,645.58 |    8.6% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~402.1 iters. Increase `minEpochIterations` to e.g. 4021)
    |                4.33 |      230,986,007.33 |    0.6% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
    |               16.23 |       61,602,235.65 |    1.7% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
    |                9.63 |      103,830,365.13 |    9.9% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,639.9 iters. Increase `minEpochIterations` to e.g. 16399)
    
    

    gcc-10 Before:

    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                2.26 |      442,527,877.02 |    0.5% |      0.03 | `CHACHA20_1MB`
    |                2.30 |      435,535,172.72 |    1.9% |      0.01 | `CHACHA20_256BYTES`
    |                2.39 |      418,262,709.74 |    0.4% |      0.01 | `CHACHA20_64BYTES`
    |                6.93 |      144,210,951.65 |    5.9% |      0.09 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
    |                3.16 |      316,109,217.24 |    4.8% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
    |                8.43 |      118,625,079.49 |    0.3% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT`
    |                4.18 |      239,143,934.28 |    0.2% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
    |               16.05 |       62,308,156.96 |    5.2% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT` (Unstable with ~961.0 iters. Increase `minEpochIterations` to e.g. 9610)
    |                7.63 |      131,070,821.81 |    0.1% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT`
    

    gcc-10 after:

    |             ns/byte |              byte/s |    err% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------:|:----------
    |                2.20 |      454,351,689.08 |    0.2% |      0.03 | `CHACHA20_1MB`
    |                2.40 |      416,825,911.73 |    4.4% |      0.01 | `CHACHA20_256BYTES`
    |                2.40 |      416,369,054.39 |    0.2% |      0.01 | `CHACHA20_64BYTES`
    |                6.58 |      151,882,394.04 |   10.5% |      0.08 | :wavy_dash: `CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
    |                3.03 |      329,600,644.76 |    0.9% |      0.04 | `CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT`
    |                9.40 |      106,431,172.41 |   10.1% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT` (Unstable with ~434.9 iters. Increase `minEpochIterations` to e.g. 4349)
    |                4.30 |      232,776,146.25 |    0.2% |      0.01 | `CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT`
    |               16.17 |       61,831,918.45 |    1.3% |      0.01 | `CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT`
    |                8.83 |      113,301,205.50 |    5.8% |      0.01 | :wavy_dash: `CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT` (Unstable with ~1,771.5 iters. Increase `minEpochIterations` to e.g. 17715)
    
  18. martinus commented at 5:45 PM on April 23, 2022: contributor

    I get the same 1-3% speedup on my i7. In my test adding #pragma GCC unroll 10 in front of the loop seems to produce exactly the same unrolled loop as the hand coded, this works for GCC and clang

    Side note 1: use e.g. ./src/bench/bench_bitcoin -filter="CHACHA20.*" -min_time=2000 to run each test for 2 seconds to get more stable results

    Side note 2: No need to quote the result, it's markdown :slightly_smiling_face:

    My results on i7-8700, with clang 13.0.1:

    master | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1.91 | 523,770,793.06 | 0.1% | 18.52 | 6.08 | 3.043 | 0.20 | 0.0% | 1.09 | CHACHA20_1MB | 1.94 | 515,227,758.97 | 0.3% | 18.79 | 6.16 | 3.048 | 0.22 | 0.0% | 1.10 | CHACHA20_256BYTES | 2.02 | 494,527,885.82 | 0.2% | 19.61 | 6.44 | 3.046 | 0.28 | 0.0% | 1.10 | CHACHA20_64BYTES

    branch | ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:---------- | 1.83 | 547,223,233.51 | 0.0% | 17.08 | 5.83 | 2.931 | 0.05 | 0.0% | 1.07 | CHACHA20_1MB | 1.87 | 535,851,391.81 | 0.1% | 17.51 | 5.95 | 2.942 | 0.07 | 0.0% | 1.10 | CHACHA20_256BYTES | 1.98 | 504,774,917.46 | 0.0% | 18.81 | 6.32 | 2.977 | 0.13 | 0.0% | 1.10 | CHACHA20_64BYTES

  19. Empact commented at 11:15 PM on April 23, 2022: member

    +1 for #pragma unroll or similar

  20. laanwj commented at 9:03 AM on April 27, 2022: member

    So I think the conclusion here is that on i7 there's no (or not much) difference but on other platforms it varies. But it never becomes worse. I think a performance optimization like this is mostly interesting for slower CPUs with less effective branch prediction so that's OK with me.

  21. laanwj commented at 6:18 PM on May 4, 2022: member

    @sipa What are your thoughts on using #pragma unroll or a macro? Or do you prefer keeping it this way?

  22. sipa commented at 6:20 PM on May 4, 2022: member

    @laanwj That won't work on every compiler.

    I'd be ok with switching to a macro to do the 10x expansion.

  23. MarcoFalke commented at 6:28 PM on May 4, 2022: member

    TIL that it is possible to pass multiple lines as an argument to a macro

  24. sipa commented at 6:29 PM on May 4, 2022: member

    TIL that it is possible to pass multiple lines as an argument to a macro

    You clearly never saw the original serialization code this codebase had ;)

  25. sipa force-pushed on May 4, 2022
  26. sipa commented at 6:45 PM on May 4, 2022: member

    I'd be ok with switching to a macro to do the 10x expansion.

    Done, used @ajtowns's approach suggested above.

  27. in src/crypto/chacha20.cpp:21 in 266bf15ddc outdated
      17 | @@ -18,6 +18,8 @@ constexpr static inline uint32_t rotl32(uint32_t v, int c) { return (v << c) | (
      18 |    a += b; d = rotl32(d ^ a, 8); \
      19 |    c += d; b = rotl32(b ^ c, 7);
      20 |  
      21 | +#define REPEAT10(a) a a a a a a a a a a
    


    MarcoFalke commented at 6:51 PM on May 4, 2022:
    #define REPEAT10(a) do { a a a a a a a a a a } while (0)
    

    nit: Shouldn't this use do-while?

    Otherwise writing

    if (blub) REPEAT10(bla());
    

    will do the wrong thing?

    Also, leaving the semicolon after the do-while in the definition makes the compiler enforce that one is placed after the call.


    sipa commented at 6:53 PM on May 4, 2022:

    Done.

  28. Unroll the ChaCha20 inner loop for performance 81c09ee45c
  29. sipa force-pushed on May 4, 2022
  30. martinus commented at 5:47 AM on May 5, 2022: contributor

    tested ACK 81c09ee with clang++ 13.0.1, test CHACHA20_1MB:

    • 4.3% faster on i9-9960X
    • 4.5% faster on i9-9980HK
    • 4.4% faster on i7-8700
  31. DrahtBot commented at 1:53 AM on May 8, 2022: member

    <!--9cd9c72976c961c55c7acef8f6ba82cd-->

    Guix builds

    File commit 460450836304f257d3fc20e9fe32cb3a4efaa82b<br>(master) commit 4fd4a5fe65ca0676ab8fb5c73665c83b5971822c<br>(master and this pull)
    SHA256SUMS.part 2cc4936b229484ae... 72b43e559d44ffb7...
    *-aarch64-linux-gnu-debug.tar.gz 9bddf61e8a0520d9... deeb9f228bac481f...
    *-aarch64-linux-gnu.tar.gz e9d04df826660c90... 522251f6bae6c76e...
    *-arm-linux-gnueabihf-debug.tar.gz 4beeedbf37980025... 1670184e36cf8954...
    *-arm-linux-gnueabihf.tar.gz 3d050332fd87e4a2... d96deb05bc2a4c32...
    *-arm64-apple-darwin-unsigned.dmg 4da5ad0e9772e4e6... 76cc400c040540b0...
    *-arm64-apple-darwin-unsigned.tar.gz dcfdc8f43192660a... b2635ec3d4fae50f...
    *-arm64-apple-darwin.tar.gz 4a5d70a227973d83... d1fe669b0e4947c4...
    *-powerpc64-linux-gnu-debug.tar.gz 53d2e35633a6a31c... 49c537f1d856cbc4...
    *-powerpc64-linux-gnu.tar.gz c375a065b6a1548d... a86b04e408ec9570...
    *-powerpc64le-linux-gnu-debug.tar.gz 4969102569ae2d44... 8e8ba290e0752fbd...
    *-powerpc64le-linux-gnu.tar.gz e4127eb1ced1baa7... d79b1b65762f872b...
    *-riscv64-linux-gnu-debug.tar.gz 2e0700a552bee2aa... a1ecf6009d2ebb4e...
    *-riscv64-linux-gnu.tar.gz 2e1a90e7096072c7... df8f4d16400de367...
    *-win64-debug.zip a42b196a2551ff55... 4cecf16acf68c1af...
    *-win64-setup-unsigned.exe ecea0e8c84dfa767... cf8ed2f4015e5a1d...
    *-win64-unsigned.tar.gz 76ac00a1fded7fab... f00c3589f6bc5cff...
    *-win64.zip c78256b254a7b0da... 6818154f30241f4c...
    *-x86_64-apple-darwin-unsigned.dmg 086d56f16fe9ec2d... f6dbb479fce016e5...
    *-x86_64-apple-darwin-unsigned.tar.gz a32dbe948312f693... ce21801a74eb1971...
    *-x86_64-apple-darwin.tar.gz 31010eb16dfe31bc... c4ed0c95ff3d6e2d...
    *-x86_64-linux-gnu-debug.tar.gz f9d542596820f03f... 770b969a8724b3ea...
    *-x86_64-linux-gnu.tar.gz 34f5361f2f5dc8ac... cc157942e3794baf...
    *.tar.gz 1eaba963a7fb2753... be1c1b02a7e1bbd8...
    guix_build.log 3690d59eb499477b... e1f3ab033c1676fe...
    guix_build.log.diff e933d67b5ec7a7fc...
  32. DrahtBot removed the label DrahtBot Guix build requested on May 8, 2022
  33. MarcoFalke commented at 11:54 AM on May 9, 2022: member
    • A few percent faster on AMD EPYC as well with gcc-9/gcc-11.2/gcc-12.1/clang-14
    • Same on AMD EPYC with guix built bench
    • Same on Cortex-A72 with guix built bench
  34. MarcoFalke commented at 11:56 AM on May 9, 2022: member

    ACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟

    <details><summary>Show signature</summary>

    Signature:

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA512
    
    ACK 81c09ee45caecf8d9daf6766b94cebf54f3f08cd 🍟
    -----BEGIN PGP SIGNATURE-----
    
    iQGzBAEBCgAdFiEE+rVPoUahrI9sLGYTzit1aX5ppUgFAlwqrYAACgkQzit1aX5p
    pUgxzQv9FMC3MiK58jmwXRv26Mf41HrwpXJawhRSU/j+VM0Vq9JI6RlIkZ3E5Biy
    EKOxtL9cMKv6cMOyE5bihZF3uIqnwJCMAx+8cb+/6RYm33UseEMHxX/S8T+Q8/vy
    4r5BU/kisbX77yAjooN7Lr0/nKSv2E8APFjvcp7NIkWkx89W2zrk9z4eoFS5Dri/
    yAbMpc95eTtu4gmsbjNNE73/Q1MsdfXiBgzwP8ToV/grzoZPpBTt7dsb1QRRjn1N
    NAY/xG1p1kFo7ORbJ0ZHiKE4waat0Erqi8MX35f5mkMVa47X5VdDuP1FGn191f9K
    oS6cfgSZr4d+SE3SFer56/3QOVToa06VmxjmKoRv0j12S7NVOxnjRNjwN6XkhgoK
    wlpkNa3HxNxdMNmaUDqxXk5Z1zH5RCjZwiPQuMG5sExjemAAJXOFQ8WYnJFGp04R
    dFlXeMTy2ZQWMWoEMhdJ2jCDjvggjMW8t51VA3+GQvr8ZZmN10dzXPA+Qi1c25es
    QNkpUvPg
    =2W4Z
    -----END PGP SIGNATURE-----
    

    </details>

  35. MarcoFalke merged this on May 9, 2022
  36. MarcoFalke closed this on May 9, 2022

  37. sidhujag referenced this in commit 346bcd37d7 on May 9, 2022
  38. DrahtBot locked this on May 9, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-13 15:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me