Consider removing x86_64 field assembly #726

issue jonasnick openend this issue on March 24, 2020
  1. jonasnick commented at 9:54 am on March 24, 2020: contributor

    I noticed that turning off x86_64 assembly on my laptop actually speeds up ecdsa_verify. The internal benchmarks show that --without-asm scalar operations are slower, but field operations are faster. In order to investigate this I created a branch that includes the configurable benchmark iterations from #722, a test_matrix.sh script and allows turning on field assembly individually (https://github.com/jonasnick/secp256k1/tree/eval-asm).

    Here are the results with gcc 9.3.0 (got similar results with clang 9.0.1):

     0SECP256K1_BENCH_ITERS=200000
     1
     2bench config CFLAGS=-DUSE_ASM_X86_64_FIELD ./configure --disable-openssl-tests --with-asm=x86_64
     3scalar_sqr: min 0.0331us / avg 0.0332us / max 0.0337us
     4scalar_mul: min 0.0342us / avg 0.0343us / max 0.0345us
     5field_sqr: min 0.0165us / avg 0.0165us / max 0.0167us
     6field_mul: min 0.0204us / avg 0.0205us / max 0.0209us
     7ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
     8ecdsa_verify: min 56.9us / avg 56.9us / max 56.9us
     9
    10bench config CFLAGS= ./configure --disable-openssl-tests --without-asm
    11scalar_sqr: min 0.0375us / avg 0.0376us / max 0.0383us
    12scalar_mul: min 0.0362us / avg 0.0366us / max 0.0396us
    13field_sqr: min 0.0152us / avg 0.0152us / max 0.0152us
    14field_mul: min 0.0177us / avg 0.0178us / max 0.0178us
    15ecdsa_sign: min 41.8us / avg 41.8us / max 41.9us
    16ecdsa_verify: min 54.6us / avg 54.7us / max 54.7us
    17
    18bench config CFLAGS= ./configure --disable-openssl-tests --with-asm=x86_64
    19scalar_sqr: min 0.0331us / avg 0.0331us / max 0.0333us
    20scalar_mul: min 0.0342us / avg 0.0343us / max 0.0347us
    21field_sqr: min 0.0152us / avg 0.0153us / max 0.0154us
    22field_mul: min 0.0178us / avg 0.0178us / max 0.0180us
    23ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
    24ecdsa_verify: min 53.2us / avg 53.2us / max 53.2us
    

    Note the 6.5% ecdsa_verify speedup. However, I don’t fully understand this:

    1. There’s assembly for field_sqr and field_mul. If we remove it, both functions are faster. But, some other internal functions are slower. For example:
      0SECP256K1_BENCH_ITERS=200000
      1group_add_affine: min 0.257us / avg 0.257us / max 0.259us
      2vs.
      3group_add_affine: min 0.263us / avg 0.263us / max 0.264us
      
      This could just be an artifact of micro-benching and I have not tested this with #667.
    2. Removing field arithmetic also makes ecdsa verification slower if endomorphism is enabled.
      0SECP256K1_BENCH_ITERS=200000
      1ecdsa_verify: min 41.1us / avg 41.1us / max 41.1us
      2vs.
      3ecdsa_verify: min 41.5us / avg 41.6us / max 41.6us
      

    It should be noted that without field arithmetic assembly, in order to use 64 bit field arithmetic you need to have __int128 support (or use field=32bit with a 40% verification slowdown). I did not check where this is supported (MSVC?). Also we should try this with older compilers.

  2. elichai commented at 10:08 am on March 24, 2020: contributor
    Yeah I saw weird similiar things when looked into enabling asm in rust-secp256k1. But gave up trying to investigate it at some point
  3. real-or-random commented at 11:45 am on March 24, 2020: contributor
    It would be really interesting to see this with old compilers, and figure out if people need to rely on old compilers. An alternative could be to disable ASM by default but still support it.
  4. peterdettman commented at 12:00 pm on March 24, 2020: contributor
    I’m not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h).
  5. sipa commented at 3:59 pm on March 24, 2020: contributor
    MSVC supports neither __int128 or inline GCC-style assembly. I believe all MSVC builds of this code just use 32-bit arithmetic. They have a builtin for 128-bit wide multiplication though, so we could write an analogue of the int128 code using that instead.
  6. real-or-random cross-referenced this on Mar 24, 2020 from issue x86_64 asm and 32bit field/scalar options should be incompatible by real-or-random
  7. real-or-random commented at 6:21 pm on March 26, 2020: contributor
    Another question is (assuming for a moment that performance is identical) whether we prefer the asm code or the C code. There are indeed some reasons to prefer asm, e.g., the compiler can’t miscompile it.
  8. elichai commented at 8:32 am on March 29, 2020: contributor

    I’m not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h).

    I read through the asm now, it looks like it matches the C code. I think this was done here https://github.com/bitcoin-core/secp256k1/pull/128/commits/f04861597061f22f5ce027edb0a18577441201cc

    FWIW looking in godbolt, the generated asm of the C functions look very much similar to the ones implemented in asm, although looking at llvm-mca, it says that the C code is twice as much cycles, which sounds weird.

    ASM: https://godbolt.org/z/YxGUgE C: https://godbolt.org/z/JPzYbo

  9. sipa commented at 8:15 pm on May 28, 2020: contributor

    I’ve done benchmarks on gcc and clang, with various versions, -O2 and -O3, with and without assembly enabled. All benchmarks have --disable-endomorphism --with-bignum=no. The results are kind of interesting, and show they’re clearly version dependent, unfortunately.

    All tests done on a Intel(R) Core(TM) i7-7820HQ CPU, with clock fixed at 2.3 GHz.

    secp256k1-ecdsa-speed-gcc

    secp256k1-ecdsa-speed-clang

    Raw data:

     0$ for C in gcc-4.7 gcc-4.9 gcc-5 gcc-6 gcc-7 gcc-8 gcc-9 gcc-10 clang-3.7 clang-3.8 clang-4.0 clang-6.0 clang-7 clang-8 clang-9 clang-10; do for OPT in "-O2" "-O3"; do for ASM in no x86_64; do (git clean -dfx && ./autogen.sh 2>/dev/null && ./configure CC=$C CFLAGS="$OPT" --disable-endomorphism --disable-openssl-tests --enable-benchmark --with-bignum=no --with-asm=$ASM && make -j9 bench_verify tests && ./tests 1) >/dev/null && echo -n "$C $OPT asm=$ASM: " && ./bench_verify; done; done; done) | tee -a ~/secp.log
     1gcc-4.7 -O2 asm=no: ecdsa_verify: min 131us / avg 131us / max 132us
     2gcc-4.7 -O2 asm=x86_64: ecdsa_verify: min 107us / avg 107us / max 107us
     3gcc-4.7 -O3 asm=no: ecdsa_verify: min 133us / avg 134us / max 134us
     4gcc-4.7 -O3 asm=x86_64: ecdsa_verify: min 106us / avg 106us / max 106us
     5gcc-4.9 -O2 asm=no: ecdsa_verify: min 125us / avg 125us / max 126us
     6gcc-4.9 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
     7gcc-4.9 -O3 asm=no: ecdsa_verify: min 127us / avg 127us / max 127us
     8gcc-4.9 -O3 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 108us
     9gcc-5 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 115us
    10gcc-5 -O2 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 109us
    11gcc-5 -O3 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
    12gcc-5 -O3 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 108us
    13gcc-6 -O2 asm=no: ecdsa_verify: min 113us / avg 113us / max 113us
    14gcc-6 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
    15gcc-6 -O3 asm=no: ecdsa_verify: min 114us / avg 114us / max 115us
    16gcc-6 -O3 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
    17gcc-7 -O2 asm=no: ecdsa_verify: min 118us / avg 118us / max 118us
    18gcc-7 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
    19gcc-7 -O3 asm=no: ecdsa_verify: min 123us / avg 123us / max 123us
    20gcc-7 -O3 asm=x86_64: ecdsa_verify: min 115us / avg 115us / max 115us
    21gcc-8 -O2 asm=no: ecdsa_verify: min 108us / avg 108us / max 109us
    22gcc-8 -O2 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 109us
    23gcc-8 -O3 asm=no: ecdsa_verify: min 117us / avg 117us / max 118us
    24gcc-8 -O3 asm=x86_64: ecdsa_verify: min 111us / avg 111us / max 111us
    25gcc-9 -O2 asm=no: ecdsa_verify: min 106us / avg 106us / max 106us
    26gcc-9 -O2 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
    27gcc-9 -O3 asm=no: ecdsa_verify: min 105us / avg 105us / max 106us
    28gcc-9 -O3 asm=x86_64: ecdsa_verify: min 111us / avg 111us / max 111us
    29gcc-10 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 115us
    30gcc-10 -O2 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
    31gcc-10 -O3 asm=no: ecdsa_verify: min 105us / avg 105us / max 106us
    32gcc-10 -O3 asm=x86_64: ecdsa_verify: min 110us / avg 111us / max 111us
    33clang-3.7 -O2 asm=no: ecdsa_verify: min 122us / avg 122us / max 122us
    34clang-3.7 -O2 asm=x86_64: ecdsa_verify: min 112us / avg 112us / max 113us
    35clang-3.7 -O3 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
    36clang-3.7 -O3 asm=x86_64: ecdsa_verify: min 112us / avg 112us / max 112us
    37clang-3.8 -O2 asm=no: ecdsa_verify: min 124us / avg 124us / max 124us
    38clang-3.8 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    39clang-3.8 -O3 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
    40clang-3.8 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    41clang-4.0 -O2 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
    42clang-4.0 -O2 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
    43clang-4.0 -O3 asm=no: ecdsa_verify: min 118us / avg 118us / max 118us
    44clang-4.0 -O3 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
    45clang-6.0 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
    46clang-6.0 -O2 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
    47clang-6.0 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 113us
    48clang-6.0 -O3 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
    49clang-7 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
    50clang-7 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
    51clang-7 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 113us
    52clang-7 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
    53clang-8 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
    54clang-8 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    55clang-8 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 112us
    56clang-8 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    57clang-9 -O2 asm=no: ecdsa_verify: min 114us / avg 114us / max 115us
    58clang-9 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
    59clang-9 -O3 asm=no: ecdsa_verify: min 111us / avg 111us / max 112us
    60clang-9 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
    61clang-10 -O2 asm=no: ecdsa_verify: min 114us / avg 115us / max 115us
    62clang-10 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    63clang-10 -O3 asm=no: ecdsa_verify: min 111us / avg 111us / max 112us
    64clang-10 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    
  10. sipa commented at 9:02 pm on May 28, 2020: contributor
    So eh… should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly?
  11. real-or-random commented at 8:25 am on May 29, 2020: contributor

    Oof. Okay, so the clang performance is not really dependent on the version, so it’s more interesting to look at GCC.

    I think then that we should keep the asm for now, given that there many people/distros don’t have GCC >= 9.

    It would be interesting to see what causes the difference in -O2 for GCC 9 vs 10. We switched to -O2 because there were no large differences, and we figured out that -O2 is in fact quicker on GCC (see #700 (comment) for GCC 8). For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?

    So eh… should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly?

    PR welcome. :stuck_out_tongue:

    By the way, the green line is missing from the clang graph.

  12. jonasnick commented at 1:13 pm on May 29, 2020: contributor

    Nice data @sipa. So our current default options are the best choices for <= gcc-8.

    Furthermore, now that we’ve switched to -O2 by default, with gcc-10 scalar assembly alone is not faster than scalar and field assembly together. As in the OP (gcc-9), with gcc-10’s -O3, scalar assembly alone is the fastest configuration.

    0gcc10 -O3:
    1ecdsa_verify: avg 58.8us # with assembly
    2ecdsa_verify: avg 53.2us # without assembly
    3ecdsa_verify: avg 52.8us # with scalar assembly only
    4
    5gcc10 -02:
    6ecdsa_verify: avg 57.6us # with assembly
    7ecdsa_verify: avg 59.3us # without assembly
    8ecdsa_verify: avg 58.5us # with scalar assembly only
    

    For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?

    No idea what changed in gcc-10, but looking at above slopes there is some reason to believe that -O2 will get faster again with gcc-11 :grimacing:

  13. sipa commented at 10:26 pm on May 29, 2020: contributor

    By the way, the green line is missing from the clang graph.

    No, it’s actually just hidden by the orange line. For all versions of clang, -O2/asm seems indistinguishable from -O3/asm.

  14. elichai commented at 12:40 pm on June 4, 2020: contributor

    FWIW, when compiling to asm filed_5x52_int128_impl.h produces the same ASM for both -O2 and -O3:

    0$ sed -i "11s/^/#include \"..\/include\/secp256k1.h\" \n#include \"util.h\"\n /" src/field_5x52_int128_impl.h
    1$ sed -i 's/static//g' src/field_5x52_int128_impl.h
    2$ gcc -xc -std=c89 -S -O2 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o2.s
    3$ gcc -xc -std=c89 -S -O3 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o3.s
    4$ diff o2.s o3.s
    

    If we do actually believe that gcc-10 produce better asm we can easily use the commands above and then just link to these functions no matter the gcc version

  15. real-or-random cross-referenced this on Jul 15, 2020 from issue WIP: "safegcd" field and scalar inversion by peterdettman
  16. elichai cross-referenced this on Sep 12, 2020 from issue Avoid overly-wide multiplications in 5x52 field mul/sqr by peterdettman
  17. real-or-random commented at 2:49 pm on September 24, 2023: contributor

    I think by now, we should just turn off asm by default. Compiler-generated code has been faster for a few years, and so it should be better for most users. (related: https://github.com/bitcoin/bitcoin/pull/27897)

    Turning it off by default is real progress, and then we can keep procrastinating on whether we want to remove the existing ASM, replace it by CryptOpt, or whatever.

    Recent (unscientific) benchmark on my laptop:

     0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)
     1
     2clang 16.0.6:
     3schnorrsig_sign               ,    30.0       ,    30.0       ,    30.2    
     4schnorrsig_verify             ,    57.3       ,    57.4       ,    57.4
     5
     6clang, no asm:
     7schnorrsig_sign               ,    28.7       ,    28.7       ,    28.8    
     8schnorrsig_verify             ,    52.4       ,    52.5       ,    52.6
     9
    10gcc version 13.2.1, asm:
    11schnorrsig_sign               ,    30.9       ,    31.2       ,    33.3    
    12schnorrsig_verify             ,    59.0       ,    59.1       ,    59.1    
    13
    14gcc, no asm:
    15schnorrsig_sign               ,    29.1       ,    29.1       ,    29.1    
    16schnorrsig_verify             ,    54.2       ,    54.2       ,    54.3
    
  18. real-or-random commented at 2:49 pm on September 24, 2023: contributor
    @theStack Are you willing to run some detailed benchmarks here with recent compilers?
  19. fanquake commented at 10:44 am on October 2, 2023: member
    cc @jamesob (might also be interested in benching)
  20. real-or-random added the label performance on Nov 13, 2023
  21. real-or-random commented at 3:58 pm on November 20, 2023: contributor

    We’ve just discussed this on IRC (log. A short summary is:

    • A good way to come to a decision is to run a benchmark on the GCC used for the official builds of Bitcoin Core. Currently, this is GCC 10.5.0 So if GCC 10.5.0 is faster than our ASM, we should do something now. Otherwise, we should wait for https://github.com/bitcoin/bitcoin/pull/27897, and then recheck with GCC 12.3.

    • We can’t just change the default of the ASM build setting to “off” because this setting affects also the scalar ASM. The scalar ASM is probably still better because its performance relies on ADC instructions, and compilers are still bad. (But the truth is, clang got better here, and also GCC 14 will become better, and we should do benchmarks.) We could, however, replace the current field ASM by compiler-generated ASM, or just remove it. Or we could even introduce separate settings for field and scalar ASM.

    In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks).

  22. theStack commented at 3:35 pm on November 23, 2023: contributor

    In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks).

    I’ve ran some benchmarks with GCC 10.5.0 on master (commit e72103932d5421f3ae501f4eba5452b1b454cb6e) and see the verification routines for both ECDSA and Schnorr signatures are significantly faster without the ASM optimizations (>10%). As assumed in the recent IRC discussion though, the scalar routines still perform better with ASM:

     0$ gcc-10 --version | head -1
     1gcc-10 (Ubuntu 10.5.0-1ubuntu1~22.04) 10.5.0
     2
     3$ ./configure --with-asm=x86_64 CC=gcc-10 && make clean && make && ./bench verify sign && ./bench_internal mul split sqrt | tail -n +2
     4Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
     5
     6ecdsa_verify                  ,    33.0       ,    33.5       ,    34.2    
     7ecdsa_sign                    ,    25.4       ,    25.9       ,    27.1    
     8schnorrsig_sign               ,    19.1       ,    19.3       ,    19.5    
     9schnorrsig_verify             ,    33.6       ,    34.0       ,    34.5    
    10
    11scalar_mul                    ,     0.0289    ,     0.0305    ,     0.0332 
    12scalar_split                  ,     0.120     ,     0.125     ,     0.133  
    13field_mul                     ,     0.0173    ,     0.0179    ,     0.0189 
    14field_sqrt                    ,     4.27      ,     4.37      ,     4.62   
    15
    16$ ./configure --with-asm=no CC=gcc-10 && make clean && make && ./bench verify sign && ./bench_internal mul split sqrt | tail -n +2
    17Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    18
    19ecdsa_verify                  ,    29.7       ,    30.5       ,    31.1    
    20ecdsa_sign                    ,    23.8       ,    24.3       ,    24.9    
    21schnorrsig_sign               ,    17.5       ,    17.9       ,    18.4    
    22schnorrsig_verify             ,    30.1       ,    30.4       ,    31.3    
    23
    24scalar_mul                    ,     0.0339    ,     0.0393    ,     0.0616 
    25scalar_split                  ,     0.141     ,     0.155     ,     0.192  
    26field_mul                     ,     0.0133    ,     0.0138    ,     0.0151 
    27field_sqrt                    ,     3.84      ,     3.95      ,     4.22   
    

    So it seems, given that others can confirm these results, that at least the field ASM could/should be removed (or replaced by compiler-generated ASM)? Also happy to run more benchmarks with newer version of gcc and clang.

  23. real-or-random commented at 3:43 pm on November 23, 2023: contributor
    Thanks, just a quick remark: Can you also check the sign functions to see if the difference in scalar performance actually makes a measurable difference for signing?
  24. theStack commented at 3:54 pm on November 23, 2023: contributor

    Thanks, just a quick remark: Can you also check the sign functions to see if the difference in scalar performance actually makes a measurable difference for signing?

    Sure, good point. I’ve re-ran my benchmarks (this time with the “sign” parameter added to the bench call) and updated the results above accordingly. Seems that the sign operations for both ECDSA and Schnorr are also consistently faster on my machine without ASM on GCC 10.5.0.

  25. theStack cross-referenced this on Nov 23, 2023 from issue bench: add --help option to bench_internal by theStack
  26. real-or-random referenced this in commit 2f0762fa8f on Nov 24, 2023
  27. real-or-random cross-referenced this on Nov 24, 2023 from issue field: Remove x86_64 asm by real-or-random
  28. real-or-random commented at 7:52 am on November 24, 2023: contributor
    nit: I think you wanted to run the benchmark for field_sqr (instead of field_sqrt) because this is what we have assembly for. But no need to re-run from my side. sqr is almost like mul, and anyway, what counts in the end is verification…
  29. real-or-random referenced this in commit c1b4966410 on Nov 24, 2023
  30. theStack commented at 7:59 pm on November 24, 2023: contributor

    nit: I think you wanted to run the benchmark for field_sqr (instead of field_sqrt) because this is what we have assembly for. But no need to re-run from my side. sqr is almost like mul, and anyway, what counts in the end is verification…

    Ah, that was indeed unintended. Fixed that for the two other benchmarks I ran at the course of reviewing #1446 (https://github.com/bitcoin-core/secp256k1/pull/1446#pullrequestreview-1748386710).

  31. real-or-random commented at 12:20 pm on December 4, 2023: contributor
    We have removed the field assembly in https://github.com/bitcoin-core/secp256k1/pull/1446
  32. real-or-random closed this on Dec 4, 2023


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-21 20:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me