Consider removing x86_64 field assembly #726

issue jonasnick opened this issue on March 24, 2020
  1. jonasnick commented at 9:54 AM on March 24, 2020: contributor

    I noticed that turning off x86_64 assembly on my laptop actually speeds up ecdsa_verify. The internal benchmarks show that --without-asm scalar operations are slower, but field operations are faster. In order to investigate this I created a branch that includes the configurable benchmark iterations from #722, a test_matrix.sh script and allows turning on field assembly individually (https://github.com/jonasnick/secp256k1/tree/eval-asm).

    Here are the results with gcc 9.3.0 (got similar results with clang 9.0.1):

    SECP256K1_BENCH_ITERS=200000
    
    bench config CFLAGS=-DUSE_ASM_X86_64_FIELD ./configure --disable-openssl-tests --with-asm=x86_64
    scalar_sqr: min 0.0331us / avg 0.0332us / max 0.0337us
    scalar_mul: min 0.0342us / avg 0.0343us / max 0.0345us
    field_sqr: min 0.0165us / avg 0.0165us / max 0.0167us
    field_mul: min 0.0204us / avg 0.0205us / max 0.0209us
    ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
    ecdsa_verify: min 56.9us / avg 56.9us / max 56.9us
    
    bench config CFLAGS= ./configure --disable-openssl-tests --without-asm
    scalar_sqr: min 0.0375us / avg 0.0376us / max 0.0383us
    scalar_mul: min 0.0362us / avg 0.0366us / max 0.0396us
    field_sqr: min 0.0152us / avg 0.0152us / max 0.0152us
    field_mul: min 0.0177us / avg 0.0178us / max 0.0178us
    ecdsa_sign: min 41.8us / avg 41.8us / max 41.9us
    ecdsa_verify: min 54.6us / avg 54.7us / max 54.7us
    
    bench config CFLAGS= ./configure --disable-openssl-tests --with-asm=x86_64
    scalar_sqr: min 0.0331us / avg 0.0331us / max 0.0333us
    scalar_mul: min 0.0342us / avg 0.0343us / max 0.0347us
    field_sqr: min 0.0152us / avg 0.0153us / max 0.0154us
    field_mul: min 0.0178us / avg 0.0178us / max 0.0180us
    ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
    ecdsa_verify: min 53.2us / avg 53.2us / max 53.2us
    

    Note the 6.5% ecdsa_verify speedup. However, I don't fully understand this:

    1. There's assembly for field_sqr and field_mul. If we remove it, both functions are faster. But, some other internal functions are slower. For example:

      SECP256K1_BENCH_ITERS=200000
      group_add_affine: min 0.257us / avg 0.257us / max 0.259us
      vs.
      group_add_affine: min 0.263us / avg 0.263us / max 0.264us
      

      This could just be an artifact of micro-benching and I have not tested this with #667.

    2. Removing field arithmetic also makes ecdsa verification slower if endomorphism is enabled.

      SECP256K1_BENCH_ITERS=200000
      ecdsa_verify: min 41.1us / avg 41.1us / max 41.1us
      vs.
      ecdsa_verify: min 41.5us / avg 41.6us / max 41.6us
      

    It should be noted that without field arithmetic assembly, in order to use 64 bit field arithmetic you need to have __int128 support (or use field=32bit with a 40% verification slowdown). I did not check where this is supported (MSVC?). Also we should try this with older compilers.

  2. elichai commented at 10:08 AM on March 24, 2020: contributor

    Yeah I saw weird similiar things when looked into enabling asm in rust-secp256k1. But gave up trying to investigate it at some point

  3. real-or-random commented at 11:45 AM on March 24, 2020: contributor

    It would be really interesting to see this with old compilers, and figure out if people need to rely on old compilers. An alternative could be to disable ASM by default but still support it.

  4. peterdettman commented at 12:00 PM on March 24, 2020: contributor

    I'm not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h).

  5. sipa commented at 3:59 PM on March 24, 2020: contributor

    MSVC supports neither __int128 or inline GCC-style assembly. I believe all MSVC builds of this code just use 32-bit arithmetic. They have a builtin for 128-bit wide multiplication though, so we could write an analogue of the int128 code using that instead.

  6. real-or-random cross-referenced this on Mar 24, 2020 from issue x86_64 asm and 32bit field/scalar options should be incompatible by real-or-random
  7. real-or-random commented at 6:21 PM on March 26, 2020: contributor

    Another question is (assuming for a moment that performance is identical) whether we prefer the asm code or the C code. There are indeed some reasons to prefer asm, e.g., the compiler can't miscompile it.

  8. elichai commented at 8:32 AM on March 29, 2020: contributor

    I'm not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h).

    I read through the asm now, it looks like it matches the C code. I think this was done here https://github.com/bitcoin-core/secp256k1/pull/128/commits/f04861597061f22f5ce027edb0a18577441201cc

    FWIW looking in godbolt, the generated asm of the C functions look very much similar to the ones implemented in asm, although looking at llvm-mca, it says that the C code is twice as much cycles, which sounds weird.

    ASM: https://godbolt.org/z/YxGUgE C: https://godbolt.org/z/JPzYbo

  9. sipa commented at 8:15 PM on May 28, 2020: contributor

    I've done benchmarks on gcc and clang, with various versions, -O2 and -O3, with and without assembly enabled. All benchmarks have --disable-endomorphism --with-bignum=no. The results are kind of interesting, and show they're clearly version dependent, unfortunately.

    All tests done on a Intel(R) Core(TM) i7-7820HQ CPU, with clock fixed at 2.3 GHz.

    secp256k1-ecdsa-speed-gcc

    secp256k1-ecdsa-speed-clang

    Raw data:

    $ for C in gcc-4.7 gcc-4.9 gcc-5 gcc-6 gcc-7 gcc-8 gcc-9 gcc-10 clang-3.7 clang-3.8 clang-4.0 clang-6.0 clang-7 clang-8 clang-9 clang-10; do for OPT in "-O2" "-O3"; do for ASM in no x86_64; do (git clean -dfx && ./autogen.sh 2>/dev/null && ./configure CC=$C CFLAGS="$OPT" --disable-endomorphism --disable-openssl-tests --enable-benchmark --with-bignum=no --with-asm=$ASM && make -j9 bench_verify tests && ./tests 1) >/dev/null && echo -n "$C $OPT asm=$ASM: " && ./bench_verify; done; done; done) | tee -a ~/secp.log
    gcc-4.7 -O2 asm=no: ecdsa_verify: min 131us / avg 131us / max 132us
    gcc-4.7 -O2 asm=x86_64: ecdsa_verify: min 107us / avg 107us / max 107us
    gcc-4.7 -O3 asm=no: ecdsa_verify: min 133us / avg 134us / max 134us
    gcc-4.7 -O3 asm=x86_64: ecdsa_verify: min 106us / avg 106us / max 106us
    gcc-4.9 -O2 asm=no: ecdsa_verify: min 125us / avg 125us / max 126us
    gcc-4.9 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
    gcc-4.9 -O3 asm=no: ecdsa_verify: min 127us / avg 127us / max 127us
    gcc-4.9 -O3 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 108us
    gcc-5 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 115us
    gcc-5 -O2 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 109us
    gcc-5 -O3 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
    gcc-5 -O3 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 108us
    gcc-6 -O2 asm=no: ecdsa_verify: min 113us / avg 113us / max 113us
    gcc-6 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
    gcc-6 -O3 asm=no: ecdsa_verify: min 114us / avg 114us / max 115us
    gcc-6 -O3 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
    gcc-7 -O2 asm=no: ecdsa_verify: min 118us / avg 118us / max 118us
    gcc-7 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
    gcc-7 -O3 asm=no: ecdsa_verify: min 123us / avg 123us / max 123us
    gcc-7 -O3 asm=x86_64: ecdsa_verify: min 115us / avg 115us / max 115us
    gcc-8 -O2 asm=no: ecdsa_verify: min 108us / avg 108us / max 109us
    gcc-8 -O2 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 109us
    gcc-8 -O3 asm=no: ecdsa_verify: min 117us / avg 117us / max 118us
    gcc-8 -O3 asm=x86_64: ecdsa_verify: min 111us / avg 111us / max 111us
    gcc-9 -O2 asm=no: ecdsa_verify: min 106us / avg 106us / max 106us
    gcc-9 -O2 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
    gcc-9 -O3 asm=no: ecdsa_verify: min 105us / avg 105us / max 106us
    gcc-9 -O3 asm=x86_64: ecdsa_verify: min 111us / avg 111us / max 111us
    gcc-10 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 115us
    gcc-10 -O2 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
    gcc-10 -O3 asm=no: ecdsa_verify: min 105us / avg 105us / max 106us
    gcc-10 -O3 asm=x86_64: ecdsa_verify: min 110us / avg 111us / max 111us
    clang-3.7 -O2 asm=no: ecdsa_verify: min 122us / avg 122us / max 122us
    clang-3.7 -O2 asm=x86_64: ecdsa_verify: min 112us / avg 112us / max 113us
    clang-3.7 -O3 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
    clang-3.7 -O3 asm=x86_64: ecdsa_verify: min 112us / avg 112us / max 112us
    clang-3.8 -O2 asm=no: ecdsa_verify: min 124us / avg 124us / max 124us
    clang-3.8 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    clang-3.8 -O3 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
    clang-3.8 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    clang-4.0 -O2 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
    clang-4.0 -O2 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
    clang-4.0 -O3 asm=no: ecdsa_verify: min 118us / avg 118us / max 118us
    clang-4.0 -O3 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
    clang-6.0 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
    clang-6.0 -O2 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
    clang-6.0 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 113us
    clang-6.0 -O3 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
    clang-7 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
    clang-7 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
    clang-7 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 113us
    clang-7 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
    clang-8 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
    clang-8 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    clang-8 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 112us
    clang-8 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    clang-9 -O2 asm=no: ecdsa_verify: min 114us / avg 114us / max 115us
    clang-9 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
    clang-9 -O3 asm=no: ecdsa_verify: min 111us / avg 111us / max 112us
    clang-9 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
    clang-10 -O2 asm=no: ecdsa_verify: min 114us / avg 115us / max 115us
    clang-10 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    clang-10 -O3 asm=no: ecdsa_verify: min 111us / avg 111us / max 112us
    clang-10 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
    
  10. sipa commented at 9:02 PM on May 28, 2020: contributor

    So eh... should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly?

  11. real-or-random commented at 8:25 AM on May 29, 2020: contributor

    Oof. Okay, so the clang performance is not really dependent on the version, so it's more interesting to look at GCC.

    I think then that we should keep the asm for now, given that there many people/distros don't have GCC >= 9.

    It would be interesting to see what causes the difference in -O2 for GCC 9 vs 10. We switched to -O2 because there were no large differences, and we figured out that -O2 is in fact quicker on GCC (see #700 (comment) for GCC 8). For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?

    So eh... should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly?

    PR welcome. :stuck_out_tongue:

    By the way, the green line is missing from the clang graph.

  12. jonasnick commented at 1:13 PM on May 29, 2020: contributor

    Nice data @sipa. So our current default options are the best choices for <= gcc-8.

    Furthermore, now that we've switched to -O2 by default, with gcc-10 scalar assembly alone is not faster than scalar and field assembly together. As in the OP (gcc-9), with gcc-10's -O3, scalar assembly alone is the fastest configuration.

    gcc10 -O3:
    ecdsa_verify: avg 58.8us # with assembly
    ecdsa_verify: avg 53.2us # without assembly
    ecdsa_verify: avg 52.8us # with scalar assembly only
    
    gcc10 -02:
    ecdsa_verify: avg 57.6us # with assembly
    ecdsa_verify: avg 59.3us # without assembly
    ecdsa_verify: avg 58.5us # with scalar assembly only
    

    For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?

    No idea what changed in gcc-10, but looking at above slopes there is some reason to believe that -O2 will get faster again with gcc-11 :grimacing:

  13. sipa commented at 10:26 PM on May 29, 2020: contributor

    By the way, the green line is missing from the clang graph.

    No, it's actually just hidden by the orange line. For all versions of clang, -O2/asm seems indistinguishable from -O3/asm.

  14. elichai commented at 12:40 PM on June 4, 2020: contributor

    FWIW, when compiling to asm filed_5x52_int128_impl.h produces the same ASM for both -O2 and -O3:

    $ sed -i "11s/^/#include \"..\/include\/secp256k1.h\" \n#include \"util.h\"\n /" src/field_5x52_int128_impl.h
    $ sed -i 's/static//g' src/field_5x52_int128_impl.h
    $ gcc -xc -std=c89 -S -O2 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o2.s
    $ gcc -xc -std=c89 -S -O3 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o3.s
    $ diff o2.s o3.s
    

    If we do actually believe that gcc-10 produce better asm we can easily use the commands above and then just link to these functions no matter the gcc version

  15. real-or-random cross-referenced this on Jul 15, 2020 from issue WIP: "safegcd" field and scalar inversion by peterdettman
  16. elichai cross-referenced this on Sep 12, 2020 from issue Avoid overly-wide multiplications in 5x52 field mul/sqr by peterdettman
  17. real-or-random commented at 2:49 PM on September 24, 2023: contributor

    I think by now, we should just turn off asm by default. Compiler-generated code has been faster for a few years, and so it should be better for most users. (related: https://github.com/bitcoin/bitcoin/pull/27897)

    Turning it off by default is real progress, and then we can keep procrastinating on whether we want to remove the existing ASM, replace it by CryptOpt, or whatever.

    Recent (unscientific) benchmark on my laptop:

    Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)
    
    clang 16.0.6:
    schnorrsig_sign               ,    30.0       ,    30.0       ,    30.2    
    schnorrsig_verify             ,    57.3       ,    57.4       ,    57.4
    
    clang, no asm:
    schnorrsig_sign               ,    28.7       ,    28.7       ,    28.8    
    schnorrsig_verify             ,    52.4       ,    52.5       ,    52.6
    
    gcc version 13.2.1, asm:
    schnorrsig_sign               ,    30.9       ,    31.2       ,    33.3    
    schnorrsig_verify             ,    59.0       ,    59.1       ,    59.1    
    
    gcc, no asm:
    schnorrsig_sign               ,    29.1       ,    29.1       ,    29.1    
    schnorrsig_verify             ,    54.2       ,    54.2       ,    54.3
    
  18. real-or-random commented at 2:49 PM on September 24, 2023: contributor

    @theStack Are you willing to run some detailed benchmarks here with recent compilers?

  19. fanquake commented at 10:44 AM on October 2, 2023: member

    cc @jamesob (might also be interested in benching)

  20. real-or-random added the label performance on Nov 13, 2023
  21. real-or-random commented at 3:58 PM on November 20, 2023: contributor

    We've just discussed this on IRC (log. A short summary is:

    • A good way to come to a decision is to run a benchmark on the GCC used for the official builds of Bitcoin Core. Currently, this is GCC 10.5.0 So if GCC 10.5.0 is faster than our ASM, we should do something now. Otherwise, we should wait for https://github.com/bitcoin/bitcoin/pull/27897, and then recheck with GCC 12.3.

    • We can't just change the default of the ASM build setting to "off" because this setting affects also the scalar ASM. The scalar ASM is probably still better because its performance relies on ADC instructions, and compilers are still bad. (But the truth is, clang got better here, and also GCC 14 will become better, and we should do benchmarks.) We could, however, replace the current field ASM by compiler-generated ASM, or just remove it. Or we could even introduce separate settings for field and scalar ASM.

    In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks).

  22. theStack commented at 3:35 PM on November 23, 2023: contributor

    In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks).

    I've ran some benchmarks with GCC 10.5.0 on master (commit e72103932d5421f3ae501f4eba5452b1b454cb6e) and see the verification routines for both ECDSA and Schnorr signatures are significantly faster without the ASM optimizations (>10%). As assumed in the recent IRC discussion though, the scalar routines still perform better with ASM:

    $ gcc-10 --version | head -1
    gcc-10 (Ubuntu 10.5.0-1ubuntu1~22.04) 10.5.0
    
    $ ./configure --with-asm=x86_64 CC=gcc-10 && make clean && make && ./bench verify sign && ./bench_internal mul split sqrt | tail -n +2
    Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    
    ecdsa_verify                  ,    33.0       ,    33.5       ,    34.2    
    ecdsa_sign                    ,    25.4       ,    25.9       ,    27.1    
    schnorrsig_sign               ,    19.1       ,    19.3       ,    19.5    
    schnorrsig_verify             ,    33.6       ,    34.0       ,    34.5    
    
    scalar_mul                    ,     0.0289    ,     0.0305    ,     0.0332 
    scalar_split                  ,     0.120     ,     0.125     ,     0.133  
    field_mul                     ,     0.0173    ,     0.0179    ,     0.0189 
    field_sqrt                    ,     4.27      ,     4.37      ,     4.62   
    
    $ ./configure --with-asm=no CC=gcc-10 && make clean && make && ./bench verify sign && ./bench_internal mul split sqrt | tail -n +2
    Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    
    ecdsa_verify                  ,    29.7       ,    30.5       ,    31.1    
    ecdsa_sign                    ,    23.8       ,    24.3       ,    24.9    
    schnorrsig_sign               ,    17.5       ,    17.9       ,    18.4    
    schnorrsig_verify             ,    30.1       ,    30.4       ,    31.3    
    
    scalar_mul                    ,     0.0339    ,     0.0393    ,     0.0616 
    scalar_split                  ,     0.141     ,     0.155     ,     0.192  
    field_mul                     ,     0.0133    ,     0.0138    ,     0.0151 
    field_sqrt                    ,     3.84      ,     3.95      ,     4.22   
    

    So it seems, given that others can confirm these results, that at least the field ASM could/should be removed (or replaced by compiler-generated ASM)? Also happy to run more benchmarks with newer version of gcc and clang.

  23. real-or-random commented at 3:43 PM on November 23, 2023: contributor

    Thanks, just a quick remark: Can you also check the sign functions to see if the difference in scalar performance actually makes a measurable difference for signing?

  24. theStack commented at 3:54 PM on November 23, 2023: contributor

    Thanks, just a quick remark: Can you also check the sign functions to see if the difference in scalar performance actually makes a measurable difference for signing?

    Sure, good point. I've re-ran my benchmarks (this time with the "sign" parameter added to the bench call) and updated the results above accordingly. Seems that the sign operations for both ECDSA and Schnorr are also consistently faster on my machine without ASM on GCC 10.5.0.

  25. theStack cross-referenced this on Nov 23, 2023 from issue bench: add --help option to bench_internal by theStack
  26. real-or-random referenced this in commit 2f0762fa8f on Nov 24, 2023
  27. real-or-random cross-referenced this on Nov 24, 2023 from issue field: Remove x86_64 asm by real-or-random
  28. real-or-random commented at 7:52 AM on November 24, 2023: contributor

    nit: I think you wanted to run the benchmark for field_sqr (instead of field_sqrt) because this is what we have assembly for. But no need to re-run from my side. sqr is almost like mul, and anyway, what counts in the end is verification...

  29. real-or-random referenced this in commit c1b4966410 on Nov 24, 2023
  30. theStack commented at 7:59 PM on November 24, 2023: contributor

    nit: I think you wanted to run the benchmark for field_sqr (instead of field_sqrt) because this is what we have assembly for. But no need to re-run from my side. sqr is almost like mul, and anyway, what counts in the end is verification...

    Ah, that was indeed unintended. Fixed that for the two other benchmarks I ran at the course of reviewing #1446 (https://github.com/bitcoin-core/secp256k1/pull/1446#pullrequestreview-1748386710).

  31. real-or-random commented at 12:20 PM on December 4, 2023: contributor

    We have removed the field assembly in https://github.com/bitcoin-core/secp256k1/pull/1446

  32. real-or-random closed this on Dec 4, 2023


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-22 20:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me