Consider removing x86_64 field assembly

jonasnick commented at 9:54 am on March 24, 2020: contributor

I noticed that turning off x86_64 assembly on my laptop actually speeds up ecdsa_verify. The internal benchmarks show that --without-asm scalar operations are slower, but field operations are faster. In order to investigate this I created a branch that includes the configurable benchmark iterations from #722, a test_matrix.sh script and allows turning on field assembly individually (https://github.com/jonasnick/secp256k1/tree/eval-asm).

Here are the results with gcc 9.3.0 (got similar results with clang 9.0.1):

 0SECP256K1_BENCH_ITERS=200000
 1
 2bench config CFLAGS=-DUSE_ASM_X86_64_FIELD ./configure --disable-openssl-tests --with-asm=x86_64
 3scalar_sqr: min 0.0331us / avg 0.0332us / max 0.0337us
 4scalar_mul: min 0.0342us / avg 0.0343us / max 0.0345us
 5field_sqr: min 0.0165us / avg 0.0165us / max 0.0167us
 6field_mul: min 0.0204us / avg 0.0205us / max 0.0209us
 7ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
 8ecdsa_verify: min 56.9us / avg 56.9us / max 56.9us
 9
10bench config CFLAGS= ./configure --disable-openssl-tests --without-asm
11scalar_sqr: min 0.0375us / avg 0.0376us / max 0.0383us
12scalar_mul: min 0.0362us / avg 0.0366us / max 0.0396us
13field_sqr: min 0.0152us / avg 0.0152us / max 0.0152us
14field_mul: min 0.0177us / avg 0.0178us / max 0.0178us
15ecdsa_sign: min 41.8us / avg 41.8us / max 41.9us
16ecdsa_verify: min 54.6us / avg 54.7us / max 54.7us
17
18bench config CFLAGS= ./configure --disable-openssl-tests --with-asm=x86_64
19scalar_sqr: min 0.0331us / avg 0.0331us / max 0.0333us
20scalar_mul: min 0.0342us / avg 0.0343us / max 0.0347us
21field_sqr: min 0.0152us / avg 0.0153us / max 0.0154us
22field_mul: min 0.0178us / avg 0.0178us / max 0.0180us
23ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
24ecdsa_verify: min 53.2us / avg 53.2us / max 53.2us

Note the 6.5% ecdsa_verify speedup. However, I don’t fully understand this:

There’s assembly for field_sqr and field_mul. If we remove it, both functions are faster. But, some other internal functions are slower. For example:
```
0SECP256K1_BENCH_ITERS=200000
1group_add_affine: min 0.257us / avg 0.257us / max 0.259us
2vs.
3group_add_affine: min 0.263us / avg 0.263us / max 0.264us
```
This could just be an artifact of micro-benching and I have not tested this with #667.

Removing field arithmetic also makes ecdsa verification slower if endomorphism is enabled.

0SECP256K1_BENCH_ITERS=200000
1ecdsa_verify: min 41.1us / avg 41.1us / max 41.1us
2vs.
3ecdsa_verify: min 41.5us / avg 41.6us / max 41.6us

It should be noted that without field arithmetic assembly, in order to use 64 bit field arithmetic you need to have __int128 support (or use field=32bit with a 40% verification slowdown). I did not check where this is supported (MSVC?). Also we should try this with older compilers.

elichai commented at 10:08 am on March 24, 2020: contributor

Yeah I saw weird similiar things when looked into enabling asm in rust-secp256k1. But gave up trying to investigate it at some point

real-or-random commented at 11:45 am on March 24, 2020: contributor

It would be really interesting to see this with old compilers, and figure out if people need to rely on old compilers. An alternative could be to disable ASM by default but still support it.

peterdettman commented at 12:00 pm on March 24, 2020: contributor

I’m not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h).

sipa commented at 3:59 pm on March 24, 2020: contributor

MSVC supports neither __int128 or inline GCC-style assembly. I believe all MSVC builds of this code just use 32-bit arithmetic. They have a builtin for 128-bit wide multiplication though, so we could write an analogue of the int128 code using that instead.

real-or-random cross-referenced this on Mar 24, 2020 from issue x86_64 asm and 32bit field/scalar options should be incompatible by real-or-random

real-or-random commented at 6:21 pm on March 26, 2020: contributor

Another question is (assuming for a moment that performance is identical) whether we prefer the asm code or the C code. There are indeed some reasons to prefer asm, e.g., the compiler can’t miscompile it.

elichai commented at 8:32 am on March 29, 2020: contributor

I’m not sure the field asm ever got updated to use the improved mul/sqr approach that the C code uses (especially field_5x52_int128_impl.h).

I read through the asm now, it looks like it matches the C code. I think this was done here https://github.com/bitcoin-core/secp256k1/pull/128/commits/f04861597061f22f5ce027edb0a18577441201cc

FWIW looking in godbolt, the generated asm of the C functions look very much similar to the ones implemented in asm, although looking at llvm-mca, it says that the C code is twice as much cycles, which sounds weird.

ASM: https://godbolt.org/z/YxGUgE C: https://godbolt.org/z/JPzYbo

sipa commented at 8:15 pm on May 28, 2020: contributor

I’ve done benchmarks on gcc and clang, with various versions, -O2 and -O3, with and without assembly enabled. All benchmarks have --disable-endomorphism --with-bignum=no. The results are kind of interesting, and show they’re clearly version dependent, unfortunately.

All tests done on a Intel(R) Core(TM) i7-7820HQ CPU, with clock fixed at 2.3 GHz.

secp256k1-ecdsa-speed-gcc

secp256k1-ecdsa-speed-clang

Raw data:

 0$ for C in gcc-4.7 gcc-4.9 gcc-5 gcc-6 gcc-7 gcc-8 gcc-9 gcc-10 clang-3.7 clang-3.8 clang-4.0 clang-6.0 clang-7 clang-8 clang-9 clang-10; do for OPT in "-O2" "-O3"; do for ASM in no x86_64; do (git clean -dfx && ./autogen.sh 2>/dev/null && ./configure CC=$C CFLAGS="$OPT" --disable-endomorphism --disable-openssl-tests --enable-benchmark --with-bignum=no --with-asm=$ASM && make -j9 bench_verify tests && ./tests 1) >/dev/null && echo -n "$C $OPT asm=$ASM: " && ./bench_verify; done; done; done) | tee -a ~/secp.log
 1gcc-4.7 -O2 asm=no: ecdsa_verify: min 131us / avg 131us / max 132us
 2gcc-4.7 -O2 asm=x86_64: ecdsa_verify: min 107us / avg 107us / max 107us
 3gcc-4.7 -O3 asm=no: ecdsa_verify: min 133us / avg 134us / max 134us
 4gcc-4.7 -O3 asm=x86_64: ecdsa_verify: min 106us / avg 106us / max 106us
 5gcc-4.9 -O2 asm=no: ecdsa_verify: min 125us / avg 125us / max 126us
 6gcc-4.9 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
 7gcc-4.9 -O3 asm=no: ecdsa_verify: min 127us / avg 127us / max 127us
 8gcc-4.9 -O3 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 108us
 9gcc-5 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 115us
10gcc-5 -O2 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 109us
11gcc-5 -O3 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
12gcc-5 -O3 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 108us
13gcc-6 -O2 asm=no: ecdsa_verify: min 113us / avg 113us / max 113us
14gcc-6 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
15gcc-6 -O3 asm=no: ecdsa_verify: min 114us / avg 114us / max 115us
16gcc-6 -O3 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
17gcc-7 -O2 asm=no: ecdsa_verify: min 118us / avg 118us / max 118us
18gcc-7 -O2 asm=x86_64: ecdsa_verify: min 109us / avg 109us / max 109us
19gcc-7 -O3 asm=no: ecdsa_verify: min 123us / avg 123us / max 123us
20gcc-7 -O3 asm=x86_64: ecdsa_verify: min 115us / avg 115us / max 115us
21gcc-8 -O2 asm=no: ecdsa_verify: min 108us / avg 108us / max 109us
22gcc-8 -O2 asm=x86_64: ecdsa_verify: min 108us / avg 108us / max 109us
23gcc-8 -O3 asm=no: ecdsa_verify: min 117us / avg 117us / max 118us
24gcc-8 -O3 asm=x86_64: ecdsa_verify: min 111us / avg 111us / max 111us
25gcc-9 -O2 asm=no: ecdsa_verify: min 106us / avg 106us / max 106us
26gcc-9 -O2 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
27gcc-9 -O3 asm=no: ecdsa_verify: min 105us / avg 105us / max 106us
28gcc-9 -O3 asm=x86_64: ecdsa_verify: min 111us / avg 111us / max 111us
29gcc-10 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 115us
30gcc-10 -O2 asm=x86_64: ecdsa_verify: min 110us / avg 110us / max 110us
31gcc-10 -O3 asm=no: ecdsa_verify: min 105us / avg 105us / max 106us
32gcc-10 -O3 asm=x86_64: ecdsa_verify: min 110us / avg 111us / max 111us
33clang-3.7 -O2 asm=no: ecdsa_verify: min 122us / avg 122us / max 122us
34clang-3.7 -O2 asm=x86_64: ecdsa_verify: min 112us / avg 112us / max 113us
35clang-3.7 -O3 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
36clang-3.7 -O3 asm=x86_64: ecdsa_verify: min 112us / avg 112us / max 112us
37clang-3.8 -O2 asm=no: ecdsa_verify: min 124us / avg 124us / max 124us
38clang-3.8 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
39clang-3.8 -O3 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
40clang-3.8 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
41clang-4.0 -O2 asm=no: ecdsa_verify: min 119us / avg 119us / max 120us
42clang-4.0 -O2 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
43clang-4.0 -O3 asm=no: ecdsa_verify: min 118us / avg 118us / max 118us
44clang-4.0 -O3 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
45clang-6.0 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
46clang-6.0 -O2 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
47clang-6.0 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 113us
48clang-6.0 -O3 asm=x86_64: ecdsa_verify: min 114us / avg 114us / max 114us
49clang-7 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
50clang-7 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
51clang-7 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 113us
52clang-7 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
53clang-8 -O2 asm=no: ecdsa_verify: min 115us / avg 115us / max 116us
54clang-8 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
55clang-8 -O3 asm=no: ecdsa_verify: min 112us / avg 112us / max 112us
56clang-8 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
57clang-9 -O2 asm=no: ecdsa_verify: min 114us / avg 114us / max 115us
58clang-9 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
59clang-9 -O3 asm=no: ecdsa_verify: min 111us / avg 111us / max 112us
60clang-9 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 113us
61clang-10 -O2 asm=no: ecdsa_verify: min 114us / avg 115us / max 115us
62clang-10 -O2 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us
63clang-10 -O3 asm=no: ecdsa_verify: min 111us / avg 111us / max 112us
64clang-10 -O3 asm=x86_64: ecdsa_verify: min 113us / avg 113us / max 114us

sipa commented at 9:02 pm on May 28, 2020: contributor

So eh… should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly?

real-or-random commented at 8:25 am on May 29, 2020: contributor

Oof. Okay, so the clang performance is not really dependent on the version, so it’s more interesting to look at GCC.

I think then that we should keep the asm for now, given that there many people/distros don’t have GCC >= 9.

It would be interesting to see what causes the difference in -O2 for GCC 9 vs 10. We switched to -O2 because there were no large differences, and we figured out that -O2 is in fact quicker on GCC (see #700 (comment) for GCC 8). For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?

So eh… should we look at the GCC 9.3.0+ -O3 output, and try convert that to inline assembly?

PR welcome. :stuck_out_tongue:

By the way, the green line is missing from the clang graph.

jonasnick commented at 1:13 pm on May 29, 2020: contributor

Nice data @sipa. So our current default options are the best choices for <= gcc-8.

Furthermore, now that we’ve switched to -O2 by default, with gcc-10 scalar assembly alone is not faster than scalar and field assembly together. As in the OP (gcc-9), with gcc-10’s -O3, scalar assembly alone is the fastest configuration.

0gcc10 -O3:
1ecdsa_verify: avg 58.8us # with assembly
2ecdsa_verify: avg 53.2us # without assembly
3ecdsa_verify: avg 52.8us # with scalar assembly only
4
5gcc10 -02:
6ecdsa_verify: avg 57.6us # with assembly
7ecdsa_verify: avg 59.3us # without assembly
8ecdsa_verify: avg 58.5us # with scalar assembly only

For example, if they moved some flags from -O2 to -O3 in GCC 10, maybe we want to enable them again?

No idea what changed in gcc-10, but looking at above slopes there is some reason to believe that -O2 will get faster again with gcc-11 :grimacing:

sipa commented at 10:26 pm on May 29, 2020: contributor

By the way, the green line is missing from the clang graph.

No, it’s actually just hidden by the orange line. For all versions of clang, -O2/asm seems indistinguishable from -O3/asm.

elichai commented at 12:40 pm on June 4, 2020: contributor

FWIW, when compiling to asm filed_5x52_int128_impl.h produces the same ASM for both -O2 and -O3:

0$ sed -i "11s/^/#include \"..\/include\/secp256k1.h\" \n#include \"util.h\"\n /" src/field_5x52_int128_impl.h
1$ sed -i 's/static//g' src/field_5x52_int128_impl.h
2$ gcc -xc -std=c89 -S -O2 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o2.s
3$ gcc -xc -std=c89 -S -O3 -DHAVE___INT128 src/field_5x52_int128_impl.h -o o3.s
4$ diff o2.s o3.s

If we do actually believe that gcc-10 produce better asm we can easily use the commands above and then just link to these functions no matter the gcc version

real-or-random cross-referenced this on Jul 15, 2020 from issue WIP: "safegcd" field and scalar inversion by peterdettman

elichai cross-referenced this on Sep 12, 2020 from issue Avoid overly-wide multiplications in 5x52 field mul/sqr by peterdettman

real-or-random commented at 2:49 pm on September 24, 2023: contributor

I think by now, we should just turn off asm by default. Compiler-generated code has been faster for a few years, and so it should be better for most users. (related: https://github.com/bitcoin/bitcoin/pull/27897)

Turning it off by default is real progress, and then we can keep procrastinating on whether we want to remove the existing ASM, replace it by CryptOpt, or whatever.

Recent (unscientific) benchmark on my laptop:

 0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)
 1
 2clang 16.0.6:
 3schnorrsig_sign               ,    30.0       ,    30.0       ,    30.2    
 4schnorrsig_verify             ,    57.3       ,    57.4       ,    57.4
 5
 6clang, no asm:
 7schnorrsig_sign               ,    28.7       ,    28.7       ,    28.8    
 8schnorrsig_verify             ,    52.4       ,    52.5       ,    52.6
 9
10gcc version 13.2.1, asm:
11schnorrsig_sign               ,    30.9       ,    31.2       ,    33.3    
12schnorrsig_verify             ,    59.0       ,    59.1       ,    59.1    
13
14gcc, no asm:
15schnorrsig_sign               ,    29.1       ,    29.1       ,    29.1    
16schnorrsig_verify             ,    54.2       ,    54.2       ,    54.3

real-or-random commented at 2:49 pm on September 24, 2023: contributor

@theStack Are you willing to run some detailed benchmarks here with recent compilers?

fanquake commented at 10:44 am on October 2, 2023: member

cc @jamesob (might also be interested in benching)

real-or-random added the label performance on Nov 13, 2023

real-or-random commented at 3:58 pm on November 20, 2023: contributor

We’ve just discussed this on IRC (log. A short summary is:

A good way to come to a decision is to run a benchmark on the GCC used for the official builds of Bitcoin Core. Currently, this is GCC 10.5.0 So if GCC 10.5.0 is faster than our ASM, we should do something now. Otherwise, we should wait for https://github.com/bitcoin/bitcoin/pull/27897, and then recheck with GCC 12.3.
We can’t just change the default of the ASM build setting to “off” because this setting affects also the scalar ASM. The scalar ASM is probably still better because its performance relies on ADC instructions, and compilers are still bad. (But the truth is, clang got better here, and also GCC 14 will become better, and we should do benchmarks.) We could, however, replace the current field ASM by compiler-generated ASM, or just remove it. Or we could even introduce separate settings for field and scalar ASM.

In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks).

theStack commented at 3:35 pm on November 23, 2023: contributor

In any case, the next step here is benchmark GCC 10.5.0 vs. our field ASM, both on the macro (ECDSA/Schnorr verify) and micro-level (field benchmarks).

I’ve ran some benchmarks with GCC 10.5.0 on master (commit e72103932d5421f3ae501f4eba5452b1b454cb6e) and see the verification routines for both ECDSA and Schnorr signatures are significantly faster without the ASM optimizations (>10%). As assumed in the recent IRC discussion though, the scalar routines still perform better with ASM:

 0$ gcc-10 --version | head -1
 1gcc-10 (Ubuntu 10.5.0-1ubuntu1~22.04) 10.5.0
 2
 3$ ./configure --with-asm=x86_64 CC=gcc-10 && make clean && make && ./bench verify sign && ./bench_internal mul split sqrt | tail -n +2
 4Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
 5
 6ecdsa_verify                  ,    33.0       ,    33.5       ,    34.2    
 7ecdsa_sign                    ,    25.4       ,    25.9       ,    27.1    
 8schnorrsig_sign               ,    19.1       ,    19.3       ,    19.5    
 9schnorrsig_verify             ,    33.6       ,    34.0       ,    34.5    
10
11scalar_mul                    ,     0.0289    ,     0.0305    ,     0.0332 
12scalar_split                  ,     0.120     ,     0.125     ,     0.133  
13field_mul                     ,     0.0173    ,     0.0179    ,     0.0189 
14field_sqrt                    ,     4.27      ,     4.37      ,     4.62   
15
16$ ./configure --with-asm=no CC=gcc-10 && make clean && make && ./bench verify sign && ./bench_internal mul split sqrt | tail -n +2
17Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
18
19ecdsa_verify                  ,    29.7       ,    30.5       ,    31.1    
20ecdsa_sign                    ,    23.8       ,    24.3       ,    24.9    
21schnorrsig_sign               ,    17.5       ,    17.9       ,    18.4    
22schnorrsig_verify             ,    30.1       ,    30.4       ,    31.3    
23
24scalar_mul                    ,     0.0339    ,     0.0393    ,     0.0616 
25scalar_split                  ,     0.141     ,     0.155     ,     0.192  
26field_mul                     ,     0.0133    ,     0.0138    ,     0.0151 
27field_sqrt                    ,     3.84      ,     3.95      ,     4.22

So it seems, given that others can confirm these results, that at least the field ASM could/should be removed (or replaced by compiler-generated ASM)? Also happy to run more benchmarks with newer version of gcc and clang.

real-or-random commented at 3:43 pm on November 23, 2023: contributor

Thanks, just a quick remark: Can you also check the sign functions to see if the difference in scalar performance actually makes a measurable difference for signing?

theStack commented at 3:54 pm on November 23, 2023: contributor

Thanks, just a quick remark: Can you also check the sign functions to see if the difference in scalar performance actually makes a measurable difference for signing?

Sure, good point. I’ve re-ran my benchmarks (this time with the “sign” parameter added to the bench call) and updated the results above accordingly. Seems that the sign operations for both ECDSA and Schnorr are also consistently faster on my machine without ASM on GCC 10.5.0.

theStack cross-referenced this on Nov 23, 2023 from issue bench: add --help option to bench_internal by theStack

real-or-random referenced this in commit 2f0762fa8f on Nov 24, 2023

real-or-random cross-referenced this on Nov 24, 2023 from issue field: Remove x86_64 asm by real-or-random

real-or-random commented at 7:52 am on November 24, 2023: contributor

nit: I think you wanted to run the benchmark for field_sqr (instead of field_sqrt) because this is what we have assembly for. But no need to re-run from my side. sqr is almost like mul, and anyway, what counts in the end is verification…

real-or-random referenced this in commit c1b4966410 on Nov 24, 2023

theStack commented at 7:59 pm on November 24, 2023: contributor

nit: I think you wanted to run the benchmark for field_sqr (instead of field_sqrt) because this is what we have assembly for. But no need to re-run from my side. sqr is almost like mul, and anyway, what counts in the end is verification…

Ah, that was indeed unintended. Fixed that for the two other benchmarks I ran at the course of reviewing #1446 (https://github.com/bitcoin-core/secp256k1/pull/1446#pullrequestreview-1748386710).

real-or-random commented at 12:20 pm on December 4, 2023: contributor

We have removed the field assembly in https://github.com/bitcoin-core/secp256k1/pull/1446

real-or-random closed this on Dec 4, 2023

Consider removing x86_64 field assembly #726