I noticed that turning off x86_64 assembly on my laptop actually speeds up ecdsa_verify
. The internal benchmarks show that --without-asm
scalar operations are slower, but field operations are faster. In order to investigate this I created a branch that includes the configurable benchmark iterations from #722, a test_matrix.sh script and allows turning on field assembly individually (https://github.com/jonasnick/secp256k1/tree/eval-asm).
Here are the results with gcc 9.3.0 (got similar results with clang 9.0.1):
0SECP256K1_BENCH_ITERS=200000
1
2bench config CFLAGS=-DUSE_ASM_X86_64_FIELD ./configure --disable-openssl-tests --with-asm=x86_64
3scalar_sqr: min 0.0331us / avg 0.0332us / max 0.0337us
4scalar_mul: min 0.0342us / avg 0.0343us / max 0.0345us
5field_sqr: min 0.0165us / avg 0.0165us / max 0.0167us
6field_mul: min 0.0204us / avg 0.0205us / max 0.0209us
7ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
8ecdsa_verify: min 56.9us / avg 56.9us / max 56.9us
9
10bench config CFLAGS= ./configure --disable-openssl-tests --without-asm
11scalar_sqr: min 0.0375us / avg 0.0376us / max 0.0383us
12scalar_mul: min 0.0362us / avg 0.0366us / max 0.0396us
13field_sqr: min 0.0152us / avg 0.0152us / max 0.0152us
14field_mul: min 0.0177us / avg 0.0178us / max 0.0178us
15ecdsa_sign: min 41.8us / avg 41.8us / max 41.9us
16ecdsa_verify: min 54.6us / avg 54.7us / max 54.7us
17
18bench config CFLAGS= ./configure --disable-openssl-tests --with-asm=x86_64
19scalar_sqr: min 0.0331us / avg 0.0331us / max 0.0333us
20scalar_mul: min 0.0342us / avg 0.0343us / max 0.0347us
21field_sqr: min 0.0152us / avg 0.0153us / max 0.0154us
22field_mul: min 0.0178us / avg 0.0178us / max 0.0180us
23ecdsa_sign: min 40.3us / avg 40.3us / max 40.4us
24ecdsa_verify: min 53.2us / avg 53.2us / max 53.2us
Note the 6.5% ecdsa_verify speedup. However, I don’t fully understand this:
- There’s assembly for field_sqr and field_mul. If we remove it, both functions are faster. But, some other internal functions are slower. For example:
This could just be an artifact of micro-benching and I have not tested this with #667.
0SECP256K1_BENCH_ITERS=200000 1group_add_affine: min 0.257us / avg 0.257us / max 0.259us 2vs. 3group_add_affine: min 0.263us / avg 0.263us / max 0.264us
- Removing field arithmetic also makes ecdsa verification slower if endomorphism is enabled.
0SECP256K1_BENCH_ITERS=200000 1ecdsa_verify: min 41.1us / avg 41.1us / max 41.1us 2vs. 3ecdsa_verify: min 41.5us / avg 41.6us / max 41.6us
It should be noted that without field arithmetic assembly, in order to use 64 bit field arithmetic you need to have __int128
support (or use field=32bit with a 40% verification slowdown). I did not check where this is supported (MSVC?). Also we should try this with older compilers.