Avoid overly-wide multiplications in 5x52 field mul/sqr

peterdettman commented at 8:24 am on September 10, 2020: contributor

Speeds up bench_ecdh, bench_sign, bench_verify relative to master by 5+% at -O3, haswell.

Avoid overly-wide multiplications b53e0cd61f

sipa commented at 11:18 pm on September 10, 2020: contributor

As I understand it, this does two things:

Adds some casts to inform the compiler that a 64x64->128 multiplication rather than a 128x64->128 multiplication is being done.
Changes how some 128-bit values are split across variables, so that there are more shifts of exactly 64 bits (which translate to no-ops).

Do you ever run out of ideas for optimization? ;)

sipa commented at 11:32 pm on September 10, 2020: contributor

I measure a 5.9% speedup for ecdsa_verify on AMD Ryzen 2950X with clock fixed at 2.2 GHz (Linux x86_64, GCC 9.3, -O2, gmp enabled, endomorphism enabled, asm disabled). Before this PR the asm code was tied with the no-asm code; after, the no-asm code clearly wins (this isn’t surprising, as the asm code implements the old algorithm, but still).

sipa commented at 0:43 am on September 11, 2020: contributor

ACK b53e0cd61fce0bcef178f317537c91efc9afd04d

Verified the new equation and bounds by hand, ran 10000 iterations of the unit tests, and 1000 iterations of the exhaustive tests (with randomization from #808) on it.

sipa commented at 3:10 am on September 11, 2020: contributor

I made the same changes to asm code in https://github.com/sipa/secp256k1/commits/202009_cleanup_5x52_mul, but it’s only a ~2% speedup for the with-asm code (so the no-asm code is still faster) indicating that this code change enables more shuffling around than I could directly apply.

We could just take the GCC 9.3 generated code, and turn it into asm source code, or we could just drop the asm code. @gmaxwell also suggests to me that we could replace the asm code with a faster BMI2-enabled version, which wouldn’t work on every x86_64 CPU. EDIT: compiling with -mbmi2 slows it down, though perhaps a well-written handrolled BMI2 version can be faster.

elichai commented at 12:20 pm on September 11, 2020: contributor

Cool trick!

On my machine (i9-9980HK @ 2.4GHz) with: ./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=no Before:

 0$ ./bench_verify 
 1ecdsa_verify: min 95.1us / avg 95.2us / max 95.4us
 2$ ./bench_verify 
 3ecdsa_verify: min 95.2us / avg 96.3us / max 97.8us
 4$ ./bench_sign 
 5ecdsa_sign: min 68.7us / avg 69.1us / max 71.1us
 6$ ./bench_sign 
 7ecdsa_sign: min 68.7us / avg 68.8us / max 68.9us
 8$ ./bench_ecdh 
 9ecdh: min 108us / avg 109us / max 109us
10$ ./bench_ecdh 
11ecdh: min 108us / avg 109us / max 110us

After:

 0$ ./bench_verify 
 1ecdsa_verify: min 93.1us / avg 93.4us / max 94.4us
 2$ ./bench_verify 
 3ecdsa_verify: min 93.1us / avg 93.2us / max 93.3us
 4$ ./bench_sign 
 5ecdsa_sign: min 67.7us / avg 68.1us / max 69.8us
 6$ ./bench_sign 
 7ecdsa_sign: min 68.1us / avg 69.0us / max 72.2us
 8$ ./bench_ecdh 
 9ecdh: min 106us / avg 107us / max 109us
10$ ./bench_ecdh 
11ecdh: min 106us / avg 106us / max 108us

Seems to be ~2% improvement here.

But with CFLAGS=-O3 ./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=no Before:

 0$ ./bench_verify 
 1ecdsa_verify: min 90.2us / avg 90.6us / max 92.0us
 2$ ./bench_verify 
 3ecdsa_verify: min 90.5us / avg 90.7us / max 91.0us
 4$ ./bench_sign 
 5ecdsa_sign: min 67.5us / avg 68.0us / max 69.3us
 6$ ./bench_sign 
 7ecdsa_sign: min 67.4us / avg 67.6us / max 67.7us
 8$ ./bench_ecdh 
 9ecdh: min 99.7us / avg 102us / max 108us
10$ ./bench_ecdh 
11ecdh: min 99.5us / avg 99.9us / max 101us

After:

 0$ ./bench_verify 
 1ecdsa_verify: min 88.7us / avg 89.9us / max 91.8us
 2$ ./bench_verify 
 3ecdsa_verify: min 88.9us / avg 89.6us / max 90.8us
 4$ ./bench_sign 
 5ecdsa_sign: min 66.8us / avg 66.9us / max 67.1us
 6$ ./bench_sign 
 7ecdsa_sign: min 66.7us / avg 67.0us / max 68.4us
 8$ ./bench_ecdh 
 9ecdh: min 96.1us / avg 97.1us / max 98.9us
10$ ./bench_ecdh 
11ecdh: min 95.9us / avg 96.2us / max 97.7us

Seems to be ~2-4% improvement here on -O3.

sipa commented at 6:34 pm on September 12, 2020: contributor

@elichai Feel like benchmarking the asm code too (with my branch linked above)?

elichai commented at 7:20 pm on September 12, 2020: contributor

@elichai Feel like benchmarking the asm code too (with my branch linked above)?

Here: I verified that I still reproduce my past result(which I do), and then tested current master with ./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=yes

 0$ ./bench_verify
 1ecdsa_verify: min 92.0us / avg 92.3us / max 93.6us
 2$ ./bench_verify
 3ecdsa_verify: min 92.1us / avg 92.2us / max 92.5us
 4$ ./bench_sign
 5ecdsa_sign: min 66.0us / avg 66.8us / max 67.9us
 6$ ./bench_sign
 7ecdsa_sign: min 66.0us / avg 66.4us / max 67.3us
 8$ ./bench_ecdh
 9ecdh: min 104us / avg 105us / max 107us
10$ ./bench_ecdh
11ecdh: min 104us / avg 105us / max 106us

on @sipa’s branch:

 0$ ./bench_verify
 1ecdsa_verify: min 90.8us / avg 91.1us / max 91.4us
 2$ ./bench_verify
 3ecdsa_verify: min 90.8us / avg 91.3us / max 92.9us
 4$ ./bench_sign
 5ecdsa_sign: min 65.4us / avg 65.8us / max 66.9us
 6$ ./bench_sign
 7ecdsa_sign: min 65.5us / avg 65.9us / max 66.6us
 8$ ./bench_ecdh
 9ecdh: min 103us / avg 103us / max 105us
10$ ./bench_ecdh
11ecdh: min 103us / avg 103us / max 104us

master with -O3:

 0$ ./bench_verify
 1ecdsa_verify: min 93.1us / avg 93.2us / max 93.2us
 2$ ./bench_verify
 3ecdsa_verify: min 93.0us / avg 93.5us / max 94.9us
 4$ ./bench_sign
 5ecdsa_sign: min 65.3us / avg 65.8us / max 67.2us
 6$ ./bench_sign
 7ecdsa_sign: min 65.3us / avg 65.8us / max 67.0us
 8$ ./bench_ecdh
 9ecdh: min 104us / avg 104us / max 105us
10$ ./bench_ecdh
11ecdh: min 104us / avg 105us / max 106us

@sipa’s branch with -O3:

 0$ ./bench_verify
 1ecdsa_verify: min 92.2us / avg 92.4us / max 94.2us
 2$ ./bench_verify
 3ecdsa_verify: min 92.1us / avg 92.2us / max 92.2us
 4$ ./bench_sign
 5ecdsa_sign: min 65.0us / avg 65.3us / max 65.9us
 6$ ./bench_sign
 7ecdsa_sign: min 65.1us / avg 65.6us / max 66.5us
 8$ ./bench_ecdh
 9ecdh: min 102us / avg 102us / max 103us
10$ ./bench_ecdh
11ecdh: min 102us / avg 103us / max 104us

I don’t get how regular -O3 gives the best result in everything except signing

elichai commented at 7:29 pm on September 12, 2020: contributor

Using this: #726 (comment) it seems like -O3 and -O2 still generates the exact same asm routine for field_5x52_int128_impl.

(this PR and master do generate different asm’s)

elichai commented at 8:14 pm on September 12, 2020: contributor

I went over all the asm differences between -O2 and -O3 when compiling filed_5x52_impl.h with int128 in gcc and they differ only in these functions:

0secp256k1_fe_normalize_var
1secp256k1_fe_clear
2secp256k1_fe_cmp_var
3secp256k1_fe_add
4secp256k1_fe_cmov
5secp256k1_fe_storage_cmov

the main interesting thing is secp256k1_fe_add which -O3 seems to be optimizing with SSE instructions: https://godbolt.org/z/f7xa35

sadly trying to compile and test with -mno-sse breaks gcc

sipa commented at 8:44 pm on September 12, 2020: contributor

FWIW, those are SSE/SSE2 instructions, which are guaranteed to be available on x86_64.

peterdettman commented at 5:03 pm on September 13, 2020: contributor

@elichai FYI my results were with the endomorphism enabled, which may be skewing the relative improvement percentage compared to having it off. From my testing it appears this tweak is improving fe_mul more than fe_sqr, which is somewhat counter-intuitive since the same operations were saved for each and fe_sqr has less overall.

sipa commented at 6:54 pm on September 13, 2020: contributor

FWIW, my benchmarks were also with endomorphism enabled.

gmaxwell commented at 9:34 pm on September 13, 2020: contributor

Pretty much all benchmarking now should be endomorphism enabled unless there is a specific reason not to test with it. Endomorphism should become default on (or even on and not configurable) within a month.

real-or-random requested review from real-or-random on Oct 6, 2021

real-or-random approved

real-or-random commented at 7:00 am on October 16, 2021: contributor

ACK b53e0cd61fce0bcef178f317537c91efc9afd04d I’ve inspected the diff and run the tests without asm for a CPU day

real-or-random merged this on Oct 17, 2021

real-or-random closed this on Oct 17, 2021

sipa cross-referenced this on Oct 17, 2021 from issue WIP WIP WIP 5x64 field representation by sipa

sipa commented at 1:46 am on October 18, 2021: contributor

FWIW, on a 64-bit ARM system (developerbox, Cortex-A53):

before this PR: ecdsa_verify: min 481us / avg 481us / max 481us
after this PR: ecdsa_verify: min 470us / avg 470us / max 470us

fanquake referenced this in commit 8f5cd5e893 on Oct 20, 2021

sipa referenced this in commit f727914d7e on Oct 28, 2021

sipa cross-referenced this on Oct 28, 2021 from issue Update libsecp256k1 subtree to current master by sipa

sipa referenced this in commit 440f7ec80e on Oct 31, 2021

sipa referenced this in commit d057eae556 on Dec 2, 2021

fanquake referenced this in commit c4a1e09a8c on Dec 3, 2021

sipa referenced this in commit 86dbc4d075 on Dec 15, 2021

fanquake referenced this in commit c06cda3e48 on Dec 18, 2021

jonasnick cross-referenced this on Jan 2, 2022 from issue Sync Upstream by jonasnick

real-or-random referenced this in commit 21e2d65b79 on Jan 5, 2022

gwillen referenced this in commit 35d6112a72 on May 25, 2022

janus referenced this in commit 879a9a27b9 on Jul 10, 2022

patricklodder referenced this in commit 21badcf9d2 on Jul 25, 2022

patricklodder referenced this in commit 03002a9013 on Jul 28, 2022

backpacker69 referenced this in commit 77186f4a04 on Jan 18, 2023

sipa cross-referenced this on Apr 1, 2023 from issue Possible optimization for secp256k1 by sipa

real-or-random cross-referenced this on Apr 6, 2023 from issue fiat-crypto + CryptOpt tracking issue by real-or-random

str4d referenced this in commit 6de4698bf9 on Apr 21, 2023

vmta referenced this in commit e1120c94a1 on Jun 4, 2023

vmta referenced this in commit 8f03457eed on Jul 1, 2023

Avoid overly-wide multiplications in 5x52 field mul/sqr #810