Avoid overly-wide multiplications in 5x52 field mul/sqr #810
pull peterdettman wants to merge 1 commits into bitcoin-core:master from peterdettman:cleanup_5x52_mul changing 1 files +22 −23-
peterdettman commented at 8:24 am on September 10, 2020: contributorSpeeds up bench_ecdh, bench_sign, bench_verify relative to master by 5+% at -O3, haswell.
-
Avoid overly-wide multiplications b53e0cd61f
-
sipa commented at 11:18 pm on September 10, 2020: contributor
As I understand it, this does two things:
- Adds some casts to inform the compiler that a 64x64->128 multiplication rather than a 128x64->128 multiplication is being done.
- Changes how some 128-bit values are split across variables, so that there are more shifts of exactly 64 bits (which translate to no-ops).
Do you ever run out of ideas for optimization? ;)
-
sipa commented at 11:32 pm on September 10, 2020: contributorI measure a 5.9% speedup for ecdsa_verify on AMD Ryzen 2950X with clock fixed at 2.2 GHz (Linux x86_64, GCC 9.3, -O2, gmp enabled, endomorphism enabled, asm disabled). Before this PR the asm code was tied with the no-asm code; after, the no-asm code clearly wins (this isn’t surprising, as the asm code implements the old algorithm, but still).
-
sipa commented at 3:10 am on September 11, 2020: contributor
I made the same changes to asm code in https://github.com/sipa/secp256k1/commits/202009_cleanup_5x52_mul, but it’s only a ~2% speedup for the with-asm code (so the no-asm code is still faster) indicating that this code change enables more shuffling around than I could directly apply.
We could just take the GCC 9.3 generated code, and turn it into asm source code, or we could just drop the asm code. @gmaxwell also suggests to me that we could replace the asm code with a faster BMI2-enabled version, which wouldn’t work on every x86_64 CPU. EDIT: compiling with
-mbmi2
slows it down, though perhaps a well-written handrolled BMI2 version can be faster. -
elichai commented at 12:20 pm on September 11, 2020: contributor
Cool trick!
On my machine (i9-9980HK @ 2.4GHz) with:
./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=no
Before:0$ ./bench_verify 1ecdsa_verify: min 95.1us / avg 95.2us / max 95.4us 2$ ./bench_verify 3ecdsa_verify: min 95.2us / avg 96.3us / max 97.8us 4$ ./bench_sign 5ecdsa_sign: min 68.7us / avg 69.1us / max 71.1us 6$ ./bench_sign 7ecdsa_sign: min 68.7us / avg 68.8us / max 68.9us 8$ ./bench_ecdh 9ecdh: min 108us / avg 109us / max 109us 10$ ./bench_ecdh 11ecdh: min 108us / avg 109us / max 110us
After:
0$ ./bench_verify 1ecdsa_verify: min 93.1us / avg 93.4us / max 94.4us 2$ ./bench_verify 3ecdsa_verify: min 93.1us / avg 93.2us / max 93.3us 4$ ./bench_sign 5ecdsa_sign: min 67.7us / avg 68.1us / max 69.8us 6$ ./bench_sign 7ecdsa_sign: min 68.1us / avg 69.0us / max 72.2us 8$ ./bench_ecdh 9ecdh: min 106us / avg 107us / max 109us 10$ ./bench_ecdh 11ecdh: min 106us / avg 106us / max 108us
Seems to be ~2% improvement here.
But with
CFLAGS=-O3 ./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=no
Before:0$ ./bench_verify 1ecdsa_verify: min 90.2us / avg 90.6us / max 92.0us 2$ ./bench_verify 3ecdsa_verify: min 90.5us / avg 90.7us / max 91.0us 4$ ./bench_sign 5ecdsa_sign: min 67.5us / avg 68.0us / max 69.3us 6$ ./bench_sign 7ecdsa_sign: min 67.4us / avg 67.6us / max 67.7us 8$ ./bench_ecdh 9ecdh: min 99.7us / avg 102us / max 108us 10$ ./bench_ecdh 11ecdh: min 99.5us / avg 99.9us / max 101us
After:
0$ ./bench_verify 1ecdsa_verify: min 88.7us / avg 89.9us / max 91.8us 2$ ./bench_verify 3ecdsa_verify: min 88.9us / avg 89.6us / max 90.8us 4$ ./bench_sign 5ecdsa_sign: min 66.8us / avg 66.9us / max 67.1us 6$ ./bench_sign 7ecdsa_sign: min 66.7us / avg 67.0us / max 68.4us 8$ ./bench_ecdh 9ecdh: min 96.1us / avg 97.1us / max 98.9us 10$ ./bench_ecdh 11ecdh: min 95.9us / avg 96.2us / max 97.7us
Seems to be ~2-4% improvement here on
-O3
. -
elichai commented at 7:20 pm on September 12, 2020: contributor
@elichai Feel like benchmarking the asm code too (with my branch linked above)?
Here: I verified that I still reproduce my past result(which I do), and then tested current master with
./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=yes
0$ ./bench_verify 1ecdsa_verify: min 92.0us / avg 92.3us / max 93.6us 2$ ./bench_verify 3ecdsa_verify: min 92.1us / avg 92.2us / max 92.5us 4$ ./bench_sign 5ecdsa_sign: min 66.0us / avg 66.8us / max 67.9us 6$ ./bench_sign 7ecdsa_sign: min 66.0us / avg 66.4us / max 67.3us 8$ ./bench_ecdh 9ecdh: min 104us / avg 105us / max 107us 10$ ./bench_ecdh 11ecdh: min 104us / avg 105us / max 106us
on @sipa’s branch:
0$ ./bench_verify 1ecdsa_verify: min 90.8us / avg 91.1us / max 91.4us 2$ ./bench_verify 3ecdsa_verify: min 90.8us / avg 91.3us / max 92.9us 4$ ./bench_sign 5ecdsa_sign: min 65.4us / avg 65.8us / max 66.9us 6$ ./bench_sign 7ecdsa_sign: min 65.5us / avg 65.9us / max 66.6us 8$ ./bench_ecdh 9ecdh: min 103us / avg 103us / max 105us 10$ ./bench_ecdh 11ecdh: min 103us / avg 103us / max 104us
master with
-O3
:0$ ./bench_verify 1ecdsa_verify: min 93.1us / avg 93.2us / max 93.2us 2$ ./bench_verify 3ecdsa_verify: min 93.0us / avg 93.5us / max 94.9us 4$ ./bench_sign 5ecdsa_sign: min 65.3us / avg 65.8us / max 67.2us 6$ ./bench_sign 7ecdsa_sign: min 65.3us / avg 65.8us / max 67.0us 8$ ./bench_ecdh 9ecdh: min 104us / avg 104us / max 105us 10$ ./bench_ecdh 11ecdh: min 104us / avg 105us / max 106us
@sipa’s branch with
-O3
:0$ ./bench_verify 1ecdsa_verify: min 92.2us / avg 92.4us / max 94.2us 2$ ./bench_verify 3ecdsa_verify: min 92.1us / avg 92.2us / max 92.2us 4$ ./bench_sign 5ecdsa_sign: min 65.0us / avg 65.3us / max 65.9us 6$ ./bench_sign 7ecdsa_sign: min 65.1us / avg 65.6us / max 66.5us 8$ ./bench_ecdh 9ecdh: min 102us / avg 102us / max 103us 10$ ./bench_ecdh 11ecdh: min 102us / avg 103us / max 104us
I don’t get how regular
-O3
gives the best result in everything except signing -
elichai commented at 7:29 pm on September 12, 2020: contributor
Using this: #726 (comment) it seems like
-O3
and-O2
still generates the exact same asm routine forfield_5x52_int128_impl
.(this PR and master do generate different asm’s)
-
elichai commented at 8:14 pm on September 12, 2020: contributor
I went over all the asm differences between
-O2
and-O3
when compilingfiled_5x52_impl.h
withint128
in gcc and they differ only in these functions:0secp256k1_fe_normalize_var 1secp256k1_fe_clear 2secp256k1_fe_cmp_var 3secp256k1_fe_add 4secp256k1_fe_cmov 5secp256k1_fe_storage_cmov
the main interesting thing is
secp256k1_fe_add
which-O3
seems to be optimizing with SSE instructions: https://godbolt.org/z/f7xa35sadly trying to compile and test with
-mno-sse
breaks gcc -
sipa commented at 8:44 pm on September 12, 2020: contributorFWIW, those are SSE/SSE2 instructions, which are guaranteed to be available on x86_64.
-
peterdettman commented at 5:03 pm on September 13, 2020: contributor@elichai FYI my results were with the endomorphism enabled, which may be skewing the relative improvement percentage compared to having it off. From my testing it appears this tweak is improving fe_mul more than fe_sqr, which is somewhat counter-intuitive since the same operations were saved for each and fe_sqr has less overall.
-
sipa commented at 6:54 pm on September 13, 2020: contributorFWIW, my benchmarks were also with endomorphism enabled.
-
gmaxwell commented at 9:34 pm on September 13, 2020: contributorPretty much all benchmarking now should be endomorphism enabled unless there is a specific reason not to test with it. Endomorphism should become default on (or even on and not configurable) within a month.
-
real-or-random requested review from real-or-random on Oct 6, 2021
-
real-or-random approved
-
real-or-random commented at 7:00 am on October 16, 2021: contributorACK b53e0cd61fce0bcef178f317537c91efc9afd04d I’ve inspected the diff and run the tests without asm for a CPU day
-
real-or-random merged this on Oct 17, 2021
-
real-or-random closed this on Oct 17, 2021
-
sipa cross-referenced this on Oct 17, 2021 from issue WIP WIP WIP 5x64 field representation by sipa
-
sipa commented at 1:46 am on October 18, 2021: contributor
FWIW, on a 64-bit ARM system (developerbox, Cortex-A53):
- before this PR:
ecdsa_verify: min 481us / avg 481us / max 481us
- after this PR:
ecdsa_verify: min 470us / avg 470us / max 470us
- before this PR:
-
fanquake referenced this in commit 8f5cd5e893 on Oct 20, 2021
-
sipa referenced this in commit f727914d7e on Oct 28, 2021
-
sipa cross-referenced this on Oct 28, 2021 from issue Update libsecp256k1 subtree to current master by sipa
-
sipa referenced this in commit 440f7ec80e on Oct 31, 2021
-
sipa referenced this in commit d057eae556 on Dec 2, 2021
-
fanquake referenced this in commit c4a1e09a8c on Dec 3, 2021
-
sipa referenced this in commit 86dbc4d075 on Dec 15, 2021
-
fanquake referenced this in commit c06cda3e48 on Dec 18, 2021
-
jonasnick cross-referenced this on Jan 2, 2022 from issue Sync Upstream by jonasnick
-
real-or-random referenced this in commit 21e2d65b79 on Jan 5, 2022
-
gwillen referenced this in commit 35d6112a72 on May 25, 2022
-
janus referenced this in commit 879a9a27b9 on Jul 10, 2022
-
patricklodder referenced this in commit 21badcf9d2 on Jul 25, 2022
-
patricklodder referenced this in commit 03002a9013 on Jul 28, 2022
-
backpacker69 referenced this in commit 77186f4a04 on Jan 18, 2023
-
sipa cross-referenced this on Apr 1, 2023 from issue Possible optimization for secp256k1 by sipa
-
real-or-random cross-referenced this on Apr 6, 2023 from issue fiat-crypto + CryptOpt tracking issue by real-or-random
-
str4d referenced this in commit 6de4698bf9 on Apr 21, 2023
-
vmta referenced this in commit e1120c94a1 on Jun 4, 2023
-
vmta referenced this in commit 8f03457eed on Jul 1, 2023
This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-23 23:15 UTC
More mirrored repositories can be found on mirror.b10c.me