Avoid overly-wide multiplications in 5x52 field mul/sqr #810

pull peterdettman wants to merge 1 commits into bitcoin-core:master from peterdettman:cleanup_5x52_mul changing 1 files +22 −23
  1. peterdettman commented at 8:24 am on September 10, 2020: contributor
    Speeds up bench_ecdh, bench_sign, bench_verify relative to master by 5+% at -O3, haswell.
  2. Avoid overly-wide multiplications b53e0cd61f
  3. sipa commented at 11:18 pm on September 10, 2020: contributor

    As I understand it, this does two things:

    • Adds some casts to inform the compiler that a 64x64->128 multiplication rather than a 128x64->128 multiplication is being done.
    • Changes how some 128-bit values are split across variables, so that there are more shifts of exactly 64 bits (which translate to no-ops).

    Do you ever run out of ideas for optimization? ;)

  4. sipa commented at 11:32 pm on September 10, 2020: contributor
    I measure a 5.9% speedup for ecdsa_verify on AMD Ryzen 2950X with clock fixed at 2.2 GHz (Linux x86_64, GCC 9.3, -O2, gmp enabled, endomorphism enabled, asm disabled). Before this PR the asm code was tied with the no-asm code; after, the no-asm code clearly wins (this isn’t surprising, as the asm code implements the old algorithm, but still).
  5. sipa commented at 0:43 am on September 11, 2020: contributor

    ACK b53e0cd61fce0bcef178f317537c91efc9afd04d

    Verified the new equation and bounds by hand, ran 10000 iterations of the unit tests, and 1000 iterations of the exhaustive tests (with randomization from #808) on it.

  6. sipa commented at 3:10 am on September 11, 2020: contributor

    I made the same changes to asm code in https://github.com/sipa/secp256k1/commits/202009_cleanup_5x52_mul, but it’s only a ~2% speedup for the with-asm code (so the no-asm code is still faster) indicating that this code change enables more shuffling around than I could directly apply.

    We could just take the GCC 9.3 generated code, and turn it into asm source code, or we could just drop the asm code. @gmaxwell also suggests to me that we could replace the asm code with a faster BMI2-enabled version, which wouldn’t work on every x86_64 CPU. EDIT: compiling with -mbmi2 slows it down, though perhaps a well-written handrolled BMI2 version can be faster.

  7. elichai commented at 12:20 pm on September 11, 2020: contributor

    Cool trick!

    On my machine (i9-9980HK @ 2.4GHz) with: ./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=no Before:

     0$ ./bench_verify 
     1ecdsa_verify: min 95.1us / avg 95.2us / max 95.4us
     2$ ./bench_verify 
     3ecdsa_verify: min 95.2us / avg 96.3us / max 97.8us
     4$ ./bench_sign 
     5ecdsa_sign: min 68.7us / avg 69.1us / max 71.1us
     6$ ./bench_sign 
     7ecdsa_sign: min 68.7us / avg 68.8us / max 68.9us
     8$ ./bench_ecdh 
     9ecdh: min 108us / avg 109us / max 109us
    10$ ./bench_ecdh 
    11ecdh: min 108us / avg 109us / max 110us
    

    After:

     0$ ./bench_verify 
     1ecdsa_verify: min 93.1us / avg 93.4us / max 94.4us
     2$ ./bench_verify 
     3ecdsa_verify: min 93.1us / avg 93.2us / max 93.3us
     4$ ./bench_sign 
     5ecdsa_sign: min 67.7us / avg 68.1us / max 69.8us
     6$ ./bench_sign 
     7ecdsa_sign: min 68.1us / avg 69.0us / max 72.2us
     8$ ./bench_ecdh 
     9ecdh: min 106us / avg 107us / max 109us
    10$ ./bench_ecdh 
    11ecdh: min 106us / avg 106us / max 108us
    

    Seems to be ~2% improvement here.

    But with CFLAGS=-O3 ./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=no Before:

     0$ ./bench_verify 
     1ecdsa_verify: min 90.2us / avg 90.6us / max 92.0us
     2$ ./bench_verify 
     3ecdsa_verify: min 90.5us / avg 90.7us / max 91.0us
     4$ ./bench_sign 
     5ecdsa_sign: min 67.5us / avg 68.0us / max 69.3us
     6$ ./bench_sign 
     7ecdsa_sign: min 67.4us / avg 67.6us / max 67.7us
     8$ ./bench_ecdh 
     9ecdh: min 99.7us / avg 102us / max 108us
    10$ ./bench_ecdh 
    11ecdh: min 99.5us / avg 99.9us / max 101us
    

    After:

     0$ ./bench_verify 
     1ecdsa_verify: min 88.7us / avg 89.9us / max 91.8us
     2$ ./bench_verify 
     3ecdsa_verify: min 88.9us / avg 89.6us / max 90.8us
     4$ ./bench_sign 
     5ecdsa_sign: min 66.8us / avg 66.9us / max 67.1us
     6$ ./bench_sign 
     7ecdsa_sign: min 66.7us / avg 67.0us / max 68.4us
     8$ ./bench_ecdh 
     9ecdh: min 96.1us / avg 97.1us / max 98.9us
    10$ ./bench_ecdh 
    11ecdh: min 95.9us / avg 96.2us / max 97.7us
    

    Seems to be ~2-4% improvement here on -O3.

  8. sipa commented at 6:34 pm on September 12, 2020: contributor
    @elichai Feel like benchmarking the asm code too (with my branch linked above)?
  9. elichai commented at 7:20 pm on September 12, 2020: contributor

    @elichai Feel like benchmarking the asm code too (with my branch linked above)?

    Here: I verified that I still reproduce my past result(which I do), and then tested current master with ./configure --enable-experimental --enable-module-ecdh --enable-module-recovery --disable-openssl-tests --with-asm=yes

     0$ ./bench_verify
     1ecdsa_verify: min 92.0us / avg 92.3us / max 93.6us
     2$ ./bench_verify
     3ecdsa_verify: min 92.1us / avg 92.2us / max 92.5us
     4$ ./bench_sign
     5ecdsa_sign: min 66.0us / avg 66.8us / max 67.9us
     6$ ./bench_sign
     7ecdsa_sign: min 66.0us / avg 66.4us / max 67.3us
     8$ ./bench_ecdh
     9ecdh: min 104us / avg 105us / max 107us
    10$ ./bench_ecdh
    11ecdh: min 104us / avg 105us / max 106us
    

    on @sipa’s branch:

     0$ ./bench_verify
     1ecdsa_verify: min 90.8us / avg 91.1us / max 91.4us
     2$ ./bench_verify
     3ecdsa_verify: min 90.8us / avg 91.3us / max 92.9us
     4$ ./bench_sign
     5ecdsa_sign: min 65.4us / avg 65.8us / max 66.9us
     6$ ./bench_sign
     7ecdsa_sign: min 65.5us / avg 65.9us / max 66.6us
     8$ ./bench_ecdh
     9ecdh: min 103us / avg 103us / max 105us
    10$ ./bench_ecdh
    11ecdh: min 103us / avg 103us / max 104us
    

    master with -O3:

     0$ ./bench_verify
     1ecdsa_verify: min 93.1us / avg 93.2us / max 93.2us
     2$ ./bench_verify
     3ecdsa_verify: min 93.0us / avg 93.5us / max 94.9us
     4$ ./bench_sign
     5ecdsa_sign: min 65.3us / avg 65.8us / max 67.2us
     6$ ./bench_sign
     7ecdsa_sign: min 65.3us / avg 65.8us / max 67.0us
     8$ ./bench_ecdh
     9ecdh: min 104us / avg 104us / max 105us
    10$ ./bench_ecdh
    11ecdh: min 104us / avg 105us / max 106us
    

    @sipa’s branch with -O3:

     0$ ./bench_verify
     1ecdsa_verify: min 92.2us / avg 92.4us / max 94.2us
     2$ ./bench_verify
     3ecdsa_verify: min 92.1us / avg 92.2us / max 92.2us
     4$ ./bench_sign
     5ecdsa_sign: min 65.0us / avg 65.3us / max 65.9us
     6$ ./bench_sign
     7ecdsa_sign: min 65.1us / avg 65.6us / max 66.5us
     8$ ./bench_ecdh
     9ecdh: min 102us / avg 102us / max 103us
    10$ ./bench_ecdh
    11ecdh: min 102us / avg 103us / max 104us
    

    I don’t get how regular -O3 gives the best result in everything except signing

  10. elichai commented at 7:29 pm on September 12, 2020: contributor

    Using this: #726 (comment) it seems like -O3 and -O2 still generates the exact same asm routine for field_5x52_int128_impl.

    (this PR and master do generate different asm’s)

  11. elichai commented at 8:14 pm on September 12, 2020: contributor

    I went over all the asm differences between -O2 and -O3 when compiling filed_5x52_impl.h with int128 in gcc and they differ only in these functions:

    0secp256k1_fe_normalize_var
    1secp256k1_fe_clear
    2secp256k1_fe_cmp_var
    3secp256k1_fe_add
    4secp256k1_fe_cmov
    5secp256k1_fe_storage_cmov
    

    the main interesting thing is secp256k1_fe_add which -O3 seems to be optimizing with SSE instructions: https://godbolt.org/z/f7xa35

    sadly trying to compile and test with -mno-sse breaks gcc

  12. sipa commented at 8:44 pm on September 12, 2020: contributor
    FWIW, those are SSE/SSE2 instructions, which are guaranteed to be available on x86_64.
  13. peterdettman commented at 5:03 pm on September 13, 2020: contributor
    @elichai FYI my results were with the endomorphism enabled, which may be skewing the relative improvement percentage compared to having it off. From my testing it appears this tweak is improving fe_mul more than fe_sqr, which is somewhat counter-intuitive since the same operations were saved for each and fe_sqr has less overall.
  14. sipa commented at 6:54 pm on September 13, 2020: contributor
    FWIW, my benchmarks were also with endomorphism enabled.
  15. gmaxwell commented at 9:34 pm on September 13, 2020: contributor
    Pretty much all benchmarking now should be endomorphism enabled unless there is a specific reason not to test with it. Endomorphism should become default on (or even on and not configurable) within a month.
  16. real-or-random requested review from real-or-random on Oct 6, 2021
  17. real-or-random approved
  18. real-or-random commented at 7:00 am on October 16, 2021: contributor
    ACK b53e0cd61fce0bcef178f317537c91efc9afd04d I’ve inspected the diff and run the tests without asm for a CPU day
  19. real-or-random merged this on Oct 17, 2021
  20. real-or-random closed this on Oct 17, 2021

  21. sipa cross-referenced this on Oct 17, 2021 from issue WIP WIP WIP 5x64 field representation by sipa
  22. sipa commented at 1:46 am on October 18, 2021: contributor

    FWIW, on a 64-bit ARM system (developerbox, Cortex-A53):

    • before this PR: ecdsa_verify: min 481us / avg 481us / max 481us
    • after this PR: ecdsa_verify: min 470us / avg 470us / max 470us
  23. fanquake referenced this in commit 8f5cd5e893 on Oct 20, 2021
  24. sipa referenced this in commit f727914d7e on Oct 28, 2021
  25. sipa cross-referenced this on Oct 28, 2021 from issue Update libsecp256k1 subtree to current master by sipa
  26. sipa referenced this in commit 440f7ec80e on Oct 31, 2021
  27. sipa referenced this in commit d057eae556 on Dec 2, 2021
  28. fanquake referenced this in commit c4a1e09a8c on Dec 3, 2021
  29. sipa referenced this in commit 86dbc4d075 on Dec 15, 2021
  30. fanquake referenced this in commit c06cda3e48 on Dec 18, 2021
  31. jonasnick cross-referenced this on Jan 2, 2022 from issue Sync Upstream by jonasnick
  32. real-or-random referenced this in commit 21e2d65b79 on Jan 5, 2022
  33. gwillen referenced this in commit 35d6112a72 on May 25, 2022
  34. janus referenced this in commit 879a9a27b9 on Jul 10, 2022
  35. patricklodder referenced this in commit 21badcf9d2 on Jul 25, 2022
  36. patricklodder referenced this in commit 03002a9013 on Jul 28, 2022
  37. backpacker69 referenced this in commit 77186f4a04 on Jan 18, 2023
  38. sipa cross-referenced this on Apr 1, 2023 from issue Possible optimization for secp256k1 by sipa
  39. real-or-random cross-referenced this on Apr 6, 2023 from issue fiat-crypto + CryptOpt tracking issue by real-or-random
  40. str4d referenced this in commit 6de4698bf9 on Apr 21, 2023
  41. vmta referenced this in commit e1120c94a1 on Jun 4, 2023
  42. vmta referenced this in commit 8f03457eed on Jul 1, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-23 23:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me