field: Remove x86_64 asm #1446

pull real-or-random wants to merge 2 commits into bitcoin-core:master from real-or-random:202311-remove-fe-asm changing 6 files +13 −522
  1. real-or-random commented at 9:59 pm on November 23, 2023: contributor

    Widely available versions of GCC and Clang beat our field asm on -O2. In particular, GCC 10.5.0, which is Bitcoin Core’s current compiler for official x86_64 builds, produces code that is > 20% faster for fe_mul and > 10% faster for signature verification (see #726).

    These are the alternatives to this PR:

    We could replace our current asm with the fastest compiler output that we can find. This is potentially faster, but it has multiple drawbacks:

    • It’s more coding work because it needs detailed benchmarks (e.g., with many compiler/options).
    • It’s more review work because we need to deal with inline asm (including clobbers etc.) and there’s a lack of experts reviewers in this area.
    • It’s not unlikely that we’ll fall behind again in a few compiler versions, and then we have to deal with this again, i.e., redo the benchmarks. Given our history here, I doubt that we’ll revolve this timely.

    We could change the default of the asm build option to off. But this will also disable the scalar asm, which is still faster.

    We could split the build option into two separate options for field and scalar asm and only disable the field asm by default. But this adds complexity to the build and to the test matrix.

    My conclusion is that this PR gets the low-hanging fruit in terms of performance. It simplifies our code significantly. It’s clearly an improvement, and it’s very easy to review. Whether re-introducing better asm (whether from a compiler or from CryptOpt) is worth the hassle can be evaluated separately, and should not hold up this improvement.

    Solves #726.

  2. real-or-random added the label performance on Nov 23, 2023
  3. sipa commented at 10:02 pm on November 23, 2023: contributor
    Concept ACK (if happy CI).
  4. real-or-random force-pushed on Nov 23, 2023
  5. real-or-random force-pushed on Nov 23, 2023
  6. real-or-random renamed this:
    field: Remove x84_64 asm
    field: Remove x86_64 asm
    on Nov 23, 2023
  7. real-or-random force-pushed on Nov 23, 2023
  8. theStack commented at 1:54 am on November 24, 2023: contributor

    Concept ACK

    Two PR description / commit message nits:

    • s/10.0.5/10.5.0/ (GCC version number)
    • s/723/726/ (linked issue number)
  9. field: Remove x86_64 asm
    Widely available versions of GCC and Clang beat our field asm on -O2.
    In particular, GCC 10.5.0, which is Bitcoin Core's current compiler
    for official x86_64 builds, produces code that is > 20% faster for
    fe_mul and > 10% faster for signature verification (see #726).
    
    These are the alternatives to this PR:
    
    We could replace our current asm with the fastest compiler output
    that we can find. This is potentially faster, but it has multiple
    drawbacks:
     - It's more coding work because it needs detailed benchmarks (e.g.,
       with many compiler/options).
     - It's more review work because we need to deal with inline asm
       (including clobbers etc.) and there's a lack of experts reviewers
       in this area.
     - It's not unlikely that we'll fall behind again in a few compiler
       versions, and then we have to deal with this again, i.e., redo the
       benchmarks. Given our history here, I doubt that we'll revolve
       this timely.
    
    We could change the default of the asm build option to off. But this
    will also disable the scalar asm, which is still faster.
    
    We could split the build option into two separate options for field
    and scalar asm and only disable the field asm by default. But this
    adds complexity to the build and to the test matrix.
    
    My conclusion is that this PR gets the low-hanging fruit in terms of
    performance. It simplifies our code significantly. It's clearly an
    improvement, and it's very easy to review. Whether re-introducing
    better asm (whether from a compiler or from CryptOpt) is worth the
    hassle can be evaluated separately, and should not hold up this
    improvement.
    
    Solves #726.
    2f0762fa8f
  10. build: Don't call assembly an optimization
    because we don't know whether it's an optimization.
    f07cead0ca
  11. real-or-random force-pushed on Nov 24, 2023
  12. real-or-random commented at 7:13 am on November 24, 2023: contributor

    Two PR description / commit message nits:

    Force-pushed to fix. (I had already force-pushed multiple times yesterday to fix typos in the commit message… The commit message is by far the most complex part of this PR. :D)

  13. theStack approved
  14. theStack commented at 7:54 pm on November 24, 2023: contributor

    ACK f07cead0ca96e26356466b635ce6e7fe3834c949

    Out of curiosity, I ran the benchmarks again with gcc-11 (which is the default on Ubuntu 22.04 LTS) and clang-14. As expected, the PR branch is faster for sign/verify and the relevant field operations (mul and sqr) on both:

     0======================
     1===== gcc 11.4.0 =====
     2======================
     3
     4master (with field asm):
     5
     6$ ./configure CC=gcc-11 && make clean && make && ./bench verify sign && ./bench_internal mul split sqr | tail -n +5
     7Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
     8
     9ecdsa_verify                  ,    32.9       ,    33.6       ,    35.7    
    10ecdsa_sign                    ,    24.8       ,    25.4       ,    28.2    
    11schnorrsig_sign               ,    18.9       ,    19.3       ,    20.7    
    12schnorrsig_verify             ,    33.3       ,    34.1       ,    36.1    
    13field_sqr                     ,     0.0165    ,     0.0185    ,     0.0214 
    14field_mul                     ,     0.0182    ,     0.0195    ,     0.0228
    15
    16pr1446 (without field asm):
    17
    18$ ./configure CC=gcc-11 && make clean && make && ./bench verify sign && ./bench_internal mul split sqr | tail -n +5
    19Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    20
    21ecdsa_verify                  ,    28.9       ,    29.4       ,    30.1    
    22ecdsa_sign                    ,    23.0       ,    23.3       ,    23.9    
    23schnorrsig_sign               ,    17.2       ,    17.5       ,    18.1    
    24schnorrsig_verify             ,    30.0       ,    30.3       ,    31.0    
    25field_sqr                     ,     0.0130    ,     0.0167    ,     0.0475 
    26field_mul                     ,     0.0136    ,     0.0147    ,     0.0161
    27
    28
    29========================
    30===== clang 14.0.0 =====
    31========================
    32
    33master (with field asm):
    34
    35$ ./configure CC=clang-14 && make clean && make && ./bench verify sign && ./bench_internal mul split sqr | tail -n +5
    36Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    37
    38ecdsa_verify                  ,    32.7       ,    34.7       ,    38.7    
    39ecdsa_sign                    ,    24.6       ,    25.0       ,    25.6    
    40schnorrsig_sign               ,    18.3       ,    18.4       ,    19.4    
    41schnorrsig_verify             ,    32.8       ,    33.1       ,    33.8    
    42field_sqr                     ,     0.0152    ,     0.0158    ,     0.0165 
    43field_mul                     ,     0.0175    ,     0.0181    ,     0.0191
    44
    45
    46pr1446 (without field asm):
    47
    48$ ./configure CC=clang-14 && make clean && make && ./bench verify sign && ./bench_internal mul split sqr | tail -n +5
    49Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    50
    51ecdsa_verify                  ,    30.6       ,    31.0       ,    32.3    
    52ecdsa_sign                    ,    24.7       ,    24.9       ,    25.7    
    53schnorrsig_sign               ,    18.2       ,    18.4       ,    18.6    
    54schnorrsig_verify             ,    31.2       ,    31.6       ,    32.6    
    55field_sqr                     ,     0.0117    ,     0.0123    ,     0.0133 
    56field_mul                     ,     0.0139    ,     0.0148    ,     0.0168
    

    (Interesting that clang’s generated code for _fe_sqr performs so much better than gcc’s, being ~35% faster).

  15. theStack cross-referenced this on Nov 24, 2023 from issue Consider removing x86_64 field assembly by jonasnick
  16. real-or-random commented at 11:17 pm on November 25, 2023: contributor

    (Interesting that clang’s generated code for _fe_sqr performs so much better than gcc’s, being ~35% faster).

    Perhaps this was some glitch in the benchmark, given the rather high “max” value for gcc, and given that sqr should be faster than mul. (sqr is just a specialization of mul, so its only purpose is to be faster than mul.)

  17. real-or-random added this to the milestone 0.4.1 or 0.5.0 on Nov 29, 2023
  18. real-or-random added the label needs-changelog on Nov 29, 2023
  19. sipa commented at 1:51 pm on November 30, 2023: contributor
    utACK f07cead0ca96e26356466b635ce6e7fe3834c949
  20. jonasnick approved
  21. jonasnick commented at 6:44 pm on December 1, 2023: contributor

    ACK f07cead0ca96e26356466b635ce6e7fe3834c949

    I ran the benchmarks with gcc 10.5.0 (before noticing that this had already been done in 726) and confirmed the >10% speedup. I think speedups of that magnitude (and common/default configure settings) justify a mention in the changelog.

  22. jonasnick merged this on Dec 1, 2023
  23. jonasnick closed this on Dec 1, 2023

  24. deadalnix commented at 8:48 pm on December 1, 2023: none

    It’s not unlikely that we’ll fall behind again in a few compiler versions

    I can confirm, I’ve been working on this on the LLVM front, and there is still a lot more it can do.

  25. real-or-random commented at 9:21 pm on December 1, 2023: contributor

    It’s not unlikely that we’ll fall behind again in a few compiler versions

    I can confirm, I’ve been working on this on the LLVM front, and there is still a lot more it can do.

    Nice, I wasn’t aware! The upcoming GCC 14 will also be able to do more addc/subc with https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=43a3252c42af12. (And libsecp256k1 code broke the first attempt: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112430 …)

    But indeed, there’s much more potential. The scalar asm, which relies heavily on addc, is still faster because compilers can’t figure out the carry pattern. One possible middle ground could be the use of addc intrinsics instead of asm.

    (copied this reply from https://twitter.com/real_or_random/status/1730696947197948144)

  26. russeree commented at 10:59 pm on December 1, 2023: none

    Concept Ack;

    Nit: It is critical that the output is 1:1 compatible with existing implementation because inducing a bug though the optimizations would lead to a chainsplit if on one side a sig validates and on another it doesn’t for any reason.

    Edit: Thinking more about this, I am leaning to an configuration flag for this that is off by default, letting the compiler just work magic to optimize isn’t a process that I trust completely especially for consensus critical code.

  27. real-or-random removed the label needs-changelog on Dec 6, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-10-30 01:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me