field: Remove x86

real-or-random commented at 9:59 pm on November 23, 2023: contributor

Widely available versions of GCC and Clang beat our field asm on -O2. In particular, GCC 10.5.0, which is Bitcoin Core’s current compiler for official x86_64 builds, produces code that is > 20% faster for fe_mul and > 10% faster for signature verification (see #726).

These are the alternatives to this PR:

We could replace our current asm with the fastest compiler output that we can find. This is potentially faster, but it has multiple drawbacks:

It’s more coding work because it needs detailed benchmarks (e.g., with many compiler/options).
It’s more review work because we need to deal with inline asm (including clobbers etc.) and there’s a lack of experts reviewers in this area.
It’s not unlikely that we’ll fall behind again in a few compiler versions, and then we have to deal with this again, i.e., redo the benchmarks. Given our history here, I doubt that we’ll revolve this timely.

We could change the default of the asm build option to off. But this will also disable the scalar asm, which is still faster.

We could split the build option into two separate options for field and scalar asm and only disable the field asm by default. But this adds complexity to the build and to the test matrix.

My conclusion is that this PR gets the low-hanging fruit in terms of performance. It simplifies our code significantly. It’s clearly an improvement, and it’s very easy to review. Whether re-introducing better asm (whether from a compiler or from CryptOpt) is worth the hassle can be evaluated separately, and should not hold up this improvement.

Solves #726.

real-or-random added the label performance on Nov 23, 2023

sipa commented at 10:02 pm on November 23, 2023: contributor

Concept ACK (if happy CI).

real-or-random force-pushed on Nov 23, 2023

real-or-random renamed this:
~~field: Remove x84_64 asm~~
field: Remove x86_64 asm
on Nov 23, 2023

real-or-random force-pushed on Nov 23, 2023

theStack commented at 1:54 am on November 24, 2023: contributor

Concept ACK

Two PR description / commit message nits:

s/10.0.5/10.5.0/ (GCC version number)
s/723/726/ (linked issue number)

field: Remove x86_64 asm

Widely available versions of GCC and Clang beat our field asm on -O2.
In particular, GCC 10.5.0, which is Bitcoin Core's current compiler
for official x86_64 builds, produces code that is > 20% faster for
fe_mul and > 10% faster for signature verification (see #726).

These are the alternatives to this PR:

We could replace our current asm with the fastest compiler output
that we can find. This is potentially faster, but it has multiple
drawbacks:
 - It's more coding work because it needs detailed benchmarks (e.g.,
   with many compiler/options).
 - It's more review work because we need to deal with inline asm
   (including clobbers etc.) and there's a lack of experts reviewers
   in this area.
 - It's not unlikely that we'll fall behind again in a few compiler
   versions, and then we have to deal with this again, i.e., redo the
   benchmarks. Given our history here, I doubt that we'll revolve
   this timely.

We could change the default of the asm build option to off. But this
will also disable the scalar asm, which is still faster.

We could split the build option into two separate options for field
and scalar asm and only disable the field asm by default. But this
adds complexity to the build and to the test matrix.

My conclusion is that this PR gets the low-hanging fruit in terms of
performance. It simplifies our code significantly. It's clearly an
improvement, and it's very easy to review. Whether re-introducing
better asm (whether from a compiler or from CryptOpt) is worth the
hassle can be evaluated separately, and should not hold up this
improvement.

Solves #726.

2f0762fa8f

build: Don't call assembly an optimization

because we don't know whether it's an optimization.

f07cead0ca

real-or-random force-pushed on Nov 24, 2023

real-or-random commented at 7:13 am on November 24, 2023: contributor

Two PR description / commit message nits:

Force-pushed to fix. (I had already force-pushed multiple times yesterday to fix typos in the commit message… The commit message is by far the most complex part of this PR. :D)

theStack approved

theStack commented at 7:54 pm on November 24, 2023: contributor

ACK f07cead0ca96e26356466b635ce6e7fe3834c949

Out of curiosity, I ran the benchmarks again with gcc-11 (which is the default on Ubuntu 22.04 LTS) and clang-14. As expected, the PR branch is faster for sign/verify and the relevant field operations (mul and sqr) on both:

 0======================
 1===== gcc 11.4.0 =====
 2======================
 3
 4master (with field asm):
 5
 6$ ./configure CC=gcc-11 && make clean && make && ./bench verify sign && ./bench_internal mul split sqr | tail -n +5
 7Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
 8
 9ecdsa_verify                  ,    32.9       ,    33.6       ,    35.7    
10ecdsa_sign                    ,    24.8       ,    25.4       ,    28.2    
11schnorrsig_sign               ,    18.9       ,    19.3       ,    20.7    
12schnorrsig_verify             ,    33.3       ,    34.1       ,    36.1    
13field_sqr                     ,     0.0165    ,     0.0185    ,     0.0214 
14field_mul                     ,     0.0182    ,     0.0195    ,     0.0228
15
16pr1446 (without field asm):
17
18$ ./configure CC=gcc-11 && make clean && make && ./bench verify sign && ./bench_internal mul split sqr | tail -n +5
19Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
20
21ecdsa_verify                  ,    28.9       ,    29.4       ,    30.1    
22ecdsa_sign                    ,    23.0       ,    23.3       ,    23.9    
23schnorrsig_sign               ,    17.2       ,    17.5       ,    18.1    
24schnorrsig_verify             ,    30.0       ,    30.3       ,    31.0    
25field_sqr                     ,     0.0130    ,     0.0167    ,     0.0475 
26field_mul                     ,     0.0136    ,     0.0147    ,     0.0161
27
28
29========================
30===== clang 14.0.0 =====
31========================
32
33master (with field asm):
34
35$ ./configure CC=clang-14 && make clean && make && ./bench verify sign && ./bench_internal mul split sqr | tail -n +5
36Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
37
38ecdsa_verify                  ,    32.7       ,    34.7       ,    38.7    
39ecdsa_sign                    ,    24.6       ,    25.0       ,    25.6    
40schnorrsig_sign               ,    18.3       ,    18.4       ,    19.4    
41schnorrsig_verify             ,    32.8       ,    33.1       ,    33.8    
42field_sqr                     ,     0.0152    ,     0.0158    ,     0.0165 
43field_mul                     ,     0.0175    ,     0.0181    ,     0.0191
44
45
46pr1446 (without field asm):
47
48$ ./configure CC=clang-14 && make clean && make && ./bench verify sign && ./bench_internal mul split sqr | tail -n +5
49Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
50
51ecdsa_verify                  ,    30.6       ,    31.0       ,    32.3    
52ecdsa_sign                    ,    24.7       ,    24.9       ,    25.7    
53schnorrsig_sign               ,    18.2       ,    18.4       ,    18.6    
54schnorrsig_verify             ,    31.2       ,    31.6       ,    32.6    
55field_sqr                     ,     0.0117    ,     0.0123    ,     0.0133 
56field_mul                     ,     0.0139    ,     0.0148    ,     0.0168

(Interesting that clang’s generated code for _fe_sqr performs so much better than gcc’s, being ~35% faster).

theStack cross-referenced this on Nov 24, 2023 from issue Consider removing x86_64 field assembly by jonasnick

real-or-random commented at 11:17 pm on November 25, 2023: contributor

(Interesting that clang’s generated code for _fe_sqr performs so much better than gcc’s, being ~35% faster).

Perhaps this was some glitch in the benchmark, given the rather high “max” value for gcc, and given that sqr should be faster than mul. (sqr is just a specialization of mul, so its only purpose is to be faster than mul.)

real-or-random added this to the milestone 0.4.1 or 0.5.0 on Nov 29, 2023

real-or-random added the label needs-changelog on Nov 29, 2023

sipa commented at 1:51 pm on November 30, 2023: contributor

utACK f07cead0ca96e26356466b635ce6e7fe3834c949

jonasnick approved

jonasnick commented at 6:44 pm on December 1, 2023: contributor

ACK f07cead0ca96e26356466b635ce6e7fe3834c949

I ran the benchmarks with gcc 10.5.0 (before noticing that this had already been done in 726) and confirmed the >10% speedup. I think speedups of that magnitude (and common/default configure settings) justify a mention in the changelog.

jonasnick merged this on Dec 1, 2023

jonasnick closed this on Dec 1, 2023

deadalnix commented at 8:48 pm on December 1, 2023: none

It’s not unlikely that we’ll fall behind again in a few compiler versions

I can confirm, I’ve been working on this on the LLVM front, and there is still a lot more it can do.

real-or-random commented at 9:21 pm on December 1, 2023: contributor

It’s not unlikely that we’ll fall behind again in a few compiler versions

I can confirm, I’ve been working on this on the LLVM front, and there is still a lot more it can do.

Nice, I wasn’t aware! The upcoming GCC 14 will also be able to do more addc/subc with https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=43a3252c42af12. (And libsecp256k1 code broke the first attempt: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112430 …)

But indeed, there’s much more potential. The scalar asm, which relies heavily on addc, is still faster because compilers can’t figure out the carry pattern. One possible middle ground could be the use of addc intrinsics instead of asm.

(copied this reply from https://twitter.com/real_or_random/status/1730696947197948144)

russeree commented at 10:59 pm on December 1, 2023: none

Concept Ack;

Nit: It is critical that the output is 1:1 compatible with existing implementation because inducing a bug though the optimizations would lead to a chainsplit if on one side a sig validates and on another it doesn’t for any reason.

Edit: Thinking more about this, I am leaning to an configuration flag for this that is off by default, letting the compiler just work magic to optimize isn’t a process that I trust completely especially for consensus critical code.

real-or-random removed the label needs-changelog on Dec 6, 2023

field: Remove x86_64 asm #1446