fiat-crypto + CryptOpt tracking issue #1261

issue real-or-random openend this issue on April 6, 2023
  1. real-or-random commented at 5:36 am on April 6, 2023: contributor

    fiat-crypto can generate verified field code for multiple targets, e.g., C and x86_64 asm. It has algorithm templates for our mul and sqr algorithm (under the name “Dettman”) for secp256k1’s field. But the current implementation misses at least one optimization that we have in our C code (but not asm), namely the one #810.

    CryptOpt can optimize fiat-crypto’s output, producing significantly faster asm while retaining the formal guarantees.

    A plausible route towards using these projects in this library is:

    • Optimize the algorithm template in fiat-crypto code further by implementing the optimization from #810 and perhaps further refinements. This is tracked in https://github.com/mit-plv/fiat-crypto/issues/1582. This would ideally be done by the fiat-crypto folks (in particular @OwenConoly who contributed the initial implementation).
    • If fiat-crypto’s C output is fast enough (ideally on par with our current C code), replace the current C code by fiat-crypto’s C code.
    • If CryptoOpt provides a significant enough further speedup, replace the current x86_64 asm by fiat-crypto + CryptOpt x86_64 asm. Since the current output of CryptOpt is not fully generic x86_64 asm, we need either of these two:
      • Add CPU id on our side
      • Modify CryptOpt so that it uses only instructions available on all x86_64 CPUs
  2. real-or-random cross-referenced this on Apr 6, 2023 from issue DRAFT: Replace Field Arithmetic by dderjoel
  3. real-or-random cross-referenced this on Apr 6, 2023 from issue PROOF-OF-CONCEPT Replace x86_64 asm by CryptOpt output by real-or-random
  4. dderjoel commented at 0:36 am on April 18, 2023: none

    Hi everyone, It took me a bit but I now have some numbers.

    I’ve used those nine machines

    CPU 𝜇-architecture
    Intel Core i7-6770HQ Skylake-H
    Intel Core i7-10710U Comet Lake-U
    Intel Core i9-10900K Comet Lake-S
    Intel Core i7-11700KF Rocket Lake-S
    Intel Core i9-13900KF Raptor Lake-S
    AMD Ryzen Theadr. 1900X Zen 1
    AMD Ryzen 7 5800X Zen 3
    AMD Ryzen 9 5950X Zen 3
    AMD Ryzen 9 7950X Zen 4

    I’ve created a little project here, which takes this library as the base. For default_asm, it compiles with ./configure --with-asm=x86_64 using the current asm implementation. For default_c, it compiles with ./configure --with-asm=no using the current c implementation. For default_c52, it replaces the c-implementation with the version before the PR #810 , then compiles with ./configure --with-asm=no. I’ve included that to compare that with fiat_c. For fiat_c, it replaces the c implementation with the Fiat Cryptography implementation (which does not feature the #810 optimization.) For fiat_cryptopt, it uses Fiat as a starting point, then optimizes mul and square on all those machines, and I’ve included the ‘on average best’ one. So on average, we see that for the ecmult_gen the fiat_c is more performant than the default_asm and fiat_cryptopt is even faster than that and the current default_c.

    Numbers for the secp256k1 benchmarks ./bench_internal simple and ./bench_ecmult (I’ve commented out the multi_benches to succeed in a timely manner.)

    Geometric mean over nine machines (per-machine tables at the end of the post)

    implementation default_asm default_c default_c52 fiat_c fiat_cryptopt
    ecmult_gen 24.6435 23.2049 24.7301 24.1802 22.8905
    ecmult_const 46.798 43.8816 47.4836 45.7434 42.6387
    ecmult_1p 37.0311 34.1909 37.06 35.7762 33.1715
    ecmult_0p_g 25.3775 23.7174 26.1944 25.0963 23.3779
    ecmult_1p_g 21.6629 19.9417 21.5685 20.8703 19.412
    field_half 0.00333343 0.00333149 0.00334944 0.00342343 0.0033374
    field_normalize 0.0112975 0.0113496 0.0113459 0.0112799 0.0112925
    field_normalize_weak 0.00447881 0.00451408 0.00449407 0.00450251 0.00453795
    field_sqr 0.0215233 0.0194716 0.0213214 0.0189657 0.0190589
    field_mul 0.0258206 0.0219956 0.0244374 0.0240141 0.0226145
    field_inverse 2.13491 2.13158 2.13192 2.13569 2.14261
    field_inverse_var 1.39323 1.39281 1.3899 1.39515 1.39883
    field_is_square_var 1.82198 1.83353 1.84074 1.83197 1.82702
    field_sqrt 5.94338 5.4591 5.88598 5.29186 5.28916

    In addition to the benchmarks of the library, I also used the SUPERCOP framework, in which I’ve included implementations with the same mul/square.. This is similar to the table in the paper, but instead of optimizing and comparing on one single machine (not shown in the table below) I’ve included one particular implementation which I ran on all machines. The numbers are cycles for four scalar multiplications (two base point, two variable point). We see that on GM, the fiat_cryptopt is again the most performant one.

    Implementation Lang. 1900X 5800X 5950X 7950X i7 6G i7 10G i9 10G i7 11G i9 13G G.M.
    (default_asm) libsecp256k1 [Bitcoin Core 2021] asm 720k (1.07x) 567k (1.05x) 567k (1.05x) 560k (1.06x) 565k (1.06x) 564k (1.06x) 564k (1.06x) 539k (1.08x) 439k (1.06x) 561k (1.06x)
    libsecp256k1+Best (Opt) asm 712k (1.06x) 545k (1.01x) 545k (1.01x) 543k (1.02x) 541k (1.01x) 541k (1.02x) 541k (1.02x) 512k (1.03x) 448k (1.08x) 544k (1.02x)
    (default_c52) libsecp256k1 [Bitcoin Core 2021] (w/o 810) C 699k (1.04x) 561k (1.04x) 559k (1.04x) 546k (1.03x) 575k (1.08x) 574k (1.08x) 572k (1.08x) 538k (1.08x) 454k (1.09x) 561k (1.06x)
    (default_c) libsecp256k1 [Bitcoin Core 2021] (w/ 810) C 671k (1.00x) 538k (1.00x) 538k (1.00x) 530k (1.00x) 561k (1.05x) 563k (1.06x) 562k (1.06x) 508k (1.02x) 435k (1.05x) 542k (1.02x)
    (fiat_c) libsecp256k1 + Fiat-C C 671k (1.00x) 541k (1.00x) 540k (1.00x) 537k (1.01x) 551k (1.03x) 551k (1.04x) 550k (1.04x) 508k (1.02x) 425k (1.02x) 538k (1.01x)
    (fiat_cryptopt) libsecp256k1+Best (Fiat) asm 703k (1.05x) 539k (1.00x) 539k (1.00x) 536k (1.01x) 533k (1.00x) 530k (1.00x) 531k (1.00x) 498k (1.00x) 415k (1.00x) 532k (1.00x)

    Intel Core i7-6770HQ

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 29.3 29.1 30.2 30.1 27.1
    ecmult_const 54.9 55.0 57.6 57.0 49.6
    ecmult_1p 42.8 42.7 44.8 44.3 38.3
    ecmult_0p_g 29.3 29.6 31.7 31.3 27.0
    ecmult_1p_g 25.1 24.9 26.1 25.9 22.5
    field_half 0.00425 0.00425 0.00429 0.00429 0.00424
    field_normalize 0.0143 0.0144 0.0145 0.0143 0.0144
    field_normalize_weak 0.00566 0.00585 0.00566 0.00566 0.00584
    field_sqr 0.0243 0.0237 0.0242 0.0218 0.0214
    field_mul 0.0302 0.0266 0.0285 0.0305 0.0253
    field_inverse 2.70 2.69 2.69 2.70 2.68
    field_inverse_var 1.82 1.80 1.80 1.80 1.80
    field_is_square_var 2.31 2.34 2.36 2.30 2.32
    field_sqrt 6.79 6.50 6.58 6.10 5.87

    Intel Core i7-10710U

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 70.0 69.3 72.1 71.7 64.7
    ecmult_const 131.0 131.0 137.0 136.0 118.0
    ecmult_1p 102.0 102.0 107.0 106.0 91.5
    ecmult_0p_g 70.0 70.9 74.7 73.6 63.4
    ecmult_1p_g 59.9 59.6 62.4 61.8 53.8
    field_half 0.00997 0.00993 0.0101 0.0105 0.00994
    field_normalize 0.0344 0.0344 0.0344 0.0344 0.0345
    field_normalize_weak 0.0132 0.0137 0.0132 0.0132 0.0137
    field_sqr 0.0576 0.0559 0.0572 0.0515 0.0507
    field_mul 0.0714 0.0629 0.0722 0.0676 0.0599
    field_inverse 6.16 6.16 6.11 6.15 6.15
    field_inverse_var 4.33 4.32 4.32 4.36 4.32
    field_is_square_var 5.65 5.59 5.63 5.68 5.59
    field_sqrt 16.2 15.5 15.7 14.5 14.0

    Intel Core i9-10900K

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 20.5 20.4 21.1 21.1 19.0
    ecmult_const 38.4 38.5 40.3 39.8 34.7
    ecmult_1p 30.0 29.9 31.3 31.0 26.8
    ecmult_0p_g 20.6 20.7 21.8 21.5 18.5
    ecmult_1p_g 17.6 17.4 18.3 18.1 15.7
    field_half 0.00296 0.00296 0.00297 0.00293 0.00298
    field_normalize 0.0101 0.0101 0.0101 0.0101 0.0101
    field_normalize_weak 0.00393 0.00393 0.00392 0.00393 0.00392
    field_sqr 0.0171 0.0166 0.0171 0.0153 0.0151
    field_mul 0.0212 0.0187 0.0201 0.0201 0.0178
    field_inverse 1.80 1.80 1.80 1.80 1.80
    field_inverse_var 1.27 1.27 1.27 1.27 1.26
    field_is_square_var 1.62 1.65 1.64 1.62 1.62
    field_sqrt 4.75 4.54 4.62 4.26 4.10

    Intel Core i7-11700KF

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 19.9 17.8 19.3 18.8 17.8
    ecmult_const 38.3 34.2 37.1 35.4 33.4
    ecmult_1p 29.9 26.4 28.9 27.3 26.0
    ecmult_0p_g 20.9 18.4 20.1 18.8 18.1
    ecmult_1p_g 17.5 15.5 16.7 15.9 15.3
    field_half 0.00304 0.00303 0.00303 0.00303 0.00305
    field_normalize 0.0102 0.0105 0.0104 0.0101 0.0101
    field_normalize_weak 0.00394 0.00394 0.00410 0.00415 0.00416
    field_sqr 0.0195 0.0171 0.0181 0.0170 0.0164
    field_mul 0.0225 0.0182 0.0191 0.0196 0.0203
    field_inverse 1.89 1.89 1.89 1.89 1.97
    field_inverse_var 1.22 1.22 1.21 1.22 1.25
    field_is_square_var 1.67 1.67 1.68 1.67 1.68
    field_sqrt 5.35 4.69 5.12 4.51 4.65

    Intel Core i9-13900KF

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 19.5 17.8 18.6 17.9 17.6
    ecmult_const 38.4 34.3 36.2 34.6 33.8
    ecmult_1p 30.7 27.3 28.8 27.5 26.8
    ecmult_0p_g 21.4 18.8 20.0 19.1 18.6
    ecmult_1p_g 18.0 16.0 16.8 16.1 15.7
    field_half 0.00330 0.00332 0.00337 0.00335 0.00333
    field_normalize 0.0109 0.0110 0.0110 0.0109 0.0109
    field_normalize_weak 0.00452 0.00451 0.00449 0.00450 0.00451
    field_sqr 0.0191 0.0169 0.0193 0.0167 0.0177
    field_mul 0.0256 0.0202 0.0210 0.0196 0.0191
    field_inverse 2.13 2.12 2.13 2.13 2.13
    field_inverse_var 1.37 1.37 1.37 1.36 1.38
    field_is_square_var 1.88 1.87 1.87 1.88 1.88
    field_sqrt 5.43 4.97 5.56 4.82 5.04

    AMD Ryzen Theadripper 1900X

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 25.7 23.5 26.0 24.4 25.1
    ecmult_const 50.9 45.9 51.4 47.4 48.5
    ecmult_1p 39.6 35.5 40.0 37.1 37.1
    ecmult_0p_g 27.5 24.7 28.5 26.3 26.6
    ecmult_1p_g 23.0 20.6 23.2 21.6 21.5
    field_half 0.00270 0.00270 0.00271 0.00293 0.00269
    field_normalize 0.0104 0.0104 0.0104 0.0104 0.0104
    field_normalize_weak 0.00365 0.00365 0.00365 0.00365 0.00365
    field_sqr 0.0216 0.0202 0.0224 0.0198 0.0202
    field_mul 0.0267 0.0235 0.0263 0.0246 0.0244
    field_inverse 2.04 2.04 2.05 2.05 2.04
    field_inverse_var 1.21 1.22 1.21 1.22 1.21
    field_is_square_var 1.51 1.50 1.53 1.51 1.56
    field_sqrt 5.96 5.60 6.10 5.50 5.54

    AMD Ryzen 5800X

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 20.6 19.2 20.7 20.3 19.3
    ecmult_const 38.2 35.2 39.3 37.6 35.5
    ecmult_1p 31.4 27.5 30.6 29.5 27.7
    ecmult_0p_g 20.9 19.1 21.6 20.5 19.4
    ecmult_1p_g 18.4 16.0 17.8 17.2 16.2
    field_half 0.00252 0.00251 0.00251 0.00264 0.00251
    field_normalize 0.00818 0.00818 0.00818 0.00819 0.00818
    field_normalize_weak 0.00343 0.00343 0.00343 0.00343 0.00343
    field_sqr 0.0180 0.0147 0.0177 0.0152 0.0157
    field_mul 0.0199 0.0167 0.0194 0.0195 0.0191
    field_inverse 1.58 1.58 1.58 1.58 1.58
    field_inverse_var 1.03 1.03 1.03 1.03 1.03
    field_is_square_var 1.37 1.38 1.38 1.38 1.36
    field_sqrt 4.72 4.23 4.83 4.30 4.35

    AMD Ryzen 5950X

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 23.1 21.4 23.2 22.7 21.6
    ecmult_const 42.6 39.3 43.9 42.0 39.6
    ecmult_1p 34.9 30.7 34.1 32.9 30.9
    ecmult_0p_g 23.1 21.2 24.7 23.4 22.3
    ecmult_1p_g 20.4 17.9 19.9 19.2 18.1
    field_half 0.00280 0.00280 0.00280 0.00294 0.00280
    field_normalize 0.00914 0.00912 0.00913 0.00913 0.00913
    field_normalize_weak 0.00382 0.00383 0.00382 0.00382 0.00382
    field_sqr 0.0200 0.0164 0.0197 0.0169 0.0176
    field_mul 0.0221 0.0188 0.0216 0.0216 0.0213
    field_inverse 1.77 1.76 1.76 1.77 1.77
    field_inverse_var 1.15 1.15 1.15 1.15 1.18
    field_is_square_var 1.52 1.54 1.54 1.54 1.52
    field_sqrt 5.26 4.71 5.39 4.80 4.85

    AMD Ryzen 7950X

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 16.8 15.5 16.8 16.4 15.8
    ecmult_const 32.0 29.2 32.4 31.0 29.8
    ecmult_1p 25.1 22.7 25.3 24.4 23.4
    ecmult_0p_g 17.4 15.8 18.4 17.6 17.0
    ecmult_1p_g 14.6 13.2 14.7 14.2 13.6
    field_half 0.00212 0.00212 0.00212 0.00212 0.00213
    field_normalize 0.00698 0.00697 0.00696 0.00695 0.00696
    field_normalize_weak 0.00290 0.00290 0.00290 0.00290 0.00290
    field_sqr 0.0143 0.0130 0.0141 0.0128 0.0125
    field_mul 0.0165 0.0142 0.0170 0.0161 0.0149
    field_inverse 1.34 1.34 1.34 1.34 1.34
    field_inverse_var 0.825 0.827 0.825 0.838 0.825
    field_is_square_var 1.06 1.09 1.09 1.09 1.06
    field_sqrt 4.12 3.62 3.93 3.50 3.51
  5. dderjoel commented at 1:04 am on April 18, 2023: none

    In regards to merging this, In the I’ve included the implementations as .c files. And the methods are not static. In the case for the libsecp-benchmarks I’ve added the implementation as a c-source to the bench{,_internal,_ecmult}_SOURCES targets (see here).

    Side note, I wanted to include the implementations as .asm files first (as those are formally verified) but then I think we would be in trouble once we compile for non System V- ABI’s. So I’ve used nasm to assemble the asm files and objdump to dump the code in at&t syntax (afaik, otherwise clang complains sometimes) and include them into the c-source. With that, I hope the compiler is able (if needed) swap registers of the arguments if one compiles for non System V. And when I tried to include the methods as inline static in .h files, any optimization-level of the compiler resulted in segfaults, because the registers were not set correctly. (I’m too unfamiliar with this inline-asm notation).

    In regards to “tell cryptopt to emit general code (i.e. no mulx, adox, adcx) That is not too trivial. I’ll see what I can do, but I don’t think we should expect anything here too soon.

  6. real-or-random commented at 2:38 pm on April 19, 2023: contributor

    Thanks, you have a nice benchmark suite. :) Interestingly, fiat_c beats default_c on SUPERCOP, but not in the other benchmarks. Any idea why this is the case?

    Anyway, I don’t think that any of this should change the plan sketched above. Since we’re not in a hurry, it would be awesome to have https://github.com/mit-plv/fiat-crypto/issues/1582 solved first.

    In regards to “tell cryptopt to emit general code (i.e. no mulx, adox, adcx) That is not too trivial. I’ll see what I can do, but I don’t think we should expect anything here too soon.

    Oh okay, we assumed it’s a simple change. Anyway, should we decide that we want asm, which is anyway CPU-specific, then I think it makes sense to go 100% for it and use CPUID.

  7. dderjoel commented at 0:56 am on April 20, 2023: none

    fiat_c beats default_c on SUPERCOP, but not in the other benchmarks. Any idea why this is the case?

    SUPERCOP tries many different compiler / compiler settings (see below) , whereas for the ./bench_{} I’ve just used the default flags (see below output of ./configure). I’m happy to do that with other compilers, too, if you think that’s useful.

    Oh okay, we assumed it’s a simple change. Anyway, should we decide that we want asm, which is anyway CPU-specific, then I think it makes sense to go 100% for it and use CPUID.

    I’ll keep it in mind. I’m a bit divided on this though. CPUs since Oct ‘14 (Broadwell) support BMI2 / ADX. So it feel’s wrong to not use them. Would be interesting to see what the usage of this library is distributed… On the other hand I do understand to be as flexible as possible.

    I’m using Clang 15 and GCC 12.


    Supercop compiler settings

    0gcc -march=native -mtune=native -O3 -fomit-frame-pointer -fwrapv -fPIC -fPIE
    1gcc -march=native -mtune=native -Os -fomit-frame-pointer -fwrapv -fPIC -fPIE
    2gcc -march=native -mtune=native -O2 -fomit-frame-pointer -fwrapv -fPIC -fPIE
    3gcc -march=native -mtune=native -O -fomit-frame-pointer -fwrapv -fPIC -fPIE
    4clang -march=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
    5clang -march=native -Os -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
    6clang -march=native -O2 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
    7clang -march=native -O -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
    8clang -mcpu=native -O3 -fomit-frame-pointer -fwrapv -Qunused-arguments -fPIC -fPIE
    

    output of ./configure on i7-11700K, ./fiat_cryptopt

     0Build Options:
     1  with external callbacks = no
     2  with benchmarks         = yes
     3  with tests              = yes
     4  with ctime tests        = yes
     5  with coverage           = no
     6  with examples           = no
     7  module ecdh             = yes
     8  module recovery         = no
     9  module extrakeys        = yes
    10  module schnorrsig       = yes
    11
    12  asm                     = x86_64
    13  ecmult window size      = 15
    14  ecmult gen prec. bits   = 4
    15
    16  valgrind                = yes
    17  CC                      = gcc
    18  CPPFLAGS                =  
    19  SECP_CFLAGS             = -O2  -std=c89 -pedantic -Wno-long-long -Wnested-externs -Wshadow -Wstrict-prototypes -Wundef -Wno-overlength-strings -Wall -Wno-unused-function -Wextra -Wcast-align -Wcast-align=strict -fvisibility=hidden 
    20  CFLAGS                  = -g -O2
    21  LDFLAGS                 = 
    
  8. real-or-random commented at 8:07 am on April 20, 2023: contributor

    SUPERCOP tries many different compiler / compiler settings (see below) , whereas for the ./bench_{} I’ve just used the default flags (see below output of ./configure).

    Makes sense.

    I’m happy to do that with other compilers, too, if you think that’s useful.

    I’m not convinced that it will provide much more insight right now.

    CPUs since Oct ‘14 (Broadwell) support BMI2 / ADX. So it feel’s wrong to not use them. Would be interesting to see what the usage of this library is distributed…

    Yep, exactly my thoughts. If you really have an older CPU, the C code will serve you well.

  9. jonasnick commented at 2:36 pm on April 21, 2023: contributor

    Concept ACK on investigating fiat-crypto and (longer term) cryptopt. Would be great to have a verified implementation that’s only slightly slower or even faster.

    Thanks @dderjoel for the benchmarks.

  10. dderjoel commented at 0:34 am on May 3, 2023: none

    Owen implemented #810 in the Fiat repo (see PR https://github.com/mit-plv/fiat-crypto/issues/1582), it’s not yet fully merged, but I could not wait to benchmark it.

    I’ve updated secp_bench and supercop with the new C and CryptOpt-optimized versions.

    I’ve also added our 12th Gen machine back into the benchmark suite, here the updated numbers (copied post from above, indicating changes):

    I’ve used those ~nine~ ten machines

    CPU 𝜇-architecture
    Intel Core i7-6770HQ Skylake-H
    Intel Core i7-10710U Comet Lake-U
    Intel Core i9-10900K Comet Lake-S
    Intel Core i7-11700KF Rocket Lake-S
    Intel Core i9-12900KF Alder Lake-S
    Intel Core i9-13900KF Raptor Lake-S
    AMD Ryzen Theadr. 1900X Zen 1
    AMD Ryzen 7 5800X Zen 3
    AMD Ryzen 9 5950X Zen 3
    AMD Ryzen 9 7950X Zen 4

    I’ve created a little project here, which takes this library as the base. For default_asm, it compiles with ./configure --with-asm=x86_64 using the current asm implementation. For default_c, it compiles with ./configure --with-asm=no using the current c implementation. For default_c52, it replaces the c-implementation with the version before the PR #810 , then compiles with ./configure --with-asm=no. I’ve included that to compare that with fiat_c. For fiat_c, it replaces the c implementation with the ~Fiat Cryptography implementation (which does not feature the #810 optimization.)~ the Merge-in-progess Fiat Cryptography implementation, now featuring the #810 optimization. For fiat_cryptopt, it uses Fiat as a starting point, then optimizes mul and square on all those machines, and I’ve included the ‘on average best’ one. (implicitly also now using the optimization) So on average, we see that for the ecmult_gen the fiat_c is more performant than the default_asm and slightly more performant than default_c (plus giving formal assurance) and fiat_cryptopt is ~even faster than that and the current default_c.~ *the fastest (plus formal assurance) *.

    Numbers for the secp256k1 benchmarks ./bench_internal simple and ./bench_ecmult (I’ve commented out the multi_benches to succeed in a timely manner.)

    Geometric mean over ~nine~ ten machines (per-machine tables at the end of the post)

    implementation default_asm default_c default_c52 fiat_c fiat_cryptopt
    ecmult_gen 16.1257 14.9874 15.8764 14.9714 14.1071
    ecmult_const 30.3317 28.2041 30.3634 28.1846 26.1778
    ecmult_1p 23.9575 22.0809 23.641 22.1608 20.5615
    ecmult_0p_g 16.9942 15.8604 16.9429 15.9223 14.7684
    ecmult_1p_g 14.0591 12.8636 13.7521 12.9906 12.0012
    field_half 0.0024964 0.0024537 0.00237626 0.0023697 0.00237259
    field_normalize 0.00737954 0.00744617 0.00736519 0.00732155 0.00733252
    field_normalize_weak 0.00294988 0.00295974 0.00293822 0.00295422 0.00294591
    field_sqr 0.014059 0.0126383 0.0137789 0.0123009 0.0119273
    field_mul 0.0170241 0.0144868 0.0156579 0.014689 0.0144257
    field_inverse 1.40048 1.40896 1.39608 1.38928 1.3986
    field_inverse_var 0.915942 0.919022 0.912739 0.907769 0.910547
    field_is_square_var 1.20313 1.21123 1.20451 1.19417 1.19385
    field_sqrt 3.87547 3.56868 3.82354 3.44445 3.35331

    In addition to the benchmarks of the library, I also used the SUPERCOP framework, in which I’ve included implementations with the same mul/square.. This is similar to the table in the paper, but instead of optimizing and comparing on one single machine (not shown in the table below) I’ve included one particular implementation which I ran on all machines. The numbers are cycles for four scalar multiplications (two base point, two variable point). We see that on GM, the fiat_cryptopt is again the most performant one. (In fact everywhere except on 1900X, where the Fiat-C is the most performant one)

    Implementation Lang. 1900X 5800X 5950X 7950X i7 6G i7 10G i9 10G i7 11G i9 12G i9 13G G.M.
    (default_asm) libsecp256k1 [Bitcoin Core 2021] asm 719k (1.11x) 571k (1.11x) 567k (1.10x) 560k (1.11x) 565k (1.07x) 565k (1.08x) 564k (1.08x) 540k (1.13x) 438k (1.09x) 439k (1.11x) 548k (1.09x)
    libsecp256k1+Best (Opt) asm 712k (1.10x) 543k (1.06x) 545k (1.06x) 544k (1.08x) 541k (1.03x) 541k (1.03x) 542k (1.04x) 512k (1.07x) 449k (1.12x) 447k (1.13x) 533k (1.07x)
    (default_c52) libsecp256k1 [Bitcoin Core 2021] (w/o 810) C 699k (1.08x) 560k (1.09x) 561k (1.09x) 547k (1.09x) 574k (1.09x) 574k (1.10x) 573k (1.10x) 536k (1.12x) 459k (1.14x) 456k (1.15x) 550k (1.10x)
    (default_c) libsecp256k1 [Bitcoin Core 2021] (w/ 810) C 671k (1.04x) 538k (1.05x) 537k (1.04x) 531k (1.06x) 567k (1.08x) 563k (1.07x) 562k (1.08x) 504k (1.06x) 440k (1.09x) 439k (1.11x) 531k (1.06x)
    (fiat_c) libsecp256k1 + Fiat-C C 648k (1.00x) 537k (1.05x) 536k (1.04x) 536k (1.07x) 549k (1.04x) 546k (1.04x) 548k (1.05x) 492k (1.03x) 418k (1.04x) 418k (1.05x) 519k (1.04x)
    (fiat_cryptopt) libsecp256k1+Best (Fiat) asm 675k (1.04x) 514k (1.00x) 515k (1.00x) 503k (1.00x) 526k (1.00x) 524k (1.00x) 523k (1.00x) 477k (1.00x) 402k (1.00x) 397k (1.00x) 500k (1.00x)

    Intel Core i7-6770HQ

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 21.9 21.6 22.4 22.2 20.0
    ecmult_const 40.7 40.8 42.7 41.7 36.4
    ecmult_1p 31.8 31.7 33.2 32.4 28.1
    ecmult_0p_g 22.7 22.9 24.0 23.4 20.2
    ecmult_1p_g 18.6 18.5 19.4 18.9 16.5
    field_half 0.00470 0.00441 0.00364 0.00436 0.00378
    field_normalize 0.0106 0.0108 0.0107 0.0107 0.0107
    field_normalize_weak 0.00420 0.00420 0.00421 0.00422 0.00421
    field_sqr 0.0184 0.0175 0.0180 0.0169 0.0162
    field_mul 0.0224 0.0198 0.0213 0.0221 0.0181
    field_inverse 1.99 2.00 2.01 1.99 2.00
    field_inverse_var 1.36 1.34 1.34 1.34 1.34
    field_is_square_var 1.73 1.74 1.74 1.72 1.73
    field_sqrt 5.04 4.83 4.90 4.66 4.44

    Intel Core i7-10710U

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 19.8 19.3 20.1 19.7 17.8
    ecmult_const 35.8 35.7 37.4 36.4 31.9
    ecmult_1p 28.1 28.2 29.4 28.6 24.9
    ecmult_0p_g 20.3 20.6 21.4 21.0 18.2
    ecmult_1p_g 16.5 16.4 17.2 16.8 14.6
    field_half 0.00327 0.00370 0.00356 0.00280 0.00368
    field_normalize 0.00956 0.00956 0.00932 0.00954 0.00932
    field_normalize_weak 0.00373 0.00373 0.00364 0.00387 0.00364
    field_sqr 0.0167 0.0157 0.0157 0.0151 0.0144
    field_mul 0.0201 0.0177 0.0188 0.0186 0.0163
    field_inverse 1.71 1.72 1.72 1.71 1.72
    field_inverse_var 1.21 1.21 1.21 1.22 1.22
    field_is_square_var 1.60 1.54 1.56 1.57 1.56
    field_sqrt 4.50 4.22 4.30 4.09 3.90

    Intel Core i9-10900K

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 14.9 14.8 15.3 15.1 13.6
    ecmult_const 27.9 27.9 29.5 28.5 24.9
    ecmult_1p 21.8 21.7 22.7 22.2 19.3
    ecmult_0p_g 15.0 15.2 15.8 15.4 13.3
    ecmult_1p_g 12.8 12.6 13.3 13.0 11.3
    field_half 0.00222 0.00218 0.00218 0.00215 0.00218
    field_normalize 0.00739 0.00731 0.00731 0.00733 0.00731
    field_normalize_weak 0.00285 0.00285 0.00286 0.00285 0.00285
    field_sqr 0.0125 0.0120 0.0123 0.0115 0.0111
    field_mul 0.0154 0.0136 0.0145 0.0142 0.0125
    field_inverse 1.31 1.31 1.31 1.31 1.31
    field_inverse_var 0.920 0.920 0.919 0.920 0.917
    field_is_square_var 1.18 1.19 1.19 1.17 1.17
    field_sqrt 3.44 3.30 3.35 3.18 3.04

    Intel Core i7-11700KF

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 17.1 14.4 15.1 13.1 12.7
    ecmult_const 31.5 27.3 28.0 24.5 23.5
    ecmult_1p 23.7 21.2 20.6 19.5 19.5
    ecmult_0p_g 16.6 14.7 14.4 14.3 13.6
    ecmult_1p_g 14.0 12.2 12.1 11.6 11.4
    field_half 0.00239 0.00237 0.00240 0.00223 0.00227
    field_normalize 0.00779 0.00837 0.00805 0.00730 0.00740
    field_normalize_weak 0.00320 0.00327 0.00325 0.00296 0.00313
    field_sqr 0.0155 0.0136 0.0142 0.0125 0.0114
    field_mul 0.0175 0.0145 0.0150 0.0128 0.0144
    field_inverse 1.48 1.50 1.48 1.36 1.36
    field_inverse_var 0.951 0.969 0.961 0.877 0.877
    field_is_square_var 1.32 1.33 1.32 1.20 1.20
    field_sqrt 4.18 3.74 3.92 3.37 3.15

    Intel Core i9-12900KF

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 11.6 10.6 11.0 10.4 9.96
    ecmult_const 22.6 20.2 21.3 19.9 18.9
    ecmult_1p 18.1 16.0 17.0 15.8 15.0
    ecmult_0p_g 13.5 11.8 12.5 11.6 11.9
    ecmult_1p_g 10.6 9.37 9.92 9.23 8.80
    field_half 0.00197 0.00198 0.00196 0.00197 0.00199
    field_normalize 0.00638 0.00626 0.00639 0.00641 0.00637
    field_normalize_weak 0.00266 0.00261 0.00264 0.00266 0.00266
    field_sqr 0.0114 0.00948 0.0114 0.00946 0.00978
    field_mul 0.0152 0.0115 0.0122 0.0113 0.0122
    field_inverse 1.25 1.23 1.25 1.25 1.25
    field_inverse_var 0.807 0.794 0.814 0.807 0.814
    field_is_square_var 1.10 1.08 1.10 1.10 1.10
    field_sqrt 3.26 2.91 3.25 2.82 2.84

    Intel Core i9-13900KF

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 10.7 9.23 9.61 9.10 8.70
    ecmult_const 20.1 17.7 18.7 17.5 16.6
    ecmult_1p 15.9 14.1 14.9 13.9 13.2
    ecmult_0p_g 11.4 9.95 10.5 9.66 9.18
    ecmult_1p_g 9.30 8.23 8.72 8.12 7.74
    field_half 0.00183 0.00183 0.00171 0.00178 0.00183
    field_normalize 0.00562 0.00602 0.00571 0.00568 0.00595
    field_normalize_weak 0.00235 0.00247 0.00233 0.00233 0.00247
    field_sqr 0.00978 0.00932 0.00986 0.00837 0.00905
    field_mul 0.0132 0.0113 0.0107 0.0102 0.0113
    field_inverse 1.10 1.16 1.10 1.10 1.16
    field_inverse_var 0.705 0.747 0.709 0.707 0.745
    field_is_square_var 0.970 1.02 0.967 0.972 1.02
    field_sqrt 2.80 2.71 2.88 2.44 2.63

    AMD Ryzen Theadripper 1900X

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 23.6 21.7 23.9 22.1 22.5
    ecmult_const 46.6 42.1 47.1 42.7 43.4
    ecmult_1p 36.1 32.4 36.4 33.0 32.9
    ecmult_0p_g 25.8 23.4 26.2 23.6 23.2
    ecmult_1p_g 21.0 18.9 21.2 19.2 19.1
    field_half 0.00264 0.00273 0.00270 0.00270 0.00275
    field_normalize 0.00960 0.00941 0.00962 0.00942 0.00942
    field_normalize_weak 0.00330 0.00332 0.00333 0.00355 0.00332
    field_sqr 0.0196 0.0184 0.0204 0.0175 0.0172
    field_mul 0.0241 0.0214 0.0240 0.0219 0.0218
    field_inverse 1.82 1.86 1.84 1.84 1.86
    field_inverse_var 1.11 1.11 1.11 1.10 1.10
    field_is_square_var 1.36 1.39 1.40 1.39 1.41
    field_sqrt 5.38 5.08 5.56 4.82 4.77

    AMD Ryzen 5800X

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 16.3 15.4 16.8 15.6 14.5
    ecmult_const 30.1 28.1 31.9 28.8 26.6
    ecmult_1p 24.7 22.0 24.8 22.8 20.8
    ecmult_0p_g 16.5 15.5 17.5 15.7 14.5
    ecmult_1p_g 14.6 12.9 14.4 13.2 12.1
    field_half 0.00255 0.00219 0.00221 0.00238 0.00199
    field_normalize 0.00661 0.00646 0.00647 0.00655 0.00647
    field_normalize_weak 0.00275 0.00270 0.00270 0.00272 0.00270
    field_sqr 0.0142 0.0117 0.0140 0.0124 0.0115
    field_mul 0.0161 0.0136 0.0155 0.0144 0.0142
    field_inverse 1.28 1.26 1.25 1.27 1.26
    field_inverse_var 0.833 0.822 0.815 0.826 0.816
    field_is_square_var 1.10 1.09 1.09 1.10 1.07
    field_sqrt 3.81 3.35 3.84 3.47 3.30

    AMD Ryzen 5950X

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 16.1 15.5 16.7 15.3 14.6
    ecmult_const 30.6 28.4 31.6 28.1 26.7
    ecmult_1p 25.4 22.4 25.0 22.1 20.8
    ecmult_0p_g 17.7 16.5 18.0 16.3 15.3
    ecmult_1p_g 15.0 13.1 14.1 13.2 12.2
    field_half 0.00223 0.00221 0.00224 0.00224 0.00218
    field_normalize 0.00649 0.00642 0.00642 0.00643 0.00645
    field_normalize_weak 0.00270 0.00267 0.00267 0.00267 0.00267
    field_sqr 0.0142 0.0115 0.0139 0.0121 0.0114
    field_mul 0.0158 0.0132 0.0154 0.0140 0.0140
    field_inverse 1.26 1.24 1.24 1.25 1.25
    field_inverse_var 0.819 0.808 0.817 0.816 0.811
    field_is_square_var 1.08 1.08 1.08 1.08 1.07
    field_sqrt 3.75 3.33 3.81 3.39 3.27

    AMD Ryzen 7950X

    bench asm c c52 fiat_c fiat_cryptopt
    ecmult_gen 14.0 12.7 13.8 13.1 12.3
    ecmult_const 26.4 23.9 26.7 24.8 23.0
    ecmult_1p 20.7 18.6 20.9 19.5 18.2
    ecmult_0p_g 15.1 13.6 15.4 14.3 13.2
    ecmult_1p_g 12.1 10.8 12.1 11.4 10.4
    field_half 0.00213 0.00196 0.00189 0.00189 0.00189
    field_normalize 0.00556 0.00564 0.00549 0.00561 0.00559
    field_normalize_weak 0.00228 0.00229 0.00228 0.00230 0.00229
    field_sqr 0.0115 0.0106 0.0112 0.0104 0.00989
    field_mul 0.0137 0.0118 0.0136 0.0123 0.0123
    field_inverse 1.09 1.10 1.07 1.10 1.10
    field_inverse_var 0.670 0.680 0.657 0.681 0.675
    field_is_square_var 0.855 0.889 0.860 0.881 0.857
    field_sqrt 3.34 2.98 3.17 2.95 2.82
  11. dderjoel cross-referenced this on May 3, 2023 from issue Possible optimization for secp256k1 by sipa
  12. dderjoel commented at 5:50 am on May 9, 2023: none

    Hi everyone, I’ve integrated the new implementations (CryptOpt Assembly and Fiat-C) into my fork, and it works with ./configure --with-asm=no and ./configure --with-asm=x86_64. I am unsure how to run the ci, to see if it works (especially on Windows).

    Also, the current commits don’t check if the CPU features flags BMI2 nor ADX. Is the idea to check that at runtime or compile time or both? Also, I can’t figure out how I would make the field arithmetic static and inline again (if that is needed).

    I’m happy to do the bulk work but I’d need some guidance on the following

    • CPU flags check on compile / run time
    • Where to put implementations (I’ve currently put them into src/third_party) and whether to force them to be static
    • If not static, then we need CMake integration (to compile ./third_party/*.c)
  13. sipa cross-referenced this on May 9, 2023 from issue Rework of 10x26 field mul/sqr by peterdettman
  14. real-or-random added the label assurance on May 10, 2023
  15. dderjoel commented at 7:29 am on May 22, 2023: none

    Hi everyone,

    I’ve updated my fork in three ways:

    1. I’ve included the up to date version from Fiat, i.e. without the unnecessary AND’s
    2. I’ve included the CryptOpt-generated versions into the header files, (as opposed to earlier where I’ve put them into separate C files). One problem was that the inline assembly did not specify that caller-save registers will be overwritten. Another problem was that rdx and rdi were also overwritten. Now, I’ve added two instructions to the end, which restore rdx and rdi. (Does anyone know hot to tell CC that those are clobbered AND read at the same time)?
    3. I’ve included CPU flag detection macros into ./build-aux/m4/bitcoin_secp.m4. I’ve found existing macros to check common cpu features and to check via builtin already, but I was unable to reliably detect adx (especially with clang, which does not support those in __builtin_cpu_supports. c.f. the list of supported flags

    I believe the last thing for a PR would be to check during the CMAKE build process, if BMI2 and ADX are available. Does anyone have other ideas on what we need here?

  16. sipa commented at 12:59 pm on May 22, 2023: contributor

    @dderjoel Thanks for the updates, I’m very glad to see the progress here.

    My thinking is that we’ll actually want to split this into two separate efforts, which we can discuss and consider with different timelines:

    • Replacing the pure-C code with Fiat-Crypto generated C code.
    • Introducing CryptoOpt-generated assembly code.

    The first is straightforward, I think: your benchmarks show that the Fiat-Crypto generated C code (with the #810 optimization) is competitive with the C code we already have, so I think there is little reason not to just outright replace that code. Since there is no reason to keep the old code too, we don’t need any decision logic or build system support to choose one over the other. I have some comments on the implementation, but that can wait for a PR. Overall, this feels very close to usable.

    Introducing CryptoOpt is a more complex matter, because of the use of ADCX/ADOX/MULX it cannot be a drop-in replacement for the assembly code we already have. While I have no problem with optimizing exclusively for architectures which do have these, we can’t just drop support for those who don’t (but falling back to the C implementation, or another less-optimized asm implementation, on those is ok, IMO). There are a number of options:

    • Modify CryptoOpt to not use the new instructions, allowing us to use it as a drop-in replacement for the existing asm code. There are of course alternatives to this, like using the output of a well-known, stable, C compiler on the Fiat-Crypto generated C code, and include that output directly as asm code. Either of these options would essentially give us the same level of assurance in the asm code as fiat-crypto-output-fed-to-c-compiler, which is more than what we currently have for the asm code.
    • No autodetection, and only enable it when explicitly requested at configure time, which someone can do if they know it’ll be run on a machine that supports the new instructions. This could be a first step to introduce the CryptoOpt code to the codebase, but it won’t see widespread use in this mode.
    • Have runtime autodetection, to determine whether the machine the code is running on (not the compiler host machine as I have the impression your branch is trying to do) has support for the relevant instructions, and use that to determine which code to use. The CPU feature detection code isn’t hard to write, but it may involve some changes to the library to have any kind of runtime-switchable features (e.g. when does the autodetect code run, does it need to go into the context for thread-safety, …) that we’d need to have a discussion about.
  17. sipa commented at 1:07 pm on May 22, 2023: contributor

    Separate comment, as it feels unrelated to the rest.

    In inline asm blocks, all variable templates fall in one of these categories:

    • Input variables
    • Early-clobber output variable (=&)
    • Output variables (=)
    • Input-output variables (+)
    • (not actually variables) explicitly clobbered registers (which won’t be assigned to input or output variables)

    Normal output variables may be assigned the same register as an input variable, but early-clobber output variables won’t.

    If there is a PR, I’m happy to look over it.

  18. real-or-random commented at 2:57 pm on May 22, 2023: contributor

    I agree that we should add Fiat-Crypto first separately. I wonder what would be necessary for reviewers to be convinced that the code is correct and what exact guarantees the Fiat-Crypto proofs provide.

    Introducing CryptoOpt is a more complex matter […]

    I think a reasonable path forward is to first a compile-time option (maybe with a simple abort at run time, e.g., just call ADX in the self-check), and then later add runtime autodetection. Modifying CryptoOpt to be generic x86_64 sounds more complicated, and we’d probably leave performance on the table.

  19. sipa commented at 3:19 pm on May 22, 2023: contributor

    I wonder what would be necessary for reviewers to be convinced that the code is correct and what exact guarantees the Fiat-Crypto proofs provide.

    Indeed. I wonder if @roconnor-blockstream would be interested in having a look at that too…

    (Thinking out loud) would it be possible to use Klee to prove the C code that comes out is always correct, independent from the Fiat-crypto proofs? Actually, if that’s feasible, perhaps we could try to do that right now with the current C code too.

    I think a reasonable path forward is to first a compile-time option (maybe with a simple abort at run time, e.g., just call ADX in the self-check), and then later add runtime autodetection.

    That seems reasonable.

    Modifying CryptoOpt to be generic x86_64 sounds more complicated, and we’d probably leave performance on the table.

    I certainly wouldn’t have non-ADCX-cryptopt-asm as the only x86_64 asm code in the library longer term; that seems like a waste. But if we don’t, what do we do for non-ADCX systems in the runtime autodetection world?

    • Fall back to the C code implementation (and deleting the existing ASM code)?
    • Fall back to the existing non-proven ASM code?
    • Replace the existing ASM code with a new one that incorporates #810 (e.g. by taking compiler output from the fiat-crypto C code)?
  20. sipa commented at 3:29 pm on May 22, 2023: contributor
    @dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don’t think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.
  21. dderjoel commented at 0:53 am on May 23, 2023: none
    • Replacing the pure-C code with Fiat-Crypto generated C code.

    I agree that we should add Fiat-Crypto first separately.

    Will create a PR later which only replaces the 64bit C code. We can then continue to discuss there.

    maybe with a simple abort at run time, e.g., just call ADX in the self-check

    That seems reasonable.

    Would that be in src/selftest.h:29? something like

     0static int secp256k1_selftest_passes(void) {
     1    int ret = 1;
     2#if defined(USE_ASM_X86_64)     
     3  __asm__ __volatile__("cpuid\n"
     4                       "mov %%rbx, %%rax\n"
     5                       "shr $19, %%rbx\n"
     6                       "shr $8, %%rax\n"
     7                       "and %%rbx, %%rax\n"
     8                       "and $1, %%rax\n"
     9                       : "=a"(ret)
    10                       : "a"(7), "c"(0)
    11                       : "rdx", "rbx");
    12#fi
    13    return secp256k1_selftest_sha256() && ret;
    14}
    

    (but falling back to the C implementation, or another less-optimized asm implementation, on those is ok, IMO)

    Yes. According to my benchmarks, the C code is faster than the current asm implementation so performance wise, it does make sense to fallback to that.

    Modify CryptoOpt to not use the new instructions,

    As @real-or-random already pointed out, that is a bit more complex, see also Issue 143 in the CryptOpt repo, where I explain why.

    No autodetection […], but it won’t see widespread use in this mode.

    Yes, I agree, which would be sad. As far as I understood, the current autodetection assumes that the code is compiled for the native system, as it checks if the local machine is x86_64. Hence, I thought it might as well also check if the flags are available. The only situation in which this (semantically) fails is if I compile on a (new ish) x86 machine and run it on an older x86 machine. This would, currently still work, but with the flag detection only in the build pipeline would not. I wonder how likely that is.

    Have runtime autodetection

    Yes, this is quite common in other libraries, too, like openSSL. If we have that, we could even go one step further and include one version for AMD and one for Intel, or even one for each Generation (at some point I believe it’s a bit too much to maintain, but as CryptOpt makes it very easy to generate such implementations, it becomes feasible at least :))

    @dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don’t think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.

    That is a question for Owen and the Fiat-Crypto team. So if I look at the current 32-bit implementation, and look at the limbs I think its 10 limbs and the last limb has a width of 22 bits.

    When I call ./src/ExtractionOCaml/dettman_multiplication dettman 32 10 22 '2^256 - 4294968273' It fails with

    0Computed bounds (Some [Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], ome [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x800401ea7]]) are not tight enough expected bounds not looser than (Some [Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x5fffff]])).
    1The bounds [0x0 ~> 0x800401ea7] are looser than the expected bounds [0x0 ~> 0x5fffff]
    

    and a lot of intermediate code.

    On the CryptOpt side, I have not tried incorporating 32-bit code yet. This is probably another engineering task, probably at the order of “get CryptOpt to generate code without ADX / BMI2”, as everything is tailored to 64-bit registers and carry-bits. But as it currently does not seem that urgent here either, this is not priority 1 on my list.

  22. dderjoel cross-referenced this on May 23, 2023 from issue Replace the 64-bit C field implementation by fiat-crypto output by dderjoel
  23. OwenConoly commented at 1:51 am on May 23, 2023: none

    When I call ./src/ExtractionOCaml/dettman_multiplication dettman 32 10 22 '2^256 - 4294968273' It fails with

    0Computed bounds (Some [Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], ome [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x3ffffff], Some [0x0 ~> 0x800401ea7]]) are not tight enough expected bounds not looser than (Some [Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x7fffffe], Some [0x0 ~> 0x5fffff]])).
    1The bounds [0x0 ~> 0x800401ea7] are looser than the expected bounds [0x0 ~> 0x5fffff]
    

    Interesting. Apparently there’s a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn’t be extremely surprising.) It looks like there’s some extra reduction on the final limb? I’ll take a closer look at the 32-bit code to see what the difference is.

  24. dderjoel commented at 1:53 am on May 23, 2023: none

    Interesting. Apparently there’s a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn’t be extremely surprising.) It looks like there’s some extra reduction on the final limb? I’ll take a closer look at the 32-bit code to see what the difference is.

    Thanks for looking into that, Owen. FWIW, I’m not 100% sure if my invocation is correct, it was just an (somewhat educated) guess.

  25. dderjoel commented at 2:02 am on May 23, 2023: none

    In inline asm blocks, all variable templates fall in one of these categories:

    • Input variables

    • Early-clobber output variable (=&)

    • Output variables (=)

    • Input-output variables (+)

    • (not actually variables) explicitly clobbered registers (which won’t be assigned to input or output variables)

    Normal output variables may be assigned the same register as an input variable, but early-clobber output variables won’t.

    If there is a PR, I’m happy to look over it.

    So I’ve created a branch only-asm which features the asm implementation here. lines 202 and 203 are the instructions to restore r and b into rdi and rdx respectively. That is not always necessary, and also did not run through the equivalence checker. As it it sometimes dead operations, depending on when/how the code is used, it would be more elegant to tell CC that those are simply clobbered. I’m also unsure with all the memory read/write, if they are needed.

    Now I’m having a hard time to sort "D"(r) and "d"(b) into your description above.

  26. sipa commented at 3:30 am on May 23, 2023: contributor

    @dderjoel

    Would that be in src/selftest.h:29? something like …

    I think @real-or-random meant something even simpler (just some asm code that uses adc, which presumbly crashes on non-supporting systems), but your suggestion is better (and already introduces detection logic which can be later moved from selftesting to autodetection once that is feasible).

    Yes. According to my benchmarks, the C code is faster than the current asm implementation so performance wise, it does make sense to fallback to that.

    Great.

    As @real-or-random already pointed out, that is a bit more complex, see also https://github.com/0xADE1A1DE/CryptOpt/issues/143, where I explain why.

    Ah, I see now why it’s nontrivial to avoid those instructions. I assumed you already had support for mul/imul, and it was just a matter of disabling mulx… but I get why mul/imul are actually a lot harder to deal with than mulx.

    In either case, we still do have the option of replacing the asm code with fiat-crypto-c-fed-through-c-compiler, if we find a particularly fast combination of compiler/version/options.

    Yes, I agree, which would be sad. As far as I understood, the current autodetection assumes that the code is compiled for the native system, as it checks if the local machine is x86_64. Hence, I thought it might as well also check if the flags are available. The only situation in which this (semantically) fails is if I compile on a (new ish) x86 machine and run it on an older x86 machine. This would, currently still work, but with the flag detection only in the build pipeline would not. I wonder how likely that is.

    I see where you’re coming from now, but no, there is absolutely no assumption that the compilation is for the native system. You tell the libsecp256k1 build system what compiler suite to use, and it targets whatever that compiler is for. By default, that’s your standard system compiler, which is most likely for the same architecture (x86/x86_64/arm32/aarch64/…) as the system you’re running on, but there is no requirement for that. And even if it’s the same architecture, it’s not necessarily for your own system. In fact, I believe that there is a significant fraction of libsecp256k1 use that is built through cross-compiling (as compilation for another system than your own is called), as Bitcoin Core’s release binaries (which include libsecp256k1) are created on a Linux x86_64 environment, cross-compiled for everything else (including arm32, aarch64, windows x86_64, mac osx, …).

    We absolutely can’t do autodetection at compile or configure time. The options are manual configuration, or runtime autodetection.

    If we have that, we could even go one step further and include one version for AMD and one for Intel, or even one for each Generation (at some point I believe it’s a bit too much to maintain, but as CryptOpt makes it very easy to generate such implementations, it becomes feasible at least :))

    Indeed. That’s the advantage of using formal methods, that they lower the review barrier significantly for having multiple options like this.

    It’d also open the door for things like #967, which adds a field with a more standard 4x64 limb representation, but which is only faster with asm optimization (because it much more crucially relies on fast carries, which are hard to get the compiler to emit for pure C code).

    This is probably another engineering task, probably at the order of “get CryptOpt to generate code without ADX / BMI2”, as everything is tailored to 64-bit registers and carry-bits. But as it currently does not seem that urgent here either, this is not priority 1 on my list.c

    32-bit x86 is all but dead, I absolutely wouldn’t suggest you spend any time on asm optimizing that.

    The 32-bit C code is there for weak hardware, from hardware wallet to (old) Raspberry Pi like devices. But none of those are x86. If you’re looking for more work, 32-bit ARM and (increasingly) 64-bit ARM are far more interesting targets than 32-bit x86. @OwenConoly

    Interesting. Apparently there’s a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn’t be extremely surprising.) It looks like there’s some extra reduction on the final limb? I’ll take a closer look at the 32-bit code to see what the difference is.

    Thank you for looking into it!

  27. dderjoel commented at 4:18 am on May 23, 2023: none

    which presumbly crashes on non-supporting systems

    Yes, should trigger an SIGILL in those cases.

    re, build:

    Yes, I forgot about the cross compilation and releases. in the only-asm branch, I’ve hooked the detection into the case where the build script is checking anyway, and is conservative to not use the asm then. In other words, where it used to check if it could use x86_64, it now additionally checks if the flags are set. However, now I see that this is not enough. For the case, one wants to build for generic x64, one would set --with-asm=x86-64 and this would result in code which needs the flags. So as an alternative, we could use additional switches --has-adx and --has-bmi2 to ./configure, and then if (regardless of auto-detected or explicitly set) also --with-asm=x86-64 is set, then the new asm can be used. Question then is, what the default for this should be (also auto detect or conservative to false?)

    If we want to use runtime detection-and-selection, then I believe either the functions must be non-inlined and in e.g. here we could check a global bit which one to jump to, either the adx-asm version or to the C-compiled fallback.

  28. peterdettman commented at 6:52 am on May 23, 2023: contributor

    Interesting. Apparently there’s a qualitative difference between the 64-bit implementation and the 32-bit implementation. (I suppose this shouldn’t be extremely surprising.) It looks like there’s some extra reduction on the final limb? I’ll take a closer look at the 32-bit code to see what the difference is.

    Possibly relevant: the current 10x26 field mul/sqr leave their residual carry in limbs[2] (thus exceeding 26 bits), while the most-significant limb will be fully reduced (22 bits). The 5x52 (C code, not asm) leaves theirs in the most-significant limb (limbs[4]).

    See this PR: #815 , in particular the first commit: https://github.com/bitcoin-core/secp256k1/pull/815/commits/9388b77b03c6e0c09a950e84ba34e9a2d8e0d433 . In that PR @sipa also anticipates that this may be better for fiat-crypto integration: #815 (comment) .

    I have previously mentioned a side-benefit to our magnitude tracking from rectifying this: #1060 (comment) (“Secondly…”)

  29. dderjoel commented at 10:17 pm on May 24, 2023: none

    Okay, I’m more than happy to do the work and let the core team here review, but I’d need some corners with some guidance, because I’m not 100% sure of what we want. Currently, I see two options:

    • At compile time include both, the asm version and the C version, and at runtime check the necessary bits and decide which one to use.
    • At compile time decide (with switches would we want, what would defaults be?) which one to compile and then have the check in the self-test (sort of what we have now in my fork’s only-asm branch, maybe needs some more verbose error message.)
  30. sipa commented at 10:20 pm on May 24, 2023: contributor

    @dderjoel

    I think eventually we want runtime check and proper dispatch to the right version, having both compiled in. But at this stage, I think it’s more interesting to just have a configure flag that enables either the asm or C (defaulting to C code), plus a selfcheck.

    This allows experimentation and benchmarking and testing, but avoids the need to make whatever restructuring is needed for autodetection. Moreover, it makes it easier for someone other than you do make the autodetection happen, as all the actual code would already be in.

  31. dderjoel cross-referenced this on May 25, 2023 from issue Replace ASM with CryptOpt generated by dderjoel
  32. dderjoel commented at 5:06 am on May 25, 2023: none
    Just created a PR for exactly this, can I trigger CI without removing the Draft status? EDIT: never mind, just a minute to trigger itself
  33. dderjoel commented at 2:53 am on May 30, 2023: none

    I’ve re-run the sec_bench suite on the exact fork on my 10 test machines, and the results is still very similar to the one before.

    implementation default_asm default_c default_c52 fiat_c fiat_cryptopt
    ecmult_gen 15.8059 14.9232 15.7527 15.0524 14.3984
    ecmult_const 30.0159 28.1553 30.2437 28.3729 26.7443
    ecmult_1p 23.7137 21.99 23.5329 22.3725 20.9531
    ecmult_0p_g 16.851 15.7264 16.8364 15.926 14.9667
    ecmult_1p_g 13.8695 12.8423 13.7058 12.9949 12.2609
    field_half 0.00246964 0.00245409 0.0024321 0.00240483 0.00248599
    field_normalize 0.00738854 0.00741554 0.0073913 0.00737073 0.00733137
    field_normalize_weak 0.00294516 0.00294647 0.00294284 0.0029337 0.00292048
    field_sqr 0.0141365 0.0125201 0.0137808 0.0123229 0.0118652
    field_mul 0.0170084 0.01443 0.0156856 0.014615 0.0145313
    field_inverse 1.40032 1.39717 1.39222 1.39458 1.61234
    field_inverse_var 0.912255 0.913256 0.908407 0.91439 0.909508
    field_is_square_var 1.194 1.20539 1.20361 1.20455 1.19346
    field_sqrt 3.87147 3.55153 3.83592 3.44752 3.34927
  34. OwenConoly commented at 3:53 am on June 1, 2023: none

    @dderjoel Another question: what about the 32-bit code? Can fiat-crypto generate C code for that too? If so, does that incorporate #810 now? I don’t think we care enough about the 32-bit code to bother with asm optimizations for that, but replacing the existing code with formally-verified code would still be nice.

    An update on this: as Joel mentioned, using the same template we used for the 64-bit code did not work to generate the 32-bit code. There are two reasons that it didn’t work.

    1. When writing the 64-bit template, I hadn’t considered the fact that we might need to split modular reductions into pieces, like this. Note that we have both R1 and R0, instead of just a single constant R like in the 64-bit code. It was simple enough to make this generalization, so this isn’t a problem anymore.
    2. Conceptually, the 32-bit and 64-bit algorithms are almost identical—but there’s one difference. In the 64-bit mul, the first reduction is from p8 (the highest partial product). So, naively extrapolating, I would guess that in the 32-bit mul, the first reduction would be from p18 (the highest partial product). If we try to generate the 32-bit mul and square using Joel’s command-line invocation above, the generated code makes this “naive extrapolation” and starts out with a reduction from p18. This doesn’t work nicely at all: in the end, the value in the most significant limb ends up (potentially) being too big to even fit in a 32-bit register. The handwritten 32-bit code avoids this issue by starting out with a reduction from p17, before moving on to the reduction from p18. The main difference between the generated code and the handwritten code is just that, in the handwritten code, this reduction from p17 appears at the beginning instead of (i.e., not in addition to) later in the function, as in the generated code. This is basically the only conceptual difference.

    I am curious whether there is a simple explanation (other than “it makes the bounds work out right”) for why we need to start the 32-bit mul with this reduction from p17, but we don’t need to start the 64-bit mul with a reduction from p7. Ideally, I’d like to be able to answer questions like “Hypothetically, if we were on a 16-bit machine (with a 20-limb implementation of this algorithm), then which partial product is our first reduction from? p36 (the third-highest partial product), perhaps?” Of course it isn’t vital to answer questions like this, but it would be nice to view both the 32-bit mul and the 64-bit mul as different instances of the same general algorithm.

  35. OwenConoly commented at 3:56 am on June 1, 2023: none

    I overcame issue (2) above simply by creating a Coq implementation for the 32-bit mul which is separate from the implementation of the 64-bit mul. I followed along the C implementation from the first commit 9388b77 of #815, which @peterdettman mentioned above.

    Here’s the C code I generated for the 32-bit mul and sqr: secp256k1_dettman_32.c. Would you be interested in including this code in the library as well? I assume we’d want some benchmarks before considering that too seriously, and I gather from the discussion in #815 that not many people have 32-bit machines on which to benchmark things like this.

    Note that the generated code does not include the Karatsuba optimization from the third commit of #815. I think it would be interesting to add the Karatsuba multiplication to fiat-crypto, so I would be willing to add that optimization as well.

    And the 32-bit code I generated does incorporate the optimization from #810. The handwritten 32-bit code from #815 also has this optimization.

  36. peterdettman commented at 9:45 am on June 1, 2023: contributor

    I am curious whether there is a simple explanation (other than “it makes the bounds work out right”) for why we need to start the 32-bit mul with this reduction from p17, but we don’t need to start the 64-bit mul with a reduction from p7. Ideally, I’d like to be able to answer questions like “Hypothetically, if we were on a 16-bit machine (with a 20-limb implementation of this algorithm), then which partial product is our first reduction from? p36 (the third-highest partial product), perhaps?” Of course it isn’t vital to answer questions like this, but it would be nice to view both the 32-bit mul and the 64-bit mul as different instances of the same general algorithm.

    Well it pretty much is just “it makes the bounds work out right”, but the crux is which of the high partial products can we do last so that our residual carry lands neatly in the most-significant-limb. We then start one higher than that.

    For 10x26: p18 produces a 91 bit reduced result that doesn’t fit in r8, r9 i.e. 49 bits. p17 produces a 95 bit reduced result that doesn’t fit in r7, r8, r9 i.e. 75 bits. p16 produces a 98 bit reduced result that does fit in r6, r7, r8, r9 i.e. 101 bits.

    So our best option for the last partial products step is p6/p16, and a final carry chain landing in r9 as desired.

    For 5x52: p8 produces a 143 bit reduced result that doesn’t fit in r3, r4 i.e. 101 bits. p7 produces a 147 bit reduced result that does fit in r2, r3, r4 i.e. 153 bits.

    So our best option for the last partial products step is p2/p7, and a final carry chain landing in r4 as desired.

    For our field there’s (conservatively) an extra bit coming from the lower partial product in each pair and the carry-in (that I ignored above), but for a field with a prime very (very) close to a power of 2, the lower partial product would need to be included in the analysis properly.

    Regarding 16-bit, I doubt a practical algorithm would try to follow the pattern of these implementations. If a reduced radix is even tenable I would guess it would have to use Karatsuba, perhaps with 20 limbs in four groups of 64 bits: (13 13 13 13 12). (Note that this is already incompatible with the current assumption of 4 extra bits for carry-free linear field ops).

  37. peterdettman commented at 9:48 am on June 1, 2023: contributor
    Also, +1 in principle to including also the 32-bit code.
  38. dderjoel cross-referenced this on Jun 22, 2023 from issue Alternative to uint128 by davidben

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-10-30 03:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me