Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

pull k06a wants to merge 13 commits into bitcoin-core:master from 1inch:feature/new-modinv changing 4 files +311 −7
  1. k06a commented at 8:24 pm on February 7, 2022: none

    Original algorithm borrowed from this paper (LS3): https://www.researchgate.net/publication/304417579_Modular_Inverse_Algorithms_Without_Multiplications_for_Cryptographic_Applications

    Was improved by Anton Bukov (@k06a) and Mikhail Melnik (@zumzoom) to use unsigned 256-bit integers: https://gist.github.com/k06a/b990b7c7dda766d4f661e653d6804a53

    This would allow to avoid usage of 62-bit signed representation and compute modinv without 256-bit multiplications and division.

    Code is passing all the tests, but we injected some code to benchmark new also, need to move it to the bench target for sure, maybe someone could help with this?

    Running tests on ARM64 (Apple M1 Max) gives 20% improvement for modinv() method:

     0t1 = clock();
     1for (i = 0; i < 1000000; i++) {
     2    secp256k1_scalar_to_signed62(&s, x);
     3    secp256k1_modinv64(&s, &secp256k1_const_modinfo_scalar);
     4    secp256k1_scalar_from_signed62(r, &s);
     5    CHECK(secp256k1_scalar_eq(r, &rrr));
     6}
     7t2 = clock();
     8for (i = 0; i < 1000000; i++) {    
     9    secp256k1_modinv64_scalar(r, x, &secp256k1_const_mod_scalar);
    10    CHECK(secp256k1_scalar_eq(r, &rrr));
    11}
    12t3 = clock();
    13printf("Time old: %f\n", ((double)(t2 - t1)) / CLOCKS_PER_SEC);
    14printf("Time new: %f\n", ((double)(t3 - t2)) / CLOCKS_PER_SEC);
    15printf("Improvement: %.2f%%\n", (1 - (double)(t3 - t2) / (double)(t2 - t1)) * 100);
    

    =>

    0Time old: 3.284568
    1Time new: 2.639177
    2Improvement: 19.65%
    

    Please help to test in on x86 and x64 architectures, we tested only ARM64.

  2. New modinv implementation 9769931b3e
  3. Introduce constant-time versions of MSB and CMP methods for scalar ab689a5a6e
  4. Introduce non-modular add and neg for secp256k1_scalar bdc7cce196
  5. Fix bugs and add minus operation for secp256k1_scalar 5a11a9a8fb
  6. Benchmarking 7fa7e59b84
  7. Some refactoring 5dc4a30232
  8. Added comment 206a75a010
  9. Remove unnecessary parameter 9758f4d658
  10. Introduce secp256k1_scalar_msb_hint f3daaf6819
  11. Comment out modinv benchmarks 879ecba353
  12. k06a marked this as ready for review on Feb 7, 2022
  13. Remove secp256k1_scalar_print 2737a6177a
  14. sipa commented at 8:59 pm on February 7, 2022: contributor

    Benchmarks on AMD Ryzen 9 5950x:

    On master (85b00a1c65c28fb6bc4381e7bf24cc349569b814):

    0$ ./bench_internal inverse
    1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    2
    3scalar_inverse                ,     1.28      ,     1.29      ,     1.44   
    4scalar_inverse_var            ,     0.873     ,     0.875     ,     0.881  
    

    This PR (879ecba353f1346bd03c0ed41656884f201bde27):

    0$ ./bench_internal inverse
    1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    2
    3scalar_inverse                ,     1.29      ,     1.31      ,     1.43   
    4scalar_inverse_var            ,     4.69      ,     4.71      ,     4.72   
    

    So this looks like a ~5x slowdown. Is your benchmark realistic? Is the number being inverted perhaps highly structured in your benchmark?

  15. Revert 8x32 changes 19f524b92f
  16. sipa commented at 9:09 pm on February 7, 2022: contributor

    On Developerbox 1GHz 24-core Cortex-A53 (ARM64):

    On master (85b00a1c65c28fb6bc4381e7bf24cc349569b814):

    0$ ./bench_internal inverse
    1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    2
    3scalar_inverse                ,    12.8       ,    12.8       ,    12.8    
    4scalar_inverse_var            ,     7.54      ,     7.54      ,     7.55   
    

    This PR (2737a6177a3521fca937db01434124ab02783b78):

    0$ ./bench_internal inverse
    1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    2
    3scalar_inverse                ,    12.8       ,    12.8       ,    12.8    
    4scalar_inverse_var            ,    36.1       ,    36.1       ,    36.1    
    

    Also a ~5x slowdown.

  17. k06a commented at 9:13 pm on February 7, 2022: none

    @sipa running benchmark for me gave same numbers in both branches somehow, please try to run tests under 64-bit architecture and uncomment following lines:

    1. scalar_4x64_impl.h#L981: https://github.com/1inch/secp256k1/blob/19f524b92fcc0ef3df99c266ce75547ae8b03a3e/src/scalar_4x64_impl.h#L981
    2. scalar_4x64_impl.h#L985-L1006: https://github.com/1inch/secp256k1/blob/19f524b92fcc0ef3df99c266ce75547ae8b03a3e/src/scalar_4x64_impl.h#L985-L1006
  18. k06a commented at 9:13 pm on February 7, 2022: none
    Ah I see you run scalar_inverse_var benchmark directly!
  19. k06a commented at 9:15 pm on February 7, 2022: none

    For my laptop (M1 Max CPU) results are the following:

    On master:

    0./bench_internal inverse
    1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    2scalar_inverse                ,     1.87      ,     1.91      ,     1.96   
    3scalar_inverse_var            ,     4.46      ,     4.49      ,     4.52   
    4field_inverse                 ,     1.87      ,     1.90      ,     1.91   
    5field_inverse_var             ,     1.15      ,     1.17      ,     1.26   
    

    On this PR:

    0./bench_internal inverse
    1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    2scalar_inverse                ,     1.89      ,     1.97      ,     2.32   
    3scalar_inverse_var            ,     4.49      ,     4.54      ,     4.60   
    4field_inverse                 ,     1.91      ,     1.93      ,     1.96   
    5field_inverse_var             ,     1.16      ,     1.17      ,     1.19 
    
  20. sipa commented at 9:17 pm on February 7, 2022: contributor
    @k06a Sure you recompiled in between running the benchmarks? Vartime scalar inverse should be faster than constant-time, and in your benchmark it seems slower for both master and your PR.
  21. k06a commented at 9:23 pm on February 7, 2022: none
    0make && ./bench_internal inverse
    

    On master:

    0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    1scalar_inverse_var            ,     1.16      ,     1.17      ,     1.17   
    

    On this PR:

    0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    1scalar_inverse_var            ,     4.45      ,     4.52      ,     4.62   
    
  22. k06a commented at 9:24 pm on February 7, 2022: none
    Curious why we saw 20% improvement running tests, maybe different compilation options or compiler optimised some iterations…
  23. sipa commented at 9:26 pm on February 7, 2022: contributor

    @k06a My guess is that you were benchmarking inversion of the number 1 or so, or some other highly structured number.

    It makes no sense to me why an algorithm like this would be faster than what we have. It seems designed for hardware in which multiplication is expensive, which is increasingly not the case in modern CPUs.

  24. Some cleanup 787e165e10
  25. k06a commented at 9:30 pm on February 7, 2022: none
    @sipa it was 265-bit value, actually value from first test.
  26. k06a commented at 9:36 pm on February 7, 2022: none
    Now we at least know how to run benchmarks properly :)
  27. sipa commented at 9:37 pm on February 7, 2022: contributor
    @k06a It’s also possible that you’re training the CPU’s branch predictor non-representatively by inverting the same number over and over again (or alternating between the two same numbers). The scalar_inverse_var benchmark in bench_internal avoids that by adding a constant randomish term after every inversion.
  28. k06a commented at 9:41 pm on February 7, 2022: none
    @sipa we saw strange performance fluctuations during adding some changes, very likely we became victims of branch prediction.
  29. mratsim commented at 9:46 am on February 9, 2022: none

    Your algorithm is similar to the classic Euclid algorithm. You can easily use 256-bit full-width numbers, see the (constant-time) algorithm from GMP by Niels Moller.

    invmod Niels Moller

    It’s even simpler than your algorithm because there is no need for absolute value. There is no check for less than 0, it only operates on the relative difference of unsigned u and v. You might note a u < 0 on line 12, but this can be tested on the previous line by getting the borrow from u-v and then conditionally add the modulus so it’s just regular unsigned modular addition.

    Performance

    Compared to algorithms based on divsteps/transition matrix (Bernstein-Yang and Pornin’s), both yours and Moller full-width operations at each iteration while divsteps only use full-width operations once every 62 iterations. Even if the full-width operations are 4~5x more costly (applying a transition matrix to 4 bigints) that’s still a theoretical 12~15x speedup. This leaves a lot of time for “book-keeping” operations like change of base to 2^62.

    See more details here #767 (comment) (warning: rabbit hole).

    Besides conversion to/from 62-bit representation is costless, x86-64 can issue at least 2 shifts per cycle and there are 10 shifts to issue for a total of 5 cycles at most.

  30. k06a commented at 12:17 pm on February 9, 2022: none
    @mratsim thx for the detailed explanation!
  31. sipa closed this on Apr 16, 2022


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-10-30 03:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me