Implement the modular inverse using unsigned 256-bit integers addition and shifts

k06a commented at 8:24 pm on February 7, 2022: none

Original algorithm borrowed from this paper (LS3): https://www.researchgate.net/publication/304417579_Modular_Inverse_Algorithms_Without_Multiplications_for_Cryptographic_Applications

Was improved by Anton Bukov (@k06a) and Mikhail Melnik (@zumzoom) to use unsigned 256-bit integers: https://gist.github.com/k06a/b990b7c7dda766d4f661e653d6804a53

This would allow to avoid usage of 62-bit signed representation and compute modinv without 256-bit multiplications and division.

Code is passing all the tests, but we injected some code to benchmark new also, need to move it to the bench target for sure, maybe someone could help with this?

Running tests on ARM64 (Apple M1 Max) gives 20% improvement for modinv() method:

 0t1 = clock();
 1for (i = 0; i < 1000000; i++) {
 2    secp256k1_scalar_to_signed62(&s, x);
 3    secp256k1_modinv64(&s, &secp256k1_const_modinfo_scalar);
 4    secp256k1_scalar_from_signed62(r, &s);
 5    CHECK(secp256k1_scalar_eq(r, &rrr));
 6}
 7t2 = clock();
 8for (i = 0; i < 1000000; i++) {    
 9    secp256k1_modinv64_scalar(r, x, &secp256k1_const_mod_scalar);
10    CHECK(secp256k1_scalar_eq(r, &rrr));
11}
12t3 = clock();
13printf("Time old: %f\n", ((double)(t2 - t1)) / CLOCKS_PER_SEC);
14printf("Time new: %f\n", ((double)(t3 - t2)) / CLOCKS_PER_SEC);
15printf("Improvement: %.2f%%\n", (1 - (double)(t3 - t2) / (double)(t2 - t1)) * 100);

=>

0Time old: 3.284568
1Time new: 2.639177
2Improvement: 19.65%

Please help to test in on x86 and x64 architectures, we tested only ARM64.

New modinv implementation 9769931b3e

Introduce constant-time versions of MSB and CMP methods for scalar ab689a5a6e

Introduce non-modular add and neg for secp256k1_scalar bdc7cce196

Fix bugs and add minus operation for secp256k1_scalar 5a11a9a8fb

Benchmarking 7fa7e59b84

Some refactoring 5dc4a30232

Added comment 206a75a010

Remove unnecessary parameter 9758f4d658

Introduce secp256k1_scalar_msb_hint f3daaf6819

Comment out modinv benchmarks 879ecba353

k06a marked this as ready for review on Feb 7, 2022

Remove secp256k1_scalar_print 2737a6177a

sipa commented at 8:59 pm on February 7, 2022: contributor

Benchmarks on AMD Ryzen 9 5950x:

On master (85b00a1c65c28fb6bc4381e7bf24cc349569b814):

0$ ./bench_internal inverse
1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
2
3scalar_inverse                ,     1.28      ,     1.29      ,     1.44   
4scalar_inverse_var            ,     0.873     ,     0.875     ,     0.881

This PR (879ecba353f1346bd03c0ed41656884f201bde27):

0$ ./bench_internal inverse
1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
2
3scalar_inverse                ,     1.29      ,     1.31      ,     1.43   
4scalar_inverse_var            ,     4.69      ,     4.71      ,     4.72

So this looks like a ~5x slowdown. Is your benchmark realistic? Is the number being inverted perhaps highly structured in your benchmark?

Revert 8x32 changes 19f524b92f

sipa commented at 9:09 pm on February 7, 2022: contributor

On Developerbox 1GHz 24-core Cortex-A53 (ARM64):

On master (85b00a1c65c28fb6bc4381e7bf24cc349569b814):

0$ ./bench_internal inverse
1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
2
3scalar_inverse                ,    12.8       ,    12.8       ,    12.8    
4scalar_inverse_var            ,     7.54      ,     7.54      ,     7.55

This PR (2737a6177a3521fca937db01434124ab02783b78):

0$ ./bench_internal inverse
1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
2
3scalar_inverse                ,    12.8       ,    12.8       ,    12.8    
4scalar_inverse_var            ,    36.1       ,    36.1       ,    36.1

Also a ~5x slowdown.

k06a commented at 9:13 pm on February 7, 2022: none

@sipa running benchmark for me gave same numbers in both branches somehow, please try to run tests under 64-bit architecture and uncomment following lines:

scalar_4x64_impl.h#L981: https://github.com/1inch/secp256k1/blob/19f524b92fcc0ef3df99c266ce75547ae8b03a3e/src/scalar_4x64_impl.h#L981
scalar_4x64_impl.h#L985-L1006: https://github.com/1inch/secp256k1/blob/19f524b92fcc0ef3df99c266ce75547ae8b03a3e/src/scalar_4x64_impl.h#L985-L1006

k06a commented at 9:13 pm on February 7, 2022: none

Ah I see you run scalar_inverse_var benchmark directly!

k06a commented at 9:15 pm on February 7, 2022: none

For my laptop (M1 Max CPU) results are the following:

On master:

0./bench_internal inverse
1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
2scalar_inverse                ,     1.87      ,     1.91      ,     1.96   
3scalar_inverse_var            ,     4.46      ,     4.49      ,     4.52   
4field_inverse                 ,     1.87      ,     1.90      ,     1.91   
5field_inverse_var             ,     1.15      ,     1.17      ,     1.26

On this PR:

0./bench_internal inverse
1Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
2scalar_inverse                ,     1.89      ,     1.97      ,     2.32   
3scalar_inverse_var            ,     4.49      ,     4.54      ,     4.60   
4field_inverse                 ,     1.91      ,     1.93      ,     1.96   
5field_inverse_var             ,     1.16      ,     1.17      ,     1.19

sipa commented at 9:17 pm on February 7, 2022: contributor

@k06a Sure you recompiled in between running the benchmarks? Vartime scalar inverse should be faster than constant-time, and in your benchmark it seems slower for both master and your PR.

k06a commented at 9:23 pm on February 7, 2022: none

0make && ./bench_internal inverse

On master:

0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
1scalar_inverse_var            ,     1.16      ,     1.17      ,     1.17

On this PR:

0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
1scalar_inverse_var            ,     4.45      ,     4.52      ,     4.62

k06a commented at 9:24 pm on February 7, 2022: none

Curious why we saw 20% improvement running tests, maybe different compilation options or compiler optimised some iterations…

sipa commented at 9:26 pm on February 7, 2022: contributor

@k06a My guess is that you were benchmarking inversion of the number 1 or so, or some other highly structured number.

It makes no sense to me why an algorithm like this would be faster than what we have. It seems designed for hardware in which multiplication is expensive, which is increasingly not the case in modern CPUs.

Some cleanup 787e165e10

k06a commented at 9:30 pm on February 7, 2022: none

@sipa it was 265-bit value, actually value from first test.

k06a commented at 9:36 pm on February 7, 2022: none

Now we at least know how to run benchmarks properly :)

sipa commented at 9:37 pm on February 7, 2022: contributor

@k06a It’s also possible that you’re training the CPU’s branch predictor non-representatively by inverting the same number over and over again (or alternating between the two same numbers). The scalar_inverse_var benchmark in bench_internal avoids that by adding a constant randomish term after every inversion.

k06a commented at 9:41 pm on February 7, 2022: none

@sipa we saw strange performance fluctuations during adding some changes, very likely we became victims of branch prediction.

mratsim commented at 9:46 am on February 9, 2022: none

Your algorithm is similar to the classic Euclid algorithm. You can easily use 256-bit full-width numbers, see the (constant-time) algorithm from GMP by Niels Moller.

invmod Niels Moller

It’s even simpler than your algorithm because there is no need for absolute value. There is no check for less than 0, it only operates on the relative difference of unsigned u and v. You might note a u < 0 on line 12, but this can be tested on the previous line by getting the borrow from u-v and then conditionally add the modulus so it’s just regular unsigned modular addition.

Performance

Compared to algorithms based on divsteps/transition matrix (Bernstein-Yang and Pornin’s), both yours and Moller full-width operations at each iteration while divsteps only use full-width operations once every 62 iterations. Even if the full-width operations are 4~5x more costly (applying a transition matrix to 4 bigints) that’s still a theoretical 12~15x speedup. This leaves a lot of time for “book-keeping” operations like change of base to 2^62.

See more details here #767 (comment) (warning: rabbit hole).

Besides conversion to/from 62-bit representation is costless, x86-64 can issue at least 2 shifts per cycle and there are 10 shifts to issue for a total of 5 cycles at most.

k06a commented at 12:17 pm on February 9, 2022: none

@mratsim thx for the detailed explanation!

sipa closed this on Apr 16, 2022

Implement the modular inverse using unsigned 256-bit integers addition and shifts #1073

Performance