WIP: "safegcd" field and scalar inversion

peterdettman commented at 8:08 am on July 15, 2020: contributor

see “Fast constant-time gcd computation and modular inversion” by Daniel J. Bernstein and Bo-Yin Yang https://gcd.cr.yp.to

Implements constant-time field and scalar inversion using a Euclidean-style algorithm operating on the least-significant bits. It follows the curve25519 case study as outlined in section 12 of the above paper, with some small tweaks. ~~I do not have the actual curve25519 code to compare to, but this PR appears to be already 15-20% faster than their reported 10050 haswell cycles for field inversion.~~ Actually, this initial measurement was in error, but we did reach 5-10% faster in the end, without any asm.

Performance comparison (gcc 10.1.0, -O3, haswell, endo=yes, asm=no, field=64bit scalar=64bit):

master: scalar_inverse: min 12.4us field_inverse: min 5.04us ecdh: min 59.7us ecdsa_sign: min 47.9us

safegcd_inv: scalar_inverse: min 3.45us field_inverse: min 3.23us ecdh: min 56.2us ecdsa_sign: min 37.7us

I’ve not done a 32bit version yet, although it is mostly analogous and should have an even greater relative advantage over Fermat. The field and scalar implementations have substantial common code that is duplicated here for simplicity.

"safegcd" field and scalar inversion

- see "Fast constant-time gcd computation and modular inversion" by Daniel J. Bernstein and Bo-Yin Yang https://gcd.cr.yp.to()

9fbe4854fe

Fix secp256k1_scalar_is_even/scalar_low issue 4fab082c9a

real-or-random commented at 10:01 am on July 15, 2020: contributor

I don’t have time to look at this right now but this is very neat!

peterdettman commented at 10:17 am on July 15, 2020: contributor

I should add that for variable time inversion we can just make a simple rewrite of _divsteps_62 and it’s already faster than existing one:

field_inverse: min 2.10us <– (actually a variable-time implementation) field_inverse_var: min 2.22us <– (unmodified secp256k1_fe_inv_var)

_divsteps_62 is also an ideal target for asm with a bunch of conditional swaps and negates (although I’m not sure how smart gcc is already being about them).

real-or-random commented at 1:13 pm on July 15, 2020: contributor

I see a few TODOs in the code but nothing substantial. Can you comment on how much this is is “WIP” (except the fact that 32 bit is missing?

_divsteps_62 is also an ideal target for asm with a bunch of conditional swaps and negates (although I’m not sure how smart gcc is already being about them).

An interesting data point is #726 (comment) which shows that current versions of gcc -O3 outperform our ASM (but we switched to a default of -O2 a few months ago). So ASM is in general getting less and less interesting for us.

peterdettman commented at 1:50 pm on July 15, 2020: contributor

It’s conceptually complete at the top level. So besides general review, the main TODO is a thorough bounds analysis (they are quite tight for performance) and in a couple of places I expect that could require an extra carry step or two. The code also uses arithmetic right shift as things stand (although it’s not used in the main inner loop), which I guess will provoke discussion.

Of course there’s the 32bit version, and some organisational work to share common code in a sensible way.

Apart from the code itself, we would probably want to get some outside opinions from mathematicians on the paper’s proofs (meaning no offence to the authors), as it is less obvious than Fermat inversion. Indeed, in Appendix F.2 of the paper the authors show that an earlier analysis of a similar algorithm (“A binary recursive gcd algorithm”, Stehlé andZimmermann) was incorrect.

mratsim commented at 3:14 pm on July 15, 2020: none

I wonder if you measured against Niels Moller constant-time inversion in Nettle/Botan/GMP?

The Yang-Bernstein GCD seems to require significantly more operations compared to algorithm 5 of Fast Software Polynomial Multiplicationon ARM Processors Using the NEON Engine

I’ve outline my performance concerns in Botan issue https://github.com/randombit/botan/issues/1479#issuecomment-601646479

peterdettman commented at 4:32 pm on July 15, 2020: contributor

@mratsim I have not, but frankly I don’t think it could improve on these results because it is performing full-length operations on the operands in each loop iteration. This has been the fundamental performance barrier for all of the binary GCD style algorithms that several contributors have attempted here and elsewhere over the years. So I am focused now on the 2-adic division, “bottom bits” algorithms.

This PR performs 62 iterations at a time on the “bottom bits” only (see the …_divsteps_62 method(s)), and only updates the full sized operands 12 times. It’s true safegcd has more total iterations (744 vs 512), but they are much much cheaper. safegcd also defers the full u/v calculation to the post-processing phase (about 25% of total time in this PR), where the matrices can be combined very efficiently.

At 256 bits, I suppose it’s possible that the NEON vector instructions dramatically change the accounting, and perhaps something similar is possible with x86_64 SIMD extensions; I would be very interested to see the results of either. Of course there are opportunities for SIMD to improve this PR also.

TODOs and comments 0b90a57f7e

VERIFY_CHECK _divsteps_62 loop invariant 0c3869a46d

gmaxwell commented at 9:36 am on July 20, 2020: contributor

VERY COOL. One comment on correctness is that inversions are easy and fairly cheap to test at runtime, so at least flaws can be turned into clean crashes instead of incorrect results– for whatever consolation that is. :)

gmaxwell commented at 9:47 am on July 20, 2020: contributor

One could even implement code that multiplies to check the inverse, and if its wrong redoes the operation using the ladder.

This would be a fringe timing channel if it were triggerable (but it’s easy to test to be confident that its rare and randomize it if there are untrusted inputs to a constant time version), so I think that wouldn’t be a problem.

This would essentially guarantee algorithmic correctness at little cost, assuming the software just didn’t have an ordinary programming flaw.

So think fear of correctness and a shortage of computational number theorists need not stand in the way of this work.

More checks and comments 11b525c71c

Update f,g at full length until proper analysis 3ae7179ad7

Initial 32bit safegcd

- definitely needs bounds analysis

2f643ad31d

in src/field_5x52_impl.h:633 in 0c3869a46d outdated

628+    r[2] = (a2 >> 20 | a3 << 32) & M62;
629+    r[3] = (a3 >> 30 | a4 << 22) & M62;
630+    r[4] =  a4 >> 40;
631+}
632+
633+static int secp256k1_fe_divsteps_62(uint16_t eta, uint64_t f0, uint64_t g0, int64_t *t) {

real-or-random commented at 12:29 pm on July 20, 2020:

Just one thing I noticed when skimming through it: The types of eta are pretty inconsistent. You use uint16_t inside the function, return an int and then store it in an int16_t.

peterdettman commented at 7:22 am on July 22, 2020:

Sure, it should be cleaned up. It’s logically a signed value (~ in [-744,744]) in the sense that the algorithm as-written switches on the sign of it, but the actual constant-time code for _divsteps is in “bit-twiddling” style, and I wanted to make it clear that there’s no reliance (in this method) on arithmetic right-shift. Conceptually maybe it’s (u)int_fast16_t, but I was a bit lazy about dealing with the variable sizes before things worked. Possibly uint64_t (resp. uint32_t for 32bit) makes more sense.

peterdettman commented at 5:25 pm on July 21, 2020: contributor

There’s now 32-bit versions for field and scalar inversion. It needs bounds analysis, carry chains checked, etc., but these are the current timing comparisons with 32bit scalar and field configured (bearing in mind the test machine is x86_64):

master: scalar_inverse: min 38.9us field_inverse: min 7.36us

safegcd_inv: scalar_inverse: min 4.35us field_inverse: min 3.70us

in src/field_5x52_impl.h:655 in 0c3869a46d outdated

650+        u ^= y; q ^= y; q ^= c1; q -= c1;
651+
652+        z = (v ^ r) & c1;
653+        v ^= z; r ^= z; r ^= c1; r -= c1;
654+
655+        eta = (eta ^ (uint16_t)c1) - (uint16_t)c1 - 1;

real-or-random commented at 10:18 pm on July 21, 2020:

On a related note, seeing integer types smaller than int makes me somewhat nervous. Most people looking at this line for example won’t notice that all the arithmetic here is performed on (signed) int due to integer promotion rules. (I believe that’s not an issue in this line, it’s just a general remark.)

peterdettman commented at 7:24 am on July 22, 2020:

Am I just missing a U i.e. 1U or else could you explain?

real-or-random commented at 11:04 am on July 22, 2020:

The way integer operations (and this includes bitwise as well as unary operators, shifts and comparisons) is done in C can be pretty unexpected for types smaller than int: Before any other rules apply, “Integer types smaller than int are promoted when an operation is performed on them. If all values of the original type can be represented as an int, the value of the smaller type is converted to an int; otherwise, it is converted to an unsigned int” [1]. On our targets, int is the same as int32_t, so an uint16_t for sure fits in an int. The possibility that an unsigned integer is converted to a signed integer is fucked up because signed overflow is undefined behavior. And compilers have been reported to rely on this [2, 3].

For this line here, this means eta ^ (uint16_t)c1 is evaluated by first converting eta and (uint16_t)c1 to int and then performing the ^. Then for the -, the right operand (uint16_t)c1 is converted to int. In those cases, it’s easy to see that it doesn’t make a difference, at least for our targets.

In general, if you ignore the theoretical cases like int 17 is bits wide, it seems (without a proof) that the only interesting operations are multiplication, left shifts, right shifts, and comparisons, the former two because of possible overflows, comparisons because of issues with comparing signed and unsigned values, and right shifts because it’s relevant whether the left operand is signed or unsigned.

The real fuckup here is that the stdint.h types were supposed to be used for portable arithmetic but portable arithmetic is not possible in C because all integer arithmetic, no matter what type, implicitly depends on the size of int by the above rule. Our code would probably blow up on a target where int is 64 bits. (Yes, this exists.)

The typical trick to work around this is to start the arithmetic with an unsigned int value, e.g., (1U * eta) ^ (uint16_t)c1. No need to point out that this is ugly.

[1] https://wiki.sei.cmu.edu/confluence/display/c/INT02-C.+Understand+integer+conversion+rules
[2] https://stackoverflow.com/questions/24795651/whats-the-best-c-way-to-multiply-unsigned-integers-modularly-safely the link is for C++ but the same rules apply in this case [3] https://www.cryptologie.net/article/418/integer-promotion-in-c/

gmaxwell commented at 0:40 am on July 22, 2020: contributor

iMX.6 (arm7)

Before: scalar_inverse: min 338us / avg 338us / max 342us scalar_inverse_var: min 13.7us / avg 13.8us / max 13.9us field_inverse: min 209us / avg 210us / max 210us field_inverse_var: min 13.6us / avg 13.6us / max 13.7us

After: scalar_inverse: min 33.7us / avg 33.8us / max 34.3us scalar_inverse_var: min 13.6us / avg 13.6us / max 13.7us field_inverse: min 31.5us / avg 31.5us / max 31.6us field_inverse_var: min 13.5us / avg 13.5us / max 13.6us

I suppose I shouldn’t let the fact that it only matched GMP for the variable time here distract me from the fact that it’s 10x faster for the constant time scalar inverse. This will be a big speedup for GMP-less ECDSA validation and general signing on arm.

Edit: ah, I guess its not actually using a variable time version of this algo for the variable time ones? Even using the constant time would would still be a big speedup for the non-gmp builds. It would sure be nice to get rid of optional linking against gmp.

gmaxwell commented at 4:44 am on July 22, 2020: contributor

Gmp-less comparison.

iMX.6 (arm7) no-asm Before: ecdsa_verify: min 2810us / avg 2815us / max 2832us After: ecdsa_verify: min 2504us / avg 2506us / max 2510us

iMX.6 (arm7) with-asm Before: ecdsa_verify: min 1383us / avg 1387us / max 1396us After: ecdsa_verify: min 1074us / avg 1075us / max 1076us

Obviously this speedup would be greater w/ a variable time version of the new code in use.

peterdettman commented at 9:35 am on July 22, 2020: contributor

@gmaxwell Agreed that random blinding can be useful in the ways you describe. Still it starts to lean towards your point elsewhere that one could just use the variable-time alg. with blinding anyway.

Yes, there’s no variable-time in the PR yet. I gave an example timing above from minor changes to _divsteps_62. I did experiment a bit further with the 64bit version and got var-time field inverse down to around ~1.75us, vs. ~3.25us for const-time and ~2.2us for the gmp one. That’s already enough to dispense with gmp for 64bit, but I’m sure it can be improved further.

For the arm7 numbers (thanks - I should really try and get a hardware setup for this), I am actually a bit surprised at how slow the “Before” field_inverse is (especially relative to gmp for inv_var). I guess this has to be down to fe_mul, fe_sqr being slow (slow mul instruction?), which partially affects the timings for this PR. As things stand, beating 32bit gmp on inv_var would appear to require about 2.5x speedup over this PR.

Could you please satisfy my curiosity and give the full output of bench_internal on that arm7? I also have some possible variations of _fe_mul that I could PR which it would be very helpful to get measured on real devices.

JeanLucPons commented at 2:37 pm on July 22, 2020: none

Hello, VanitySearch uses a very similar implementation of modular inversion (called DRS62, delayed right shift 62 bits). It operates on the 62 least-significant bits of the integer and maintain a 2x2 matrix in the inner XCD loop. It is a variable time implementation.

https://github.com/JeanLucPons/VanitySearch/blob/3ba22e9072db540a445d37471e7969bff92fa504/IntMod.cpp#L380 It is between 2 and 3 times faster than Fermat/Euler method on Secp256K1 prime.

gmaxwell commented at 5:10 pm on July 22, 2020: contributor

iMX.6 with ASM: scalar_add: min 0.157us / avg 0.158us / max 0.160us scalar_negate: min 0.0697us / avg 0.0699us / max 0.0705us scalar_sqr: min 1.10us / avg 1.13us / max 1.31us scalar_mul: min 1.07us / avg 1.10us / max 1.30us scalar_inverse: min 33.9us / avg 34.1us / max 34.3us scalar_inverse_var: min 33.9us / avg 33.9us / max 34.0us field_normalize: min 0.0764us / avg 0.0771us / max 0.0777us field_normalize_weak: min 0.0327us / avg 0.0329us / max 0.0332us field_sqr: min 0.291us / avg 0.291us / max 0.292us field_mul: min 0.361us / avg 0.366us / max 0.368us field_inverse: min 27.6us / avg 27.7us / max 27.9us field_inverse_var: min 27.6us / avg 27.8us / max 28.4us field_sqrt: min 78.9us / avg 79.2us / max 80.2us group_double_var: min 2.44us / avg 2.45us / max 2.49us group_add_var: min 5.70us / avg 5.73us / max 5.82us group_add_affine: min 4.90us / avg 4.91us / max 4.93us group_add_affine_var: min 4.08us / avg 4.08us / max 4.10us group_jacobi_var: min 79.3us / avg 79.9us / max 83.8us wnaf_const: min 2.44us / avg 2.45us / max 2.48us ecmult_wnaf: min 2.60us / avg 2.62us / max 2.63us hash_sha256: min 1.56us / avg 1.56us / max 1.57us hash_hmac_sha256: min 6.35us / avg 6.36us / max 6.40us hash_rfc6979_hmac_sha256: min 34.9us / avg 35.2us / max 35.5us context_verify: min 54297us / avg 54598us / max 55353us context_sign: min 501us / avg 503us / max 510us

iMX.6 with no-asm: scalar_add: min 0.157us / avg 0.157us / max 0.158us scalar_negate: min 0.0697us / avg 0.0703us / max 0.0720us scalar_sqr: min 1.10us / avg 1.10us / max 1.10us scalar_mul: min 1.05us / avg 1.05us / max 1.05us scalar_inverse: min 33.7us / avg 33.7us / max 33.7us scalar_inverse_var: min 33.7us / avg 34.1us / max 34.4us field_normalize: min 0.0771us / avg 0.0774us / max 0.0778us field_normalize_weak: min 0.0326us / avg 0.0327us / max 0.0329us field_sqr: min 0.817us / avg 0.818us / max 0.819us field_mul: min 0.939us / avg 0.942us / max 0.962us field_inverse: min 31.5us / avg 31.6us / max 32.2us field_inverse_var: min 31.5us / avg 31.7us / max 32.5us field_sqrt: min 207us / avg 208us / max 211us group_double_var: min 5.97us / avg 5.98us / max 6.02us group_add_var: min 14.5us / avg 14.5us / max 14.6us group_add_affine: min 11.1us / avg 11.2us / max 11.4us group_add_affine_var: min 10.0us / avg 10.0us / max 10.1us group_jacobi_var: min 208us / avg 209us / max 210us wnaf_const: min 2.44us / avg 2.44us / max 2.44us ecmult_wnaf: min 2.60us / avg 2.61us / max 2.65us hash_sha256: min 1.56us / avg 1.59us / max 1.60us hash_hmac_sha256: min 6.38us / avg 6.40us / max 6.44us hash_rfc6979_hmac_sha256: min 35.0us / avg 35.4us / max 37.2us context_verify: min 124901us / avg 125307us / max 125731us context_sign: min 904us / avg 914us / max 958us

peterdettman commented at 6:04 pm on July 22, 2020: contributor

Hello, VanitySearch uses a very similar implementation of modular inversion (called DRS62, delayed right shift 62 bits). It operates on the 62 least-significant bits of the integer and maintain a 2x2 matrix in the inner XCD loop. It is a variable time implementation.

https://github.com/JeanLucPons/VanitySearch/blob/3ba22e9072db540a445d37471e7969bff92fa504/IntMod.cpp#L380 It is between 2 and 3 times faster than Fermat/Euler method on Secp256K1 prime.

Yes it’s quite similar, but the “divstep” used here should ultimately be faster than the approach there. Section 8 of the safegcd paper discusses some other variants and why they’re slower (at least for constant-time case, but probably in general). I’d suggest you check particularly 8.4 which shows that you have to be very careful about the “divstep” you use, since merely clearing 0s and combining to get more 0s is not sufficient to guarantee termination.

The comments suggest this is Binary xGCD but I would usually take that to include a size comparison b/w u,v on each iteration, which that code isn’t doing. So I guess it’s actually a variant of Stehlé and Zimmermann’s “Binary Recursive Gcd”?

JeanLucPons commented at 7:23 am on July 23, 2020: none

The “divstep62” used in VanitySearch is a slightly optimized signed bxcd loop to get zeros a bit faster. The transition matrix starts with identity and simply records the steps done on native 64 bits signed integers. It ends (in average) in 62x9.84= 610 steps (against 62x12=744 steps for the constant time implementation described in the paper). A standard bxcd “divstep62” gives (in average) 62x10.08=624.96 steps.

JeanLucPons commented at 2:26 pm on July 24, 2020: none

Yes, there’s no variable-time in the PR yet. I gave an example timing above from minor changes to _divsteps_62. I did experiment a bit further with the 64bit version and got var-time field inverse down to around ~1.75us, vs. ~3.25us for const-time and ~2.2us for the gmp one. That’s already enough to dispense with gmp for 64bit, but I’m sure it can be improved further.

Hello, Could you tell me on which hardware you get these perf ? On my i5-8500, my implementation for field inverse gives 2.73us (in average, for 256 bit numbers uniformly distributed). I think I can win few cycles by optimizing the vector/matrix product (something similar to what you did in secp256k1_fe_update_fg) but I’m not sure to win a lot. Thanks ;)

gmaxwell commented at 5:53 pm on July 24, 2020: contributor

@JeanLucPons aside, you can benchmark this code easily by applying the patch and running bench_internal. :) This doesn’t totally answer your question since he hasn’t posted those variable time optimizations but you could scale his figures based on comparing the constant time numbers.

peterdettman commented at 7:49 pm on July 24, 2020: contributor

@JeanLucPons It’s an i7 (Haswell) @2.6GHz. The constant-time algorithm does 744 iterations because that’s (more than) enough to guarantee completion, but the average number of iterations to send g to 0 is actually ~532.

In your code I would first try making the inner loop branchless. If you are able to use __builtin_ctzll or equivalent, then the while(even) loop can be rewritten very simply.

However I would suggest you turn your attention first to whether the algorithm is correct i.e. is it guaranteed to terminate for all inputs? A typical Euclidean algorithm always reduces the larger of two variables, and so will terminate. This is the case e.g. for Binary GCD also. A weirder way of saying that is that each iteration produces 0s in the high bits of whichever variable currently has fewer.

The 2-adic (“bottom bits”) algorithms - including this PR, “binary recursive GCD”, Algorithm PM (“Plus-or-Minus”) - instead produce 0s in the low bits of whichever variable currently has fewer. This is what the “eta” variable is tracking in this PR (in the paper it’s “delta”; eta is just -delta). Even with this property, one has to be careful as section 8.4 of the safegcd paper explains.

AFAICT your algorithm does’t have a property like this, it alternates between u,v without regard for which of the two variables has made more progress. Are you aware of a proof that it does always terminate? The same question applies to the “BXCD” algorithm there.

sipa commented at 3:52 am on July 25, 2020: contributor

This is really nice!

I’ve skimmed the paper and the code, and this is my understanding (which may help other reviewers):

We have a “micro iteration” function divstep(eta,f,g) that returns updated eta,f,g. When calling with arguments (1,f=oddinteger,g=integer), it will after a bounded number of iterations converge to (something,+-gcd(f,g),0).
Every iteration of divstep actually corresponds with a matrix multiplication with the input, and that matrix has small magnitude coefficients, and only depends on eta, and the bottom bit of f and g.
N iterations of divstep(eta,f,g) is thus a matrix multiplication consisting of the product of all those individual matrices, whose coefficients’ bitlengths scale with N. It turns out that the combined matrix only depends on eta and the lower N bits of f and g.
This gives rise to macroiterations consisting of a fast way of applying N microiterations. Each first computes the combined matrix for N microiterations (as many as result in matrix coefficients that fit in a 64-bit or 32-bit integer; N=62 or 31), and then applying that matrix to the entire f/g values.
The overall algorithm is encoding the input in correctly-sized limbs (62 bits or 31 bits), applying 12 or 24 macroiterations on those limbs, remembering the 2x2 64/32-bit matrices from each, and then combining those matrices in order to extract the inverse from it, after a scalar multiplication to correct for the 744 powers of 2 the result was divided by.

Some overall comments:

Would it be possible to introduce simple 2x2 matrix types (with 32/64/128/256 bit coefficients, i assume?) and operations on it? The many coefficients are a bit hard to follow. Or is this hard because they start off being 2x2 32/64-bit matrices and are then reused to serve as coefficients for 128/256 bit matrices, and it would be excessive to use separate memory for those?
Alternatively, can you explain the combine_Ns, and decode_matrix functions more?
Would it make sense to merge the inner core (the macroiterations applying on the limbs) as they seem to be shared by fe/scalar code?

Follow-up questions for understanding:

The algorithm could be made variable time fairly easily by checking if g=0 after every macro iteration? I suspect it will also make the correction term to undo the power of 2 multiplications dependent on the number of iterations (but there are at most 12/24, so precomputing them all would not be a problem).
Could jacobi be implemented effectively using the same code, except using a different extraction from the matrix coefficients?

JeanLucPons commented at 5:51 am on July 25, 2020: none

@peterdettman Yes my divstep62 always end. It is true that the way this bxcd is written might be a bit confusing. Roughly speaking if u does not make progress u and v may be swapped 2 times. My divstep62 ends in 62loops + 31 extra swaps in worst case. I tried a constant time implementation but it is ~20% slower.

Anyway, in 99.9999999…% of case my implementation ends in less than 11 divstep62 and I know it exists a small class of number where this limit is overpassed and also exceed 12. I don’t know if it is the same in your implementation, I must admit that the theorem 11.2 is unclear for me. I do no have time this weekend to compile the stuff and test your program. I will do this next Monday ;) Keep up the good work ;)

real-or-random commented at 12:41 pm on July 26, 2020: contributor

The code also uses arithmetic right shift as things stand (although it’s not used in the main inner loop), which I guess will provoke discussion.

Technically that’s implemented-defined but I can’t find a reference to a platform where a right shift of a negative value is not implemented as an arithmetic shift. We already rely on it in here since a recent PR: https://github.com/bitcoin-core/secp256k1/blob/d567b779fe446fd18820a9d2968ecb703c8dea19/src/ecmult_const_impl.h#L19 (Okay, there it could easily be avoided by casting to unsigned first.)

So I don’t think that’s a big issue. If this can be avoided for free, yes, we should do it. But otherwise I’d simply accept this. My understanding of the algorithm is not good enough but I suspect it cannot be avoided at zero cost?

We have multiple other undocumented assumptions in the code base, e.g., about sizes of some types. I guess we should just document those using a CHECK somewhere that asserts this, e.g., like Bitcoin Core does: https://github.com/bitcoin/bitcoin/blob/master/src/compat/assumptions.h

peterdettman commented at 6:43 am on July 27, 2020: contributor

So I don’t think that’s a big issue. If this can be avoided for free, yes, we should do it. But otherwise I’d simply accept this. My understanding of the algorithm is not good enough but I suspect it cannot be avoided at zero cost?

e.g. the matrix outputs of divstep_62 are signed and they are used to update the full length f and g, which are signed values. Also the individual matrices are later multiplied together producing more multi-limb signed values. Thus signed multiplications are the natural choice, meaning signed accumulators which are arithmetic right shifted as the output limbs are produced. I don’t see a zero-cost way to avoid it, but I’m all ears of course.

+1 for CHECKing arithmetic right shift and other assumptions.

JeanLucPons commented at 6:52 am on July 27, 2020: none

Hello, I did a comparison on my Xeon(R) CPU X5647 @ 2.93GHz

 0Build Options:
 1  with endomorphism       = no
 2  with ecmult precomp     = yes
 3  with external callbacks = no
 4  with benchmarks         = yes
 5  with coverage           = no
 6  module ecdh             = no
 7  module recovery         = no
 8
 9  asm                     = x86_64
10  bignum                  = no
11  field                   = 64bit
12  scalar                  = 64bit
13  ecmult window size      = 15
14  ecmult gen prec. bits   = 4
15
16  valgrind                = no
17  CC                      = gcc
18  CFLAGS                  = -O2 -fvisibility=hidden -std=c89 -pedantic -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes -Wno-unused-function -Wno-long-long -Wno-overlength-strings -W -g

master: field_inverse: min 9.33us safe_gcdinv: field_inverse: min 5.73us

My implementation (also with -O2) : 5.10us. By considering peterdettman experiments on variable time , I can expect that this implementation should be ~65% faster than mine :)

However i know my code is not well optimized for linux as I use only very few assembly instructions for missing intrinsic functions. On windows, only intrinsic functions are allowed and I know that with gcc intrinsics are capricious ;)

I’m not very familiar with libsecp256k1 coding style, I will try to make more test, stats with this safegcd.

JeanLucPons commented at 7:17 am on July 27, 2020: none

master: field_inverse: min 9.33us / avg 9.33us / max 9.34us safegcd_inv: field_inverse: min 5.73us / avg 5.78us / max 6.16us A question, safegcd_inv is supposed to be constant time, why I observe such a difference between min and max ? Other timings for constant time functions do not show such a large difference…

peterdettman commented at 7:27 am on July 27, 2020: contributor

Anyway, in 99.9999999…% of case my implementation ends in less than 11 divstep62 and I know it exists a small class of number where this limit is overpassed and also exceed 12. I don’t know if it is the same in your implementation, I must admit that the theorem 11.2 is unclear for me.

I have results for a curve25519 32-bit implementation of safegcd, 1000000 random inputs: the number of divstep_31 calls needed to send g to 0 was (16, 17, 18, 19) a total of (9, 430015, 569880, 96) times respectively. To be clear, this is the const-time code, it always does the full 24 iterations, but g is sent to 0 much earlier.

Minor cleanup b29e51e1ee

JeanLucPons commented at 8:09 am on July 27, 2020: none

In my case the trap was tricky, not link to the divstep62 itself or to the inputs of a single divstep62 but to the overall loop and to the 256 bit number. I got a fault (more than 12 divstep62 to complete) while running my Kangaroo program on a Render Farm (GPU code). This fault didn’t change much the time needed to solve the 114 bit private key challenge (so it is rare) and the cases when it happens are rather unexpected. (~2^52.36 inversion done during the chalenge). As i said I don’t know if your implementation have the same issue. I will test it ASAP.

gmaxwell commented at 10:27 am on July 27, 2020: contributor

, why I observe such a difference between min and max

Measurement noise is normal: Even when a function is constant time nothing else your computer is doing is constant time. This code passes our constant time tester (valgrind_ctime_test binary), so it’s likely to have no secret-input dependant branches or memory accesses. The purpose of the min/max in the benchmark is mostly to tell you if something is going wrong with the measurement (E.g. if they’ve very different you probably want to measure again)

If you have any questions please feel free to ask. It’s always good to have more eyes looking at this stuff.

(~2^52.36 inversion done during the chalenge).

If you do implement a variant of this algorithm and run it on an absurd number of inputs it would be extremely useful if you saved any inputs that took an unusually large number of steps to invert– so that we can add them as test cases! If you do not, I will do this kind of search myself eventually, but since this isn’t GPU code and I only have a dozen gpus (and only about 1000 x86_64 cores) I am unlikely to test >2^52 inversions myself. :) If we got some good test cases as a side effect of some other computation that would be nice. It might even be the case that inputs which take many steps with your algorithm also take many steps for this one.

(For other contributors here: JeanLucPons has written a GPU discrete log solver for secp256k1 which uses all-affine arithmetic and enormous batched inversions, recently it was used to solve a 114-bit ECDLP challenge– which is I believe the record for prime field ECDLP)

peterdettman commented at 10:45 am on July 27, 2020: contributor

@sipa An accurate summary; I envy your skimming powers!

This PR deviates from the paper in some minor ways. As noted previously, eta == -delta, which just simplifies the sign test. Also, the transition matrices are negated from what’s in the paper, allowing a divstep_31 instead of 30 (u can reach -1 « 31). update_fg is adjusted for this negation. The even number of transition matrices cancels out the -1^12. The paper mentions a 32bit implementation of curve25519 using divsteps_30, but this requires 25 top-level iterations and only 24 will combine cleanly into 3 2x2ofFE matrices. Still, it’s possible trying divsteps_31 is overly optimistic.

(matrix types) Or is this hard because…

Yes, probably hard or overly-complicated, but I wouldn’t object if it helps readers. We’re combining 12 2x2of64bit matrices pairwise, then the 6 products pairwise again to get 3 2x2of256bit values, in place (actually I used some temp space). Those are then converted to 3 2x2ofFE matrices for the final combination, using optimised field ops. Note that there are still some small calculations in the matrix combination that are ultimately not used, but accessing this small performance gain would cost some extra code size.

Would it make sense to merge the inner core (the macroiterations applying on the limbs)

Yes, that’s what I have in mind. Note though that it’s currently possible to configure 32bit scalar and 64bit field, so you can’t just choose 32bit or 64bit “inverse” module.

More sharing is possible at some performance cost (I assume, but not actually measured). e.g look at @JeanLucPons code for an approach that:

applies the transition matrices to (r, s) immediately
clears the bottom 62bits of (r, s) by adding a multiple of P (like Montgomery reduction step) and divides by 2^62
this naturally prevents (r, s) from growing past 256 bits, and also automatically scales the output according to the actual number of macro-iterations performed (instead of a fixed 2^744), a good fit for the var-time case.

Conceivably this could reach an implementation taking the prime as an argument. I guess it should probably be tested so we can actually quantify the performance difference.

The algorithm could be made variable time fairly easily by checking if g=0 after every macro iteration?

Yes, early termination is the primary macro-optimization. A table of finalisers is a reasonable approach, but also see the section just above. There are also some micro-optimisations possible in divsteps_62:

multiple zeros can be dealt with in one iteration (either __builtin_ctzll or similar, or else a tiny lookup table that fits in a platform integer.
no swapping of f and g happens while eta is positive, so we can know in some cases that there will be several swapless iterations, and those can be compressed into one with a little mathemagic.

Could jacobi be implemented effectively using the same code

I’ve been trying to understand that myself. Unfortunately, allowing (f, g) to go negative (as in this PR) means that the jacobi symbol is not defined for some reduction steps. It can work with a different divstep that keeps both (f, g) non-negative (but take care per safegcd paper 8.4 - although gcd/jacobi can actually terminate when f == g).

I recommend reading An O(M(n) log n) algorithm for the Jacobi symbol (there are slides too).

peterdettman commented at 10:55 am on July 27, 2020: contributor

In my case the trap was tricky, not link to the divstep62 itself or to the inputs of a single divstep62 but to the overall loop and to the 256 bit number. I got a fault (more than 12 divstep62 to complete) while running my Kangaroo program on a Render Farm (GPU code). This fault didn’t change much the time needed to solve the 114 bit private key challenge (so it is rare) and the cases when it happens are rather unexpected. (~2^52.36 inversion done during the chalenge). As i said I don’t know if your implementation have the same issue. I will test it ASAP.

I don’t want to be a broken record, but this tends to confirm my suspicion that maybe the overall algorithm doesn’t terminate for all inputs (although you say here only that it didn’t terminate before 744 iterations). As @gmaxwell said, it would be very valuable to somehow record those “fault” inputs, but they may well be specific to the precise divstep in use (not to mention the prime).

JeanLucPons commented at 3:03 pm on July 27, 2020: none

OK i did few check with records I have but with your implementation it seems to not exceed 10 divstep62(). I tweaked a bit the code to record when g goes to 0 and it didn’t exceed 10 iterations. However I’m far from familiar with this lib and I not 100% sure of what I’m testing, I didn’t manage to access the number (I used SECP256K1_FE_CONST() by hand for testing few cases). I need to learn the libsecpk1 style before ;) I never observed an infinite loop with my implementation and in some rare cases it exceeds 12. When i try to draw random numbers in the “window” which fails, it goes up to 24 steps and produce the result in ]-3P,3P[, probably link to the divstep I use.

Initial _inv_var implementations 3519dccfe4

peterdettman commented at 6:50 pm on July 27, 2020: contributor

Added var-time inverses. Currently only changes the divsteps method (uses __builtin_ctz intrinsic). There’s still a fixed number of macro-iterations, although once g is 0, the divsteps calls are almost free.

(All measurements on x86_64)

master (32bit): scalar_inverse: min 39.0us / avg 39.5us / max 40.5us scalar_inverse_var: min 2.30us / avg 2.37us / max 2.50us field_inverse: min 7.34us / avg 7.45us / max 7.59us field_inverse_var: min 2.25us / avg 2.30us / max 2.44us

master (64bit): scalar_inverse: min 12.3us / avg 12.5us / max 13.1us scalar_inverse_var: min 2.23us / avg 2.31us / max 2.49us field_inverse: min 5.04us / avg 5.13us / max 5.31us field_inverse_var: min 2.24us / avg 2.28us / max 2.33us

safegcd_inv (32bit): scalar_inverse: min 4.38us / avg 4.52us / max 4.98us scalar_inverse_var: min 2.91us / avg 3.06us / max 3.52us field_inverse: min 3.69us / avg 3.75us / max 3.82us field_inverse_var: min 2.25us / avg 2.29us / max 2.36us

safegcd_inv (64bit): scalar_inverse: min 3.56us / avg 3.78us / max 4.05us scalar_inverse_var: min 1.96us / avg 2.01us / max 2.22us field_inverse: min 3.26us / avg 3.33us / max 3.45us field_inverse_var: min 1.73us / avg 1.77us / max 1.85us

gmaxwell commented at 7:33 pm on July 27, 2020: contributor

arm7

master (asm): scalar_inverse: min 342us / avg 347us / max 356us scalar_inverse_var: min 13.6us / avg 14.5us / max 17.6us field_inverse: min 80.9us / avg 82.0us / max 83.8us field_inverse_var: min 13.0us / avg 13.2us / max 13.4us

master (noasm): scalar_inverse: min 349us / avg 359us / max 371us scalar_inverse_var: min 14.1us / avg 14.3us / max 14.6us field_inverse: min 215us / avg 220us / max 226us field_inverse_var: min 13.8us / avg 13.9us / max 14.0us

safegcd_inv (asm): scalar_inverse: min 34.7us / avg 34.7us / max 34.8us scalar_inverse_var: min 25.5us / avg 25.5us / max 25.5us field_inverse: min 28.1us / avg 28.1us / max 28.1us field_inverse_var: min 18.9us / avg 18.9us / max 18.9us

safegcd_inv (noasm): scalar_inverse: min 34.7us / avg 34.8us / max 34.9us scalar_inverse_var: min 25.3us / avg 25.4us / max 25.5us field_inverse: min 32.2us / avg 32.2us / max 32.3us field_inverse_var: min 23.0us / avg 23.0us / max 23.1us

Way better than no GMP, but it’s still being edged out by GMP somewhat on arm.

sipa commented at 10:07 pm on July 27, 2020: contributor

@peterdettman Another possibility I think, is making the update_fg code take in a width parameter, which would ~~always be 5/9 in the novar version, but~~ depend on the iteration count ~~in the var version~~ (presumably the bounds can be computed, and shrink in function of the iteration count).

EDIT: actually, this can even be done in the novar version.

peterdettman commented at 4:59 am on July 28, 2020: contributor

Another possibility I think, is making the update_fg code take in a width parameter

Not as straightforward as it looks. Actually this PR originally did attempt to do it, but I removed it in https://github.com/bitcoin-core/secp256k1/pull/767/commits/3ae7179ad78398c23f27b41dd94baa7c0662c964. Considering the const-time case:

I got it wrong; see if you can spot why.
The possible width reduction isn’t linear but only towards the end; it isn’t based on whether (f, g) are actually getting smaller, rather it’s based on knowing we only need the result mod 2^(62 * 4), mod 2^(62 * 3), etc. (in theory we only need a sign bit from the final update_fg). See also the central image of Figure 12.2 in the safegcd paper. Note that “At most X iterations/bit are needed” cannot be inverted to “Y iterations left, therefore current bits <= Y/X”.
e.g. in the 64bit case the fixed-width updates are loop-unrolled and so faster throughout, at least partially offsetting any gains.

In the var-time case it could work better because in practice (f, g) do shrink linearly for typical inputs. I think the paper shows that f, g can’t increase in (signed) bitlength from a divstep, so it should be safe to dynamically detect when shortening the width will preserve the (signed) value of f and g.

real-or-random commented at 9:12 am on July 28, 2020: contributor

Added var-time inverses. Currently only changes the divsteps method (uses __builtin_ctz intrinsic).

For future reference, here are a few crazy alternative implementations if the intristic is not available https://github.com/hcs0/Hackers-Delight/blob/master/ntz.c.txt (notably including a branch-free variant which is not interesting for us)

Sorry, I’d like to be more helpful but I’m currently too busy with other things to understand the algorithm and do a proper review….

Way better than no GMP, but it’s still being edged out by GMP somewhat on arm.

This suggests the question how much slowdown we’d accept (if any) for the advantage of being able to get rid of GMP entirely.

gmaxwell commented at 9:23 am on July 28, 2020: contributor

This suggests the question how much slowdown we’d accept (if any) for the advantage of being able to get rid of GMP entirely.

Oh I think it’s there already. Bitcoin doesn’t use GMP due to a strong preference against dependencies (esp in consensus code). The work to refine and validate this implementation for constant time use, which it provides an enormous speedup (42% for signing) with no alternative is 95% of the work needed for using it variable time (or even 100% if the constant time version is used).

sipa commented at 1:48 am on July 29, 2020: contributor

It seems GMP has an (experimental) constant-time modular inverse function too now, mpn_sec_invert_itch. It may be interesting to benchmark that as well.

This text in the documentation:

It is required that M is odd, and that nbcnt >= ceil(\log(A+1)) + ceil(\log(M+1)).

makes me think it may be using a similar style algorithm.

real-or-random commented at 5:11 am on July 29, 2020: contributor

It seems GMP has an (experimental) constant-time modular inverse function too now, mpn_sec_invert_itch. It may be interesting to benchmark that as well.

A noted by @mratsim above, this is Niels Möller’s algorithm (and implementation) from 2013, also see the GMP ChangeLog (was called mpn_sec_minvert earlier).

JeanLucPons commented at 1:08 pm on July 29, 2020: none

Hello, I implemented the var time version in my program (not using a global matrix), still do the matrix/vector product at each step. I gain ~50% compare to my former implementation. Nice ! Hope the gain will be the same on the GPU side. Many thanks to @peterdettman for this works :) https://github.com/JeanLucPons/Kangaroo/blob/0042e1a3672bb12461c09d969daa6fc37b918724/SECPK1/IntMod.cpp#L366

peterdettman commented at 1:16 pm on July 29, 2020: contributor

@JeanLucPons Awesome! I actually cloned VanitySearch and was benchmarking basically the same change already, but very happy to see you work it through yourself. I may be borrowing your r/s updating code in return.

Oh, and the same basic code appears to work on the GPU with __ffsll(x) - 1 replacing __builtin_ctzll (BitScanForward64), but it wasn’t clear to me how to benchmark it. I look forward to your results.

JeanLucPons commented at 2:01 pm on July 29, 2020: none

Thanks for the tips ;) On the GPU side for vanitySearch it will not bring significant speedup as modular inversion can be batched in large group but for Kangaroo, it is an actual advantage as the smaller group you have the better it is (due to the DP overhead).

JeanLucPons commented at 9:25 am on July 31, 2020: none

@peterdettman

Hello,

I tried the var time implementation on the GPU (using __ffsll()) and it has similar performance than my former implementation (even a bit slower). Likely due to the fact that branching is costly on GPU. Note than on GPU, i do optimization for the Secp256K1 prime, not on the CPU side. So I’m trying to port the constant time implementation and I didn’t manage to make it work, the porting of the var time was rather straightforward but I pain to port the constant time version of divstep62. The fist number I’m inverting in my program is: C9BD1905155383999C46C2C295F2B761BCB223FEDC24A059D838091D0868192A which is equal to 2^-256 (mod p). The var time give the good result (1000003D1). With the constant time, only one of 2 variable (f,g) make progress. I likely doing something wrong. Could you print me out the matrices that should be produced for that number ?

Many thanks again and fell free to borrow anything you want in my code and especially to optimize anything you want ;)

peterdettman commented at 2:16 pm on July 31, 2020: contributor

@JeanLucPons I think we’d better move that conversation over to the Kangaroo project. Could you open a PR there with the initial attempt and then mention me to take a look. It will probably be something relatively simple, but if not I will get the transition matrices for you.

JeanLucPons commented at 2:19 pm on July 31, 2020: none

Thanks, i found the issue, I had to do that with Visual studio.

0    for(i = 0; i < 62; ++i) {
1
2      c1 = -(g & ((((uint64_t)eta) & 0xFFFFULL) >> 15));

Simplify type of 'eta' bd184711c8

sipa commented at 10:45 pm on July 31, 2020: contributor

I doubt this is useful, but if you get to observe the sign of f during the steps of the algorithm, you can compute the jacobi symbol:

 0def divstep(delta, f, g):
 1    assert(f & 1)
 2    if delta > 0 and (g & 1):
 3        return (1-delta, g, (g-f)//2)
 4    elif g & 1:
 5        return (1+delta, f, (g+f)//2)
 6    else:
 7        return (1+delta, f, g//2)
 8
 9def jacobi(a, m):
10    ret = 1
11    delta = 1
12    f = m
13    g = a
14    while g != 0:
15        old_f = f
16        delta, f, g = divstep(delta, f, g)
17        r = (abs(f) & 6) + (abs(old_f) & 2) + 2 * ((f < 0) != (old_f < 0))
18        if r == 4 or r == 6:
19            ret *= -1
20    if abs(f) == 1:
21        return ret
22    return 0

JeanLucPons commented at 10:30 am on August 5, 2020: none

Still, it’s possible trying divsteps_31 is overly optimistic.

Hello, Could you tell us more about this divstep31 ? Are you running a ‘computer proof’ for the overflow checking ?

peterdettman commented at 4:45 am on August 6, 2020: contributor

Could you tell us more about this divstep31 ? Are you running a ‘computer proof’ for the overflow checking ?

I’m not applying any automated proof tools, no. The code analysis I don’t think is so difficult, it’s more about understanding the limits for the transition matrices from each divsteps_31 call (u, v, q, r).

divsteps_31 is just 31 iterations of the safegcd divstep, operating on the least-significant 31 bits of (f, g). In the 64-bit case, using groups of 64 -2 = 62 iterations (i.e. 2 bits of “headroom”) makes the bounds/overflow analysis fairly simple, and 12 * 62 == 744 is sufficient iterations for the constant-time algorithm. For the 32-bit case you could choose 32 - 2 == 30 iterations and things are similarly simple, but then you would have 25 * 30 == 750 iterations. It would be neater to have 24 * 31 == 744, because 25 don’t combine neatly into 3 field element matrices (thus extra fixup code), and also just to save the extra matrix multiplications on the full-length values.

I still mean to test your (@JeanLucPons) approach for combining the transition matrices on-the-fly using the Montgomery-reduction steps (I mentioned this in #767 (comment)) also. If that is fast enough, maybe we’d just settle for divsteps_30.

JeanLucPons commented at 8:05 am on August 6, 2020: none

Thanks for the clarification.

If I’m not mistaken, concerning the update of R/S:

With the method in the paper you have (I count 64bitx64bit multiplier noted M for divstep62):

48 M => 6 x 128bit matrices 96 M => 3 x 256bit matrices 6 modmul (6x21 = 126M) => bezout coef. You exploit the fact that the initial (r,s) vector is (1,0) => 270M

By updating r,s at each step (considering an optimized multiplication by the secpk1 prime):

12 * (20 + 2 + 2) => 288M

A bit faster with matrix/matrix products but few mul can also be win in the second method by also exploiting the first (r,s) and with the advantage to stop when you want. Edit: I forget the initial modmul for the normalisation (2^-744), so speed should be very similar with both methods.

field_5x52: update Bézout coefficients on-the-fly b8c7390bc3

peterdettman commented at 10:09 am on August 8, 2020: contributor

64-bit field now rewritten to update the Bézout coefficients on-the-fly. Simpler, faster: pick any 2.

field_inverse: min 2.94us / avg 2.97us / max 3.04us field_inverse_var: min 1.44us / avg 1.47us / max 1.57us

For reference, my earlier results:

master (64bit): field_inverse: min 5.04us / avg 5.13us / max 5.31us field_inverse_var: min 2.24us / avg 2.28us / max 2.33us

safegcd_inv (64bit): field_inverse: min 3.26us / avg 3.33us / max 3.45us field_inverse_var: min 1.73us / avg 1.77us / max 1.85us

field_10x26: update Bézout coefficients on-the-fly 64a4912c43

peterdettman commented at 12:42 pm on August 8, 2020: contributor

32-bit field updated. New results:

field_inverse: min 3.32us / avg 3.38us / max 3.47us field_inverse_var: min 1.74us / avg 1.75us / max 1.81us

Earlier results:

master (32bit): field_inverse: min 7.34us / avg 7.45us / max 7.59us field_inverse_var: min 2.25us / avg 2.30us / max 2.44us

safegcd_inv (32bit): field_inverse: min 3.69us / avg 3.75us / max 3.82us field_inverse_var: min 2.25us / avg 2.29us / max 2.36us

scalar_4x64: update Bézout coefficients on-the-fly e5f2d29cbb

peterdettman commented at 3:23 pm on August 8, 2020: contributor

64-bit scalar updated. New results:

scalar_inverse: min 3.00us / avg 3.07us / max 3.14us scalar_inverse_var: min 1.50us / avg 1.53us / max 1.57us

Earlier results:

master (64bit): scalar_inverse: min 12.3us / avg 12.5us / max 13.1us scalar_inverse_var: min 2.23us / avg 2.31us / max 2.49us

safegcd_inv (64bit): scalar_inverse: min 3.56us / avg 3.78us / max 4.05us scalar_inverse_var: min 1.96us / avg 2.01us / max 2.22us

JeanLucPons commented at 3:59 pm on August 8, 2020: none

Nice :)

Here is a CPU profiling of the var time divtep62 by @peterdettman (Visual studio) ~1.57us on my i5. The mul by P is not optimized for the secp256k1 prime on this profile and ~1% is used for the “weak normalization” not displayed here.

modinv_perf

On the GPU side, for counting the trailing zero, log2(x&-x) is faster than __ffsll() (at least on a 1050 Ti), I will do more tests on other GPUs…

scalar_8x32: update Bézout coefficients on-the-fly 34bec400bf

peterdettman commented at 4:47 pm on August 8, 2020: contributor

32-bit scalar updated. All results for new approach:

master (32bit): scalar_inverse: min 39.0us / avg 39.5us / max 40.5us scalar_inverse_var: min 2.30us / avg 2.37us / max 2.50us field_inverse: min 7.34us / avg 7.45us / max 7.59us field_inverse_var: min 2.25us / avg 2.30us / max 2.44us

master (64bit): scalar_inverse: min 12.3us / avg 12.5us / max 13.1us scalar_inverse_var: min 2.23us / avg 2.31us / max 2.49us field_inverse: min 5.04us / avg 5.13us / max 5.31us field_inverse_var: min 2.24us / avg 2.28us / max 2.33us

safegcd_inv (32bit): scalar_inverse: min 3.37us / avg 3.41us / max 3.48us scalar_inverse_var: min 1.84us / avg 1.85us / max 1.93us field_inverse: min 3.32us / avg 3.38us / max 3.47us field_inverse_var: min 1.74us / avg 1.75us / max 1.81us

safegcd_inv (64bit): scalar_inverse: min 3.00us / avg 3.07us / max 3.14us scalar_inverse_var: min 1.50us / avg 1.53us / max 1.57us field_inverse: min 2.94us / avg 2.97us / max 3.04us field_inverse_var: min 1.44us / avg 1.47us / max 1.57us

peterdettman commented at 4:57 pm on August 8, 2020: contributor

On the GPU side, for counting the trailing zero, log2(x&-x) is faster than __ffsll() (at least on a 1050 Ti)

You can try also __clzll(__brevll(x)) i.e. count leading zeros of bit-reversed value, although the __ffsll intrinsic may already be implemented that way in the absence of hardware support.

sipa commented at 5:36 pm on August 8, 2020: contributor

Nice, it seems to easily beat GMP now (at least on the systems you tested). This is a much nicer approach than deconstructing the full coefficients through matrix multiplications.

A few comments again before I start reviewing in more detail.

I’m not sure that int128 -> int64 truncation (or int64 -> int32) is well-defined. It’s not UB, but I believe it is implementation defined what it means. I doubt any reasonable platform does anything but reinterpreting the bottom 64/32 bits in two’s complement, but it would be nice to not rely on this. I believe a int64_t makesigned(uint64_t x) { return (x >> 63) ? -(int64_t)(~x)-1 : (int64_t)x; } works and is optimized out on all reasonable platforms I can test on godbolt (gcc, clang, msvc, icc, on arm/ppc/x86 at least). I hate such stupid workarounds, and would love an argument why this isn’t needed if anyone has one.
Only encode, decode, and the inner core’s P/I62 depend on the modulus, I think? Would you mind trying what performance impact there is from merging the scalar/fe operations (but pass in a pointer to an object with precomputed P/I62?). Otherwise I’m happy to do that.
__builtin_ctzl[l] isn’t always available, so a workaround will be needed. Is it better to just not have this level of granulary (i.e. just do one bit at a time per iteration, and perhaps break when g=0 is reached), or to manually count the zero bits but keep the existing algorithm?

sipa commented at 5:43 pm on August 8, 2020: contributor

Eh, and left shifting a negative number is UB apparently; right shifting is implementation defined. I think you can work around these by doing most operations on unsigned, and then using the makesigned() function above to get signed versions when needed, but ugh…

gmaxwell commented at 5:58 pm on August 8, 2020: contributor

Here are updated ARM benchmarks. I can’t benchmark right now on the iMX.7 that I used for arm7 numbers before, due to a snafu here (flash card corruption :( ), so this time I’m testing on a much slower arm7 (though with a much newer compiler).

All of the below are GCC 9.3.0.

armv8 (S905 A53) 64-bit master: scalar_inverse: min 61.1us / avg 61.3us / max 62.3us scalar_inverse_var: min 5.67us / avg 5.68us / max 5.71us field_inverse: min 31.6us / avg 31.6us / max 31.7us field_inverse_var: min 5.59us / avg 5.61us / max 5.71us

armv8 (S905 A53) 64-bit old-safegcd: scalar_inverse: min 14.4us / avg 14.5us / max 15.6us scalar_inverse_var: min 7.80us / avg 7.80us / max 7.81us field_inverse: min 13.7us / avg 13.7us / max 13.8us field_inverse_var: min 7.09us / avg 7.10us / max 7.20us

armv8 (S905 A53) 64-bit safegcd: scalar_inverse: min 12.8us / avg 12.8us / max 12.9us scalar_inverse_var: min 6.26us / avg 6.27us / max 6.27us field_inverse: min 12.2us / avg 12.2us / max 12.3us field_inverse_var: min 5.75us / avg 5.76us / max 5.86us

armv7 i.MX53 32-bit (asm) master: scalar_inverse: min 452us / avg 452us / max 453us scalar_inverse_var: min 20.4us / avg 20.4us / max 20.7us field_inverse: min 165us / avg 165us / max 165us field_inverse_var: min 20.0us / avg 20.0us / max 20.1us

armv7 i.MX53 32-bit (asm) old-safegcd: scalar_inverse: min 50.7us / avg 50.8us / max 51.0us scalar_inverse_var: min 38.4us / avg 38.5us / max 38.6us field_inverse: min 43.7us / avg 43.7us / max 43.7us field_inverse_var: min 31.3us / avg 31.3us / max 31.4us

armv7 i.MX53 32-bit (asm) safegcd: scalar_inverse: min 43.3us / avg 43.4us / max 43.5us scalar_inverse_var: min 25.6us / avg 25.6us / max 25.7us field_inverse: min 38.6us / avg 38.6us / max 38.6us field_inverse_var: min 22.1us / avg 22.1us / max 22.6us

It’s still getting edged out by GMP, though it’s pretty neck and neck now.

real-or-random commented at 6:15 pm on August 8, 2020: contributor

* I'm not sure that int128 -> int64 truncation (or int64 -> int32) is well-defined. It's not UB, but I believe it is implementation defined what it means. I doubt any reasonable platform does anything but reinterpreting the bottom 64/32 bits in two's complement, but it would be nice to not rely on this. I believe a `int64_t makesigned(uint64_t x) { return (x >> 63) ? -(int64_t)(~x)-1LL : (int64_t)x; }` works and is optimized out on all reasonable platforms I can test on godbolt (gcc, clang, msvc, icc, on arm/ppc/x86 at least). I hate such stupid workarounds, and would love an argument why this isn't needed if anyone has one.

I don’t see why int64_t makesigned(uint64_t x) { return (x >> 63) ? -(int64_t)(~x)-1LL : (int64_t)x; } should be much better. That’s also implementation-defined. Yeah, you could argue it relies only on twos complement. But it could be compiled to a branch.

For ctzl, see #767 (comment)

Eh, and left shifting a negative number is UB apparently;

Urghs, okay, we should definitively avoid UB. The easiest way is to cast to unsigned, shift, and cast back to signed. That’s still implementation-defined because the cast back to signed but at least not UB.

gmaxwell commented at 6:24 pm on August 8, 2020: contributor

Yeah, freaking C99 copied that garbage from C++, it was implementation defined previously. I believe there is discussion of making it implementation defined– because it at least theoretically makes it unsafe to use shifts on signed numbers period. Casting and casting back gets you the intended behaviour and just requires a twos-complement implementation (I believe everything with a C89 compiler is twos-complement, PDP1 was ones-complement but there is no C89 compiler for it).

sipa commented at 6:53 pm on August 8, 2020: contributor

@real-or-random int64_t makesigned(uint64_t x) { return (x >> 63) ? -(int64_t)(~x)-1LL : (int64_t)x; } only performs a ~ on unsigned input, which is well-defined (and unary/binary minus on signed numbers, within range, which is also well-defined) as far as I know.

I edited my comment; an earlier version was using ~ on signed numbers too.

sipa commented at 6:59 pm on August 8, 2020: contributor

I think it’s reasonable to require a two’s complement implementation (we could verify it in a configure script, and in unit tests), and just do a cast-to-unsigned, shift, cast-to-sign for signed left shifts.

real-or-random commented at 10:21 pm on August 8, 2020: contributor

@real-or-random int64_t makesigned(uint64_t x) { return (x >> 63) ? -(int64_t)(~x)-1LL : (int64_t)x; } only performs a ~ on unsigned input, which is well-defined (and unary/binary minus on signed numbers, within range, which is also well-defined) as far as I know.

I edited my comment; an earlier version was using ~ on signed numbers too.

Oh yes, indeed.

But if we require anyway for the left shifts that unsigned to signed conversions of the same width are a two’s-complement noop (and we should of course!), then (int64_t)(uint64_t)x should be a much simpler way to truncate x of type int128_t. (And depending on how pedantic you are, you may just as well write (int64_t)x. Who would be crazy enough to come up with an implementation where this makes a difference difference?)

I think it’s reasonable to require a two’s complement implementation (we could verify it in a configure script, and in unit tests)

If you really have some exotic system, then you might not run ./configure or the tests either. I think the safest way is to use some C static assert hack e.g., https://stackoverflow.com/a/56742411. These looks very ugly but they work and the ugliness is hidden behind a macro. We should also check other simple assumptions such as sizeof(int) == 4 etc. Alternatively, we could perform some self-tests at context creation time, and call the error callback if they fail. The simple ones will be optimized away by the compiler, and we should have some kind of self-test anyway if we want to implement a SHA256 override (see the discussion in #702).

gmaxwell commented at 10:37 pm on August 8, 2020: contributor

Those “static assert” hacks tend to not work well outside of GCC/Clang in my experience– usually they use some language extension that doesn’t exist everywhere. So you add some code to detect some config no user will ever have like ones complement, but then break compiles on ordinary stuff. Self-tests are better generally.

real-or-random cross-referenced this on Aug 8, 2020 from issue Document / assert non-portable assumptions by real-or-random

sipa commented at 10:55 pm on August 8, 2020: contributor

@real-or-random If we assume two’s complement nothing is needed for truncation, as (int64_t)int128 == (int64_t)(uint64_t)int128. Under that assumption I think all the code in this PR is fine, except the left shift of signed values (which needs a conversion to unsigned and back).

My suggestion to use makesigned is for the case where we don’t want to assume two’s complement. Doing that would require a lot of changes in this PR (basically making all signed values unsigned, and using makesigned whenever a “signed” interpretation is needed).

real-or-random commented at 11:06 pm on August 8, 2020: contributor

@real-or-random If we assume two’s complement nothing is needed for truncation, as (int64_t)int128 == (int64_t)(uint64_t)int128. Under that assumption I think all the code in this PR is fine, except the left shift of signed values (which needs a conversion to unsigned and back).

Two’s complement doesn’t say anything about truncation. I was making a difference between conversion from unsigned to signed and truncation of signed, which are both implementation-defined but different issues. So I believe there’s nothing in the standard that guarantees that for your implementation (int64_t)int128 == (int64_t)(uint64_t)int128 is guaranteed to hold.

But anyway, I think this is a somewhat useless discussion. We should just assume and assert that both are the same value (That’s why I had added that sentence in parentheses.) and that we have two’s complement, and the problem is solved.

peterdettman commented at 8:52 am on August 9, 2020: contributor

Are there current or planned uses of modular inversion where we know the input size in bits is (always or usually) shorter than the full 256 bits?

gmaxwell commented at 9:21 am on August 9, 2020: contributor

@apoelstra Anything in bulletproofs?

I recall making a prior proposal for an accountable multiparty signature with size O(missing_parties) where in a naive implementation the verifier needed to invert small values (the coefficients of a vandermonde matrix)– but I was able to reorder it to get the inversions into the signer and setup. It’s doesn’t seem impossible that there would be some cause for it, esp in the scalar field.

One case where we commonly deal with non-full-size numbers is the reduced scalars that show up as a result of the endomorphism decomposition, but nothing is coming to mind as a reason to invert one of them.

Alternate var-time divsteps code bfd7a0fbd6

peterdettman commented at 10:05 am on August 9, 2020: contributor

@gmaxwell I pushed some alternate code under preprocessor tags for divsteps_62_var and divsteps_30_var (so 4 places total) that I’d like you to try on the ARM systems please. All the blocks look like this:

0#if 1
1        /* Handle up to 3 divsteps at once, subject to eta and i. */
2        /* ... */
3#else
4        g += f;
5        q += u;
6        r += v;
7#endif

The “3 divsteps” is faster everywhere on my machine, but not by a huge margin, so maybe that’s a deoptimization for the ARM systems?

real-or-random commented at 10:12 am on August 9, 2020: contributor

@apoelstra Anything in bulletproofs?

Not natively but I could imagine that one could optimize some of the inversions there. The inner product proofs need scalar inversions of (Fiat-Shamir) challenges a few times, and at least in the case of Schnorr sigs, a challenge 128 bits are sufficient there to guarantee soundness and extractability. (This is just a random idea. This needs some careful investigation, and I haven’t looked at the details.)

JeanLucPons commented at 10:12 am on August 9, 2020: none

On GPU, this code is a bit faster.

 0   /*
 1    int64_t m,w;
 2    // Handle up to 3 divsteps at once, subject to eta and bitCount
 3    int limit = (*eta + 1) > bitCount ? bitCount : (*eta + 1);
 4    m = (UINT64_MAX >> (64 - limit)) & 7U;
 5    // Note that f * f == 1 mod 8, for any f
 6    w = (-u0 * v0) & m;
 7    v0 += u0 * w;
 8    *vu += *uu * w;
 9    *vv += *uv * w;
10    */
11
12    v0 += u0;
13    *vu += *uu;
14    *vv += *uv;

gmaxwell commented at 10:37 am on August 9, 2020: contributor

@peterdettman Pieter had me try exactly that 15 hours ago. :P It was faster on the 32-bit arm, but not by a huge margin and slower on the 64-bit arm, and slower on x86_64 (obviously or you wouldn’t have had the optimization in the first place).

(see above for the pre-deoptimization numbers)

armv7 i.MX53 32-bit (asm) sipa-safegcd: scalar_inverse: min 43.3us / avg 43.4us / max 43.5us scalar_inverse_var: min 24.2us / avg 24.2us / max 24.2us field_inverse: min 38.6us / avg 38.6us / max 38.6us field_inverse_var: min 20.6us / avg 20.6us / max 21.1us

armv8 (S905 A53) 64-bit sipa-safegcd: scalar_inverse: min 12.8us / avg 12.8us / max 12.8us scalar_inverse_var: min 6.33us / avg 6.34us / max 6.34us field_inverse: min 12.2us / avg 12.2us / max 12.2us field_inverse_var: min 5.83us / avg 5.83us / max 5.83us

epyc 64-bit safegcd: scalar_inverse: min 3.20us / avg 3.20us / max 3.21us scalar_inverse_var: min 1.61us / avg 1.61us / max 1.62us field_inverse: min 3.16us / avg 3.16us / max 3.16us field_inverse_var: min 1.59us / avg 1.59us / max 1.59us

epyc 64-bit sipa-safegcd: scalar_inverse: min 3.20us / avg 3.21us / max 3.22us scalar_inverse_var: min 1.69us / avg 1.70us / max 1.70us field_inverse: min 3.16us / avg 3.16us / max 3.16us field_inverse_var: min 1.65us / avg 1.66us / max 1.66us

Add comments regarding small inputs f873c3b503

peterdettman commented at 10:58 am on August 9, 2020: contributor

For small inputs, I’ve added a comment explaining an optimization when len(g) is bounded (const-time) or short (var-time). It’s based on the simple observation that you could imagine g1 = g << clz(g) as the input instead, in which case the first clz(g) divsteps would simply be shifting g1 down and decrementing eta, taking us back to the originally input g - yet safegcd(g1) would still complete within 741 divsteps total.

Probably with some math you could show that the worst-case iterations are reduced close to double that.

Avoid left shift of signed values 17982d820e

gmaxwell commented at 12:03 pm on August 9, 2020: contributor

@peterdettman you can always conditionally negate a number if it shaves off bits and negate the inverse if you did. Doing so reduces the maximum (and average) by 1 bit. Does that pay for itself?

Add alternative to __builtin_ctz intrinsics

- lookup tables based on de Bruijn sequences

06d568a7e6

Write primes in signed-digit form 16509ca068

gmaxwell commented at 3:01 pm on August 9, 2020: contributor

armv7 i.MX53 32-bit (asm) safegcd-altdivstep-no-ctz: scalar_inverse: min 43.7us / avg 43.8us / max 43.9us scalar_inverse_var: min 28.0us / avg 28.0us / max 28.1us field_inverse: min 37.7us / avg 37.7us / max 37.7us field_inverse_var: min 23.5us / avg 23.6us / max 24.1us

So not faster even on the arm7 which doesn’t have a ctz (it uses bitrev and clz)… but it’s good to have a fallback handy for where the builtin doesn’t exist.

Unify _update_de_ methods 40c815ebe1

gmaxwell commented at 3:56 pm on August 9, 2020: contributor

Unify _update_de appears to be a 17% performance loss on arm7 field var inverse? Scalar doesn’t appear to have been hurt.

I didn’t see why at a glance, so I retested several times. 17% is a bit large for compiler noise. :P

Before: (with clz on and merged divstep off) scalar_inverse: min 43.5us / avg 43.5us / max 43.6us scalar_inverse_var: min 24.0us / avg 24.1us / max 24.1us field_inverse: min 38.1us / avg 38.1us / max 38.2us field_inverse_var: min 19.7us / avg 19.7us / max 19.7us

After: scalar_inverse: min 43.7us / avg 43.7us / max 43.8us scalar_inverse_var: min 24.2us / avg 24.2us / max 24.3us field_inverse: min 43.3us / avg 43.3us / max 43.4us field_inverse_var: min 23.8us / avg 23.8us / max 23.8us

peterdettman commented at 4:01 pm on August 9, 2020: contributor

I’m not sure that int128 -> int64 truncation (or int64 -> int32) is well-defined. It’s not UB, but I believe it is implementation defined what it means. I doubt any reasonable platform does anything but reinterpreting the bottom 64/32 bits in two’s complement, but it would be nice to not rely on this. I believe a int64_t makesigned(uint64_t x) { return (x >> 63) ? -(int64_t)(~x)-1 : (int64_t)x; } works and is optimized out on all reasonable platforms I can test on godbolt (gcc, clang, msvc, icc, on arm/ppc/x86 at least). I hate such stupid workarounds, and would love an argument why this isn’t needed if anyone has one.

Happy to change things, but it sounds from later comments like the plan might be to self-test for reasonable behaviour.

Only encode, decode, and the inner core’s P/I62 depend on the modulus, I think? Would you mind trying what performance impact there is from merging the scalar/fe operations (but pass in a pointer to an object with precomputed P/I62?). Otherwise I’m happy to do that.

Yep, encode, decode and now the field/scalar versions of _update_de_(30/62) are the same except for P and I(30/62). gcc is unrolling the loop in _update_de_ methods so by writing the primes with signed digits we retain optimal performance.

According to my testing, P and I(30/62) could now be made into arguments without performance loss iff the _update_de_ methods are inlined. If the intention was to share the actual inv/inv_var methods then you start to need deep inlining.

EDIT: Okay, maybe not, looks like this was a “works for me at -O3” thing.

__builtin_ctzl[l] isn’t always available, so a workaround will be needed. Is it better to just not have this level of granulary (i.e. just do one bit at a time per iteration, and perhaps break when g=0 is reached), or to manually count the zero bits but keep the existing algorithm?

Changed code to default to portable alternative. If anyone wants to work on providing CTZ32(uint32_t) and CTZ64(uint64_t) wrappers that would be helpful.

peterdettman commented at 4:03 pm on August 9, 2020: contributor

Unify _update_de appears to be a 17% performance loss on arm7 field var inverse?

Can you check if it’s unrolling the loop in _update_de_? Also please mention the compiler version and options so I can take a look in godbolt.

gmaxwell commented at 4:17 pm on August 9, 2020: contributor

gcc version 9.3.0 (GCC)

./configure –with-asm=arm –enable-experimental && make clean && make bench_internal && ./bench_internal scalar && ./bench_internal field

 0diff --git a/src/field_10x26_impl.h b/src/field_10x26_impl.h
 1index 48551a8..43d3d9e 100644
 2--- a/src/field_10x26_impl.h
 3+++ b/src/field_10x26_impl.h
 4@@ -1301,7 +1301,7 @@ static uint32_t secp256k1_fe_divsteps_30_var(uint32_t eta, uint32_t f0, uint32_t
 5         /* Use a sentinel bit to count zeros only up to i. */
 6         x = g | (UINT32_MAX << i);
 7 
 8-#if 0
 9+#if 1
10         zeros = __builtin_ctzl(x);
11 #else
12         zeros = debruijn[((x & -x) * 0x04D7651F) >> 27];
13@@ -1329,7 +1329,7 @@ static uint32_t secp256k1_fe_divsteps_30_var(uint32_t eta, uint32_t f0, uint32_t
14             z = v; v = r; r = -z;
15         }
16 
17-#if 1
18+#if 0
19         /* Handle up to 3 divsteps at once, subject to eta and i. */
20         limit = ((int)eta + 1) > i ? i : ((int)eta + 1);
21         m = (UINT32_MAX >> (32 - limit)) & 7U;
22diff --git a/src/scalar_8x32_impl.h b/src/scalar_8x32_impl.h
23index 6912e51..c5c80e5 100644
24--- a/src/scalar_8x32_impl.h
25+++ b/src/scalar_8x32_impl.h
26@@ -866,7 +866,7 @@ static uint32_t secp256k1_scalar_divsteps_30_var(uint32_t eta, uint32_t f0, uint
27         /* Use a sentinel bit to count zeros only up to i. */
28         x = g | (UINT32_MAX << i);
29 
30-#if 0
31+#if 1
32         zeros = __builtin_ctzl(x);
33 #else
34         zeros = debruijn[((x & -x) * 0x04D7651F) >> 27];
35@@ -894,7 +894,7 @@ static uint32_t secp256k1_scalar_divsteps_30_var(uint32_t eta, uint32_t f0, uint
36             z = v; v = r; r = -z;
37         }
38 
39-#if 1
40+#if 0
41         /* Handle up to 3 divsteps at once, subject to eta and i. */
42         limit = ((int)eta + 1) > i ? i : ((int)eta + 1);
43         m = (UINT32_MAX >> (32 - limit)) & 7U;

I’ll look at the disassembly in a moment.

https://0bin.net/paste/-dl8vA0ken4U05vv#qEgv2IAZnsgqVO2ObALYjSxXuybDR4O5EfvLEug8lPp is after https://0bin.net/paste/2eYUwTpkaGfL-981#W9bTuyRifxZTcAAHWAAqXCrD0srfLxFAe9KEJEl6TlF is before

Both don’t look unrolled to me (other than the one copy that was unrolled in before, of course)

real-or-random commented at 4:26 pm on August 9, 2020: contributor

Can you check if it’s unrolling the loop in _update_de_?

GCC 9.3 SHOULD not unroll loops on -O2 (but yeah, I wouldn’t bet on it)..

peterdettman commented at 4:32 pm on August 9, 2020: contributor

OK, so mystery solved I guess. I’ve been testing at -O3 and gcc (10.2.0, x86_64) is unrolling the loop and constant-folding the array elements intelligently AFAICT. This may mean that we need to maintain the update_de method separately for performance.

gmaxwell commented at 4:47 pm on August 9, 2020: contributor

Just for amusement, O3 on the compiler, platform, and config:

scalar_inverse: min 44.3us / avg 44.4us / max 44.5us scalar_inverse_var: min 27.4us / avg 27.4us / max 27.5us field_inverse: min 44.1us / avg 44.2us / max 44.2us field_inverse_var: min 27.1us / avg 27.2us / max 27.6us

Probably blows out its poor caches. :P

peterdettman commented at 5:13 pm on August 9, 2020: contributor

So, ignoring this hiccup, do we know why this PR struggles to match GMP on the ARM chips, when it is comparing so well on x86_64? If it’s just down to some asm magic, then maybe we’d also need that to beat it here.

gmaxwell commented at 5:26 pm on August 9, 2020: contributor

GMP does have piles of asm for arm7 and arm8. … but interesting question as to what it could be doing more efficiently that helps. For MSB oriented extgcd the obvious answer was multi-limb arithmetic. That isn’t as much of a factor here.

FWIW, you’re almost tied with it… that’s pretty good but it’s funny that the same algorithm is faster on x86_64. It might also just be as relative operation speed issue. I think the lsb half GCD takes a lot more steps, but works with smaller numbers and that GMP uses a MSB oriented half-gcd that takes fewer steps but is really inefficient unless implemented with fancy techniques for the numbers. Maybe on arm more operations with slower multiplies makes it worse off?

The performance is good enough – though it’s a little hard to shake the feeling that there is some amazing optimization being missed. :)

sipa cross-referenced this on Aug 9, 2020 from issue Make scalar/field choice depend on C-detected __int128 availability by sipa

peterdettman commented at 5:39 am on August 10, 2020: contributor

For the var-time case I think there’s a little more blood to squeeze from the stone. The transition matrices from each _divsteps_62_var contain 63-bit signed values (-2^62 <= u,v,q,r < 2^62) in the worst case, but on random inputs the minimum number of sign bits over all matrix entries averages more than 28 leading zeros (measured on function exit), which should allow another 26 iterations safely (without per-iteration checks on the matrix element sizes).

So if we pass in also f1, g1 (i.e. the second 62-bit limb of each), then after the first 62 iterations we can check the actual size of the matrix elements. For some minimum of available bits we would then get new f, g from f0, f1, g0, g1 and the matrix, and restart a second loop to continue updating the matrix.

This should result in the outer loop typically finishing in 7 (often 6) iterations instead of 9, saving 2 full updates of d, e, f, g. Saving those full updates is probably even more important for 32bit, although it can probably save proportionally fewer.

peterdettman commented at 6:17 am on August 10, 2020: contributor

OK, the flaw with that idea is that the full updates assume 62 divsteps were done. Drat.

EDIT: I mean, it can still work, but you’d need new full update methods that would shift down the 1 fixed 62bit limb + also an extra shift down of some extra number of bits, guaranteed 0 <= bits < 62. So not too bad, but extra code.

Redo update_de methods dc58f4f094

gmaxwell commented at 8:35 pm on August 10, 2020: contributor

Oh good, that performs well (on 64-bit) without any O3 required.

Faster 64bit _inv_var, why not? 132c76dc3a

peterdettman commented at 7:41 pm on August 11, 2020: contributor

64-bit var-time inverses updated. New results:

safegcd_inv (64bit): scalar_inverse_var: min 1.293us / avg 1.309us / max 1.347us field_inverse_var: min 1.249us / avg 1.266us / max 1.321us

Earlier results:

master (64bit): scalar_inverse_var: min 2.23us / avg 2.31us / max 2.49us field_inverse_var: min 2.24us / avg 2.28us / max 2.33us

safegcd_inv (64bit) previous: scalar_inverse_var: min 1.50us / avg 1.53us / max 1.57us field_inverse_var: min 1.44us / avg 1.47us / max 1.57us

sipa commented at 8:53 pm on August 11, 2020: contributor

@peterdettman That’s amazing. It’s a similar speedup for me (but that’s not surprising on x86_64; would be interesting to see ARM numbers too as it adds multiplications which may not be worth it). Any reason why this couldn’t be done for 32-bit?

gmaxwell commented at 9:24 pm on August 11, 2020: contributor

arm8 (“Redo update_de methods”) scalar_inverse: min 12.3us / avg 12.3us / max 12.3us scalar_inverse_var: min 6.88us / avg 6.88us / max 6.88us field_inverse: min 12.0us / avg 12.0us / max 12.0us field_inverse_var: min 6.60us / avg 6.61us / max 6.61us

arm8 (“why not”) scalar_inverse: min 12.3us / avg 12.3us / max 12.4us scalar_inverse_var: min 6.76us / avg 6.76us / max 6.76us field_inverse: min 12.0us / avg 12.0us / max 12.1us field_inverse_var: min 6.48us / avg 6.50us / max 6.60us

arm8 (master) scalar_inverse: min 61.1us / avg 61.1us / max 61.1us scalar_inverse_var: min 5.61us / avg 5.62us / max 5.67us field_inverse: min 31.5us / avg 31.5us / max 31.5us field_inverse_var: min 5.62us / avg 5.62us / max 5.62us

peterdettman commented at 8:40 am on August 12, 2020: contributor

Any reason why this couldn’t be done for 32-bit?

Oh it should work, and I will get to it, but it’s largely guesswork without the actual hardware; I will try to push a selectable set of options to make it simple to test some variants.

Keep in mind that when the multiplications “work” the number of loop iterations is reduced, so it’s not initially a linear increase in cost when adding more. It’s about fishing for potential saved loop iterations as cheaply as possible.

Note that the multiplications used to determine w are only 8 bit (even less) non-widening multiplications. “f * g” and “f * f” can be in parallel. Then there are 3 non-widening multiplications by w. Not to mention the 3 parallel swap/negates. So there is real potential for vector instructions to speed this up. u, v, q, r can easily live in SIMD registers and according to godbolt, x86-64 clang (trunk) is already doing that (I looked at _fe_divsteps_62).

I’m particularly keen to see someone apply NEON magic for the 32 bit version (because I’m a sore loser and want to beat gmp everywhere).

real-or-random commented at 10:08 am on August 12, 2020: contributor

So there is real potential for vector instructions to speed this up. u, v, q, r can easily live in SIMD registers and according to godbolt, x86-64 clang (trunk) is already doing that (I looked at _fe_divsteps_62).

Neat! You may want to try -ftree-vectorize (included in -O3) in GCC.

Get better control over the range of d, e 2f6dfa2146

real-or-random referenced this in commit 887bd1f8b6 on Aug 12, 2020

sipa commented at 1:02 am on August 13, 2020: contributor

@peterdettman To formalize assumptions about conversions to signed, and right shifts of signed numbers: https://github.com/bitcoin-core/secp256k1/pull/798

Verify the expected zeros are produced 90743d29ab

_inv_var conditional negations

- tighten result magnitude for _fe_decode methods

5de2c83390

Experiment with f,g shortening in inv_var 308fd32e00

peterdettman commented at 7:14 am on August 15, 2020: contributor

@gmaxwell Your help would be appreciated measuring (for 32bit only) whether the “Experiment with f,g shortening in inv_var” commit is an improvement (you can toggle IS_THIS_FASTER flag in the code).

On my x86-64 machine, this change slows down the 32bit version, but the equivalent 64bit code is slightly faster, so I am still hopeful that it should be an improvement on real 32bit hardware.

f,g shortening for 64bit field ff0cf1124c

peterdettman commented at 7:31 am on August 15, 2020: contributor

Might as well test the 64bit one too.

gmaxwell commented at 6:46 pm on August 15, 2020: contributor

@peterdettman hurrah for optimizations

arm7 IS_THIS_FASTER 1 field_inverse: min 38.6us / avg 38.6us / max 38.6us field_inverse_var: min 22.7us / avg 22.8us / max 23.2us

arm7 IS_THIS_FASTER 0 field_inverse: min 38.6us / avg 38.6us / max 38.6us field_inverse_var: min 24.0us / avg 24.1us / max 24.4us

arm8 IS_THIS_FASTER 1 field_inverse: min 12.0us / avg 12.1us / max 12.1us field_inverse_var: min 6.19us / avg 6.19us / max 6.19us

arm8 IS_THIS_FASTER 0 field_inverse: min 12.0us / avg 12.1us / max 12.1us field_inverse_var: min 6.50us / avg 6.50us / max 6.51us

THIS_IS_FASTER b51a1b55d9

sipa commented at 6:02 am on August 16, 2020: contributor

Accentuate the positive

(Eliminate the negative)

1baff2caec

sipa commented at 7:55 pm on August 16, 2020: contributor

@peterdettman I wonder, is there any literature on (variable-time variants of) this or other LSB-based GCD algorithms, that outperform HGCD-based ones for modular inversions?

As in: how novel is the fact that this beats libgmp?

peterdettman commented at 8:24 am on August 17, 2020: contributor

@peterdettman I wonder, is there any literature on (variable-time variants of) this or other LSB-based GCD algorithms, that outperform HGCD-based ones for modular inversions?

Well, I don’t have too many papers about it. A search for “2-adic gcd modular inversion” returns this safegcd paper as top result. The (to me) most relevant references are to Brent–Kung and Stehlé–Zimmermann. Indeed my first attempt at this several years ago was based on “A Binary Recursive GCD Algorithm”, except minus the recursive structure (and no constant-time version). The main difference here is a better divstep, and much clearer code (I hope).

As in: how novel is the fact that this beats libgmp?

Be a little careful. We’re beating libgmp at 256 bits, and I think even a generalized version of this PR (for arbitrary primes) will still be significantly faster than libgmp at this size (that’s how my Java version is written and it is about the same speed as libgmp even though the Java version uses divsteps30 and 32bit muls). However this PR generalized would be quadratic; any subquadratic algorithm will eventually win out as the size increases.

Of course the full safegcd algorithm is also subquadratic (read 1.4 and compare figure 12.2 image on the right) and one might expect it to typically perform better than the other subquadratic algorithms for a wide range of input sizes for the reasons discussed in 1.4 (at least in the integer case). I’m uncertain about the asymptotic performance comparison though.

One could reasonably expect safegcd to eventually be adopted by libgmp for var-time modular inversion. Probably it’s even more compelling for the constant-time case.

My cunning plan:

Convince GMP to implement safegcd.
Copy their hand-optimized, vectorized asm version of divsteps30
Finally our 32bit version is faster than GMP!… umm, wait…

JeanLucPons commented at 2:11 pm on August 17, 2020: none

I managed to win few percent on my I5 by pre-computing an array of inverse (mod 64) rather than computing them on the fly.

 0int64_t INV64[] = { 
 1  -0LL,-1LL,-0LL,-43LL,-0LL,-13LL,-0LL,-55LL,-0LL,-57LL,-0LL,-35LL,-0LL,-5LL,-0LL,-47LL,
 2  -0LL,-49LL,-0LL,-27LL,-0LL,-61LL,-0LL,-39LL,-0LL,-41LL,-0LL,-19LL,-0LL,-53LL,-0LL,-31LL,
 3  -0LL,-33LL,-0LL,-11LL,-0LL,-45LL,-0LL,-23LL,-0LL,-25LL,-0LL,-3LL,-0LL,-37LL,-0LL,-15LL,
 4  -0LL,-17LL,-0LL,-59LL,-0LL,-29LL,-0LL,-7LL,-0LL,-9LL,-0LL,-51LL,-0LL,-21LL,-0LL,-63LL };
 5
 6...
 7
 8      // Handle up to 6 divsteps at once, subject to eta and bitCount.
 9      limit = (*eta + 1) > bitCount ? bitCount : (*eta + 1);
10      m = (UINT64_MAX >> (64 - limit)) & 63U;
11      //w = (u0 * v0 * (u0 * u0 - 2)) & m; // w = v0 * -u0^-1 mod 2^6  (1 Newton step => 6bit)
12      w = (v0 * INV64[u0 & 63U]) & m;
13
14    } else {
15
16      // Handle up to 4 divsteps at once, subject to eta and bitCount.
17      limit = (*eta + 1) > bitCount ? bitCount : (*eta + 1);
18      m = (UINT64_MAX >> (64 - limit)) & 15U;
19      //w = u0 + (((u0 + 1) & 4) << 1); 
20      //w = (-w * v0) & m;   // w = v0 * -u0^1 mod 2^4
21      w = (v0 * INV64[u0 & 15U]) & m;

peterdettman commented at 4:35 pm on August 17, 2020: contributor

Yes, I was intending to try something like this, especially for 32bit. Note that u0 should always be odd, so the table could be half-size if you shift u0 down 1 bit. Also, if you have a table for 6 divsteps anyway, you might as well use it in both branches (or rather move it out of the if/else).

Try 128 byte table of inverses 65550c1f6d

peterdettman commented at 6:20 pm on August 17, 2020: contributor

Per @JeanLucPons suggestion, I’ve updated the 32bit version to use a table for calculating w.

gmaxwell commented at 9:21 pm on August 17, 2020: contributor

I regret to inform you that perhaps 8 was too audacious (or not audacious enough):

On ARM7,

Table: scalar_inverse: min 43.3us / avg 43.4us / max 43.5us scalar_inverse_var: min 26.0us / avg 26.0us / max 26.2us field_inverse: min 38.3us / avg 38.3us / max 38.6us field_inverse_var: min 22.2us / avg 22.2us / max 22.2us

Commit under the table: scalar_inverse: min 43.3us / avg 43.4us / max 43.6us scalar_inverse_var: min 25.9us / avg 25.9us / max 26.0us field_inverse: min 38.3us / avg 38.3us / max 38.4us field_inverse_var: min 22.1us / avg 22.1us / max 22.3us

JeanLucPons commented at 2:09 pm on August 18, 2020: none

On my i5 the best gain was obtained with 6 bits (outside the eta test), without the »1 (I store the 0s in the table to avoid the right shift) and using a int64_t array but the gain is only ~3%.

sipa commented at 11:31 pm on August 18, 2020: contributor

Just saw this: https://twitter.com/BearSSLnews/status/1295717508012539911

If I understand it correctly, an MSB constant-time extgcd based modular inversion, which using BMI2 intrinsics on x86_64 sets a speed record.

sipa commented at 1:38 am on August 19, 2020: contributor

Looked into it a bit more. Code can be found on https://github.com/pornin/bingcd/blob/main/src/gf25519.c, and a paper on https://eprint.iacr.org/2020/972.

Where safegcd uses just the lower bits to make step decisions, the algorithm there uses both the top and lower bits (but ignores the middle ones!), and then similarly aggregates them and applies them in bulk to the entire number. Dropping the middle bits means that subtractions may result in the wrong sign when applied to the full number, but this is apparently correctable.

For constant-timeness it critically relies on a way to count leading zero bits and to shift in constant time. That’s not something that can be guaranteed on every platform, annoyingly. Perhaps this approach is useful as a specialized implementation, but safegcd seems more useful generally as it is constant time and has decent performance on generic hardware.

For performance it relies on quick computation of leading zeroes, and a quick way to compute (ab+cd) mod p (with p the modulus for the inverse). His numbers for ed25519 on BMI2 hardware (where leading zeroes can be computed very quickly with lzcnt, and multiplications mod p are easy (p being close to a power of 2) definitely beat the (non-assembly-optimized) constant-time code here, by almost a factor 1.5. But as said, it doesn’t seem generally usable for that purpose.

As a competitor for a variable-time algorithm, the code in this PR is already ~20% faster. Of course, a variable-time variant of the algorithm there may be better (and much more applicable as a fast variable-time leading-zero-bit counter isn’t hard), but would need work to see what variable-time optimizations apply.

peterdettman commented at 7:57 am on August 19, 2020: contributor

@sipa Thanks for linking this paper (henceforth “optbingcd”), but how are you reading an (almost) 1.5 factor advantage? The only thing I can think is that you have compared 744 divsteps to 508 iterations.

His numbers for ed25519 on BMI2 hardware (where leading zeroes can be computed very quickly with lzcnt, and multiplications mod p are easy (p being close to a power of 2) definitely beat the (non-assembly-optimized) constant-time code here, by almost a factor 1.5.

The optbingcd paper claims 7490 vs 8520 cycles for safegcd on Coffee Lake (Kaby Lake). safegcd reported “10050 cycles, 8778 cycles, and 8543 cycles” for “Haswell, Skylake, and Kaby Lake respectively”.

Now, I really need to get a proper cycles benchmark for this PR (help certainly appreciated here), but based only on my latest 2.6GHz Haswell timings (field_inverse in 2.87us, -O3 to match their CFLAGS), scaled to Coffee Lake based on the safegcd results, one could make a very rough guess of ~6343 cycles on Coffee Lake - 18% faster than optbingcd. Winning!

As a sanity check (against possible turbo or H/T), field_inverse_var (gmp) takes ~2.20us for essentially the same benchmark, variable-time.

As you note, this PR is still pure C, whereas optbingcd relies on inline assembly and intrinsics for its speed (from the paper: “This highlights the importance of optimizing that inner loop as much as possible”). Bo-Yin Yang was also kind enough to send me the safegcd source code for comparison, and it should not surprise anyone that its inner loop is written entirely in qhasm. The optbingcd paper reports 4572 of its 7490 cycles are spent in the inner loop (61%) on Coffee Lake. _fe_divsteps_62 is more like 80-90% of the total time for _fe_inv. Any/every cycle saved per divstep (inner loop iteration) is 744 fewer cycles total, although there are maybe only 8-9 cycles there in the first place.

Can we get an x86-64 asm wizard to help us knock this one out of the park? (ARM/NEON too, obviously). #SaveACycle.

I will note though that optbingcd could probably be made faster by adopting the (r,s)-updating approach from this PR (thanks again to @JeanLucPons), instead of the up-to-double-length (u,v) plus final multiplication that appears to be used for the same purpose.

JeanLucPons commented at 7:58 am on August 19, 2020: none

0(compared with the classic binary
1	 * GCD, the use of the approximations may make the code "miss one
2	 * bit", but the conditional subtraction regains it)

This recalls me my former implementation where I do not use eta or even the MSB bits to take the swap decision. I do an extra swap and subtract if u0 remains odd. It results in losing one bit on two in worst case and can end in 2 times more iterations. Here as they have a 32bit approximation of the MSB, the probability to loose one bit is really weak.

On this code, they seems to do as in this PR for the update of the bezout coefs but by performing a weak normalization at each step (they do not perform a Montgomery reduction) and they normalize at the end (modmul by GF_INVT508). I’m not convince that this method is faster than the method used in this PR and I’m not sure that a var time optimization of this code will beat this PR.

real-or-random commented at 10:07 am on August 19, 2020: contributor

_fe_divsteps_62 is more like 80-90% of the total time for _fe_inv. Any/every cycle saved per divstep (inner loop iteration) is 744 fewer cycles total, although there are maybe only 8-9 cycles there in the first place.

It will be interesting to benchmark the difference between clang and gcc because clang uses SIMD as you noted above.

https://godbolt.org/z/sd18M9

peterdettman commented at 12:47 pm on August 19, 2020: contributor

Add e.g. -march=haswell in compiler explorer to see clang go “full vector”. However I have tested clang 10.0.1 for 64bit case, and the performance is not very different. Actually the constant-time code benchmarks slightly faster by targeting x86-64 instead of haswell, and the opposite is true for the var-time version.

sipa commented at 11:21 pm on August 21, 2020: contributor

@peterdettman

@sipa Thanks for linking this paper (henceforth “optbingcd”), but how are you reading an (almost) 1.5 factor advantage? The only thing I can think is that you have compared 744 divsteps to 508 iterations.

I just benchmarked this PR’s constant time version (on an i7-7820HQ Kaby Lake, with CPU frequency locked at 2.6GHz) against the optbingcd’s reported numbers. That wasn’t very fair, as I don’t have a Coffee Lake CPU, and was using the default -O2, and without -march.

I redid the numbers now, by running the optbingcd code myself, and using -O3 -march=native for both this PR and optbingcd. I couldn’t get all of optbingcd to work (the asm multiplication test fails, so I had to revert to using the __int128 version - but this was only a ~1-2% performance penalty). Still:

field_inverse: min 4.10us / avg 4.10us / max 4.10us

inversion (binary GCD): 9437.84 (9424.17 .. 9449.19)

At 2.6GHz, 4.10us = 10660 cycles, so optbingcd is around 13% faster still.

sipa commented at 11:26 pm on August 21, 2020: contributor

Ping @pornin. In case you’re interested, we’re discussing fast modular inversion algorithms (constant and variable time), in particular safegcd-based code, but also your binary gcd algorithm.

peterdettman commented at 7:57 am on August 22, 2020: contributor

@sipa For completeness, could you please report a gmp (i.e. master branch) benchmark under the same conditions.

The optbingcd paper says “[..] Coffee Lake and Kaby Lake use the same internal scheduler and execution units, and differ only in external pieces that should not matter much for a compact, L1-cache only computational piece of code.”. I’m thus still a bit confused as to how you can get ~9400 cycles when the paper itself reports ~7500 cycles.

peterdettman commented at 11:27 am on August 22, 2020: contributor

Although I’m having trouble figuring out how to disable turbo boost on macos catalina, I was at least able to use Intel Power Gadget to observe the actual CPU frequency. Then running one bench_internal per core I’m able to get the CPU to settle at 2.8GHz and I get timings around 3.87us in this way, indicating ~10,850 cycles. Very rough, but broadly confirms your result.

That’s somewhat disappointing, but I guess it makes sense, all things considered. On the bright side, the divstep inner loop being more like 12-13 cycles makes it more reasonable to expect a 2-3 cycle saving from an asm version.

real-or-random commented at 3:28 pm on August 22, 2020: contributor

On the bright side, the divstep inner loop being more like 12-13 cycles makes it more reasonable to expect a 2-3 cycle saving from an asm version.

Moving the eta assignment to the end of the loop makes it already ~3% faster percent on my machine (Kabylake, gcc, -O2).

sipa commented at 11:13 pm on August 22, 2020: contributor

@peterdettman

All benchmarks on i7-7820HQ pinned at 2.6 GHz, compiled with GCC 9.3.0.

On master (670cdd3f8be25f81472b2d16dcd228b0d24a5c45), -O3 -march=native

field_inverse: min 6.78us / avg 6.79us / max 6.82us
field_inverse_var: min 2.80us / avg 2.81us / max 2.85us

On master (670cdd3f8be25f81472b2d16dcd228b0d24a5c45), -O2

field_inverse: min 6.77us / avg 6.78us / max 6.79us
field_inverse_var: min 2.81us / avg 2.82us / max 2.83us

On safegcd (65550c1f6d44da2bd9d72d6fd2256a6cba0fd828), -O3 -march=native

field_inverse: min 4.10us / avg 4.11us / max 4.13us
field_inverse_var: min 2.12us / avg 2.13us / max 2.13us

On safegcd (65550c1f6d44da2bd9d72d6fd2256a6cba0fd828), -O2

field_inverse: min 4.23us / avg 4.23us / max 4.24us
field_inverse_var: min 2.35us / avg 2.35us / max 2.36us

optbingcd, -O3 -march=native

inversion (binary GCD): 9442.39 (9424.53 .. 9526.05)

optbingcd, -O2

inversion (binary GCD): 9860.63 (9853.78 .. 9949.89)

pornin commented at 6:53 pm on August 23, 2020: none

Hello! I have been summoned…

It turned out that there were some edge cases that my algorithm mishandled (and the proof was wrong in one point). However, the fixed version of the algorithm is even faster. On my Coffee Lake core, cost is down to 6253 cycles.

The problem was in doing 32 iterations: as values shrink in the algorithm, the approximation loses accuracy. After 31 iterations, the loss can be severe enough that it may induce the algorithm to take a very wrong decision which makes one of the values substantially grow. This is quite improbable in practice, but in cryptography we cannot rely on luck because adversaries may try to exercise such cases on purpose.

The repaired algorithm does only 31 iterations in the inner loop; this means that “updates” (application of the computed update factors to the big integers) must happen once every 31 iterations instead of once every 32 iterations; this avoids the bad edge cases, at the cost of basically one extra update overall (about 220 cycles in my code). However, doing only 31 iterations means that update factors can now be stored more efficiently: I can put two of them in the same 64-bit register. This speeds up the inner loop, now down to about 6.17 cycles per iteration (instead of 9 cycles per iteration). Hence the better performance (and with an algorithm which is now, hopefully, correct).

If you want to stick to “pure C”, the main problem should be convincing the C compiler to emit a code sequence that uses the appropriate cmov opcodes. Usually, when writing constant-time code, we use boolean combinations, but cmov should be faster. This unfortunately depends quite a lot on the compiler brand (GCC and Clang don’t behave the same here), version, and optimization flags.

sipa commented at 1:02 am on August 24, 2020: contributor

@pornin I realized I was compiling with GCC, while you were intending (and probably optimized for) compilation with clang. The difference is huge (and bigger than before, with your latest change). GCC needs 9600 cycles for me; clang needs only 6970. Any intuition as to what GCC is doing wrong? @peterdettman So optbingcd is now actually 1.5x faster than the (constant time) safegcd in this PR, but variable-time safegcd is another 1.25x faster than that still.

If you want to stick to “pure C”, the main problem should be convincing the C compiler to emit a code sequence that uses the appropriate cmov opcodes. Usually, when writing constant-time code, we use boolean combinations, but cmov should be faster. This unfortunately depends quite a lot on the compiler brand (GCC and Clang don’t behave the same here), version, and optimization flags.

Yeah, we’re relying on cmovs in this codebase already for certain constant-time algorithms, and have valgrind-based tests that the result is in fact constant-time on a number of platforms. Though, is cmov sufficient for your algorithm? It seems to me you need constant-time leading-zero-counting as well, which I’m not sure is possible on every system.

One thing I’d be interested to hear is whether you think that a variable-time version of your algorithm would be possible that improves upon the runtime significantly. The variable-time safegcd variant here beats libgmp on x86_64 already, so it does seem there are possibilities.

PS: we’ve met at HACS

peterdettman commented at 6:26 am on August 24, 2020: contributor

update factors can now be stored more efficiently: I can put two of them in the same 64-bit register. @pornin I’m still trying to understand how to cleanly manage the operations on the shared registers, but it sounds like a neat idea that might be worth trying in this PR by splitting divsteps_62 into 2 divsteps_31 or something similar. It will be offset by some extra matrix multiplications, but on paper it looks net positive.

If you want to stick to “pure C”, the main problem should be convincing the C compiler to emit a code sequence that uses the appropriate cmov opcodes.

Yes, and it’s quite annoying. The only idiom I can find that reliably generates cmov-like instructions is an actual if, and when you have several of them in a group, I’ve found e.g. gcc likes to guess an actual branch will be faster (hard to complain about that when your C code has literal branches). @sipa If one manages to convince clang 10.0.1 to generate conditional moves (https://godbolt.org/z/Mzafe7), then I measure an improvement of >30% for _fe_inv, ~8200 haswell cycles. I expect on your CPU this might be closer to ~15% slower than this updated optbingcd.

sipa commented at 7:52 am on August 24, 2020: contributor

@peterdettman That seems easy to do with a bit of inline asm. We could have functions for “cmov_var” operations that translate to native cmov instructions on x86_64 (and possibly others), and conditionals otherwise.

JeanLucPons commented at 8:40 am on August 24, 2020: none

@pornin I’m trying your optbingcg and especially the case when starting with len(a)==len(b) and msb(a) == msb(b) that should increase the number of wrong decisions: Here are my result. May be I’m doing something wrong…

0Test add: ......................... done.
1Test mul: ......................... done.
2Test inv: ...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................ERR: INV:
3a = 57896044618658097711785492504343953926634992332820282019710345259882855656616
4b = 51314689404857564656982019773616219587230925979562315155017551773576055083230
5a = 0x7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF000000000005ECA8
6b = 0x717315BF9C729435E3DB956DCD6164B5CA4C6371C6891AC722B267CFF0B9D0DE
7c = 0x8000000000000000000000000000000000000000000000000000000000000016

real-or-random commented at 9:06 am on August 24, 2020: contributor

Given that (at least for safegcd), a variable-time algorithm is much faster, I wonder whether blinding is a better strategy to resist timing channels.

Given input x, draw a pseudorandom blinding factor b, and compute (x * b)^-1 * b = x^-1. The overhead is 2 muls and deriving b. The muls are cheap and an optimized implementation of ChaCha12 (should be enough for this purpose) needs around 300-600 cycles for a single block depending on the architecture.

Anyway, I don’t think we should be crazy about the performance of constant-time inversions. Variable-time/verification performance is more important for this library.

elichai commented at 10:35 am on August 24, 2020: contributor

Damn this is quite faster than libgmp. I remember looking into #290 and it was really hard to beat them. and this takes 50+% faster than libgmp :O

master + libgmp:

0scalar_inverse: min 15.8us / avg 15.9us / max 16.1us
1scalar_inverse_var: min 3.14us / avg 3.17us / max 3.22us
2field_inverse: min 7.39us / avg 7.42us / max 7.52us
3field_inverse_var: min 3.11us / avg 3.12us / max 3.15us

with this PR:

0scalar_inverse: min 4.61us / avg 4.66us / max 4.73us
1scalar_inverse_var: min 2.06us / avg 2.09us / max 2.13us
2field_inverse: min 4.60us / avg 4.65us / max 4.76us
3field_inverse_var: min 2.04us / avg 2.04us / max 2.06us

pornin commented at 11:47 am on August 24, 2020: none

@JeanLucPons On my machine, with your example, my code returns the correct result:

0a = 0x7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF000000000005ECA8
1b = 0x2E7970EBB2A522D59B69421568E9665BD93A0EE9CCA5AF7B7DB96CADA8367BB4
2c = 0x7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEE

I don’t know what went wrong in your case, but the algorithm should work fine for this one too.

JeanLucPons commented at 11:54 am on August 24, 2020: none

@JeanLucPons On my machine, with your example, my code returns the correct result: I don’t know what went wrong in your case, but the algorithm should work fine for this one too.

Thanks for the answer it was because I have to cut the _lzcnt_u64() into 2 __builtin_clz() but I wasn’t aware that __builtin_clz(0) was undefined, i would expect 32 but that’s ok now i get the correct result too :)

gmaxwell commented at 12:58 pm on August 24, 2020: contributor

@real-or-random

Given that (at least for safegcd), a variable-time algorithm is much faster, I wonder whether blinding is a better strategy to resist timing channels.

I think I suggested doing that in an issue but this work has made it less attractive. One advantage of a truly constant time rather than blinded algorithm is that it makes it much easier to test hardware for EMI/power/timing sidechannels directly, since the traces will naturally be aligned and assuming alignement and cache state are managed they should take exactly the same amount of time down to the cycle for every run. The safegcd and variable time safegcd are also extremely similar so it’s not entirely redundant.

When the performance difference was a factor of 5 I was leaning towards blinding and using the variable time being worth the above costs… at 2.2 it’s at least a little less attractive.

Blinding might be slightly less costly than you think– right now the ecmult_gen uses a random projection for the initial point (secp256k1_gej_rescale). I believe that alone actually renders the input to the final inversion uniformly random, though it would deserve careful analysis. If it does, then it’s already blinded. (well, almost, since the rescale currently only happens on randomize– but that is already something that should get fixed independent of anything being done with the inversion).

pornin commented at 1:10 pm on August 24, 2020: none

@sipa For constant-time counting of leading zeros, the usual idiom is something like this:

 0static inline uint64_t
 1count_leading_zeros(uint64_t x)
 2{
 3        uint64_t r, c;
 4
 5        r = 0;
 6        c = -(((x >> 32) - 1) >> 63);
 7        r += c & 32;
 8        x ^= c & (x ^ (x << 32));
 9        c = -(((x >> 48) - 1) >> 63);
10        r += c & 16;
11        x ^= c & (x ^ (x << 16));
12        c = -(((x >> 56) - 1) >> 63);
13        r += c & 8;
14        x ^= c & (x ^ (x << 8));
15        c = -(((x >> 60) - 1) >> 63);
16        r += c & 4;
17        x ^= c & (x ^ (x << 4));
18        c = -(((x >> 62) - 1) >> 63);
19        r += c & 2;
20        x ^= c & (x ^ (x << 2));
21        c = -(((x >> 63) - 1) >> 63);
22        r += c & 1;
23        x ^= c & (x ^ (x << 1));
24        r += 1 - ((x | -x) >> 63);
25        return (int)r;
26}

Experimentally, this leads to an inversion cost of about 6912 cycles, i.e. an overhead of about 659 cycles compared to the version with _lzcnt_u64() (since this routine is invoked 15 times, this would put its cost at about 44 cycles. There might be better methods.

In a non-constant-time setup, there is a fast algorithm described by Harley (in a newsgroup in 1996!): https://groups.google.com/g/comp.arch/c/daG3ld_SCA0/m/6rpQegohhF8J It basically involves first propagating the highest 1 so that you get the value 2**n-1 (where the highest non-zero bit is at index n-1), then finding out which of the 32 possible values (for 32-bit words) you got with a multiplication and a lookup in a small table. The lookup makes it non-constant-time; it might be possible to make it constant-time by using the lookup index as a shift count and doing a few shifts of constant values. I suppose a 64-bit version is feasible along the same lines.

pornin commented at 1:19 pm on August 24, 2020: none

@sipa About a variable-time shortcut: as the algorithm advances, the internal a and b values shrink, and the algorithm converges when a reaches zero. You could make a shortcut at that point. This can be done efficiently at the start of the outer loop:

0    if ((a.v0 | a.v1 | a.v2 | a.v3) == 0) {
1        // algorithm has converged, exit now
2    }

Take care that if you do a lower number of iterations, then the corrective factor (the 1/(2**508)) must be adjusted accordingly. If you allow an exit at the start of each outer iteration, then that’s 15 possible corrective factors, so 15 hardcoded values (which will take 480 bytes of read-only data, nothing too wild here).

I don’t know how much time can be saved that way on average. I suppose you can shave off a couple of iterations, so maybe 10-15% on average? This should be measured.

PS: HACS is very cool, too bad there won’t be a physical one this year. Speaking of which, this kind of code is exactly what should be formally verified; there are so many possibilities for rare edge cases with carry propagation!

JeanLucPons commented at 2:54 pm on August 24, 2020: none

I’m trying to evaluate to number of needed steep before a reaches 0 with the optbingcd method, and it seems very good. Here is an histogram for random inputs (no fail): [0,0,0,0,0,0,0,0,0,0,21548,875516,102932,4,0] (INV_INNER_FAST count (start at 1), the last item is the 43xINV_INNER ) divstep_62() needs ~9 loops (in average).

However while I’m trying to see how this distribution evolve around the boundaries, I got again a failure :( I will try to find a host with the required instructions….

0a = 0x7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFED00
1b = 0x2B404DC70A68B991255B404DC70A68B991255B404DC70A68B991255B404DC704
2c = 0x6FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBA

real-or-random commented at 3:03 pm on August 24, 2020: contributor

@pornin

https://github.com/hcs0/Hackers-Delight/blob/master/ntz.c.txt is a nice collection of ctzl algorithms. One of those algorithms is in this PR (https://github.com/bitcoin-core/secp256k1/pull/767/commits/06d568a7e6596aae9a6837bb2eee3108347cbc3a) and ntz7 is a constant-time algorithm

pornin commented at 4:08 pm on August 24, 2020: none

@JeanLucPons My bad, that bug was mine. Algorithm is fine, implementation needed a couple extra lines. Now it goes fine, and, for some reason, it’s slightly faster (by about 20 cycles).

Bug was that, when reaching the final stages, I was reusing the approximate values xa and xb from the previous round, instead of reading the “true” ones (which are then at most 44 bits). Bug would happen when, at that stage, the values would be larger than 32 bits, which does not in fact happen often (and was not triggered by the randomized tests).

real-or-random commented at 11:39 pm on August 24, 2020: contributor

I think I suggested doing that in an issue but this work has made it less attractive. One advantage of a truly constant time rather than blinded algorithm is that it makes it much easier to test hardware for EMI/power/timing sidechannels directly, since the traces will naturally be aligned and assuming alignement and cache state are managed they should take exactly the same amount of time down to the cycle for every run.

Yeah, though I’m not aware of anybody doing these tests currently. Yes, we have the valgrind check but here, a declassify/classify is ok. And an argument in favor of blinding is that it does not rely on the compiler gods not introducing branches etc.

The safegcd and variable time safegcd are also extremely similar so it’s not entirely redundant.

I don’t understand. What is not redundant?

When the performance difference was a factor of 5 I was leaning towards blinding and using the variable time being worth the above costs… at 2.2 it’s at least a little less attractive.

Right but I believe 2.2 is still a lot and a blinded version would also be simpler (less code).

Blinding might be slightly less costly than you think– right now the ecmult_gen uses a random projection for the initial point (secp256k1_gej_rescale). I believe that alone actually renders the input to the final inversion uniformly random, though it would deserve careful analysis. If it does, then it’s already blinded. (well, almost, since the rescale currently only happens on randomize– but that is already something that should get fixed independent of anything being done with the inversion).

Hm, indeed. Note that we could derive the blinding factor via the secret key that we anyway get in signing (like deterministic signing), so there’s no need to update the context.

sipa commented at 2:01 am on August 25, 2020: contributor

I believe that alone actually renders the input to the final inversion uniformly random, though it would deserve careful analysis. @gmaxwell I think that’s almost always true, but nontrivial to reason about. Our constant-time group addition is weird, and the output Z coordinate is 2(Z1Y1 + Z14Y2), so it’s not just propagating a factor with Z nicely (like doubling does). I suspect this formula for uniform Z1 isn’t exactly hiding all information about Y1 and Y2 (perhaps quadratic residue of a function of Y1 and Y2 gets through or so), and the fact that Y1 (especially early in the algorithm) is correlated with Z1 makes it even harder. Then there is the weird edge case to deal with addings of opposing Y, which cmovs another formula in place. @real-or-random I’m not sure how I feel about relying exclusively on blinding for protection against timing leaks. It suddenly makes a lot of code (including the (re)blinding logic itself) much more critical, and only protected by algebraic reasoning that’s much harder to test for.

As far as it reducing complexity… true, having only variable-time inverse would gain us something. But if safegcd is the approach we take, there is also code shared between the variable-time and constant-time version, making it less “extra”.

peterdettman commented at 3:45 am on August 25, 2020: contributor

Blinding might be slightly less costly than you think– right now the ecmult_gen uses a random projection for the initial point (secp256k1_gej_rescale). I believe that alone actually renders the input to the final inversion uniformly random, though it would deserve careful analysis.

If during ecmult_gen the accumulator passes through infinity, then the random projective factor is lost, at least as the current code stands. _gej_add_ge seems to deal with possible a->z == 0, but it’s unclear to me whether it can otherwise produce an r->z == 0 output. If we never set z to 0, then the “fixup” code in the last few lines could set the result to (u2, s2, a->z) instead of (b->x, b->y, 1) and the projective randomization might be preserved in this way.

sipa commented at 4:05 am on August 25, 2020: contributor

@sipa If one manages to convince clang 10.0.1 to generate conditional moves (https://godbolt.org/z/Mzafe7), then I measure an improvement of >30% for _fe_inv, ~8200 haswell cycles. I expect on your CPU this might be closer to ~15% slower than this updated optbingcd.

~~@peterdettman Where exactly do you need cmovs?~~

EDIT: nevermind, it’s obvious from your godbolt link.

mratsim commented at 7:26 am on August 25, 2020: none

@real-or-random

Given that (at least for safegcd), a variable-time algorithm is much faster, I wonder whether blinding is a better strategy to resist timing channels.

I think I suggested doing that in an issue but this work has made it less attractive. One advantage of a truly constant time rather than blinded algorithm is that it makes it much easier to test hardware for EMI/power/timing sidechannels directly, since the traces will naturally be aligned and assuming alignement and cache state are managed they should take exactly the same amount of time down to the cycle for every run. The safegcd and variable time safegcd are also extremely similar so it’s not entirely redundant.

When the performance difference was a factor of 5 I was leaning towards blinding and using the variable time being worth the above costs… at 2.2 it’s at least a little less attractive.

Blinding might be slightly less costly than you think– right now the ecmult_gen uses a random projection for the initial point (secp256k1_gej_rescale). I believe that alone actually renders the input to the final inversion uniformly random, though it would deserve careful analysis. If it does, then it’s already blinded. (well, almost, since the rescale currently only happens on randomize– but that is already something that should get fixed independent of anything being done with the inversion).

There are recent advances in attack vectors (i.e. machine learning and de-blinding techniques) against blinded ECC, while I don’t think any paper was written on scalar inversion, the techniques might be relevant.

Side-Channel Attacks on Blinded Scalar Multiplications Revisited
Thomas Roche, Laurent Imbert, Victor Lomné
https://eprint.iacr.org/2019/1220
One Trace Is All It Takes: Machine Learning-Based Side-Channel Attack on EdDSA
Léo Weissbart, Stjepan Picek, Lejla Batina, 2019
https://eprint.iacr.org/2019/358

Avoid redundant calculation 5ccfc30aaf

peterdettman commented at 9:34 am on August 25, 2020: contributor

Small tweak to const-time divsteps methods gives significant speed boost.

real-or-random commented at 11:31 am on August 25, 2020: contributor

Small tweak to const-time divsteps methods gives significant speed boost.

Nice, I missed that one when looking at the code.

Moving the eta assignment to the end of the loop makes it already ~3% faster percent on my machine (Kabylake, gcc, -O2).

Did you try this one too? As I said, it resulted in faster code on my machine (at least with the previous commit).

peterdettman commented at 11:58 am on August 25, 2020: contributor

Moving the eta assignment to the end of the loop makes it already ~3% faster percent on my machine (Kabylake, gcc, -O2).

Did you try this one too? As I said, it resulted in faster code on my machine (at least with the previous commit).

Sorry for not replying to that. I have tried it and the differences are in the noise for me. If I am alone in that I’m happy to move it of course.

pornin commented at 1:46 pm on August 25, 2020: none

@sipa I just added some GCC compatibility to my code. Three things were required:

On my machine (Ubuntu 20.04), GCC defaults to building position-independent executables, but Clang does not. PIE implies some constraints on relocations; in the (assembly) routines for squaring and multiplications, I had opcodes that referenced a constant value in RAM. In order to make it work with PIE, I made that access relative to rip (which, in particular, removes any relocation). This does not change performance.
My GCC installation does not seem to have the __rdtsc() intrinsic; I have to use inline assembly for that (this is in the test code only, for benchmarks).
GCC was optimizing a bit too much the multiplications and squarings: the generated functions were empty! Apparently, declaring that a piece of inline assembly receives some pointers to data as input and clobbers "memory" is not enough for GCC to assume that the assembly code may indeed modify the data which is pointed to. It works with Clang, but GCC removes everything. A few __volatile__ fix that.

None of the changes impacts performance with Clang. Performance with GCC is not as good; with -O2 -march=native, I get about 7400 cycles for inversion (with -O3 it goes over 8000). My guess is that GCC has trouble with the multiplications into 128-bit integers (unsigned __int128) and/or with the additions-with-carry (_addcarry_u64(), _subborrow_u64()).

One may argue that if I write some routines in assembly, then maybe I should write everything in assembly. At least I would get performance that does not depend on compiler version and flags…

sipa commented at 2:50 pm on August 25, 2020: contributor

@pornin I actually opened a PR against your repo fixing all 3 issues ;)

pornin commented at 3:08 pm on August 25, 2020: none

@sipa Ah! Sorry, I hadn’t seen that. I tend to use GitHub as a push-only publication system.

real-or-random commented at 10:01 pm on August 25, 2020: contributor

There are recent advances in attack vectors (i.e. machine learning and de-blinding techniques) against blinded ECC, while I don’t think any paper was written on scalar inversion, the techniques might be relevant.
* Side-Channel Attacks on Blinded Scalar Multiplications Revisited
  Thomas Roche, Laurent Imbert, Victor Lomné
  https://eprint.iacr.org/2019/1220

* One Trace Is All It Takes: Machine Learning-Based Side-Channel Attack on EdDSA
  Léo Weissbart, Stjepan Picek, Lejla Batina, 2019
  https://eprint.iacr.org/2019/358

I don’t think these papers are relevant. The second is in fact an attack on a constant-time implementation and the first is an attack on blinding using small blinding factors. What I suggested is to blind with a “full” scalar (chosen uniformly from the set of all scalars). Assuming the blinding itself is correct and constant-time, and a new blinding factor is derived for every inversion, this should not leave any further potential for side channels (or any channels – you could in fact send the blinded input of the inversion algorithm to the attacker and the attacker would still not learn anything).

Blinding has other issues for sure, e.g., what @sipa said.

@real-or-random I’m not sure how I feel about relying exclusively on blinding for protection against timing leaks. It suddenly makes a lot of code (including the (re)blinding logic itself) much more critical, and only protected by algebraic reasoning that’s much harder to test for.

Yeah, it’s entirely true that testing this is hard. It’s like testing the pseudorandomness of a hash function and there you can have test vectors at least (admittedly with all problems that they have). I see your point but I don’t there’s a clear case in favor of constant-time algorithms. One reason not to use blinding is simply that we don’t use it anywhere else. Keeping everything constant time is just a simple and clear design.

(I wanted to argue that blinding protects against other classes of bugs. If there is a bug that is triggered by a negligible set of inputs, e.g., in the carry handling, then the attacker has a hard time finding an input. But this argument is pointless for our use case of secret inversions, where the input is uniformly random anyway.)

sipa commented at 10:33 pm on August 25, 2020: contributor

@peterdettman patch to make 5x52 field use cmovs:

 0--- a/src/field_5x52_impl.h
 1+++ b/src/field_5x52_impl.h
 2@@ -563,12 +563,33 @@ static uint64_t secp256k1_fe_divsteps_62(uint64_t eta, uint64_t f0, uint64_t g0,
 3     int i;
 4 
 5     for (i = 0; i < 62; ++i) {
 6+        uint64_t neta;
 7 
 8         VERIFY_CHECK((f & 1) == 1);
 9         VERIFY_CHECK((u * f0 + v * g0) == f << i);
10         VERIFY_CHECK((q * f0 + r * g0) == g << i);
11 
12         c2 = -(g & 1);
13+#ifdef USE_ASM_X86_64
14+        c1 = eta >> 63;
15+        x = -f;
16+        y = -u;
17+        z = -v;
18+        neta = -eta;
19+        __asm__ volatile (
20+           "test %[c1], %[g];\n"
21+           "cmovneq %[g], %[f];\n"
22+           "cmovneq %[x], %[g];\n"
23+           "cmovneq %[q], %[u];\n"
24+           "cmovneq %[y], %[q];\n"
25+           "cmovneq %[r], %[v];\n"
26+           "cmovneq %[z], %[r];\n"
27+           "cmovneq %[neta], %[eta];\n"
28+           : [f]"+&r"(f), [g]"+&r"(g), [u]"+&r"(u), [q]"+&r"(q), [v]"+&r"(v), [r]"+&r"(r), [eta]"+&r"(eta)
29+           : [c1]"r"(c1), [x]"r"(x), [y]"r"(y), [z]"r"(z), [neta]"r"(neta)
30+           : "cc");
31+        --eta;
32+#else
33         c1 = c2 & ((int64_t)eta >> 63);
34 
35         x = (f ^ g) & c1;
36@@ -581,6 +602,7 @@ static uint64_t secp256k1_fe_divsteps_62(uint64_t eta, uint64_t f0, uint64_t g0,
37         v ^= z; r ^= z; r ^= c1; r -= c1;
38 
39         eta = (eta ^ c1) - c1 - 1;
40+#endif
41 
42         g += (f & c2); g >>= 1;
43         q += (u & c2); u <<= 1;

sipa commented at 10:55 pm on August 25, 2020: contributor

Benchmarks on i7-7820HQ pinned at 2.6 GHz (all -O2, with GMP):

 0            field_inv                 field_inv_var
 1            master safgcd cmov        master safgcd
 2
 3gcc-4.7:    6.75us 3.93us 4.28us      2.80us 2.29us
 4gcc-4.9:    6.76us 3.75us 3.24us      2.81us 2.30us
 5gcc-5:      6.76us 3.87us 3.23us      2.79us 2.41us
 6gcc-6:      6.74us 3.80us 3.14us      2.79us 2.30us
 7gcc-7:      6.76us 3.85us 3.11us      2.79us 2.32us
 8gcc-8:      6.77us 3.85us 3.21us      2.80us 2.36us
 9gcc-9:      6.76us 3.88us 3.09us      2.80us 2.34us
10gcc-10:     6.82us 3.85us 3.09us      2.80us 1.88us
11
12clang-3.7:  6.80us 3.74us 3.45us      2.79us 2.30us
13clang-3.8:  6.80us 3.71us 3.44us      2.79us 2.38us
14clang-4.0:  6.80us 3.70us 3.52us      2.79us 2.31us
15clang-6.0:  6.80us 3.74us 3.41us      2.79us 2.28us
16clang-7:    6.79us 3.74us 3.42us      2.78us 2.30us
17clang-8:    6.80us 3.77us 3.34us      2.79us 2.28us
18clang-9:    6.79us 3.74us 3.33us      2.81us 2.25us
19clang-10:   6.79us 3.75us 3.33us      2.79us 2.25us

Same benchmarks with -O3 -march=native:

 0            field_inv                 field_inv_var
 1            master safgcd cmov        master safgcd
 2
 3gcc-4.7:    6.75us 3.93us 3.15us      2.80us 2.20us
 4gcc-4.9:    6.77us 3.78us 3.24us      2.81us 2.45us
 5gcc-5:      6.76us 3.80us 4.35us      2.80us 2.41us
 6gcc-6:      6.76us 3.79us 3.10us      2.79us 2.47us
 7gcc-7:      6.81us 3.79us 3.18us      2.79us 2.15us
 8gcc-8:      6.80us 3.84us 4.08us      2.80us 2.19us
 9gcc-9:      6.76us 3.77us 3.77us      2.80us 2.12us
10gcc-10:     6.78us 3.71us 4.34us      2.80us 1.79us
11
12clang-3.7:  6.82us 3.68us 3.40us      2.79us 2.33us
13clang-3.8:  6.81us 3.69us 3.43us      2.79us 2.22us
14clang-4.0:  6.81us 3.69us 3.52us      2.79us 2.16us
15clang-6.0:  6.83us 3.70us 3.39us      2.79us 2.19us
16clang-7:    6.82us 3.71us 3.42us      2.78us 2.19us
17clang-8:    6.82us 3.72us 3.33us      2.79us 2.22us
18clang-9:    6.82us 3.74us 3.32us      2.78us 2.14us
19clang-10:   6.82us 3.74us 3.32us      2.80us 2.15us

@pornin’s (constant time) code takes 2.68us on the same system, though for a different (and 1 bit smaller) field.

gmaxwell commented at 11:52 pm on August 25, 2020: contributor

But this argument is pointless for our use case of secret inversions, where the input is uniformly random anyway.)

That consideration is not irrelevant to the use-case in consensus-critical verification and could use a constant random value generated at context creation, which makes blinding very cheap. … but checking the inversion can also be done for that application and is even cheaper than blinding (one element mul and a compare, vs two element multiplications).

I like the idea of doing both of these things for validation I think that plus testing with a billion random values is sufficiently correct that formal validation isn’t needed (though certainly would be good).

though I’m not aware of anybody doing these tests currently.

I have made a few attempts, we previously got power sidechannel traces that td-linux captured, it’s just extremely difficult to work out the kinks and get traces that are clean enough to automate the analysis and use in a CI setup without getting spurious alerts. Many different secrets and repeat that many times and then scale and compare the correlation between traces (after adjusting for offset/amplitude error) of the same secret vs the correlation of traces of different secrets, and consider the test failed if these two sets become too distinguishable– MSE(secret1_trace1,secret2_trace1) - MSE(secret1_trace1,secret1_trace2) > threshold.

This is a lot stronger test in some ways than actual attacks, but actual attacks require hand massaging to target specific parts of algorithms and aren’t suitable for a CI sort of this. This will at least alert if someone makes a change that greats a new honking-big side channel.

This approach isn’t workable though if there is any variable time component, as the misalignment will dominate the difference… even though that vartime part wouldn’t be a big impediment for an attacker who would just manually adjust their tooling to ignore it.

I also found a cycle-accurate risc-v simulator that simulates everything including cache, branch prediction, and dram,. I haven’t yet figured out the appropriate test harness, but I should be able to have one where it sets different secrets, flushes the caches, starts the simulation mode, and measures that it takes exactly the same number of cycles independent of the secrets. Not as sensitive as something that measures sidechannels, but it’s all software.

It would also be broken by a variable time inverse.

For the approach in the PR the constant time and var time inverses are essentially the same– variable time just has early termination and maybe makes some arbitrary choices among optimizations differently because the early termination works better with some choices that are slower than the constant time optimal ones.

peterdettman commented at 8:29 am on August 27, 2020: contributor

@sipa Thanks, but I generally can’t reproduce those results. With a small tweak I can get the assembler version to be about 5% faster than the current C version, but otherwise it’s not even close (at least 10% slower) to the ~8200 cycles I can get from clang for my example code (convenience re-link: https://godbolt.org/z/Mzafe7). Not sure why and can’t spare the time to investigate much for the next week or so. Perhaps you can try compiling for and/or testing on haswell too, otherwise I can’t explain the difference.

JeanLucPons commented at 10:30 am on September 1, 2020: none

Hello, I implemented a vartime divstep62 based on the @pornin’s method. It ends (in average) in 6.13 divstep62 against 9.00 for safegcd. On my I5 (compiled with Visual Studio 2019), both methods have a similar speed. However on my old Xeon (compiled with gcc 7 -O2), pornin’s method is 1.22 time faster. https://github.com/JeanLucPons/Kangaroo/blob/e313da10d5fd3380e34c532a554ce987202d23b3/SECPK1/IntMod.cpp#L184

sipa commented at 9:52 pm on September 1, 2020: contributor

@peterdettman I’m happy with the knowledge that at least in some cases, simple asm optimizations can be used to improve things further. I don’t think we need to include any of that in this PR.

Do you have any thoughts on next steps here? Do you want to keep experimenting with variations of the code, or do you think it’s time to get it closer to a mergeable shape? I’m happy to help with either.

JeanLucPons commented at 10:35 am on September 3, 2020: none

I compared @pornin method and safegcd method (vartime) using the cycle counter (__rdtsc). My code does not have particular optimization for a specific prime so I tried with 2^255-19 and 2^256-0x1000003D1 and it didn’t have significant impact.

VS2019: I5-8500 4600 (pornin) 4680 (safegcd)

gcc-8: Xeon X5647 6788 (pornin) 8172 (safegcd)

JeanLucPons commented at 10:03 am on September 5, 2020: none

I manage to get a significant speed up (for vartime impl) with VS2019 by using the _tzcnt_u64() intrinsic instead of _BitScanForward64(). I also checked the usage of AVX2.

VS2019: I5-8500 + _tzcnt_u64 + AVX2 3780 (pornin) 4532 (safegcd)

VS2019: I5-8500 + _tzcnt_u64 + no SIMD 4150 (pornin) 4310 (safegcd)

I cannot explain why VS2019 generates a less efficient code with AVX2 for safegcd.

Faster const-time divsteps cbd2d57dce

peterdettman commented at 7:43 am on September 9, 2020: contributor

Do you have any thoughts on next steps here? Do you want to keep experimenting with variations of the code, or do you think it’s time to get it closer to a mergeable shape? I’m happy to help with either. @sipa With this last commit we’re around 5-10% faster on haswell than the safegcd paper (really this time), with no asm. I think that’s a good place to leave off and focus on merging. I’d be very glad to simply hand that over to you.

mratsim cross-referenced this on Sep 10, 2020 from issue Research zkSNARKS blocker: Benchmark and optimize proof time by oskarth

real-or-random commented at 10:24 am on September 19, 2020: contributor

We discussed random blinding here. I just found this talk which demonstrates a sidechannel attack on the subsequent gcd algorithm if the blinding multiplication is done over the integers (without mod): https://youtu.be/ORGCHkSInjs?list=PLeeS-3Ml-rpqyNMiXWuheOmKAgCkUaass&t=1450 It’s not directly related to the discussion here but I found it interesting.

sipa cross-referenced this on Oct 12, 2020 from issue Safegcd inverses, drop Jacobi symbols, remove libgmp by sipa

sipa commented at 7:39 am on October 12, 2020: contributor

See rebased/squashed version in #831.

in src/field_10x26_impl.h:1395 in cbd2d57dce outdated

1390+    cd = (int64_t)u * di + (int64_t)v * ei;
1391+    ce = (int64_t)q * di + (int64_t)r * ei;
1392+
1393+    /* Calculate the multiples of P to add, to zero the 30 bottom bits. We choose md, me
1394+     * from the centred range [-2^29, 2^29) to keep d, e within [-2^256, 2^256). */
1395+    md = (I30 * 4 * (int32_t)cd) >> 2;

sipa commented at 4:31 am on October 14, 2020:

Is there a point to multiplying by 4 and then shifting down again?

peterdettman commented at 4:55 am on October 14, 2020:

See the comment for this line: shift up/down is sign extension, so that the result is in [-2^29, 2^29].

sipa commented at 5:00 am on October 14, 2020:

Ah, I see!

in src/field_10x26_impl.h:1617 in cbd2d57dce outdated

1612+        }
1613+
1614+        fn = f[len - 1];
1615+        gn = g[len - 1];
1616+
1617+        cond = ((int32_t)len - 2) >> 31;

sipa commented at 4:32 am on October 14, 2020:

Am I right in thinking that (cond == 0) corresponds to (len >= 2 && (fn == 0 || fn == -1) && (gn == 0 || gn == -1))?

peterdettman commented at 5:00 am on October 14, 2020:

Yes. i.e. if the length is more than 1, and the top words of both f and g contain only a sign bit, then we can shorten the length and keep the sign(s) in the top bits of the new top word(s).

peterdettman commented at 5:09 am on October 14, 2020:

I should clarify that whilst d, e, f, g are arrays of signed values, only the top word is ever negative.

sipa commented at 11:14 pm on October 18, 2020:

Is there an easy argument why that’s always the case, even with a modulus that’s encoded using negative limbs?

peterdettman commented at 3:16 am on October 19, 2020:

I overstated it; f can be initialised with negative limbs from the modulus. The update methods only leave a negative value in the top limb. _update_fg is always called at least once.

sipa commented at 4:49 am on October 20, 2020:

Got is.

in src/field_10x26_impl.h:1621 in cbd2d57dce outdated

1616+
1617+        cond = ((int32_t)len - 2) >> 31;
1618+        cond |= fn ^ (fn >> 31);
1619+        cond |= gn ^ (gn >> 31);
1620+
1621+        if (cond == 0) {

sipa commented at 4:34 am on October 14, 2020:

This clause is there so that when the top nonzero element is 0 or -1, we stuff it into bit 29 of the element below?

How is it guaranteed that at the end this 29th bit is gone (in scalar_encode)?

peterdettman commented at 5:05 am on October 14, 2020:

The sign bit is stuffed into bits 30, 31 - I’m not sure where 29 comes from. Are you asking about secp256k1_fe_decode_30 (since we are in field code here)? f, g are never decoded.

sipa commented at 11:18 pm on October 18, 2020:

Sorry for the confusion, let me restate:

All the inner arithmetic supports 31 bits values (+ sign bit)
Generally both d/e and g/f only store 30 significant bits per limb
- Except when the top limb of f/g are both 0 or -1, in which case it’s stuffed into the previous limb, possibly pushing that one to 31 bits.
- D/e strictly only have 30 bits per limb, and only the top limb can be negative (otherwise the decode functions would fail).

Analogously for the 64-bit version, using with s/30/62/.

peterdettman commented at 3:06 am on October 19, 2020:

My bounds analysis for the update methods was based on s31 values in the matrices produced by _divsteps_30, and s30 values in d, e, f, g. The update methods currently only leave negative values in the top limb of d, e, f, g, but the modulus (and therefore initial f) might have negatives elsewhere.

Your questions have made me realize that the dynamic shortening of f, g can result in a top limb of f or g (or both) that is s31. Probably not an issue, but it might be best to redo bounds analysis for the update methods.

sipa commented at 4:34 am on October 14, 2020: contributor

A few comments to test my understanding.

real-or-random commented at 9:21 am on October 14, 2020: contributor

A few comments to test my understanding.

It would be nice add code comments for all these clarifications.

fanquake cross-referenced this on Oct 15, 2020 from issue configure: Allow users to explicitly enable libsecp256k1's GMP bignum support by luke-jr

sipa commented at 8:23 pm on October 15, 2020: contributor

@real-or-random Definitely.

sipa cross-referenced this on Oct 17, 2020 from issue Update secp256k1 by str4d

in src/scalar_4x64_impl.h:1309 in cbd2d57dce outdated

1304+     * If the maximum bitlength of g is known to be less than 256, then eta can be set
1305+     * initially to -(1 + (256 - maxlen(g))), and only (741 - (256 - maxlen(g))) total
1306+     * divsteps are needed. */
1307+    eta = -(uint64_t)1;
1308+
1309+    for (i = 0; i < 12; ++i) {

sipa commented at 11:24 pm on October 18, 2020:

Is there no point in taking advantage of the fact that 741 iterations is enough? (this loop looks like it supports 12*62 = 744 iterations). @gmaxwell and I have been trying to find example inputs that need a large number of iterations (using a beam search that tries to find low bits that increase the average iteration count (over many high bits that follow) , but the best so far is 0x4212888e98e92e1c5384b37d0bf0655ec3bbad64381e713d6d6b0490d4858c83 (modulo the field order p), which needs 637 iterations.

I’ve also worked on some code that uses partial evaluation and interval reasoning to split up the range of inputs that have a large bound (using the log2(sqrt(f^2+4*g^2)) formula from the paper), until better bounds for the entire range is found. If my code is correct, both the field and scalar logic only need 737 iterations at most - but proving anything better will take a long time.

Some scratch code is here: https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea

peterdettman commented at 3:54 am on October 19, 2020:

For a limit of 741 I doubt there’s much to be gained by saving 3 divsteps, and I prefer the simplicity. Maybe for the 32-bit version saving 9 divsteps could be more viable. Two example approaches: 1) more code (alternate divsteps method plus extra calling code) -> careful about benchmark results neglecting e.g. the instruction cache impact, or branch prediction slot. 2) pass the iteration count to divsteps -> careful not to break compiler’s ability to unroll divsteps loop (or handcoded asm).

I do hope that mathematical developments could reduce the max iterations. Checking the limit for the specific prime could gain a few bits as you say. It’s unclear to me whether the worst-case input(s) can be generated, but it would be wonderful if “bad” inputs could be cheaply characterised (maybe based on something like the Hamming weight in NAF), then cheaply “fixed” in constant-time (something simple like multiplication by a small constant), so that a new limit would apply.

sipa commented at 6:39 pm on October 20, 2020:

If we call b(x,y) = ⌊49/17 log2(√(x2 + 4y2)))⌋, with appropriate corrections for small inputs, then I believe these bounds are correct:

δ = 1: iter ≤ b(|f|,|g|) [from the paper]
δ < 1: iter ≤ 1 - δ + b(|f|, ((21-δ-1)|f| + |g|) / 21-δ) [δ will only go up the next 1-δ steps, and in each, the g argument is either halved or averaged with f; after that many steps, this formula is an upper bound (corresponding to always averaging)]
δ > 1: iter ≤ 1 + δ + b(|g|, |g| + (|f| + |g|)/21+δ) [by assuming the next step will have an odd g and using the previous rule, but correcting for the fact that more δ-incrementing even-g steps can happen in between]

With those rules, it seems like I’ve proven that no input (for either field or scalar order modulus) ever needs more than 736 iterations using the code above (with finds a partition of the input range based on fixing low bits, and using partial evaluation + the bounds above to find a bound <= 736 for each).

sipa commented at 2:28 am on October 25, 2020:

Update: we’ve proven that no inputs for either field or scalar order need more than 735 iterations. We could do 734 but I don’t think it’s worth the effort, as by extrapolation it would take around 16 years of CPU time (which is doable, but needs some effort to distribute the work across network nodes etc) - which means that even if proven it would make the computation hard to repeat. The ≤736 bound takes a couple of minutes on to prove with the (somewhat involved) C++ code, and a few hours in Python.

Bound	Proof steps for field	Proof steps for scalar
≤741	1	1
≤740	21	21
≤739	85	85
≤738	17835	17835
≤737	6753276	6753276
≤736	2879322241	2879322241
≤735	1216906236685	1216976994782

gmaxwell commented at 3:04 am on October 25, 2020:

The ≤735 proof takes 80 minutes on a 128 core host.

I wouldn’t mind doing the ≤734 and wouldn’t mind doing– even without networking code– if it would actually be even slightly useful, but it’ll take me about a month realtime that way (for each field and scalar) (I could do it 4 or 5 days with a networked version). ≤733– by extrapolation– would be about 7000 cpu-years and reasonably beyond my ability to do but it could be done (e.g. with the help of an actual supercomputer).

I don’t think ≤732 is currently provable using all the kings cycles and all the kings mems, at least not using these techniques.

sipa commented at 8:38 am on October 25, 2020:

I don’t think ≤732 is currently provable using all the kings cycles and all the kings mems, at least not using these techniques.

Surely all the kings men can muster 1M c6g.16xlarge instances for a week (~ 350M USD).

elichai commented at 9:15 am on October 25, 2020:

@sipa curious to see the code used to prove this

sipa commented at 5:21 pm on October 25, 2020:

@elichai https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea

fgcd.cpp is a heuristic search for large inputs
fgcd_recurse.{py,cpp} are provers for a maximum bound

Rework _update_de I/O bounds 85da7a9e4d

Rework _update_de for 32bit c9b7717827

in src/scalar_4x64_impl.h:990 in cbd2d57dce outdated

985+    r->d[2] = r2;
986+    r->d[3] = r3;
987+
988+    secp256k1_scalar_reduce(r, secp256k1_scalar_check_overflow(r));
989+
990+    secp256k1_scalar_add(&u, r, &SECP256K1_SCALAR_NEG_TWO_POW_256);

sipa commented at 8:06 am on October 29, 2020:

I’m wondering if there is a good reason to do this in the scalar/field representation rather than the signed30/62 representation? There could be a single addition chain that negates (based on the sign of f[0]) and (conditionally) adds the modulus if the computed inverse is negative.

peterdettman commented at 8:16 am on October 29, 2020:

No good reason - I do it as part of the inverse in the BouncyCastle implementation. As written that requires D to be in [-P, P), which I believe is correct, but so far in secp256k1 I only assumed [-2^256,2^256).

sipa commented at 1:14 am on October 30, 2020:

I had a hard time verifying those bounds, so I wrote a (hopefully faithful) reimplementation of the algorithm here in Python, only replacing the limb decomposition with native integers (which are arbitrary length in Python): https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea#file-fgcd-py

Based on the choice of the limb size and moduli, better bounds may hold, but if I make no assumptions (allowing any modulus, and any limb size from 1 bit up to however many bits the modulus is), I can only verify that the “d” value at the end of the divsteps iterations is in [-2P, P). Care to share you reasoning for why it’d be tighter than that?

sipa commented at 2:30 am on October 30, 2020:

~~Hmm, when I restrict it to even limb sizes of at least 6 bits, [-P,P) seems to hold.~~

Scratch that, inverse of 19934 modulo 21163, with 6 bit limbs, gives d=-23436.

peterdettman commented at 2:53 am on October 30, 2020:

It was a while ago now… I recall doing some exhaustive testing for small values and finding an asymptote for the worst-case at around +/- 5*P/6 (actually just a bit larger), but I’m foggy on the details. We could certainly use more analysis here and it’s probably best to go with [-2P, P) for the moment if you can prove that.

peterdettman commented at 2:59 am on October 30, 2020:

BTW, we are only actually interested in limbs of 30+ bits, right? Does your test show a clear trend for the worst-case as the limb size increases?

sipa commented at 3:43 am on October 30, 2020:

Yes, we only care about limbs of 30 or 62 bits in practice, but for exhaustive tests with small values it seemed better to also look at the effect of other/smaller limb sizes.

And indeed, there is somewhat of a downtrend. Odd limb sizes seem to be slightly worse than even ones. For even limb sizes >= 8 I can’t find inputs that result in d outside of [-P, P). Though, this may be just because the search space is larger, and while such inputs are rarer, they don’t actually stop existing.

sipa commented at 4:19 am on October 30, 2020:

After iterating over all moduli up to 33301, and all inputs in [0, modulus). “iterations” is the number limb-sized multi-divsteps that were needed to reach g=0. The table entries are the ratio -d/m, for the lowest d seen.

limb bits	1 iteration	2 iterations	3 iterations	4 iterations	5 iterations	6 iterations	7 iterations	max
1			0.333333	0.600000	0.714286	1.153846	1.294118	1.988523
2		0.428571	0.727273	0.894737	1.073171	1.033708	1.179104	1.273476
3	0.333333	0.714286	0.742268	1.228916	0.893238	1.273476	1.238157	1.353553
4	0.428571	0.727273	0.821782	0.931232	0.948829	1.255086	1.271807	1.271807
5		0.790698	0.872404	1.138999	0.961705	1.274614	0.935210	1.274614
6	0.210526	0.790698	0.908851	0.940290	1.107404	0.919449	0.755455	1.107404
7	0.232558	0.794521	0.904947	1.181315	0.940741	0.849245		1.181315
8		0.821782	0.914713	0.934919	0.863095			0.934919
9		0.821656	0.918805	1.165892	0.731644			1.165892
10	0.248021	0.830424	0.925502	0.857774				0.925502
11		0.830892	0.922294	0.850083				0.922294
12	0.249510	0.828776	0.926862	0.731644				0.926862
13	0.249755	0.832723	0.926357					0.926357
14		0.832906	0.862055					0.862055
15		0.832906	0.845425					0.845425
16	0.249969	0.833227	0.834568					0.834568

peterdettman commented at 6:05 am on October 30, 2020:

I guess we need to look at (find) the worst-case matrix (4x4x30bits) M and some fixed-point X so that update_de_(X, X, M) leaves D, E <= X on output. Maybe an interesting first question is: why is there an asymptote (seemingly) at ~5/6 - can we check behaviour for “infinite” limb size?

sipa commented at 6:57 am on October 30, 2020:

I’m a bit surprised that the “1 iteration” column has entries for some and not others. That implies that the number of iterations needed for a given modulus/input combination depends on the limb size (more than just rounding up to the next limbsize multiple). Is that possible?

peterdettman commented at 8:25 am on October 30, 2020:

It doesn’t sound right, no. If (1 iteration of) n divsteps sends g to 0, then g will be 0 forevermore. So the first iteration of (n + 1) divsteps must also send g to 0.

sipa commented at 2:00 am on October 31, 2020:

I was wrong. In a new version I’m verifying that the number of big steps is exactly (n+l-1)/l, where n is the number of steps needed with limbsize 1bit, and l the limb size.

I’ve ran more numbers, but I’m unconvinced there is much we can be certain about. “groupsize” is the limb size in what follows (as I don’t really have limbs, but bits are still processed in groups):

All prime moduli up to 110947:

groupsize=1: any=-2.055085..1.000000 end=-1.993407..0.422482 groupsize=2: any=-1.491790..1.044510 end=-1.397171..0.803614 groupsize=3: any=-1.552709..1.170796 end=-1.369628..0.946939 groupsize=4: any=-1.374500..1.106852 end=-1.274800..0.878979 groupsize=5: any=-1.448697..1.071038 end=-1.274614..0.888029 groupsize=6: any=-1.253247..1.066610 end=-1.249308..0.848182 groupsize=7: any=-1.242349..0.985827 end=-1.181315..0.883249 groupsize=8: any=-1.235770..0.991852 end=-0.935896..0.848735 groupsize=9: any=-1.228235..0.989970 end=-1.176310..0.884596 groupsize=10: any=-1.038560..0.995537 end=-0.927074..0.846394 groupsize=11: any=-0.998062..0.983944 end=-0.922294..0.848562 groupsize=12: any=-0.994568..0.982599 end=-0.927107..0.846037 groupsize=13: any=-0.993722..0.986827 end=-0.928188..0.847768 groupsize=14: any=-0.972259..0.967022 end=-0.926749..0.845945 groupsize=15: any=-0.976746..0.975281 end=-0.899866..0.836040 groupsize=16: any=-0.947529..0.899753 end=-0.862346..0.833838

10% of prime moduli up to 349507:

groupsize=1: any=-2.051133..0.995659 end=-1.995117..0.421344 groupsize=2: any=-1.492478..1.031685 end=-1.396499..0.803922 groupsize=3: any=-1.558648..1.170698 end=-1.371851..0.948689 groupsize=4: any=-1.381529..1.077632 end=-1.274312..0.879803 groupsize=5: any=-1.502681..1.078220 end=-1.272138..0.888213 groupsize=6: any=-1.313554..1.083599 end=-1.250655..0.857256 groupsize=7: any=-1.249881..1.005733 end=-1.249881..0.884644 groupsize=8: any=-1.230708..0.989838 end=-0.958233..0.849045 groupsize=9: any=-1.219221..0.992666 end=-1.176310..0.883021 groupsize=10: any=-1.142527..0.994789 end=-0.930372..0.848842 groupsize=11: any=-1.166794..0.989934 end=-1.166794..0.882834 groupsize=12: any=-0.995544..0.988289 end=-0.926862..0.836935 groupsize=13: any=-0.993748..0.988289 end=-0.915995..0.845237 groupsize=14: any=-0.984945..0.975717 end=-0.926749..0.839390 groupsize=15: any=-0.995626..0.975281 end=-0.925214..0.847219 groupsize=16: any=-0.987976..0.985138 end=-0.927226..0.836313

1% of prime moduli up to 1131973:

groupsize=1: any=-2.052378..0.993162 end=-1.994926..0.422179 groupsize=2: any=-1.494688..1.033515 end=-1.396643..0.802153 groupsize=3: any=-1.562146..1.169104 end=-1.371607..0.946862 groupsize=4: any=-1.378231..1.114226 end=-1.273925..0.879529 groupsize=5: any=-1.475837..1.083105 end=-1.271721..0.888214 groupsize=6: any=-1.295239..1.047611 end=-1.254773..0.848775 groupsize=7: any=-1.330274..0.986786 end=-1.250550..0.885003 groupsize=8: any=-1.248294..0.991616 end=-1.248294..0.853758 groupsize=9: any=-1.228581..0.995544 end=-1.176310..0.872221 groupsize=10: any=-1.177726..0.992804 end=-0.925239..0.848618 groupsize=11: any=-1.167367..0.987836 end=-1.167367..0.883213 groupsize=12: any=-0.998108..0.990108 end=-0.928186..0.841764 groupsize=13: any=-1.150210..0.977471 end=-1.150210..0.849172 groupsize=14: any=-0.993729..0.977471 end=-0.911387..0.839390 groupsize=15: any=-0.985101..0.970516 end=-0.913996..0.835745 groupsize=16: any=-0.957962..0.979279 end=-0.922496..0.842599

0.1% of prime moduli up to 3464117:

groupsize=1: any=-2.050473..1.003521 end=-1.989695..0.421928 groupsize=2: any=-1.497404..1.043420 end=-1.396404..0.804456 groupsize=3: any=-1.564505..1.162865 end=-1.371172..0.948631 groupsize=4: any=-1.377905..1.079789 end=-1.274672..0.879191 groupsize=5: any=-1.496653..1.110779 end=-1.271000..0.887810 groupsize=6: any=-1.297679..1.077534 end=-1.251278..0.848526 groupsize=7: any=-1.300602..1.021607 end=-1.248854..0.884289 groupsize=8: any=-1.248034..1.005337 end=-1.248034..0.848858 groupsize=9: any=-1.228581..0.995544 end=-1.176310..0.848870 groupsize=10: any=-1.207779..0.995366 end=-0.929434..0.847704 groupsize=11: any=-1.153446..0.990005 end=-1.153446..0.882861 groupsize=12: any=-0.996765..0.989516 end=-0.927898..0.847193 groupsize=13: any=-1.163140..0.987821 end=-1.163140..0.883305 groupsize=14: any=-0.995132..0.984791 end=-0.893015..0.835079 groupsize=15: any=-0.995132..0.967950 end=-0.907344..0.834382 groupsize=16: any=-0.984029..0.961038 end=-0.906420..0.834617

0.01% of moduli up to 11039339:

groupsize=1: any=-2.051940..0.996391 end=-1.996305..0.421290 groupsize=2: any=-1.499505..1.051901 end=-1.396416..0.802318 groupsize=3: any=-1.561363..1.166551 end=-1.368984..0.948631 groupsize=4: any=-1.383586..1.091238 end=-1.274289..0.880152 groupsize=5: any=-1.481463..1.085957 end=-1.260109..0.888202 groupsize=6: any=-1.299034..1.060646 end=-1.251278..0.857331 groupsize=7: any=-1.260355..0.992207 end=-1.250235..0.883328 groupsize=8: any=-1.231910..0.992368 end=-0.956577..0.848064 groupsize=9: any=-1.216104..0.991101 end=-1.199865..0.884526 groupsize=10: any=-1.225255..0.993749 end=-0.934115..0.847749 groupsize=11: any=-1.148017..0.991232 end=-0.924975..0.846513 groupsize=12: any=-1.018515..0.987497 end=-0.927898..0.846874 groupsize=13: any=-1.160253..0.983809 end=-1.160253..0.882765 groupsize=14: any=-1.003507..0.986345 end=-0.921619..0.844235 groupsize=15: any=-0.996078..0.971918 end=-0.916266..0.836446 groupsize=16: any=-0.993275..0.965574 end=-0.866950..0.834224

0.001% of prime moduli up to 28747003:

groupsize=1: any=-2.053982..0.996391 end=-1.992506..0.421290 groupsize=2: any=-1.493148..1.041476 end=-1.395867..0.800439 groupsize=3: any=-1.556455..1.154373 end=-1.369605..0.945930 groupsize=4: any=-1.384990..1.072617 end=-1.272547..0.879611 groupsize=5: any=-1.470865..1.075154 end=-1.270969..0.887815 groupsize=6: any=-1.260852..1.065870 end=-1.251268..0.848581 groupsize=7: any=-1.313924..1.017953 end=-1.250235..0.883669 groupsize=8: any=-1.246336..0.992604 end=-0.963704..0.848223 groupsize=9: any=-1.241720..0.992726 end=-1.241720..0.884562 groupsize=10: any=-1.183389..0.992054 end=-0.928432..0.845884 groupsize=11: any=-1.185953..0.986327 end=-0.927732..0.853218 groupsize=12: any=-1.090875..0.986342 end=-0.927898..0.843522 groupsize=13: any=-1.000580..0.977187 end=-0.914065..0.843360 groupsize=14: any=-0.995132..0.989678 end=-0.928494..0.846937 groupsize=15: any=-1.164714..0.963789 end=-1.164714..0.881029 groupsize=16: any=-0.992100..0.975921 end=-0.914312..0.835900

0.0001% of prime moduli up to 87960151:

groupsize=1: any=-2.052892..0.993336 end=-1.995083..0.422630 groupsize=2: any=-1.495467..1.047303 end=-1.396386..0.795697 groupsize=3: any=-1.555671..1.163660 end=-1.371843..0.946223 groupsize=4: any=-1.382615..1.119554 end=-1.272758..0.878755 groupsize=5: any=-1.475855..1.092415 end=-1.259864..0.887520 groupsize=6: any=-1.319167..1.058888 end=-1.249887..0.848697 groupsize=7: any=-1.313924..1.010592 end=-1.250286..0.884678 groupsize=8: any=-1.231136..0.991234 end=-0.939565..0.848686 groupsize=9: any=-1.200750..0.989698 end=-1.166432..0.884064 groupsize=10: any=-1.227392..0.989845 end=-0.937704..0.847799 groupsize=11: any=-1.153800..0.988057 end=-1.153800..0.846207 groupsize=12: any=-1.088226..0.991479 end=-0.924114..0.846646 groupsize=13: any=-0.997867..0.988057 end=-0.914374..0.845121 groupsize=14: any=-0.997096..0.980230 end=-0.916255..0.836382 groupsize=15: any=-1.157445..0.989845 end=-1.157445..0.842492 groupsize=16: any=-0.994511..0.975921 end=-0.914883..0.843662

The any=-a..b indicates the range of d and e over the modulus (a=min(d/m, e/m), b=max(d/m, e/m)) during the execution (after any big step, including the last one). The end=-a..b indicates the range of d (not e) after the last big step. For small values there appears to be a convergence to -5/6..5/6 for the end range, but this is far less clear for larger moduli. I believe it may be a property that only holds as long as the modulus is small compared to the group size.

The only thing that seems easy to prove is that if max(abs(d),abs(e)) <= α modulus before a big step, then max(abs(d’),abs(e’)) <= (α+1/2)modulus, due to the fact that all entries in the 2-groupsizet matrix are between -1 and 1, and the sum and difference of any row in that matrix is also between -1 and 1. But that only bounds the final abs(d) below 7 modulus (on 64 bit) and 13.5 modulus (on 32 bit).

peterdettman commented at 3:43 am on October 31, 2020:

D, E, begin as (0,1) and are then multiplied by some number of n-divstep matrices. The cases we are most interested in are those where the total number of divsteps (ignoring any “0” steps at the end) is an exact multiple of n (groupsize), so we can simplify this to just: start with (D,E)= (0,1) and find worst-case matrices (I mean worst-case using the actual possible divsteps not a vague upper bound). It should be either a specific calculable matrix depending only on n, or at most a cycle of 2 or 4 (since there are signs and 2 variables D, E). Then just iterate applying the matrix (matrices) to D,E (in this way you can easily study very large inputs too). If for some experiment you want the initial F, G, you can just invert the matrices and apply to F,G = (1,0) or (-1,0).

EDIT: I suppose the first iteration is special and more tricky to determine a worst-case for.

peterdettman commented at 4:34 am on October 31, 2020:

I guess it’s more complicated than that because you don’t have the modulus initially, so you’d have to produce the non-modular D,E, chain first, then back-calculate the modulus, then run it again with update_DE, etc.

sipa commented at 2:55 am on November 1, 2020:

I’m not convinced there is a tractable way of finding the actual worst-case matrices. If there was, the bounds analysis in the paper would be significantly simpler.

This is the best I’ve got, showing that d is in range [-0x16000000180000006000000180000006000000180000006000000180000006000, 0x15ffffff67ffffff9ffffffe7ffffff9ffffffe7ffffff9ffffffe7ffffffa000] (for 25 30-bit iterations) and in range [-0x16000000000000001800000000000000600000000000000180000000000000040, 0x15ffffffffffffff67ffffffffffffff9ffffffffffffffe7ffffffffffffff80] (for 12 62-bit iterations):

Start with the ((at least one odd g since start of big step, delta) -> list of (d,e) pairs) map {(False, 1): [(0, 1)]}.
Iterate 25 times:
- Iterate 30 times:
  - Apply the divstep rule (ignoring the division by 2) on the input map, both taking the “g is even” branch, and the “g is odd” branch (which depends on delta>0 or not), for every input (d,e), constructing a new map. This is the equivalent of doing 1 bit work of divsteps, and immediately applying the resulting t matrix to d,e.
  - Minimize the list of points in every map entry by computing its convex hull.
- Apply the “add multiple of modulus and shift down by 30 bits” logic from update_de, taking into account that [-2^29, 2^29) times the modulus is added, except for d when no odd g was ever encountered (which would just have been multiplied by 2^30 in that case, and will thus certainly still be a multiple of the modulus after shifting down by 30 steps).
Find the minimal and maximal d in any list produced anywhere above.

Code is here: https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea#file-hull-py . It’s somewhat surprising to me that computing the convex hull of the images of (0, 1) after 62 divsteps is tractable to compute, but the result looks plausible.

peterdettman commented at 4:03 am on November 1, 2020:

How about I modify update_DE so that if u*D and v*E (resp. q*D and r*E) have the same sign then we force the sign of md (resp. me) to be the opposite?

sipa commented at 4:21 am on November 1, 2020:

I’ll think about that… but is that easy/cheap to do? You may need to scan for the first non-zero limb to find the sign; doing that (especially in constant time) may not be that much cheaper than an addition chain to add/subtract a modulus.

peterdettman commented at 4:22 am on November 1, 2020:

It’s cheap, the sign is always correct in the top limb for D, E. EDIT: They are in 2’s complement basically, on a 2^30 radix to avoid overflow issues in the updates and so that the implicit shift is free.

peterdettman commented at 5:56 am on November 1, 2020:

BTW, I’ve kind of accepted [-2P, 2P) for the moment. Just looking for a sign-based fixup for update_de; the ones I’ve tried so far actually give larger worst-case values in random trials, but perhaps they allow a locally-reasoned proof.

sipa commented at 3:05 am on November 3, 2020:

It seems this convex hull approximation technique also works for bounding the number of divsteps needed! https://gist.github.com/sipa/5736b83903336a1e6f3ccdeaa4cfbfea#file-hull-bound-py

If the code is correct (and it looks like it is for sufficiently small inputs), it proves that for any 256-bit modulus M and input in [0..M), no more than 723 divsteps are needed. Still not enough to reduce our number of big step iterations in either 32 or 64 bit, but perhaps there are other gains?

EDIT: if the input is restricted to 0..MODULUS/2, 721 is enough. So close to only needing 24 iterations in 32 bit mode.

peterdettman commented at 3:56 pm on November 10, 2020:

I’ve pushed some changes that should address this issue of the size constraints on D, E. See comments in _update_de method(s). I’ve only done 64bit, but the pattern should be clear. This latest commit also does the final normalization within the 62-bit format.

peterdettman commented at 6:34 am on November 12, 2020: contributor

@sipa Just so we don’t duplicate efforts: I am working on porting this into your new PR.

sipa commented at 7:12 am on November 12, 2020: contributor

@peterdettman Cool, thanks! I was planning on doing so myself soon, but hadn’t started.

real-or-random cross-referenced this on Jan 27, 2021 from issue Use blinded instead of constant time inverse in secret gej->ge? by gmaxwell

real-or-random cross-referenced this on Jan 28, 2021 from issue Rescale initial point before every ecmult_gen by real-or-random

sipa commented at 8:05 am on January 30, 2021: contributor

Closing in favor of #831.

sipa closed this on Jan 30, 2021

sipa referenced this in commit 26de4dfeb1 on Mar 18, 2021

sipa cross-referenced this on Apr 2, 2021 from issue Update libsecp256k1 subtree to latest master by sipa

sipa cross-referenced this on Apr 4, 2021 from issue Safegcd-based modular inverses in MuHash3072 by sipa

laanwj referenced this in commit 359f72105b on Jun 7, 2021

UdjinM6 referenced this in commit bc61867454 on Aug 10, 2021

5tefan referenced this in commit dd45c616b6 on Aug 12, 2021

JeanLucPons cross-referenced this on Jan 19, 2022 from issue DRS62 by ViktYusk

mratsim cross-referenced this on Feb 8, 2022 from issue Fast modular inversion by mratsim

mratsim cross-referenced this on Feb 9, 2022 from issue Implement the modular inverse using unsigned 256-bit integers addition and shifts by k06a

WIP: “safegcd” field and scalar inversion #767

All prime moduli up to 110947:

10% of prime moduli up to 349507:

1% of prime moduli up to 1131973:

0.1% of prime moduli up to 3464117:

0.01% of moduli up to 11039339:

0.001% of prime moduli up to 28747003:

0.0001% of prime moduli up to 87960151: