Use the extened euclidian algorithm to compute scalar inverse in variable time

deadalnix commented at 1:54 am on March 29, 2020: none

This is faster than the multiple exponentiation method that is used now. It remains way slower than libgmp, so probably not a full replacement, but it provides a nice speedup when libgmp is not available.

These are the benchmark time I measure.

scalar_inverse: min 14.4us / avg 16.4us / max 20.0us scalar_inverse_var: min 11.5us / avg 11.6us / max 11.6us

While there is a fair bag of trick that is used to make this fast, there is still a lot of untapped potential. For instance, doing numerous additions before reducing x0 and x1. While more could be done, this is already an improvement, so it is time for a PR.

deadalnix force-pushed on Mar 29, 2020

elichai commented at 7:20 am on March 29, 2020: contributor

This seem like a simpler approach than #290 even though it’s slower and doesn’t have the jacobi implementation, but the bignum approach is quite hard to implement correctly.

(haven’t reviewed the actual logic here)

deadalnix commented at 3:14 pm on March 29, 2020: none

I wasn’t aware of #290 , this is unfortunate. A few comments.

The approach is slightly different. This uses an approach with no divisions - technically, divisions by powers of 2 only, but these can be performed as simple shifts and additions while #290 tries to limit the number of bigint divisions using lehmer’s optimization. I do not know which is best and it is hard to compare because the implementations details are quite different.

I was considering using a 5x64bits or 9x32bits big int for x0, x1 and negx going forward, just like #290. A lot of time is spent keeping them between 0 and p, but we could do it only once at the end. Everything else do not really needs big int. This is the next logical step here.

On a more high level note, I think proceeding is small steps is preferable. It is easier to review, avoids rebase hell and delivers value to users now. In addition, it allows for better fan out down the road, as it is possible to work and experiment with the various optimisations and variations discussed independently on top of a common base.

Implementing jacobi, further optimizing this and making it work with field element in addition to scalars is on my todo list, but I think it is better to agree on some approach before investing that much effort into this.

real-or-random commented at 6:11 pm on March 30, 2020: contributor

Let me just note for now that I haven’t had a closer look so far but this looks very interesting. Is this the algorithm described in pseudocode here? https://crypto.stackexchange.com/a/54623/12020 A pseudocode variant will make it easier to review this.

deadalnix commented at 6:51 pm on March 30, 2020: none

Not quite, but it is similar. The way i approached is starting with algorithm 2.23 from “Guide to Elliptic Curve Cryptography” by Hankerson, Menezes and Vanstone for which I added a few optimizations:

a can only be even early on, and at that time, x1 is always zero, so no need to double it.
when b is even, they accumulate more factors 2 in x0 and x1, but I eliminate them right away - it is worth it due to the optimized division by a power of 2 routine.

Pseudocode is as follow:

 0inverse a:
 1  b = p
 2  x0 = 1
 3  x1 = 0
 4  k = 0
 5
 6  while a is even:
 7    a /= 2
 8    k++
 9
10  while a > 1:
11    b -= a
12    x1 -= x0
13    while b is even:
14      b /= 2
15      x1 = x1 / 2 if x1 is even else (x1 + p) / 2
16    if a > b:
17      swap a, b
18      swap x0, x1
19
20  repeat k times:
21    x0 = x0 / 2 if x0 is even else (x0 + p) / 2
22
23  return x0

This is the hist of it, but there are various optimizations done along the way.

deadalnix force-pushed on Apr 15, 2020

Add the capability to divide scalars by a power of 2. b0ae4ed30a

Add secp256k1_scalar_cmp_var to compare scalars

Not adding it to scalar.h because it is not meant to be a public API. Scalars do not have ordering per se, but this is meant to be able to implement the extended euclidian algorithm in order to compute inverse in variable time effisciently.

5abdeff8cf

Move secp256k1_scalar_is_even to make things clearer 90557613e1

Implement the extended euclidian algorithm to compute the inverse of scalars in variable time.

This is faster than the exponentiation method. It is still significantly slower than GMP, so we keep that option as well.

0a4741a461

deadalnix force-pushed on Apr 15, 2020

deadalnix commented at 5:05 pm on April 15, 2020: none

I updated secp256k1_scalar_pow2_div to add zero instea of skipping the fixup with a branch as it turns out it is faster.

mratsim commented at 5:32 pm on April 15, 2020: none

FYI, EEA can be made constant-time with very reasonable speed (for a constant-time impl, it would still be an order of magnitude slower than variable time GCD like in GMP) following the algorithm 5 from paper: https://link.springer.com/content/pdf/10.1007%2F978-3-642-40588-4_10.pdf

Implementation in C are available in GMP (mirror) and Nettle, in C++ in Botan and in Nim in my own library Constantine

There is a lengthy discussion on constant-time modular inverse in Botan repo: https://github.com/randombit/botan/issues/1479

In particular there is discussion on a recent paper by Bernstein and Yang that claims a very fast constant time GCD and inversion (https://eprint.iacr.org/2019/266).

deadalnix commented at 1:13 am on April 16, 2020: none

That’s interesting, but considering the benchmark I have, I highly doubt this can be made faster that the exponentiation technique that is used today.

mratsim commented at 9:18 am on April 16, 2020: none

The Bernstein-Yang paper claims that their GCD-based inversion is faster than using Little Fermat on Curve25519. Similarly, the Möller algorithm for inversion is faster than the addition-chains from secp256k1 in my own benchmark though I use a generic Montgomery mul/square algorithm and the difference is about 20% so faster/specialized mul/square will probably change the balance). Anyway if there is a dedicated thread on constant-time inversion I would be happy to discuss there instead of in this variable time EEA thread.

roconnor-blockstream cross-referenced this on Apr 21, 2020 from issue Scalarinversevar by roconnor-blockstream

roconnor-blockstream commented at 3:29 pm on April 21, 2020: contributor

I’ve made a PR with some suggested amendments @ https://github.com/deadalnix/secp256k1/pull/1

https://github.com/roconnor-blockstream/secp256k1/commit/2446b1e79bfb739c58fc181d2b83cafdbcf707e3 replaces the scalar_complement and scalar_binadd with a scalar_sub function. I’ve been told that my scalar_sub function has managed to coax the compiler into emitting a SBB (subtract with borrow) instruction which is very good. I haven’t verified that fact myself yet. I believe this substitution is an objective improvement.

I suppose there is an argument over which style is better, using subtraction or addition with complement as is done in scalar_negate. My opinion is that subtraction is marginally better and arguably the scalar_negate should be using this subtraction technique.

https://github.com/roconnor-blockstream/secp256k1/commit/13e07abfa17d4f52515a3361cf42c54387eb360d revises the main secp256k1_scalar_eea_inverse a little. My main goal there is to remove the use of goto. My reasons are a little selfish. The Verifiable C project cannot reason about C code with gotos, and I would like to be able to do a formal verification of this code at some point in the future. That said, to date I’ve only done a very limited about of verification of libsecp256k1 using Verifiable C, only enough to evaluate the tool. Otherwise this would be the only use of goto in the entire codebase. I think this revision to secp256k1_scalar_eea_inverse is a subjective improvement.

That all said, it would be a huge improvement to generalize this so that it can be used to compute field inversions in addition to scalar inversions. ECDSA verification requires a single scalar inversion, and Schnorr verification requires no scalar inversions at all, whereas field inversion occurs in a moderate number of places during verification.

I think we can generalize this by building a static table of “remainders”, as the lookup table found in secp256k1_scalar_pow2_div, for both the field order and group order, and pass a pointer to required table into a generalized eea_inverse function. For field inversion, we can compact the field value into the scalar type (or perhaps better, a new equivalent 256-bit type), perform the eea inversion, and copy the result back into the field limb format.

Let me know if I’ve made any errors.

deadalnix commented at 4:23 pm on April 21, 2020: none

Awesome!

I’m not sure if it is best to comment here or in your PR, I’ll do it here and we can move the discussion there if need be.

I was working myself on the next iteration of this thing and I got some more nice speedup - mostly by avoiding to reduce at every steps. Unfortunately, we stepped on each others toe a bit, but that’s okay, there are no way you’d be able to know.

You move some variable declarations to have smaller scope. This is awesome. I was uncertain about the C89 rules and don’t wanted to fight the compiler over this so I went extra large.
I’m not sure about the goto removal. I get it, goto is frowned upon, and for the most part for good reasons. But once in a while comes this situation where a goto really is better and I think we are in this case. This is because the goto effectively skips a good chunk of the first loop iteration - and in fact it can skip more. If you have a solution to achieve this without goto, then I’m taking, but it is unclear to me if this is possible at all.
Your implementation of scalar sub should be called binsub instead, because it skips the modulo reduction. It should also return the leftover so that the caller can do something with it. If nothing else, check that expectations about overflow are maintained.
Please do not remove the binadd operation. There are optimizations of secp256k1_scalar_pow2_div that can greatly benefit from it. It is also useful as a building block to avoid reduction at every step. I have code for all this, but not in a shape I think is good enough for review ATM. I can upload it anyways if you want to check.
I expect that there won’t be any major roadblock to generalize this to field element and I intend to do this if there is interest. I made the choice of going for scalar only so that I can focus on extracting all the juice from there - there is a lot more to extract. Doing everything at once would make review more difficult, as well as iterating on various ways to do things to see which is faster slower, both of which are undesirable initially.

roconnor-blockstream commented at 4:54 pm on April 21, 2020: contributor

Regarding variable scoping. I wasn’t sure either. Jonas suggested to me that it was okay to narrow the scope, so I guess we should do it unless we hear an objection.

Regarding goto: I believe my revision does no more work than your goto version. Roughly speaking the call secp256k1_scalar_negate(&v, a); was eliminated in favour of entering the loop from the top an running secp256k1_scalar_sub which does more or less the identical amount of work.

In light of your overall comments, I suggest closing my PR against your repo and I can revisit it once you have some new code. Feel free to incorporate any of my suggestions that you agree with into your next update. I have a couple of other, more questionable tricks in there, such as using *r as scratch space.

real-or-random cross-referenced this on Jul 15, 2020 from issue WIP: "safegcd" field and scalar inversion by peterdettman

roconnor-blockstream commented at 12:58 pm on March 24, 2021: contributor

I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged.

deadalnix commented at 3:15 pm on March 24, 2021: none

Yes. Closing.

deadalnix closed this on Mar 24, 2021

Use the extened euclidian algorithm to compute scalar inverse in variable time #730