Implement batch inversion of field elements

peterdettman commented at 11:39 am on May 19, 2014: contributor

Adds secp256k1_fe_inv_all and secp256k1_fe_inv_all_var methods for batch inversion of field elements, implemented using “Montgomery’s Trick”, which uses only one actual field inversion, trading off the others for 3 field multiplications each.

Includes randomised tests covering the new methods and also coverage for the existing inv/inv_var methods.

sipa commented at 6:10 pm on May 19, 2014: contributor

Code looks good, but is there are use case for this? I’d like to avoid code bloat if not necessary.

peterdettman commented at 2:01 pm on May 20, 2014: contributor

It’s useful to convert a batch of points from jacobian to affine coordinates with a single field inversion, e.g. the common case of precomputing points for a scalar multiplication in jacobian, then converting the entire table to affine (the scalar point multiplication of course being faster when one of the points is given in affine coordinates). Once we have a batch gej->ge method, secp256k1_ecmult_table_precomp_ge and secp256k1_ecmult_table_precomp_gej become somewhat redundant since a single pre comp method can have the best of both worlds.

Okeya and Sakurai discuss a more intricate variation in their paper “Fast Multi-scalar Multiplication Methods on Elliptic Curves with Precomputation Strategy Using Montgomery Trick” (couldn’t immediately find a link but I can forward you a copy if interested). They take advantage of batch inversion to perform multiple affine point additions in parallel (using the basic affine addition formulae directly, but with the inversion shared).

Less significantly (depending how slow inversion actually is), one can think of batching multiple ECDSA operations so as to be able to share the inversion required by the final jacobian -> affine transformation. This would just use the same batch gej->ge method from above.

Just leave this pull request open for the moment if you like, and I’ll add some of that higher-level functionality to this branch, then you can weigh up the benefits once there’s something useful to show.

sipa commented at 6:02 pm on May 23, 2014: contributor

Interesting. That may actually provide a measurable speedup, if we do both the A and G precomputed multiples in affine coordinates. It may or may not weigh up to the cost of an extra inversion (which, IIRC, is around 10% of the total signature validation time when usin GMP + x86_64_asm field code).

Thinking a bit further, one reason for using GMP is exactly because its modular inversion is so fast. If we can reduce the importance of inversions (for batch validation), perhaps it becomes doable to use a more naive self-implemented inversion.

peterdettman commented at 12:49 pm on May 25, 2014: contributor

The changes are all in this branch now, so we can do some performance analysis. The G precomputation is an obvious win.

1000 * start/stop fe/ge/ecmult (–with-field=64bit): Before: 0m30.472s After: 0m6.301s

A very rough stab at the cost of an inversion then: 12M + 4S + I ~= 5 * (12M + 4S + 3M), since each I in building the table is replaced by 3M. Using S = 0.8M, then I ~ 76M. If S = 0.6M, then I ~ 73M. Probably should write a little benchmark to get an accurate value.

To get some idea on the A precomputation, I give results for “bench”. Note that to give a fair comparison I’m comparing to the branch from https://github.com/bitcoin/secp256k1/pull/20, since this multi-inv branch has the equivalent optimisation.

bench (./configure –enable-endomorphism=no –with-field=64bit): Before: 1m36.579s After: 1m35.329s

bench (./configure –enable-endomorphism=yes –with-field=64bit): Before: 1m11.284s After: 1m12.136s

Unfortunately, I am not able to test the 64bit_asm field implementation (on OSX, for some reason the asm methods don’t link correctly for me, perhaps something to do with the ABI), so I would certainly appreciate someone checking the 64bit_asm field, although it seems predictable it will make the inversion even more costly relatively speaking.

So, it’s not there yet for A, which sort of surprises me. Back of the envelope: with WINDOW_A == 5, there are 8 precomputed points. To convert the jacobian points to affine costs an extra 7 * 3M + I. Using affine values in the table should save 1S + 4M for each actual point addition in the main loop, so with I ~= 75M, the conversion should pay for itself after about 20 point additions - but there are over 40 point additions (for the A part) in a typical secp256k1_ecmult! I think my estimate of I based on the G tests above must be much too low. There ought to be a measurable saving when I < ~160M!!

I think I will next look at a modular inversion implementation, since there appears to be interest in removing use of GMP anyway? In BouncyCastle we use Algorithm 12 from the paper “Software Implementation of the NIST Elliptic Curves Over Prime Fields” (Brown, Hankerson, Lopez, Menezes), called there “Binary inversion”, so I’ll take a stab at writing a version of that to work with the secp256k1_num_t type and we’ll see how it stacks up. In BC, I ~= 100M for this curve.

peterdettman commented at 1:54 pm on May 25, 2014: contributor

Correction: the cost for converting the table is actually 7 * (6M + S) + I (I was ignoring the cost of actually adjusting x/y after the batch inversion). So a batch inversion for the A precomp should pay off around I <~ 140M (but apparently doesn’t).

Revisiting the G calculation: 12M + 4S + I ~= 5 * (12M + 4S + 6M + S). So I is ~ 90 to 95 depending on the cost of S.

There’s still a bit of a gap between theory and results!

peterdettman commented at 1:24 pm on June 3, 2014: contributor

Urgh, the G calculation once again (accounting that the existing version uses mixed additions): 8M + 3S + I ~= 5 * (12M + 4S + 6M + S) So that gives I ~100M, which is a fairly common estimate in the literature.

I have been measuring the A case some more and I think the problem is actually that secp256k1_fe_gej_add_ge isn’t as fast (relative to …gej_add) as it should be. The first thing that jumped out is a redundant normalize call for ‘u1’. With that removed and the latest normalize method changes, this branch is now faster (i.e. with a batch inversion to get all affine precomputed points) even with the endomorphism enabled.

sipa commented at 8:41 pm on June 3, 2014: contributor

Can you rebase?

I benchmarked –disable-endomorphism –with-bignum=gmp –with-field=64bit_asm:

Before: 115s
After: 112s

When benchmarking –with-bignum=openssl however, it seems slower. That seems reasonable, as the field inversion we use with OpenSSL’s bignum is significantly slower than GMP’s, so trading one extra inversion for the batch code for slightly faster additions may not be worth it.

gmaxwell commented at 8:46 pm on June 3, 2014: contributor

Perhaps we should be looking for a internal inversion, might be able to do one faster than GMP from specializing for our field?

sipa commented at 8:48 pm on June 3, 2014: contributor

We have an internal one, which is faster than OpenSSL’s, but slower than GMP’s (even with the overhead of converting between formats). It’s only used when GMP is not available.

gmaxwell commented at 8:52 pm on June 3, 2014: contributor

Oh. I … somehow missed that.

sipa commented at 9:04 pm on June 3, 2014: contributor

To clarify, the internal inversion code is a simple static (and constant time) exponentiation ladder (the current version of which was even contributed by Peter). GMP uses some extgcd based algorithm, I believe (which is variable time, and has dynamic memory overhead).

Implementing our own variable time inversion that competes with GMP’s would be nice.

peterdettman commented at 1:43 am on June 4, 2014: contributor

@gmaxwell A new inversion method is on my todo list. I have in mind to port across the basic alg we use in BouncyCastle which is an extgcd variant, using only shifts and adds. The special prime form for secp256k1 could be helpful, see e.g. http://pdf.aminer.org/000/087/131/low_power_elliptic_curve_cryptography_using_scaled_modular_arithmetic.pdf for ideas. @sipa Note that it is a simple matter to blind an inversion with a random fe multiplication as an alternative to constant time.

gmaxwell commented at 1:47 am on June 4, 2014: contributor

Consuming randomness is a little unfortunate from an interface/portability perspective (e.g. using the library on a small embedded device which may lack strong sources of randomness). If the performance was close I’d prefer not to… obviously we don’t need to blind anything that isn’t operating on secret data, however.

peterdettman commented at 6:25 am on June 4, 2014: contributor

@gmaxwell Somewhat off-topic… but unfortunately it’s not only secret data that warrants blinding! Differential side-channel analysis can make use of the fact that known data is being operated on; notably this includes the basepoint G. One possible countermeasure is point-blinding, which on secp256k1 (with curve parameter a=0) is very straightforward to implement and has low runtime cost, and because there are no points with x=0 or y=0, safe.

I do appreciate there are issues around consuming randomness, but ironically it’s those small embedded devices that have the most need for DPA countermeasures.

I have some prototype code for point blinding already (via random curve isomorphism: see e.g. http://joye.site88.net/papers/TJ10coordblind.pdf); perhaps we can pick up the randomness discussion once I clean that up into a pull request?

gmaxwell commented at 6:51 am on June 4, 2014: contributor

@peterdettman I’m not sure if you’ve tried actually measuring the power side-channel on small devices, but it seems hopeless. Even in code that is constant time can have drastically different power usage, even copying memory will leak data in some cases. I think we’d mostly be hoping to achieve timing sidechannel freeness.

I’m not sure if we’re on the same page wrt sidechannels, though. For example, Verification is not a secret operation in the applications we’ve cared about so far, so I do not care about side-channel immunity in verification. I do, however, care about it in multiplication with the generator in pubkey generation and in signing, and in the secret addition in signing. e.g. any process which handles secret data in some manner. — Though there are perhaps some nice non-bitcoin applications for this curve that might want better secrecy for other operations, so I suppose it would be nice if it were cheap.

Blinding for the multiplies is straight forward though has a performance penalty (at least implemented in the simple way), and in signing we can both take a slowdown and can yield deterministic randomness as we already need to so for the nonce. So I’m not too concerned there, but in verification it’s another matter.

Sipa and I had already talked about using point blinding in pubkey generation for robustness reasons, e.g. computing blinded and non-blinding and making sure we get the same key. (This is a concern because a miscalculation of a public key in bitcoin can cause devastating irreversible losses)

peterdettman commented at 5:29 am on June 25, 2014: contributor

@sipa Writing a modular inversion competitive with GMP turns out to be fairly challenging (read: it will take me a while longer). If I remove the part of this PR that uses batch inversion in the A precomp (and just use it for the G precomp), do you think it could be merged in that state?

sipa commented at 4:57 pm on July 8, 2014: contributor

@peterdettman that sounds fine to me.

peterdettman commented at 12:07 pm on July 16, 2014: contributor

Patch reduced to only use the batch conversion in the G precomp, and rebased.

Just to summarise the situation:

Precomp A can also be converted to affine if inversion is fast enough, which it isn’t for bignum=openssl (at least for a single operation).
It’d be nice to have an internal implementation of (non-constant-time) inversion that was competitive with GMP. It’s probably the biggest remaining obstacle to removing the GMP dependency, and the above openssl issue would become moot.
In the higher level API, we could now look at batch ECDSA operations, that can share inversions for converting the results to affine (and where applicable for converting the precomp tables to affine).

sipa commented at 2:47 pm on July 16, 2014: contributor

Code changes look good.

Please use CHECK() instead of assert() in tests.

Use batch inversion in G precomputation f16be77ffc

peterdettman commented at 1:08 pm on July 17, 2014: contributor

Changed to use CHECK().

sipa merged this on Jul 17, 2014

sipa closed this on Jul 17, 2014

sipa referenced this in commit 5e53856862 on Jul 17, 2014

peterdettman deleted the branch on Jul 18, 2014

practicalswift cross-referenced this on Jan 21, 2021 from issue UB in tests: Invalid pointer comparison in secp256k1_fe_inv_all_var by practicalswift

Implement batch inversion of field elements #16