Effectively affine precomp A for _ecmult without inversion!

peterdettman commented at 9:02 am on December 21, 2014: contributor

Builds on top of https://github.com/bitcoin/secp256k1/pull/41 (Co-Z arithmetic for precomputation).
Instead of inversion, scales all precomp A points to the same Z (usually != 1).
Store the precomp A as affine by discarding the z coordinate (whilst noting the single Z value for later), which is equivalent to taking them as affine points on an isogenous curve.
In secp256k1_ecmult, we treat the accumulator (“r”) as operating on this isogenous curve.
Doubling formula needs no changes, adding of precomp A points is a simple _gej_add_ge_var.
G precomp remains the same, but each add of these requires an extra 1M to bring the accumulator temporarily back to the “default isogeny”. For simplicity here I’ve modified _gej_add_ge_var to support this.
The z value of the final result needs to be scaled (1M) also.
Optimal at WINDOW_A=5.

Anyway, the upshot of this is an ~5% performance improvement for bench_verify (64bit, endo=yes), over and above the Co-Z precomputation itself, so >8% total vs master. Rough math suggests it’s saving ~116M vs the Co-Z arithmetic PR, which appears to make any inversion approach obsolete.

Questions welcome, as I’m not sure how to explain this in a straight-forward way (as far as I know this is a novel idea). I guess it’s important to understand why the isogeny works out so neatly; it’s a rather nice property of secp256k1 that stems from it having a==0 in the curve equation; otherwise, operating on an isogenous curve would require changes to the doubling formula. I recommend e.g. http://joye.site88.net/papers/TJ10coordblind.pdf, where this is discussed in the context of blinding (which reminds me I was meaning to PR a demo of that), albeit the a==0 case is not explicitly called out. It may be easier just to think of it as playing games with the z values…

EDIT: I’ll leave it as I wrote it, but I think the above should be talking about isomorphisms rather than isogenies.

gmaxwell commented at 9:05 am on December 21, 2014: contributor

It intutively makes sense to me. This is quite awesome. I’d already thought about using an isogenous projection for blinding (having seen it mentioned in a list of blinding techinues)– but doing scalar blinding seems more interesting for most of our need for blinding, never would have thought of it for this.

Darn ECDSA: With schnorr this would make our verification completely inversion free.

peterdettman commented at 2:38 am on December 22, 2014: contributor

Perhaps a better reference: http://joye.site88.net/papers/JT01edpa.pdf @gmaxwell Im still a bit giddy that this works so well. I had the z-ratio trick and the isomorphism stuff in mind for a while, and yesterday it just gelled for me. I wasn’t quite sure how it would work out , but I coded it up and… 1M per G add? Awesome, indeed.

I actually now think it will work fine for a!=0 curves too, using “modified Jacobian” coordinates to avoid increasing the cost of doubling.

Why not both scalar and point blinding? Simple point blinding amounts to just an extra 2M per scalar mult, assuming the G precomp points have been moved to some random curve isomorphism in setup.

gmaxwell commented at 6:50 pm on December 22, 2014: contributor

Why not both scalar and point blinding?

Not unresonable, just updating the point bilinding takes a fair amount more computation and we already have the unknown-discrete-log blinding. I’m all for implementing all reasonably simple, reasonably fast countermeasures for the secret data handling cases. Though actual immunity to things like high resolution power sidechannel analysis seem hopeless on general purposes computers, thats no reason to make an attacker’s life easier.

peterdettman force-pushed on Dec 23, 2014

sipa commented at 1:42 pm on December 23, 2014: contributor

This is just awesome. It seems we’re around 8% slower than Ed25519 here with it (without using GLV, with gmp).

I’m going to play around with the math, try to see if I can derive that it’s correct :)

peterdettman commented at 2:22 am on December 24, 2014: contributor

FYI, I managed to apply this same technique to a P-521 random-base-mult implementation (http://indigo.ie/~mscott/ws521.cpp), with a similar ~8% performance boost. That curve has a==-3, which has to be scaled for the isomorphism, so to retain doubling performance (3M+5S) the formula has to be changed to a modified-Jacobian scheme. The adds (104) then cost 7M+6S instead of 11M+5S. This saving is offset by (4M+S) per precomputed point (15) to make them “co-Z”.

So we can tentatively conclude that this is applicable to any short Weierstrass form curve, with (only) a==0 being particularly efficient. I don’t know enough about other curve forms to comment yet.

The question arises whether the scheme can still be used for general (aP + bQ + cR + …), and indeed I see no fundamental obstacle, just an extra few steps involved in the precomputation to arrange for all precomp-tables to be globally co-Z.

Use Co-Z arithmetic for precomputations

- Selected Co-Z formulas from "Scalar Multiplication on Weierstraß Elliptic Curves from Co-Z Arithmetic" (Goundar, Joye, et. al.) added as group methods with new type sep256k1_coz_t.
- Co-Z methods used for A and G point precomputations.
- WINDOW_A size increased to 6 since the precomputation is much faster per-point.
- DBLU cost: 3M+4S, ZADDU cost: 5M+2S.
- Take advantage of z-ratios from Co-Z to speed up table inversion.

9a0ea65f8e

Demonstrate "split-Z" secp256k1_ecmult 7d3e82fe93

peterdettman force-pushed on Dec 24, 2014

sipa commented at 2:34 pm on December 27, 2014: contributor

I’m going to add some comments and unit tests on top of this, if you’re done making changes.

sipa cross-referenced this on Dec 29, 2014 from issue Co-Z + effective affine precomputation + tests by sipa

sipa commented at 0:35 am on February 13, 2015: contributor

See rebased/refactored version in #210.

sipa closed this on Feb 13, 2015

Effectively affine precomp A for _ecmult without inversion! #171