Use Co-Z arithmetic for precomputations

peterdettman commented at 12:51 pm on June 25, 2014: contributor

Selected Co-Z formulas from “Scalar Multiplication on Weierstraß Elliptic Curves from Co-Z Arithmetic” (Goundar, Joye, et. al.) added as group methods with new type sep256k1_coz_t.
Co-Z methods used for point precomputations in secp256k1_ecmult.
WINDOW_A size increased to 6 since the precomputation is much faster per-point.
DBLU cost: 3M+4S, ZADDU cost: 5M+2S.
From 2.4% to 3.8% faster ‘bench’ results, depending on configuration.

(UPDATE: Technique explicitly described in https://eprint.iacr.org/2008/051)

It’s probably best to have a read of the paper (http://joye.site88.net/papers/GJMRV11regpm.pdf) if not familiar with Co-Z arithmetic, but roughly it’s based on the observations that

points with equal Z coordinate (Co-Z) can be added very cheaply.
an input to a Co-Z double or add can be updated to have the same Z coordinate as the result, almost for free, allowing “chaining” of operations.

This works out well for generating a table of 1P, (2P), 3P, 5P, etc… as the 2P can be created with the same Z as 1P, then added to 1P, 3P, etc., in each case being updated to have the same Z as the previous output, which is also the next input. Then the total cost for a table of 8 values is 3M+4S + 7(5M+2S) = 38M+18S, compared to 3M+4S + 7(12M+4S) = 87M+32S. Alternatively, as this PR does, a table of 16 values can be built instead for 3M+4S + 15(5M+2S) = 78M+34S, which is still cheaper than the current table generation; then ~6 point additions are saved during the scalar multiplication (72M+24S), for a total saving of 72M+24S + 87M+32S - (78M+34S) = 81M+22S, or ~100M.

Timing results for ‘bench’ across a variety of configurations (bignum=gmp throughout):

Field	Endo?	Before	After	%Perf
64bit	yes	1m10.623s	1m8.147s	+3.6%
64bit	no	1m35.524s	1m32.660s	+3.1%
32bit	yes	1m43.756s	1m39.973s	+3.8%
32bit	no	2m21.871s	2m17.690s	+3.0%
gmp	yes	1m41.278s	1m37.710s	+3.7%
gmp	no	2m24.717s	2m21.326s	+2.4%

gmaxwell commented at 1:01 am on June 27, 2014: contributor

If this gives a fast point tripling…, I seem to recall seeing some work using multi-base numbers— e.g. where a scalar is expressed as sum(2^n_x_n) + sum(3^n_y_n), and there was some fast encoding into that overcomplete basis. The result of doing so tended to render the numbers much more sparse (more zeros) and thus save a lot of additions.

peterdettman commented at 6:32 am on June 27, 2014: contributor

@gmaxwell See e.g https://eprint.iacr.org/2008/285.pdf . It might be worth trying, but it looks like the results would be marginal (vs 5-NAF or especially 6-NAF above). Tripling with the co-z formulas above is just dblu/zaddu, which I would guess corresponds pretty closely to direct tripling formula (http://www.hyperelliptic.org/EFD/g1p/auto-shortw-jacobian-0.html), although you get 2P as an output too.

sipa commented at 4:08 pm on July 8, 2014: contributor

With unconditional 2-3% performance increase, I certainly want this, but I’ll need some time to read up on the math :)

peterdettman commented at 12:36 pm on July 16, 2014: contributor

@sipa Let me know if you have any questions; I think I stuck fairly close to the algorithms as given in the main body of the paper, but perhaps a sign was switched here or there to reduce the number of _negate calls. It wouldn’t surprise me if they could yet be improved a little; I was pretty happy with the initial results and didn’t spend much time past that.

peterdettman renamed this:
~~Use Co-Z arithmetic for WINDOW_A precomputations~~
Use Co-Z arithmetic for precomputations
on Jul 18, 2014

peterdettman commented at 4:30 am on July 18, 2014: contributor

Patch updated to cover G precomputations also, and rebased.

sipa commented at 10:02 pm on August 3, 2014: contributor

Some unit tests for verifying co-Z vs normal addition/doubling would be nice; especially one that covers the doubling branch inside secp256k1_coz_zaddu (which currently has no coverage in the tests).

sipa commented at 10:09 pm on August 3, 2014: contributor

The code duplication between secp256k1_coz_dblu and dblu_a is painful to see. I see two solutions:

Change secp256k1_gej_t to be a secp256k1_coz_t + z coordinate + infinity flag.
Add casting from secp256k1_gej_t* to secp256k1_coz_t*

and then instead of secp256k1_coz_dblu taking coz + gej + const gej, have it take coz + coz + z + inf + const gej.

But let’s keep that for later.

peterdettman force-pushed on Aug 20, 2014

peterdettman commented at 4:47 am on August 20, 2014: contributor

@sipa Given the current restricted usage for the coz stuff (i.e. precomputation of P, 3P, 5P… and P, 2P, 3P… sequences with P != INF), I am thinking it might be simpler for the moment to raise errors if infinities or double-via-add crop up, since they should never happen given the large order of all valid points.

In regards to the code duplication in dblu(_a) I had pretty much the same thoughts on alternatives, but I don’t have a strong favourite amongst them (of course the duplication isn’t great either). Can we kick that can down the road?

sipa commented at 4:40 pm on August 20, 2014: contributor

@peterdettman I don’t like relying on relatively complex reasoning to show that a code path shouldn’t be reachable, even if it just contains an assert. At least trying to have some basic testing in place that it works even when it is feel much more assuring.

sipa commented at 4:40 pm on August 20, 2014: contributor

I don’t mind having the duplication for now.

peterdettman commented at 2:25 pm on August 21, 2014: contributor

@sipa OK, I’ll add some test coverage in the next few days.

peterdettman force-pushed on Nov 1, 2014

peterdettman force-pushed on Nov 4, 2014

peterdettman force-pushed on Nov 5, 2014

peterdettman force-pushed on Nov 9, 2014

peterdettman force-pushed on Nov 13, 2014

peterdettman force-pushed on Nov 15, 2014

peterdettman force-pushed on Nov 18, 2014

peterdettman force-pushed on Nov 19, 2014

sipa commented at 8:12 pm on December 4, 2014: contributor

@peterdettman The code has been changed quite a bit lately, and you’ll have to rebase again. I’m still very interested in this, but can you limit the changes to just what is necessary for runtime performance? Adding complexity to speed up the precomputation (which takes only a few milliseconds here) isn’t very interesting.

peterdettman commented at 11:12 am on December 11, 2014: contributor

@sipa Yeah, there’s a lot of changes, but I’ll try to get it rebased. I’ll keep in mind the runtime/precomputation distinction, but I’ve been assuming we could end up using an affine precomputation in secp256k1_ecmult at some point, thus the opts for secp256k1_ecmult_table_precomp_ge_var.

peterdettman force-pushed on Dec 12, 2014

gmaxwell commented at 10:37 am on December 12, 2014: contributor

@peterdettman When I recently looked again at your patch I thought you actually were doing the conversion to affine for the multiply, until pieter pointed out to me that it was precomp only… so I guess that intent was clear. Have you benchmarked it with that?

I estimate from the mul/sqr counts that gej+ge should be about 30% faster than gej+gej, assuming a square takes .7x as long as a mul. W/ endormphism combined costs in the mutiply loop would be gej+ge 10.1 * 16 * 2 + gej+gej 14.8 * 16 * 2 + double 127 * 5.8 = 1533.4. vs 1383. Using all ge+ge adds… so about 10% less weighed field operations in the inner-loop. Given that saving the final inversion was about a 3% speedup, the affine could reasonably expected to be a win. esp with the known z-ratio trick. The affine points also use only 2/3rds the memory, so potentially this allows for a larger window.

peterdettman force-pushed on Dec 12, 2014

peterdettman force-pushed on Dec 13, 2014

peterdettman commented at 6:27 am on December 13, 2014: contributor

@gmaxwell Yeah, I keep a branch around that uses affine-A in _ecmult(). I measure about %1.5 improvement for bench_verify, but:

That’s the GMP inversion (unless I missed some developments); it will be hard to match it’s performance.
_gej_add_ge_var is a little suboptimal still: weak normalisation PR improves it relatively, and I have some code that eliminates two _negates. It’s not clear exactly how it will balance out.

Of course, for any sort of batching, it becomes an easy win, even with the exponentiation inversion.

Note that the optimal window size is 6 for co-z alone, but 5 with the inversion. Using affine precomp lowers the per-add cost by 4M+S, but raises the per-precomp cost by the same amount (and it was worse by 2M before I noticed the z-ratios trick). I calculate the theoretical break-even cost of inversion as ~131M (and our multiplication got faster recently :( ).

I tried an alternative scheme where instead of scaling the precomp x,y values by zInv (which costs 3M+S), we just record the zInv. Then a modified gej_add_ge can be used that only needs 1 extra M (i.e. 9M+3S), since we can just scale the gej.z by zInv (for purposes of scaling the “ge” x, y). Yeah, that actually does work, but it still works out slower (the math has it costing just 9M more in total, optimal at w=6).

I also tried cacheing the z2, z3 values for the precomp points (when using gej_add of course), since each point is used 2-3 times, but again this was slower in practice (partly due to copying the entire struct when negating the precomp entry, but even after I hacked around that it seems the larger memory footprint was an issue), although it might bear a second attempt in case I fluffed it somehow.

sipa commented at 2:30 am on December 14, 2014: contributor

I benchmarked this after rebasing on top of #154 (and using its new normalization functions in the coz functions): an extra 3.5%-4% (with GMP enabled).

sipa commented at 3:25 am on December 14, 2014: contributor

Can you add unit tests for this? In test_ge there are already many additions between random and non-random points; adding comparisons with equivalent additions using the co-z functions would be nice.

peterdettman commented at 8:18 am on December 14, 2014: contributor

@sipa Just to be clear, I also measure around +3.5%-4% for this PR as is, and then a further ~1.5% if I additionally enable the affine precomp and change (back) to WINDOW_A=5. I’ll get onto the tests!

peterdettman force-pushed on Dec 18, 2014

peterdettman force-pushed on Dec 21, 2014

peterdettman cross-referenced this on Dec 21, 2014 from issue Effectively affine precomp A for _ecmult without inversion! by peterdettman

peterdettman force-pushed on Dec 23, 2014

Use Co-Z arithmetic for precomputations

- Selected Co-Z formulas from "Scalar Multiplication on Weierstraß Elliptic Curves from Co-Z Arithmetic" (Goundar, Joye, et. al.) added as group methods with new type sep256k1_coz_t.
- Co-Z methods used for A and G point precomputations.
- WINDOW_A size increased to 6 since the precomputation is much faster per-point.
- DBLU cost: 3M+4S, ZADDU cost: 5M+2S.
- Take advantage of z-ratios from Co-Z to speed up table inversion.

47a7652384

peterdettman force-pushed on Dec 24, 2014

peterdettman commented at 5:08 am on December 29, 2014: contributor

I have found a (~7 year old) paper that actually describes this precomputation technique (including the z-ratios): https://eprint.iacr.org/2008/051 . They develop it slightly further (“Scheme 2”) to save 1S per precomp-point, which I’ve now replicated, but there’s enough stacked on this PR for the moment, so I’ll leave it for now.

peterdettman cross-referenced this on Dec 30, 2014 from issue Co-Z + effective affine precomputation + tests by sipa

sipa cross-referenced this on Feb 11, 2015 from issue Effective affine precomputation (by Peter Dettman) by sipa

sipa commented at 11:20 pm on February 11, 2015: contributor

Closing in favor of #210.

sipa closed this on Feb 11, 2015

sipa cross-referenced this on Feb 11, 2015 from issue Co-Z based precomputation (by Peter Dettman) by sipa

peterdettman cross-referenced this on Dec 24, 2021 from issue Try a non-uniform group law (e.g., for ecmult_gen)? by real-or-random

Use Co-Z arithmetic for precomputations #41