Strauss-wNAF multimul

sipa commented at 7:44 pm on July 7, 2017: contributor

This extends secp256k1_multi to support arbitrary numbers of input points.

Features:

Any point can be infinity (resulting in it being ignored).
The G factor can be NULL.
Will split into batches of a maximum size and combine them.
Uses effective affine addition across all inputs, and to build the odd-multiples tables.
Supports endomorphism (TODO: figure out up to what size it is worth it).
Uses Strauss’ algorithm to only do 255 doublings (129 with endomorphism) in each batch.

Introduce secp256k1_ecmult_mult 94c7b48338

Add multi point ecmult test 579def2d85

sipa commented at 8:13 pm on July 7, 2017: contributor

@bbuenz This implements a faster multi-multiplication, though there is no intention of exposing it in the public API (which we keep restricted to higher-level safe-by-default functionality). It is useful for building other things on top of in derived projects, though.

bbuenz commented at 9:10 pm on July 7, 2017: none

@sipa Thanks, I’ll consider forking it.

gmaxwell commented at 11:48 am on July 8, 2017: contributor

I suggested to Pieter today that we switch to using the split-G table at all times, w/ endo or without. I believe(tm) that given equal WNAF dimensions for a 2-point multiexp split-G will be faster even when used without the endomorphism by about 1 doubling, because half the time the G have the highest non-zero value. Making this change would make it easier for a endomorphism enabled codebase to drop the endomorphism when it isn’t a help.

The endo will stop being helpful very quickly… (a fact that will probably radically lower the crossover point to bos-coster when endo is in use, since endo helps bos-coster out to a very high count).

peterdettman commented at 4:37 am on July 10, 2017: contributor

@sipa Nice to see the cross-table effective-affine working, since previously I had only proved it would!

So IIUC, this PR precomputes an odd-multiples table for each point in the batch, and uses wNAF scalar recoding, so that there’s one addition per input point per (w+1) bits (or something). I don’t know if you recall the FourQ paper Patrick Longa presented at RWC this year (https://eprint.iacr.org/2015/565.pdf). There they are using 2 endos, so for a standard scalar multiplication they end up with 4 points in a multi-exp. Instead of precomputing a table for each of those points, they make one table with 8 values, each combination of P1 with +/-P2, +/-P3, +/-P4 (negation is used if you need the -P1 combinations). The scalars are recoded (“signed-digit recoding”) so that each bit is +/-1. The main ladder then performs 1 addition per bit, assembling the +/-1 from each scalar into a table lookup.

An idea then is to take batches and group them by 4s (or 5s?). On paper it looks like it would save quite a bit of precomputation time and memory usage, while possibly costing a few extra additions in the ladder (wNAF is sliding window?). Unfortunately this is not as easy to add/test as the current scheme since we lack various pieces (the table(s) can still be made effective-affine, at a slightly higher cost). It also might still need the current wNAF approach for “leftovers” that don’t make a full 4-or-5 (and maybe G too).

gmaxwell commented at 9:39 am on July 10, 2017: contributor

Though that kind of signed digit recoding and joint table entries moves the endomorphism ahead of the precomputation, instead of after.

IIRC… Current code, without endo does ~42 additions point non-G point plus one double and 7 adds for the precomp, plus the multiplies for the EA conversion. G does ~15 adds. So unless I’m off on the numbers, if you would have to 256 adds, it’ll need to be adding in more than 5 points; and the precomp cost grows. I don’t see how to get it particularly close.

peterdettman commented at 12:33 pm on July 10, 2017: contributor

Though that kind of signed digit recoding and joint table entries moves the endomorphism ahead of the precomputation, instead of after.

I think you’d just “chunk” the points first, then build the precomp table for each chunk, then if you want to use the endomorphism, you take the (lambda*P[1..n])s as a new chunk, and get its precomp table directly from the original precomp (in the same way secp256k1_ecmult does).

IIRC… Current code, without endo does ~42 additions point non-G point plus one double and 7 adds for the precomp, plus the multiplies for the EA conversion. G does ~15 adds. So unless I’m off on the numbers, if you would have to 256 adds, it’ll need to be adding in more than 5 points; and the precomp cost grows. I don’t see how to get it particularly close.

I think I over-estimated the size of the precomp tables in the current code. Still, if we chunked by 6s (to get the ladder additions to about the same number), then the precomp table would be 32 points, instead of 48 (8*6 separate tables). The form of the table suggests you need 31 calculations of the form P +/- Q (producing both points); it happens that each of those is only a little more expensive than a single addition (and they’re mostly mixed additions if the input points are affine). I haven’t worked out the details, but I would guess the cost of precomputation is about the same as current code.

I find that interesting because the sliding wNAF relies on “skipping zeroes” in the scalar (an average of 1 extra bit per window), whereas this alternative is more compatible with a constant-time multiexp. Now I’m wondering if there are multi-point versions of the Joint Sparse Form, or equivalents with wider window sizes.

peterdettman commented at 1:25 pm on July 10, 2017: contributor

I should add that the signed recoding scheme only needs one “windows” array per chunk instead of one wNAF array per point, which is a further significant memory saving (possibly it could even be calculated left-to-right on-the-fly).

gmaxwell commented at 6:05 pm on July 10, 2017: contributor

I don’t think it’s possible to do the 6 point precomp in 32 adds. Am I missing something?

Edit: Yes, it would be 32 entries, but I don’t think you can construct it with 32 adds.

peterdettman commented at 2:03 am on July 11, 2017: contributor

Depends on how you’re counting. I’m supposing the use of an “extended” addition that calculates XA(P,Q) = (P+Q,P-Q). This needs 1M+1S more than a regular addition, but gives 2 output points.

If C(n) is the number of XA needed to precompute the table T(n) for P[1..n], then: C(1) = 0 C(n+1) = C(n) + 2^(n-1)

i.e. T(n+1) is built from T(n) using 1XA per entry in T(n).

So, C(n) = 0 + 1 + … + 2^(n-2) = 2^(n-1) - 1. T(6) has 32 entries, needing 31XA, which is ~ 37A in terms of cost. Or you could say it’s 62 additions, half of which are very cheap. This is better than 6D+42A for 6 wNAF, but I expect the effective-affine transformation to be a little more clunky, and the wNAF tables could eventually use coZ, so it might be a wash on costs.

Chunks of 7 might actually be optimal speed-wise, and taking into account the reduced window array usage, total memory still comes in less than for the current scheme.

peterdettman commented at 2:26 am on July 11, 2017: contributor

It appears there are also width-3 variants of the Joint Sparse Form, which is something like chunking by 2s, but giving a sparser recoding (joint hamming weight ~0.36), with a 12 entry lookup table and single window array (or even none) per 2 input points. Probably the same performance as current scheme, with less memory needed.

gmaxwell commented at 8:29 am on July 12, 2017: contributor

Ah, the extended addition is what I was missing. On memory, on desktops&servers the amount needed for the tables isn’t really significant up to the point where bos-coster is faster… though smaller tables might have better cache behavior and end up faster due to that.

sipa cross-referenced this on Jul 21, 2017 from issue [WIP] Aggregate signature module implementation by apoelstra

in src/ecmult_impl.h:469 in 94c7b48338 outdated

461@@ -403,4 +462,31 @@ static void secp256k1_ecmult(const secp256k1_ecmult_context *ctx, secp256k1_gej
462     }
463 }
464 
465+static void secp256k1_ecmult(const secp256k1_ecmult_context *ctx, secp256k1_gej *r, const secp256k1_gej *a, const secp256k1_scalar *na, const secp256k1_scalar *ng) {
466+    secp256k1_ecmult_strauss_wnaf(ctx, r, 1, a, na, ng);
467+}
468+
469+static void secp256k1_ecmult_multi(const secp256k1_ecmult_context *ctx, secp256k1_gej *r, size_t n, const secp256k1_gej *a, const secp256k1_scalar *na, const secp256k1_scalar *ng) {

apoelstra commented at 4:05 pm on July 27, 2017:

Can you split this function out into a separate commit so that it can easily be replaced with a dispatch function?

sipa cross-referenced this on Aug 17, 2017 from issue Multi-point multiplication support by sipa

sipa commented at 1:31 am on August 17, 2017: contributor

Superseded by #473.

sipa closed this on Aug 17, 2017

Strauss-wNAF multimul #464