Add pippenger_wnaf for multi-multiplication

jonasnick commented at 2:45 pm on November 13, 2017: contributor

This PR is based on #473 and adds a variant of “Pippengers algorithm” (see Bernstein et al., Faster batch forgery identification, page 15 and https://github.com/scipr-lab/libff/pull/10) for point multi-multiplication that performs better with a large number of points than Strauss’ algorithm.

aggsig

Thanks to @sipa for providing wnaf_fixed, benchmarking, and the crucial suggestion to use affine addition.

The PR also makes ecmult_multi decide which algorithm to use, based on the number of points and the available scratch space. For restricted scratch spaces this can be further optimized in the future (f.e. a 35kB scratch space allows batches of 11 points with strauss or 95 points with pippenger; choosing pippenger would be 5% faster).

As soon as this PR has received some feedback I’ll repeat the benchmarks to determine the optimal pippenger_bucket_window with the new benchmarking code in #473.

apoelstra commented at 2:47 pm on November 13, 2017: contributor

Hooray!

Please rebase – should be straightforward, we merged only some minor things last week.

apoelstra cross-referenced this on Nov 13, 2017 from issue Multi-point multiplication support by sipa

jonasnick force-pushed on Nov 14, 2017

jonasnick commented at 10:45 am on November 14, 2017: contributor

Rebased

in src/ecmult_impl.h:783 in be3248f50a outdated

778+    } else if(bucket_window == 10) {
779+        return 17400;
780+    } else if(bucket_window == 11) {
781+        return 28600;
782+    } else if(bucket_window == PIPPENGER_MAX_BUCKET_WINDOW) {
783+        return INT_MAX;

sipa commented at 10:39 pm on November 14, 2017:

INT_MAX is not defined.

apoelstra commented at 4:52 pm on November 15, 2017:

#include <limits.h> ?

jonasnick force-pushed on Nov 16, 2017

apoelstra commented at 5:53 pm on November 16, 2017: contributor

In secp256k1_ecmult_multi_split_strauss_wnaf we immediately convert a secp256k1_ge received from the callback to a secp256k1_gej. Would be better to change the callback API to just return a gej in the first place, otherwise the caller may be forced to do an expensive and pointless gej->ge conversion.

Edit Oh, I was looking at old code. I see that Strauss does the ge->gej conversion but Pippenger does not (and Pippenger derives a speed benefit from using ge). And our two main applications, aggsig and bulletproofs, don’t need to do a gej->ge conversion anyway, at least during verification. Never mind.

sipa commented at 5:45 am on November 17, 2017: contributor

@apoelstra Yes, I believe it was a conscious choice to making the interface pass affine points because no callers need more.

in include/secp256k1.h:264 in be8abb5ad1 outdated

255@@ -243,6 +256,28 @@ SECP256K1_API void secp256k1_context_set_error_callback(
256     const void* data
257 ) SECP256K1_ARG_NONNULL(1);
258 
259+/** Create a secp256k1 scratch space object.
260+ *
261+ *  Returns: a newly created scratch space.
262+ *  Args: ctx:  an existing context object (cannot be NULL)
263+ *  In:  init_size: initial amount of memory to allocate
264+ *        max_size: maximum amount of memory to allocat

sipa commented at 10:19 pm on November 23, 2017:

Nit: allocat

in src/scratch_impl.h:16 in be8abb5ad1 outdated

 7+#ifndef _SECP256K1_SCRATCH_IMPL_H_
 8+#define _SECP256K1_SCRATCH_IMPL_H_
 9+
10+#include "scratch.h"
11+
12+#define ALIGNMENT 16

sipa commented at 10:22 pm on November 23, 2017:

Add a comment to explain that we assume that common architectures never have alignment requirements above 8 for any of the types we care about (and make it 16 because who cares about a few byes)?

sipa commented at 10:19 am on November 25, 2017:

Perhaps add a TODO that we should try to determine this at configure time (using sizeof(max_align_t), if available, for example).

in src/ecmult_impl.h:502 in 541ea3fc14 outdated

497+    if (max_points == 0) return 0;
498+    if (max_points > 160) max_points = 160; /* At this point, gains are not longer compensating for locality degradation */
499+    n_batches = (n + max_points - 1) / max_points;
500+    points_per_batch = (n + n_batches - 1) / n_batches;
501+
502+    /* Attempt to allocate sufficient space for Strauss */

sipa commented at 10:36 pm on November 23, 2017:

@apoelstra Is this halving loop still necessary? The max allocation call above should be perfect, so I think we can just assert that this never fails.

apoelstra commented at 0:30 am on December 7, 2017:

Agreed, please change this to a CHECK on the return value of secp256k1_scratch_resize.

apoelstra commented at 3:57 pm on December 7, 2017:

Oh, derp, Jonas points out that this is fixed in a later commit. Never mind.

jonasnick commented at 4:01 pm on December 7, 2017:

This is outdated by “Add pippenger_wnaf ecmult_multi”: if resize fails then return 0.

in src/ecmult_impl.h:2 in 118ad20838 outdated

0@@ -1,5 +1,5 @@
1 /**********************************************************************
2- * Copyright (c) 2013, 2014 Pieter Wuille                             *
3+ * Copyright (c) 2013, 2014, 2017 Pieter Wuille, Andrew Poelstra      *

sipa commented at 3:17 am on November 25, 2017:

This needs a “Jonas Nick”, I think.

in src/ecmult_impl.h:487 in 118ad20838 outdated

483@@ -403,4 +484,540 @@ static void secp256k1_ecmult(const secp256k1_ecmult_context *ctx, secp256k1_gej
484     }
485 }
486 
487+static void secp256k1_ecmult(const secp256k1_ecmult_context *ctx, secp256k1_gej *r, const secp256k1_gej *a, const secp256k1_scalar *na, const secp256k1_scalar *ng) {

sipa commented at 6:27 am on November 25, 2017:

At some point this should be renamed to _var as well, but let’s leave that for later.

in src/ecmult_impl.h:564 in 118ad20838 outdated

559+}
560+
561+/** Convert a number to WNAF notation.
562+ *  The number becomes represented by sum(2^{wi} * wnaf[i], i=0..WNAF_SIZE(w)+1) - return_val.
563+ *  It has the following guarantees:
564+ *  - each wnaf[i] an odd integer between -(1 << w) and (1 << w)

sipa commented at 6:31 am on November 25, 2017:

These two lines are more accurately and correctly written as each wnaf[i] is either zero or an odd integer between -(1 << w) and (1 << w).

jonasnick commented at 1:43 pm on November 27, 2017:

Specifically, do you suggest to replace

0 *  - each wnaf[i] is an odd integer between -(1 << w) and (1 << w)
1 *  - each wnaf[i] is nonzero, unless the given number is 0

with

0* wnaf[i] is either zero or an odd integer between -(1 << w) and (1 << w)

? If so, that sounds just as accurate but less helpful.

apoelstra commented at 3:07 pm on November 27, 2017:

To be clear: is it possible for a nonzero number to have 0 wnaf digits in our representation? I believe the answer is yes, though for wnaf_const (for example) it is not, so this is an important distinction.

sipa commented at 7:32 pm on November 27, 2017:

@jonasnick The first statement “each wnaf[i] is an odd integer between -(1 « w) and (1 « w)” is just incorrect. 0 is not an odd integer, and 0 is possible in case the input is 0.

jonasnick commented at 10:47 pm on November 27, 2017:

Oh, of course 0 is even. But I believe the second statement should stay because at least empirically it doesn’t happen that a nonzero number has 0 wnaf digits (we test for this, copied from wnaf_const). @sipa can you confirm?

sipa commented at 2:10 am on November 28, 2017:

I don’t care that it doesn’t happen often. The first sentence is stated as a condition which is always true, and it is not. You could say “There result satisfies one of the two following conditions:” and leave the two in. But leaving them as-is is just wrong.

in src/ecmult_impl.h:719 in 118ad20838 outdated

708+ * Returns optimal bucket_window (number of bits of a scalar represented by a
709+ * set of buckets) for a given number of points.
710+ */
711+static int secp256k1_pippenger_bucket_window(size_t n) {
712+#ifdef USE_ENDOMORPHISM
713+    if (n <= 4) {

sipa commented at 6:40 am on November 25, 2017:

Maybe write this in a tree structure rather than linear (as in: top level is is a bisection for first half/second half) etc.

jonasnick commented at 2:00 pm on November 27, 2017:

What’s the goal of this?

sipa commented at 7:30 pm on November 27, 2017:

Only having a logarithmic number of conditionals on average.

jonasnick commented at 2:43 pm on November 30, 2017:

While being better in the average case I think this is worse for small batches (where this code has more impact compared to larger batches) and looks ugly. Maybe that’s something for another PR.

sipa commented at 11:49 pm on December 4, 2017:

Fair enough.

in src/ecmult_impl.h:767 in 118ad20838 outdated

762+#endif
763+}
764+
765+static size_t secp256k1_pippenger_bucket_window_inv(int bucket_window) {
766+#ifdef USE_ENDOMORPHISM
767+    if(bucket_window == 1) {

sipa commented at 6:41 am on November 25, 2017:

Nit: space after if.

sipa commented at 7:00 am on November 25, 2017:

I think this can be written more efficiently using a switch case.

in src/ecmult_impl.h:692 in 118ad20838 outdated

688+                secp256k1_ge_neg(&tmp, &pt[point_state.input_pos]);
689+                secp256k1_gej_add_ge_var(&buckets[idx], &buckets[idx], &tmp, NULL);
690+            }
691+        }
692+        secp256k1_gej_set_infinity(&running_sum);
693+        secp256k1_gej_set_infinity(&walking_sum);

sipa commented at 6:51 am on November 25, 2017:

Do you need a separate variable for walking_sum? Can’t you reuse r for this?

jonasnick commented at 1:58 pm on November 27, 2017:

I don’t see how. If r stores the result of previous iterations it can’t be used in the addition ladder.

in src/ecmult_impl.h:961 in 118ad20838 outdated

956+        n_points = space_for_points/entry_size;
957+        n_points = n_points > max_points ? max_points : n_points;
958+        if (n_points > res) {
959+            res = n_points;
960+        }
961+        if(n_points < max_points) {

sipa commented at 6:55 am on November 25, 2017:

Nit: space after if.

in src/ecmult_impl.h:999 in 118ad20838 outdated

994+        max_points = ECMULT_MAX_POINTS_PER_BATCH;
995+    }
996+    n_batches = (n+max_points-1)/max_points;
997+    n_batch_points = (n+n_batches-1)/n_batches;
998+
999+    if(n_batch_points >= ECMULT_PIPPENGER_THRESHOLD) {

sipa commented at 6:56 am on November 25, 2017:

Nit: space after if.

in src/ecmult_impl.h:1014 in 118ad20838 outdated

1009+    }
1010+    for(i = 0; i < n_batches; i++) {
1011+        size_t nbp = n < n_batch_points ? n : n_batch_points;
1012+        size_t offset = n_batch_points*i;
1013+        secp256k1_gej tmp;
1014+        if(!(*f)(ctx, scratch, error_callback, &tmp, i == 0 ? inp_g_sc : NULL, cb, cbdata, nbp, offset)) {

sipa commented at 6:56 am on November 25, 2017:

Nit: you can just use f(...) instead of (*f)(...).

in src/ecmult_impl.h:1008 in 118ad20838 outdated

1003+        if (max_points == 0) {
1004+            return 0;
1005+        }
1006+        n_batches = (n+max_points-1)/max_points;
1007+        n_batch_points = (n+n_batches-1)/n_batches;
1008+        f = &secp256k1_ecmult_strauss_batch;

sipa commented at 6:57 am on November 25, 2017:

Nit: ampersand is optional here.

in src/scratch.h:11 in 118ad20838 outdated

 7+#ifndef _SECP256K1_SCRATCH_
 8+#define _SECP256K1_SCRATCH_
 9+#include <sys/types.h>
10+
11+/* The typedef is used internally; the struct name is used in the public API
12+ * (where it is exposed as a different typedef) */

sipa commented at 10:16 am on November 25, 2017:

The typedef and struct name are the same publicly; the struct is just not defined there.

apoelstra commented at 3:02 pm on November 27, 2017:

I’m not sure what you mean, the typedef here is secp256k1_scratch, the public one is secp256k1_scratch_space.

sipa commented at 7:38 pm on November 27, 2017:

Oops, ignore me!

jonasnick force-pushed on Nov 30, 2017

jonasnick commented at 3:16 pm on November 30, 2017: contributor

Pushed fixes

peterdettman commented at 8:01 pm on November 30, 2017: contributor

@jonasnick I opened a PR here: https://github.com/jonasnick/secp256k1/pull/1 for suggested changes to save some additions in each _pippenger_wnaf window.

jonasnick commented at 4:46 pm on December 1, 2017: contributor

As expected, @peterdettman delivers. I see about 2% to 4% speedup. Merged the commit into this PR.

jonasnick force-pushed on Dec 1, 2017

sipa commented at 11:54 pm on December 4, 2017: contributor

utACK c7723997941576dd03bb1db77dad8d5dc6255924 tree 6f9376c5ebdfe589593fabde10c5d61f5d07ebcc

Please squash fixes.

jonasnick force-pushed on Dec 6, 2017

yeastplume referenced this in commit 9eb90c9fd9 on Dec 6, 2017

yeastplume cross-referenced this on Dec 6, 2017 from issue Aggsig work to date - merging by yeastplume

jonasnick force-pushed on Dec 6, 2017

jonasnick commented at 4:38 pm on December 6, 2017: contributor

Squashed fixes and updated bucket windows. Previous bucket windows did not not include peterdettman’s optimization, did not use bench_ecmult and had a bug where the number of points for a bucket window were twice of what they should have been. The result is that the performance without endomorphism is very similar, with endomorphism it’s up to 8% faster for some number of points. The performance graph looks very similar to the one above.

sipa commented at 11:56 pm on December 6, 2017: contributor

ACK 50f0779be9c8d67fc5c8ca3a7a21b17da74cfa33

in src/scratch.h:9 in 56fb2ccbe9 outdated

0@@ -0,0 +1,35 @@
1+/**********************************************************************
2+ * Copyright (c) 2017 Andrew Poelstra	                              *
3+ * Distributed under the MIT software license, see the accompanying   *
4+ * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
5+ **********************************************************************/
6+
7+#ifndef _SECP256K1_SCRATCH_
8+#define _SECP256K1_SCRATCH_
9+#include <sys/types.h>

apoelstra commented at 0:10 am on December 7, 2017:

We can remove this #include. There are two places that ssize_t are used in ecmult_impl.h which need to be removed. The first, ssize_t max_alloc = can be changed to size_t because it is assigned a size_t anyway (we changed secp256k1_scratch_max_allocation to not return a sign value).

The second use is ssize_t space_for_points. This can also be made into a size_t. The check on line 938, if (space_for_points < 0), will need to be rearranged to compare the two things that are subtracted on line 937 to get space_for_points.

in src/scratch.h:16 in 56fb2ccbe9 outdated

12+ * (where it is exposed as a different typedef) */
13+typedef struct secp256k1_scratch_space_struct {
14+    void *data;
15+    size_t offset;
16+    size_t init_size;
17+    size_t max_size;

apoelstra commented at 0:13 am on December 7, 2017:

Can we add an error_callback pointer to this structure and remove it from the arguments of secp256k1_scratch_resize? It would avoid the need to pass error callbacks into a bunch of higher level functions.

in src/ecmult.h:35 in da61b77821 outdated

29@@ -28,4 +30,9 @@ static int secp256k1_ecmult_context_is_built(const secp256k1_ecmult_context *ctx
30 /** Double multiply: R = na*A + ng*G */
31 static void secp256k1_ecmult(const secp256k1_ecmult_context *ctx, secp256k1_gej *r, const secp256k1_gej *a, const secp256k1_scalar *na, const secp256k1_scalar *ng);
32 
33+typedef int (secp256k1_ecmult_multi_callback)(secp256k1_scalar *sc, secp256k1_ge *pt, size_t idx, void *data);
34+
35+/** Multi-multiply: R = inp_g_sc * G + sum_i ni * Ai. */

apoelstra commented at 0:29 am on December 7, 2017:

Update comment to say that this function will clobber the scratch space that gets passed in? I’ve nearly been burned by that in the Bulletproofs code a couple times.

in src/tests.c:2702 in deabf0d88f outdated

2697+    memset(&sc[0], 0, sizeof(sc[0]));
2698+    CHECK(secp256k1_ecmult_multi(&ctx->ecmult_ctx, scratch, &ctx->error_callback, &r, &szero, ecmult_multi_callback, &data, 20));
2699+    memset(&sc[1], 0, sizeof(sc[0]));
2700+    memset(&sc[2], 0, sizeof(sc[0]));
2701+    memset(&sc[3], 0, sizeof(sc[0]));
2702+    memset(&sc[4], 0, sizeof(sc[0]));

apoelstra commented at 0:45 am on December 7, 2017:

Can we use secp256k1_scalar_clear instead of memsetting?

jonasnick force-pushed on Dec 7, 2017

jonasnick commented at 3:49 pm on December 7, 2017: contributor

Pushed fixes

in src/ecmult_impl.h:869 in 396615f1b5 outdated

907+static int secp256k1_ecmult_pippenger_batch_single(const secp256k1_ecmult_context *actx, secp256k1_scratch *scratch, const secp256k1_callback* error_callback, secp256k1_gej *r, const secp256k1_scalar *inp_g_sc, secp256k1_ecmult_multi_callback cb, void *cbdata, size_t n) {
908+    return secp256k1_ecmult_pippenger_batch(actx, scratch, error_callback, r, inp_g_sc, cb, cbdata, n, 0);
909+}
910+
911+#define MAX_BATCH_SIZE 1024
912+typedef int (*secp256k1_ecmult_multi_func)(const secp256k1_ecmult_context*, secp256k1_scratch*, const secp256k1_callback*, secp256k1_gej*, const secp256k1_scalar*, secp256k1_ecmult_multi_callback cb, void*, size_t);

apoelstra commented at 4:29 pm on December 7, 2017:

I’d like this definition and the two wrapper functions to be moved to src/ecmult.h rather than the _impl file. @sipa do you have thoughts on that? These are used (only) in bench_ecmult.c and tests.c.

apoelstra commented at 5:00 pm on December 7, 2017: contributor

ACK https://github.com/bitcoin-core/secp256k1/pull/486/commits/5979a573829b166f05ea8e03747bad9225451919 .. please squash. The only nit is the position of the secp256k1_ecmult_multi_func typedef, which I don’t care about that much.

add resizeable scratch space API

Alignment support by Pieter Wuille.

548de42ecf

Generalize Strauss to support multiple points

API by Andrew Poelstra.

8c1c831bdb

Add ecmult_multi tests dba5471b69

Add bench_ecmult bc65aa794e

Add pippenger_wnaf ecmult_multi 355a38f113

Use scratch space dependent batching in ecmult_multi 36b22c9337

Add flags for choosing algorithm in ecmult_multi benchmark a58f543f5a

Save some additions per window in _pippenger_wnaf 4c950bbeaf

Use more precise pippenger bucket windows d2f9c6b5dc

jonasnick force-pushed on Dec 7, 2017

jonasnick commented at 8:13 pm on December 7, 2017: contributor

Squashed

sipa merged this on Dec 8, 2017

sipa closed this on Dec 8, 2017

sipa referenced this in commit c77fc08597 on Dec 8, 2017

str4d cross-referenced this on Dec 11, 2017 from issue Merge current secp256k1 subtree by str4d

anargle cross-referenced this on May 29, 2018 from issue Kernel half-aggregation by anargle

Add pippenger_wnaf for multi-multiplication #486