sipa
commented at 2:44 pm on December 30, 2021:
contributor

I occurs to me that we could actually avoid the cost of doing the scalar halving at ecmult_gen time, by instead having precomputed tables with multiples of G/2 instead of G.

peterdettman
commented at 3:07 pm on December 30, 2021:
contributor

I occurs to me that we could actually avoid the cost of doing the scalar halving at ecmult_gen time, by instead having precomputed tables with multiples of G/2 instead of G.

The scalar halving also ensures that the low bit (the shifted-away bit) is 0.

sipa
commented at 3:12 pm on December 30, 2021:
contributor

I don’t think that matters? Even/odd has no special meaning when working modulo a prime.

The PR currently uses the bits of scalar (input + 2^COMB_BITS - 1 - blind) * 2^-1 to select multiples of G to add together.

My suggestion is that instead it could use the bits of (input + 2^COMB_BITS - 1 - blind) to select multiples of G/2 to add together.

sipa
commented at 3:33 pm on December 30, 2021:
contributor

It seems I’m wrong, but I’m confused why!

peterdettman
commented at 3:35 pm on December 30, 2021:
contributor

I think confusion is creeping in since after the halving the scalar isn’t a scalar value anymore; it’s in signed-digit form, which can only represent an odd value. In particular the bits of a scalar s in signed-digit form represent the scalar value 2*s+1. I think you should also be careful not to reason modulo the order in signed-digit form.

Edit: Actually, even that might not be quite complete since the high bit is being treated specially here.

sipa
commented at 3:43 pm on December 30, 2021:
contributor

It works, but you need to use the bits of input - blind + (2^COMB_BITS - 1)/2 instead. That’s what you get when you substitute 2*(input-blind) for e in the formula in the paper (because now we’re trying to compute input*G = 2*(input-blind)*(G/2) + blind*G).

peterdettman
commented at 3:56 pm on December 30, 2021:
contributor

Oh I see, using double the input and the blind (of G/2) lets the halving be moved to precomputation. Nice.

in
src/ecmult_gen_impl.h:59
in
4a4d16e83coutdated

83+ * The value (2^COMB_BITS - 1 - b) is precomputed as ctx->scalar_offset, and bG is
84+ * precomputed as ctx->final_point_add. Thus recoded can be written as
85+ * recoded = (gn + scalar_offset)/2, and R becomes the sum of (2*bit_i-1)*2^i*G
86+ * values plus final_point_add. */
87+
88+ /* Compute (gn + 2^COMB_BITS - 1 - 1)/2 value as a scalar. */

peterdettman
commented at 4:16 pm on December 30, 2021:

sipa
commented at 5:08 pm on December 30, 2021:
contributor

Updated to use the avoid-halving-scalar trick.

sipa force-pushed
on Dec 30, 2021

in
src/ecmult_gen.h:92
in
a7f0d0e5deoutdated

94- /* Blinding values used when computing (n-b)G + bG. */
95- secp256k1_scalar blind; /* -b */
96- secp256k1_gej initial; /* bG */
97+ /* Blinding values used when computing nG as (n-b)G + bG. */
98+ secp256k1_scalar scalar_offset; /* -b */
99+ secp256k1_ge final_point_add; /* bG */

peterdettman
commented at 5:03 am on December 31, 2021:

These comments are all a bit off. I would suggest a short explanation here and possibly defer to the fuller explanations in ecmult_gen_impl.h. e.g.:

Values chosen such that n*G == comb(n + scalar_offset, G) + final_point_add. The latter expression lets us use scalar blinding and optimize the comb precomputation.

Done, renamed it to ge_offset across all commits. f, final1, final2 were renamed to p, p1, p2 respectively.

in
configure.ac:310
in
a7f0d0e5deoutdated

305@@ -307,20 +306,31 @@ case $set_ecmult_window in
306 esac
307308 # Set ecmult gen precision
309-if test x"$req_ecmult_gen_precision" = x"auto"; then
310- set_ecmult_gen_precision=4
311+if test x"$req_ecmult_gen_kb" = x"auto"; then
312+ set_ecmult_gen_kb=86

peterdettman
commented at 6:40 am on December 31, 2021:

I’m a little bit of the mind that 22 should be the default, but perhaps that’s my old brain stuck in 64kB L1 world. Still, would it be reasonable to default to 22 for 32bit?

real-or-random
commented at 10:43 am on December 31, 2021:

With #929, it may be good idea to handle the logic in the preprocessor in the case of “auto”.

If you agree, I think it’s both ok to do it in this PR, or postpone it to a separate PR that also cares about all other setting/parameters (which we should do anyway).

I’m not convinced about what the defaults should be, but given that ecmult already defaults to 1 MiB precomputed table, I see little harm in picking an 86 KiB one here.

I agree we can easily revisit defaults later (especially in combination with #929).

I changed the configure-driver default to also be 22 kB; we can revisit the defaults later.

in
src/ecmult_gen.h:73
in
a7f0d0e5deoutdated

70+#endif
71+
72+/* The remaining COMB_* parameters are derived values, don't modify these. */
73+/* - The distance between the teeth of the comb. */
74+#define COMB_SPACING ((255 + COMB_BLOCKS * COMB_TEETH) / (COMB_BLOCKS * COMB_TEETH))
75+/* - The number of bits covered by all the combs; must be at least 256. */

peterdettman
commented at 6:46 am on December 31, 2021:

68+#if !(1 <= COMB_TEETH && COMB_TEETH <= 8)
69+# error "COMB_TEETH must be in the range [1, 8]"
70+#endif
71+
72+/* The remaining COMB_* parameters are derived values, don't modify these. */
73+/* - The distance between the teeth of the comb. */

peterdettman
commented at 6:47 am on December 31, 2021:

peterdettman
commented at 7:06 am on December 31, 2021:
contributor

As an exercise I added a 2 blocks, 4 teeth configuration:

add {2,4} to CONFIGS table in precompute_ecmult_gen.c; nice and declarative.

make precompute_ecmult_gen and run to get new precomputed_ecmult_gen.c which correctly includes new {2,4} option.

modify configure.ac to support new “1” option for ecmult-gen-kb

configure –with-ecmult-gen-kb=1, confirm in libsecp256k1-config.h that COMB_BLOCKS, COMB_TEETH are correct.

make clean, make, tests to confirm it’s working

This all worked without isues and was reasonably minimal effort. This particular example also satisfied me that there is no issue with combs covering exactly 256 bits.

sipa
commented at 3:05 pm on December 31, 2021:
contributor

@peterdettman FWIW, the easiest way of achieving the same would be:

Modify configure.ac to support a new option

Run configure with that option

make clean-precomp && make normally

Because precompute_ecmult_gen is automatically built & run when a precomputed_*.c file is missing, and because precompute_ecmult_gen will always include the table for whatever the configuration is.

sipa force-pushed
on Dec 31, 2021

peterdettman
commented at 12:04 pm on January 1, 2022:
contributor

Having some confusion over possible bad interactions between modular reduction and signed-digit form, I wanted to reason things out in reverse.

For point P of order N, scalar s in [0,N), and an L-bit comb (2^L > N), let C(s,P) be the value calculated by our comb, which considers the bits of s, zero-extended to L bits as signed digits. Then:

0 C(s,P) == (2.s - (2^L - 1)) * P

Therefore in order to use the comb to calculate k * G, we solve for the scalar t to use:

0 k * G == C(t,G) == (2.t - (2^L - 1)) * G
1=> k == 2.t - (2^L - 1) mod N
2=> t == (k + (2^L - 1))/2 mod N

Can we skip that halving and use G/2 instead?

0 C(2t,G/2) == (4.t - (2^L - 1)) * G/2
1 == (2.t - (2^L - 1)/2) * G
2 != C(t,G) unless 2^L == 1 mod N

So no, but let’s back up a step and ask what scalar u to use in the comb to calculate k * G as 2.k * (G/2):

0 2.k * (G/2) == C(u,G/2) == (2.u - (2^L - 1)) * G/2
1=> 2.k == 2.u - (2^L - 1) mod N
2=> u == k + (2^L - 1)/2 mod N

and since L is constant, the halving is now only needed in the precomputation.

Scalar blinding (using b * G == B):

0 k * G == (k - b) * G + B == 2.(k - b) * (G/2) + B == C(k - b + (2^L - 1)/2, G/2) + B

where -b + (2^L - 1)/2 can be precomputed.

sipa
commented at 12:13 pm on January 1, 2022:
contributor

Nice, that’s much better explained than my current comments. I’ll try to include it.

peterdettman
commented at 12:16 pm on January 1, 2022:
contributor

Although the above satisfies me mathematically, I would like to see an explicit test case where the comb_offset (2^L - 1)/2 causes a modular reduction (relative to k-b). e.g. arrange for k-b == N + 1 - comb_offset. I hope that’s not too painful, but otherwise random testing seems unlikely to hit such a case.

sipa
commented at 4:13 pm on January 1, 2022:
contributor

@peterdettman Making this change causes instant failure during the tests, at least:

0--- a/src/ecmult_gen_impl.h
1+++ b/src/ecmult_gen_impl.h
2@@ -78,7 +78,7 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
3 */
4 5 /* Compute the scalar (gn + ctx->scalar_offset). */
6- secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn);
7+ CHECK(!secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn));
8 /* Convert to recoded array. */
9 for (i = 0; i < 8; ++i) {
10 recoded[i] = secp256k1_scalar_get_bits(&tmp, 32 * i, 32);

Including a test case that results in tmp=0 or tmp=1 or so may be useful though.

peterdettman
commented at 4:28 pm on January 1, 2022:
contributor

scalar_offset includes the randomly-distributed blind, so there will be modular reductions. This isn’t concerning at all because we can reason about the blind independently of the comb (just input offset + output offset).

However the comb offset is small: 2^(COMB_BITS-256) * (2^256 - N). So I would like a test that involves the comb_offset itself causing a modular reduction. The math checks out of course, but an explicit test can’t hurt.

peterdettman
commented at 4:42 pm on January 1, 2022:
contributor

Actually, I think (2^L - 1)/2 mod N is only small (128 bits) if L == 256.

sipa
commented at 4:44 pm on January 1, 2022:
contributor

For the 43/6/1 configuration it is 0x80000000000000000000000000000001e7f9b4a5f9130fa66044722cc7ae9e1e
For the 11/6/4 configuration it is 0x8000000000000000000000000000000987e0873ddd5f4e3fe1563adfe6691698

peterdettman
commented at 4:59 pm on January 1, 2022:
contributor

Including a test case that results in tmp=0 or tmp=1 or so may be useful though.

So, these and maybe -1 would be enough. I will check a 256-bit config manually if they are added.

sipa force-pushed
on Jan 1, 2022

sipa
commented at 6:36 pm on January 1, 2022:
contributor

@peterdettman I’ve incorporated your derivation in the comments in ecmult_gen_impl.h, and added a test case for recoded={-1,0,1}.

in
configure.ac:183
in
a98735b09aoutdated

183-[A larger table size usually results in possible faster signing.]
184-["auto" is a reasonable setting for desktop machines (currently 4). [default=auto]]
185+AC_ARG_WITH([ecmult-gen-kb], [AS_HELP_STRING([--with-ecmult-gen-kb=2|22|86|auto],
186+[The size of the precomputed table for signing in multiples of 1024 bytes (on typical platforms).]
187+[Larger values result in possibly better signing/keygeneration performance at the cost of a larger table.]
188+["auto" is a reasonable setting for desktop machines (currently 86). [default=auto]]

peterdettman
commented at 5:22 am on January 2, 2022:

FIxed. I though of “comb()” here as a more abstract “the comb multiplication algorithm”, but I’ve now just aligned the ecmult_gen.h and ecmult_gen_impl.h descriptions (also switched to other from c() to comb()).

sipa force-pushed
on Jan 2, 2022

sipa
commented at 4:41 pm on January 4, 2022:
contributor

I’ve worked on an additional change that introduces a COMB_RANGE which is normally 256, but in exhaustive test mode corresponds to the number of bits in EXHAUSTIVE_TEST_ORDER. Then COMB_BITS only has to cover COMB_RANGE etc, instead of always being at least 256.

And I’m seeing a suspicious failure in certain configurations with that (even making sure that the precomputed table only contains non-infinity points).

I’ll retry this approach to make sure I haven’t missed something.

sipa force-pushed
on Jan 4, 2022

sipa
commented at 11:58 pm on January 4, 2022:
contributor

I added a commit that permits COMB_BITS < 256 in exhaustive test mode. However, it doesn’t work in a lot of configurations, and I don’t understand what’s causing it.

Here is a list of (blocks teeth spacing) tuples and whether they work (for both order 13 and 199): groups.txt

sipa
commented at 3:34 pm on January 5, 2022:
contributor

Update: it appears that EXHAUSTIVE_TEST_ORDER < 2**(BLOCKS * TEETH * (SPACING - 1)) perfectly predicts which configurations work.

sipa force-pushed
on Jan 5, 2022

sipa
commented at 7:52 pm on January 5, 2022:
contributor

Final update: I was being very dumb, and precompute_ecmult_gen just had spacing = (COMB_RANGE + blocks * teeth) / (block * teeth) hardcoded, leading to an inconsistency between the table and the actual multiplication code.

False alarm.

sipa force-pushed
on Nov 23, 2022

sipa
commented at 4:29 am on November 23, 2022:
contributor

Rebased.

sipa
commented at 1:26 pm on November 23, 2022:
contributor

I believe this PR is ready for review (the discussion above was just observing some configurations not working due to a fairly stupid bug which was fixed).

sipa force-pushed
on Dec 29, 2022

sipa
commented at 4:21 am on December 29, 2022:
contributor

Rebased after merge of #1178. Added changelog entry.

sipa force-pushed
on Dec 29, 2022

sipa force-pushed
on Dec 29, 2022

sipa force-pushed
on Jan 11, 2023

sipa
commented at 9:50 pm on January 11, 2023:
contributor

stratospher
commented at 5:26 am on December 1, 2023:

764efa4: the new function introduced - ge_cmp_ge(a, b) is ge_equals_ge(a, b) with more safety checks to make sure that the first ge argument (a) passed has both a->x and a->y normalised right? (for secp256k1_fe_equal verify checks)

can’t we use ge_cmp_ge here?

if we’re using ge_equals_ge, shouldn’t final1.x and final1.y be normalised instead of final1.y and final2.y?

Ok, this comment made me take the time to just get rid of these strange test-only functions. See #1450.

in
src/scalar_low_impl.h:187
in
0cfa60b503outdated

188+ }
189 /* If this VERIFY_CHECK triggers we were given a noninvertible scalar (and thus
190 * have a composite group order; fix it in exhaustive_tests.c). */
191- VERIFY_CHECK(*r != 0);
192+ VERIFY_CHECK(res != 0);
193+ *r = res;

stratospher
commented at 5:56 am on December 1, 2023:

0cfa60b: we could keep secp256k1_scalar_verify(r); in the end.

101+ * Thus (gn-b)*G can be written as comb(s,G) if gn-b = 2*s - (2^COMB_BITS - 1), or
102+ * s = (gn - b + (2^COMB_BITS - 1))/2 mod order.
103+ *
104+ * We use an alternative that avoids the modular division by two: we write (gn-b)*G =
105+ * comb(d,G/2). For that to hold it must be the case that
106+ * (gn-b)*G = (2*d + 2^COMB_BITS - 1) * (G/2), or

stratospher
commented at 6:37 am on December 4, 2023:

in
src/ecmult_gen_impl.h:106
in
da73fd9d51outdated

122+
123+ /* In secp256k1_ecmult_gen_prec_table we have precomputed sums of the
124+ * (2*d_i-1) * 2^(i-1) * G points, for various combinations of i positions.
125+ * We rewrite our equation in terms of these table entries.
126+ *
127+ * Let mask(b) = sum(2^(b*COMB_TEETH + t)*COMB_SPACING for t=0..COMB_TEETH-1),

stratospher
commented at 2:10 am on January 11, 2024:

da73fd9:

0 * Let mask(b) = sum(2^((b*COMB_TEETH + t)*COMB_SPACING) for t=0..COMB_TEETH-1),

stratospher
commented at 2:08 pm on March 10, 2024:

da73fd9: so for vs_pos = [256, 263] - since those bits are always set to 0. their possible combinations are computed and kept but never end up being used? also wondering if those big powers of 2 (maybe something like - 2**263 - 2**259 - 2**255 - 2**251 - 2**247 - 2**243) are safe to add in secp256k1_gej_add_var.

(talking about default configuration with blocks=11, teeth=6)

so for vs_pos = [256, 263] - since those bits are always set to 0. their possible combinations are computed and kept but never end up being used?

I believe that’s correct. We could avoid computing them, but this is just table generation code. There is no particular reason for it to be efficient (I’ve suggested replacing it with a Python script in the past…).

also wondering if those big powers of 2 (maybe something like - 2263 - 2259 - 2255 - 2251 - 2247 - 2243) are safe to add in secp256k1_gej_add_var.

That function operates on points, and clearly cannot know what multiple of G is being added (doing so would require breaking DLP!). The group of points is cyclic with order equal to the curve order, so e.g. 2**264 * gen/2 is $2^{263} G = (2^{263} \mod n) G$ = 55349809480404436077109870898555922636672 G.

stratospher
commented at 4:38 pm on April 2, 2024:

oh makes sense.

in
src/ecmult_gen_impl.h:88
in
65f580e931outdated

84@@ -85,19 +85,17 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
85 * (gn-b)*G = (2*d + 2^COMB_BITS - 1) * (G/2), or
86 * d = (gn + (2^COMB_BITS - 1)/2 - b) mod order.
87 *
88- * -b is precomputed as ctx->scalar_offset, so our final equations become:
89+ * 2^COMB_BITS - 1)/2 - b is precomputed as ctx->scalar_offset, so our final equations become:

stratospher
commented at 2:54 pm on March 10, 2024:

65f580e:

0 * 2^(COMB_BITS - 1)/2 - b is precomputed as ctx->scalar_offset, so our final equations become:

sipa
commented at 7:47 pm on March 17, 2024:
contributor

Addressed comments and rebased.

real-or-random
commented at 3:22 pm on March 18, 2024:
contributor

CI reports segfaults in the sanitizer jobs, but I’m not sure if this is a problem with the compiler, e.g., see
https://github.com/bitcoin-core/secp256k1/actions/runs/8317860245/job/22759343772?pr=1058#step:9:148 where even a test binary built and run by ./configure segfaults. Note that this happens with the stable compilers, not with the nightly compilers… Our usage of sanitizers is quite heavy: We use multiple sanitizers at once, and we enable some special options for extra checks, maybe those work any longer. (My feeling is that upstream maintenance of the sanitizers isn’t great currently, there are many open issues without responses from the maintainers.)

sipa force-pushed
on Mar 19, 2024

sipa force-pushed
on Mar 20, 2024

sipa
commented at 2:24 am on March 29, 2024:
contributor

@armfazh said he might me interested in looking at this. “Project Palm”

in
src/ecmult_gen_impl.h:64
in
1b218b3dcaoutdated

58+ secp256k1_scalar tmp;
59+ /* Array of uint32_t values large enough to store COMB_BITS bits. Only the bottom
60+ * 8 are ever nonzero, but having the zero padding at the end if COMB_BITS>256
61+ * avoids the need to deal with out-of-bounds reads from a scalar. */
62+ uint32_t recoded[(COMB_BITS + 31) >> 5] = {0};
63+ int first = 1, i;

stratospher
commented at 8:28 am on April 2, 2024:

1b218b3: accidental but i’m surprised it compiles.

Well a scalar cannot contain more than 256 bits, and asking about higher bits than that is illegal, so i cannot exceed 7. The recoded array may be larger if the combs straddle the 256-bit boundary, but in that case the last recoded element will just always be 0, and thus loop doesn’t need to touch it.

In commit “Permit COMB_BITS < 256 for exhaustive tests” a condition is added to this loop that makes it i < 8 && i < ((COMB_BITS + 31) >> 5).

in
src/ecmult_gen.h:132
in
b368086daeoutdated

108@@ -109,6 +109,10 @@ typedef struct {
109 * ecmult_gen_impl.h for more details. */
110 secp256k1_scalar scalar_offset;
111 secp256k1_ge final_point_add;
112+
113+ /* Factor used for projective blinding. This value is used to rescale the Z
114+ * coordinate of the first table lookup. */
115+ secp256k1_fe proj_blind;

stratospher
commented at 3:06 pm on April 2, 2024:

b368086: we could clear this in secp256k1_ecmult_gen_context_clear

in
src/ecmult_gen_compute_table_impl.h:30
in
0a63916f36outdated

58- (void)r;
59- VERIFY_CHECK(r);
60- secp256k1_gej_set_ge(&nums_gej, &nums_ge);
61- /* Add G to make the bits in x uniformly distributed. */
62- secp256k1_gej_add_ge_var(&nums_gej, &nums_gej, gen, NULL);
63+ /* u is the running power of two times gen we're working with, initially gen/2. */

I’m not sure that we want assertions inside the table generation tool code. We don’t have VERIFY and non-VERIFY modes of those etc. Do you suggest just using normal C assert? I guess we could do that.

Yeah, okay, it had simply occurred to me that this piece is easy to check without adding a dependency on ecmult… But checking that particular condition is not too interesting. So never mind, ignore my comment.

But having said this, do you think it makes sense to add some code in tests.c that does some basic checks on the generated table values? The advantage is that ecmult/ecmult_const won’t add a dependency there.

I think the run_ecmult_constants tests should generally test every table entry several times already, regardless of configuration. We could add an explicit test that basically recomputes the table using existing ecmult, but perhaps that’s overkill?

Given that the precompute binary is built with -DVERIFY, I’ve added your suggestion here to VERIFY_CHECK that u*2 = gen.

real-or-random
commented at 9:13 am on April 3, 2024:
contributor

More to follow, I’m in the middle of the main commit…

sipa force-pushed
on Apr 3, 2024

in
src/ecmult_gen_impl.h:69
in
62601920beoutdated

85+ * a value b determined by the context. b*G is precomputed as ctx->ge_offset, so we're
86+ * left with computing R = (gn-b)*G + ctx->ge_offset.
87+ *
88+ * The multiplication (gn-b)*G will be performed using a signed-digit multi-comb (see Section
89+ * 3.3 of "Fast and compact elliptic-curve cryptography" by Mike Hamburg
90+ * (https://eprint.iacr.org/2012/309).

89+ * 3.3 of "Fast and compact elliptic-curve cryptography" by Mike Hamburg
90+ * (https://eprint.iacr.org/2012/309).
91+ *
92+ * Let comb(s,P) = sum((2*s_i-1)*2^i*P for i=0..COMB_BITS-1), where s_i is the i'th bit of the
93+ * binary representation of scalar s. So the s_i values determine whether -2^i*P (s_i=0) or
94+ * +2^i*P (s_1) are added together. By manipulating:

It’s the difference between the multiplicand scalar and the scalar whose encoding the table lookup bits are drawn from (before adding the blinding offset).

in
src/ecmult_gen_impl.h:118
in
62601920beoutdated

134+ * mask(10) = 2^240 + 2^244 + 2^248 + 2^252 + 2^256 + 2^260
135+ *
136+ * Imagine we have a table(b,m) function which can look up, given b and
137+ * m=(recoded & mask(b)), the sum of (2*d_i-1)*2^(i-1)*G for all bit positions
138+ * i set in mask(b). In our example, table(0, 1 + 2^8 + 2^20) would be equal to
139+ * (2^-1 - 2^3 + 2^7 - 2^11 - 2^15 + 2^19)*G.

Hm, this took me a while to follow. Let me explain in my own words. So you assume that the relevant bits of d are 1...0...0...1...0...1, or 1...0...1...0...0...1 in reverse, which is 1 + 2^8 + 2^20, which determines the positive summands in the G factor (decremented because the formula says so). Is this right? Want to elaborate in the comment?

in
src/ecmult_gen_impl.h:144
in
62601920beoutdated

160+ * return R
161+ *
162+ * The last question is how to implement the table(b,m) function. For any value of
163+ * b, m=(recoded & mask(b)) can only take on at most 2^COMB_TEETH possible values
164+ * (the last one may have fewer as there mask(b) may the curve order). So we could
165+ * create COMB_BLOCK tables which contain a value for each such m value.

in
src/ecmult_gen_impl.h:150
in
62601920beoutdated

166+ *
167+ * Due to the fact that every table entry is a sum of positive and negative powers
168+ * of two multiplied by G, every table will contains pairs of negated points:
169+ * if all the masked bits in m flip, the table value is negated. We can exploit this
170+ * to only store the first half of every table. If an entry from the second half is
171+ * needed, we look up its bit-flipped version instead, and conditionally negate it.

in
src/ecmult_gen_impl.h:172
in
62601920beoutdated

188+ uint32_t bit = secp256k1_scalar_get_bits(&recoded, bit_pos, 1);
189+ bits |= bit << tooth;
190+ bit_pos += COMB_SPACING;
191+ }
192+
193+ /* If the top bit of bits is 1, conditionally flip them all (corresponding

IIUC this relies on COMB_TEETH <= 32. Want to VERIFY_CHECK this?

I guess it will be more natural to declare tooth as int, it’s not involved in all the bit magic/masking. I thought the same is true for index, but it’s compared against abs, so it’s probably better to keep it.

On a related note, it may be a good idea to change the return value of secp256k1_scalar_get_bits touint32_t ? It’s currently unsigned int. We kind of assert in assumptions.h that this type is at least 32 bits, but I think the implicit conversion to uint32_t is unexpected for readers. Plus, only the _t types are guaranteed to be “sane” (2’s complement, no padding bits, …). (Sorry, I haven’t checked the other callers…)

in
src/ecmult_gen_impl.h:178
in
c30860aa26outdated

192+ * = (mask(b) - m - mask(b)/2)*G
193+ * = (-m - mask(b)/2)*G
194+ * = -(m - mask(b)/2)*G
195+ * = -table(b,m)
196+ *
197+ * Thus flipping all the (relevant) bits in m means negating the table result. Because of this

It makes sense to me that flipping the bits means flipping the signs of the summands, and thus flipping the result; I think this argument alone would convince me.

I can follow the 1st equal sign assuming m=(d & mask(b)). Maybe say that the 1st equality needs this, or just substitute (d & mask(b) for m.

84f5ce761ced5da6c9332279ff491b139e1510c7 Always generate tables for current (blocks,teeth) config

I think this is fine for the table generation code in the end, but I smell a refactor here. This code is kind of abusing the loop. I think what you want is to extract the main part into a function and have something like this in main:

I think in the spirit of trying to make manual builds, i.e., without build system, easy (https://github.com/bitcoin-core/secp256k1/issues/929), it would be better to move the selection logic here to the C code, and have ECMULT_GEN_KB as the “exposed” configuration macro. Then configure, cmake, and manual builds will have the same interface. (We could still allow passing COMB_BLOCKS and COMB_TEETH directly as long as ECMULT_GEN_KB is not defined, to retain the possibility to select a non-standard config.)

This shouldn’t hold up this PR. It could also be done in a follow-up PR, if you feel that’s easier.

Makes sense, let me leave a link to #189 here, just as a reminder.

edit: I took the freedom to unresolve this conversation, as a reminder for follow-ups.

in
src/ecmult_gen_impl.h:258
in
9d917a1499outdated

225@@ -225,7 +226,13 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
226 secp256k1_fe_cmov(&add.y, &neg, sign);
227228 /* Add the looked up and conditionally negated value to r. */
229- secp256k1_gej_add_ge(r, r, &add);
230+ if (EXPECT(first, 0)) {
231+ /* If this is the first table lookup, we can skip addition. */

you’re right, I misremembered the surrounding style, sorry

in
src/scalar.h:29
in
1921fd8bcaoutdated

22@@ -23,10 +23,10 @@
23 static void secp256k1_scalar_clear(secp256k1_scalar *r);
2425 /** Access bits from a scalar. All requested bits must belong to the same 32-bit limb. */
26-static unsigned int secp256k1_scalar_get_bits(const secp256k1_scalar *a, unsigned int offset, unsigned int count);
27+static uint32_t secp256k1_scalar_get_bits(const secp256k1_scalar *a, unsigned int offset, unsigned int count);
2829 /** Access bits from a scalar. Not constant time in offset and count. */
30-static unsigned int secp256k1_scalar_get_bits_var(const secp256k1_scalar *a, unsigned int offset, unsigned int count);
31+static uint32_t secp256k1_scalar_get_bits_var(const secp256k1_scalar *a, unsigned int offset, unsigned int count);

Not this PR, but let me note that the names are a bit confusing. Just from the names, one would expect that secp256k1_scalar_get_bits_var is faster… But yeah, that’s not the case. I was about to suggest that you use the _var for performance.

We should probably rename the constant-time one to secp256k1_scalar_get_bits_limb32 or something like that. As you have this commit here anyway, you could add this, but feel free to ignore.

If (bit_pos & 0x1f) happens to be 31, we still extract a single bit. Could it help to ensure that there’s always some junk in the high bits of bitdata? (Not sure what’s the best way to achieve this.)

–
Also, compilers are not helpful here: gcc keeps the xor, but clang 17.0.6 manages to move the xor out of the loop and replace by a mov.

To inspect this, try objdump-nice .libs/libsecp256k1.so | less and search for bitdata, where objdump-nice is a shell alias for objdump --disassemble --wide --debugging --line-numbers --source-comment --visualize-jumps=extended-color --disassembler-options=intel --no-show-raw-insn --no-addresses

volatile helps, but I’m not convinced:

bits = *((volatile uint32_t*)&bits) ^ (bitdata << tooth); keeps the xor, but moves bits from a register to memory. Not sure if this better in terms of leakage.

bits ^= bitdata << *((volatile uint32_t*)&tooth); also keeps the xor. This moves tooth to memory and prevents clang from unrolling the loop.

Squashed it in (and added you as co-author of the commit).

in
src/tests.c:5619
in
8ec3b9f0eboutdated

5614+ /* Run test with gn = i - scalar_offset (so that the ecmult_gen recoded value represents i). */
5615+ secp256k1_ecmult_gen(&CTX->ecmult_gen_ctx, &res1, &gn);
5616+ secp256k1_ecmult(&res2, NULL, &secp256k1_scalar_zero, &gn);
5617+ secp256k1_gej_neg(&res2, &res2);
5618+ secp256k1_gej_add_var(&res1, &res1, &res2, NULL);
5619+ CHECK(secp256k1_gej_is_infinity(&res1));

5611+ secp256k1_scalar_negate(&gn, &gn); /* gn = -1 - scalar_offset */
5612+
5613+ for (i = -1; i < 2; ++i) {
5614+ /* Run test with gn = i - scalar_offset (so that the ecmult_gen recoded value represents i). */
5615+ secp256k1_ecmult_gen(&CTX->ecmult_gen_ctx, &res1, &gn);
5616+ secp256k1_ecmult(&res2, NULL, &secp256k1_scalar_zero, &gn);

in
src/ecmult_gen_impl.h:201
in
7dec5ab865outdated

225+ uint32_t bit_pos = comb_off;
226+ /* Inner loop: for each block, add table entries to the result. */
227+ for (block = 0; block < COMB_BLOCKS; ++block) {
228+ /* Gather the mask(block)-selected bits of d into bits. They're packed
229+ * together: bit (tooth) of bits = bit
230+ * ((block*COMB_TEETH + tooth)*COMB_SPACING + comb_off) of d. */

It may look overkill, but I don’t think it is upon a closer look: This saves ~1.1 us on my machine (lazy benchmark with turboboost enabled), which is 40% in context_create with NULL seed (because that has an early return to skip the ecmult_gen). But also ecmult_gen is just ~10 us, so this will also save ~10% in context_randomize, which is not entirely unimportant.

Hmm, I realize we also have secp256k1_scalar_half now (added after this PR was opened), which would make most of the cost of the diff calculation go away too. WDYT?

yeah that’s much simpler, I didn’t remember scalar_half…

sipa force-pushed
on Apr 5, 2024

sipa force-pushed
on Apr 6, 2024

sipa force-pushed
on Apr 6, 2024

in
src/ecmult_gen_compute_table_impl.h:30
in
a61f0e5921outdated

38-
39- /* get the generator */
40- secp256k1_gej_set_ge(&gj, gen);
41-
42- /* Construct a group element with no known corresponding scalar (nothing up my sleeve). */
43+ /* u is the running power of two times gen we're working with, initially gen/2. */

Yep, and I had to find this GCC bug to understand that GCC talks about the memory pointed to byret (inside checked_malloc) being uninitialized, and not ret itself…

I think we should expand a bit on the difference to the old table sizes. (Not sure to which section that belongs then. Your current approach of using three sections is technically correct. But maybe just merge all of these items into one item under [Changed]. Then it’s less scattered

nit: Want to give also the CMake options (see older changelog items for examples)?

real-or-random
commented at 5:10 pm on April 12, 2024:
contributor

As part of testing this, I played around with the edge values for the config parameters. This led me to a series of minors nits/fixups in the code that handles the parameters, see https://github.com/real-or-random/secp256k1/tree/202112_sdmc, that you can pick. Most of these commits, if not all, should probably be squashed into your commits (and a regeneration of the precomputed files is necessary, this is why CI fails on my repo.)

Apart from that ACK. I did all the review and testing that I planned to do.

sipa force-pushed
on Apr 15, 2024

sipa
commented at 3:33 pm on April 15, 2024:
contributor

@real-or-random I’ve adopted almost all your commits, except for:

I don’t see the need for “check for value of COMB_RANGE in precomputed file”, as COMB_RANGE doesn’t affect the table.

I have turned “Introduce CEIL_DIV macro and use it” into a separate, first, commit, which is used in a few more places in the codebase. This also necessitated switching to a different formula that avoids an overflow.

I did things a bit differently for “add todo for you”.

This is clearly better than my suggestion, but this can now underflow for unsigned x == 0, so the docs need to be changed:

0/* ceil(x/y) for integers x > 0 and y > 0. Here, / denotes rational division. */
1#define CEIL_DIV(x, y) (1 + ((x) - 1) / (y))

It’s a bit sad that there’s no real way to add a VERIFY_CHECK in a macro (or at least not one which is compatible with the way we use this macro in WNAF_SIZE_BITS(bits, w) as a compile time integer constant…). If you think that’s better, we could drop that use, and make it a real function size_t ceil_div(size_t, size_t) with a VERIFY_CHECK instead, but no strong opinions here.

Nice catch. I’ve adopted your suggested change. All current invocation sites have clearly non-zero input, so I don’t think we need to needlessly complicate.

real-or-random
commented at 4:46 pm on April 15, 2024:
contributor

ACK mod nit

Introduce CEIL_DIV macro and use itaa00a6b892

sipa force-pushed
on Apr 15, 2024

real-or-random approved

real-or-random
commented at 5:39 pm on April 15, 2024:
contributor

64+ *diff = secp256k1_scalar_one;
65+ for (i = 0; i < COMB_BITS - 1; ++i) {
66+ secp256k1_scalar_add(diff, diff, diff);
67+ }
6869+ /* The result is the sum of 2^(COMB_BITS - 1) + (-1/2). */

15+
16+/* Configuration parameters for the signed-digit multi-comb algorithm:
17+ *
18+ * - COMB_BLOCKS is the number of blocks the input is split into. Each
19+ * has a corresponding table.
20+ * - COMB_TEETH is the number of bits simultaneously covered by one table.

jonasnick
commented at 6:42 pm on April 16, 2024:
contributor

Concept ACK

Can confirm a 13% speedup with default parameters on my dev laptop.

Initial gej blinding -> final ge blinding

Instead of having the starting point of the ecmult_gen computation be
offset, do it with the final point. This enables reasoning over the
set of points reachable in intermediary computations, which can be
leveraged by potential future optimization.
Because the final point is in affine coordinates, its projective
blinding is no longer possible. It will be reintroduced again in
a different way, in a later commit.
Also introduce some more comments and more descriptive names.

ab45c3e089

Make exhaustive tests's scalar_inverse(&x,&x) work

The old code overwrote the input at the start of the function,
making a call like secp256k1_scalar_inverse(&x,&x) always fail.

486518b350

sipa force-pushed
on Apr 19, 2024

jonasnick approved

jonasnick
commented at 3:18 pm on April 19, 2024:
contributor

ACK6a3362a1dcb556a84a0876e393f4d2f3fba34a32

Signed-digit multi-comb ecmult_gen algorithm

This introduces the signed-digit multi-comb multiplication algorithm
for constant-time G multiplications (ecmult_gen). It is based on
section 3.3 of "Fast and compact elliptic-curve cryptography" by
Mike Hamburg (see https://eprint.iacr.org/2012/309).
Original implementation by Peter Dettman, with changes by Pieter Wuille
to use scalars for recoding, and additional comments.

fde1dfcd8d

Always generate tables for current (blocks,teeth) config5f7be9f6a5

Provide 3 configurations accessible through ./configureed2a056f3d

Optimization: move (2^COMB_BITS-1)/2 term into ctx->scalar_offset

It is unnecessary to recompute this term needed by the SDMC algorithm
for every multiplication; move it into the context scalar_offset value
instead.

7a33db35cd

Optimization: first table lookup needs no point addition15d0cca2a6

Optimization: avoid unnecessary doublings in precomputation6247f485b6

Make secp256k1_scalar_get_bits support 32-bit reads

The old code would trigger UB when count=32.

e03dcc44b5

Optimization: use Nx32 representation for recoded bits

The existing code needs to deal with the edge case that bit_pos >= 256,
which would lead to an out-of-bounds read from secp256k1_scalar.
Instead, recode the scalar into an array of uint32_t with enough zero
padding at the end to alleviate the issue. This also simplifies the
code, and is necessary for a security improvement in a follow-up
commit.
Original code by Peter Dettman, with modifications by Pieter Wuille.

a0d32b597d

Reduce side channels from single-bit reads

Co-authored-by: Tim Ruffing <crypto@timruffing.de>

07810d9abb

Reintroduce projective blinding644e86de9a

Add test case for ecmult_gen recoded = {-1,0,1}39b2f2a321

Permit COMB_BITS < 256 for exhaustive testsa043940253

Add changelog entry for SDMC4c341f89ab

sipa force-pushed
on Apr 19, 2024

real-or-random approved

real-or-random
commented at 4:15 pm on April 19, 2024:
contributor

reACK4c341f89ab704205a89b5c5d404ed60f381f7c48

jonasnick approved

jonasnick
commented at 5:49 pm on April 19, 2024:
contributor

ACK4c341f89ab704205a89b5c5d404ed60f381f7c48

stratospher
commented at 4:25 pm on April 20, 2024:
contributor

ACK4c341f8. Did these benchmarks and saw a 12.4% on gcc 13.2.0 and 11.5% on clang 15.0.0. Also summarised how the precomputed table generation works here for future me :)

This is a metadata mirror of the GitHub repository
bitcoin-core/secp256k1.
This site is not affiliated with GitHub.
Content is generated from a GitHub metadata backup.
generated: 2024-09-20 03:15 UTC

This site is hosted by @0xB10C More mirrored repositories can be found on mirror.b10c.me