Signed-digit multi-comb ecmult

sipa commented at 8:56 pm on December 29, 2021: contributor

A third iteration of the signed-digit multi-comb ecmult_gen algorithm (earlier attempts: #693, and #546 by Peter Dettman). Short summary:

A new constant-time point multiplication algorithm with precomputation (so only used for multiply with G).
Based on section 3.3 of https://eprint.iacr.org/2012/309 by Mike Hamburg.
Configurable through two parameters: COMB_BLOCKS and COMB_TEETH
- Currently only 3 predefined configurations reachable through ./configure. All three are included in precomputed_ecmult_gen.c and tested in CI.
  - --with-ecmult-gen-kb=2: 2 kB table with COMB_BLOCKS=2 COMB_TEETH=5
  - --with-ecmult-gen-kb=22: 22 kB table with COMB_BLOCKS=11 COMB_TEETH=6
  - --with-ecmult-gen-kb=86: 86 kB table with COMB_BLOCKS=43 COMB_TEETH=6
- Many more configurations can be reached by manually setting the macros. These are not tested.

Compared with the previous PR #693:

Updated to the new static-precomputation-only model (#893).
Just 3 curated configurations reachable through configure.
Removed some optimizations that do not matter (much).
Do blinding through an final correction add rather than an initial start point, which may later permit usage of incomplete addition formulae (#1051).
The recoding of the input scalar to signed bit representation is done slightly differently, which needs fewer special cases.

sipa commented at 11:07 pm on December 29, 2021: contributor

Benchmarks:

AMD Ryzen 5950X, GCC 11.2.0, default compile options

Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=1000000 ./bench schnorrsig_sign

0master:       schnorrsig_sign               ,    18.1       ,    18.1       ,    18.1    
1pr1058 kb=2:  schnorrsig_sign               ,    17.9       ,    17.9       ,    17.9 
2pr1058 kb=22: schnorrsig_sign               ,    15.1       ,    15.1       ,    15.1    
3pr1058 kb=86: schnorrsig_sign               ,    14.4       ,    14.4       ,    14.4

Intel Core I7-7820HQ @ 2.3 GHz, GCC 10.3.0, default compile options

Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=1000000 ./bench schnorrsig_sign

0master:       schnorrsig_sign               ,    38.8       ,    38.9       ,    38.9
1pr1058 kb=2:  schnorrsig_sign               ,    39.4       ,    39.4       ,    39.4
2pr1058 kb=22: schnorrsig_sign               ,    33.2       ,    33.3       ,    33.3
3pr1058 kb=86: schnorrsig_sign               ,    32.4       ,    32.4       ,    32.4

ARM Cortex-A53 @ 1 GHz, GCC 9.3.0, default compile options

Using ./autogen.sh && ./configure --enable-experimental --enable-module-schnorrsig && make clean && make -j check && SECP256K1_BENCH_ITERS=100000 ./bench schnorrsig_sign

0master:       schnorrsig_sign               ,   249.0       ,   249.0       ,   249.0
1pr1058 kb=2:  schnorrsig_sign               ,   250.0       ,   250.0       ,   250.0 
2pr1058 kb=22: schnorrsig_sign               ,   200.0       ,   200.0       ,   200.0
3pr1058 kb=86: schnorrsig_sign               ,   192.0       ,   192.0       ,   192.0

sipa commented at 2:44 pm on December 30, 2021: contributor

I occurs to me that we could actually avoid the cost of doing the scalar halving at ecmult_gen time, by instead having precomputed tables with multiples of G/2 instead of G.

peterdettman commented at 3:07 pm on December 30, 2021: contributor

I occurs to me that we could actually avoid the cost of doing the scalar halving at ecmult_gen time, by instead having precomputed tables with multiples of G/2 instead of G.

The scalar halving also ensures that the low bit (the shifted-away bit) is 0.

sipa commented at 3:12 pm on December 30, 2021: contributor

I don’t think that matters? Even/odd has no special meaning when working modulo a prime.

The PR currently uses the bits of scalar (input + 2^COMB_BITS - 1 - blind) * 2^-1 to select multiples of G to add together.

My suggestion is that instead it could use the bits of (input + 2^COMB_BITS - 1 - blind) to select multiples of G/2 to add together.

sipa commented at 3:33 pm on December 30, 2021: contributor

It seems I’m wrong, but I’m confused why!

peterdettman commented at 3:35 pm on December 30, 2021: contributor

I think confusion is creeping in since after the halving the scalar isn’t a scalar value anymore; it’s in signed-digit form, which can only represent an odd value. In particular the bits of a scalar s in signed-digit form represent the scalar value 2*s+1. I think you should also be careful not to reason modulo the order in signed-digit form.

Edit: Actually, even that might not be quite complete since the high bit is being treated specially here.

sipa commented at 3:43 pm on December 30, 2021: contributor

It works, but you need to use the bits of input - blind + (2^COMB_BITS - 1)/2 instead. That’s what you get when you substitute 2*(input-blind) for e in the formula in the paper (because now we’re trying to compute input*G = 2*(input-blind)*(G/2) + blind*G).

peterdettman commented at 3:56 pm on December 30, 2021: contributor

Oh I see, using double the input and the blind (of G/2) lets the halving be moved to precomputation. Nice.

in src/ecmult_gen_impl.h:59 in 4a4d16e83c outdated

83+     * The value (2^COMB_BITS - 1 - b) is precomputed as ctx->scalar_offset, and bG is
84+     * precomputed as ctx->final_point_add. Thus recoded can be written as
85+     * recoded = (gn + scalar_offset)/2, and R becomes the sum of (2*bit_i-1)*2^i*G
86+     * values plus final_point_add. */
87+
88+    /* Compute (gn + 2^COMB_BITS - 1 - 1)/2 value as a scalar. */

peterdettman commented at 4:16 pm on December 30, 2021:

Nit: -1 -b

sipa commented at 5:09 pm on December 30, 2021:

Superseded by new description.

sipa force-pushed on Dec 30, 2021

sipa commented at 5:08 pm on December 30, 2021: contributor

Updated to use the avoid-halving-scalar trick.

sipa force-pushed on Dec 30, 2021

in src/ecmult_gen.h:92 in a7f0d0e5de outdated

94-    /* Blinding values used when computing (n-b)G + bG. */
95-    secp256k1_scalar blind; /* -b */
96-    secp256k1_gej initial;  /* bG */
97+    /* Blinding values used when computing nG as (n-b)G + bG. */
98+    secp256k1_scalar scalar_offset; /* -b */
99+    secp256k1_ge final_point_add;  /* bG */

peterdettman commented at 5:03 am on December 31, 2021:

These comments are all a bit off. I would suggest a short explanation here and possibly defer to the fuller explanations in ecmult_gen_impl.h. e.g.:

Values chosen such that n*G == comb(n + scalar_offset, G) + final_point_add. The latter expression lets us use scalar blinding and optimize the comb precomputation.

sipa commented at 3:52 pm on December 31, 2021:

Done.

real-or-random commented at 7:19 am on April 3, 2024:

16beb95db60757bbf559e97c3d0cef519355f410 Initial gej blinding -> final ge blinding

nit: I suggest renaming to ge_offset.

s/_add/_offset to make this consistent with scalar_offset.
drop final to uncouple the naming from whether the blinding is applied at the beginning or at the end
s/point/ge because that’s more common in our code

(feel free to pick just any of these transforms if you don’t like all of them)

If you accept this, you’ll probably also want to rename f, final1, final2 in the tests.c then.

edit: After looking at all the commits, I see that this could be a rebase mess… I think, it’s okay to add a rename commit if you like the suggestion.

sipa commented at 5:28 pm on April 3, 2024:

Done, renamed it to ge_offset across all commits. f, final1, final2 were renamed to p, p1, p2 respectively.

in configure.ac:310 in a7f0d0e5de outdated

305@@ -307,20 +306,31 @@ case $set_ecmult_window in
306 esac
307 
308 # Set ecmult gen precision
309-if test x"$req_ecmult_gen_precision" = x"auto"; then
310-  set_ecmult_gen_precision=4
311+if test x"$req_ecmult_gen_kb" = x"auto"; then
312+  set_ecmult_gen_kb=86

peterdettman commented at 6:40 am on December 31, 2021:

I’m a little bit of the mind that 22 should be the default, but perhaps that’s my old brain stuck in 64kB L1 world. Still, would it be reasonable to default to 22 for 32bit?

real-or-random commented at 10:43 am on December 31, 2021:

With #929, it may be good idea to handle the logic in the preprocessor in the case of “auto”.

If you agree, I think it’s both ok to do it in this PR, or postpone it to a separate PR that also cares about all other setting/parameters (which we should do anyway).

sipa commented at 3:09 pm on December 31, 2021:

I’m not convinced about what the defaults should be, but given that ecmult already defaults to 1 MiB precomputed table, I see little harm in picking an 86 KiB one here.

I agree we can easily revisit defaults later (especially in combination with #929).

sipa commented at 3:52 pm on December 31, 2021:

I changed the configure-driver default to also be 22 kB; we can revisit the defaults later.

in src/ecmult_gen.h:57 in a7f0d0e5de outdated

52+#    define COMB_TEETH 1
53+#  endif
54+#else /* !defined(EXHAUSTIVE_TEST_ORDER) */
55+/* Use (11, 6) as default configuration, which results in a 22 kB table. */
56+#  ifndef COMB_BLOCKS
57+#    define COMB_BLOCKS 11

peterdettman commented at 6:41 am on December 31, 2021:

Just noting that this isn’t the same as the configure-driven default, not sure that it matters.

sipa commented at 3:52 pm on December 31, 2021:

I changed the configure-driver default to also be 22 kB; we can revisit the defaults later.

in src/ecmult_gen.h:73 in a7f0d0e5de outdated

70+#endif
71+
72+/* The remaining COMB_* parameters are derived values, don't modify these. */
73+/* - The distance between the teeth of the comb. */
74+#define COMB_SPACING ((255 + COMB_BLOCKS * COMB_TEETH) / (COMB_BLOCKS * COMB_TEETH))
75+/* - The number of bits covered by all the combs; must be at least 256. */

peterdettman commented at 6:46 am on December 31, 2021:

“by all the blocks” is maybe better.

sipa commented at 3:52 pm on December 31, 2021:

Done.

in src/ecmult_gen.h:71 in a7f0d0e5de outdated

68+#if !(1 <= COMB_TEETH && COMB_TEETH <= 8)
69+#  error "COMB_TEETH must be in the range [1, 8]"
70+#endif
71+
72+/* The remaining COMB_* parameters are derived values, don't modify these. */
73+/* - The distance between the teeth of the comb. */

peterdettman commented at 6:47 am on December 31, 2021:

“in each comb”?

sipa commented at 3:51 pm on December 31, 2021:

Done.

peterdettman commented at 7:06 am on December 31, 2021: contributor

As an exercise I added a 2 blocks, 4 teeth configuration:

add {2,4} to CONFIGS table in precompute_ecmult_gen.c; nice and declarative.
make precompute_ecmult_gen and run to get new precomputed_ecmult_gen.c which correctly includes new {2,4} option.
modify configure.ac to support new “1” option for ecmult-gen-kb
configure –with-ecmult-gen-kb=1, confirm in libsecp256k1-config.h that COMB_BLOCKS, COMB_TEETH are correct.
make clean, make, tests to confirm it’s working

This all worked without isues and was reasonably minimal effort. This particular example also satisfied me that there is no issue with combs covering exactly 256 bits.

sipa commented at 3:05 pm on December 31, 2021: contributor

@peterdettman FWIW, the easiest way of achieving the same would be:

Modify configure.ac to support a new option
Run configure with that option
make clean-precomp && make normally

Because precompute_ecmult_gen is automatically built & run when a precomputed_*.c file is missing, and because precompute_ecmult_gen will always include the table for whatever the configuration is.

sipa force-pushed on Dec 31, 2021

peterdettman commented at 12:04 pm on January 1, 2022: contributor

Having some confusion over possible bad interactions between modular reduction and signed-digit form, I wanted to reason things out in reverse.

For point P of order N, scalar s in [0,N), and an L-bit comb (2^L > N), let C(s,P) be the value calculated by our comb, which considers the bits of s, zero-extended to L bits as signed digits. Then:

0    C(s,P) == (2.s - (2^L - 1)) * P

Therefore in order to use the comb to calculate k * G, we solve for the scalar t to use:

0    k * G == C(t,G) == (2.t - (2^L - 1)) * G 
1=>  k == 2.t - (2^L - 1) mod N
2=>  t == (k + (2^L - 1))/2 mod N

Can we skip that halving and use G/2 instead?

0    C(2t,G/2) == (4.t - (2^L - 1)) * G/2
1              == (2.t - (2^L - 1)/2) * G
2              != C(t,G) unless 2^L == 1 mod N

So no, but let’s back up a step and ask what scalar u to use in the comb to calculate k * G as 2.k * (G/2):

0    2.k * (G/2) == C(u,G/2) == (2.u - (2^L - 1)) * G/2
1=>  2.k == 2.u - (2^L - 1) mod N
2=>  u == k + (2^L - 1)/2 mod N

and since L is constant, the halving is now only needed in the precomputation.

Scalar blinding (using b * G == B):

0    k * G == (k - b) * G + B == 2.(k - b) * (G/2) + B == C(k - b + (2^L - 1)/2, G/2) + B

where -b + (2^L - 1)/2 can be precomputed.

sipa commented at 12:13 pm on January 1, 2022: contributor

Nice, that’s much better explained than my current comments. I’ll try to include it.

peterdettman commented at 12:16 pm on January 1, 2022: contributor

Although the above satisfies me mathematically, I would like to see an explicit test case where the comb_offset (2^L - 1)/2 causes a modular reduction (relative to k-b). e.g. arrange for k-b == N + 1 - comb_offset. I hope that’s not too painful, but otherwise random testing seems unlikely to hit such a case.

sipa commented at 4:13 pm on January 1, 2022: contributor

@peterdettman Making this change causes instant failure during the tests, at least:

 0--- a/src/ecmult_gen_impl.h
 1+++ b/src/ecmult_gen_impl.h
 2@@ -78,7 +78,7 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
 3      */
 4 
 5     /* Compute the scalar (gn + ctx->scalar_offset). */
 6-    secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn);
 7+    CHECK(!secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn));
 8     /* Convert to recoded array. */
 9     for (i = 0; i < 8; ++i) {
10         recoded[i] = secp256k1_scalar_get_bits(&tmp, 32 * i, 32);

Including a test case that results in tmp=0 or tmp=1 or so may be useful though.

peterdettman commented at 4:28 pm on January 1, 2022: contributor

scalar_offset includes the randomly-distributed blind, so there will be modular reductions. This isn’t concerning at all because we can reason about the blind independently of the comb (just input offset + output offset).

However the comb offset is small: 2^(COMB_BITS-256) * (2^256 - N). So I would like a test that involves the comb_offset itself causing a modular reduction. The math checks out of course, but an explicit test can’t hurt.

peterdettman commented at 4:42 pm on January 1, 2022: contributor

Actually, I think (2^L - 1)/2 mod N is only small (128 bits) if L == 256.

sipa commented at 4:44 pm on January 1, 2022: contributor

For the 43/6/1 configuration it is 0x80000000000000000000000000000001e7f9b4a5f9130fa66044722cc7ae9e1e For the 11/6/4 configuration it is 0x8000000000000000000000000000000987e0873ddd5f4e3fe1563adfe6691698

peterdettman commented at 4:59 pm on January 1, 2022: contributor

Including a test case that results in tmp=0 or tmp=1 or so may be useful though.

So, these and maybe -1 would be enough. I will check a 256-bit config manually if they are added.

sipa force-pushed on Jan 1, 2022

sipa commented at 6:36 pm on January 1, 2022: contributor

@peterdettman I’ve incorporated your derivation in the comments in ecmult_gen_impl.h, and added a test case for recoded={-1,0,1}.

in configure.ac:183 in a98735b09a outdated

183-[A larger table size usually results in possible faster signing.]
184-["auto" is a reasonable setting for desktop machines (currently 4). [default=auto]]
185+AC_ARG_WITH([ecmult-gen-kb], [AS_HELP_STRING([--with-ecmult-gen-kb=2|22|86|auto],
186+[The size of the precomputed table for signing in multiples of 1024 bytes (on typical platforms).]
187+[Larger values result in possibly better signing/keygeneration performance at the cost of a larger table.]
188+["auto" is a reasonable setting for desktop machines (currently 86). [default=auto]]

peterdettman commented at 5:22 am on January 2, 2022:

86 -> 22?

sipa commented at 6:36 pm on January 2, 2022:

Fixed.

in src/ecmult_gen.h:92 in a98735b09a outdated

94     int built;
95 
96-    /* Blinding values used when computing (n-b)G + bG. */
97-    secp256k1_scalar blind; /* -b */
98-    secp256k1_gej initial;  /* bG */
99+    /* Values chosen such that n*G == comb(n + scalar_offset, G) + final_point_add.

peterdettman commented at 5:24 am on January 2, 2022:

I guess it should be G/2 in the comb now?

sipa commented at 6:35 pm on January 2, 2022:

FIxed. I though of “comb()” here as a more abstract “the comb multiplication algorithm”, but I’ve now just aligned the ecmult_gen.h and ecmult_gen_impl.h descriptions (also switched to other from c() to comb()).

sipa force-pushed on Jan 2, 2022

sipa commented at 4:41 pm on January 4, 2022: contributor

I’ve worked on an additional change that introduces a COMB_RANGE which is normally 256, but in exhaustive test mode corresponds to the number of bits in EXHAUSTIVE_TEST_ORDER. Then COMB_BITS only has to cover COMB_RANGE etc, instead of always being at least 256.

And I’m seeing a suspicious failure in certain configurations with that (even making sure that the precomputed table only contains non-infinity points).

I’ll retry this approach to make sure I haven’t missed something.

sipa force-pushed on Jan 4, 2022

sipa commented at 11:58 pm on January 4, 2022: contributor

I added a commit that permits COMB_BITS < 256 in exhaustive test mode. However, it doesn’t work in a lot of configurations, and I don’t understand what’s causing it.

Here is a list of (blocks teeth spacing) tuples and whether they work (for both order 13 and 199): groups.txt

sipa commented at 3:34 pm on January 5, 2022: contributor

Update: it appears that EXHAUSTIVE_TEST_ORDER < 2**(BLOCKS * TEETH * (SPACING - 1)) perfectly predicts which configurations work.

sipa force-pushed on Jan 5, 2022

sipa commented at 7:52 pm on January 5, 2022: contributor

Final update: I was being very dumb, and precompute_ecmult_gen just had spacing = (COMB_RANGE + blocks * teeth) / (block * teeth) hardcoded, leading to an inconsistency between the table and the actual multiplication code.

False alarm.

sipa force-pushed on Nov 23, 2022

sipa commented at 4:29 am on November 23, 2022: contributor

Rebased.

sipa commented at 1:26 pm on November 23, 2022: contributor

I believe this PR is ready for review (the discussion above was just observing some configurations not working due to a fairly stupid bug which was fixed).

sipa force-pushed on Dec 29, 2022

sipa commented at 4:21 am on December 29, 2022: contributor

Rebased after merge of #1178. Added changelog entry.

sipa force-pushed on Dec 29, 2022

sipa force-pushed on Jan 11, 2023

sipa commented at 9:50 pm on January 11, 2023: contributor

Rebased after #1187 merge.

sipa force-pushed on Jan 19, 2023

sipa commented at 8:30 pm on January 19, 2023: contributor

Rebased after merge of #1170 and #1190.

sipa force-pushed on Apr 10, 2023

sipa force-pushed on May 23, 2023

sipa force-pushed on May 24, 2023

sipa commented at 1:00 am on May 24, 2023: contributor

Rebased, and added to cmake.

real-or-random referenced this in commit 40f50d0fbd on Nov 7, 2023

sipa force-pushed on Nov 8, 2023

sipa commented at 2:11 pm on November 8, 2023: contributor

Rebased.

sipa force-pushed on Nov 8, 2023

sipa force-pushed on Nov 11, 2023

real-or-random added the label performance on Nov 13, 2023

in src/tests.c:5592 in 764efa4eb4 outdated

5597-    CHECK(secp256k1_scalar_eq(&b, &CTX->ecmult_gen_ctx.blind));
5598-    CHECK(gej_xyz_equals_gej(&initial, &CTX->ecmult_gen_ctx.initial));
5599+    CHECK(secp256k1_scalar_eq(&b, &CTX->ecmult_gen_ctx.scalar_offset));
5600+    final2 = CTX->ecmult_gen_ctx.final_point_add;
5601+    secp256k1_fe_normalize_var(&final2.y);
5602+    ge_equals_ge(&final1, &final2);

stratospher commented at 5:26 am on December 1, 2023:

764efa4: the new function introduced - ge_cmp_ge(a, b) is ge_equals_ge(a, b) with more safety checks to make sure that the first ge argument (a) passed has both a->x and a->y normalised right? (for secp256k1_fe_equal verify checks)

can’t we use ge_cmp_ge here?
if we’re using ge_equals_ge, shouldn’t final1.x and final1.y be normalised instead of final1.y and final2.y?

sipa commented at 4:45 pm on December 1, 2023:

Ok, this comment made me take the time to just get rid of these strange test-only functions. See #1450.

in src/scalar_low_impl.h:187 in 0cfa60b503 outdated

188+    }
189     /* If this VERIFY_CHECK triggers we were given a noninvertible scalar (and thus
190      * have a composite group order; fix it in exhaustive_tests.c). */
191-    VERIFY_CHECK(*r != 0);
192+    VERIFY_CHECK(res != 0);
193+    *r = res;

stratospher commented at 5:56 am on December 1, 2023:

0cfa60b: we could keep secp256k1_scalar_verify(r); in the end.

sipa commented at 2:57 pm on December 2, 2023:

Done.

sipa force-pushed on Dec 2, 2023

sipa commented at 2:57 pm on December 2, 2023: contributor

Rebased after #1450.

in src/ecmult_gen_impl.h:89 in 9fccdf3529 outdated

101+     * Thus (gn-b)*G can be written as comb(s,G) if gn-b = 2*s - (2^COMB_BITS - 1), or
102+     * s = (gn - b + (2^COMB_BITS - 1))/2 mod order.
103+     *
104+     * We use an alternative that avoids the modular division by two: we write (gn-b)*G =
105+     * comb(d,G/2). For that to hold it must be the case that
106+     * (gn-b)*G = (2*d + 2^COMB_BITS - 1) * (G/2), or

stratospher commented at 6:37 am on December 4, 2023:

9fccdf3:

0     * (gn-b)*G = (2*d - 2^COMB_BITS + 1) * (G/2), or

sipa commented at 7:47 pm on March 17, 2024:

Done, as (gn-b)*G = (2*d - (2^COMB_BITS - 1)) * (G / 2), or

sipa force-pushed on Dec 6, 2023

in src/ecmult_gen_compute_table_impl.h:59 in da73fd9d51 outdated

110-            if (j == n - 2) {
111-                /* In the last iteration, numsbase is (1 - 2^j) * nums instead. */
112-                secp256k1_gej_neg(&numsbase, &numsbase);
113-                secp256k1_gej_add_var(&numsbase, &numsbase, &nums_gej, NULL);
114+        }
115+        /* Now u = 2^(block*(teeth + 1)*spacing) * gen/2. */

stratospher commented at 7:32 am on January 10, 2024:

da73fd9: block+1 would become next iteration’s block

0        /* Now u = 2^((block + 1)*teeth*spacing) * gen/2. */

sipa commented at 7:47 pm on March 17, 2024:

Indeed! Done.

in src/ecmult_gen_impl.h:106 in da73fd9d51 outdated

122+
123+    /* In secp256k1_ecmult_gen_prec_table we have precomputed sums of the
124+     * (2*d_i-1) * 2^(i-1) * G points, for various combinations of i positions.
125+     * We rewrite our equation in terms of these table entries.
126+     *
127+     * Let mask(b) = sum(2^(b*COMB_TEETH + t)*COMB_SPACING for t=0..COMB_TEETH-1),

stratospher commented at 2:10 am on January 11, 2024:

da73fd9:

0     * Let mask(b) = sum(2^((b*COMB_TEETH + t)*COMB_SPACING) for t=0..COMB_TEETH-1),

sipa commented at 7:46 pm on March 17, 2024:

Done.

in src/ecmult_gen_compute_table_impl.h:54 in da73fd9d51 outdated

 99+            secp256k1_gej_add_var(&sum, &sum, &u, NULL);
100+            /* Make u = 2^((block*teeth + tooth)*spacing + 1) * gen/2. */
101+            secp256k1_gej_double_var(&u, &u, NULL);
102+            /* Make ds[tooth] = u = 2^((block*teeth + tooth)*spacing + 1) * gen/2. */
103+            ds[tooth] = u;
104+            /* Make u = 2^((block*teeth + tooth + 1)*spacing). */

stratospher commented at 8:15 am on March 9, 2024:

da73fd9: micro-nit:

0            /* Make u = 2^((block*teeth + tooth + 1)*spacing) * gen/2. */

sipa commented at 7:46 pm on March 17, 2024:

Done.

in src/ecmult_gen_compute_table_impl.h:85 in da73fd9d51 outdated

124+         * table entry and adding ds[tooth]. */
125+        for (tooth = 0; tooth < teeth - 1; ++tooth) {
126+            size_t stride = ((size_t)1) << tooth;
127+            size_t index;
128+            for (index = 0; index < stride; ++index, ++vs_pos) {
129+                secp256k1_gej_add_var(&vs[vs_pos], &vs[vs_pos - stride], &ds[tooth], NULL);

stratospher commented at 2:08 pm on March 10, 2024:

da73fd9: so for vs_pos = [256, 263] - since those bits are always set to 0. their possible combinations are computed and kept but never end up being used? also wondering if those big powers of 2 (maybe something like - 2**263 - 2**259 - 2**255 - 2**251 - 2**247 - 2**243) are safe to add in secp256k1_gej_add_var.

(talking about default configuration with blocks=11, teeth=6)

sipa commented at 7:39 pm on March 17, 2024:

so for vs_pos = [256, 263] - since those bits are always set to 0. their possible combinations are computed and kept but never end up being used?

I believe that’s correct. We could avoid computing them, but this is just table generation code. There is no particular reason for it to be efficient (I’ve suggested replacing it with a Python script in the past…).

also wondering if those big powers of 2 (maybe something like - 2263 - 2259 - 2255 - 2251 - 2247 - 2243) are safe to add in secp256k1_gej_add_var.

That function operates on points, and clearly cannot know what multiple of G is being added (doing so would require breaking DLP!). The group of points is cyclic with order equal to the curve order, so e.g. 2**264 * gen/2 is $2^{263} G = (2^{263} \mod n) G$ = 55349809480404436077109870898555922636672 G.

stratospher commented at 4:38 pm on April 2, 2024:

oh makes sense.

in src/ecmult_gen_impl.h:88 in 65f580e931 outdated

84@@ -85,19 +85,17 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
85      * (gn-b)*G = (2*d + 2^COMB_BITS - 1) * (G/2), or
86      * d = (gn + (2^COMB_BITS - 1)/2 - b) mod order.
87      *
88-     * -b is precomputed as ctx->scalar_offset, so our final equations become:
89+     * 2^COMB_BITS - 1)/2 - b is precomputed as ctx->scalar_offset, so our final equations become:

stratospher commented at 2:54 pm on March 10, 2024:

65f580e:

0     * 2^(COMB_BITS - 1)/2 - b is precomputed as ctx->scalar_offset, so our final equations become:

sipa commented at 7:46 pm on March 17, 2024:

Done.

sipa force-pushed on Mar 17, 2024

sipa commented at 7:47 pm on March 17, 2024: contributor

Addressed comments and rebased.

real-or-random commented at 3:22 pm on March 18, 2024: contributor

CI reports segfaults in the sanitizer jobs, but I’m not sure if this is a problem with the compiler, e.g., see https://github.com/bitcoin-core/secp256k1/actions/runs/8317860245/job/22759343772?pr=1058#step:9:148 where even a test binary built and run by ./configure segfaults. Note that this happens with the stable compilers, not with the nightly compilers… Our usage of sanitizers is quite heavy: We use multiple sanitizers at once, and we enable some special options for extra checks, maybe those work any longer. (My feeling is that upstream maintenance of the sanitizers isn’t great currently, there are many open issues without responses from the maintainers.)

sipa force-pushed on Mar 19, 2024

sipa force-pushed on Mar 20, 2024

sipa commented at 2:24 am on March 29, 2024: contributor

@armfazh said he might me interested in looking at this. “Project Palm”

in src/ecmult_gen_impl.h:64 in 1b218b3dca outdated

58+    secp256k1_scalar tmp;
59+    /* Array of uint32_t values large enough to store COMB_BITS bits. Only the bottom
60+     * 8 are ever nonzero, but having the zero padding at the end if COMB_BITS>256
61+     * avoids the need to deal with out-of-bounds reads from a scalar. */
62+    uint32_t recoded[(COMB_BITS + 31) >> 5] = {0};
63+    int first = 1, i;

stratospher commented at 8:28 am on April 2, 2024:

1b218b3: accidental but i’m surprised it compiles.

sipa commented at 5:04 pm on April 3, 2024:

This line defines two variables, first and i, and initializes first to 1. Is this surprising?

stratospher commented at 2:36 am on April 4, 2024:

oops never mind! confused it with something else.

in src/ecmult_gen_impl.h:104 in 1b218b3dca outdated

 98@@ -95,7 +99,12 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
 99      */
100 
101     /* Compute the scalar (gn + ctx->scalar_offset). */
102-    secp256k1_scalar_add(&recoded, &ctx->scalar_offset, gn);
103+    secp256k1_scalar_add(&tmp, &ctx->scalar_offset, gn);
104+    /* Convert to recoded array. */
105+    for (i = 0; i < 8; ++i) {

stratospher commented at 8:55 am on April 2, 2024:

1b218b3: don’t we still need to read the last 8 non-zero bits here? maybe just i < (COMB_BITS + 31) >> 5 instead of i < 8.

sipa commented at 5:09 pm on April 3, 2024:

Well a scalar cannot contain more than 256 bits, and asking about higher bits than that is illegal, so i cannot exceed 7. The recoded array may be larger if the combs straddle the 256-bit boundary, but in that case the last recoded element will just always be 0, and thus loop doesn’t need to touch it.

In commit “Permit COMB_BITS < 256 for exhaustive tests” a condition is added to this loop that makes it i < 8 && i < ((COMB_BITS + 31) >> 5).

in src/ecmult_gen.h:132 in b368086dae outdated

108@@ -109,6 +109,10 @@ typedef struct {
109      * ecmult_gen_impl.h for more details. */
110     secp256k1_scalar scalar_offset;
111     secp256k1_ge final_point_add;
112+
113+    /* Factor used for projective blinding. This value is used to rescale the Z
114+     * coordinate of the first table lookup. */
115+    secp256k1_fe proj_blind;

stratospher commented at 3:06 pm on April 2, 2024:

b368086: we could clear this in secp256k1_ecmult_gen_context_clear

sipa commented at 5:27 pm on April 3, 2024:

Good point, done.

in src/ecmult_gen_impl.h:319 in b368086dae outdated

268@@ -265,7 +269,11 @@ static void secp256k1_ecmult_gen_blind(secp256k1_ecmult_gen_context *ctx, const
269     secp256k1_rfc6979_hmac_sha256_initialize(&rng, keydata, 64);
270     memset(keydata, 0, sizeof(keydata));
271 
272-    /* TODO: reintroduce projective blinding. */
273+    /* Compute projective blinding factor (cannot be 0). */
274+    secp256k1_rfc6979_hmac_sha256_generate(&rng, nonce32, 32);
275+    secp256k1_fe_set_b32_mod(&f, nonce32);
276+    secp256k1_fe_cmov(&f, &secp256k1_fe_one, secp256k1_fe_normalizes_to_zero(&f));
277+    ctx->proj_blind = f;

stratospher commented at 3:31 pm on April 2, 2024:

b368086: maybe clear f too.

sipa commented at 5:27 pm on April 3, 2024:

Done.

in src/ecmult_gen_compute_table_impl.h:30 in 0a63916f36 outdated

58-        (void)r;
59-        VERIFY_CHECK(r);
60-        secp256k1_gej_set_ge(&nums_gej, &nums_ge);
61-        /* Add G to make the bits in x uniformly distributed. */
62-        secp256k1_gej_add_ge_var(&nums_gej, &nums_gej, gen, NULL);
63+    /* u is the running power of two times gen we're working with, initially gen/2. */

real-or-random commented at 8:55 am on April 3, 2024:

https://github.com/bitcoin-core/secp256k1/commit/0a63916f36b92a0508aa7f1df7b8e659bc42bfa4 Signed-digit multi-comb ecmult_gen algorithm

Want to assert that 2*u == gen?

sipa commented at 5:30 pm on April 3, 2024:

I’m not sure that we want assertions inside the table generation tool code. We don’t have VERIFY and non-VERIFY modes of those etc. Do you suggest just using normal C assert? I guess we could do that.

real-or-random commented at 6:36 pm on April 3, 2024:

Yeah, okay, it had simply occurred to me that this piece is easy to check without adding a dependency on ecmult… But checking that particular condition is not too interesting. So never mind, ignore my comment.

But having said this, do you think it makes sense to add some code in tests.c that does some basic checks on the generated table values? The advantage is that ecmult/ecmult_const won’t add a dependency there.

sipa commented at 1:41 pm on April 5, 2024:

I think the run_ecmult_constants tests should generally test every table entry several times already, regardless of configuration. We could add an explicit test that basically recomputes the table using existing ecmult, but perhaps that’s overkill?

sipa commented at 5:39 pm on April 5, 2024:

Given that the precompute binary is built with -DVERIFY, I’ve added your suggestion here to VERIFY_CHECK that u*2 = gen.

real-or-random commented at 9:13 am on April 3, 2024: contributor

More to follow, I’m in the middle of the main commit…

sipa force-pushed on Apr 3, 2024

in src/ecmult_gen_impl.h:69 in 62601920be outdated

85+     * a value b determined by the context. b*G is precomputed as ctx->ge_offset, so we're
86+     * left with computing R = (gn-b)*G + ctx->ge_offset.
87+     *
88+     * The multiplication (gn-b)*G will be performed using a signed-digit multi-comb (see Section
89+     * 3.3 of "Fast and compact elliptic-curve cryptography" by Mike Hamburg
90+     * (https://eprint.iacr.org/2012/309).

real-or-random commented at 3:34 pm on April 4, 2024:

nit: you want a , instead of a ( before see Section

sipa commented at 3:35 am on April 5, 2024:

Done.

in src/ecmult_gen_impl.h:73 in 62601920be outdated

89+     * 3.3 of "Fast and compact elliptic-curve cryptography" by Mike Hamburg
90+     * (https://eprint.iacr.org/2012/309).
91+     *
92+     * Let comb(s,P) = sum((2*s_i-1)*2^i*P for i=0..COMB_BITS-1), where s_i is the i'th bit of the
93+     * binary representation of scalar s. So the s_i values determine whether -2^i*P (s_i=0) or
94+     * +2^i*P (s_1) are added together. By manipulating:

real-or-random commented at 3:35 pm on April 4, 2024:

s/s_1/s_0=1

sipa commented at 3:36 am on April 5, 2024:

Done. Also changed notation to match that used in #1184 (s[i] instead of s_i).

in src/ecmult_gen_impl.h:36 in 62601920be outdated

45-static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp256k1_gej *r, const secp256k1_scalar *gn) {
46-    int bits = ECMULT_GEN_PREC_BITS;
47-    int g = ECMULT_GEN_PREC_G(bits);
48-    int n = ECMULT_GEN_PREC_N(bits);
49+/* Compute the scalar (2^COMB_BITS - 1) / 2. */
50+static void secp256k1_ecmult_gen_scalar_diff(secp256k1_scalar* diff) {

real-or-random commented at 3:47 pm on April 4, 2024:

I’m not saying that it matters, but why is this called diff?

sipa commented at 5:34 pm on April 4, 2024:

It’s the difference between the multiplicand scalar and the scalar whose encoding the table lookup bits are drawn from (before adding the blinding offset).

real-or-random commented at 10:22 am on April 5, 2024:

Makes sense, maybe add this explanation to the “docstring”.

sipa commented at 1:52 pm on April 5, 2024:

Done.

in src/ecmult_gen_impl.h:118 in 62601920be outdated

134+     *   mask(10) = 2^240 + 2^244 + 2^248 + 2^252 + 2^256 + 2^260
135+     *
136+     * Imagine we have a table(b,m) function which can look up, given b and
137+     * m=(recoded & mask(b)), the sum of (2*d_i-1)*2^(i-1)*G for all bit positions
138+     * i set in mask(b). In our example, table(0, 1 + 2^8 + 2^20) would be equal to
139+     * (2^-1 - 2^3 + 2^7 - 2^11 - 2^15 + 2^19)*G.

real-or-random commented at 4:14 pm on April 4, 2024:

Hm, this took me a while to follow. Let me explain in my own words. So you assume that the relevant bits of d are 1...0...0...1...0...1, or 1...0...1...0...0...1 in reverse, which is 1 + 2^8 + 2^20, which determines the positive summands in the G factor (decremented because the formula says so). Is this right? Want to elaborate in the comment?

sipa commented at 3:36 am on April 5, 2024:

I’ve made some significant changes to this whole explanation. Let me know if it’s any clearer now.

real-or-random commented at 10:22 am on April 5, 2024:

Yes!

in src/ecmult_gen_impl.h:126 in 62601920be outdated

142+     *   1*(table(0, recoded & mask(0)) + table(1, recoded & mask(1)) + ...)
143+     * + 2*(table(0, (recoded/2) & mask(0)) + table(1, (recoded/2) & mask(1)) + ...)
144+     * + 4*(table(0, (recoded/4) & mask(0)) + table(1, (recoded/4) & mask(1)) + ...)
145+     * + ...
146+     * + 2^(COMB_SPACING-1)*(table(0, (recoded/2^(COMB_SPACING-1)) & mask(0)) + ...)
147+     * + ctx->ge_offset.

real-or-random commented at 4:15 pm on April 4, 2024:

nit: indent the display formula

sipa commented at 3:37 am on April 5, 2024:

Rewrote this.

in src/ecmult_gen_impl.h:139 in 62601920be outdated

155+     *     for block in range(COMB_BLOCKS):
156+     *       R += table(block, (recoded >> comb_off) & mask(block))
157+     *     if comb_off > 0:
158+     *       R = 2*R
159+     *   R += ge_offset
160+     *   return R

real-or-random commented at 4:21 pm on April 4, 2024:

It may be a good idea to explain here why we ge_offset is added at the end, instead of using it as initial value for the variable R.

sipa commented at 3:37 am on April 5, 2024:

Done.

in src/ecmult_gen_impl.h:144 in 62601920be outdated

160+     *   return R
161+     *
162+     * The last question is how to implement the table(b,m) function. For any value of
163+     * b, m=(recoded & mask(b)) can only take on at most 2^COMB_TEETH possible values
164+     * (the last one may have fewer as there mask(b) may the curve order). So we could
165+     * create COMB_BLOCK tables which contain a value for each such m value.

real-or-random commented at 4:22 pm on April 4, 2024:

nit: grammar in parantheses

sipa commented at 1:51 pm on April 5, 2024:

Fixed.

in src/ecmult_gen_impl.h:150 in 62601920be outdated

166+     *
167+     * Due to the fact that every table entry is a sum of positive and negative powers
168+     * of two multiplied by G, every table will contains pairs of negated points:
169+     * if all the masked bits in m flip, the table value is negated. We can exploit this
170+     * to only store the first half of every table. If an entry from the second half is
171+     * needed, we look up its bit-flipped version instead, and conditionally negate it.

real-or-random commented at 4:27 pm on April 4, 2024:

nit: remove “conditionally”? (the entire clause is already conditional)

sipa commented at 3:37 am on April 5, 2024:

Indeed, fixed.

in src/ecmult_gen_impl.h:172 in 62601920be outdated

188+                uint32_t bit = secp256k1_scalar_get_bits(&recoded, bit_pos, 1);
189+                bits |= bit << tooth;
190+                bit_pos += COMB_SPACING;
191+            }
192+
193+            /* If the top bit of bits is 1, conditionally flip them all (corresponding

real-or-random commented at 4:31 pm on April 4, 2024:

nit: same here, remove “conditionally” (or do I have this wrong?)

sipa commented at 3:37 am on April 5, 2024:

Fixed.

in src/ecmult_gen_impl.h:228 in 62601920be outdated

186+            uint32_t bits = 0, sign, abs, index, tooth;
187+            for (tooth = 0; tooth < COMB_TEETH && bit_pos < 256; ++tooth) {
188+                uint32_t bit = secp256k1_scalar_get_bits(&recoded, bit_pos, 1);
189+                bits |= bit << tooth;
190+                bit_pos += COMB_SPACING;
191+            }

real-or-random commented at 4:37 pm on April 4, 2024:

IIUC this relies on COMB_TEETH <= 32. Want to VERIFY_CHECK this?

I guess it will be more natural to declare tooth as int, it’s not involved in all the bit magic/masking. I thought the same is true for index, but it’s compared against abs, so it’s probably better to keep it.

On a related note, it may be a good idea to change the return value of secp256k1_scalar_get_bits to uint32_t ? It’s currently unsigned int. We kind of assert in assumptions.h that this type is at least 32 bits, but I think the implicit conversion to uint32_t is unexpected for readers. Plus, only the _t types are guaranteed to be “sane” (2’s complement, no padding bits, …). (Sorry, I haven’t checked the other callers…)

sipa commented at 1:44 pm on April 5, 2024:

src/ecmult_gen.h has an #error for COMB_TEETH > 8. Should I repeat it?

I’ve changed secp256k1_scalar_get_bits and friends to return uint32_t.

real-or-random commented at 2:04 pm on April 5, 2024:

src/ecmult_gen.h has an #error for COMB_TEETH > 8. Should I repeat it?

Never mind, the code is anyway replaced in later commits, so no need to bother.

real-or-random commented at 4:52 pm on April 4, 2024: contributor

I forgot to say that all the comments I’ve just posted are in the main commit https://github.com/bitcoin-core/secp256k1/pull/1058/commits/62601920bee3b57456a130d1c10a4e7fc08f3866.

sipa force-pushed on Apr 5, 2024

in src/ecmult_gen_impl.h:178 in c30860aa26 outdated

192+     *   = (mask(b) - m - mask(b)/2)*G
193+     *   = (-m - mask(b)/2)*G
194+     *   = -(m - mask(b)/2)*G
195+     *   = -table(b,m)
196+     *
197+     * Thus flipping all the (relevant) bits in m means negating the table result. Because of this

real-or-random commented at 9:44 am on April 5, 2024:

https://github.com/bitcoin-core/secp256k1/commit/62601920bee3b57456a130d1c10a4e7fc08f3866

It makes sense to me that flipping the bits means flipping the signs of the summands, and thus flipping the result; I think this argument alone would convince me.

I can follow the 1st equal sign assuming m=(d & mask(b)). Maybe say that the 1st equality needs this, or just substitute (d & mask(b) for m.
The 4th line should be = (-m + mask(b)/2)*G.

sipa commented at 1:47 pm on April 5, 2024:

Fixed, and made some more editorial changes.

in src/precompute_ecmult_gen.c:64 in 84f5ce761c outdated

65+            blocks = CONFIGS[config][0];
66+            teeth = CONFIGS[config][1];
67+            if (blocks == COMB_BLOCKS && teeth == COMB_TEETH) did_current_config = 1;
68+        } else {
69+            /* In the last iteration, output table for (COMB_BLOCKS, COMB_TEETH) if not
70+             * already done. */

real-or-random commented at 10:05 am on April 5, 2024:

84f5ce761ced5da6c9332279ff491b139e1510c7 Always generate tables for current (blocks,teeth) config

I think this is fine for the table generation code in the end, but I smell a refactor here. This code is kind of abusing the loop. I think what you want is to extract the main part into a function and have something like this in main:

0for (config = 0; config < sizeof(CONFIGS) / sizeof(*CONFIGS); ++config) {
1    blocks = CONFIGS[config][0];
2    teeth = CONFIGS[config][1];
3    if (blocks == COMB_BLOCKS && teeth == COMB_TEETH) did_current_config = 1;      
4    print_table(fp, blocks, teeth); 
5}
6if (!did_current_config) {
7    print_table(fp, COMB_BLOCKS, COMB_TEETH);
8}

The name print_table is consistent with src/precompute_ecmult.c.

sipa commented at 1:47 pm on April 5, 2024:

Done, much cleaner.

in configure.ac:375 in 556938cd27 outdated

377+  ;;
378+22)
379+  SECP_CONFIG_DEFINES="$SECP_CONFIG_DEFINES -DCOMB_BLOCKS=11 -DCOMB_TEETH=6"
380+  ;;
381+86)
382+  SECP_CONFIG_DEFINES="$SECP_CONFIG_DEFINES -DCOMB_BLOCKS=43 -DCOMB_TEETH=6"

real-or-random commented at 10:19 am on April 5, 2024:

I think in the spirit of trying to make manual builds, i.e., without build system, easy (https://github.com/bitcoin-core/secp256k1/issues/929), it would be better to move the selection logic here to the C code, and have ECMULT_GEN_KB as the “exposed” configuration macro. Then configure, cmake, and manual builds will have the same interface. (We could still allow passing COMB_BLOCKS and COMB_TEETH directly as long as ECMULT_GEN_KB is not defined, to retain the possibility to select a non-standard config.)

This shouldn’t hold up this PR. It could also be done in a follow-up PR, if you feel that’s easier.

sipa commented at 1:46 pm on April 5, 2024:

Let’s leave this for a follow-up. I feel there is some discussion to be had about the interaction with precomputation etc.

real-or-random commented at 2:05 pm on April 5, 2024:

Makes sense, let me leave a link to #189 here, just as a reminder.

edit: I took the freedom to unresolve this conversation, as a reminder for follow-ups.

in src/ecmult_gen_impl.h:258 in 9d917a1499 outdated

225@@ -225,7 +226,13 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
226             secp256k1_fe_cmov(&add.y, &neg, sign);
227 
228             /* Add the looked up and conditionally negated value to r. */
229-            secp256k1_gej_add_ge(r, r, &add);
230+            if (EXPECT(first, 0)) {
231+                /* If this is the first table lookup, we can skip addition. */

real-or-random commented at 11:28 am on April 5, 2024:

I checked: EXCEPT actually yields a jne here instead of a je on both gcc and clang.

in src/scalar_8x32_impl.h:67 in 1921fd8bca outdated

61@@ -62,23 +62,24 @@ SECP256K1_INLINE static void secp256k1_scalar_set_int(secp256k1_scalar *r, unsig
62     SECP256K1_SCALAR_VERIFY(r);
63 }
64 
65-SECP256K1_INLINE static unsigned int secp256k1_scalar_get_bits(const secp256k1_scalar *a, unsigned int offset, unsigned int count) {
66+SECP256K1_INLINE static uint32_t secp256k1_scalar_get_bits(const secp256k1_scalar *a, unsigned int offset, unsigned int count) {
67     SECP256K1_SCALAR_VERIFY(a);
68+    VERIFY_CHECK(count > 0 && count <= 32);

real-or-random commented at 11:34 am on April 5, 2024:

https://github.com/bitcoin-core/secp256k1/commit/1921fd8bcaf29f23303411a64988bd178b8e7c5b Make secp256k1_scalar_get_bits support 32-bit reads

nit: Add empty line after SECP256K1_SCALAR_VERIFY(a); Also 1x further below, and the same in the 4x64 file

sipa commented at 5:11 pm on April 5, 2024:

I think we usually have all the VERIFY checks together?

real-or-random commented at 5:32 pm on April 5, 2024:

you’re right, I misremembered the surrounding style, sorry

in src/scalar.h:29 in 1921fd8bca outdated

22@@ -23,10 +23,10 @@
23 static void secp256k1_scalar_clear(secp256k1_scalar *r);
24 
25 /** Access bits from a scalar. All requested bits must belong to the same 32-bit limb. */
26-static unsigned int secp256k1_scalar_get_bits(const secp256k1_scalar *a, unsigned int offset, unsigned int count);
27+static uint32_t secp256k1_scalar_get_bits(const secp256k1_scalar *a, unsigned int offset, unsigned int count);
28 
29 /** Access bits from a scalar. Not constant time in offset and count. */
30-static unsigned int secp256k1_scalar_get_bits_var(const secp256k1_scalar *a, unsigned int offset, unsigned int count);
31+static uint32_t secp256k1_scalar_get_bits_var(const secp256k1_scalar *a, unsigned int offset, unsigned int count);

real-or-random commented at 11:53 am on April 5, 2024:

https://github.com/bitcoin-core/secp256k1/commit/1921fd8bcaf29f23303411a64988bd178b8e7c5b Make secp256k1_scalar_get_bits support 32-bit reads

Not this PR, but let me note that the names are a bit confusing. Just from the names, one would expect that secp256k1_scalar_get_bits_var is faster… But yeah, that’s not the case. I was about to suggest that you use the _var for performance.

We should probably rename the constant-time one to secp256k1_scalar_get_bits_limb32 or something like that. As you have this commit here anyway, you could add this, but feel free to ignore.

sipa commented at 5:39 pm on April 5, 2024:

I’ve done the renaming.

in src/ecmult_gen_impl.h:114 in 1777e0a507 outdated

103@@ -100,6 +104,11 @@ static void secp256k1_ecmult_gen(const secp256k1_ecmult_gen_context *ctx, secp25
104 
105     /* Compute the scalar d = (gn + ctx->scalar_offset). */
106     secp256k1_scalar_add(&d, &ctx->scalar_offset, gn);
107+    /* Convert to recoded array. */
108+    for (i = 0; i < 8; ++i) {
109+        recoded[i] = secp256k1_scalar_get_bits(&d, 32 * i, 32);
110+    }

real-or-random commented at 12:02 pm on April 5, 2024:

Have you considered using scalar_get_b32, reading into a byte array instead?

sipa commented at 5:19 pm on April 5, 2024:

That would weaken the side-channel protection, and also incur a cost for byte-swapping.

sipa force-pushed on Apr 5, 2024

in src/ecmult_gen_impl.h:226 in b71105dd4f outdated

208+                 * above that contain junk. This reduces leakage from single bits. See
209+                 * https://www.usenix.org/system/files/conference/usenixsecurity18/sec18-alam.pdf
210+                 */
211+                uint32_t bitdata = recoded[bit_pos >> 5] >> (bit_pos & 0x1f);
212+                bits &= ~(1 << tooth);
213+                bits ^= bitdata << tooth;

real-or-random commented at 2:01 pm on April 5, 2024:

Hm, this is difficult to analyze and review.

If (bit_pos & 0x1f) happens to be 31, we still extract a single bit. Could it help to ensure that there’s always some junk in the high bits of bitdata? (Not sure what’s the best way to achieve this.)

– Also, compilers are not helpful here: gcc keeps the xor, but clang 17.0.6 manages to move the xor out of the loop and replace by a mov.

To inspect this, try objdump-nice .libs/libsecp256k1.so | less and search for bitdata, where objdump-nice is a shell alias for objdump --disassemble --wide --debugging --line-numbers --source-comment --visualize-jumps=extended-color --disassembler-options=intel --no-show-raw-insn --no-addresses

volatile helps, but I’m not convinced:

bits = *((volatile uint32_t*)&bits) ^ (bitdata << tooth); keeps the xor, but moves bits from a register to memory. Not sure if this better in terms of leakage.
bits ^= bitdata << *((volatile uint32_t*)&tooth); also keeps the xor. This moves tooth to memory and prevents clang from unrolling the loop.

–

By the way, the mitigation in OpenSSL is https://github.com/openssl/openssl/pull/6276, but looking at that commit doesn’t seem particularly helpful in our case.

sipa commented at 2:07 pm on April 5, 2024:

I admit that I never actually analyzed how much this mitigation does. I’m ok just dropping it.

real-or-random commented at 5:30 pm on April 5, 2024:

See https://github.com/real-or-random/secp256k1/commit/50ef2da3df16e7e610098bd50cacb60e5264cf7c for an improved version. I’ve manually verified that gcc, clang, msvc actually use rots and don’t optimize that away.

sipa commented at 5:39 pm on April 5, 2024:

Squashed it in (and added you as co-author of the commit).

in src/tests.c:5619 in 8ec3b9f0eb outdated

5614+        /* Run test with gn = i - scalar_offset (so that the ecmult_gen recoded value represents i). */
5615+        secp256k1_ecmult_gen(&CTX->ecmult_gen_ctx, &res1, &gn);
5616+        secp256k1_ecmult(&res2, NULL, &secp256k1_scalar_zero, &gn);
5617+        secp256k1_gej_neg(&res2, &res2);
5618+        secp256k1_gej_add_var(&res1, &res1, &res2, NULL);
5619+        CHECK(secp256k1_gej_is_infinity(&res1));

real-or-random commented at 2:16 pm on April 5, 2024:

nit: use secp256k1_gej_eq_var instead

sipa commented at 5:40 pm on April 5, 2024:

Done.

in src/tests.c:5614 in 8ec3b9f0eb outdated

5611+    secp256k1_scalar_negate(&gn, &gn); /* gn = -1 - scalar_offset */
5612+
5613+    for (i = -1; i < 2; ++i) {
5614+        /* Run test with gn = i - scalar_offset (so that the ecmult_gen recoded value represents i). */
5615+        secp256k1_ecmult_gen(&CTX->ecmult_gen_ctx, &res1, &gn);
5616+        secp256k1_ecmult(&res2, NULL, &secp256k1_scalar_zero, &gn);

real-or-random commented at 2:20 pm on April 5, 2024:

nit: Add secp256k1_ecmult_const(&res3, &secp256k1_ge_const_g, &gn); ?

sipa commented at 5:40 pm on April 5, 2024:

Added.

in src/ecmult_gen_compute_table_impl.h:87 in 0b8deaac02 outdated

83@@ -85,6 +84,7 @@ static void secp256k1_ecmult_gen_compute_table(secp256k1_ge_storage* table, cons
84     for (block = 0; block < blocks; ++block) {
85         size_t index;
86         for (index = 0; index < points; ++index) {
87+            CHECK(!secp256k1_ge_is_infinity(&prec[block * points + index]));

real-or-random commented at 4:56 pm on April 5, 2024:

note: the precompute binaries are built with -DVERIFY (but it can’t be wrong to keep this here, I guess)

sipa commented at 5:40 pm on April 5, 2024:

Changed to VERIFY_CHECK for consistency.

in src/ecmult_gen_impl.h:201 in 7dec5ab865 outdated

225+        uint32_t bit_pos = comb_off;
226+        /* Inner loop: for each block, add table entries to the result. */
227+        for (block = 0; block < COMB_BLOCKS; ++block) {
228+            /* Gather the mask(block)-selected bits of d into bits. They're packed
229+             * together: bit (tooth) of bits = bit
230+             * ((block*COMB_TEETH + tooth)*COMB_SPACING + comb_off) of d. */

real-or-random commented at 5:03 pm on April 5, 2024:

nit:

0            /* Gather the mask(block)-selected bits of d into bits. They're packed:
1             * bits[tooth] = d[(block*COMB_TEETH + tooth)*COMB_SPACING + comb_off]. */

The took me a while to parse due to the linebreaks.

sipa commented at 5:40 pm on April 5, 2024:

Done.

sipa force-pushed on Apr 5, 2024

in src/util.h:401 in d4ec830386 outdated

396+#if defined(_MSC_VER)
397+    return _rotr(x, by);  /* needs <stdlib.h> */
398+#else
399+    /* Reduce rotation amount to avoid UB when shifting. */
400+    const unsigned int mask = CHAR_BIT * sizeof(x) - 1;
401+    /* Turned it into a rot instruction by GCC and clang. */

real-or-random commented at 6:06 pm on April 5, 2024:

-.- remove “it”, sorry.

sipa commented at 6:10 pm on April 5, 2024:

Done.

sipa force-pushed on Apr 5, 2024

in src/ecmult_gen_impl.h:42 in dafb392b40 outdated

56+    int i;
57+
58+    /* Compute scalar -1/2. */
59+    secp256k1_scalar neghalf = SECP256K1_SCALAR_CONST(0, 0, 0, 0, 0, 0, 0, 2);
60+    secp256k1_scalar_inverse_var(&neghalf, &neghalf);
61+    secp256k1_scalar_negate(&neghalf, &neghalf);

real-or-random commented at 6:57 pm on April 5, 2024:

Optimized this here, if you want to cherry-pick: https://github.com/real-or-random/secp256k1/commit/b71226b71e02ae7c0e67ff9a9546933399032da8

It may look overkill, but I don’t think it is upon a closer look: This saves ~1.1 us on my machine (lazy benchmark with turboboost enabled), which is 40% in context_create with NULL seed (because that has an early return to skip the ecmult_gen). But also ecmult_gen is just ~10 us, so this will also save ~10% in context_randomize, which is not entirely unimportant.

sipa commented at 7:01 pm on April 5, 2024:

I predict this will fail in exhaustive test mode.

real-or-random commented at 7:09 pm on April 5, 2024:

Argh, ok, I see… I’ll try again later…

real-or-random commented at 8:46 pm on April 5, 2024:

New attempt: https://github.com/real-or-random/secp256k1/commit/9054994691d9f037146988a6cb41ec7a0d90fe03

sipa commented at 10:37 pm on April 5, 2024:

Great! Included.

sipa commented at 10:47 pm on April 5, 2024:

Hmm, I realize we also have secp256k1_scalar_half now (added after this PR was opened), which would make most of the cost of the diff calculation go away too. WDYT?

sipa commented at 11:05 am on April 6, 2024:

I’ve reverted that commit, and instead switched to computing neghalf through secp256k1_scalar_half + secp256k1_scalar_negate.

 0diff --git a/src/ecmult_gen_compute_table_impl.h b/src/ecmult_gen_compute_table_impl.h
 1index a861d95b..17a46771 100644
 2--- a/src/ecmult_gen_compute_table_impl.h
 3+++ b/src/ecmult_gen_compute_table_impl.h
 4@@ -22,11 +22,11 @@ static void secp256k1_ecmult_gen_compute_table(secp256k1_ge_storage* table, cons
 5     secp256k1_gej* vs = checked_malloc(&default_error_callback, points_total * sizeof(*vs));
 6     secp256k1_gej u;
 7     size_t vs_pos = 0;
 8-    secp256k1_scalar half = SECP256K1_SCALAR_CONST(0, 0, 0, 0, 0, 0, 0, 2);
 9+    secp256k1_scalar half;
10     int block, i;
11 
12     /* u is the running power of two times gen we're working with, initially gen/2. */
13-    secp256k1_scalar_inverse_var(&half, &half);
14+    secp256k1_scalar_half(&half, &secp256k1_scalar_one);
15     secp256k1_gej_set_infinity(&u);
16     for (i = 255; i >= 0; --i) {
17         /* Use a very simple multiplication ladder to avoid dependency on ecmult. */
18diff --git a/src/ecmult_gen_impl.h b/src/ecmult_gen_impl.h
19index 4bb70f52..517db105 100644
20--- a/src/ecmult_gen_impl.h
21+++ b/src/ecmult_gen_impl.h
22@@ -37,12 +37,12 @@ static void secp256k1_ecmult_gen_scalar_diff(secp256k1_scalar* diff) {
23     int i;
24 
25     /* Compute scalar -1/2. */
26-    secp256k1_scalar neghalf = SECP256K1_SCALAR_CONST(0, 0, 0, 0, 0, 0, 0, 2);
27-    secp256k1_scalar_inverse_var(&neghalf, &neghalf);
28+    secp256k1_scalar neghalf;
29+    secp256k1_scalar_half(&neghalf, &secp256k1_scalar_one);
30     secp256k1_scalar_negate(&neghalf, &neghalf);
31 
32     /* Compute offset = 2^(COMB_BITS - 1). */
33-    secp256k1_scalar_set_int(diff, 1);
34+    *diff = secp256k1_scalar_one;
35     for (i = 0; i < COMB_BITS - 1; ++i) {
36         secp256k1_scalar_add(diff, diff, diff);
37     }

real-or-random commented at 10:01 am on April 7, 2024:

yeah that’s much simpler, I didn’t remember scalar_half…

sipa force-pushed on Apr 5, 2024

sipa force-pushed on Apr 6, 2024

in src/ecmult_gen_compute_table_impl.h:30 in a61f0e5921 outdated

38-
39-    /* get the generator */
40-    secp256k1_gej_set_ge(&gj, gen);
41-
42-    /* Construct a group element with no known corresponding scalar (nothing up my sleeve). */
43+    /* u is the running power of two times gen we're working with, initially gen/2. */

real-or-random commented at 9:26 am on April 7, 2024:

Add VERIFY_CHECK(points_total > 0) here to tell GCC that the loop is at least executed once.

See https://github.com/real-or-random/secp256k1/commit/854723f1e30d0e7e71967d0c225645275d169883#diff-2d5b5da541c539e4be4fc1506b554895bd68ef20bf8ae17879c8af6c948dbe54 for working CI runs on MacOS GCC.

sipa commented at 10:09 am on April 7, 2024:

Wow, thanks, I didn’t consider this possibility.

So bizarre that GCC doesn’t detect this when invoking secp256k1_scalar_inverse…

real-or-random commented at 10:25 am on April 7, 2024:

Yep, and I had to find this GCC bug to understand that GCC talks about the memory pointed to by ret (inside checked_malloc) being uninitialized, and not ret itself…

sipa force-pushed on Apr 7, 2024

in CHANGELOG.md:1 in dcc846f185

real-or-random commented at 5:05 pm on April 12, 2024:

changelog:

I think we should expand a bit on the difference to the old table sizes. (Not sure to which section that belongs then. Your current approach of using three sections is technically correct. But maybe just merge all of these items into one item under [Changed]. Then it’s less scattered
nit: Want to give also the CMake options (see older changelog items for examples)?

real-or-random commented at 5:10 pm on April 12, 2024: contributor

As part of testing this, I played around with the edge values for the config parameters. This led me to a series of minors nits/fixups in the code that handles the parameters, see https://github.com/real-or-random/secp256k1/tree/202112_sdmc, that you can pick. Most of these commits, if not all, should probably be squashed into your commits (and a regeneration of the precomputed files is necessary, this is why CI fails on my repo.)

Apart from that ACK. I did all the review and testing that I planned to do.

sipa force-pushed on Apr 15, 2024

sipa commented at 3:33 pm on April 15, 2024: contributor

@real-or-random I’ve adopted almost all your commits, except for:

I don’t see the need for “check for value of COMB_RANGE in precomputed file”, as COMB_RANGE doesn’t affect the table.
I have turned “Introduce CEIL_DIV macro and use it” into a separate, first, commit, which is used in a few more places in the codebase. This also necessitated switching to a different formula that avoids an overflow.
I did things a bit differently for “add todo for you”.

in src/util.h:174 in 51196df976 outdated

169@@ -170,7 +170,10 @@ static SECP256K1_INLINE void *checked_malloc(const secp256k1_callback* cb, size_
170 #define ALIGNMENT 16
171 #endif
172 
173-#define ROUND_TO_ALIGN(size) ((((size) + ALIGNMENT - 1) / ALIGNMENT) * ALIGNMENT)
174+/* ceil(x/y) for integers x >= 0 and y > 0. Here, / denotes rational division. */
175+#define CEIL_DIV(x, y) (1 + ((x) - 1) / (y))

real-or-random commented at 4:45 pm on April 15, 2024:

This is clearly better than my suggestion, but this can now underflow for unsigned x == 0, so the docs need to be changed:

0/* ceil(x/y) for integers x > 0 and y > 0. Here, / denotes rational division. */
1#define CEIL_DIV(x, y) (1 + ((x) - 1) / (y))

It’s a bit sad that there’s no real way to add a VERIFY_CHECK in a macro (or at least not one which is compatible with the way we use this macro in WNAF_SIZE_BITS(bits, w) as a compile time integer constant…). If you think that’s better, we could drop that use, and make it a real function size_t ceil_div(size_t, size_t) with a VERIFY_CHECK instead, but no strong opinions here.

sipa commented at 5:21 pm on April 15, 2024:

Nice catch. I’ve adopted your suggested change. All current invocation sites have clearly non-zero input, so I don’t think we need to needlessly complicate.

real-or-random commented at 4:46 pm on April 15, 2024: contributor

ACK mod nit

Introduce CEIL_DIV macro and use it aa00a6b892

sipa force-pushed on Apr 15, 2024

real-or-random approved

real-or-random commented at 5:39 pm on April 15, 2024: contributor

ACK 8ab2f5ee3344150e103050f578659592c2e5cc92

in src/tests.c:5601 in 8ab2f5ee33 outdated

5606     secp256k1_ecmult_gen_blind(&CTX->ecmult_gen_ctx, 0);
5607-    CHECK(secp256k1_scalar_eq(&b, &CTX->ecmult_gen_ctx.blind));
5608-    CHECK(gej_xyz_equals_gej(&initial, &CTX->ecmult_gen_ctx.initial));
5609+    CHECK(secp256k1_scalar_eq(&b, &CTX->ecmult_gen_ctx.scalar_offset));
5610+    p2 = CTX->ecmult_gen_ctx.ge_offset;
5611+    secp256k1_fe_normalize_var(&p2.y);

jonasnick commented at 7:08 am on April 16, 2024:

both normalize are unnecessary

sipa commented at 2:39 pm on April 19, 2024:

Gone.

in src/ecmult_gen_impl.h:50 in 8ab2f5ee33 outdated

64+    *diff = secp256k1_scalar_one;
65+    for (i = 0; i < COMB_BITS - 1; ++i) {
66+        secp256k1_scalar_add(diff, diff, diff);
67+    }
68 
69+    /* The result is the sum of 2^(COMB_BITS - 1) + (-1/2). */

jonasnick commented at 7:09 am on April 16, 2024:

micronit: remove “of”

sipa commented at 2:39 pm on April 19, 2024:

Gone.

in src/ecmult_gen_compute_table_impl.h:73 in 8ab2f5ee33 outdated

117-                secp256k1_gej_add_var(&numsbase, &numsbase, &nums_gej, NULL);
118+        }
119+        /* Now u = 2^((block*teeth + teeth)*spacing) * gen/2
120+         *       = 2^((block+1)*teeth*spacing) * gen/2       */
121+
122+        /* Next, compute the table entries for block block in Jacobian coordinates.

jonasnick commented at 7:10 am on April 16, 2024:

“block block”?

sipa commented at 2:29 pm on April 19, 2024:

Yes, the block with index block.

sipa commented at 2:39 pm on April 19, 2024:

Reworded to “block number block”.

in src/ecmult_gen_impl.h:321 in 8ab2f5ee33 outdated

322-    secp256k1_fe_clear(&s);
323+    secp256k1_fe_set_b32_mod(&f, nonce32);
324+    secp256k1_fe_cmov(&f, &secp256k1_fe_one, secp256k1_fe_normalizes_to_zero(&f));
325+    ctx->proj_blind = f;
326+
327+    /* For a random blinding value b, set scalar_offset=diff-n, ge_offset=bG */

jonasnick commented at 7:10 am on April 16, 2024:

s/diff-n/diff-b ?

sipa commented at 2:39 pm on April 19, 2024:

Fixed.

in src/ecmult_gen_compute_table_impl.h:56 in 8ab2f5ee33 outdated

90+        /* Here u = 2^(block*teeth*spacing) * gen/2. */
91+        secp256k1_gej sum;
92+        secp256k1_gej_set_infinity(&sum);
93+        for (tooth = 0; tooth < teeth; ++tooth) {
94+            /* Here u = 2^((block*teeth + tooth)*spacing) * gen/2. */
95+            /* Make sum = sum(2^((block*teeth + t)*spacing), t=0..tooth). */

jonasnick commented at 7:14 am on April 16, 2024:

missing * gen/2

sipa commented at 2:39 pm on April 19, 2024:

Added.

in src/ecmult_gen_impl.h:102 in 8ab2f5ee33 outdated

126+     *
127+     * Adding precomputation, our final equations become:
128+     *
129+     *     ctx->scalar_offset = (2^COMB_BITS - 1)/2 - b (mod order)
130+     *     ctx->ge_offset = b*G
131+     *     d = gn + ctx->scalar_offset

jonasnick commented at 7:16 am on April 16, 2024:

missing (mod order)?

sipa commented at 2:39 pm on April 19, 2024:

Added.

in src/ecmult_gen.h:56 in 8ab2f5ee33 outdated

49+#    define COMB_RANGE 8
50+#    define COMB_BLOCKS 2
51+#    define COMB_TEETH 3
52+#  else
53+#    error "Unknown exhaustive test order"
54+#  endif

jonasnick commented at 7:42 am on April 16, 2024:

missing exhaustive test order 7?

sipa commented at 2:40 pm on April 19, 2024:

Added.

in src/ecmult_gen.h:18 in 8ab2f5ee33 outdated

15+
16+/* Configuration parameters for the signed-digit multi-comb algorithm:
17+ *
18+ * - COMB_BLOCKS is the number of blocks the input is split into. Each
19+ *   has a corresponding table.
20+ * - COMB_TEETH is the number of bits simultaneously covered by one table.

jonasnick commented at 6:26 pm on April 16, 2024:

Not sure if it’s helpful to mention separate tables here. In precompute_ecmult_gen.c it’s all just in one big table:

0secp256k1_ge_storage* table = checked_malloc(&default_error_callback, blocks * points * sizeof(secp256k1_ge_storage));

sipa commented at 2:40 pm on April 19, 2024:

Reworded to “table”.

jonasnick commented at 6:42 pm on April 16, 2024: contributor

Concept ACK

Can confirm a 13% speedup with default parameters on my dev laptop.

Initial gej blinding -> final ge blinding

Instead of having the starting point of the ecmult_gen computation be
offset, do it with the final point. This enables reasoning over the
set of points reachable in intermediary computations, which can be
leveraged by potential future optimization.

Because the final point is in affine coordinates, its projective
blinding is no longer possible. It will be reintroduced again in
a different way, in a later commit.

Also introduce some more comments and more descriptive names.

ab45c3e089

Make exhaustive tests's scalar_inverse(&x,&x) work

The old code overwrote the input at the start of the function,
making a call like secp256k1_scalar_inverse(&x,&x) always fail.

486518b350

sipa force-pushed on Apr 19, 2024

jonasnick approved

jonasnick commented at 3:18 pm on April 19, 2024: contributor

ACK 6a3362a1dcb556a84a0876e393f4d2f3fba34a32

Signed-digit multi-comb ecmult_gen algorithm

This introduces the signed-digit multi-comb multiplication algorithm
for constant-time G multiplications (ecmult_gen). It is based on
section 3.3 of "Fast and compact elliptic-curve cryptography" by
Mike Hamburg (see https://eprint.iacr.org/2012/309).

Original implementation by Peter Dettman, with changes by Pieter Wuille
to use scalars for recoding, and additional comments.

fde1dfcd8d

Always generate tables for current (blocks,teeth) config 5f7be9f6a5

Provide 3 configurations accessible through ./configure ed2a056f3d

Optimization: move (2^COMB_BITS-1)/2 term into ctx->scalar_offset

It is unnecessary to recompute this term needed by the SDMC algorithm
for every multiplication; move it into the context scalar_offset value
instead.

7a33db35cd

Optimization: first table lookup needs no point addition 15d0cca2a6

Optimization: avoid unnecessary doublings in precomputation 6247f485b6

Rename scalar_get_bits -> scalar_get_bits_limb32; return uint32_t 5005abee60

Make secp256k1_scalar_get_bits support 32-bit reads

The old code would trigger UB when count=32.

e03dcc44b5

Optimization: use Nx32 representation for recoded bits

The existing code needs to deal with the edge case that bit_pos >= 256,
which would lead to an out-of-bounds read from secp256k1_scalar.

Instead, recode the scalar into an array of uint32_t with enough zero
padding at the end to alleviate the issue. This also simplifies the
code, and is necessary for a security improvement in a follow-up
commit.

Original code by Peter Dettman, with modifications by Pieter Wuille.

a0d32b597d

Reduce side channels from single-bit reads

Co-authored-by: Tim Ruffing <crypto@timruffing.de>

07810d9abb

Reintroduce projective blinding 644e86de9a

Add test case for ecmult_gen recoded = {-1,0,1} 39b2f2a321

Permit COMB_BITS < 256 for exhaustive tests a043940253

Add changelog entry for SDMC 4c341f89ab

sipa force-pushed on Apr 19, 2024

real-or-random approved

real-or-random commented at 4:15 pm on April 19, 2024: contributor

reACK 4c341f89ab704205a89b5c5d404ed60f381f7c48

jonasnick approved

jonasnick commented at 5:49 pm on April 19, 2024: contributor

ACK 4c341f89ab704205a89b5c5d404ed60f381f7c48

stratospher commented at 4:25 pm on April 20, 2024: contributor

ACK 4c341f8. Did these benchmarks and saw a 12.4% on gcc 13.2.0 and 11.5% on clang 15.0.0. Also summarised how the precomputed table generation works here for future me :)

jonasnick merged this on Apr 22, 2024

jonasnick closed this on Apr 22, 2024

Signed-digit multi-comb ecmult_gen algorithm #1058

AMD Ryzen 5950X, GCC 11.2.0, default compile options

Intel Core I7-7820HQ @ 2.3 GHz, GCC 10.3.0, default compile options

ARM Cortex-A53 @ 1 GHz, GCC 9.3.0, default compile options