Faster const-time normalization

peterdettman commented at 1:58 pm on December 3, 2021: contributor

Use a “borrowing” trick in _fe_normalize to simplify the handling of values in [P, 2^256). This is significantly faster too (according to bench_internal), though performance of _fe_normalize isn’t that important.

Applying a similar idea to _fe_normalizes_to_zero also results in a performance boost, and in this case it appears to boost signing speed by 2% or more (also ECDH), due to three calls in _gej_add_ge.

in src/field_5x52_impl.h:85 in 4bbb6612dc outdated

61+     * x is incremented before the first pass and then decremented before the second pass
62+     * to ensure that the result doesn't fall into the range [P, 2^256). */
63+    uint64_t x = (t4 >> 48) + 1; t4 &= 0x0FFFFFFFFFFFFULL;
64 
65     /* The first pass ensures the magnitude is 1, ... */
66     t0 += x * 0x1000003D1ULL;

robot-dreams commented at 12:03 pm on December 5, 2021:

Given that x is bigger than before, how we know that t0 += x * 0x1000003D1ULL won’t overflow the uint64_t?

peterdettman commented at 6:42 pm on December 5, 2021:

Most, if not all, callers use field elements with a maximum magnitude of 8, so there’s plenty of room. The practical limit is actually around 31 (because field_10x26 can only represent magnitude 32 and several methods assume there’s a free bit available for carries on entry). I think it would be a good idea to actually enforce some exact magnitude limit on input to field methods though (probably 16).

robot-dreams commented at 4:43 pm on December 5, 2021: contributor

This is a really nice idea! I just had one inline question.

The updated versions of fe_normalize and fe_normalizes_to_zero indeed look faster, and a local benchmark run seems to agree:

Benchmarks before

0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)
1
2field_normalize               ,     0.00968   ,     0.00973   ,     0.00983
3group_add_affine              ,     0.189     ,     0.196     ,     0.205
4ecdsa_sign                    ,    22.3       ,    22.7       ,    23.9

Benchmarks after

0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)
1
2field_normalize               ,     0.00752   ,     0.00756   ,     0.00763
3group_add_affine              ,     0.189     ,     0.189     ,     0.191
4ecdsa_sign                    ,    22.3       ,    22.4       ,    22.5

I also wrote down some notes as I was trying to understand the change, and I’ll post them here in case they’re useful for future reference.

Why it’s not possible for the result to be in [P, 2^256)

If there’s no final carry bit, then the second pass will subtract 2^256 - P from a value that’s at most 2^256 - 1, resulting in a value that’s at most P - 1.

If there was a carry, the largest possible value (not including the carry bit) is 65536 * 0x1000003D1 - 1 for 5x52 and 1024 * 0x1000003D1 - 1 for 10x26, both of which are well below P.

(If there was a carry bit, clearing it is equivalent (mod P) to decrementing by 2^256 - P, which cancels out the increment before the first pass.)

Where the 63-bit shift (in the 5x52 version) came from

The 64th bit is nonzero if and only if there was underflow after subtraction (2^256 - P is only a 33-bit value, so after subtracting it it’s not possible to underflow so much that the 64th bit is still cleared afterwards).

Thus subtracting t » 63 is the same logic as “if subtraction on the previous limb caused an underflow then subtract 1, otherwise subtract 0”.

If we store a 52-bit value in a 64-bit word, underflowing it and then keeping the bottom 52 bits is equivalent to underflowing the same value stored in a 52-bit word.

(An analogous explanation works for where the 31-bit shift came from in the 10x26 case.)

robot-dreams commented at 5:28 pm on December 6, 2021: contributor

ACK 4bbb6612dc0581ded87d911039729c4ec1a143c8

My only concern from before was magnitude / overflow. I agree it’d be a good idea to enforce some exact magnitude limit and/or more carefully audit maximum possible magnitudes. But I don’t think it should block this PR—especially since I also tried running tests with VERIFY_CHECK(r->magnitude <= 16) at the top of fe_normalize and it passed.

peterdettman force-pushed on Dec 21, 2021

peterdettman force-pushed on Dec 23, 2021

sipa cross-referenced this on Dec 31, 2021 from issue Try a non-uniform group law (e.g., for ecmult_gen)? by real-or-random

peterdettman force-pushed on Jan 1, 2022

real-or-random cross-referenced this on Jan 4, 2022 from issue Separate magnitude/normalization/... checking/propagation from implementations by sipa

Faster const-time normalization 8932956fe5

peterdettman force-pushed on Apr 20, 2022

real-or-random added the label performance on May 11, 2023

Faster const-time normalization #1028

Benchmarks before

Benchmarks after

Why it’s not possible for the result to be in [P, 2^256)

Where the 63-bit shift (in the 5x52 version) came from