Faster const-time normalization #1028

pull peterdettman wants to merge 1 commits into bitcoin-core:master from peterdettman:opt_normalize changing 2 files +68 −72
  1. peterdettman commented at 1:58 pm on December 3, 2021: contributor

    Use a “borrowing” trick in _fe_normalize to simplify the handling of values in [P, 2^256). This is significantly faster too (according to bench_internal), though performance of _fe_normalize isn’t that important.

    Applying a similar idea to _fe_normalizes_to_zero also results in a performance boost, and in this case it appears to boost signing speed by 2% or more (also ECDH), due to three calls in _gej_add_ge.

  2. in src/field_5x52_impl.h:85 in 4bbb6612dc outdated
    61+     * x is incremented before the first pass and then decremented before the second pass
    62+     * to ensure that the result doesn't fall into the range [P, 2^256). */
    63+    uint64_t x = (t4 >> 48) + 1; t4 &= 0x0FFFFFFFFFFFFULL;
    64 
    65     /* The first pass ensures the magnitude is 1, ... */
    66     t0 += x * 0x1000003D1ULL;
    


    robot-dreams commented at 12:03 pm on December 5, 2021:
    Given that x is bigger than before, how we know that t0 += x * 0x1000003D1ULL won’t overflow the uint64_t?

    peterdettman commented at 6:42 pm on December 5, 2021:
    Most, if not all, callers use field elements with a maximum magnitude of 8, so there’s plenty of room. The practical limit is actually around 31 (because field_10x26 can only represent magnitude 32 and several methods assume there’s a free bit available for carries on entry). I think it would be a good idea to actually enforce some exact magnitude limit on input to field methods though (probably 16).
  3. robot-dreams commented at 4:43 pm on December 5, 2021: contributor

    This is a really nice idea! I just had one inline question.


    The updated versions of fe_normalize and fe_normalizes_to_zero indeed look faster, and a local benchmark run seems to agree:

    Benchmarks before

    0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)
    1
    2field_normalize               ,     0.00968   ,     0.00973   ,     0.00983
    3group_add_affine              ,     0.189     ,     0.196     ,     0.205
    4ecdsa_sign                    ,    22.3       ,    22.7       ,    23.9
    

    Benchmarks after

    0Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)
    1
    2field_normalize               ,     0.00752   ,     0.00756   ,     0.00763
    3group_add_affine              ,     0.189     ,     0.189     ,     0.191
    4ecdsa_sign                    ,    22.3       ,    22.4       ,    22.5
    

    I also wrote down some notes as I was trying to understand the change, and I’ll post them here in case they’re useful for future reference.

    Why it’s not possible for the result to be in [P, 2^256)

    If there’s no final carry bit, then the second pass will subtract 2^256 - P from a value that’s at most 2^256 - 1, resulting in a value that’s at most P - 1.

    If there was a carry, the largest possible value (not including the carry bit) is 65536 * 0x1000003D1 - 1 for 5x52 and 1024 * 0x1000003D1 - 1 for 10x26, both of which are well below P.

    (If there was a carry bit, clearing it is equivalent (mod P) to decrementing by 2^256 - P, which cancels out the increment before the first pass.)

    Where the 63-bit shift (in the 5x52 version) came from

    The 64th bit is nonzero if and only if there was underflow after subtraction (2^256 - P is only a 33-bit value, so after subtracting it it’s not possible to underflow so much that the 64th bit is still cleared afterwards).

    Thus subtracting t » 63 is the same logic as “if subtraction on the previous limb caused an underflow then subtract 1, otherwise subtract 0”.

    If we store a 52-bit value in a 64-bit word, underflowing it and then keeping the bottom 52 bits is equivalent to underflowing the same value stored in a 52-bit word.

    (An analogous explanation works for where the 31-bit shift came from in the 10x26 case.)

  4. robot-dreams commented at 5:28 pm on December 6, 2021: contributor

    ACK 4bbb6612dc0581ded87d911039729c4ec1a143c8

    My only concern from before was magnitude / overflow. I agree it’d be a good idea to enforce some exact magnitude limit and/or more carefully audit maximum possible magnitudes. But I don’t think it should block this PR—especially since I also tried running tests with VERIFY_CHECK(r->magnitude <= 16) at the top of fe_normalize and it passed.

  5. peterdettman force-pushed on Dec 21, 2021
  6. peterdettman force-pushed on Dec 23, 2021
  7. sipa cross-referenced this on Dec 31, 2021 from issue Try a non-uniform group law (e.g., for ecmult_gen)? by real-or-random
  8. peterdettman force-pushed on Jan 1, 2022
  9. real-or-random cross-referenced this on Jan 4, 2022 from issue Separate magnitude/normalization/... checking/propagation from implementations by sipa
  10. Faster const-time normalization 8932956fe5
  11. peterdettman force-pushed on Apr 20, 2022
  12. real-or-random added the label performance on May 11, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-20 03:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me