Faster const-time modinv divsteps #1031

pull peterdettman wants to merge 5 commits into bitcoin-core:master from peterdettman:opt_modinv changing 3 files +202 −165
  1. peterdettman commented at 6:25 pm on December 4, 2021: contributor

    Changes to _divsteps_59 (_30) that give maybe 4% speed improvement to const-time modinv on 64 bit. I see a larger gain on 32 bit but measured on 64 bit so might not be real.

    Start the result matrix scaled by 2^62 (resp. 2^30) and shift q, r down instead of u, v up at each step (should make life easier for vectorization). Since we’re always shifting away the LSB of g, q, r, we can avoid doing a full negation for x, y, z (after a few tweaks). Then it makes sense to switch zeta back to delta (I confined this change to the local method for the moment).

  2. sipa commented at 7:41 pm on December 8, 2021: contributor
    Wowwow, calm down!
  3. sipa commented at 11:33 pm on December 8, 2021: contributor

    See https://github.com/sipa/secp256k1/commits/pr1031 for two commits on top of this. One renames your delta to theta (as the variable operated on is really defined as delta-1/2 compared to delta as defined in the paper). The other updates the writeup to reflect the improvements here.

    Concept ACK in any case. I’ll look over the code in more detail, but at least this convinces me it’s correct.

  4. real-or-random commented at 10:07 am on December 9, 2021: contributor

    As I understand, this should always be faster and we don’t need lots of benchmarks to confirm.

    The other updates the writeup to reflect the improvements here.

    Nice, I wanted to ask for this.

  5. sipa commented at 6:09 pm on December 9, 2021: contributor

    Updated my branch with additional assertions and a few more improvements to the documentation.

    Code review ACK. I’m happy to do my changes on top in a separate PR, though it’s probably better to include them, as a few code comments are out of that as-is in this PR. Feel free to cherry-pick or otherwise include my suggested changes.

  6. peterdettman commented at 11:17 am on December 10, 2021: contributor

    Wowwow, calm down! @sipa Never!

    So, after I PR’ed this, I sat there looking at the code and thinking: you know, there’s a lot of low zeros in u/v/q/r… you could almost fit an f/g in there…

    (a few moments later)

    https://github.com/peterdettman/secp256k1/tree/vec_modinv

    I won’t spoiler it, but you might like to benchmark const-time inverses in that branch…

  7. sipa commented at 3:09 pm on December 10, 2021: contributor

    @peterdettman Nice!

    Benchmarks on Ryzen 5950x (default compilation options, GCC 11.2.0, SECP256K1_BENCH_ITERS=1000000 ./bench_internal inverse:

     0master:
     1field_inverse                 ,     1.27      ,     1.27      ,     1.28   
     2field_inverse_var             ,     0.867     ,     0.871     ,     0.874  
     3pr1031:
     4field_inverse                 ,     1.26      ,     1.26      ,     1.27   
     5field_inverse_var             ,     0.846     ,     0.860     ,     0.868  
     6vec_modinv:
     7field_inverse                 ,     0.986     ,     0.995     ,     1.01   
     8field_inverse_var             ,     0.868     ,     0.870     ,     0.873  
     9vec_modinv, using the const-time matrix in the vartime code:
    10field_inverse                 ,     1.00      ,     1.00      ,     1.01   
    11field_inverse_var             ,     0.866     ,     0.871     ,     0.876  
    

    This seems to at least suggest the option of throwing out the vartime matrix computation code, and using this new constant-time code for both?

    The last option (if someone else wants to benchmark) is simply this on top of https://github.com/peterdettman/secp256k1/tree/vec_modinv:

    0@@ -568,7 +568,7 @@ static void secp256k1_modinv64_var(secp256k1_modinv64_signed62 *x, const secp256
    1     while (1) {
    2         /* Compute transition matrix and new eta after 62 divsteps. */
    3         secp256k1_modinv64_trans2x2 t;
    4-        eta = secp256k1_modinv64_divsteps_62_var(eta, f.v[0], g.v[0], &t);
    5+        eta = secp256k1_modinv64_divsteps_59(eta, f.v[0], g.v[0], &t);
    6         /* Update d,e using that transition matrix. */
    
  8. peterdettman commented at 4:01 pm on December 10, 2021: contributor

    Wow, that’s even more improvement than on my machine. Around 4900 cycles, right?

    For the _var version, if copying the vec approach at minimum the vartime version would want to escape at the halfway mark if g1 is 0 (mod remaining steps) already, so I think there would still be a separate version. I haven’t really looked at it though.

  9. sipa commented at 4:02 pm on December 10, 2021: contributor
    @peterdettman A vartime version of the vec matrix code could perhaps also do more than 59 iterations.
  10. peterdettman commented at 4:06 pm on December 10, 2021: contributor
    It could trivially do 60. The current core vec loop tops out at 30. In theory I think it can do 31, but it’s not a given until I work it through.
  11. sipa commented at 4:28 pm on December 10, 2021: contributor

    Benchmarks on ARM64 (Cortex-A53, default compilation options, GCC 9.3.0, ./bench_internal inverse):

     0master:
     1field_inverse                 ,    12.5       ,    12.5       ,    12.5    
     2field_inverse_var             ,     7.26      ,     7.27      ,     7.27
     3pr1031:
     4field_inverse                 ,    11.9       ,    11.9       ,    11.9    
     5field_inverse_var             ,     7.26      ,     7.27      ,     7.27 
     6vec_modinv:
     7field_inverse                 ,    11.1       ,    11.1       ,    11.1    
     8field_inverse_var             ,     7.26      ,     7.27      ,     7.27   
     9vec_modinv using const-time matrix for vartime:
    10field_inverse                 ,    11.2       ,    11.2       ,    11.2    
    11field_inverse_var             ,     9.83      ,     9.83      ,     9.83  
    

    Looks like we shouldn’t throw out the vartime matrix code just yet…

  12. peterdettman force-pushed on Dec 21, 2021
  13. sipa commented at 4:21 pm on December 21, 2021: contributor
    ACK 6afd499f53e94d48a7fd90ff345d47101a6c6e41 @robot-dreams Perhaps you’re interested in reviewing this too? @peterdettman Given that your later (and much more significant) improvement builds on top of this one (at least conceptually, using initially-shifted u,v,q,r variables) and doesn’t work for 32 bit yet, I’d vote to get this change merged as-is, and do the rest in a follow-up. WDYT?
  14. peterdettman commented at 4:50 pm on December 21, 2021: contributor
    @sipa Yeah, we may as well merge this, since the other one is still under experimentation and will in any case require significant write-up and VERIFY checks.
  15. in doc/safegcd_implementation.md:158 in 6afd499f53 outdated
    154@@ -155,14 +155,14 @@ do one division by *2<sup>N</sup>* as a final step:
    155 ```python
    156 def divsteps_n_matrix(delta, f, g):
    157     """Compute delta and transition matrix t after N divsteps (multiplied by 2^N)."""
    158-    u, v, q, r = 1, 0, 0, 1 # start with identity matrix
    159+    u, v, q, r = 2**N, 0, 0, 2**N # start with identity matrix
    


    robot-dreams commented at 4:35 am on December 22, 2021:

    Nits (to be consistent with below):

    • use shift instead of power
    • add (scaled by 2^N). to comment

    peterdettman commented at 5:16 am on January 11, 2022:
    Done.
  16. in doc/safegcd_implementation.md:442 in 6afd499f53 outdated
    439 
    440 in constant-time form as:
    441 
    442 ```python
    443-    c1 = (-delta) >> 63
    444+    c1 = delta >> 63
    


    robot-dreams commented at 5:45 am on December 22, 2021:

    Does this need to be (delta - 1) >> 63 here (and line 508 below)?

    We want to negate f (i.e. set c1 = -1) whenever delta <= 0, but the way it is now, for delta == 0 we wouldn’t negate, right?


    peterdettman commented at 5:17 am on January 11, 2022:
    Done (2 occurrences).
  17. in doc/safegcd_implementation.md:686 in 6afd499f53 outdated
    688 ```python
    689-def divsteps_n_matrix(zeta, f, g):
    690-    """Compute zeta and transition matrix t after N divsteps (multiplied by 2^N)."""
    691-    u, v, q, r = 1, 0, 0, 1 # start with identity matrix
    692+def divsteps_n_matrix(theta, f, g):
    693+    """Compute delta and transition matrix t after N divsteps (multiplied by 2^N)."""
    


    robot-dreams commented at 6:57 am on December 22, 2021:
    Nit: theta, since that’s what’s returned?
  18. in doc/safegcd_implementation.md:696 in 6afd499f53 outdated
    705-        c1 &= c2                     # reusing c1 here for the earlier c3 variable
    706-        zeta = (zeta ^ c1) - 1       # inlining the unconditional zeta decrement here
    707+        # Conditionally subtract x, y, z from g, q, r.
    708+        g, q, r = g - (x & c2), q - (y & c2), r - (z & c2)
    709+        c3 = ~c1 & c2
    710+        # Conditionally negate delta, and then increment it by 1.
    


    robot-dreams commented at 7:07 am on December 22, 2021:

    Nit: emphasize that you’re in fact negating delta (not theta)?

    0        # Conditionally negate delta (theta = delta - 1/2), and then increment it by 1.
    

    peterdettman commented at 5:19 am on January 11, 2022:
    Now “Conditionally complement theta, and then increment it by 1.”, since we are now using theta directly.
  19. robot-dreams commented at 7:21 am on December 22, 2021: contributor

    Thanks for the tag @sipa ! You’re right, I’m definitely interested in this change 😄

    I took an initial look at the new writeup and it looks correct so far, but I still need to sleep on it and then review the new C implementation carefully.

  20. in src/modinv64_impl.h:149 in 6afd499f53 outdated
    144@@ -145,72 +145,72 @@ typedef struct {
    145     int64_t u, v, q, r;
    146 } secp256k1_modinv64_trans2x2;
    147 
    148-/* Compute the transition matrix and eta for 59 divsteps (where zeta=-(delta+1/2)).
    149+/* Compute the transition matrix and theta for 59 divsteps (where theta=delta-1/2)
    150  * Note that the transformation matrix is scaled by 2^62 and not 2^59.
    


    robot-dreams commented at 4:24 pm on December 22, 2021:

    Is it worth elaborating on some of these parameter choices?

    0/* Compute the transition matrix and theta for 59 divsteps (where theta=delta-1/2)
    1 * We use 59 instead of 62 because we need 590 total divsteps.
    2 * However, we scale the transformation matrix by 2^62 and not 2^59 to
    3 * use the precomputed inverse of modulus mod 2^62.
    4 */
    

    peterdettman commented at 5:22 pm on December 24, 2021:

    How’s this?

    0/* Compute the transition matrix and theta for 59 divsteps (where theta=delta-1/2)
    1 * Although only 59 divsteps are performed, the resulting transformation matrix
    2 * is scaled by 2^62 to allow reuse of _update_de_62 and _update_fg_62 between
    3 * _modinv64 and _modinv64_var.
    
  21. in src/modinv32_impl.h:194 in 6afd499f53 outdated
    208+        VERIFY_CHECK((f & 1) == 1);
    209+        /* Minimum trailing zeros count for matrix elements decreases in each iteration */
    210+        VERIFY_CHECK(!((u | v | q | r) & (0xFFFFFFFFULL >> (i + 2))));
    211+        /* Applying the matrix so far to the initial f,g gives current f,g. */
    212+        VERIFY_CHECK((u >> (30 - i)) * f0 + (v >> (30 - i)) * g0 == f << i);
    213+        VERIFY_CHECK((q >> (30 - i)) * f0 + (r >> (30 - i)) * g0 == g << i);
    


    robot-dreams commented at 4:32 pm on December 22, 2021:
    Just wondering, why don’t these (and the analogous checks in modinv64_impl.h) cast to higher precision, like line 234 below?

    peterdettman commented at 5:18 pm on December 22, 2021:

    They are equalities modulo 2^32 (resp. 2^64) only, because that’s the precision we are given f0 and g0 to. Technically maybe we only even care about modulo 2^30 (resp. 2^62), but the extra bits are there and should agree anyway. If divsteps were being done at full precision (i.e. directly on the full-size F,G) the equality would still hold, but a significant feature of safegcd is to be able to operate in limited precision most of the time, only updating the full F,G values when we need more bits to work on.

    The latter matrix determinant check is full precision because, well, we’re checking the exact value and not making any assumptions that we got u/v/q/r right.


    robot-dreams commented at 5:29 pm on December 22, 2021:
    Thanks! Yes, the determinant check doesn’t involve the low-order-only f0, g0, that makes sense.
  22. robot-dreams commented at 4:45 pm on December 22, 2021: contributor

    ACK 6afd499f53e94d48a7fd90ff345d47101a6c6e41

    I checked the writeup updates, made sure the updated Python code produced the same transition matrices as before (using various small parameters), and then compared the Python vs. C implementations line by line.

    Am I understanding correctly that removing 3 subtractions from the inner loop gives such a dramatic speedup?

  23. peterdettman commented at 5:22 pm on December 22, 2021: contributor

    Am I understanding correctly that removing 3 subtractions from the inner loop gives such a dramatic speedup? @sipa gave numbers from a different branch of mine that is more radical. This PR probably only gives a few percent, and maybe nothing if the processor could have executed them in parallel anyway.

    Edit: “in parallel” meaning using otherwise idle ALUs/registers, although it’s also possible to have f/u/v, g/q/r, x/y/z be vectorized, in which case there might only even be the one operation saved.

  24. peterdettman force-pushed on Dec 23, 2021
  25. peterdettman force-pushed on Dec 24, 2021
  26. peterdettman commented at 5:27 pm on December 24, 2021: contributor
    @sipa I’m currently assuming that you will get to the nits that @robot-dreams had above about safegcd_implementation.md, but I’ll sort them out next week if not.
  27. sipa commented at 6:44 pm on January 4, 2022: contributor
    See my updated branch https://github.com/sipa/secp256k1/commits/pr1031 . I’ve edited the “Update safegcd writeup to reflect the code” commit to address @robot-dreams’s comments.
  28. peterdettman force-pushed on Jan 11, 2022
  29. peterdettman force-pushed on Jan 11, 2022
  30. peterdettman commented at 5:24 am on January 11, 2022: contributor
    Thanks, @sipa. Squashed one spelling fix into my last commit.
  31. Faster const-time modinv divsteps d0a4b212cb
  32. modinv: introduce theta=delta-1/2 f027e1f441
  33. Update safegcd writeup to reflect the code 5271894b4d
  34. Improve matrix computation assertions ab4900d9d8
  35. Improve verify check and comments
    - Add explanation for 59 divsteps vs 2^62 scaling
    7e2acdae49
  36. peterdettman force-pushed on Apr 20, 2022
  37. sipa commented at 8:52 pm on November 21, 2022: contributor
  38. sipa commented at 9:02 pm on January 19, 2023: contributor
    I’ve taken the liberty to open a rebased PR of this #1197.
  39. real-or-random commented at 4:19 pm on May 11, 2023: contributor
    Closing in favor of #1197
  40. real-or-random closed this on May 11, 2023


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-22 21:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me