Avoid division when decomposing scalars

peterdettman commented at 2:49 pm on May 25, 2014: contributor

In secp256k1_gej_split_exp, there are two divisions used. Since the denominator is a constant known at compile-time, each can be replaced by a multiplication followed by a right-shift (and rounding).
This change adds the constants g1, g2 for this purpose and rewrites secp256k1_gej_split_exp accordingly.

The technique is discussed, amongst other places, in the paper “Efficient Software Implementation of Public-Key Cryptography on Sensor Networks Using the MSP430X Microcontroller” (Gouvea, Oliveira, Lopez) - in section 4.3.

sipa commented at 5:27 pm on June 2, 2014: contributor

Nice, this removes the need for secp256k1_num_div entirely.

sipa commented at 11:11 pm on June 15, 2014: contributor

Can you:

Remove the secp256k1_num_div function and its implementations.
Add some commentary about how you came up with the numbers?

peterdettman commented at 2:36 am on June 21, 2014: contributor

@sipa Apologies for the delay, I’m a bit pushed for time at the moment. I’ll return to these pull requests hopefully later this week.

peterdettman commented at 4:21 am on June 24, 2014: contributor

I’ve added some comments and removed _num_div, but not ready for merge yet as it’s not giving the expected performance improvement. Bear with me while I investigate!

peterdettman commented at 9:54 am on June 24, 2014: contributor

The split itself is a lot faster, but there’s some kind of one-off-error vs the original “bnc1/bnc2” values on a small proportion of inputs, so looking into that now.

peterdettman commented at 1:27 pm on July 16, 2014: contributor

This one’s got me a bit confused. Not the “one-off-error”, which was actually just the rough edges of what is inherently an approximation (I’ve increased the precision so now the split is virtually always exactly the same as the old code). Rather, I can’t see that performance (of bench) is actually improved.

Naively an improvement is expected since two divisions are replaced by simple truncations (right-shift). If the split code is wrapped in a (pointless) loop of say 10 iterations to try and give it a heavier “weight”, then the difference is pretty clear: the new split is something like 1.8x as fast, and it shows in the bench results. However, for the unmodified bench, I think I am measuring the new code as very slightly slower (it’s hard to pick the signal out from the noise here though).

So, I’m thinking it has something to do with needing to load g1/g2 (vs ‘order’) at each call i.e. some sort of memory/cacheing problem. This seems plausible as the split is only needed once per scalar mult, and so the constants could quite well be getting evicted between split calls. (Note that ‘order’, which is not used in the updated split, is used in the ECDSA code, so perhaps is being cached for the existing split code).

On that theory I tried a few changes like copying things into local variables, using gcc prefetch intrinsics, changing the layout in memory to correspond to the order of usage (g1,g2,a1b2,a2,b1), etc., but without much luck and usually just making things worse!

Anyway, I’d appreciate it if others could perhaps report their timings with/without this patch (of course it’s only relevant when the endomorphism is used), and if you have any thoughts on what’s going on here performance-wise, I’m all ears.

in src/num_gmp_impl.h: in 941d4ffe99 outdated

332+    r->limbs = a->limbs - limbs;
333+    if (left == 0) {
334+        mpn_copyi(r->data, a->data + limbs, r->limbs);
335+    } else {
336+        mpn_rshift(r->data, a->data + limbs, r->limbs, left);
337+        while (r->limbs>1 && r->data[r->limbs-1]==0) r->limbs--;

sipa commented at 3:02 pm on July 16, 2014:

Who not >0?

peterdettman commented at 2:00 pm on July 20, 2014:

I’m guessing you’re referring to the “r->limbs>1” test? _num_trunc is just _num_split modified to ignore the low bits (perhaps I could have just extended _num_shift to support large shifts). I gather that a zero valued num still has limbs==1, but in any case, I’ve just copied and modified… Please clarify if I’ve misunderstood the question.

peterdettman force-pushed on Aug 20, 2014

sipa commented at 9:44 pm on September 1, 2014: contributor

Indeed, I notice a slight slowdown with this (around 0.4%). I wonder whether the advantage of not needing division in the bignum API isn’t already worth it…

theuni cross-referenced this on Sep 11, 2014 from issue WIP: Internal bignum by theuni

sipa commented at 11:18 am on October 26, 2014: contributor

@gmaxwell review requested

peterdettman commented at 11:30 am on October 26, 2014: contributor

@sipa I’m still very curious about the performance gap, but I think it’s reasonable to merge this since the native num stuff appears to be progressing rapidly. I still expect there will be a net performance gain in batch operations, and we may yet find a way around the single-op deficit.

peterdettman force-pushed on Oct 31, 2014

peterdettman force-pushed on Nov 3, 2014

peterdettman force-pushed on Nov 4, 2014

peterdettman force-pushed on Nov 13, 2014

peterdettman commented at 3:39 am on November 13, 2014: contributor

Re-measuring with -O3 -march=native, it looks like this PR actually does improve performance (I measure very roughly 0.5% faster on bench_verify).

peterdettman force-pushed on Nov 15, 2014

peterdettman force-pushed on Nov 18, 2014

Avoid division when decomposing scalars

- In secp256k1_gej_split_exp, there are two divisions used. Since the denominator is a constant known at compile-time, each can be replaced by a multiplication followed by a right-shift (and rounding).
- Add the constants g1, g2 for this purpose and rewrite secp256k1_gej_split_exp accordingly.
- Remove secp256k1_num_div since no longer used

960f9a2e15

peterdettman force-pushed on Nov 19, 2014

sipa cross-referenced this on Dec 1, 2014 from issue Multiplication-only lambda splitter without bignum by sipa

gmaxwell closed this on Dec 7, 2014

roconnor-blockstream cross-referenced this on Sep 21, 2020 from issue Increase precision of g1 and g2. by roconnor-blockstream

tecnovert referenced this in commit f1c601ed9d on Apr 18, 2022

Avoid division when decomposing scalars #21