Minor optimizations to _scalar_inverse to save 4M #452

pull peterdettman wants to merge 2 commits into bitcoin-core:master from peterdettman:sinv_opt changing 1 files +67 −104

peterdettman commented at 5:57 am on April 18, 2017: contributor

The existing secp256k1_scalar_inverse uses a straight-forward “blocks-of-1” approach, but the exponent here admits of a few tweaks.

Firstly, the pattern 01010 occurs 3 times, so if we have (u5=)x^5 available we can save 3M during accumulation. Actually u5 can be precalculated cheaply, replacing 1S with 1M.

Secondly, the addition chain invites optimization since it ends with 127 (a critical threshold) and also contains unnecessary (to the chain)x4 and un-accumulated x7. The addition chain 2, 3, 6, 8, 14, 28, 56, 112, 126 looks inviting… and fortuitously, armed with u5, we can replace [x127 0 x] with [x126 u5] and [x4 0 x] with [x3 u5](two occurrences) - for free. The new shorter addition chain saves 2M.

There should be ~256 squares and I counted 44 multiplies in the existing version, and a saving of 1S + 4M with this PR. Since performance of scalar mul/sqr are very similar in my config, I’d expect about 5/300 ~= 1.6% performance improvement, and measure roughly 1.5%.

Minor optimizations to _scalar_inverse to save 4M

- Precalculate x^5 and use for "01010" patterns during accumulation. (net -2M)
- Further use of x^5 to allow shorter addition chain (net -2M)

cf12fa13cb

peterdettman cross-referenced this on Apr 18, 2017 from issue Scalar 4x64 performance improvements by peterdettman
briansmith commented at 7:21 pm on April 24, 2017: contributor

ICYMI: I submitted a PR to improve upon this by 1 square and 3 multiplications (total savings 2 squares + 7 multiplications), but because it was based on Peter’s unmerged PR, the PR ended up being a PR against his repo instead of this one: https://github.com/peterdettman/secp256k1/pull/1.

Further shorten the addition chain for scalar inversion.

Reduce the number of squarings by one and reduce the number of
multiplications by three.

465159c278

gmaxwell commented at 4:45 pm on April 25, 2017: contributor

These changes are awesome. Thanks. Will test and review carefully soon.
sipa commented at 10:04 pm on April 25, 2017: contributor
Benchmarked bench_verify with GMP disabled on a i7-6820HQ CPU, pegged to 2.6GHz.
- master: 97.7μs
- this PR: 97.3μs
- https://github.com/peterdettman/secp256k1/pull/1: 97.0μs
ACK on both. The result is just a sequence of mul/sqr calls, which by definition is an exponentiation ladder. If the result were incorrect, it would implement an incorrect exponent, which nearly every test run should catch.
sipa cross-referenced this on Apr 25, 2017 from issue Further shorten the addition chain for scalar inversion. by briansmith
briansmith commented at 10:12 pm on April 25, 2017: contributor

FWIW, I also reviewed Peter’s PR and it is correct. I also agree with @sipa’s assessment that many things would fail if the sequence of multiplications and squarings were incorrect. Nonetheless, I also manually verified that Peter’s addition chain is correct when improving it.
gmaxwell commented at 11:14 pm on April 25, 2017: contributor

@peterdettman Can you pull briansmith’s commit into this PR?
peterdettman commented at 5:04 am on April 26, 2017: contributor

@gmaxwell Done. Cumulative 255S+44M -> 253S+37M, measured speed of scalar_inverse at +3% as expected.
gmaxwell approved
gmaxwell commented at 10:40 pm on April 26, 2017: contributor

ACK.
sipa merged this on Apr 26, 2017
sipa closed this on Apr 26, 2017
sipa referenced this in commit cbc20b8c34 on Apr 26, 2017

Contributors
peterdettman briansmith gmaxwell sipa