Rework of 10x26 field mul/sqr

peterdettman commented at 4:38 pm on September 13, 2020: contributor

avoid overly-wide multiplications
save a few multiplies, masks and shifts
final residual left in r[9] instead of r[2] @gmaxwell It looks faster to me, but if you could collect some results for real 32-bit hardware that would be very helpful.

- avoid overly-wide multiplications
- save a few multiplies, masks and shifts
- final residual left in r[9] instead of r[2]

9388b77b03

peterdettman cross-referenced this on Sep 13, 2020 from issue Make 64x64->64 bit multiplications constant-time with MSVC on 32bit x86 by real-or-random

real-or-random commented at 9:02 pm on September 13, 2020: contributor

@peterdettman That’s great, I didn’t see the widemul optimizations even though I was staring at this code for some time.

I get the general idea but can you explain something like a more systematic approach to this? E.g., what was the algorithm you followed to create these changes? I think this helps reviewers and then we don’t need to rediscover your thoughts.

peterdettman commented at 5:11 am on September 14, 2020: contributor

The overall algorithm is “schoolbook” multiplication to calculate 19 products p00..p18, with the high part reduced modulo the field prime P using multiplication by (2^256 - P), and a final carry propagation to ensure a magnitude 1 output.

Operationally, we run two accumulators, 10 limbs (260 bits of “weight”) apart, and interleave the modular reduction: the least-significant limb of the upper accumulator can be accumulated into the lower accumulator after multiplying by [ R1, R0 ], where (R1 << 26) + R0 == (2^256 - P) << 4 (the << 4 is because 10 limbs is 260 bits, not 256).

The accumulation begins with p07, p17. d accumulates p07..p16, while c accumulates p17..p18 as the upper accumulator then p00..p06 as the lower accumulator. There are a couple of final carry propagation steps for c. The choice of starting accumulator positions affects where the final carry will land. There is an ulterior motive for finishing the carry propagation at r[9]: it will support an increased magnitude limit for field elements of 16 (currently 8). a[9], b[9] also have their input bounds changed to 27 bits (from 26) in anticipation of this.

Squaring is the same, modified only so that duplicate limb multiplications are consolidated when calculating the products p00..p18.

peterdettman commented at 5:14 am on September 14, 2020: contributor

I’m intending to push arbitrary-degree Karatsuba changes in this PR shortly, but I’d like to get some baseline performance numbers in place first.

Formatting 85db039d39

Arbitrary-degree Karatsuba for 10x26 mul 9618abcc87

peterdettman commented at 2:05 pm on September 14, 2020: contributor

I’m intending to push arbitrary-degree Karatsuba changes in this PR shortly

The latest commit adds this (only relevant to mul, not sqr). It saves 45 limb multiplies, but adds the equivalent of 52 limb additions (counting 64-bit adds as 2). However that may not be the full story; from the paper:

However an inspection of the generated code also confirmed our suspicion that the ADK code generated more register-register move instructions and memory accesses. On some processors this could offset the ADK advantage.

On my 64-bit machine this is slower (as expected), but on real 32-bit hardware there’s a good chance this is faster. I’d need someone to help with measuring that though. From what I can see on godbolt the choice of compiler leads to large differences in instruction count.

peterdettman commented at 1:59 pm on October 16, 2020: contributor

@gmaxwell I don’t want to impose on you, but I had hoped you might collect some benchmarks for this PR.

gmaxwell commented at 1:40 am on October 17, 2020: contributor

Woops! Will do. I’m excited about this work– but I’ve been redoing all the power at home and have had systems in various states of off and disconnected and couldn’t immediately try it and then forgot about it. Give me a day or two.

laanwj cross-referenced this on Oct 17, 2020 from issue Experimental features should be deprecated by rustyrussell

real-or-random cross-referenced this on Nov 24, 2020 from issue Tighter verification bounds in mul/sqr by peterdettman

peterdettman commented at 8:34 am on May 7, 2021: contributor

@gmaxwell Still hoping for some benchmarks here (separately for the first “rework” commit and the latter “Karatsuba” one).

If/when this is accepted I am hoping someone can update the corresponding asm to leave its final carry in the top limb (or in case Karatsuba proves fast, either abandon the asm or reimplement it along those lines).

At that point field implementations could be modified to support a maximum magnitude of 16. Some easy (but small) performance improvements then follow, and I am expecting it will give a little more room for per-group-operation magnitude limits (on input and output field elements) to be profitable.

gmaxwell commented at 7:44 pm on May 7, 2021: contributor

@peterdettman Oh crap. I am terribly sorry for completely forgetting your request.

I’ve tested on two different 32-bit arm systems: A faster superscalar cortex-a9, and a slower in order cortex-a8 (which can dual issue, but only if the instructions are paired right).

Common hardware wallets are usually Cortex-M4, which like the a8 is an in order core with dual issue but due to pipeline differences I’m not sure how similar the results would be. Some use M3 which also is 3-5 cycles “depending on values” (uh oh) for a 32x32->64 mul. The slower M0 has a 32 cycle multiply… I’m not aware of anything using this library on those, though I suppose that karatsuba would be a huge win there. (I’d kinda like to see those numbers!)

Results seem a little paradoxical. On a8, rework seems to slow things down and karatsuba speeds the mul back up. On a9 rework is a slight speedup, but karatsuba is a slight slowdown (though still faster than master).

With the older GCC the rework is a huge speedup.

As the ASM (which is IIRC the same algorithm as master, just with hand scheduling and register allocation) and the older GCC results show, the performance is really sensitive to compiler decisions. This might make it hard to draw conclusions on the algorithms overall. I wouldn’t be surprised if the small disadvantage for karatsuba on a9 went away with somewhat more careful scheduling.

May be of interest to @laanwj too.

i.MX53 (Cortex-a8) gcc version 9.3.0 (GCC)

Master w/ arm asm: field_sqr: min 0.598us / avg 0.598us / max 0.599us field_mul: min 0.810us / avg 0.811us / max 0.812us ecdsa_verify: min 1556us / avg 1556us / max 1557us

Master: field_sqr: min 0.666us / avg 0.666us / max 0.668us field_mul: min 1.10us / avg 1.10us / max 1.11us ecdsa_verify: min 1895us / avg 1896us / max 1896us

Rework: field_sqr: min 0.716us / avg 0.718us / max 0.728us field_mul: min 1.18us / avg 1.18us / max 1.19us ecdsa_verify: min 2025us / avg 2025us / max 2027us

Karatsuba: field_sqr: min 0.715us / avg 0.715us / max 0.717us field_mul: min 0.983us / avg 0.984us / max 0.986us ecdsa_verify: min 1829us / avg 1829us / max 1829us

i.MX6 (Cortex-a9) gcc version 9.3.0 (GCC)

Master w/ arm asm: field_sqr: min 0.292us / avg 0.292us / max 0.292us field_mul: min 0.361us / avg 0.361us / max 0.362us ecdsa_verify: min 740us / avg 740us / max 740us

Master: field_sqr: min 0.342us / avg 0.343us / max 0.343us field_mul: min 0.496us / avg 0.496us / max 0.497us ecdsa_verify: min 919us / avg 919us / max 920us

Rework: field_sqr: min 0.329us / avg 0.330us / max 0.330us field_mul: min 0.495us / avg 0.495us / max 0.496us ecdsa_verify: min 906us / avg 906us / max 907us

Karatsuba: field_sqr: min 0.331us / avg 0.331us / max 0.331us field_mul: min 0.500us / avg 0.501us / max 0.501us ecdsa_verify: min 910us / avg 910us / max 911us

i.MX6 (Cortex-a9) gcc version 4.9.2 (Debian 4.9.2-10+deb8u2)

Master w/ arm asm: field_sqr: min 0.292us / avg 0.292us / max 0.292us field_mul: min 0.361us / avg 0.361us / max 0.361us ecdsa_verify: min 739us / avg 739us / max 740us

Master: field_sqr: min 0.815us / avg 0.815us / max 0.816us field_mul: min 0.940us / avg 0.941us / max 0.942us ecdsa_verify: min 1755us / avg 1756us / max 1757us

Rework: field_sqr: min 0.554us / avg 0.555us / max 0.556us field_mul: min 0.597us / avg 0.598us / max 0.599us ecdsa_verify: min 1164us / avg 1165us / max 1171us

Karatsuba: field_sqr: min 0.555us / avg 0.556us / max 0.557us field_mul: min 0.662us / avg 0.665us / max 0.668us ecdsa_verify: min 1228us / avg 1229us / max 1229us

gmaxwell commented at 2:48 am on May 9, 2021: contributor

Given that it was so compiler sensitive I thought it would be useful to upgrade to a newer GCC and try clang too:

Cortex-a8 gcc version 10.2.0 (GCC) Master: field_sqr: min 0.658us / avg 0.658us / max 0.668us field_mul: min 1.13us / avg 1.13us / max 1.14us

Rework: field_sqr: min 0.666us / avg 0.668us / max 0.671us field_mul: min 1.03us / avg 1.03us / max 1.04us

Karatsuba: field_sqr: min 0.667us / avg 0.667us / max 0.675us field_mul: min 1.07us / avg 1.07us / max 1.08us

Cortex-a8 clang version 11.1.0 Master: field_sqr: min 0.614us / avg 0.620us / max 0.643us field_mul: min 1.33us / avg 1.34us / max 1.35us

Rework: field_sqr: min 0.747us / avg 0.750us / max 0.760us field_mul: min 1.35us / avg 1.36us / max 1.37us

Karatsuba: field_sqr: min 0.746us / avg 0.752us / max 0.786us field_mul: min 1.01us / avg 1.02us / max 1.03us

…chaos

peterdettman commented at 2:10 pm on May 11, 2021: contributor

Thanks, @gmaxwell. Somewhat irksome results though.

In the rework I tried loading a0..a9 so that the results can be written out to r immediately, but maybe it’s worth testing a variation that doesn’t attempt that change (so no a0..a9, but t0..t6 restored instead and written to r only after all reads from a).

Karatsuba might do better with manual scheduling as you say, so I hope someone can try an asm version of it (well of both versions ideally).

real-or-random cross-referenced this on Dec 1, 2022 from issue 64x64->64 muls are not constant-time with MSVC on 32bit x86 by real-or-random

real-or-random added the label performance on May 9, 2023

sipa commented at 3:06 pm on May 9, 2023: contributor

If this isn’t faster on new compilers it’s unclear if we should spend time on this. @peterdettman is there a smaller set of changes that we could consider for benchmarking?

With work progressing on fiat-crypto (https://github.com/bitcoin-core/secp256k1/issues/1261) changes like this may become more appealing, if they can get integrated into their optimization framework, as it simplifies our job in convincing it’s correct.

peterdettman cross-referenced this on May 23, 2023 from issue fiat-crypto + CryptOpt tracking issue by real-or-random

Rework of 10x26 field mul/sqr #815