Change the type of uX in secp256k1_fe_mul_inner and secp256k1_fe_sqr_inner to uint32_t, and cast only to uint64_t when necessary.
On X86 the difference was in the noise bound, but for ARM this reduces code and required stack size, and increases performance significantly. Apparantly, GCC is less smart there and sees this as a hint that uint32_t*uint32_t→uint64_t instructions can be used.
Before (asm):
| test | min | avg | max |
|---|---|---|---|
| inv | 384.321us | 384.537us | 385.125us |
| recover | 2845.161us | 2845.341us | 2845.426us |
| sign | 1324.141us | 1324.300us | 1324.453us |
| verify | 2672.787us | 2673.188us | 2673.498us |
After (asm):
| test | min | avg | max |
|---|---|---|---|
| inv | 384.400us | 384.614us | 385.357us |
| recover | 2549.572us | 2550.015us | 2552.387us |
| sign | 1223.006us | 1223.169us | 1223.313us |
| verify | 2410.431us | 2410.598us | 2410.885us |
Eg, a 10% speedup for verification.
I still intend to do a NEON implementation, but this picks a bit of low-hanging fruit that I expect to be significant on other embedded 32-bit architectures as well.