The existing secp256k1_scalar_inverse uses a straight-forward “blocks-of-1” approach, but the exponent here admits of a few tweaks.
Firstly, the pattern 01010 occurs 3 times, so if we have (u5=)x^5 available we can save 3M during accumulation. Actually u5 can be precalculated cheaply, replacing 1S with 1M.
Secondly, the addition chain invites optimization since it ends with 127 (a critical threshold) and also contains unnecessary (to the chain)x4 and un-accumulated x7. The addition chain 2, 3, 6, 8, 14, 28, 56, 112, 126 looks inviting… and fortuitously, armed with u5, we can replace [x127 0 x] with [x126 u5] and [x4 0 x] with [x3 u5](two occurrences) - for free. The new shorter addition chain saves 2M.
There should be ~256 squares and I counted 44 multiplies in the existing version, and a saving of 1S + 4M with this PR. Since performance of scalar mul/sqr are very similar in my config, I’d expect about 5/300 ~= 1.6% performance improvement, and measure roughly 1.5%.