This is faster than the multiple exponentiation method that is used now. It remains way slower than libgmp, so probably not a full replacement, but it provides a nice speedup when libgmp is not available.
These are the benchmark time I measure.
scalar_inverse: min 14.4us / avg 16.4us / max 20.0us scalar_inverse_var: min 11.5us / avg 11.6us / max 11.6us
While there is a fair bag of trick that is used to make this fast, there is still a lot of untapped potential. For instance, doing numerous additions before reducing x0 and x1. While more could be done, this is already an improvement, so it is time for a PR.