Builds on #362 as an alternative to #262.
Compressed points are moved onto an isomorphism (maybe the twist) instead of calling sqrt. After the scalar mult, parallel rsqrt/inv is used to reconstruct the correct output y-coordinate in parallel with the inversion of the z-coordinate. Total overhead vs uncompressed input is ~1%.
This implementation foregoes the jacobi check, meaning that the actual scalar multiplication could take place on a quadratic twist. This admits points with x == 0 as a possibility, but still not y == 0; the twist has a 37-bit, odd, cofactor. If this occurs, the parallel rsqrt/inv at the end will catch it and reject it. I think ECDH peers can already waste each other’s time at will, so I don’t think the delayed catch is a problem.
Do we know whether our formulae work “correctly” for the twist? Here we actually just need them not to crash (e.g. if an x==0 or P==INF turns up) in the scalar mult. Of course, we could always put the jacobi test back in, which has ~2% additional overhead.
Note: The initial “decompression” produces (x,y) == (K.X, K^2) on the curve y^2 = x^3 + 7.K^3, with K != 0. If K is not a quadratic residue then this is a twist of the original curve.