Algorithm by Peter Dettman, with original comments:
Changes to _divsteps_59 (_30) that give maybe 4% speed improvement to const-time modinv on 64 bit. I see a larger gain on 32 bit but measured on 64 bit so might not be real.
Start the result matrix scaled by 2^62 (resp. 2^30) and shift q, r down instead of u, v up at each step (should make life easier for vectorization). Since we’re always shifting away the LSB of g, q, r, we can avoid doing a full negation for x, y, z (after a few tweaks).
A new variable $\theta = \delta - 1/2$ is introduced then, which is slightly cheaper than the $\zeta = -\delta-1/2$ used before.