Changes to _divsteps_59 (_30) that give maybe 4% speed improvement to const-time modinv on 64 bit. I see a larger gain on 32 bit but measured on 64 bit so might not be real.
Start the result matrix scaled by 2^62 (resp. 2^30) and shift q, r down instead of u, v up at each step (should make life easier for vectorization). Since we’re always shifting away the LSB of g, q, r, we can avoid doing a full negation for x, y, z (after a few tweaks). Then it makes sense to switch zeta back to delta (I confined this change to the local method for the moment).