Use a “borrowing” trick in _fe_normalize to simplify the handling of values in [P, 2^256). This is significantly faster too (according to bench_internal), though performance of _fe_normalize isn’t that important.
Applying a similar idea to _fe_normalizes_to_zero also results in a performance boost, and in this case it appears to boost signing speed by 2% or more (also ECDH), due to three calls in _gej_add_ge.