Please refer to field_5x52_int128_impl.h, secp256k1_fe_sqr_inner method, the reduction part at the end (comments should also apply to secp256k1_fe_mul_inner, and the issue may exist for other field implementations, but here I discuss ‘64bit’).
The docs/contract for secp256k1_fe_sqr say that the output will be magnitude 1, and other comments explain that the elements are therefore limited to M_(2^53-1) for all limbs except the first (M_(2^49-1)). Yet, by inspection the reduction in _sqr_inner appears insufficient (t1’s value is unconstrained when “r[1] = t1 + c”). Indeed, inputs can be easily found to reveal the problem, as demonstrated in the test case with this PR.
(I emphasise that the output value is mathematically correct, but violates the limits for magnitude 1.)
e.g. sqr(-65537) leads to a carry into t1, which is already 0xFFFFFFFFFFFFF, and the carry does not propagate. Indeed, in this case, it would have to propagate all the way up to t4 and back to t0 again! The test case samples some other problematic inputs across a large range of bit lengths; it is not an isolated case, and the carry into t1(==0xFF….) can be larger than 1.
It’s not clear whether this can actually cause higher-level failures. Negation uses (M+1), normalisation seems unaffected. In theory, mul_int(x, 4096) would overflow unexpectedly! mul_int(x, 8) could lead to a value that was in theory too large for mul/sqr input, but in practice those methods tolerate larger values. If there were a 9x29 field, there would definitely be problems, with only 3 bits to spare.
It would be unfortunate to have to add an entire extra reduction round in sqr/mul, but I think an earlier reduction of t9 might help (similar to the normalization changes from https://github.com/bitcoin/secp256k1/pull/24). Not sure if I’d be competent to attempt the asm version, assuming it is similarly affected.
I suppose it might be considered to “fix” this by adjusting the magnitude system or method contracts, so I await a discussion before proceeding.