Suggested here by Andy Polyakov.
This is shorter, easier to reason about, more likely to not contain branches, and likely (slightly) faster too.
Good news: tests pass, including valgrind ct tests.
Bad news:
First is before the patch.
x86_64 64-bit GCC 10.2.1 (asm disabled, obviously) scalar_sqr: min 0.0421us / avg 0.0422us / max 0.0423us scalar_mul: min 0.0410us / avg 0.0410us / max 0.0411us scalar_inverse: min 12.2us / avg 12.2us / max 12.2us scalar_sqr: min 0.0456us / avg 0.0457us / max 0.0457us scalar_mul: min 0.0433us / avg 0.0434us / max 0.0434us scalar_inverse: min 13.2us / avg 13.2us / max 13.2us
x86_64 32-bit GCC 10.2.1 scalar_sqr: min 0.122us / avg 0.122us / max 0.123us scalar_mul: min 0.109us / avg 0.109us / max 0.109us scalar_inverse: min 35.0us / avg 35.0us / max 35.0us scalar_sqr: min 0.0961us / avg 0.0962us / max 0.0965us scalar_mul: min 0.110us / avg 0.110us / max 0.110us scalar_inverse: min 28.6us / avg 28.6us / max 28.6us
ARMv8 64-bit GCC 9.3 scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.222us / avg 0.222us / max 0.222us scalar_inverse: min 61.1us / avg 61.1us / max 61.2us scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.237us / avg 0.237us / max 0.237us scalar_inverse: min 61.6us / avg 61.6us / max 61.7us
ARMv8 32-bit GCC 9.3 scalar_sqr: min 0.399us / avg 0.399us / max 0.400us scalar_mul: min 0.388us / avg 0.388us / max 0.388us scalar_inverse: min 115us / avg 115us / max 115us scalar_sqr: min 0.392us / avg 0.393us / max 0.394us scalar_mul: min 0.471us / avg 0.473us / max 0.483us scalar_inverse: min 117us / avg 117us / max 118us
@sipa Do you remember #453? It goes in the same direction but it’s a more radical approach, and maybe then the right way to do it.
These benchmarks also remind me about my comment in #436. Optimizers seem to like these macros particularly well..
x86_64 32-bit GCC 10.2.1
At least it’s faster here. Hm, now I (somewhat seriously) wonder whether we should do the opposite and get rid of uint128_t
in the field code, which is the only reason why MSVC does not support our 64bit code. (Even though there are other ways to fix this.)
Bad news:
Life is compromise. Looking back at originating question the main idea behind suggestion was to leave no incentive for compiler to throw in branch. And double-width additions do that, unlike conditionals. So you have to ask yourself what’s more important. Besides, formally speaking a data point from just one compiler isn’t that conclusive.
What’s the exact x86 CPU that you used to run the benchmark?
EPYC 7742
@real-or-random Agree, though it may inform us about trying variants that perhaps have less performance impact.
Another point is that with the divstep inversion code, scalar mul/sqr performance will be almost entirely irrelevant for inversion (which is I expect the only way scalar operations matter to high-level functions like signing and verification).
I think inversion is just interesting to look at because it gives a single number. Scalar mul performance is more important in the context of batch validation, some ZKP etc.
I think it’s worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn’t that important.
get rid of uint128_t in the field code,
AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for “just use asm” doesn’t apply because it’s a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.
I think it’s worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn’t that important.
Indeed, and then #453 is the way to go.
get rid of uint128_t in the field code,
AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for “just use asm” doesn’t apply because it’s a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.
Ok, true.
AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance.
Yes, there is no way we can have true 64-bit operation without the ability to do a 64x64->128 bit multiplication in one way or another.
But I don’t think that’s a problem - even MSVC has _umul128
that effectively implements that; just not as a neat 128-bit data type.
I also think it’s orthogonal to the changes here. If we wanted to support non-__int128
platforms, the code would be replaced by some set of configuration-dependent macros that translate to __int128
operations on supporting platforms, and _umul128
on pairs of uint64_t
on MSVC - but the changes here would still matter.