Use double-wide types for additions in scalar code

sipa commented at 1:22 am on August 7, 2020: contributor

Suggested here by Andy Polyakov.

This is shorter, easier to reason about, more likely to not contain branches, and likely (slightly) faster too.

sipa cross-referenced this on Aug 7, 2020 from issue Improve constant-timeness on PowerPC by real-or-random

sipa commented at 1:25 am on August 7, 2020: contributor

@gmaxwell You may want to test performance/branchlessness on various platforms.

sipa force-pushed on Aug 7, 2020

Use double-wide types for additions in scalar code 3bfb85a7a8

sipa force-pushed on Aug 7, 2020

gmaxwell commented at 2:56 am on August 7, 2020: contributor

Good news: tests pass, including valgrind ct tests.

Bad news:

First is before the patch.

x86_64 64-bit GCC 10.2.1 (asm disabled, obviously) scalar_sqr: min 0.0421us / avg 0.0422us / max 0.0423us scalar_mul: min 0.0410us / avg 0.0410us / max 0.0411us scalar_inverse: min 12.2us / avg 12.2us / max 12.2us scalar_sqr: min 0.0456us / avg 0.0457us / max 0.0457us scalar_mul: min 0.0433us / avg 0.0434us / max 0.0434us scalar_inverse: min 13.2us / avg 13.2us / max 13.2us

x86_64 32-bit GCC 10.2.1 scalar_sqr: min 0.122us / avg 0.122us / max 0.123us scalar_mul: min 0.109us / avg 0.109us / max 0.109us scalar_inverse: min 35.0us / avg 35.0us / max 35.0us scalar_sqr: min 0.0961us / avg 0.0962us / max 0.0965us scalar_mul: min 0.110us / avg 0.110us / max 0.110us scalar_inverse: min 28.6us / avg 28.6us / max 28.6us

ARMv8 64-bit GCC 9.3 scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.222us / avg 0.222us / max 0.222us scalar_inverse: min 61.1us / avg 61.1us / max 61.2us scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.237us / avg 0.237us / max 0.237us scalar_inverse: min 61.6us / avg 61.6us / max 61.7us

ARMv8 32-bit GCC 9.3 scalar_sqr: min 0.399us / avg 0.399us / max 0.400us scalar_mul: min 0.388us / avg 0.388us / max 0.388us scalar_inverse: min 115us / avg 115us / max 115us scalar_sqr: min 0.392us / avg 0.393us / max 0.394us scalar_mul: min 0.471us / avg 0.473us / max 0.483us scalar_inverse: min 117us / avg 117us / max 118us

real-or-random commented at 8:20 am on August 7, 2020: contributor

@sipa Do you remember #453? It goes in the same direction but it’s a more radical approach, and maybe then the right way to do it.

These benchmarks also remind me about my comment in #436. Optimizers seem to like these macros particularly well..

x86_64 32-bit GCC 10.2.1

At least it’s faster here. Hm, now I (somewhat seriously) wonder whether we should do the opposite and get rid of uint128_t in the field code, which is the only reason why MSVC does not support our 64bit code. (Even though there are other ways to fix this.)

dot-asm commented at 8:50 am on August 7, 2020: none

Bad news:

Life is compromise. Looking back at originating question the main idea behind suggestion was to leave no incentive for compiler to throw in branch. And double-width additions do that, unlike conditionals. So you have to ask yourself what’s more important. Besides, formally speaking a data point from just one compiler isn’t that conclusive.

real-or-random commented at 9:25 am on August 7, 2020: contributor

By the way, if anyone is interested: https://godbolt.org/z/8GThbv (no. 1 before, no. 2 after the patch). It’s interesting to see how far llvm-mca’s prediction is from reality. @gmaxwell What’s the exact x86 CPU that you used to run the benchmark?

gmaxwell commented at 3:54 pm on August 7, 2020: contributor

What’s the exact x86 CPU that you used to run the benchmark?

EPYC 7742

real-or-random commented at 5:41 pm on August 7, 2020: contributor

I forgot to mention that I don’t think these benchmarks are a showstopper. If we think that the code is better (for readability and no branching), I’m fine with this.

sipa commented at 5:50 pm on August 7, 2020: contributor

@real-or-random Agree, though it may inform us about trying variants that perhaps have less performance impact.

Another point is that with the divstep inversion code, scalar mul/sqr performance will be almost entirely irrelevant for inversion (which is I expect the only way scalar operations matter to high-level functions like signing and verification).

gmaxwell commented at 6:17 pm on August 7, 2020: contributor

I think inversion is just interesting to look at because it gives a single number. Scalar mul performance is more important in the context of batch validation, some ZKP etc.

I think it’s worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn’t that important.

get rid of uint128_t in the field code,

AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for “just use asm” doesn’t apply because it’s a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

real-or-random commented at 6:28 pm on August 7, 2020: contributor

I think it’s worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn’t that important.

Indeed, and then #453 is the way to go.

get rid of uint128_t in the field code,

AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for “just use asm” doesn’t apply because it’s a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

Ok, true.

sipa commented at 8:24 pm on August 7, 2020: contributor

AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance.

Yes, there is no way we can have true 64-bit operation without the ability to do a 64x64->128 bit multiplication in one way or another.

But I don’t think that’s a problem - even MSVC has _umul128 that effectively implements that; just not as a neat 128-bit data type.

I also think it’s orthogonal to the changes here. If we wanted to support non-__int128 platforms, the code would be replaced by some set of configuration-dependent macros that translate to __int128 operations on supporting platforms, and _umul128 on pairs of uint64_t on MSVC - but the changes here would still matter.

sipa commented at 3:51 pm on May 8, 2023: contributor

This would need redo on top of #1000, but it’s not clear to me it’s worth it. Closing.

sipa closed this on May 8, 2023

Use double-wide types for additions in scalar code #786