Use double-wide types for additions in scalar code #786

pull sipa wants to merge 1 commits into bitcoin-core:master from sipa:202008_wideadd changing 2 files +66 −96
  1. sipa commented at 1:22 am on August 7, 2020: contributor

    Suggested here by Andy Polyakov.

    This is shorter, easier to reason about, more likely to not contain branches, and likely (slightly) faster too.

  2. sipa cross-referenced this on Aug 7, 2020 from issue Improve constant-timeness on PowerPC by real-or-random
  3. sipa commented at 1:25 am on August 7, 2020: contributor
    @gmaxwell You may want to test performance/branchlessness on various platforms.
  4. sipa force-pushed on Aug 7, 2020
  5. sipa force-pushed on Aug 7, 2020
  6. Use double-wide types for additions in scalar code 3bfb85a7a8
  7. sipa force-pushed on Aug 7, 2020
  8. gmaxwell commented at 2:56 am on August 7, 2020: contributor

    Good news: tests pass, including valgrind ct tests.

    Bad news:

    First is before the patch.

    x86_64 64-bit GCC 10.2.1 (asm disabled, obviously) scalar_sqr: min 0.0421us / avg 0.0422us / max 0.0423us scalar_mul: min 0.0410us / avg 0.0410us / max 0.0411us scalar_inverse: min 12.2us / avg 12.2us / max 12.2us scalar_sqr: min 0.0456us / avg 0.0457us / max 0.0457us scalar_mul: min 0.0433us / avg 0.0434us / max 0.0434us scalar_inverse: min 13.2us / avg 13.2us / max 13.2us

    x86_64 32-bit GCC 10.2.1 scalar_sqr: min 0.122us / avg 0.122us / max 0.123us scalar_mul: min 0.109us / avg 0.109us / max 0.109us scalar_inverse: min 35.0us / avg 35.0us / max 35.0us scalar_sqr: min 0.0961us / avg 0.0962us / max 0.0965us scalar_mul: min 0.110us / avg 0.110us / max 0.110us scalar_inverse: min 28.6us / avg 28.6us / max 28.6us

    ARMv8 64-bit GCC 9.3 scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.222us / avg 0.222us / max 0.222us scalar_inverse: min 61.1us / avg 61.1us / max 61.2us scalar_sqr: min 0.209us / avg 0.209us / max 0.209us scalar_mul: min 0.237us / avg 0.237us / max 0.237us scalar_inverse: min 61.6us / avg 61.6us / max 61.7us

    ARMv8 32-bit GCC 9.3 scalar_sqr: min 0.399us / avg 0.399us / max 0.400us scalar_mul: min 0.388us / avg 0.388us / max 0.388us scalar_inverse: min 115us / avg 115us / max 115us scalar_sqr: min 0.392us / avg 0.393us / max 0.394us scalar_mul: min 0.471us / avg 0.473us / max 0.483us scalar_inverse: min 117us / avg 117us / max 118us

  9. real-or-random commented at 8:20 am on August 7, 2020: contributor

    @sipa Do you remember #453? It goes in the same direction but it’s a more radical approach, and maybe then the right way to do it.

    These benchmarks also remind me about my comment in #436. Optimizers seem to like these macros particularly well..

    x86_64 32-bit GCC 10.2.1

    At least it’s faster here. Hm, now I (somewhat seriously) wonder whether we should do the opposite and get rid of uint128_t in the field code, which is the only reason why MSVC does not support our 64bit code. (Even though there are other ways to fix this.)

  10. dot-asm commented at 8:50 am on August 7, 2020: none

    Bad news:

    Life is compromise. Looking back at originating question the main idea behind suggestion was to leave no incentive for compiler to throw in branch. And double-width additions do that, unlike conditionals. So you have to ask yourself what’s more important. Besides, formally speaking a data point from just one compiler isn’t that conclusive.

  11. real-or-random commented at 9:25 am on August 7, 2020: contributor
    By the way, if anyone is interested: https://godbolt.org/z/8GThbv (no. 1 before, no. 2 after the patch). It’s interesting to see how far llvm-mca’s prediction is from reality. @gmaxwell What’s the exact x86 CPU that you used to run the benchmark?
  12. gmaxwell commented at 3:54 pm on August 7, 2020: contributor

    What’s the exact x86 CPU that you used to run the benchmark?

    EPYC 7742

  13. real-or-random commented at 5:41 pm on August 7, 2020: contributor
    I forgot to mention that I don’t think these benchmarks are a showstopper. If we think that the code is better (for readability and no branching), I’m fine with this.
  14. sipa commented at 5:50 pm on August 7, 2020: contributor

    @real-or-random Agree, though it may inform us about trying variants that perhaps have less performance impact.

    Another point is that with the divstep inversion code, scalar mul/sqr performance will be almost entirely irrelevant for inversion (which is I expect the only way scalar operations matter to high-level functions like signing and verification).

  15. gmaxwell commented at 6:17 pm on August 7, 2020: contributor

    I think inversion is just interesting to look at because it gives a single number. Scalar mul performance is more important in the context of batch validation, some ZKP etc.

    I think it’s worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn’t that important.

    get rid of uint128_t in the field code,

    AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for “just use asm” doesn’t apply because it’s a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

  16. real-or-random commented at 6:28 pm on August 7, 2020: contributor

    I think it’s worth tweaking to see if the performance can be recovered. But also: the ASM is clearly a fair bit faster, so I think that worrying about the exact fastest form of the C code isn’t that important.

    Indeed, and then #453 is the way to go.

    get rid of uint128_t in the field code,

    AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance. The above argument I gave for “just use asm” doesn’t apply because it’s a big difference and there is not likely to be ASM written for ARM8 (or Riscv64, Power9, etc.) any time soon.

    Ok, true.

  17. sipa commented at 8:24 pm on August 7, 2020: contributor

    AFAIK you can’t do that without losing access to the 64,64->high64 widening multiply which is important for performance.

    Yes, there is no way we can have true 64-bit operation without the ability to do a 64x64->128 bit multiplication in one way or another.

    But I don’t think that’s a problem - even MSVC has _umul128 that effectively implements that; just not as a neat 128-bit data type.

    I also think it’s orthogonal to the changes here. If we wanted to support non-__int128 platforms, the code would be replaced by some set of configuration-dependent macros that translate to __int128 operations on supporting platforms, and _umul128 on pairs of uint64_t on MSVC - but the changes here would still matter.

  18. sipa commented at 3:51 pm on May 8, 2023: contributor
    This would need redo on top of #1000, but it’s not clear to me it’s worth it. Closing.
  19. sipa closed this on May 8, 2023


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-22 21:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me