[WIP] Add intel simd #1703

pull Raimo33 wants to merge 3 commits into bitcoin-core:master from Raimo33:simd changing 11 files +666 −225
  1. Raimo33 commented at 11:26 am on July 18, 2025: none

    This adds avx and avx2 intrinsics support to the library in general, as discussed in #1700, wherever it yields an improvement as per the benchmarks.

    Why not sse and avx512?

    • sse is only useful in the 32bit code path (USE_FORCE_WIDEMUL_INT64=1). but practically, almost all 32bit SSE enabled CPUs are armv7 architectures, not X86
    • avx512 would not be beneficial anywhere, and thermal throttling is a problem on most CPUs that support it.

    arm has different SIMD instruction set; it would be nice to have a separate PR implementing that as well. Maybe after this is merged…

    Tasks

    • Add CI/CD flows with permutations of -mavx, -mavx2, -mno-avx, -mno-avx2 when building for amd64
    • Precompute vectors at startup (the ones marked with TODO: precompute )

    Commits

    I’ve split this PR into multiple commits with the following criteria:

    • non-simd optimizations & style changes
    • simd optimizations
    • temporary scripts for development
    • CI flows

    Test & Benchmark

    To reproduce the following results I temporarily added 3 scripts for building, testing, benchmarking as well as a jupyter notebook to visualize results. You can verify yourself by running: ./simd-build.sh && ./simd-test.sh && ./simd-bench.sh and executing the notebook as is.

    The baseline is compiled with "-O3 -mavx -mavx2 -U__AVX__ -U__AVX2__" so that spontaneous gcc vectorization is allowed, but my manual vectorization is not compiled.

    Results

  2. Raimo33 renamed this:
    Add simd
    Add intel simd
    on Jul 18, 2025
  3. Raimo33 commented at 3:18 pm on July 18, 2025: none

    To precompute simd constants at the start, the best solution I found was doing something like this:

    0#ifdef __SSE2__
    1  static __m128i _128_vec_ones;
    2#endif
    3
    4CONSTRUCTOR void simd_init(void)
    5{
    6#ifdef __SSE2__
    7  _128_vec_ones   = _mm_set1_epi8('1');
    8#endif
    9}
    

    where CONSTRUCTOR is __attribute__((constructor))

  4. Raimo33 commented at 5:00 pm on July 18, 2025: none

    I’m constantly getting these warnings. Apparently they’re harmless since I always use loadu and storeu, but for some reason the compiler doesn’t like them.

    0warning: cast increases required alignment of target type [-Wcast-align]
    1  653 |         _mm256_storeu_si256((__m256i *)r->v, out);
    

    The only fixes I found are:

    1. aligning everything to 64bytes (impossible, breaks even some of my avx logic)
    2. suppress the warning globally
    3. suppress the warning inline each time
  5. Raimo33 force-pushed on Aug 1, 2025
  6. Raimo33 renamed this:
    Add intel simd
    [WIP] Add intel simd
    on Aug 23, 2025
  7. Raimo33 force-pushed on Aug 27, 2025
  8. Raimo33 force-pushed on Aug 27, 2025
  9. Raimo33 force-pushed on Aug 28, 2025
  10. Raimo33 force-pushed on Aug 28, 2025
  11. Raimo33 force-pushed on Aug 30, 2025
  12. Raimo33 force-pushed on Aug 30, 2025
  13. Raimo33 force-pushed on Aug 30, 2025
  14. Raimo33 force-pushed on Aug 30, 2025
  15. Raimo33 force-pushed on Aug 31, 2025
  16. Raimo33 force-pushed on Aug 31, 2025
  17. Raimo33 force-pushed on Sep 1, 2025
  18. real-or-random commented at 9:15 am on September 2, 2025: contributor

    I built this and ran the benchmarks on my machine (12th Gen Intel(R) Core(TM) i7-1260P)

    Signing became faster, but verification became slower. After looking at the bench_internal results, I figured that the culprit is secp256k1_fe_impl_negate_unchecked, which makes the field_sqrt benchmark slower and hence verification. When I disable your ifdefs in secp256k1_fe_impl_negate_unchecked, I get consistently faster results across benchmarks (except schnorrsig_verify, not sure what’s going on there).

     0Benchmark results for INT128
     1Generated on 2025-09-02T09:56:29 CEST
     2Iterations: 20000
     3
     4Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
     5
     6ecdsa_verify                  ,    54.8       ,    55.2       ,    56.7    
     7ecdsa_sign                    ,    34.8       ,    34.9       ,    35.0    
     8ec_keygen                     ,    24.2       ,    24.2       ,    24.2    
     9ecdh                          ,    51.0       ,    51.4       ,    53.6    
    10schnorrsig_sign               ,    25.6       ,    25.6       ,    25.7    
    11schnorrsig_verify             ,    55.5       ,    56.0       ,    57.5    
    12ellswift_encode               ,    31.1       ,    31.1       ,    31.2    
    13ellswift_decode               ,    13.4       ,    13.4       ,    13.4    
    14ellswift_keygen               ,    55.0       ,    55.0       ,    55.0    
    15ellswift_ecdh                 ,    56.5       ,    56.8       ,    58.7    
    16
    17
    18Benchmark results for INT128_SSE2_AVX2
    19Generated on 2025-09-02T10:00:33 CEST
    20Iterations: 20000
    21
    22Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
    23
    24ecdsa_verify                  ,    54.6       ,    54.9       ,    56.4    
    25ecdsa_sign                    ,    33.5       ,    33.6       ,    33.6    
    26ec_keygen                     ,    23.0       ,    23.0       ,    23.1    
    27ecdh                          ,    50.6       ,    50.6       ,    50.7    
    28schnorrsig_sign               ,    24.5       ,    24.5       ,    24.5    
    29schnorrsig_verify             ,    55.5       ,    56.1       ,    59.5    
    30ellswift_encode               ,    31.2       ,    31.2       ,    31.2    
    31ellswift_decode               ,    13.4       ,    13.4       ,    13.4    
    32ellswift_keygen               ,    54.2       ,    54.2       ,    54.2    
    33ellswift_ecdh                 ,    56.2       ,    56.3       ,    56.4    
    

    This currently saves ~0.5% in ecdsa_verify, ~4% in ecdsa_sign and ~5% in ec_keygen. The latter is nice, but I was hoping for more in verification. Perhaps with negation “fixed”, things can be improved further. At that point, my feeling is that it’s hard to say whether this is worth the hassle. We’d also need some CPU id code etc. to make this useful in practice.

    Some more comments:

    • I don’t think we should bother with 32-bit x86. The 32-bit code is certainly interesting to some users (hardware wallets, raspberry pi, … ) but they are almost certainly not on 32-bit x86. (Unless I’m mistaken.)
    • There’s a lot to gain for SHA256, but then we should rely on the SHA instruction set directly. This is probably the lowest hanging fruit for performance in the library, but that’s a separate project. Bitcoin Core has a nice SHA256 implementation, with many backends (SHA, SHANI, SSE4, …), see https://github.com/bitcoin/bitcoin/blob/master/src/crypto/sha256.cpp . The clean way to go is either to give the caller a way to bring their own implementation at runtime, or to extract the C++ code in Bitcoin Core into a separate library (perhaps after converting it to C) that could be linked to libsecp256k1.
  19. Raimo33 commented at 9:34 am on September 2, 2025: none
    • You’re right. Rasberry PI’s are 32bit armv7. I agree we shouldn’t bother. removed sse2.
    • I was hoping more on verify as well, that’s the main goal.
    • I think we should copy bitcoin core’s sha256 optimizations. but in another PR.

    Beware that benchmarks are completely unreliable as of right now. see #1701 I’m waiting on #1732 before proceding

  20. Raimo33 force-pushed on Sep 2, 2025
  21. Raimo33 force-pushed on Sep 3, 2025
  22. Raimo33 force-pushed on Sep 3, 2025
  23. Raimo33 force-pushed on Sep 3, 2025
  24. Raimo33 force-pushed on Sep 3, 2025
  25. Raimo33 closed this on Sep 3, 2025

  26. Raimo33 deleted the branch on Sep 3, 2025
  27. Raimo33 restored the branch on Sep 3, 2025
  28. Raimo33 reopened this on Sep 3, 2025

  29. Raimo33 force-pushed on Sep 4, 2025
  30. Raimo33 force-pushed on Sep 4, 2025
  31. Raimo33 force-pushed on Sep 4, 2025
  32. Raimo33 force-pushed on Sep 4, 2025
  33. Raimo33 force-pushed on Sep 4, 2025
  34. Raimo33 force-pushed on Sep 4, 2025
  35. Raimo33 force-pushed on Sep 4, 2025
  36. Raimo33 force-pushed on Sep 5, 2025
  37. Raimo33 force-pushed on Sep 5, 2025
  38. Raimo33 force-pushed on Sep 5, 2025
  39. Raimo33 force-pushed on Sep 5, 2025
  40. Raimo33 force-pushed on Sep 5, 2025
  41. Raimo33 force-pushed on Sep 5, 2025
  42. Add generic optimizations 7a05d18fd4
  43. Raimo33 force-pushed on Sep 5, 2025
  44. Add intel simd cb16f5cd80
  45. Add dev scripts [skip ci] 056feb157a
  46. Raimo33 force-pushed on Sep 5, 2025
  47. kmk142789 approved

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-10-13 19:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me