[WIP] Add intel simd

Raimo33 commented at 11:26 am on July 18, 2025: none

This adds avx and avx2 intrinsics support to the library in general, as discussed in #1700, wherever it yields an improvement as per the benchmarks.

Why not sse and avx512?

sse is only useful in the 32bit code path (USE_FORCE_WIDEMUL_INT64=1). but practically, almost all 32bit SSE enabled CPUs are armv7 architectures, not X86
avx512 would not be beneficial anywhere, and thermal throttling is a problem on most CPUs that support it.

arm has different SIMD instruction set; it would be nice to have a separate PR implementing that as well. Maybe after this is merged…

Tasks

Add CI/CD flows with permutations of -mavx, -mavx2, -mno-avx, -mno-avx2 when building for amd64
Precompute vectors at startup (the ones marked with TODO: precompute )

Commits

I’ve split this PR into multiple commits with the following criteria:

non-simd optimizations & style changes
simd optimizations
temporary scripts for development
CI flows

Test & Benchmark

To reproduce the following results I temporarily added 3 scripts for building, testing, benchmarking as well as a jupyter notebook to visualize results. You can verify yourself by running: ./simd-build.sh && ./simd-test.sh && ./simd-bench.sh and executing the notebook as is.

The baseline is compiled with "-O3 -mavx -mavx2 -U__AVX__ -U__AVX2__" so that spontaneous gcc vectorization is allowed, but my manual vectorization is not compiled.

Results

Raimo33 renamed this:
~~Add simd~~
Add intel simd
on Jul 18, 2025

Raimo33 commented at 3:18 pm on July 18, 2025: none

To precompute simd constants at the start, the best solution I found was doing something like this:

0#ifdef __SSE2__
1  static __m128i _128_vec_ones;
2#endif
3
4CONSTRUCTOR void simd_init(void)
5{
6#ifdef __SSE2__
7  _128_vec_ones   = _mm_set1_epi8('1');
8#endif
9}

where CONSTRUCTOR is __attribute__((constructor))

Raimo33 commented at 5:00 pm on July 18, 2025: none

I’m constantly getting these warnings. Apparently they’re harmless since I always use loadu and storeu, but for some reason the compiler doesn’t like them.

0warning: cast increases required alignment of target type [-Wcast-align]
1  653 |         _mm256_storeu_si256((__m256i *)r->v, out);

The only fixes I found are:

aligning everything to 64bytes (impossible, breaks even some of my avx logic)
suppress the warning globally
suppress the warning inline each time

Raimo33 force-pushed on Aug 1, 2025

Raimo33 renamed this:
~~Add intel simd~~
[WIP] Add intel simd
on Aug 23, 2025

Raimo33 force-pushed on Aug 27, 2025

Raimo33 force-pushed on Aug 28, 2025

Raimo33 force-pushed on Aug 30, 2025

Raimo33 force-pushed on Aug 31, 2025

Raimo33 force-pushed on Sep 1, 2025

real-or-random commented at 9:15 am on September 2, 2025: contributor

I built this and ran the benchmarks on my machine (12th Gen Intel(R) Core(TM) i7-1260P)

Signing became faster, but verification became slower. After looking at the bench_internal results, I figured that the culprit is secp256k1_fe_impl_negate_unchecked, which makes the field_sqrt benchmark slower and hence verification. When I disable your ifdefs in secp256k1_fe_impl_negate_unchecked, I get consistently faster results across benchmarks (except schnorrsig_verify, not sure what’s going on there).

 0Benchmark results for INT128
 1Generated on 2025-09-02T09:56:29 CEST
 2Iterations: 20000
 3
 4Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
 5
 6ecdsa_verify                  ,    54.8       ,    55.2       ,    56.7    
 7ecdsa_sign                    ,    34.8       ,    34.9       ,    35.0    
 8ec_keygen                     ,    24.2       ,    24.2       ,    24.2    
 9ecdh                          ,    51.0       ,    51.4       ,    53.6    
10schnorrsig_sign               ,    25.6       ,    25.6       ,    25.7    
11schnorrsig_verify             ,    55.5       ,    56.0       ,    57.5    
12ellswift_encode               ,    31.1       ,    31.1       ,    31.2    
13ellswift_decode               ,    13.4       ,    13.4       ,    13.4    
14ellswift_keygen               ,    55.0       ,    55.0       ,    55.0    
15ellswift_ecdh                 ,    56.5       ,    56.8       ,    58.7    
16
17
18Benchmark results for INT128_SSE2_AVX2
19Generated on 2025-09-02T10:00:33 CEST
20Iterations: 20000
21
22Benchmark                     ,    Min(us)    ,    Avg(us)    ,    Max(us)    
23
24ecdsa_verify                  ,    54.6       ,    54.9       ,    56.4    
25ecdsa_sign                    ,    33.5       ,    33.6       ,    33.6    
26ec_keygen                     ,    23.0       ,    23.0       ,    23.1    
27ecdh                          ,    50.6       ,    50.6       ,    50.7    
28schnorrsig_sign               ,    24.5       ,    24.5       ,    24.5    
29schnorrsig_verify             ,    55.5       ,    56.1       ,    59.5    
30ellswift_encode               ,    31.2       ,    31.2       ,    31.2    
31ellswift_decode               ,    13.4       ,    13.4       ,    13.4    
32ellswift_keygen               ,    54.2       ,    54.2       ,    54.2    
33ellswift_ecdh                 ,    56.2       ,    56.3       ,    56.4

This currently saves ~0.5% in ecdsa_verify, ~4% in ecdsa_sign and ~5% in ec_keygen. The latter is nice, but I was hoping for more in verification. Perhaps with negation “fixed”, things can be improved further. At that point, my feeling is that it’s hard to say whether this is worth the hassle. We’d also need some CPU id code etc. to make this useful in practice.

Some more comments:

I don’t think we should bother with 32-bit x86. The 32-bit code is certainly interesting to some users (hardware wallets, raspberry pi, … ) but they are almost certainly not on 32-bit x86. (Unless I’m mistaken.)
There’s a lot to gain for SHA256, but then we should rely on the SHA instruction set directly. This is probably the lowest hanging fruit for performance in the library, but that’s a separate project. Bitcoin Core has a nice SHA256 implementation, with many backends (SHA, SHANI, SSE4, …), see https://github.com/bitcoin/bitcoin/blob/master/src/crypto/sha256.cpp . The clean way to go is either to give the caller a way to bring their own implementation at runtime, or to extract the C++ code in Bitcoin Core into a separate library (perhaps after converting it to C) that could be linked to libsecp256k1.

Raimo33 commented at 9:34 am on September 2, 2025: none

You’re right. Rasberry PI’s are 32bit armv7. I agree we shouldn’t bother. removed sse2.
I was hoping more on verify as well, that’s the main goal.
I think we should copy bitcoin core’s sha256 optimizations. but in another PR.

Beware that benchmarks are completely unreliable as of right now. see #1701 I’m waiting on #1732 before proceding

Raimo33 force-pushed on Sep 2, 2025

Raimo33 force-pushed on Sep 3, 2025

Raimo33 closed this on Sep 3, 2025

Raimo33 deleted the branch on Sep 3, 2025

Raimo33 restored the branch on Sep 3, 2025

Raimo33 reopened this on Sep 3, 2025

Raimo33 force-pushed on Sep 4, 2025

Raimo33 force-pushed on Sep 5, 2025

Add generic optimizations 7a05d18fd4

Raimo33 force-pushed on Sep 5, 2025

Add intel simd cb16f5cd80

Add dev scripts [skip ci] 056feb157a

Raimo33 force-pushed on Sep 5, 2025

kmk142789 approved

[WIP] Add intel simd #1703

Tasks

Commits

Test & Benchmark

Results