Use SIMD? #1700

issue Raimo33 openend this issue on July 13, 2025
  1. Raimo33 commented at 11:29 pm on July 13, 2025: none

    Ever thought about using SIMD intrinsics to speed up some functions?

    https://github.com/sipa/secp256k1/blob/master/src%2Ffield_10x26_impl.h

    This code for example is full of cases were SIMD would offer great benefit

  2. Raimo33 commented at 11:29 pm on July 13, 2025: none
    I’m open to implement it myself if it gets decided
  3. real-or-random commented at 6:51 am on July 14, 2025: contributor

    Ever thought about using SIMD intrinsics to speed up some functions?

    Does this issue answer your question? #1110

    sipa/secp256k1@master/src%2Ffield_10x26_impl.h

    By the way, this link points to a 10 year old version of the library code (because it points to the the wrong repo).

    I’m open to implement it myself if it gets decided

    I would be happy to see experimentation with SIMD, and I think we’re in general open to the idea, but be aware that we have very high coding and reviewing standards, and not a lot of bandwidth. Reviewing such code will take a long time, and no one can give you a “decision” right now.

  4. real-or-random added the label performance on Jul 14, 2025
  5. Raimo33 commented at 7:20 am on July 14, 2025: none
    Ok I will experiment then. do you think I should make separate files or put #ifndefs blocks and embed the SSE2, AVX2, AVX512 versions directly along the already existing functions?
  6. real-or-random commented at 7:26 am on July 14, 2025: contributor
    I’d start with #ifdef blocks for experimentation. This gets you started quicker if some functions use intrinsics and some don’t because you won’t need to care about organizing files so that you’ll have all the right functions included.
  7. Raimo33 commented at 10:33 am on July 14, 2025: none

    hey, quick question: are the VERIFY blocks for debugging or not? in other words, should I optimize them? for example:

     0#ifdef VERIFY
     1static void secp256k1_fe_impl_verify(const secp256k1_fe *a) {
     2    const uint64_t *d = a->n;
     3    int m = a->normalized ? 1 : 2 * a->magnitude;
     4   /* secp256k1 'p' value defined in "Standards for Efficient Cryptography" (SEC2) 2.7.1. */
     5    VERIFY_CHECK(d[0] <= 0xFFFFFFFFFFFFFULL * m);
     6    VERIFY_CHECK(d[1] <= 0xFFFFFFFFFFFFFULL * m);
     7    VERIFY_CHECK(d[2] <= 0xFFFFFFFFFFFFFULL * m);
     8    VERIFY_CHECK(d[3] <= 0xFFFFFFFFFFFFFULL * m);
     9    VERIFY_CHECK(d[4] <= 0x0FFFFFFFFFFFFULL * m);
    10    if (a->normalized) {
    11        if ((d[4] == 0x0FFFFFFFFFFFFULL) && ((d[3] & d[2] & d[1]) == 0xFFFFFFFFFFFFFULL)) {
    12            VERIFY_CHECK(d[0] < 0xFFFFEFFFFFC2FULL);
    13        }
    14    }
    15}
    16#endif
    17`
    
  8. real-or-random commented at 1:09 pm on July 14, 2025: contributor
    Yes, essentially. The VERIFY blocks and the VERIFY_CHECK macros are for assertions enabled only in the tests. No need to add SIMD there.
  9. Raimo33 commented at 4:29 pm on July 14, 2025: none

    I’ve added SIMD to field_5x52_impl.h Please share feedback and let me know if I should continue with the other files.

    I ran the benchmarks (both with avx2 enabled, to see difference between auto-generated simd and manual simd). I ran benchmarks thoroughly to ensure every change was meaningful. I don’t have an avx512 CPU so I’m unable to run some tests & benchmarks for the secp256k1_fe_impl_get_b32 function. But it should be much faster as well.

    Code: https://github.com/Raimo33/secp256k1/blob/simd/src/field_5x52_impl.h Benchmarks:

    bench_diff.pdf bench.zip

    Keep in mind that the only file I changed was field_5x52_impl.h, Imagine the possible speedup by applying simd to all other files as well. I see a lot of room for improvement and I would have a lot of fun implementing it.

  10. Raimo33 commented at 10:11 am on July 16, 2025: none
    As a side note, if you decide to integrate SIMD you’d need to change the CI pipeline and tests to compile with different flags. And ensure that the running machine supports avx intrinsics. Or maybe use some emulator if there is
  11. Raimo33 commented at 3:27 pm on July 17, 2025: none

    it seems that no function of field_10x26_impl.h ever get’s compiled in the final binaries and it’s never tested. I can add garbage to field_10x26_impl.h, run the ctest . and everything still passes.

    I’m avoiding optimization of field_10x26_impl.h as I see no way to test correctness.

  12. real-or-random commented at 3:53 pm on July 17, 2025: contributor

    We have two finite implementations:

    • One that represents a field element by 5 limbs of uint64_t, where 52 bits are used if elements are reduced (5x52). This is used on 64-bit platforms.
    • One that represents a field element by 10 limbs of uint32_t where 26 bits are used if elements are fully reduced (10x26). This is used on 32-bit platforms.

    When using cmake, create a build32 dir and run CC="$CC -m32" cmake -B build32. This should set up a 32-bit build on x86_64.

  13. real-or-random commented at 7:24 pm on July 17, 2025: contributor
    Oh, but in case this was not obvious: You won’t find an x86 (32-bit) CPU with AVX…
  14. Raimo33 commented at 7:27 pm on July 17, 2025: none

    Not a problem for compiling, I can still set the -avx2 flag. But I guess problem for running. I’m not familiar to how my 64bit CPU runs 32bit programs, but I thought it still used AVX2.

    But at this point do you think I should avoid adding SIMD to the whole field_10x26?

  15. real-or-random commented at 7:32 pm on July 17, 2025: contributor

    I’ve added SIMD to field_5x52_impl.h Please share feedback and let me know if I should continue with the other files.

    I think it would be better to open a draft pull request. This makes it easier for people to look at the changes.

    I ran the benchmarks (both with avx2 enabled, to see difference between auto-generated simd and manual simd). I ran benchmarks thoroughly to ensure every change was meaningful. I don’t have an avx512 CPU so I’m unable to run some tests & benchmarks for the secp256k1_fe_impl_get_b32 function. But it should be much faster as well.

    Code: Raimo33/secp256k1@simd/src/field_5x52_impl.h Benchmarks:

    bench_diff.pdf

    Hm, that doesn’t draw a very consistent picture. Did you disable turbo boost? Do you know that you can increase the number of benchmark iterations by setting SECP256K1_BENCH_ITERS?

    Keep in mind that the only file I changed was field_5x52_impl.h, Imagine the possible speedup by applying simd to all other files as well. I see a lot of room for improvement and I would have a lot of fun implementing it.

    I’m not sure. I assume this is the file with the biggest potential. Bigger improvements might be possible by changing the algorithms or even the data structure so that they’re more amenable to vectorization. (No idea if this is possible; I haven’t thought about this or read up on it.)

  16. real-or-random commented at 7:35 pm on July 17, 2025: contributor

    But at this point do you think I should avoid adding SIMD to the whole field_10x26?

    Yeah, I mean the only reason why 64-bit Intel CPUs support the old 32-bit instruction set is compatibility. If you want good performance on 64-bit, you’ll need to use the 5x52 code.

    The reason why we have 32-bit code is for entirely different CPUs.

  17. Raimo33 commented at 7:38 pm on July 17, 2025: none

    I’ll open a draft PR.

    Benchmarks are not consistent, true, that’s why I wanted them to measure CPU cycles instead of time, so I don’t have to bother with clock frequency. But I’ll disable turbo boost.

    I think a great improvement can come from the sha256_transform as well, but I’ll dig into it later.

    Right now I’m adding SIMD to a flow which is inherently sequential, but the great benefit would come from having parallel functions, for instance mul4, add4, etc…

    The problem with SIMD is the instruction latency of load operations, which can be smoothed out if doing 4 adjacient loads for example.

    Also, having not so friendly numbers such as 52, 5 doesn’t help

  18. Raimo33 commented at 8:36 pm on July 18, 2025: none
    • One that represents a field element by 10 limbs of uint32_t where 26 bits are used if elements are fully reduced (10x26). This is used on 32-bit platforms.

    I assume this is the same drill for scalar8x32. It only gets used when 64bit is not available, correct?

  19. Raimo33 commented at 8:38 pm on July 18, 2025: none
    by the way, many 32bit CPUs support SSE2 simd (128bit registers). So I think I’ll add that somewhere

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-08-06 06:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me