Ever thought about using SIMD intrinsics to speed up some functions?
https://github.com/sipa/secp256k1/blob/master/src%2Ffield_10x26_impl.h
This code for example is full of cases were SIMD would offer great benefit
Ever thought about using SIMD intrinsics to speed up some functions?
https://github.com/sipa/secp256k1/blob/master/src%2Ffield_10x26_impl.h
This code for example is full of cases were SIMD would offer great benefit
Ever thought about using SIMD intrinsics to speed up some functions?
Does this issue answer your question? #1110
By the way, this link points to a 10 year old version of the library code (because it points to the the wrong repo).
I’m open to implement it myself if it gets decided
I would be happy to see experimentation with SIMD, and I think we’re in general open to the idea, but be aware that we have very high coding and reviewing standards, and not a lot of bandwidth. Reviewing such code will take a long time, and no one can give you a “decision” right now.
#ifdef
blocks for experimentation. This gets you started quicker if some functions use intrinsics and some don’t because you won’t need to care about organizing files so that you’ll have all the right functions included.
VERIFY
blocks and the VERIFY_CHECK
macros are for assertions enabled only in the tests. No need to add SIMD there.
I’ve added SIMD to field_5x52_impl.h
Please share feedback and let me know if I should continue with the other files.
I ran the benchmarks (both with avx2 enabled, to see difference between auto-generated simd and manual simd). I ran benchmarks thoroughly to ensure every change was meaningful.
I don’t have an avx512 CPU so I’m unable to run some tests & benchmarks for the secp256k1_fe_impl_get_b32
function. But it should be much faster as well.
Code: https://github.com/Raimo33/secp256k1/blob/simd/src/field_5x52_impl.h Benchmarks:
Keep in mind that the only file I changed was field_5x52_impl.h
, Imagine the possible speedup by applying simd to all other files as well. I see a lot of room for improvement and I would have a lot of fun implementing it.
We have two finite implementations:
When using cmake, create a build32
dir and run CC="$CC -m32" cmake -B build32
. This should set up a 32-bit build on x86_64.
I’ve added SIMD to
field_5x52_impl.h
Please share feedback and let me know if I should continue with the other files.
I think it would be better to open a draft pull request. This makes it easier for people to look at the changes.
I ran the benchmarks (both with avx2 enabled, to see difference between auto-generated simd and manual simd). I ran benchmarks thoroughly to ensure every change was meaningful. I don’t have an avx512 CPU so I’m unable to run some tests & benchmarks for the
secp256k1_fe_impl_get_b32
function. But it should be much faster as well.Code: Raimo33/secp256k1@
simd
/src/field_5x52_impl.h Benchmarks:
Hm, that doesn’t draw a very consistent picture. Did you disable turbo boost? Do you know that you can increase the number of benchmark iterations by setting SECP256K1_BENCH_ITERS
?
Keep in mind that the only file I changed was
field_5x52_impl.h
, Imagine the possible speedup by applying simd to all other files as well. I see a lot of room for improvement and I would have a lot of fun implementing it.
I’m not sure. I assume this is the file with the biggest potential. Bigger improvements might be possible by changing the algorithms or even the data structure so that they’re more amenable to vectorization. (No idea if this is possible; I haven’t thought about this or read up on it.)
But at this point do you think I should avoid adding SIMD to the whole field_10x26?
Yeah, I mean the only reason why 64-bit Intel CPUs support the old 32-bit instruction set is compatibility. If you want good performance on 64-bit, you’ll need to use the 5x52 code.
The reason why we have 32-bit code is for entirely different CPUs.
Labels
performance