This adds avx and avx2 intrinsics support to the library in general, as discussed in #1700, wherever it yields an improvement as per the benchmarks.
Why not sse and avx512?
- sse is only useful in the 32bit code path (USE_FORCE_WIDEMUL_INT64=1). but practically, almost all 32bit SSE enabled CPUs are armv7 architectures, not X86
- avx512 would not be beneficial anywhere, and thermal throttling is a problem on most CPUs that support it.
arm has different SIMD instruction set; it would be nice to have a separate PR implementing that as well. Maybe after this is merged…
Tasks
- Add CI/CD flows with permutations of
-mavx
,-mavx2
,-mno-avx
,-mno-avx2
when building for amd64 - Precompute vectors at startup (the ones marked with
TODO: precompute
)
Commits
I’ve split this PR into multiple commits with the following criteria:
- non-simd optimizations & style changes
- simd optimizations
- temporary scripts for development
- CI flows
Test & Benchmark
To reproduce the following results I temporarily added 3 scripts for building, testing, benchmarking as well as a jupyter notebook to visualize results. You can verify yourself by running: ./simd-build.sh && ./simd-test.sh && ./simd-bench.sh
and executing the notebook as is.
The baseline is compiled with "-O3 -mavx -mavx2 -U__AVX__ -U__AVX2__"
so that spontaneous gcc vectorization is allowed, but my manual vectorization is not compiled.