This adds sse2 and avx2 support to the library in general, as discussed in #1700, wherever it yields an improvement as per the benchmarks.
arm has different SIMD instruction set; it would be nice to have a separate PR implementing that as well. Maybe after this is merged…
Tasks:
- Add 2 CI/CD flows with permutations of
-msse2
,-mno-sse2
when building for amd32 - Add 4 CI/CD flows with permutations of
-mavx2
,-msse2
,-mno-avx2
,-mno-sse2
when building for amd64 - Precompute vectors at startup (the ones marked with
TODO: precompute
)
Test & Benchmark
To reproduce the following results I temporarily added 3 scripts for building, testing, benchmarking as well as a jupyter notebook to visualize results.
You can verify yourself by running: ./simd-build.sh && ./simd-test.sh && ./simd-bench.sh
and executing the notebook as is.