Exploit modern simd to calculate 2/4/6/8/16 states at a time depending on the size of the input.
This demonstrates a 2x speedup on x86-64 and 3x for arm+neon. Platforms which require runtime detection (avx2/avx512) improve performance even further, and will come as a follow-up.
Rather than hand-writing assembly or using arch-specific intrinsics, this is written using compiler built-ins understood by gcc and clang.
In practice (at least on x86_64 and armv8), the compilers are able to produce assembly that's not much worse than hand-written.
This means that every architecture can benefit from its own vectorized instructions without having to write/maintain an implementation for each one. But because each will vary in ability to exploit the parallelism, we allow (via ifdefs) each architecture to opt-out of some or all multi-state calculation at compile-time..
Here, as a starting point, x86-64 and arm+neon have been defined based on local benchmarks.
Local profiling revealed that chacha20 accounts for a substantial amount of the network thread's time. It's not clear to me if speeding up chacha20 will improve network performance/latency, but it will definitely make it more efficient.
This is part 1 of a series of PR's for chacha20. I think it makes sense to take a look at the generic implementation and tune the architecture-specific defines for parallel blocks before adding the runtime-dependent platforms.
My WIP branch which includes avx2/avx512 can be seen here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/
I've been hacking on this for quite a while, trying every imaginable tweak and comparing lots of resulting asm/ir. I'm happy to answer any questions about any choices made which aren't immediately obvious.
Edit: some more impl details:
I wrestled with gcc/clang a good bit, tweaking something and comparing the generated code output. A few things I found, which may explain some of the decisions I made:
- gcc really wanted to inline some of the helpers, which comes at a very substantial performance cost due to register clobbering (and with avx2,
vzeroupper). Hence, all helpers are decorated withALWAYS_INLINE. - gcc/clang do well with the vec256 loads/stores with minimal fussing. Though loading each element with
[]is clumsy and verbose, it avoids compiler-specific layout assumptions. Other things I tried (which produced the same asm):- casting directly to
using unaligned_vec256 __attribute__((aligned (1))) = vec256 - memcpy into ^^
- clang's
__builtin_masked_load
- casting directly to
- Loop unrolling was hit-or-miss without
#pragma GCC unroll n, and I tried to avoid macros for loops, hence the awkward recursive inline template loops. But in practice, I see those unrolled 100% of the time. - I used
std::getin the helpers for some extra compile-time safety (this actually pointed out some off-by-one's that would've been annoying to track down) - I avoided using any lambdas or classes for fear of compilers missing obvious optimizations
- All
vec256are passed by reference to avoid an annoying clang warning about returning a vector changing the abi (this is specific to x86 when not compiling with-avx). Even though our functions are all inlined, I didn't see any harm in making that adjustment.