Currently, master contains 2 implementations of SHA256 for SSE4:
- A generic one written using GCC inline assembly (converted from Intel NASM code), added in #10821.
- A specialized double-SHA256 for 64-byte inputs written using intrinsics, added in #13191.
The advantage of the inline assembly is that its performance is not affected by compiler optimizations (and doesn’t even need compiler support for SSE4). The downside is that it is an opaque, unreadable, non-reusable blob of code.
This patch converts the former also to intrinsics - making its operation more clear, while hopefully lending itself to being adaptable for other specialized implementations.
The resulting implementation is slightly faster on my system (i7-7820HQ) when compiled with GCC 7.3. Small variations in the code can affect the optimizer though, and have as much as a few % impact on speed.