This adds an SSE4 assembly version of the SHA256 transform by Intel, and uses it at run time if SSE4 instructions are available, and use a fallback C++ implementation otherwise. Nearly every x86_64 CPU supports SSE4. The feature is only enabled when compiled with --enable-experimental-asm
.
In order to avoid build dependencies and other complications, the original Intel YASM code was translated to GCC extended asm syntax.
This gives around a 50% speedup on the SHA256 benchmark for me.
It is based on an earlier patch by @laanwj, though only includes a single assembly version (for now), and removes the YASM dependency.