This introduces a framework for specialized double-SHA256 with 64 byte inputs. 4 different implementations are provided:
- Generic C++ (reusing the normal SHA256 code)
- Specialized C++ for 64-byte inputs, but no special instructions
- 4-way using SSE4.1 intrinsics
- 8-way using AVX2 intrinsics
On my own system (AVX2 capable), I get these benchmarks for computing the Merkle root of 9001 leaves (supported lengths / special instructions / parallellism):
- 7.2 ms with varsize/naive/1way (master, non-SSE4 hardware)
- 5.8 ms with size64/naive/1way (this PR, non-SSE4 capable systems)
- 4.8 ms with varsize/SSE4/1way (master, SSE4 hardware)
- 2.9 ms with size64/SSE4/4way (this PR, SSE4 hardware)
- 1.1 ms with size64/AVX2/8way (this PR, AVX2 hardware)