Convert the 1-way SSE4 SHA256 code from asm to intrinsics

sipa commented at 11:36 pm on June 11, 2018: member

Currently, master contains 2 implementations of SHA256 for SSE4:

A generic one written using GCC inline assembly (converted from Intel NASM code), added in #10821.
A specialized double-SHA256 for 64-byte inputs written using intrinsics, added in #13191.

The advantage of the inline assembly is that its performance is not affected by compiler optimizations (and doesn’t even need compiler support for SSE4). The downside is that it is an opaque, unreadable, non-reusable blob of code.

This patch converts the former also to intrinsics - making its operation more clear, while hopefully lending itself to being adaptable for other specialized implementations.

The resulting implementation is slightly faster on my system (i7-7820HQ) when compiled with GCC 7.3. Small variations in the code can affect the optimizer though, and have as much as a few % impact on speed.

fanquake added the label Validation on Jun 12, 2018

theuni commented at 3:00 am on June 12, 2018: member

Nice! @sipa See https://github.com/theuni/bitcoin/commit/d79fb1d9328ad793932f7f8c29bd907803db4c73 for clang compile fixes, and https://github.com/theuni/bitcoin/commit/4ee6fbb8b7525988783030eb0799bbc7293d50a0 for a change that may or may not be needed to avoid a performance hit on AMD.

sipa force-pushed on Jun 12, 2018

in src/crypto/sha256_sse41.cpp:107 in b5464f73ce outdated

102+    _mm_store_si128((__m128i*)Ws, W);
103+
104+    Round(a, b, c, d, e, f, g, h, Ws[0]);
105+    XTMP0 = _mm_alignr_epi8(X3, X2, 4);
106+    XTMP0 = _mm_add_epi32(XTMP0, X0);
107+    XTMP3 = XTMP2 = XTMP1 = _mm_alignr_epi8(X1, X0, 4);

theuni commented at 3:33 am on June 12, 2018:

Is there some voodoo here, why not just use XTMP3 below? Does this avoid a pipeline stall or something?

sipa commented at 4:50 pm on June 12, 2018:

No idea. It’s just a translation of the existing assembly code.

sipa force-pushed on Jun 12, 2018

sipa commented at 4:47 pm on June 12, 2018: member

@theuni Included the clang compile fixes. I’m going to benchmark to see whether to include the other changes.

sipa force-pushed on Jun 12, 2018

sipa commented at 7:43 pm on June 14, 2018: member

It would be worthwhile to benchmark this on reasonably recent clang versions as well - the performance impact may be very different depending on how good the compiler is at ordering parallel instruction paths.

sipa commented at 10:20 pm on June 18, 2018: member

Some more benchmarks, comparing GCC 7.3 and clang 6.0, for the SHA256 benchmark (i7-7820HQ, fixed to 2.2 GHz).

GCC, master: 4.4 ms
GCC, this PR: 4.3 ms
clang, master: 4.4 ms
clang, this PR: 4.8 ms

Unfortunately, it seems that clang isn’t as good in producing as performant code from intrinsics.

DrahtBot added the label Needs rebase on Jul 9, 2018

theuni commented at 7:59 pm on July 19, 2018: member

@sipa Mind rebasing? I’d like to add the lib-per-cpu changes on top of this.

sipa force-pushed on Jul 19, 2018

sipa commented at 9:12 pm on July 19, 2018: member

Rebased, though I don’t think this PR is acceptable until we have a way to avoid the performance loss in clang.

DrahtBot removed the label Needs rebase on Jul 20, 2018

DrahtBot commented at 8:56 pm on July 28, 2018: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#13789 (crypto/sha256: Use pragmas to enforce necessary intrinsics for GCC and Clang by luke-jr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

in src/crypto/sha256_sse41.cpp:98 in 8655e7829c outdated

93+    __m128i t0, t1, t2, t3, t4;
94+
95+    w = _mm_add_epi32(w, x0);
96+    _mm_store_si128((__m128i*)w32, w);
97+
98+    Round(a, b, c, d, e, f, g, h, w32[0]);

practicalswift commented at 8:39 pm on October 4, 2018:

Is the value read from w32[0] is guaranteed to be initialised here?

sipa commented at 6:13 pm on May 20, 2019:

Yes, in the line above.

in src/crypto/sha256_sse41.cpp:104 in 8655e7829c outdated

 99+    t0 = _mm_add_epi32(_mm_alignr_epi8(x3, x2, 4), x0);
100+    t3 = t2 = t1 = _mm_alignr_epi8(x1, x0, 4);
101+    t2 = _mm_srli_epi32(t2, 7);
102+    t1 = _mm_or_si128(_mm_slli_epi32(t1, 32 - 7), t2);
103+
104+    Round(h, a, b, c, d, e, f, g, w32[1]);

practicalswift commented at 8:40 pm on October 4, 2018:

Same here and throughout this function :-)

in src/crypto/sha256.cpp:617 in 8655e7829c outdated

614@@ -615,12 +615,9 @@ std::string SHA256AutoDetect()
615 #endif
616 
617     if (have_sse4) {

practicalswift commented at 8:44 pm on October 4, 2018:

Move the if statement inside of the #if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL) to remove the possibility of an empty if statement.

bitcoin deleted a comment on Oct 4, 2018

Add 1-way SSE4 SHA256 implementation using intrinsics 797f6c1efe

Switch 1-way SSE4 SHA256 to intrinsics based implementation 7ebae4d40a

Remove SSE4 assembly implementation 4a221cef1b

sipa force-pushed on Oct 12, 2018

maflcko commented at 5:53 pm on May 20, 2019: member

Are you still working on this?

maflcko closed this on May 20, 2019

maflcko reopened this on May 20, 2019

sipa commented at 5:55 pm on May 20, 2019: member

What version of clang are we using now? It’s probably not a good idea to proceed with this unless it can be shown it doesn’t have negative impact on performance on all release platforms.

maflcko commented at 6:12 pm on May 20, 2019: member

3.7 (#16052) for gitian
FreeBSD 12: clang version 6.0.1
not sure what the default is on macos when self-compiled

sipa commented at 6:24 pm on May 20, 2019: member

I’ll close this for now, then.

sipa closed this on May 20, 2019

fanquake commented at 4:55 am on May 21, 2019: member

@MarcoFalke, Clang on my macOS machine (Xcode 10.2.1) is:

0Apple LLVM version 10.0.1 (clang-1001.0.46.4)
1Target: x86_64-apple-darwin18.6.0
2Thread model: posix
3InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

DesWurstes commented at 12:46 pm on May 21, 2019: contributor

…which is actually Clang 5.

fanquake commented at 6:34 pm on May 23, 2019: member

I’m going to reopen this, as we will be switching to a newer version of Clang in gitian.

fanquake reopened this on May 23, 2019

sipa commented at 7:05 pm on May 23, 2019: member

I’ll benchmark again in clang-7.

maflcko commented at 7:16 pm on May 23, 2019: member

I see a slowdown in SHA256 and SHA256_32b with both gcc-9 and clang-8

fanquake commented at 8:09 am on June 24, 2019: member

Futher benchmarking reported here and outside this PR have shown that there are likely slowdown issues with this change and recent versions of Clang. Closing again for now.

fanquake closed this on Jun 24, 2019

fanquake added the label Future on Jun 24, 2019

laanwj added this to the milestone Future on Sep 30, 2019

laanwj removed the label Future on Sep 30, 2019

bitcoin locked this on Dec 16, 2021

hebasto commented at 2:19 pm on September 24, 2023: member

Picked up for MSVC builds in #28526.

maflcko removed this from the milestone Future on Jul 23, 2025

maflcko added the label Up for grabs on Jul 23, 2025

Convert the 1-way SSE4 SHA256 code from asm to intrinsics #13442

Conflicts