Convert the 1-way SSE4 SHA256 code from asm to intrinsics #13442

pull sipa wants to merge 3 commits into bitcoin:master from sipa:201806_sse4intrin changing 4 files +215 −1519
  1. sipa commented at 11:36 pm on June 11, 2018: member

    Currently, master contains 2 implementations of SHA256 for SSE4:

    • A generic one written using GCC inline assembly (converted from Intel NASM code), added in #10821.
    • A specialized double-SHA256 for 64-byte inputs written using intrinsics, added in #13191.

    The advantage of the inline assembly is that its performance is not affected by compiler optimizations (and doesn’t even need compiler support for SSE4). The downside is that it is an opaque, unreadable, non-reusable blob of code.

    This patch converts the former also to intrinsics - making its operation more clear, while hopefully lending itself to being adaptable for other specialized implementations.

    The resulting implementation is slightly faster on my system (i7-7820HQ) when compiled with GCC 7.3. Small variations in the code can affect the optimizer though, and have as much as a few % impact on speed.

  2. fanquake added the label Validation on Jun 12, 2018
  3. theuni commented at 3:00 am on June 12, 2018: member
    Nice! @sipa See https://github.com/theuni/bitcoin/commit/d79fb1d9328ad793932f7f8c29bd907803db4c73 for clang compile fixes, and https://github.com/theuni/bitcoin/commit/4ee6fbb8b7525988783030eb0799bbc7293d50a0 for a change that may or may not be needed to avoid a performance hit on AMD.
  4. sipa force-pushed on Jun 12, 2018
  5. in src/crypto/sha256_sse41.cpp:107 in b5464f73ce outdated
    102+    _mm_store_si128((__m128i*)Ws, W);
    103+
    104+    Round(a, b, c, d, e, f, g, h, Ws[0]);
    105+    XTMP0 = _mm_alignr_epi8(X3, X2, 4);
    106+    XTMP0 = _mm_add_epi32(XTMP0, X0);
    107+    XTMP3 = XTMP2 = XTMP1 = _mm_alignr_epi8(X1, X0, 4);
    


    theuni commented at 3:33 am on June 12, 2018:
    Is there some voodoo here, why not just use XTMP3 below? Does this avoid a pipeline stall or something?

    sipa commented at 4:50 pm on June 12, 2018:
    No idea. It’s just a translation of the existing assembly code.
  6. sipa force-pushed on Jun 12, 2018
  7. sipa force-pushed on Jun 12, 2018
  8. sipa force-pushed on Jun 12, 2018
  9. sipa commented at 4:47 pm on June 12, 2018: member
    @theuni Included the clang compile fixes. I’m going to benchmark to see whether to include the other changes.
  10. sipa force-pushed on Jun 12, 2018
  11. sipa commented at 7:43 pm on June 14, 2018: member
    It would be worthwhile to benchmark this on reasonably recent clang versions as well - the performance impact may be very different depending on how good the compiler is at ordering parallel instruction paths.
  12. sipa commented at 10:20 pm on June 18, 2018: member

    Some more benchmarks, comparing GCC 7.3 and clang 6.0, for the SHA256 benchmark (i7-7820HQ, fixed to 2.2 GHz).

    • GCC, master: 4.4 ms
    • GCC, this PR: 4.3 ms
    • clang, master: 4.4 ms
    • clang, this PR: 4.8 ms

    Unfortunately, it seems that clang isn’t as good in producing as performant code from intrinsics.

  13. DrahtBot added the label Needs rebase on Jul 9, 2018
  14. theuni commented at 7:59 pm on July 19, 2018: member
    @sipa Mind rebasing? I’d like to add the lib-per-cpu changes on top of this.
  15. sipa force-pushed on Jul 19, 2018
  16. sipa commented at 9:12 pm on July 19, 2018: member
    Rebased, though I don’t think this PR is acceptable until we have a way to avoid the performance loss in clang.
  17. DrahtBot removed the label Needs rebase on Jul 20, 2018
  18. DrahtBot commented at 8:56 pm on July 28, 2018: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #13789 (crypto/sha256: Use pragmas to enforce necessary intrinsics for GCC and Clang by luke-jr)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  19. in src/crypto/sha256_sse41.cpp:98 in 8655e7829c outdated
    93+    __m128i t0, t1, t2, t3, t4;
    94+
    95+    w = _mm_add_epi32(w, x0);
    96+    _mm_store_si128((__m128i*)w32, w);
    97+
    98+    Round(a, b, c, d, e, f, g, h, w32[0]);
    


    practicalswift commented at 8:39 pm on October 4, 2018:
    Is the value read from w32[0] is guaranteed to be initialised here?

    sipa commented at 6:13 pm on May 20, 2019:
    Yes, in the line above.
  20. in src/crypto/sha256_sse41.cpp:104 in 8655e7829c outdated
     99+    t0 = _mm_add_epi32(_mm_alignr_epi8(x3, x2, 4), x0);
    100+    t3 = t2 = t1 = _mm_alignr_epi8(x1, x0, 4);
    101+    t2 = _mm_srli_epi32(t2, 7);
    102+    t1 = _mm_or_si128(_mm_slli_epi32(t1, 32 - 7), t2);
    103+
    104+    Round(h, a, b, c, d, e, f, g, w32[1]);
    


    practicalswift commented at 8:40 pm on October 4, 2018:
    Same here and throughout this function :-)
  21. in src/crypto/sha256.cpp:617 in 8655e7829c outdated
    614@@ -615,12 +615,9 @@ std::string SHA256AutoDetect()
    615 #endif
    616 
    617     if (have_sse4) {
    


    practicalswift commented at 8:44 pm on October 4, 2018:
    Move the if statement inside of the #if defined(ENABLE_SSE41) && !defined(BUILD_BITCOIN_INTERNAL) to remove the possibility of an empty if statement.
  22. bitcoin deleted a comment on Oct 4, 2018
  23. Add 1-way SSE4 SHA256 implementation using intrinsics 797f6c1efe
  24. Switch 1-way SSE4 SHA256 to intrinsics based implementation 7ebae4d40a
  25. Remove SSE4 assembly implementation 4a221cef1b
  26. sipa force-pushed on Oct 12, 2018
  27. MarcoFalke commented at 5:53 pm on May 20, 2019: member
    Are you still working on this?
  28. MarcoFalke closed this on May 20, 2019

  29. MarcoFalke reopened this on May 20, 2019

  30. sipa commented at 5:55 pm on May 20, 2019: member
    What version of clang are we using now? It’s probably not a good idea to proceed with this unless it can be shown it doesn’t have negative impact on performance on all release platforms.
  31. MarcoFalke commented at 6:12 pm on May 20, 2019: member
    • 3.7 (#16052) for gitian
    • FreeBSD 12: clang version 6.0.1
    • not sure what the default is on macos when self-compiled
  32. sipa commented at 6:24 pm on May 20, 2019: member
    I’ll close this for now, then.
  33. sipa closed this on May 20, 2019

  34. fanquake commented at 4:55 am on May 21, 2019: member

    @MarcoFalke, Clang on my macOS machine (Xcode 10.2.1) is:

    0Apple LLVM version 10.0.1 (clang-1001.0.46.4)
    1Target: x86_64-apple-darwin18.6.0
    2Thread model: posix
    3InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
    
  35. DesWurstes commented at 12:46 pm on May 21, 2019: contributor
    …which is actually Clang 5.
  36. fanquake commented at 6:34 pm on May 23, 2019: member
    I’m going to reopen this, as we will be switching to a newer version of Clang in gitian.
  37. fanquake reopened this on May 23, 2019

  38. sipa commented at 7:05 pm on May 23, 2019: member
    I’ll benchmark again in clang-7.
  39. MarcoFalke commented at 7:16 pm on May 23, 2019: member
    I see a slowdown in SHA256 and SHA256_32b with both gcc-9 and clang-8
  40. fanquake commented at 8:09 am on June 24, 2019: member
    Futher benchmarking reported here and outside this PR have shown that there are likely slowdown issues with this change and recent versions of Clang. Closing again for now.
  41. fanquake closed this on Jun 24, 2019

  42. fanquake added the label Future on Jun 24, 2019
  43. laanwj added this to the milestone Future on Sep 30, 2019
  44. laanwj removed the label Future on Sep 30, 2019
  45. bitcoin locked this on Dec 16, 2021
  46. hebasto commented at 2:19 pm on September 24, 2023: member
    Picked up for MSVC builds in #28526.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-07-01 13:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me