Add POWER8 ASM for 4-way SHA256 #13203

pull TheBlueMatt wants to merge 2 commits into bitcoin:master from TheBlueMatt:2018-05-asm changing 5 files +451 −4
  1. TheBlueMatt commented at 5:46 pm on May 9, 2018: member
    Based on #13191, This speeds up 4-way SHA256 by about 3.75x over the C impl.
  2. TheBlueMatt force-pushed on May 9, 2018
  3. laanwj commented at 6:02 pm on May 9, 2018: member
    Cool! Ping @luke-jr
  4. in configure.ac:766 in 67f6440f48 outdated
    761@@ -762,6 +762,29 @@ AC_LINK_IFELSE([AC_LANG_SOURCE([
    762 )
    763 LDFLAGS="$TEMP_LDFLAGS"
    764 
    765+TEMP_LDFLAGS="$LDFLAGS"
    766+LDFLAGS="$TEMP_LDFLAGS -mpower8-vector"
    


    luke-jr commented at 6:20 pm on May 9, 2018:
    Shouldn’t this be CPPFLAGS or CXXFLAGS?

    theuni commented at 7:28 pm on May 9, 2018:
    Agree, but this should be assigned to POWER_VSX_CXXFLAGS or so above, in an AX_CHECK_LINK_FLAG test.
  5. in src/Makefile.am:315 in 67f6440f48 outdated
    309@@ -309,6 +310,16 @@ crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2
    310 endif
    311 crypto_libbitcoin_crypto_avx2_a_SOURCES = crypto/sha256_avx2.cpp
    312 
    313+if HAVE_POWER8
    314+crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector
    


    luke-jr commented at 6:21 pm on May 9, 2018:
    This is probably needed on CXXFLAGS, since the man page says it changes the code generated.
  6. in src/crypto/sha256.cpp:33 in 67f6440f48 outdated
    28@@ -29,6 +29,15 @@ namespace sha256d64_avx2
    29 void Transform_8way(unsigned char* out, const unsigned char* in);
    30 }
    31 
    32+#if defined(__linux__) && defined(HAVE_POWER8)
    33+#include <sys/auxv.h>
    


    luke-jr commented at 6:27 pm on May 9, 2018:
    I don’t think we need this here, and POWER vectoring stuff seems to screw with standard C++ stuff, so maybe not a good idea to have it.

    TheBlueMatt commented at 9:01 pm on May 9, 2018:
    auxv isnt POWER- or vector-specific, its to get getauxval which is a libc function.
  7. in src/crypto/sha256.cpp:524 in 67f6440f48 outdated
    519@@ -511,6 +520,12 @@ std::string SHA256AutoDetect()
    520         ret = "sse4";
    521 #endif
    522     }
    523+#elif (defined(__linux__)) && defined(HAVE_POWER8)
    524+    if (getauxval(AT_HWCAP2) & 0x02000000) {
    


    luke-jr commented at 6:28 pm on May 9, 2018:
    Can we move this check to a function in the power8-specific file?

    TheBlueMatt commented at 9:01 pm on May 9, 2018:
    Its right below the x86-specific support check, and its a short file, it seems like overkill.
  8. in src/crypto/sha256_power8.cpp:28 in 67f6440f48 outdated
    23+
    24+namespace sha256_power8
    25+{
    26+
    27+typedef __vector uint32_t uint32x4_p8;
    28+typedef __vector uint8_t uint8x16_p8;
    


    luke-jr commented at 6:29 pm on May 9, 2018:
    __vector is a non-standard GCC extension; any reason we can’t use vector here directly?
  9. in src/crypto/sha256_power8.cpp:12 in 67f6440f48 outdated
     7+
     8+#if defined(HAVE_CONFIG_H)
     9+#include <config/bitcoin-config.h>
    10+#endif
    11+
    12+#ifdef HAVE_POWER8
    


    luke-jr commented at 6:29 pm on May 9, 2018:
    Since this file is only compiled with HAVE_POWER8, I think we don’t need this here.
  10. luke-jr changes_requested
  11. luke-jr commented at 6:30 pm on May 9, 2018: member
    I’d be surprised if this works as-is; either way, some cleanup needed to avoid breaking non-POWER builds.
  12. laanwj added the label Validation on May 9, 2018
  13. in src/Makefile.am:317 in 67f6440f48 outdated
    304+
    305+crypto_libbitcoin_crypto_avx2_a_CXXFLAGS = $(CXXFLAGS)
    306+crypto_libbitcoin_crypto_avx2_a_CPPFLAGS = $(CPPFLAGS)
    307+if ENABLE_AVX2
    308+crypto_libbitcoin_crypto_avx2_a_CXXFLAGS += $(AVX2_CXXFLAGS)
    309+crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2
    


    theuni commented at 7:19 pm on May 9, 2018:

    Can’t compiler defines be used instead so that non-autoconf builds can compile with optims too?

    Specifically for this one, rather than ifdef(ENABLE_AVX2), use ifdef(__AVX2__)

  14. in configure.ac:333 in 67f6440f48 outdated
    329@@ -327,6 +330,45 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
    330 )
    331 CXXFLAGS="$TEMP_CXXFLAGS"
    332 
    333+AC_DEFINE(ENABLE_AVX2, 1, [Define this symbol to build code that uses AVX2 intrinsics])
    


    theuni commented at 7:31 pm on May 9, 2018:
    Only define these in their respective check’s success cases.
  15. in configure.ac:778 in 67f6440f48 outdated
    773+  __builtin_crypto_vshasigmaw((__vector uint32_t)vec_vsx_ld(0, src), 1, 0xf);
    774+  return 0;
    775+  }
    776+  ]])],
    777+  [
    778+    AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.])
    


    theuni commented at 7:34 pm on May 9, 2018:
    Please move the defines/conditionals down with the other asm flags for consistency.
  16. in src/Makefile.am:298 in 67f6440f48 outdated
    293@@ -289,6 +294,32 @@ if USE_ASM
    294 crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp
    295 endif
    296 
    297+crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS)
    


    theuni commented at 7:35 pm on May 9, 2018:
    These should all be $(AM_CXXFLAGS) $(PIE_FLAGS).
  17. in src/Makefile.am:299 in 67f6440f48 outdated
    293@@ -289,6 +294,32 @@ if USE_ASM
    294 crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp
    295 endif
    296 
    297+crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS)
    298+crypto_libbitcoin_crypto_sse41_a_CPPFLAGS = $(CPPFLAGS)
    


    theuni commented at 7:36 pm on May 9, 2018:
    Ditto for the CPPFLAGS.
  18. in src/Makefile.am:319 in 67f6440f48 outdated
    314+crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector
    315+crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS)
    316+crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp
    317+
    318+LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8)
    319+EXTRA_LIBRARIES += $(LIBBITCOIN_CRYPTO_POWER8)
    


    theuni commented at 7:42 pm on May 9, 2018:
    append to EXTRA_LIBRARIES outside of the if HAVE_POWER8. These libraries are only built if something depends on them.
  19. in src/Makefile.am:319 in 67f6440f48 outdated
    313+if HAVE_POWER8
    314+crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector
    315+crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS)
    316+crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp
    317+
    318+LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8)
    


    theuni commented at 7:46 pm on May 9, 2018:
    I’m not sure whether this works as intended or not, but we don’t want to be doing it either way. Add LIBBITCOIN_CRYPTO_POWER8 to bitcoind_LDADD like the others, we can potentially aggregate them as a follow-up.

    TheBlueMatt commented at 9:05 pm on May 9, 2018:
    Why not? Its shorthand for “crypto depends on this” instead of listing it as a dep in every target that may use it. Its also much less britle, which is important given we dont have a POWER travis target.

    theuni commented at 9:53 pm on May 9, 2018:
    I agree with aggregating, but as a clean follow-up, or as part of #13191. I’d just prefer not to have one-off’s. It’s not clear, for example, if EXTRA_LIBRARIES is appended twice, as it depends on when LIBBITCOIN_CRYPTO is calculated.
  20. theuni commented at 7:48 pm on May 9, 2018: member

    Concept ACK!

    First set of build-system comments, there are few things to sort out. Will continue reviewing after the first round of cleanups.

  21. TheBlueMatt commented at 8:59 pm on May 9, 2018: member
    @luke-jr No, it definitely works, but its apparently /only/ tested on POWER, not “not yet tested” :p.
  22. TheBlueMatt force-pushed on May 9, 2018
  23. TheBlueMatt force-pushed on May 9, 2018
  24. in configure.ac:393 in 9329896877 outdated
    388+  [
    389+    AC_MSG_RESULT(no)
    390+  ]
    391+)
    392+AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8])
    393+LDFLAGS="$TEMP_LDFLAGS"
    


    theuni commented at 9:57 pm on May 9, 2018:
    This should reset CXXFLAGS, not LDFLAGS.
  25. TheBlueMatt force-pushed on May 9, 2018
  26. TheBlueMatt force-pushed on May 9, 2018
  27. in configure.ac:392 in 84c7c7d745 outdated
    381+  return 0;
    382+  }
    383+  ]])],
    384+  [
    385+    AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.])
    386+    AC_MSG_RESULT(yes)
    


    theuni commented at 11:20 pm on May 9, 2018:
    0POWER8_ASM_CXXFLAGS="-mpower8-vector"
    1AC_MSG_RESULT(yes)
    

    Then at the bottom, with the others:

    0AC_SUBST(POWER8_ASM_CXXFLAGS)
    

    Then in the Makefile.am:

    0crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS) $(POWER8_ASM_CXXFLAGS)
    
  28. in configure.ac:392 in 84c7c7d745 outdated
    387+  ],
    388+  [
    389+    AC_MSG_RESULT(no)
    390+  ]
    391+)
    392+AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8])
    


    theuni commented at 11:24 pm on May 9, 2018:
    HAVE_POWER8 is a define that will be handled by autoheader, it can’t be treated like a variable here. Need to set one separately.
  29. TheBlueMatt force-pushed on May 10, 2018
  30. gmaxwell commented at 6:55 pm on May 15, 2018: contributor
    spiffy! will test soon
  31. TheBlueMatt force-pushed on Jun 4, 2018
  32. TheBlueMatt commented at 6:34 pm on June 4, 2018: member
    Rebased.
  33. DrahtBot added the label Needs rebase on Jun 11, 2018
  34. TheBlueMatt force-pushed on Jun 11, 2018
  35. TheBlueMatt force-pushed on Jun 11, 2018
  36. TheBlueMatt force-pushed on Jun 11, 2018
  37. Add POWER8 vector impl for 4-way SHA256
    This speeds up 4-way SHA256 by about 3.75x over the C impl.
    b04f1176ff
  38. Make checkqueue bench a realistic result
    Previously checkqueue bench would essentially just test cross-core
    latency with more threads the more threads a host had. Not only did
    this make the result worse on hosts with more CPU cores, but it
    doesn't mimic how we use the checkqueue in real life, and made
    bench_bitcoin almost unusably slow on high-core-count systems,
    especially dual-CPU systems.
    
    Instead, we use the same thread count that we use for the real
    validation queue, with a 75-microsecond busy-wait on each thread,
    which should at least vaguely mimic how we use the checkqueue.
    3b402e0738
  39. TheBlueMatt force-pushed on Jun 11, 2018
  40. TheBlueMatt commented at 4:32 pm on June 11, 2018: member
    Rebased. Should be ready for merge +/- some more review, though luckily the tests should mostly cover it.
  41. TheBlueMatt commented at 4:33 pm on June 11, 2018: member
    Note that I also had to fix the checkqueue bench as otherwise bench_bitcoin is unsuably slow on multi-cpu (or many-core) systems.
  42. MarcoFalke removed the label Needs rebase on Jun 11, 2018
  43. laanwj commented at 5:50 pm on June 14, 2018: member
    I’m unable to test this at the moment. Can someone please post the benchmark results before/after this?
  44. DrahtBot added the label Needs rebase on Jul 9, 2018
  45. DrahtBot commented at 11:26 pm on July 9, 2018: member
  46. in src/crypto/sha256_power8.cpp:6 in 3b402e0738
    0@@ -0,0 +1,401 @@
    1+// Copyright (c) 2017 The Bitcoin Core developers
    2+// Distributed under the MIT software license, see the accompanying
    3+// file COPYING or http://www.opensource.org/licenses/mit-license.php.
    4+//
    5+// This is a translation to GCC extended asm syntax from YASM code by Intel
    6+// (available at the bottom of this file).
    


    luke-jr commented at 11:22 pm on July 25, 2018:
    I don’t see any YASM code at the bottom; also, really from Intel?

    sipa commented at 11:24 pm on July 25, 2018:
    Looks like that was copied from the sse4 code.
  47. in configure.ac:387 in 3b402e0738
    382+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
    383+    #include <altivec.h>
    384+    #include <stdint.h>
    385+  ]], [[
    386+    unsigned char src[16];
    387+    __builtin_crypto_vshasigmaw((__vector uint32_t)vec_vsx_ld(0, src), 1, 0xf);
    


    luke-jr commented at 11:52 pm on July 25, 2018:
    Better to use vector here. We don’t use [the non-standard] __vector in the code itself.
  48. in configure.ac:317 in 3b402e0738
    313@@ -314,6 +314,7 @@ fi
    314 AX_CHECK_COMPILE_FLAG([-msse4.2],[[SSE42_CXXFLAGS="-msse4.2"]],,[[$CXXFLAG_WERROR]])
    315 AX_CHECK_COMPILE_FLAG([-msse4.1],[[SSE41_CXXFLAGS="-msse4.1"]],,[[$CXXFLAG_WERROR]])
    316 AX_CHECK_COMPILE_FLAG([-mavx -mavx2],[[AVX2_CXXFLAGS="-mavx -mavx2"]],,[[$CXXFLAG_WERROR]])
    317+AX_CHECK_COMPILE_FLAG([-mpower8-vector],[[POWER8_CXXFLAGS="-mpower8-vector"]],,[[$CXXFLAG_WERROR]])
    


    luke-jr commented at 11:54 pm on July 25, 2018:
    I think -faltivec is needed too?
  49. gmaxwell commented at 7:16 pm on August 21, 2018: contributor
    This needs a trivial rebase but with that works fine against master. ACK. Will post benchmarks after IBD finishes.
  50. gmaxwell commented at 2:08 am on August 23, 2018: contributor

    Before: MerkleRoot, 5, 800, 21.6763, 0.00541869, 0.00541984, 0.00541886 After: MerkleRoot, 5, 800, 6.32571, 0.00158138, 0.00158148, 0.00158142

    On power9.

    Interesting reindex speed was about the same for both.

  51. gmaxwell commented at 2:09 am on August 23, 2018: contributor
    @TheBlueMatt Rebase.
  52. TheBlueMatt commented at 7:31 pm on August 24, 2018: member
    Hmm, well if it doesnt speed up rebase then there is something funky we should look into before merging new code, no?
  53. TheBlueMatt commented at 8:29 pm on August 24, 2018: member
    Gonna go ahead and close it, if someone wants to adopt it and figure out why IBD didn’t improve where the equivalent x86 change improved it significantly, they should adopt it.
  54. TheBlueMatt closed this on Aug 24, 2018

  55. gmaxwell commented at 11:46 pm on August 24, 2018: contributor
    The comparable change made a big impact on x86 many months ago– who knows now… Synchronization primitives are more expensive on power, so it also wouldn’t be surprising if our glib use of atomics all over the place weren’t moving around the limiting factors in performance on power compared to x86.
  56. laanwj removed the label Needs rebase on Oct 24, 2019
  57. MarcoFalke locked this on Dec 16, 2021

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-21 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me