Add POWER8 ASM for 4-way SHA256 #13203
pull TheBlueMatt wants to merge 2 commits into bitcoin:master from TheBlueMatt:2018-05-asm changing 5 files +451 −4-
TheBlueMatt commented at 5:46 pm on May 9, 2018: memberBased on #13191, This speeds up 4-way SHA256 by about 3.75x over the C impl.
-
TheBlueMatt force-pushed on May 9, 2018
-
in configure.ac:766 in 67f6440f48 outdated
761@@ -762,6 +762,29 @@ AC_LINK_IFELSE([AC_LANG_SOURCE([ 762 ) 763 LDFLAGS="$TEMP_LDFLAGS" 764 765+TEMP_LDFLAGS="$LDFLAGS" 766+LDFLAGS="$TEMP_LDFLAGS -mpower8-vector"
luke-jr commented at 6:20 pm on May 9, 2018:Shouldn’t this be CPPFLAGS or CXXFLAGS?
theuni commented at 7:28 pm on May 9, 2018:Agree, but this should be assigned to POWER_VSX_CXXFLAGS or so above, in an AX_CHECK_LINK_FLAG test.in src/Makefile.am:315 in 67f6440f48 outdated
309@@ -309,6 +310,16 @@ crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2 310 endif 311 crypto_libbitcoin_crypto_avx2_a_SOURCES = crypto/sha256_avx2.cpp 312 313+if HAVE_POWER8 314+crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector
luke-jr commented at 6:21 pm on May 9, 2018:This is probably needed on CXXFLAGS, since the man page says it changes the code generated.in src/crypto/sha256.cpp:33 in 67f6440f48 outdated
28@@ -29,6 +29,15 @@ namespace sha256d64_avx2 29 void Transform_8way(unsigned char* out, const unsigned char* in); 30 } 31 32+#if defined(__linux__) && defined(HAVE_POWER8) 33+#include <sys/auxv.h>
luke-jr commented at 6:27 pm on May 9, 2018:I don’t think we need this here, and POWER vectoring stuff seems to screw with standard C++ stuff, so maybe not a good idea to have it.
TheBlueMatt commented at 9:01 pm on May 9, 2018:auxv isnt POWER- or vector-specific, its to get getauxval which is a libc function.in src/crypto/sha256.cpp:524 in 67f6440f48 outdated
519@@ -511,6 +520,12 @@ std::string SHA256AutoDetect() 520 ret = "sse4"; 521 #endif 522 } 523+#elif (defined(__linux__)) && defined(HAVE_POWER8) 524+ if (getauxval(AT_HWCAP2) & 0x02000000) {
luke-jr commented at 6:28 pm on May 9, 2018:Can we move this check to a function in the power8-specific file?
TheBlueMatt commented at 9:01 pm on May 9, 2018:Its right below the x86-specific support check, and its a short file, it seems like overkill.in src/crypto/sha256_power8.cpp:28 in 67f6440f48 outdated
23+ 24+namespace sha256_power8 25+{ 26+ 27+typedef __vector uint32_t uint32x4_p8; 28+typedef __vector uint8_t uint8x16_p8;
luke-jr commented at 6:29 pm on May 9, 2018:__vector
is a non-standard GCC extension; any reason we can’t usevector
here directly?in src/crypto/sha256_power8.cpp:12 in 67f6440f48 outdated
7+ 8+#if defined(HAVE_CONFIG_H) 9+#include <config/bitcoin-config.h> 10+#endif 11+ 12+#ifdef HAVE_POWER8
luke-jr commented at 6:29 pm on May 9, 2018:Since this file is only compiled withHAVE_POWER8
, I think we don’t need this here.luke-jr changes_requestedluke-jr commented at 6:30 pm on May 9, 2018: memberI’d be surprised if this works as-is; either way, some cleanup needed to avoid breaking non-POWER builds.laanwj added the label Validation on May 9, 2018in src/Makefile.am:317 in 67f6440f48 outdated
304+ 305+crypto_libbitcoin_crypto_avx2_a_CXXFLAGS = $(CXXFLAGS) 306+crypto_libbitcoin_crypto_avx2_a_CPPFLAGS = $(CPPFLAGS) 307+if ENABLE_AVX2 308+crypto_libbitcoin_crypto_avx2_a_CXXFLAGS += $(AVX2_CXXFLAGS) 309+crypto_libbitcoin_crypto_avx2_a_CPPFLAGS += -DENABLE_AVX2
theuni commented at 7:19 pm on May 9, 2018:Can’t compiler defines be used instead so that non-autoconf builds can compile with optims too?
Specifically for this one, rather than
ifdef(ENABLE_AVX2)
, useifdef(__AVX2__)
in configure.ac:333 in 67f6440f48 outdated
329@@ -327,6 +330,45 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ 330 ) 331 CXXFLAGS="$TEMP_CXXFLAGS" 332 333+AC_DEFINE(ENABLE_AVX2, 1, [Define this symbol to build code that uses AVX2 intrinsics])
theuni commented at 7:31 pm on May 9, 2018:Only define these in their respective check’s success cases.in configure.ac:778 in 67f6440f48 outdated
773+ __builtin_crypto_vshasigmaw((__vector uint32_t)vec_vsx_ld(0, src), 1, 0xf); 774+ return 0; 775+ } 776+ ]])], 777+ [ 778+ AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.])
theuni commented at 7:34 pm on May 9, 2018:Please move the defines/conditionals down with the other asm flags for consistency.in src/Makefile.am:298 in 67f6440f48 outdated
293@@ -289,6 +294,32 @@ if USE_ASM 294 crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp 295 endif 296 297+crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS)
theuni commented at 7:35 pm on May 9, 2018:These should all be $(AM_CXXFLAGS) $(PIE_FLAGS).in src/Makefile.am:299 in 67f6440f48 outdated
293@@ -289,6 +294,32 @@ if USE_ASM 294 crypto_libbitcoin_crypto_a_SOURCES += crypto/sha256_sse4.cpp 295 endif 296 297+crypto_libbitcoin_crypto_sse41_a_CXXFLAGS = $(CXXFLAGS) 298+crypto_libbitcoin_crypto_sse41_a_CPPFLAGS = $(CPPFLAGS)
theuni commented at 7:36 pm on May 9, 2018:Ditto for the CPPFLAGS.in src/Makefile.am:319 in 67f6440f48 outdated
314+crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector 315+crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS) 316+crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp 317+ 318+LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8) 319+EXTRA_LIBRARIES += $(LIBBITCOIN_CRYPTO_POWER8)
theuni commented at 7:42 pm on May 9, 2018:append to EXTRA_LIBRARIES outside of theif HAVE_POWER8
. These libraries are only built if something depends on them.in src/Makefile.am:319 in 67f6440f48 outdated
313+if HAVE_POWER8 314+crypto_libbitcoin_crypto_power8_a_CPPFLAGS = $(AM_CPPFLAGS) -mpower8-vector 315+crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS) 316+crypto_libbitcoin_crypto_power8_a_SOURCES = crypto/sha256_power8.cpp 317+ 318+LIBBITCOIN_CRYPTO += $(LIBBITCOIN_CRYPTO_POWER8)
theuni commented at 7:46 pm on May 9, 2018:I’m not sure whether this works as intended or not, but we don’t want to be doing it either way. Add LIBBITCOIN_CRYPTO_POWER8 to bitcoind_LDADD like the others, we can potentially aggregate them as a follow-up.
TheBlueMatt commented at 9:05 pm on May 9, 2018:Why not? Its shorthand for “crypto depends on this” instead of listing it as a dep in every target that may use it. Its also much less britle, which is important given we dont have a POWER travis target.
theuni commented at 7:48 pm on May 9, 2018: memberConcept ACK!
First set of build-system comments, there are few things to sort out. Will continue reviewing after the first round of cleanups.
TheBlueMatt commented at 8:59 pm on May 9, 2018: member@luke-jr No, it definitely works, but its apparently /only/ tested on POWER, not “not yet tested” :p.TheBlueMatt force-pushed on May 9, 2018TheBlueMatt force-pushed on May 9, 2018in configure.ac:393 in 9329896877 outdated
388+ [ 389+ AC_MSG_RESULT(no) 390+ ] 391+) 392+AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8]) 393+LDFLAGS="$TEMP_LDFLAGS"
theuni commented at 9:57 pm on May 9, 2018:This should reset CXXFLAGS, not LDFLAGS.TheBlueMatt force-pushed on May 9, 2018TheBlueMatt force-pushed on May 9, 2018in configure.ac:392 in 84c7c7d745 outdated
381+ return 0; 382+ } 383+ ]])], 384+ [ 385+ AC_DEFINE(HAVE_POWER8,1,[Define if compiler supports POWER8 instructions.]) 386+ AC_MSG_RESULT(yes)
theuni commented at 11:20 pm on May 9, 2018:0POWER8_ASM_CXXFLAGS="-mpower8-vector" 1AC_MSG_RESULT(yes)
Then at the bottom, with the others:
0AC_SUBST(POWER8_ASM_CXXFLAGS)
Then in the Makefile.am:
0crypto_libbitcoin_crypto_power8_a_CXXFLAGS = $(AM_CXXFLAGS) $(PIE_FLAGS) $(POWER8_ASM_CXXFLAGS)
in configure.ac:392 in 84c7c7d745 outdated
387+ ], 388+ [ 389+ AC_MSG_RESULT(no) 390+ ] 391+) 392+AM_CONDITIONAL([HAVE_POWER8], [$HAVE_POWER8])
theuni commented at 11:24 pm on May 9, 2018:HAVE_POWER8 is a define that will be handled by autoheader, it can’t be treated like a variable here. Need to set one separately.TheBlueMatt force-pushed on May 10, 2018gmaxwell commented at 6:55 pm on May 15, 2018: contributorspiffy! will test soonTheBlueMatt force-pushed on Jun 4, 2018TheBlueMatt commented at 6:34 pm on June 4, 2018: memberRebased.DrahtBot added the label Needs rebase on Jun 11, 2018TheBlueMatt force-pushed on Jun 11, 2018TheBlueMatt force-pushed on Jun 11, 2018TheBlueMatt force-pushed on Jun 11, 2018Add POWER8 vector impl for 4-way SHA256
This speeds up 4-way SHA256 by about 3.75x over the C impl.
Make checkqueue bench a realistic result
Previously checkqueue bench would essentially just test cross-core latency with more threads the more threads a host had. Not only did this make the result worse on hosts with more CPU cores, but it doesn't mimic how we use the checkqueue in real life, and made bench_bitcoin almost unusably slow on high-core-count systems, especially dual-CPU systems. Instead, we use the same thread count that we use for the real validation queue, with a 75-microsecond busy-wait on each thread, which should at least vaguely mimic how we use the checkqueue.
TheBlueMatt force-pushed on Jun 11, 2018TheBlueMatt commented at 4:32 pm on June 11, 2018: memberRebased. Should be ready for merge +/- some more review, though luckily the tests should mostly cover it.TheBlueMatt commented at 4:33 pm on June 11, 2018: memberNote that I also had to fix the checkqueue bench as otherwise bench_bitcoin is unsuably slow on multi-cpu (or many-core) systems.MarcoFalke removed the label Needs rebase on Jun 11, 2018laanwj commented at 5:50 pm on June 14, 2018: memberI’m unable to test this at the moment. Can someone please post the benchmark results before/after this?DrahtBot added the label Needs rebase on Jul 9, 2018DrahtBot commented at 11:26 pm on July 9, 2018: memberin src/crypto/sha256_power8.cpp:6 in 3b402e0738
0@@ -0,0 +1,401 @@ 1+// Copyright (c) 2017 The Bitcoin Core developers 2+// Distributed under the MIT software license, see the accompanying 3+// file COPYING or http://www.opensource.org/licenses/mit-license.php. 4+// 5+// This is a translation to GCC extended asm syntax from YASM code by Intel 6+// (available at the bottom of this file).
luke-jr commented at 11:22 pm on July 25, 2018:I don’t see any YASM code at the bottom; also, really from Intel?
sipa commented at 11:24 pm on July 25, 2018:Looks like that was copied from the sse4 code.in configure.ac:387 in 3b402e0738
382+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ 383+ #include <altivec.h> 384+ #include <stdint.h> 385+ ]], [[ 386+ unsigned char src[16]; 387+ __builtin_crypto_vshasigmaw((__vector uint32_t)vec_vsx_ld(0, src), 1, 0xf);
luke-jr commented at 11:52 pm on July 25, 2018:Better to usevector
here. We don’t use [the non-standard]__vector
in the code itself.in configure.ac:317 in 3b402e0738
313@@ -314,6 +314,7 @@ fi 314 AX_CHECK_COMPILE_FLAG([-msse4.2],[[SSE42_CXXFLAGS="-msse4.2"]],,[[$CXXFLAG_WERROR]]) 315 AX_CHECK_COMPILE_FLAG([-msse4.1],[[SSE41_CXXFLAGS="-msse4.1"]],,[[$CXXFLAG_WERROR]]) 316 AX_CHECK_COMPILE_FLAG([-mavx -mavx2],[[AVX2_CXXFLAGS="-mavx -mavx2"]],,[[$CXXFLAG_WERROR]]) 317+AX_CHECK_COMPILE_FLAG([-mpower8-vector],[[POWER8_CXXFLAGS="-mpower8-vector"]],,[[$CXXFLAG_WERROR]])
luke-jr commented at 11:54 pm on July 25, 2018:I think-faltivec
is needed too?gmaxwell commented at 7:16 pm on August 21, 2018: contributorThis needs a trivial rebase but with that works fine against master. ACK. Will post benchmarks after IBD finishes.gmaxwell commented at 2:08 am on August 23, 2018: contributorBefore: MerkleRoot, 5, 800, 21.6763, 0.00541869, 0.00541984, 0.00541886 After: MerkleRoot, 5, 800, 6.32571, 0.00158138, 0.00158148, 0.00158142
On power9.
Interesting reindex speed was about the same for both.
gmaxwell commented at 2:09 am on August 23, 2018: contributor@TheBlueMatt Rebase.TheBlueMatt commented at 7:31 pm on August 24, 2018: memberHmm, well if it doesnt speed up rebase then there is something funky we should look into before merging new code, no?TheBlueMatt commented at 8:29 pm on August 24, 2018: memberGonna go ahead and close it, if someone wants to adopt it and figure out why IBD didn’t improve where the equivalent x86 change improved it significantly, they should adopt it.TheBlueMatt closed this on Aug 24, 2018
gmaxwell commented at 11:46 pm on August 24, 2018: contributorThe comparable change made a big impact on x86 many months ago– who knows now… Synchronization primitives are more expensive on power, so it also wouldn’t be surprising if our glib use of atomics all over the place weren’t moving around the limiting factors in performance on power compared to x86.laanwj removed the label Needs rebase on Oct 24, 2019MarcoFalke locked this on Dec 16, 2021
github-metadata-mirror
This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-18 18:12 UTC
This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me