add avx512 instrinsic #13989
pull fingera wants to merge 4 commits into bitcoin:master from fingera:1-avx512 changing 4 files +478 −5-
fingera commented at 4:06 am on August 16, 2018: contributor
-
gmaxwell commented at 4:18 am on August 16, 2018: contributorInteresting! Benchmark results?
-
fingera commented at 5:19 am on August 16, 2018: contributor
CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
https://github.com/fingera/bitcoin/tree/1-avx512-benchmark:
0./src/bench/bench_bitcoin -filter="FINGERA.+" 1FINGERA_MerkleRoot, 5, 800, 3.74455, 0.000933415, 0.000938638, 0.000937376 2FINGERA_MerkleRoot8Way, 5, 800, 6.24424, 0.00155711, 0.00157051, 0.00155929 3FINGERA_SHA256D64_1024, 5, 7400, 3.09212, 8.34755e-05, 8.36662e-05, 8.35919e-05 4FINGERA_SHA256D64_10248Way, 5, 7400, 5.75079, 0.000155021, 0.000156514, 0.000155131
-
fingera commented at 6:28 am on August 16, 2018: contributor
-
fingera commented at 7:47 am on August 16, 2018: contributorNeed to rebase?
-
fingera force-pushed on Aug 16, 2018
-
laanwj added the label Utils/log/libs on Aug 16, 2018
-
laanwj commented at 11:23 am on August 27, 2018: memberso 40 to 46% faster? that’s quite impressive
-
fingera commented at 1:37 am on August 28, 2018: contributorbecause avx512 added _mm512_rol_epi32 instrisic, may faster in the future. can i do something for merge
-
in configure.ac:380 in 2fa6fa40c3 outdated
375@@ -375,6 +376,27 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ 376 ) 377 CXXFLAGS="$TEMP_CXXFLAGS" 378 379+case $host in 380+ *mingw*)
luke-jr commented at 8:13 am on August 28, 2018:Why is this disallowed on mingw? Add a comment…
fingera commented at 8:35 am on August 28, 2018:
luke-jr commented at 8:45 am on August 28, 2018:Shouldn’t your below test cleanly fail in that scenario?
fingera commented at 9:01 am on August 28, 2018:Most servers are linux and support avx512?
luke-jr commented at 9:03 am on August 28, 2018:That’s irrelevant…
fingera commented at 9:07 am on August 28, 2018:How do I do? update ci mingw version?
luke-jr commented at 9:10 am on August 28, 2018:You shouldn’t need to.in configure.ac:386 in 2fa6fa40c3 outdated
381+ ;; 382+ *) 383+ TEMP_CXXFLAGS="$CXXFLAGS" 384+ CXXFLAGS="$CXXFLAGS $AVX512_CXXFLAGS" 385+ AC_MSG_CHECKING(for AVX512 intrinsics) 386+ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
luke-jr commented at 9:14 am on August 28, 2018:UseAC_LINK_IFELSE
here instead
fingera commented at 1:19 pm on August 28, 2018:it’s not a link error, mingw generate seh directive assembly >xmm16, but assembler can’t work with directive >xmm16.
0.seh_savexmm %xmm20, 256
solution other: make ac_lang_program more than 50 lines like this: https://github.com/fingera/bitcoin/commit/585a8c8b5b500ca8cb2bab4e2d231895a6298703
in src/crypto/sha256_avx512.cpp:25 in 2fa6fa40c3 outdated
20+__m512i inline Inc(__m512i& x, __m512i y, __m512i z, __m512i w) { x = Add(x, y, z, w); return x; } 21+__m512i inline Xor(__m512i x, __m512i y) { return _mm512_xor_si512(x, y); } 22+__m512i inline Xor(__m512i x, __m512i y, __m512i z) { return Xor(Xor(x, y), z); } 23+__m512i inline Or(__m512i x, __m512i y) { return _mm512_or_si512(x, y); } 24+__m512i inline And(__m512i x, __m512i y) { return _mm512_and_si512(x, y); } 25+__m512i inline ShR(__m512i x, int n) { return _mm512_srli_epi32(x, n); }
practicalswift commented at 9:04 am on October 1, 2018:Shouldn’t the second parameter toShR
(n
) be an unsigned integer to match_mm512_srli_epi32(__m512i, unsigned int)
?
fingera commented at 1:27 am on October 15, 2018:ok, thks. RoL because clang.DrahtBot commented at 8:22 am on November 30, 2018: memberThe following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
Conflicts
No conflicts as of last run.
DrahtBot added the label Needs rebase on Jan 30, 2019dongcarl commented at 3:22 pm on May 24, 2019: member@fingera If I understand you correctly, the currentAC_LANG_PROGRAM
here will not fail if$host
is mingw. But if we use your longerAC_LANG_PROGRAM
from https://github.com/fingera/bitcoin/commit/585a8c8b5b500ca8cb2bab4e2d231895a6298703, it will fail just like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?fingera commented at 1:23 am on May 27, 2019: contributor@fingera If I understand you correctly, the current
AC_LANG_PROGRAM
here will not fail if$host
is mingw. But if we use your longerAC_LANG_PROGRAM
from fingera@585a8c8, it will fail just like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?Yes. But it’s too long, I think disallowed is better
jamesob commented at 3:54 pm on May 29, 2019: memberPer microbenchmarks, this change appears to have some pretty substantial performance improvements.
I’ve rebased this PR and run the microbenches on an Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz. Notably,
micro.gcc.MerkleRoot
andmicro.gcc.SHA256D64_1024
are, respectively, 1.76x and 1.91x slower on master relative to this branch.Kind of curious that we only see performance improvements on gcc though - is that expected?
Benchmarking data follows, significant results bolded.
review/1-avx512 vs. master (relative)
bench name x review/1-avx512 master build.make.22.clang.total_secs 1 1.000 1.003 build.make.22.clang.peak_rss_KiB 1 1.000 1.000 micro.clang.j=4.total_secs 3 1.007 1.000 micro.clang.MerkleRoot.total_secs 3 1.008 1.000 micro.clang.SHA1.total_secs 3 1.006 1.000 micro.clang.SHA256.total_secs 3 1.000 1.003 micro.clang.SHA256D64_1024.total_secs 3 1.001 1.000 micro.clang.SHA256_32b.total_secs 3 1.000 1.000 micro.clang.SHA512.total_secs 3 1.000 1.000 micro.gcc.j=4.total_secs 3 1.000 1.164 micro.gcc.MerkleRoot.total_secs 3 1.000 1.760 micro.gcc.SHA1.total_secs 3 1.001 1.000 micro.gcc.SHA256.total_secs 3 1.008 1.000 micro.gcc.SHA256D64_1024.total_secs 3 1.000 1.912 micro.gcc.SHA256_32b.total_secs 3 1.002 1.000 micro.gcc.SHA512.total_secs 3 1.000 1.003 review/1-avx512 vs. master (absolute)
bench name x review/1-avx512 master build.make.22.clang.total_secs 1 127.3094 (± 0.0000) 127.6333 (± 0.0000) build.make.22.clang.peak_rss_KiB 1 593740.0000 (± 0.0000) 593512.0000 (± 0.0000) micro.clang.j=4.total_secs 3 48.1295 (± 0.4233) 47.7760 (± 0.0253) micro.clang.MerkleRoot.total_secs 3 0.0019 (± 0.0000) 0.0019 (± 0.0000) micro.clang.SHA1.total_secs 3 0.0029 (± 0.0000) 0.0029 (± 0.0000) micro.clang.SHA256.total_secs 3 0.0049 (± 0.0000) 0.0049 (± 0.0000) micro.clang.SHA256D64_1024.total_secs 3 0.0002 (± 0.0000) 0.0002 (± 0.0000) micro.clang.SHA256_32b.total_secs 3 0.0000 (± 0.0000) 0.0000 (± 0.0000) micro.clang.SHA512.total_secs 3 0.0045 (± 0.0000) 0.0045 (± 0.0000) micro.gcc.j=4.total_secs 3 42.0600 (± 0.1458) 48.9681 (± 0.1150) micro.gcc.MerkleRoot.total_secs 3 0.0011 (± 0.0000) 0.0020 (± 0.0000) micro.gcc.SHA1.total_secs 3 0.0029 (± 0.0000) 0.0029 (± 0.0000) micro.gcc.SHA256.total_secs 3 0.0049 (± 0.0001) 0.0049 (± 0.0000) micro.gcc.SHA256D64_1024.total_secs 3 0.0001 (± 0.0000) 0.0002 (± 0.0000) micro.gcc.SHA256_32b.total_secs 3 0.0000 (± 0.0000) 0.0000 (± 0.0000) micro.gcc.SHA512.total_secs 3 0.0048 (± 0.0000) 0.0048 (± 0.0000) dongcarl commented at 4:46 pm on May 29, 2019: member@fingera If I understand you correctly, the current
AC_LANG_PROGRAM
here will not fail if$host
is mingw. But if we use your longerAC_LANG_PROGRAM
from fingera@585a8c8, it will fail just like gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?Yes. But it’s too long, I think disallowed is better
The length of the
AC_LANG_PROGRAM
is completely fine, I believe we’d rather have accurate detection than broad-stroke disabling. Please rebase that branch overmaster
and I’d be happy to take a look! :smile:add avx512 instrinsic b0e207ebabfix avx512 ShR bf15f2abcfmingw avx512 detection 6a76a13b89fingera force-pushed on May 30, 2019DrahtBot removed the label Needs rebase on May 30, 2019jamesob commented at 3:56 pm on May 30, 2019: memberDid some additional profiling and it’s not clear to me that this change is worth pursuing.
Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:
However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:
fingera commented at 6:40 am on May 31, 2019: contributorDid some additional profiling and it’s not clear to me that this change is worth pursuing.
Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:
However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:
Because in this benchmark, merkle root
Did some additional profiling and it’s not clear to me that this change is worth pursuing.
Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:
However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:
May be most of the time on diskIO?
in src/crypto/sha256_avx512.cpp:25 in 6a76a13b89 outdated
20+__m512i inline Inc(__m512i& x, __m512i y, __m512i z, __m512i w) { x = Add(x, y, z, w); return x; } 21+__m512i inline Xor(__m512i x, __m512i y) { return _mm512_xor_si512(x, y); } 22+__m512i inline Xor(__m512i x, __m512i y, __m512i z) { return Xor(Xor(x, y), z); } 23+__m512i inline Or(__m512i x, __m512i y) { return _mm512_or_si512(x, y); } 24+__m512i inline And(__m512i x, __m512i y) { return _mm512_and_si512(x, y); } 25+__m512i inline ShR(__m512i x, unsigned int n) { return _mm512_srli_epi32(x, n); }
practicalswift commented at 6:23 pm on June 1, 2019:There is an implicit conversion fromunsigned int
toint
here. Could be made explicit to make it easier to reason about correctness?
fingera commented at 1:35 am on June 3, 2019:gcc and Intel: __m512i _mm512_srli_epi32 (__m512i a, unsigned int imm8) https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_srli_epi32&expand=5515 https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avx512fintrin.h clang: __m512i _mm512_srli_epi32(__m512i __A, int __B) https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html#l05168
defined to macro? #define ShR _mm512_srli_epi32
practicalswift commented at 6:30 am on June 3, 2019:Oh, got it. Thanks for clarifying.
Also saw this old comment of mine: #13989 (review)
Sorry for the confusion :-)
fanquake requested review from dongcarl on Jun 24, 2019dongcarl changes_requesteddongcarl commented at 7:36 pm on June 26, 2019: memberAVX512
should be disabled ifSHA-NI
is enabled, like was done forAVX2
here: https://github.com/bitcoin/bitcoin/blob/1b28bca04c2767c8bca21d66bd6978f358b0a96a/src/crypto/sha256.cpp#L613See discussion here: http://www.erisian.com.au/bitcoin-core-dev/log-2019-06-26.html#l-511
disable sha512 when shani exists 90f282f4bfdongcarl commented at 10:06 pm on June 27, 2019: memberlightly-tested ACK 90f282f4bfe8e052c674ff829239f6a4845d7397
- Code changes look good. Checked that it matches all instances of what we do for
AVX2
. - Build system changes lightly tested on Linux machine w/o AVX512
- Correctly enables AVX512 based on compiler compatibility, not system capability
- Correctly disables AVX512 for gcc 4.8
If someone has OSX/Windows machines, please test as well.
dongcarl commented at 6:51 pm on July 4, 2019: member@fanquake Mostly if the build system changes work correctly. It seems clang added support for AVX512 here: https://github.com/llvm-mirror/clang/commit/dab7845798d673cabeef792451354aeb394cdd54
Not sure what version that was tho.
fanquake commented at 4:17 am on July 5, 2019: member@dongcarl No worries.
Tested on
macOS
withClang
. Looks like support for AVX512 arrived in LLVM in 3.9:0clang --version 1Apple LLVM version 10.0.1 (clang-1001.0.46.4) 2Target: x86_64-apple-darwin18.6.0
0./configure | grep -i -E 'AVX' 1checking whether C++ compiler accepts -mavx -mavx2... yes 2checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes 3checking for AVX2 intrinsics... yes 4checking for AVX512 intrinsics... yes
However my CPU does not support AVX512:
0sysctl -a | grep machdep.cpu.features 1AVX1.0 2sysctl -a | grep machdep.cpu.leaf7_features 3AVX2
Tested inside a
debian:jessie-slim
Docker container that AVX512 is disabled for GCC 4.8 and enabled for GCC 4.9 (Intel AVX-512 support was added to GCC in 4.9).GCC 4.8 -
gcc-4.8 (Debian 4.8.4-1) 4.8.4
:0./configure CC=gcc-4.8 CXX=g++-4.8 | grep -i 'AVX' 1configure: WARNING: Doxygen not found 2checking whether C++ compiler accepts -mavx -mavx2... yes 3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... no 4checking for AVX2 intrinsics... yes 5checking for AVX512 intrinsics... no
GCC 4.9 -
gcc (Debian 4.9.2-10+deb8u2) 4.9.2
:0./configure --disable-wallet | grep -i 'AVX' 1configure: WARNING: Doxygen not found 2checking whether C++ compiler accepts -mavx -mavx2... yes 3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes 4checking for AVX2 intrinsics... yes 5checking for AVX512 intrinsics... yes
Also tested with Clang inside the same box:
Clang 3.5
Debian clang version 3.5.0-10 (tags/RELEASE_350/final) (based on LLVM 3.5.0)
:0./configure CC=clang-3.5 CXX=clang-3.5 | grep -i 'avx' 1configure: WARNING: Doxygen not found 2checking whether C++ compiler accepts -mavx -mavx2... yes 3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes 4checking for AVX2 intrinsics... yes 5checking for AVX512 intrinsics... no
Clang 4.0
clang version 4.0.1-10~deb8u1 (tags/RELEASE_401/final)
:0./configure --disable-wallet CC=clang-4.0 CXX=clang-4.0 | grep -i 'avx' 1configure: WARNING: Doxygen not found 2checking whether C++ compiler accepts -mavx -mavx2... yes 3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes 4checking for AVX2 intrinsics... yes 5checking for AVX512 intrinsics... yes
It’d be great to have @sipa , @gmaxwell or @theuni give this another look over.
laanwj commented at 1:14 pm on July 5, 2019: memberConcept ACK, I have no hardware to test this on.promag commented at 2:13 pm on July 5, 2019: memberI also don’t have hardware to test, concept ACK.fingera commented at 1:37 am on July 6, 2019: contributoriMAC PRO or MAC PRO :) Most cloud servers are already supported: https://aws.amazon.com/ec2/instance-types/ The popularity of desktops maybe wait for the future :<in configure.ac:420 in 90f282f4bf
412@@ -412,6 +413,87 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ 413 ) 414 CXXFLAGS="$TEMP_CXXFLAGS" 415 416+TEMP_CXXFLAGS="$CXXFLAGS" 417+CXXFLAGS="$CXXFLAGS $AVX512_CXXFLAGS" 418+AC_MSG_CHECKING(for AVX512 intrinsics) 419+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ 420+ #include <stdint.h>
theuni commented at 8:54 pm on July 8, 2019:No need to put the whole file here. Just check the intrinsics themselves.
fingera commented at 1:33 am on July 9, 2019:A big program to trigger MINGW-GCC BUGin src/crypto/sha256_avx512.cpp:6 in 90f282f4bf
0@@ -0,0 +1,350 @@ 1+#ifdef ENABLE_AVX512 2+ 3+#include <stdint.h> 4+#include <immintrin.h> 5+ 6+#include <crypto/sha256.h>
theuni commented at 9:02 pm on July 8, 2019:Why?
fingera commented at 1:43 am on July 9, 2019:Does not seem to be used sha256.h Keep the same as other files(avx2 sse4)?theuni commented at 9:07 pm on July 8, 2019: memberConcept ACK.
Build system changes look good at a glance. I can’t test this either :(
sipa commented at 9:52 pm on July 8, 2019: memberGiven the lack of common platforms where AVX512 occurs, and because IIRC on some of them it causes a clock speed reduction even for non-AVX512 instructions, it may not be wise to enable this by default at runtime. Does anyone have benchmarks that show otherwise?
A possibility is turning the autodetection code into an actual benchmark that’s run at startup, finding which code is fastest for what empirically rather than guessing that X is always faster than Y if X is available.
fingera commented at 2:09 am on July 9, 2019: contributorGiven the lack of common platforms where AVX512 occurs, and because IIRC on some of them it causes a clock speed reduction even for non-AVX512 instructions, it may not be wise to enable this by default at runtime. Does anyone have benchmarks that show otherwise?
A possibility is turning the autodetection code into an actual benchmark that’s run at startup, finding which code is fastest for what empirically rather than guessing that X is always faster than Y if X is available.
I think all platforms AVX512 will be faster avx2 has frequency scaling too
sipa commented at 2:22 am on July 9, 2019: member@fingera Maybe, seeing benchmarks would certainly be more convincing. Given that executing AVX512 instructions (as far as I know) may slow down other instructions, it may even be the case that in a microbenchmark AVX512-based SHA256 code is a win, but in realistic load conditions where SHA256 operations are interleaved with other things, it is not.fingera commented at 2:25 am on July 9, 2019: contributor@sipa Yes your are right https://en.wikichip.org/wiki/intel/frequency_behavior
I think. this avx512 instrinsic is AVX2 Heavy mode
Maybe bit scaning multiplication cause cpu to heavy mode? I think rolling bit(Or And) is light mode(non avx2?)
fingera commented at 4:45 am on July 9, 2019: contributorhttps://github.com/travisdowns/avx-turbo @sipa looks good
0CPUID highest leaf : [ dh] 1Running as root : [YES] 2MSR reads supported : [NO ] 3CPU pinning enabled : [YES] 4CPU supports AVX2 : [YES] 5CPU supports AVX-512: [YES] 6CPUID doesn't support leaf 0x15, falling back to manual TSC calibration. 7tsc_freq = 2499.9 MHz (from calibration loop) 8CPU brand string: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz 91 available CPUs: [0] 10Can't use cpuid leaf 0xb to filter out hyperthreads, CPU too old or AMD 111 physical cores: [0] 12Will test up to 1 CPUs 13Cores | ID | Description | OVRLP1 | OVRLP2 | OVRLP3 | Mops 141 | pause_only | pause instruction | 1.000 | 1.000 | 1.000 | 19260 151 | ucomis | SSE scalar ucomis loop | 1.000 | 1.000 | 1.000 | 860 161 | ucomis_vex | VEX scalar ucomis loop | 1.000 | 1.000 | 1.000 | 546 171 | scalar_iadd | Scalar integer adds | 1.000 | 1.000 | 1.000 | 2700 181 | avx128_iadd | 128-bit integer serial adds | 1.000 | 1.000 | 1.000 | 2699 191 | avx256_iadd | 256-bit integer serial adds | 1.000 | 1.000 | 1.000 | 2700 201 | avx512_iadd | 512-bit integer adds | 1.000 | 1.000 | 1.000 | 2699 211 | avx128_iadd_t | 128-bit integer parallel adds | 1.000 | 1.000 | 1.000 | 8100 221 | avx256_iadd_t | 256-bit integer parallel adds | 1.000 | 1.000 | 1.000 | 8099 231 | avx128_mov_sparse | 128-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 2700 241 | avx256_mov_sparse | 256-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 2700 251 | avx512_mov_sparse | 512-bit reg-reg mov | 1.000 | 1.000 | 1.000 | 2700 261 | avx128_merge_sparse | 128-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 2700 271 | avx256_merge_sparse | 256-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 2700 281 | avx512_merge_sparse | 512-bit reg-reg merge mov | 1.000 | 1.000 | 1.000 | 2700 291 | avx128_vshift | 128-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2699 301 | avx256_vshift | 256-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2699 311 | avx512_vshift | 512-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2699 321 | avx128_vshift_t | 128-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 5398 331 | avx256_vshift_t | 256-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 5398 341 | avx512_vshift_t | 512-bit variable shift (vpsrld) | 1.000 | 1.000 | 1.000 | 2699 351 | avx128_imul | 128-bit integer muls | 1.000 | 1.000 | 1.000 | 540 361 | avx256_imul | 256-bit integer muls | 1.000 | 1.000 | 1.000 | 540 371 | avx512_imul | 512-bit integer muls | 1.000 | 1.000 | 1.000 | 493 381 | avx128_fma_sparse | 128-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 2700 391 | avx256_fma_sparse | 256-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 2700 401 | avx512_fma_sparse | 512-bit 64-bit sparse FMAs | 1.000 | 1.000 | 1.000 | 2700 411 | avx128_fma | 128-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 675 421 | avx256_fma | 256-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 675 431 | avx512_fma | 512-bit serial DP FMAs | 1.000 | 1.000 | 1.000 | 675 441 | avx128_fma_t | 128-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 5398 451 | avx256_fma_t | 256-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 5398 461 | avx512_fma_t | 512-bit parallel DP FMAs | 1.000 | 1.000 | 1.000 | 4598 471 | avx512_vpermw | 512-bit serial WORD permute | 1.000 | 1.000 | 1.000 | 450 481 | avx512_vpermw_t | 512-bit parallel WORD permute | 1.000 | 1.000 | 1.000 | 1350 491 | avx512_vpermd | 512-bit serial DWORD permute | 1.000 | 1.000 | 1.000 | 900 501 | avx512_vpermd_t | 512-bit parallel DWORD permute | 1.000 | 1.000 | 1.000 | 2699
fingera commented at 5:09 am on July 9, 2019: contributorhttps://github.com/fingera/avx-turbo add more instruction avx512_vshift_t speed: 50% avx512_imul speed: 90% Execution of complex instructions will cause the CPU clock upper and lower limits to drop at that time. I think we didn’t use these instructions.fanquake added the label Up for grabs on Feb 25, 2020fanquake commented at 2:20 am on February 25, 2020: memberWe’re still waiting on benchmarks, PR comments need addressing, and the user that opened this seems to have disappeared from GitHub. Going to mark as up for grabs and close for now.fanquake closed this on Feb 25, 2020
DrahtBot locked this on Feb 15, 2022
github-metadata-mirror
This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-21 06:12 UTC
This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me