add avx512 instrinsic

fingera commented at 4:06 am on August 16, 2018: contributor

gmaxwell commented at 4:18 am on August 16, 2018: contributor

Interesting! Benchmark results?

fingera commented at 5:19 am on August 16, 2018: contributor

CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz

https://github.com/fingera/bitcoin/tree/1-avx512-benchmark:

0./src/bench/bench_bitcoin -filter="FINGERA.+"
1FINGERA_MerkleRoot, 5, 800, 3.74455, 0.000933415, 0.000938638, 0.000937376
2FINGERA_MerkleRoot8Way, 5, 800, 6.24424, 0.00155711, 0.00157051, 0.00155929
3FINGERA_SHA256D64_1024, 5, 7400, 3.09212, 8.34755e-05, 8.36662e-05, 8.35919e-05
4FINGERA_SHA256D64_10248Way, 5, 7400, 5.75079, 0.000155021, 0.000156514, 0.000155131

fingera commented at 6:28 am on August 16, 2018: contributor

ci error: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79127

fingera commented at 7:47 am on August 16, 2018: contributor

Need to rebase?

fingera force-pushed on Aug 16, 2018

laanwj added the label Utils/log/libs on Aug 16, 2018

laanwj commented at 11:23 am on August 27, 2018: member

so 40 to 46% faster? that’s quite impressive

fingera commented at 1:37 am on August 28, 2018: contributor

because avx512 added _mm512_rol_epi32 instrisic, may faster in the future. can i do something for merge

in configure.ac:380 in 2fa6fa40c3 outdated

375@@ -375,6 +376,27 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
376 )
377 CXXFLAGS="$TEMP_CXXFLAGS"
378 
379+case $host in
380+  *mingw*)

luke-jr commented at 8:13 am on August 28, 2018:

Why is this disallowed on mingw? Add a comment…

fingera commented at 8:35 am on August 28, 2018:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79127

luke-jr commented at 8:45 am on August 28, 2018:

Shouldn’t your below test cleanly fail in that scenario?

fingera commented at 9:01 am on August 28, 2018:

Most servers are linux and support avx512?

luke-jr commented at 9:03 am on August 28, 2018:

That’s irrelevant…

fingera commented at 9:07 am on August 28, 2018:

How do I do？ update ci mingw version?

luke-jr commented at 9:10 am on August 28, 2018:

You shouldn’t need to.

in configure.ac:386 in 2fa6fa40c3 outdated

381+    ;;
382+  *)
383+    TEMP_CXXFLAGS="$CXXFLAGS"
384+    CXXFLAGS="$CXXFLAGS $AVX512_CXXFLAGS"
385+    AC_MSG_CHECKING(for AVX512 intrinsics)
386+    AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[

luke-jr commented at 9:14 am on August 28, 2018:

Use AC_LINK_IFELSE here instead

fingera commented at 1:19 pm on August 28, 2018:

it’s not a link error, mingw generate seh directive assembly >xmm16, but assembler can’t work with directive >xmm16.

0.seh_savexmm	%xmm20, 256

solution other: make ac_lang_program more than 50 lines like this: https://github.com/fingera/bitcoin/commit/585a8c8b5b500ca8cb2bab4e2d231895a6298703

fingera commented at 9:17 am on August 31, 2018:

Add a little code to make mingw generate xmm register more than 16 is ok? @luke-jr

in src/crypto/sha256_avx512.cpp:25 in 2fa6fa40c3 outdated

20+__m512i inline Inc(__m512i& x, __m512i y, __m512i z, __m512i w) { x = Add(x, y, z, w); return x; }
21+__m512i inline Xor(__m512i x, __m512i y) { return _mm512_xor_si512(x, y); }
22+__m512i inline Xor(__m512i x, __m512i y, __m512i z) { return Xor(Xor(x, y), z); }
23+__m512i inline Or(__m512i x, __m512i y) { return _mm512_or_si512(x, y); }
24+__m512i inline And(__m512i x, __m512i y) { return _mm512_and_si512(x, y); }
25+__m512i inline ShR(__m512i x, int n) { return _mm512_srli_epi32(x, n); }

practicalswift commented at 9:04 am on October 1, 2018:

Shouldn’t the second parameter to ShR (n) be an unsigned integer to match _mm512_srli_epi32(__m512i, unsigned int)?

fingera commented at 1:27 am on October 15, 2018:

ok, thks. RoL because clang.

DrahtBot commented at 8:22 am on November 30, 2018: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

DrahtBot added the label Needs rebase on Jan 30, 2019

dongcarl commented at 9:38 pm on May 21, 2019: member

@fingera Would you like to rebase and keep working on this? I’d be happy to rebase for you.

fingera commented at 8:09 am on May 24, 2019: contributor

@fingera Would you like to rebase and keep working on this? I’d be happy to rebase for you.

Thankyou. Just wait for ci-mingw updates :)

dongcarl commented at 3:22 pm on May 24, 2019: member

@fingera If I understand you correctly, the current AC_LANG_PROGRAM here will not fail if $host is mingw. But if we use your longer AC_LANG_PROGRAM from https://github.com/fingera/bitcoin/commit/585a8c8b5b500ca8cb2bab4e2d231895a6298703, it will fail just like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?

fingera commented at 1:23 am on May 27, 2019: contributor

@fingera If I understand you correctly, the current AC_LANG_PROGRAM here will not fail if $host is mingw. But if we use your longer AC_LANG_PROGRAM from fingera@585a8c8, it will fail just like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?

Yes. But it’s too long, I think disallowed is better

jamesob commented at 3:54 pm on May 29, 2019: member

Per microbenchmarks, this change appears to have some pretty substantial performance improvements.

I’ve rebased this PR and run the microbenches on an Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz. Notably, micro.gcc.MerkleRoot and micro.gcc.SHA256D64_1024 are, respectively, 1.76x and 1.91x slower on master relative to this branch.

Kind of curious that we only see performance improvements on gcc though - is that expected?

Benchmarking data follows, significant results bolded.

review/1-avx512 vs. master (relative)

bench name	x	review/1-avx512	master
build.make.22.clang.total_secs	1	1.000	1.003
build.make.22.clang.peak_rss_KiB	1	1.000	1.000
micro.clang.j=4.total_secs	3	1.007	1.000
micro.clang.MerkleRoot.total_secs	3	1.008	1.000
micro.clang.SHA1.total_secs	3	1.006	1.000
micro.clang.SHA256.total_secs	3	1.000	1.003
micro.clang.SHA256D64_1024.total_secs	3	1.001	1.000
micro.clang.SHA256_32b.total_secs	3	1.000	1.000
micro.clang.SHA512.total_secs	3	1.000	1.000
micro.gcc.j=4.total_secs	3	1.000	1.164
micro.gcc.MerkleRoot.total_secs	3	1.000	1.760
micro.gcc.SHA1.total_secs	3	1.001	1.000
micro.gcc.SHA256.total_secs	3	1.008	1.000
micro.gcc.SHA256D64_1024.total_secs	3	1.000	1.912
micro.gcc.SHA256_32b.total_secs	3	1.002	1.000
micro.gcc.SHA512.total_secs	3	1.000	1.003

review/1-avx512 vs. master (absolute)

bench name	x	review/1-avx512	master
build.make.22.clang.total_secs	1	127.3094 (± 0.0000)	127.6333 (± 0.0000)
build.make.22.clang.peak_rss_KiB	1	593740.0000 (± 0.0000)	593512.0000 (± 0.0000)
micro.clang.j=4.total_secs	3	48.1295 (± 0.4233)	47.7760 (± 0.0253)
micro.clang.MerkleRoot.total_secs	3	0.0019 (± 0.0000)	0.0019 (± 0.0000)
micro.clang.SHA1.total_secs	3	0.0029 (± 0.0000)	0.0029 (± 0.0000)
micro.clang.SHA256.total_secs	3	0.0049 (± 0.0000)	0.0049 (± 0.0000)
micro.clang.SHA256D64_1024.total_secs	3	0.0002 (± 0.0000)	0.0002 (± 0.0000)
micro.clang.SHA256_32b.total_secs	3	0.0000 (± 0.0000)	0.0000 (± 0.0000)
micro.clang.SHA512.total_secs	3	0.0045 (± 0.0000)	0.0045 (± 0.0000)
micro.gcc.j=4.total_secs	3	42.0600 (± 0.1458)	48.9681 (± 0.1150)
micro.gcc.MerkleRoot.total_secs	3	0.0011 (± 0.0000)	0.0020 (± 0.0000)
micro.gcc.SHA1.total_secs	3	0.0029 (± 0.0000)	0.0029 (± 0.0000)
micro.gcc.SHA256.total_secs	3	0.0049 (± 0.0001)	0.0049 (± 0.0000)
micro.gcc.SHA256D64_1024.total_secs	3	0.0001 (± 0.0000)	0.0002 (± 0.0000)
micro.gcc.SHA256_32b.total_secs	3	0.0000 (± 0.0000)	0.0000 (± 0.0000)
micro.gcc.SHA512.total_secs	3	0.0048 (± 0.0000)	0.0048 (± 0.0000)

dongcarl commented at 4:46 pm on May 29, 2019: member

@fingera If I understand you correctly, the current AC_LANG_PROGRAM here will not fail if $host is mingw. But if we use your longer AC_LANG_PROGRAM from fingera@585a8c8, it will fail just like gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?

Yes. But it’s too long, I think disallowed is better

The length of the AC_LANG_PROGRAM is completely fine, I believe we’d rather have accurate detection than broad-stroke disabling. Please rebase that branch over master and I’d be happy to take a look! :smile:

add avx512 instrinsic b0e207ebab

fix avx512 ShR bf15f2abcf

mingw avx512 detection 6a76a13b89

fingera force-pushed on May 30, 2019

DrahtBot removed the label Needs rebase on May 30, 2019

jamesob commented at 3:56 pm on May 30, 2019: member

Did some additional profiling and it’s not clear to me that this change is worth pursuing.

Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:

microbenches

However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:

ibd local range 500000 510000

fingera commented at 6:40 am on May 31, 2019: contributor

Did some additional profiling and it’s not clear to me that this change is worth pursuing.

Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:

However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:

Because in this benchmark, merkle root

Did some additional profiling and it’s not clear to me that this change is worth pursuing.

Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:

However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:

May be most of the time on diskIO？

in src/crypto/sha256_avx512.cpp:25 in 6a76a13b89 outdated

20+__m512i inline Inc(__m512i& x, __m512i y, __m512i z, __m512i w) { x = Add(x, y, z, w); return x; }
21+__m512i inline Xor(__m512i x, __m512i y) { return _mm512_xor_si512(x, y); }
22+__m512i inline Xor(__m512i x, __m512i y, __m512i z) { return Xor(Xor(x, y), z); }
23+__m512i inline Or(__m512i x, __m512i y) { return _mm512_or_si512(x, y); }
24+__m512i inline And(__m512i x, __m512i y) { return _mm512_and_si512(x, y); }
25+__m512i inline ShR(__m512i x, unsigned int n) { return _mm512_srli_epi32(x, n); }

practicalswift commented at 6:23 pm on June 1, 2019:

There is an implicit conversion from unsigned int to int here. Could be made explicit to make it easier to reason about correctness?

fingera commented at 1:35 am on June 3, 2019:

gcc and Intel: __m512i _mm512_srli_epi32 (__m512i a, unsigned int imm8) https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_srli_epi32&expand=5515 https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avx512fintrin.h clang: __m512i _mm512_srli_epi32(__m512i __A, int __B) https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html#l05168

defined to macro? #define ShR _mm512_srli_epi32

practicalswift commented at 6:30 am on June 3, 2019:

Oh, got it. Thanks for clarifying.

Also saw this old comment of mine: #13989 (review)

Sorry for the confusion :-)

fanquake requested review from dongcarl on Jun 24, 2019

fanquake commented at 8:24 am on June 24, 2019: member

@dongcarl Were you still interested in following up here?

dongcarl changes_requested

dongcarl commented at 7:36 pm on June 26, 2019: member

AVX512 should be disabled if SHA-NI is enabled, like was done for AVX2 here: https://github.com/bitcoin/bitcoin/blob/1b28bca04c2767c8bca21d66bd6978f358b0a96a/src/crypto/sha256.cpp#L613

See discussion here: http://www.erisian.com.au/bitcoin-core-dev/log-2019-06-26.html#l-511

disable sha512 when shani exists 90f282f4bf

dongcarl commented at 10:06 pm on June 27, 2019: member

lightly-tested ACK 90f282f4bfe8e052c674ff829239f6a4845d7397

Code changes look good. Checked that it matches all instances of what we do for AVX2.
Build system changes lightly tested on Linux machine w/o AVX512
- Correctly enables AVX512 based on compiler compatibility, not system capability
- Correctly disables AVX512 for gcc 4.8

If someone has OSX/Windows machines, please test as well.

fanquake commented at 6:21 am on June 29, 2019: member

If someone has OSX/Windows machines, please test as well. @dongcarl Is there anything in particular you’d like tested? I can test on macOS.

dongcarl commented at 6:51 pm on July 4, 2019: member

@fanquake Mostly if the build system changes work correctly. It seems clang added support for AVX512 here: https://github.com/llvm-mirror/clang/commit/dab7845798d673cabeef792451354aeb394cdd54

Not sure what version that was tho.

fanquake commented at 4:17 am on July 5, 2019: member

@dongcarl No worries.

Tested on macOS with Clang. Looks like support for AVX512 arrived in LLVM in 3.9:

0clang --version
1Apple LLVM version 10.0.1 (clang-1001.0.46.4)
2Target: x86_64-apple-darwin18.6.0

0./configure | grep -i -E 'AVX'
1checking whether C++ compiler accepts -mavx -mavx2... yes
2checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes
3checking for AVX2 intrinsics... yes
4checking for AVX512 intrinsics... yes

However my CPU does not support AVX512:

0sysctl -a | grep machdep.cpu.features
1AVX1.0
2sysctl -a | grep machdep.cpu.leaf7_features
3AVX2

Tested inside a debian:jessie-slim Docker container that AVX512 is disabled for GCC 4.8 and enabled for GCC 4.9 (Intel AVX-512 support was added to GCC in 4.9).

GCC 4.8 - gcc-4.8 (Debian 4.8.4-1) 4.8.4:

0./configure CC=gcc-4.8 CXX=g++-4.8 | grep -i 'AVX'
1configure: WARNING: Doxygen not found
2checking whether C++ compiler accepts -mavx -mavx2... yes
3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... no
4checking for AVX2 intrinsics... yes
5checking for AVX512 intrinsics... no

GCC 4.9 - gcc (Debian 4.9.2-10+deb8u2) 4.9.2:

0./configure --disable-wallet | grep -i 'AVX'
1configure: WARNING: Doxygen not found
2checking whether C++ compiler accepts -mavx -mavx2... yes
3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes
4checking for AVX2 intrinsics... yes
5checking for AVX512 intrinsics... yes

Also tested with Clang inside the same box:

Clang 3.5 Debian clang version 3.5.0-10 (tags/RELEASE_350/final) (based on LLVM 3.5.0):

0./configure CC=clang-3.5 CXX=clang-3.5 | grep -i 'avx'
1configure: WARNING: Doxygen not found
2checking whether C++ compiler accepts -mavx -mavx2... yes
3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes
4checking for AVX2 intrinsics... yes
5checking for AVX512 intrinsics... no

Clang 4.0 clang version 4.0.1-10~deb8u1 (tags/RELEASE_401/final):

0./configure --disable-wallet CC=clang-4.0 CXX=clang-4.0 | grep -i 'avx'
1configure: WARNING: Doxygen not found
2checking whether C++ compiler accepts -mavx -mavx2... yes
3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes
4checking for AVX2 intrinsics... yes
5checking for AVX512 intrinsics... yes

It’d be great to have @sipa , @gmaxwell or @theuni give this another look over.

laanwj commented at 1:14 pm on July 5, 2019: member

Concept ACK, I have no hardware to test this on.

promag commented at 2:13 pm on July 5, 2019: member

I also don’t have hardware to test, concept ACK.

fingera commented at 1:37 am on July 6, 2019: contributor

iMAC PRO or MAC PRO :) Most cloud servers are already supported: https://aws.amazon.com/ec2/instance-types/ The popularity of desktops maybe wait for the future :<

in configure.ac:420 in 90f282f4bf

412@@ -412,6 +413,87 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
413 )
414 CXXFLAGS="$TEMP_CXXFLAGS"
415 
416+TEMP_CXXFLAGS="$CXXFLAGS"
417+CXXFLAGS="$CXXFLAGS $AVX512_CXXFLAGS"
418+AC_MSG_CHECKING(for AVX512 intrinsics)
419+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
420+    #include <stdint.h>

theuni commented at 8:54 pm on July 8, 2019:

No need to put the whole file here. Just check the intrinsics themselves.

fingera commented at 1:33 am on July 9, 2019:

A big program to trigger MINGW-GCC BUG

in src/crypto/sha256_avx512.cpp:6 in 90f282f4bf

0@@ -0,0 +1,350 @@
1+#ifdef ENABLE_AVX512
2+
3+#include <stdint.h>
4+#include <immintrin.h>
5+
6+#include <crypto/sha256.h>

theuni commented at 9:02 pm on July 8, 2019:

Why?

fingera commented at 1:43 am on July 9, 2019:

Does not seem to be used sha256.h Keep the same as other files(avx2 sse4)?

theuni commented at 9:07 pm on July 8, 2019: member

Concept ACK.

Build system changes look good at a glance. I can’t test this either :(

sipa commented at 9:52 pm on July 8, 2019: member

Given the lack of common platforms where AVX512 occurs, and because IIRC on some of them it causes a clock speed reduction even for non-AVX512 instructions, it may not be wise to enable this by default at runtime. Does anyone have benchmarks that show otherwise?

A possibility is turning the autodetection code into an actual benchmark that’s run at startup, finding which code is fastest for what empirically rather than guessing that X is always faster than Y if X is available.

fingera commented at 2:09 am on July 9, 2019: contributor

Given the lack of common platforms where AVX512 occurs, and because IIRC on some of them it causes a clock speed reduction even for non-AVX512 instructions, it may not be wise to enable this by default at runtime. Does anyone have benchmarks that show otherwise?

A possibility is turning the autodetection code into an actual benchmark that’s run at startup, finding which code is fastest for what empirically rather than guessing that X is always faster than Y if X is available.

I think all platforms AVX512 will be faster avx2 has frequency scaling too

sipa commented at 2:22 am on July 9, 2019: member

@fingera Maybe, seeing benchmarks would certainly be more convincing. Given that executing AVX512 instructions (as far as I know) may slow down other instructions, it may even be the case that in a microbenchmark AVX512-based SHA256 code is a win, but in realistic load conditions where SHA256 operations are interleaved with other things, it is not.

fingera commented at 2:25 am on July 9, 2019: contributor

@sipa Yes your are right https://en.wikichip.org/wiki/intel/frequency_behavior

I think. this avx512 instrinsic is AVX2 Heavy mode

Maybe bit scaning multiplication cause cpu to heavy mode? I think rolling bit(Or And) is light mode(non avx2?)

fingera commented at 4:45 am on July 9, 2019: contributor

https://github.com/travisdowns/avx-turbo @sipa looks good

 0CPUID highest leaf  : [ dh]
 1Running as root     : [YES]
 2MSR reads supported : [NO ]
 3CPU pinning enabled : [YES]
 4CPU supports AVX2   : [YES]
 5CPU supports AVX-512: [YES]
 6CPUID doesn't support leaf 0x15, falling back to manual TSC calibration.
 7tsc_freq = 2499.9 MHz (from calibration loop)
 8CPU brand string: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
 91 available CPUs: [0]
10Can't use cpuid leaf 0xb to filter out hyperthreads, CPU too old or AMD
111 physical cores: [0]
12Will test up to 1 CPUs
13Cores | ID                  | Description                     | OVRLP1 | OVRLP2 | OVRLP3 | Mops
141     | pause_only          | pause instruction               |  1.000 |  1.000 | 1.000  | 19260
151     | ucomis              | SSE scalar ucomis loop          |  1.000 |  1.000 | 1.000  |  860
161     | ucomis_vex          | VEX scalar ucomis loop          |  1.000 |  1.000 | 1.000  |  546
171     | scalar_iadd         | Scalar integer adds             |  1.000 |  1.000 | 1.000  | 2700
181     | avx128_iadd         | 128-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 2699
191     | avx256_iadd         | 256-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 2700
201     | avx512_iadd         | 512-bit integer adds            |  1.000 |  1.000 | 1.000  | 2699
211     | avx128_iadd_t       | 128-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 8100
221     | avx256_iadd_t       | 256-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 8099
231     | avx128_mov_sparse   | 128-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2700
241     | avx256_mov_sparse   | 256-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2700
251     | avx512_mov_sparse   | 512-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2700
261     | avx128_merge_sparse | 128-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2700
271     | avx256_merge_sparse | 256-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2700
281     | avx512_merge_sparse | 512-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2700
291     | avx128_vshift       | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2699
301     | avx256_vshift       | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2699
311     | avx512_vshift       | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2699
321     | avx128_vshift_t     | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 5398
331     | avx256_vshift_t     | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 5398
341     | avx512_vshift_t     | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2699
351     | avx128_imul         | 128-bit integer muls            |  1.000 |  1.000 | 1.000  |  540
361     | avx256_imul         | 256-bit integer muls            |  1.000 |  1.000 | 1.000  |  540
371     | avx512_imul         | 512-bit integer muls            |  1.000 |  1.000 | 1.000  |  493
381     | avx128_fma_sparse   | 128-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2700
391     | avx256_fma_sparse   | 256-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2700
401     | avx512_fma_sparse   | 512-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2700
411     | avx128_fma          | 128-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  675
421     | avx256_fma          | 256-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  675
431     | avx512_fma          | 512-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  675
441     | avx128_fma_t        | 128-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 5398
451     | avx256_fma_t        | 256-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 5398
461     | avx512_fma_t        | 512-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 4598
471     | avx512_vpermw       | 512-bit serial WORD permute     |  1.000 |  1.000 | 1.000  |  450
481     | avx512_vpermw_t     | 512-bit parallel WORD permute   |  1.000 |  1.000 | 1.000  | 1350
491     | avx512_vpermd       | 512-bit serial DWORD permute    |  1.000 |  1.000 | 1.000  |  900
501     | avx512_vpermd_t     | 512-bit parallel DWORD permute  |  1.000 |  1.000 | 1.000  | 2699

fingera commented at 5:09 am on July 9, 2019: contributor

https://github.com/fingera/avx-turbo add more instruction avx512_vshift_t speed: 50% avx512_imul speed: 90% Execution of complex instructions will cause the CPU clock upper and lower limits to drop at that time. I think we didn’t use these instructions.

dongcarl commented at 7:30 pm on August 5, 2019: member

@fingera Could you show benchmarks of the instructions we’re using? I’m most interested in the epi32 instructions.

fanquake added the label Up for grabs on Feb 25, 2020

fanquake commented at 2:20 am on February 25, 2020: member

We’re still waiting on benchmarks, PR comments need addressing, and the user that opened this seems to have disappeared from GitHub. Going to mark as up for grabs and close for now.

fanquake closed this on Feb 25, 2020

DrahtBot locked this on Feb 15, 2022

add avx512 instrinsic #13989

Conflicts

review/1-avx512 vs. master (relative)

review/1-avx512 vs. master (absolute)