add avx512 instrinsic #13989

pull fingera wants to merge 4 commits into bitcoin:master from fingera:1-avx512 changing 4 files +478 −5
  1. fingera commented at 4:06 am on August 16, 2018: contributor
  2. gmaxwell commented at 4:18 am on August 16, 2018: contributor
    Interesting! Benchmark results?
  3. fingera commented at 5:19 am on August 16, 2018: contributor

    CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz

    https://github.com/fingera/bitcoin/tree/1-avx512-benchmark:

    0./src/bench/bench_bitcoin -filter="FINGERA.+"
    1FINGERA_MerkleRoot, 5, 800, 3.74455, 0.000933415, 0.000938638, 0.000937376
    2FINGERA_MerkleRoot8Way, 5, 800, 6.24424, 0.00155711, 0.00157051, 0.00155929
    3FINGERA_SHA256D64_1024, 5, 7400, 3.09212, 8.34755e-05, 8.36662e-05, 8.35919e-05
    4FINGERA_SHA256D64_10248Way, 5, 7400, 5.75079, 0.000155021, 0.000156514, 0.000155131
    
  4. fingera commented at 6:28 am on August 16, 2018: contributor
  5. fingera commented at 7:47 am on August 16, 2018: contributor
    Need to rebase?
  6. fingera force-pushed on Aug 16, 2018
  7. laanwj added the label Utils/log/libs on Aug 16, 2018
  8. laanwj commented at 11:23 am on August 27, 2018: member
    so 40 to 46% faster? that’s quite impressive
  9. fingera commented at 1:37 am on August 28, 2018: contributor
    because avx512 added _mm512_rol_epi32 instrisic, may faster in the future. can i do something for merge
  10. in configure.ac:380 in 2fa6fa40c3 outdated
    375@@ -375,6 +376,27 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
    376 )
    377 CXXFLAGS="$TEMP_CXXFLAGS"
    378 
    379+case $host in
    380+  *mingw*)
    


    luke-jr commented at 8:13 am on August 28, 2018:
    Why is this disallowed on mingw? Add a comment…

    fingera commented at 8:35 am on August 28, 2018:

    luke-jr commented at 8:45 am on August 28, 2018:
    Shouldn’t your below test cleanly fail in that scenario?

    fingera commented at 9:01 am on August 28, 2018:
    Most servers are linux and support avx512?

    luke-jr commented at 9:03 am on August 28, 2018:
    That’s irrelevant…

    fingera commented at 9:07 am on August 28, 2018:
    How do I do? update ci mingw version?

    luke-jr commented at 9:10 am on August 28, 2018:
    You shouldn’t need to.
  11. in configure.ac:386 in 2fa6fa40c3 outdated
    381+    ;;
    382+  *)
    383+    TEMP_CXXFLAGS="$CXXFLAGS"
    384+    CXXFLAGS="$CXXFLAGS $AVX512_CXXFLAGS"
    385+    AC_MSG_CHECKING(for AVX512 intrinsics)
    386+    AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
    


    luke-jr commented at 9:14 am on August 28, 2018:
    Use AC_LINK_IFELSE here instead

    fingera commented at 1:19 pm on August 28, 2018:

    it’s not a link error, mingw generate seh directive assembly >xmm16, but assembler can’t work with directive >xmm16.

    0.seh_savexmm	%xmm20, 256
    

    solution other: make ac_lang_program more than 50 lines like this: https://github.com/fingera/bitcoin/commit/585a8c8b5b500ca8cb2bab4e2d231895a6298703


    fingera commented at 9:17 am on August 31, 2018:
    Add a little code to make mingw generate xmm register more than 16 is ok? @luke-jr
  12. in src/crypto/sha256_avx512.cpp:25 in 2fa6fa40c3 outdated
    20+__m512i inline Inc(__m512i& x, __m512i y, __m512i z, __m512i w) { x = Add(x, y, z, w); return x; }
    21+__m512i inline Xor(__m512i x, __m512i y) { return _mm512_xor_si512(x, y); }
    22+__m512i inline Xor(__m512i x, __m512i y, __m512i z) { return Xor(Xor(x, y), z); }
    23+__m512i inline Or(__m512i x, __m512i y) { return _mm512_or_si512(x, y); }
    24+__m512i inline And(__m512i x, __m512i y) { return _mm512_and_si512(x, y); }
    25+__m512i inline ShR(__m512i x, int n) { return _mm512_srli_epi32(x, n); }
    


    practicalswift commented at 9:04 am on October 1, 2018:
    Shouldn’t the second parameter to ShR (n) be an unsigned integer to match _mm512_srli_epi32(__m512i, unsigned int)?

    fingera commented at 1:27 am on October 15, 2018:
    ok, thks. RoL because clang.
  13. DrahtBot commented at 8:22 am on November 30, 2018: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    No conflicts as of last run.

  14. DrahtBot added the label Needs rebase on Jan 30, 2019
  15. dongcarl commented at 9:38 pm on May 21, 2019: member
    @fingera Would you like to rebase and keep working on this? I’d be happy to rebase for you.
  16. fingera commented at 8:09 am on May 24, 2019: contributor

    @fingera Would you like to rebase and keep working on this? I’d be happy to rebase for you.

    Thankyou. Just wait for ci-mingw updates :)

  17. dongcarl commented at 3:22 pm on May 24, 2019: member
    @fingera If I understand you correctly, the current AC_LANG_PROGRAM here will not fail if $host is mingw. But if we use your longer AC_LANG_PROGRAM from https://github.com/fingera/bitcoin/commit/585a8c8b5b500ca8cb2bab4e2d231895a6298703, it will fail just like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?
  18. fingera commented at 1:23 am on May 27, 2019: contributor

    @fingera If I understand you correctly, the current AC_LANG_PROGRAM here will not fail if $host is mingw. But if we use your longer AC_LANG_PROGRAM from fingera@585a8c8, it will fail just like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?

    Yes. But it’s too long, I think disallowed is better

  19. jamesob commented at 3:54 pm on May 29, 2019: member

    Per microbenchmarks, this change appears to have some pretty substantial performance improvements.

    I’ve rebased this PR and run the microbenches on an Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz. Notably, micro.gcc.MerkleRoot and micro.gcc.SHA256D64_1024 are, respectively, 1.76x and 1.91x slower on master relative to this branch.

    Kind of curious that we only see performance improvements on gcc though - is that expected?

    Benchmarking data follows, significant results bolded.

    review/1-avx512 vs. master (relative)

    bench name x review/1-avx512 master
    build.make.22.clang.total_secs 1 1.000 1.003
    build.make.22.clang.peak_rss_KiB 1 1.000 1.000
    micro.clang.j=4.total_secs 3 1.007 1.000
    micro.clang.MerkleRoot.total_secs 3 1.008 1.000
    micro.clang.SHA1.total_secs 3 1.006 1.000
    micro.clang.SHA256.total_secs 3 1.000 1.003
    micro.clang.SHA256D64_1024.total_secs 3 1.001 1.000
    micro.clang.SHA256_32b.total_secs 3 1.000 1.000
    micro.clang.SHA512.total_secs 3 1.000 1.000
    micro.gcc.j=4.total_secs 3 1.000 1.164
    micro.gcc.MerkleRoot.total_secs 3 1.000 1.760
    micro.gcc.SHA1.total_secs 3 1.001 1.000
    micro.gcc.SHA256.total_secs 3 1.008 1.000
    micro.gcc.SHA256D64_1024.total_secs 3 1.000 1.912
    micro.gcc.SHA256_32b.total_secs 3 1.002 1.000
    micro.gcc.SHA512.total_secs 3 1.000 1.003

    review/1-avx512 vs. master (absolute)

    bench name x review/1-avx512 master
    build.make.22.clang.total_secs 1 127.3094 (± 0.0000) 127.6333 (± 0.0000)
    build.make.22.clang.peak_rss_KiB 1 593740.0000 (± 0.0000) 593512.0000 (± 0.0000)
    micro.clang.j=4.total_secs 3 48.1295 (± 0.4233) 47.7760 (± 0.0253)
    micro.clang.MerkleRoot.total_secs 3 0.0019 (± 0.0000) 0.0019 (± 0.0000)
    micro.clang.SHA1.total_secs 3 0.0029 (± 0.0000) 0.0029 (± 0.0000)
    micro.clang.SHA256.total_secs 3 0.0049 (± 0.0000) 0.0049 (± 0.0000)
    micro.clang.SHA256D64_1024.total_secs 3 0.0002 (± 0.0000) 0.0002 (± 0.0000)
    micro.clang.SHA256_32b.total_secs 3 0.0000 (± 0.0000) 0.0000 (± 0.0000)
    micro.clang.SHA512.total_secs 3 0.0045 (± 0.0000) 0.0045 (± 0.0000)
    micro.gcc.j=4.total_secs 3 42.0600 (± 0.1458) 48.9681 (± 0.1150)
    micro.gcc.MerkleRoot.total_secs 3 0.0011 (± 0.0000) 0.0020 (± 0.0000)
    micro.gcc.SHA1.total_secs 3 0.0029 (± 0.0000) 0.0029 (± 0.0000)
    micro.gcc.SHA256.total_secs 3 0.0049 (± 0.0001) 0.0049 (± 0.0000)
    micro.gcc.SHA256D64_1024.total_secs 3 0.0001 (± 0.0000) 0.0002 (± 0.0000)
    micro.gcc.SHA256_32b.total_secs 3 0.0000 (± 0.0000) 0.0000 (± 0.0000)
    micro.gcc.SHA512.total_secs 3 0.0048 (± 0.0000) 0.0048 (± 0.0000)
  20. dongcarl commented at 4:46 pm on May 29, 2019: member

    @fingera If I understand you correctly, the current AC_LANG_PROGRAM here will not fail if $host is mingw. But if we use your longer AC_LANG_PROGRAM from fingera@585a8c8, it will fail just like gcc.gnu.org/bugzilla/show_bug.cgi?id=79127 says?

    Yes. But it’s too long, I think disallowed is better

    The length of the AC_LANG_PROGRAM is completely fine, I believe we’d rather have accurate detection than broad-stroke disabling. Please rebase that branch over master and I’d be happy to take a look! :smile:

  21. add avx512 instrinsic b0e207ebab
  22. fix avx512 ShR bf15f2abcf
  23. mingw avx512 detection 6a76a13b89
  24. fingera force-pushed on May 30, 2019
  25. DrahtBot removed the label Needs rebase on May 30, 2019
  26. jamesob commented at 3:56 pm on May 30, 2019: member

    Did some additional profiling and it’s not clear to me that this change is worth pursuing.

    Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:

    microbenches

    However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:

    ibd local range 500000 510000

  27. fingera commented at 6:40 am on May 31, 2019: contributor

    Did some additional profiling and it’s not clear to me that this change is worth pursuing.

    Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:

    microbenches

    However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:

    ibd local range 500000 510000

    Because in this benchmark, merkle root

    Did some additional profiling and it’s not clear to me that this change is worth pursuing.

    Subsequent benching confirms that SHA-intensive microbenchmarks show improvement:

    microbenches

    However when sampling IBD performance (500_000 -> 510_000), there’s only a minor difference in mean wall clock time taken:

    ibd local range 500000 510000

    May be most of the time on diskIO?

  28. in src/crypto/sha256_avx512.cpp:25 in 6a76a13b89 outdated
    20+__m512i inline Inc(__m512i& x, __m512i y, __m512i z, __m512i w) { x = Add(x, y, z, w); return x; }
    21+__m512i inline Xor(__m512i x, __m512i y) { return _mm512_xor_si512(x, y); }
    22+__m512i inline Xor(__m512i x, __m512i y, __m512i z) { return Xor(Xor(x, y), z); }
    23+__m512i inline Or(__m512i x, __m512i y) { return _mm512_or_si512(x, y); }
    24+__m512i inline And(__m512i x, __m512i y) { return _mm512_and_si512(x, y); }
    25+__m512i inline ShR(__m512i x, unsigned int n) { return _mm512_srli_epi32(x, n); }
    


    practicalswift commented at 6:23 pm on June 1, 2019:
    There is an implicit conversion from unsigned int to int here. Could be made explicit to make it easier to reason about correctness?

    fingera commented at 1:35 am on June 3, 2019:

    gcc and Intel: __m512i _mm512_srli_epi32 (__m512i a, unsigned int imm8) https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm512_srli_epi32&expand=5515 https://github.com/gcc-mirror/gcc/blob/master/gcc/config/i386/avx512fintrin.h clang: __m512i _mm512_srli_epi32(__m512i __A, int __B) https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html#l05168

    defined to macro? #define ShR _mm512_srli_epi32


    practicalswift commented at 6:30 am on June 3, 2019:

    Oh, got it. Thanks for clarifying.

    Also saw this old comment of mine: #13989 (review)

    Sorry for the confusion :-)

  29. fanquake requested review from dongcarl on Jun 24, 2019
  30. fanquake commented at 8:24 am on June 24, 2019: member
    @dongcarl Were you still interested in following up here?
  31. dongcarl changes_requested
  32. dongcarl commented at 7:36 pm on June 26, 2019: member
  33. disable sha512 when shani exists 90f282f4bf
  34. dongcarl commented at 10:06 pm on June 27, 2019: member

    lightly-tested ACK 90f282f4bfe8e052c674ff829239f6a4845d7397

    • Code changes look good. Checked that it matches all instances of what we do for AVX2.
    • Build system changes lightly tested on Linux machine w/o AVX512
      • Correctly enables AVX512 based on compiler compatibility, not system capability
      • Correctly disables AVX512 for gcc 4.8

    If someone has OSX/Windows machines, please test as well.

  35. fanquake commented at 6:21 am on June 29, 2019: member

    If someone has OSX/Windows machines, please test as well. @dongcarl Is there anything in particular you’d like tested? I can test on macOS.

  36. dongcarl commented at 6:51 pm on July 4, 2019: member

    @fanquake Mostly if the build system changes work correctly. It seems clang added support for AVX512 here: https://github.com/llvm-mirror/clang/commit/dab7845798d673cabeef792451354aeb394cdd54

    Not sure what version that was tho.

  37. fanquake commented at 4:17 am on July 5, 2019: member

    @dongcarl No worries.

    Tested on macOS with Clang. Looks like support for AVX512 arrived in LLVM in 3.9:

    0clang --version
    1Apple LLVM version 10.0.1 (clang-1001.0.46.4)
    2Target: x86_64-apple-darwin18.6.0
    
    0./configure | grep -i -E 'AVX'
    1checking whether C++ compiler accepts -mavx -mavx2... yes
    2checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes
    3checking for AVX2 intrinsics... yes
    4checking for AVX512 intrinsics... yes
    

    However my CPU does not support AVX512:

    0sysctl -a | grep machdep.cpu.features
    1AVX1.0
    2sysctl -a | grep machdep.cpu.leaf7_features
    3AVX2
    

    Tested inside a debian:jessie-slim Docker container that AVX512 is disabled for GCC 4.8 and enabled for GCC 4.9 (Intel AVX-512 support was added to GCC in 4.9).

    GCC 4.8 - gcc-4.8 (Debian 4.8.4-1) 4.8.4:

    0./configure CC=gcc-4.8 CXX=g++-4.8 | grep -i 'AVX'
    1configure: WARNING: Doxygen not found
    2checking whether C++ compiler accepts -mavx -mavx2... yes
    3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... no
    4checking for AVX2 intrinsics... yes
    5checking for AVX512 intrinsics... no
    

    GCC 4.9 - gcc (Debian 4.9.2-10+deb8u2) 4.9.2:

    0./configure --disable-wallet | grep -i 'AVX'
    1configure: WARNING: Doxygen not found
    2checking whether C++ compiler accepts -mavx -mavx2... yes
    3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes
    4checking for AVX2 intrinsics... yes
    5checking for AVX512 intrinsics... yes
    

    Also tested with Clang inside the same box:

    Clang 3.5 Debian clang version 3.5.0-10 (tags/RELEASE_350/final) (based on LLVM 3.5.0):

    0./configure CC=clang-3.5 CXX=clang-3.5 | grep -i 'avx'
    1configure: WARNING: Doxygen not found
    2checking whether C++ compiler accepts -mavx -mavx2... yes
    3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes
    4checking for AVX2 intrinsics... yes
    5checking for AVX512 intrinsics... no
    

    Clang 4.0 clang version 4.0.1-10~deb8u1 (tags/RELEASE_401/final):

    0./configure --disable-wallet CC=clang-4.0 CXX=clang-4.0 | grep -i 'avx'
    1configure: WARNING: Doxygen not found
    2checking whether C++ compiler accepts -mavx -mavx2... yes
    3checking whether C++ compiler accepts -mavx -mavx2 -mavx512f... yes
    4checking for AVX2 intrinsics... yes
    5checking for AVX512 intrinsics... yes
    

    It’d be great to have @sipa , @gmaxwell or @theuni give this another look over.

  38. laanwj commented at 1:14 pm on July 5, 2019: member
    Concept ACK, I have no hardware to test this on.
  39. promag commented at 2:13 pm on July 5, 2019: member
    I also don’t have hardware to test, concept ACK.
  40. fingera commented at 1:37 am on July 6, 2019: contributor
    iMAC PRO or MAC PRO :) Most cloud servers are already supported: https://aws.amazon.com/ec2/instance-types/ The popularity of desktops maybe wait for the future :<
  41. in configure.ac:420 in 90f282f4bf
    412@@ -412,6 +413,87 @@ AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
    413 )
    414 CXXFLAGS="$TEMP_CXXFLAGS"
    415 
    416+TEMP_CXXFLAGS="$CXXFLAGS"
    417+CXXFLAGS="$CXXFLAGS $AVX512_CXXFLAGS"
    418+AC_MSG_CHECKING(for AVX512 intrinsics)
    419+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
    420+    #include <stdint.h>
    


    theuni commented at 8:54 pm on July 8, 2019:
    No need to put the whole file here. Just check the intrinsics themselves.

    fingera commented at 1:33 am on July 9, 2019:
    A big program to trigger MINGW-GCC BUG
  42. in src/crypto/sha256_avx512.cpp:6 in 90f282f4bf
    0@@ -0,0 +1,350 @@
    1+#ifdef ENABLE_AVX512
    2+
    3+#include <stdint.h>
    4+#include <immintrin.h>
    5+
    6+#include <crypto/sha256.h>
    


    theuni commented at 9:02 pm on July 8, 2019:
    Why?

    fingera commented at 1:43 am on July 9, 2019:
    Does not seem to be used sha256.h Keep the same as other files(avx2 sse4)?
  43. theuni commented at 9:07 pm on July 8, 2019: member

    Concept ACK.

    Build system changes look good at a glance. I can’t test this either :(

  44. sipa commented at 9:52 pm on July 8, 2019: member

    Given the lack of common platforms where AVX512 occurs, and because IIRC on some of them it causes a clock speed reduction even for non-AVX512 instructions, it may not be wise to enable this by default at runtime. Does anyone have benchmarks that show otherwise?

    A possibility is turning the autodetection code into an actual benchmark that’s run at startup, finding which code is fastest for what empirically rather than guessing that X is always faster than Y if X is available.

  45. fingera commented at 2:09 am on July 9, 2019: contributor

    Given the lack of common platforms where AVX512 occurs, and because IIRC on some of them it causes a clock speed reduction even for non-AVX512 instructions, it may not be wise to enable this by default at runtime. Does anyone have benchmarks that show otherwise?

    A possibility is turning the autodetection code into an actual benchmark that’s run at startup, finding which code is fastest for what empirically rather than guessing that X is always faster than Y if X is available.

    I think all platforms AVX512 will be faster avx2 has frequency scaling too

  46. sipa commented at 2:22 am on July 9, 2019: member
    @fingera Maybe, seeing benchmarks would certainly be more convincing. Given that executing AVX512 instructions (as far as I know) may slow down other instructions, it may even be the case that in a microbenchmark AVX512-based SHA256 code is a win, but in realistic load conditions where SHA256 operations are interleaved with other things, it is not.
  47. fingera commented at 2:25 am on July 9, 2019: contributor

    @sipa Yes your are right https://en.wikichip.org/wiki/intel/frequency_behavior

    I think. this avx512 instrinsic is AVX2 Heavy mode

    Maybe bit scaning multiplication cause cpu to heavy mode? I think rolling bit(Or And) is light mode(non avx2?)

  48. fingera commented at 4:45 am on July 9, 2019: contributor

    https://github.com/travisdowns/avx-turbo @sipa looks good

     0CPUID highest leaf  : [ dh]
     1Running as root     : [YES]
     2MSR reads supported : [NO ]
     3CPU pinning enabled : [YES]
     4CPU supports AVX2   : [YES]
     5CPU supports AVX-512: [YES]
     6CPUID doesn't support leaf 0x15, falling back to manual TSC calibration.
     7tsc_freq = 2499.9 MHz (from calibration loop)
     8CPU brand string: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
     91 available CPUs: [0]
    10Can't use cpuid leaf 0xb to filter out hyperthreads, CPU too old or AMD
    111 physical cores: [0]
    12Will test up to 1 CPUs
    13Cores | ID                  | Description                     | OVRLP1 | OVRLP2 | OVRLP3 | Mops
    141     | pause_only          | pause instruction               |  1.000 |  1.000 | 1.000  | 19260
    151     | ucomis              | SSE scalar ucomis loop          |  1.000 |  1.000 | 1.000  |  860
    161     | ucomis_vex          | VEX scalar ucomis loop          |  1.000 |  1.000 | 1.000  |  546
    171     | scalar_iadd         | Scalar integer adds             |  1.000 |  1.000 | 1.000  | 2700
    181     | avx128_iadd         | 128-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 2699
    191     | avx256_iadd         | 256-bit integer serial adds     |  1.000 |  1.000 | 1.000  | 2700
    201     | avx512_iadd         | 512-bit integer adds            |  1.000 |  1.000 | 1.000  | 2699
    211     | avx128_iadd_t       | 128-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 8100
    221     | avx256_iadd_t       | 256-bit integer parallel adds   |  1.000 |  1.000 | 1.000  | 8099
    231     | avx128_mov_sparse   | 128-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2700
    241     | avx256_mov_sparse   | 256-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2700
    251     | avx512_mov_sparse   | 512-bit reg-reg mov             |  1.000 |  1.000 | 1.000  | 2700
    261     | avx128_merge_sparse | 128-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2700
    271     | avx256_merge_sparse | 256-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2700
    281     | avx512_merge_sparse | 512-bit reg-reg merge mov       |  1.000 |  1.000 | 1.000  | 2700
    291     | avx128_vshift       | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2699
    301     | avx256_vshift       | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2699
    311     | avx512_vshift       | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2699
    321     | avx128_vshift_t     | 128-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 5398
    331     | avx256_vshift_t     | 256-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 5398
    341     | avx512_vshift_t     | 512-bit variable shift (vpsrld) |  1.000 |  1.000 | 1.000  | 2699
    351     | avx128_imul         | 128-bit integer muls            |  1.000 |  1.000 | 1.000  |  540
    361     | avx256_imul         | 256-bit integer muls            |  1.000 |  1.000 | 1.000  |  540
    371     | avx512_imul         | 512-bit integer muls            |  1.000 |  1.000 | 1.000  |  493
    381     | avx128_fma_sparse   | 128-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2700
    391     | avx256_fma_sparse   | 256-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2700
    401     | avx512_fma_sparse   | 512-bit 64-bit sparse FMAs      |  1.000 |  1.000 | 1.000  | 2700
    411     | avx128_fma          | 128-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  675
    421     | avx256_fma          | 256-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  675
    431     | avx512_fma          | 512-bit serial DP FMAs          |  1.000 |  1.000 | 1.000  |  675
    441     | avx128_fma_t        | 128-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 5398
    451     | avx256_fma_t        | 256-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 5398
    461     | avx512_fma_t        | 512-bit parallel DP FMAs        |  1.000 |  1.000 | 1.000  | 4598
    471     | avx512_vpermw       | 512-bit serial WORD permute     |  1.000 |  1.000 | 1.000  |  450
    481     | avx512_vpermw_t     | 512-bit parallel WORD permute   |  1.000 |  1.000 | 1.000  | 1350
    491     | avx512_vpermd       | 512-bit serial DWORD permute    |  1.000 |  1.000 | 1.000  |  900
    501     | avx512_vpermd_t     | 512-bit parallel DWORD permute  |  1.000 |  1.000 | 1.000  | 2699
    
  49. fingera commented at 5:09 am on July 9, 2019: contributor
    https://github.com/fingera/avx-turbo add more instruction avx512_vshift_t speed: 50% avx512_imul speed: 90% Execution of complex instructions will cause the CPU clock upper and lower limits to drop at that time. I think we didn’t use these instructions.
  50. dongcarl commented at 7:30 pm on August 5, 2019: member
    @fingera Could you show benchmarks of the instructions we’re using? I’m most interested in the epi32 instructions.
  51. fanquake added the label Up for grabs on Feb 25, 2020
  52. fanquake commented at 2:20 am on February 25, 2020: member
    We’re still waiting on benchmarks, PR comments need addressing, and the user that opened this seems to have disappeared from GitHub. Going to mark as up for grabs and close for now.
  53. fanquake closed this on Feb 25, 2020

  54. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-21 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me