ARMv8 SHA2 Intrinsics #24115

pull prusnak wants to merge 4 commits into bitcoin:master from prusnak:armv8-shani changing 5 files +1012 −30
  1. prusnak commented at 6:17 pm on January 20, 2022: contributor

    This PR adds support for ARMv8 SHA2 Intrinsics.

    Fixes #13401 and #17414

  2. prusnak force-pushed on Jan 20, 2022
  3. prusnak marked this as a draft on Jan 20, 2022
  4. laanwj commented at 6:55 pm on January 20, 2022: member

    Concept ACK!

    detection when the feature can be used

    On Linux (the only system we care about for ARM, i guess), the following would be the way to do detection:

     0#include <sys/auxv.h>
     1#include <asm/hwcap.h>
     2
     3#ifdef __arm__
     4/* ARM 32 bit */
     5if (getauxval(AT_HWCAP2) & HWCAP2_SHA2) {
     6    have_arm_shani = true;
     7}
     8#endif
     9#ifdef __aarch64__
    10/* ARM 64 bit */
    11if (getauxval(AT_HWCAP) & HWCAP_SHA2) {
    12    have_arm_shani = true;
    13}
    14#endif
    

    Note that the capability bit is on a different HWCAP word on 32 and 64 bit (dunno if you even want to support 32 bit here).

  5. laanwj added the label Utils/log/libs on Jan 20, 2022
  6. prusnak force-pushed on Jan 20, 2022
  7. prusnak commented at 7:26 pm on January 20, 2022: contributor

    the following would be the way to do detection:

    Added in f7dd1ef

  8. prusnak force-pushed on Jan 20, 2022
  9. prusnak force-pushed on Jan 20, 2022
  10. sipa commented at 8:43 pm on January 20, 2022: member

    On commit f7dd1efae715593f5c9ff8186d518d25d1c9023c

    On a Linux aarch64 Cortex-A53 system with:

    0$ cat /proc/cpuinfo 
    1processor       : 0
    2BogoMIPS        : 200.00
    3Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
    4CPU implementer : 0x41
    5CPU architecture: 8
    6CPU variant     : 0x0
    7CPU part        : 0xd03
    8CPU revision    : 4
    

    which I presume means it has the necessary SHA2 extensions.

    The GCC 9.3.0 compiler used supports the extensions (crypto/libbitcoin_crypto_arm_shani.a is being built):

    0checking for x86 SHA-NI intrinsics... no
    1checking whether C++ compiler accepts -march=armv8-a+crc+crypto... yes
    2checking whether C++ compiler accepts -march=armv8-a+crc+crypto... (cached) yes
    3checking for AArch64 CRC32 intrinsics... yes
    4checking for AArch64 SHA-NI intrinsics... yes
    

    Still, the extension doesn’t seem to be detected. debug.log says:

    02022-01-20T20:37:05Z Using the 'standard' SHA256 implementation
    
  11. prusnak force-pushed on Jan 20, 2022
  12. prusnak commented at 9:42 pm on January 20, 2022: contributor
    @sipa should be fixed in c0849fc4fd9aa7e70973445a1b962a9270fd859c
  13. sipa commented at 10:16 pm on January 20, 2022: member

    On c0849fc4fd9aa7e70973445a1b962a9270fd859c:

    02022-01-20T22:15:44Z Using the 'arm_shani(1way)' SHA256 implementation
    
  14. prusnak marked this as ready for review on Jan 20, 2022
  15. sipa commented at 11:01 pm on January 20, 2022: member

    This PR (c0849fc4fd9aa7e70973445a1b962a9270fd859c):

    ns/byte byte/s err% total benchmark
    2.56 390,886,099.70 0.9% 0.03 SHA256
    10.27 97,329,584.86 0.0% 0.01 SHA256D64_1024
    14.60 68,478,670.64 0.4% 0.01 SHA256_32b

    On master (e3ce019667fba2ec50a59814a26566fb67fa9125):

    ns/byte byte/s err% total benchmark
    15.69 63,715,155.54 0.0% 0.17 SHA256
    43.42 23,029,615.98 0.5% 0.03 SHA256D64_1024
    41.67 23,995,549.21 0.1% 0.01 SHA256_32b
  16. PastaPastaPasta commented at 5:08 am on January 21, 2022: contributor

    On Linux (the only system we care about for ARM, i guess)

    M1 macs would like a word with you.

  17. PastaPastaPasta commented at 5:31 am on January 21, 2022: contributor

    Speaking of m1, I was able to compile this locally on my m1 pro 10 core, ./configure realized that SHA2 intrinsics could be used. See benchmarks below.

    on https://github.com/bitcoin/bitcoin/commit/c0849fc4fd9aa7e70973445a1b962a9270fd859c

    ns/byte byte/s err% total benchmark
    0.47 2,148,996,364.97 0.3% 0.01 SHA256
    1.48 676,354,727.08 0.3% 0.01 SHA256D64_1024
    1.15 873,060,380.08 0.1% 0.01 SHA256_32b

    on master

    ns/byte byte/s err% total benchmark
    3.10 322,550,263.01 1.3% 0.03 SHA256
    9.26 107,941,088.30 0.7% 0.01 SHA256D64_1024
    6.22 160,743,377.26 1.6% 0.01 SHA256_32b
  18. prusnak commented at 10:22 am on January 21, 2022: contributor

    Speaking of m1, I was able to compile this locally on my m1 pro 10 core, ./configure realized that SHA2 intrinsics could be used. See benchmarks below.

    Yes, support for Apple Silicon is included in this PR.

  19. in src/crypto/sha256_arm_shani.cpp:50 in d47c5fb566 outdated
    45+
    46+    // Load state
    47+    STATE0 = vld1q_u32(&s[0]);
    48+    STATE1 = vld1q_u32(&s[4]);
    49+
    50+    const uint8x16_t* input32 = reinterpret_cast<const uint8x16_t*>(chunk);
    


    sipa commented at 5:38 pm on January 21, 2022:

    I’m a bit concerned this is not well-defined. We can’t guarantee that chunk has an alignment that is compatible with the uint8x16_t type, and even if it is, whether such a reinterpretation is permitted would be highly architecture-specific (of course, this is already architecture-specific code). Do you know of documentation that specifically permits this?

    If not, I’d use this patch:

     0diff --git a/src/crypto/sha256_arm_shani.cpp b/src/crypto/sha256_arm_shani.cpp
     1index c051d87042..a783be9068 100644
     2--- a/src/crypto/sha256_arm_shani.cpp
     3+++ b/src/crypto/sha256_arm_shani.cpp
     4@@ -47,8 +47,6 @@ void Transform(uint32_t* s, const unsigned char* chunk, size_t blocks)
     5     STATE0 = vld1q_u32(&s[0]);
     6     STATE1 = vld1q_u32(&s[4]);
     7 
     8-    const uint8x16_t* input32 = reinterpret_cast<const uint8x16_t*>(chunk);
     9-
    10     while (blocks--)
    11     {
    12         // Save state
    13@@ -56,10 +54,14 @@ void Transform(uint32_t* s, const unsigned char* chunk, size_t blocks)
    14         CDGH_SAVE = STATE1;
    15 
    16         // Load and convert input chunk to Big Endian
    17-        MSG0 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
    18-        MSG1 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
    19-        MSG2 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
    20-        MSG3 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
    21+        MSG0 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 0)));
    22+        MSG1 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 16)));
    23+        MSG2 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 32)));
    24+        MSG3 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 48)));
    25+        chunk += 64;
    26 
    27         // Original implemenation preloaded message and constant addition which was 1-3% slower.
    28         // Now included as first step in quad round code saving one Q Neon register
    

    prusnak commented at 6:08 pm on January 21, 2022:

    Better safe than sorry - applied in f06f46c

    I do not see any performance hit with this change.

  20. hebasto commented at 10:45 pm on January 21, 2022: member

    Tested c0849fc4fd9aa7e70973445a1b962a9270fd859c on Mac mini (M1, 2020):

    0% time ./src/bitcoind -datadir=/Users/hebasto/SHANI -assumevalid=0 -stopatheight=719700 -prune=550
    12022-01-21T07:44:20Z Bitcoin Core version v22.99.0-c0849fc4fd9a (release build)
    22022-01-21T07:44:20Z Validating signatures for all blocks.
    32022-01-21T07:44:20Z Setting nMinimumChainWork=00000000000000000000000000000000000000001fa4663bbbe19f82de910280
    42022-01-21T07:44:20Z Prune configured to target 550 MiB on disk for block and undo files.
    52022-01-21T07:44:20Z Using the 'arm_shani(1way)' SHA256 implementation
    6...
    72022-01-21T22:38:17Z Shutdown: done
    8./src/bitcoind -datadir=/Users/hebasto/SHANI -assumevalid=0  -prune=550  149587.28s user 11456.17s system 300% cpu 14:53:56.52 total
    

    UPDATE. The same for the master branch (e3ce019667fba2ec50a59814a26566fb67fa9125):

    0% time ./src/bitcoind -datadir=/Users/hebasto/MASTER -assumevalid=0 -stopatheight=719700 -prune=550
    12022-01-21T22:49:25Z Bitcoin Core version v22.99.0-e3ce019667fb (release build)
    22022-01-21T22:49:25Z Validating signatures for all blocks.
    32022-01-21T22:49:25Z Setting nMinimumChainWork=00000000000000000000000000000000000000001fa4663bbbe19f82de910280
    42022-01-21T22:49:25Z Prune configured to target 550 MiB on disk for block and undo files.
    52022-01-21T22:49:25Z Using the 'standard' SHA256 implementation
    6...
    72022-01-22T14:37:08Z Shutdown: done
    8./src/bitcoind -datadir=/Users/hebasto/MASTER -assumevalid=0  -prune=550  174110.07s user 11526.30s system 326% cpu 15:47:43.83 total
    

    51 min or 6% faster IBD.

  21. sipa commented at 11:50 pm on January 21, 2022: member

    See https://github.com/sipa/bitcoin/commits/pr24115, which adds a 2-way 64-byte optimized variant. On my Cortex-A53 It’s roughly a 2x speedup for the SHA256D64_1024 benchmark (relevant for Merkle root computation) compared to this PR. For more modern architectures I could imagine it’s more:

    ns/byte byte/s err% total benchmark
    2.60 384,105,263.28 0.3% 0.03 SHA256
    5.35 187,019,153.94 0.1% 0.01 SHA256D64_1024
    14.61 68,437,280.69 0.0% 0.01 SHA256_32b

    For reference, master again:

    ns/byte byte/s err% total benchmark
    15.69 63,715,155.54 0.0% 0.17 SHA256
    43.42 23,029,615.98 0.5% 0.03 SHA256D64_1024
    41.67 23,995,549.21 0.1% 0.01 SHA256_32b
  22. PastaPastaPasta commented at 5:04 am on January 22, 2022: contributor

    @sipa’s branch on m1:

    ns/byte byte/s err% total benchmark
    0.46 2,174,603,243.64 0.6% 0.01 SHA256
    0.95 1,053,985,898.82 0.8% 0.01 SHA256D64_1024
    1.15 871,857,965.44 0.4% 0.01 SHA256_32b

    previous results on c0849

    ns/byte byte/s err% total benchmark
    0.47 2,148,996,364.97 0.3% 0.01 SHA256
    1.48 676,354,727.08 0.3% 0.01 SHA256D64_1024
    1.15 873,060,380.08 0.1% 0.01 SHA256_32b
  23. in src/crypto/sha256.cpp:654 in f06f46cd5e outdated
    639+#ifdef __linux__
    640+#ifdef __arm__ // 32-bit
    641+    if (getauxval(AT_HWCAP2) & HWCAP2_SHA2) {
    642+        have_arm_shani = true;
    643+    }
    644+#endif
    


    PastaPastaPasta commented at 6:10 am on January 22, 2022:

    I’m not sure if bitcoin core has a preference documented somewhere, but I prefer documenting what the endif is associated with such as

    0#endif // __arm__
    

    prusnak commented at 8:44 am on January 22, 2022:
    I checked all the other files in src/crypto and we use the endif comments only for the include guards.
  24. prusnak commented at 11:55 am on January 22, 2022: contributor

    I confirm the numbers on M1:

    before @sipa’s improvements (f06f46cd5e2d732669e70fbfeacd7bf9b9ac8815):

    ns/byte byte/s err% total benchmark
    0.45 2,198,307,083.71 0.3% 0.01 SHA256
    1.49 671,303,457.11 0.3% 0.01 SHA256D64_1024
    1.15 866,670,334.07 0.2% 0.01 SHA256_32b

    after @sipa’s improvements (0e7299508af17014a906a951ad9b4ecfcc891bb1):

    ns/byte byte/s err% total benchmark
    0.46 2,197,198,571.82 0.3% 0.01 SHA256
    0.94 1,059,405,097.65 0.3% 0.01 SHA256D64_1024
    1.17 857,291,152.82 0.5% 0.01 SHA256_32b
  25. prusnak commented at 11:56 am on January 22, 2022: contributor

    I confirm the numbers on M1:

    I merged @sipa’s improvements into this branch => 0e7299508af17014a906a951ad9b4ecfcc891bb1 ❤️

  26. PastaPastaPasta commented at 2:11 pm on January 22, 2022: contributor

    I’m not able to build this branch on m1 at the moment

     0config.status: creating libbitcoinconsensus.pc
     1config.status: creating Makefile
     2config.status: creating src/Makefile
     3config.status: creating doc/man/Makefile
     4config.status: creating share/setup.nsi
     5config.status: creating share/qt/Info.plist
     6config.status: creating test/config.ini
     7config.status: creating contrib/devtools/split-debug.sh
     8config.status: creating src/config/bitcoin-config.h
     9config.status: src/config/bitcoin-config.h is unchanged
    10config.status: executing depfiles commands
    11config.status: executing libtool commands
    12 cd . && /bin/sh /Users/pasta/workspace/bitcoin/build-aux/missing automake-1.16 --foreign
    13/bin/sh: /Users/pasta/workspace/bitcoin/build-aux/missing: No such file or directory
    14make: *** [Makefile.in] Error 1
    

    I just checked out sipa’s branch here: https://github.com/sipa/bitcoin/commit/0e7299508af17014a906a951ad9b4ecfcc891bb1 and compilation worked trivially

  27. sipa commented at 2:30 pm on January 22, 2022: member
    @prusnak @PastaPastaPasta Perhaps you want to also benchmark with the two last commits removed (so at “Optimization: precompute a few 3rd transform intermediaries”). Whether the last two help may be very architecture-dependent. For me they contribute a ~30% speedup, but maytbe on M1 that is not the case.
  28. prusnak commented at 3:04 pm on January 22, 2022: contributor

    @sipa benchmark of 38ed75f8b9fdb60ab45d07f3d4069a9c9cb636d4 Optimization: precompute a few 3rd transform intermediaries on M1:

    ns/byte byte/s err% total benchmark
    0.46 2,163,137,327.85 0.0% 0.01 SHA256
    1.28 780,225,941.01 0.3% 0.01 SHA256D64_1024
    1.18 850,467,547.93 0.2% 0.01 SHA256_32b

    The improvement of using 0e72995 is there also for M1.

  29. sipa commented at 3:06 pm on January 22, 2022: member
    Looks like the 2-way version is a clear win on M1 as well, thanks!
  30. prusnak force-pushed on Jan 22, 2022
  31. hebasto commented at 9:36 am on January 23, 2022: member

    Approach ACK 1cfacecf8203f775f5827fa225a6dccf7a28cdef.

    It seems commits need some cleanup (reorder? squash?) for the nice commit history, no?

  32. prusnak commented at 9:38 am on January 23, 2022: contributor

    It seems commits need some cleanup (reorder? squash?) for the nice commit history, no?

    I will squash when @sipa confirms he’s satisfied with the result and won’t be doing any further changes. @sipa are you OK with me doing the rebase/squash?

  33. sipa commented at 2:17 pm on January 23, 2022: member
    Feel free to squash/rebase wherever you feel it is useful.
  34. prusnak force-pushed on Jan 23, 2022
  35. prusnak commented at 2:23 pm on January 23, 2022: contributor

    Feel free to squash/rebase wherever you feel it is useful.

    Thanks! Squashed and rebased => 1f140783e1fbf818d4a6045464db3903af426bd0

  36. in src/crypto/sha256_arm_shani.cpp:1 in 1f140783e1 outdated
    0@@ -0,0 +1,898 @@
    1+// Copyright (c) 2018-2020 The Bitcoin Core developers
    


    PastaPastaPasta commented at 4:14 pm on January 23, 2022:
    was this moved from somewhere? Otherwise it should probably be updated

    prusnak commented at 5:01 pm on January 23, 2022:
    Addresssed in the latest rebase
  37. in src/crypto/sha256_arm_shani.cpp:19 in 1f140783e1 outdated
    14+#include <cstddef>
    15+#include <arm_acle.h>
    16+#include <arm_neon.h>
    17+
    18+namespace {
    19+static const uint32_t K[64] =
    


    PastaPastaPasta commented at 4:16 pm on January 23, 2022:
    nit: I’d prefer this being a constexpr std::array

    prusnak commented at 5:01 pm on January 23, 2022:
    Addresssed in the latest rebase
  38. in src/crypto/sha256_arm_shani.cpp:203 in 1f140783e1 outdated
    198+
    199+namespace sha256d64_arm_shani {
    200+void Transform_2way(unsigned char* output, const unsigned char* input)
    201+{
    202+    /* Initial state. */
    203+    static const uint32_t INIT[8] = {
    


    PastaPastaPasta commented at 4:18 pm on January 23, 2022:
    nit: same, constexpr std::array, and below

    prusnak commented at 5:01 pm on January 23, 2022:
    Addresssed in the latest rebase
  39. prusnak force-pushed on Jan 23, 2022
  40. in src/crypto/sha256.cpp:657 in 68dbb6775c outdated
    652+        have_arm_shani = true;
    653+    }
    654+#endif
    655+#endif
    656+
    657+#if defined(__APPLE__)
    


    hebasto commented at 5:57 pm on January 23, 2022:

    a6ae9ef1626c73253aa8e6c8e406f3ee18b8017f

    style nit: #ifdef __linux__ vs #ifdef __aarch64__ vs #if defined(__APPLE__) looks inconsistent. Mind choosing one style for new code?


    prusnak commented at 6:09 pm on January 23, 2022:
    Addressed in 9560981b0dda342b79e25067132dcb37e8d9e27b
  41. hebasto commented at 5:59 pm on January 23, 2022: member
    ACK a6ae9ef1626c73253aa8e6c8e406f3ee18b8017f (first two commits, won’t ACK code change in /src/crypto/ because of lack of my expertise).
  42. prusnak force-pushed on Jan 23, 2022
  43. hebasto commented at 6:14 pm on January 23, 2022: member
    re-ACK 9560981b0dda342b79e25067132dcb37e8d9e27b (first two commits only).
  44. in src/crypto/sha256_arm_shani.cpp:20 in 5202c6dd76 outdated
    15+#include <cstddef>
    16+#include <arm_acle.h>
    17+#include <arm_neon.h>
    18+
    19+namespace {
    20+static constexpr std::array<uint32_t, 64> K =
    


    sipa commented at 9:33 pm on January 23, 2022:
    May want to put an alignas(uint32x4_t) on this and other constants that are read using vld1q_u32 (better alignment may matter on some platforms).

    prusnak commented at 10:32 pm on January 23, 2022:
    Addressed in fe180d53c9035dfcec4211182ac954bd95ccaa60 + 195eb2fbe3754efe2ba31fbadb0fa77e801ce8bf
  45. prusnak force-pushed on Jan 23, 2022
  46. hebasto commented at 10:45 am on January 24, 2022: member

    Tested 5202c6dd7695f66325638e15861468999ec834c1 on Mac mini (M1, 2020):

    0% time ./src/bitcoind -datadir=/Users/hebasto/SHANI-2 -assumevalid=0 -stopatheight=719700 -prune=550
    12022-01-23T18:25:19Z Bitcoin Core version v22.99.0-5202c6dd7695 (release build)
    22022-01-23T18:25:19Z Validating signatures for all blocks.
    32022-01-23T18:25:19Z Setting nMinimumChainWork=00000000000000000000000000000000000000001fa4663bbbe19f82de910280
    42022-01-23T18:25:19Z Prune configured to target 550 MiB on disk for block and undo files.
    52022-01-23T18:25:19Z Using the 'arm_shani(1way,2way)' SHA256 implementation
    6...
    72022-01-24T08:16:46Z Shutdown: done
    8./src/bitcoind -datadir=/Users/hebasto/SHANI-2 -assumevalid=0  -prune=550  147477.18s user 11980.05s system 319% cpu 13:51:28.21 total
    

    117 min or 12% faster IBD than the master branch. Also see #24115 (comment).

  47. in src/crypto/sha256.cpp:657 in 195eb2fbe3 outdated
    652+        have_arm_shani = true;
    653+    }
    654+#endif
    655+#endif
    656+
    657+#if defined(__APPLE__)
    


    fanquake commented at 7:36 am on January 25, 2022:
    Can you use MAC_OSX here to match our other macOS defines.

    prusnak commented at 8:56 am on January 25, 2022:
    Addressed in 3af2f3f5024a43bacceff4aabfd108a8cd971d1f
  48. prusnak force-pushed on Jan 25, 2022
  49. PastaPastaPasta approved
  50. PastaPastaPasta commented at 2:17 pm on January 25, 2022: contributor
    ACK 4abca94b1e12792e821c2169ff7bee05f35ca284
  51. mutatrum commented at 2:44 pm on January 25, 2022: contributor

    Compiled and benchmarked on two aarch64 SBC:

    Rock Pi 4 (RK3399, Dual Cortex-A72 and Quad Cortex-A53):

    master bd482b3

    ns/byte byte/s err% ins/byte total benchmark
    6.64 150,596,444.75 0.0% 30.02 0.07 SHA256
    18.89 52,942,682.86 0.0% 82.25 0.01 SHA256D64_1024
    15.95 62,706,213.16 0.0% 70.72 0.01 SHA256_32b

    4abca94

    ns/byte byte/s err% ins/byte total benchmark
    1.56 642,517,847.54 0.9% 1.75 0.02 SHA256
    2.70 370,040,371.53 0.0% 4.38 0.01 SHA256D64_1024
    5.13 194,811,160.90 1.2% 13.59 0.01 SHA256_32b

    Firefly RK3566-ROC-PC (Quad-core Cortex-A55)

    master bd482b3

    ns/byte byte/s err% ins/byte bra/byte miss% total benchmark
    9.06 110,429,040.01 0.1% 30.02 0.02 0.1% 0.10 SHA256
    24.94 40,094,485.36 0.9% 82.25 0.05 0.3% 0.02 SHA256D64_1024
    22.83 43,792,465.97 0.1% 70.72 1.09 0.0% 0.01 SHA256_32b

    4abca94

    ns/byte byte/s err% ins/byte bra/byte miss% total benchmark
    1.49 672,648,756.27 0.5% 1.75 0.02 0.1% 0.02 SHA256
    3.10 322,551,749.73 1.1% 4.38 0.02 0.2% 0.01 SHA256D64_1024
    7.53 132,832,412.14 0.0% 13.59 1.09 0.0% 0.01 SHA256_32b

    ACK

  52. mutatrum approved
  53. prusnak commented at 7:17 pm on January 26, 2022: contributor
    May I propose this to be added to 23.0 Milestone?
  54. fanquake added this to the milestone 23.0 on Jan 27, 2022
  55. mutatrum commented at 9:31 am on January 27, 2022: contributor

    Two questions: the machine that builds the aarch64 binaries, does that have these extensions? Or in other words, if this is merged, will the aarch64 binaries have this included?

    And has someone tested the case of running this build on a device without the crypto extensions, f.e. a Raspberry Pi?

  56. sipa commented at 0:04 am on January 28, 2022: member

    ACK 3abdb593618e62d2c0f0bcc29e19bbadbf31e84c. I have not reviewed whether the macOS feature detection works. Also, some of the code is my own.

    I did verify that the x86 SHA-NI keeps working (benchmark + debug.log on my Ryzen 5950X system indicate the SHA instructions are being used). I’ve also verified the instruction sequence, and reviewed the constants used cursorily. I conjecture that it is impossible to create anything resembling the SHA-2 transformation’s flow using these instructions that passes unit tests in a non-trivial test set, but is still wrong in others. Simply because these instructions are so high-level, any modification would greatly affect the outcome of nearly every input.

  57. sipa commented at 0:07 am on January 28, 2022: member
    @mutatrum I believe we use GCC 10 in the current guix build environment, so yes, the binaries would have these extensions. But let’s trigger a build to verify. Good idea to check that things still keep working on RPi.
  58. sipa added the label DrahtBot Guix build requested on Jan 28, 2022
  59. fanquake commented at 6:40 am on January 28, 2022: member
    @prusnak Could you rebase this on master now that support for arm64 Darwin has been added to Guix (#21851)? Then we can get some Guix builds done, test the binaries etc.
  60. Rename SHANI to X86_SHANI to allow future implementation of ARM_SHANI c2b7934250
  61. Add sha256_arm_shani to build system
    Also rename AArch64 intrinsics to ARMv8 intrinsics
    as these are not necessarily limited to 64-bit
    48a72fa81f
  62. Implement sha256_arm_shani::Transform
    Co-Authored-By: Rauli Kumpulainen <rauliweb@gmail.com>
    Co-Authored-By: Pieter Wuille <pieter@wuille.net>
    fe0629852a
  63. Add optimized sha256d64_arm_shani::Transform_2way aaa1d03d3a
  64. prusnak commented at 8:45 am on January 28, 2022: contributor
    @fanquake rebased on top of current master
  65. prusnak force-pushed on Jan 28, 2022
  66. hebasto commented at 10:43 pm on January 28, 2022: member

    Guix builds:

     0$ find guix-build-$(git rev-parse --short=12 HEAD)/output/ -type f -print0 | env LC_ALL=C sort -z | xargs -r0 sha256sum
     14a309ef27036065f787330a50659c85323f78f7b5d3c69a79e3eca232f4d3e55  guix-build-aaa1d03d3ace/output/aarch64-linux-gnu/SHA256SUMS.part
     2cfedbd51f5bf65d57fe9200e22e56f3a38f44308eea565b201483811996d957b  guix-build-aaa1d03d3ace/output/aarch64-linux-gnu/bitcoin-aaa1d03d3ace-aarch64-linux-gnu-debug.tar.gz
     385c9195cf594fbbf3ce6c6b76370f746dd4342c793b54b8b5bdb650a53761eaf  guix-build-aaa1d03d3ace/output/aarch64-linux-gnu/bitcoin-aaa1d03d3ace-aarch64-linux-gnu.tar.gz
     4280234b6a20bbcddb847e0bbaf27f35ac30d292ea5dd096f8c8096220f8ef29a  guix-build-aaa1d03d3ace/output/arm-linux-gnueabihf/SHA256SUMS.part
     548c0fd0a6a4c1e86e942f6907804ed87527f329954e81e5bdc22beed83cbe6b3  guix-build-aaa1d03d3ace/output/arm-linux-gnueabihf/bitcoin-aaa1d03d3ace-arm-linux-gnueabihf-debug.tar.gz
     6c435b7a53606f33f9d23f28e6c98ce3d55f50f61fd67bf2b4ebe65261fe68aba  guix-build-aaa1d03d3ace/output/arm-linux-gnueabihf/bitcoin-aaa1d03d3ace-arm-linux-gnueabihf.tar.gz
     7e0d61af7b471ba5135a848be6d4d6cf0caa8042dab9d929f6578de17e47e40c2  guix-build-aaa1d03d3ace/output/arm64-apple-darwin/SHA256SUMS.part
     8e9eda618bf90d5d1522005bd13bd296ae89aec9c3b0b2528b9c9f67e70178828  guix-build-aaa1d03d3ace/output/arm64-apple-darwin/bitcoin-aaa1d03d3ace-arm64-apple-darwin.tar.gz
     9a6540dcea7c2a2562edd0f2232f511da97339d388cdca686e084d03e381d6fa2  guix-build-aaa1d03d3ace/output/arm64-apple-darwin/bitcoin-aaa1d03d3ace-osx-unsigned.dmg
    106d970ae7c94b78c5f97e978718862e85c3375c48d3b8ade6fafce39e7c8e0f0e  guix-build-aaa1d03d3ace/output/arm64-apple-darwin/bitcoin-aaa1d03d3ace-osx-unsigned.tar.gz
    11cdbc9eb5281b14ecbecc2956a5a41709fda93762115ce3cee9516a68874676b8  guix-build-aaa1d03d3ace/output/dist-archive/bitcoin-aaa1d03d3ace.tar.gz
    12d063099449e40036d15ed5b023f602ba824edf303ecbe78bcd3f01feeabb535f  guix-build-aaa1d03d3ace/output/powerpc64-linux-gnu/SHA256SUMS.part
    133e6da4026d466039cd361e48a492ea780021bc0efb8e67e948097871ce4226cc  guix-build-aaa1d03d3ace/output/powerpc64-linux-gnu/bitcoin-aaa1d03d3ace-powerpc64-linux-gnu-debug.tar.gz
    14669291a4767509f053bdd8789f4631ff6d28cd4ad62156105bc98cdb7d3a295a  guix-build-aaa1d03d3ace/output/powerpc64-linux-gnu/bitcoin-aaa1d03d3ace-powerpc64-linux-gnu.tar.gz
    15868cbe0f73cd786d91dd6d1990550e6c9c673ef3d874c545150b6e030951a24d  guix-build-aaa1d03d3ace/output/powerpc64le-linux-gnu/SHA256SUMS.part
    1690157bdc29262abd421535e4d6921e72aff7f3d43ca5634bc76598d8daf3a1ec  guix-build-aaa1d03d3ace/output/powerpc64le-linux-gnu/bitcoin-aaa1d03d3ace-powerpc64le-linux-gnu-debug.tar.gz
    178275665e19f85193de4e7e5ee7b451d6a4f8e414a33586e7066575812e878eda  guix-build-aaa1d03d3ace/output/powerpc64le-linux-gnu/bitcoin-aaa1d03d3ace-powerpc64le-linux-gnu.tar.gz
    182723b48d06e8217adb41bcc640417cba3470b234cce7815104e878866d775046  guix-build-aaa1d03d3ace/output/riscv64-linux-gnu/SHA256SUMS.part
    19f63822813587ec9e4bcc044c4b7918b1330d6b16be09f28bd95fedfd3dcdb147  guix-build-aaa1d03d3ace/output/riscv64-linux-gnu/bitcoin-aaa1d03d3ace-riscv64-linux-gnu-debug.tar.gz
    203ec47d6968e2e430e3ff1629f07243f8589bd30406c0de916c2bbc6d5d88e0e8  guix-build-aaa1d03d3ace/output/riscv64-linux-gnu/bitcoin-aaa1d03d3ace-riscv64-linux-gnu.tar.gz
    212120e8021edfe8170e318d82e44e405af68bb04c3fe3a3cd0407a17507cd74a0  guix-build-aaa1d03d3ace/output/x86_64-apple-darwin/SHA256SUMS.part
    2211b173cbeff7b20c717fd880446904eec2d33b30076348c0e9698e876f5be4a4  guix-build-aaa1d03d3ace/output/x86_64-apple-darwin/bitcoin-aaa1d03d3ace-osx-unsigned.dmg
    23af86643730d769ddf6635e9e240b50bd5b94aa576cf7450b4289d8ec1fc883ed  guix-build-aaa1d03d3ace/output/x86_64-apple-darwin/bitcoin-aaa1d03d3ace-osx-unsigned.tar.gz
    243c3591954aaf6b0557b74baf156f892b9b90ec63c35f3495889431afe0a17f93  guix-build-aaa1d03d3ace/output/x86_64-apple-darwin/bitcoin-aaa1d03d3ace-osx64.tar.gz
    25553d227d7e32c72390e259a46f37e4fd98f5781bd4b6d7ed8d83c5a9ba04f65b  guix-build-aaa1d03d3ace/output/x86_64-linux-gnu/SHA256SUMS.part
    26479f3597f1577a33dad6634b97515f4ceac0ff163d64e7fa00b8685793d36231  guix-build-aaa1d03d3ace/output/x86_64-linux-gnu/bitcoin-aaa1d03d3ace-x86_64-linux-gnu-debug.tar.gz
    27c273a93458f2afa04c66d7e23346835dc6201f8e7b878ffbf661ae104c8746e7  guix-build-aaa1d03d3ace/output/x86_64-linux-gnu/bitcoin-aaa1d03d3ace-x86_64-linux-gnu.tar.gz
    

    UPDATE: build artifacts are available in https://github.com/hebasto/artefacts/tree/master/pr24115/guix-build-aaa1d03d3ace/output

  67. prusnak commented at 2:18 pm on January 29, 2022: contributor

    Two questions:

    Systems used:

    • RPI4 - Raspberry Pi 4 system (which does not support ARMv8 SHA2 NI), Ubuntu 21.10
    • ALTRA - Ampere Altra system (which does support ARMv8 SHA2 NI), Ubuntu 20.4.3 LTS

    I performed the following tests:

    • build aaa1d03 on RPI4
      • configure output says checking for ARMv8 SHA-NI intrinsics... yes
      • resulting binary contains ARMv8 SHA-NI code
      • bitcoind starts
      • bitcoind output says Using the 'standard' SHA256 implementation
    • build aaa1d03 on ALTRA
      • configure output says checking for ARMv8 SHA-NI intrinsics... yes
      • resulting binary contains ARMv8 SHA-NI code
      • bitcoind starts
      • bitcoind output says Using the 'arm_shani(1way,2way)' SHA256 implementation

    • take binary built on ALTRA and run it on RPI4
      • bitcoind starts
      • bitcoind output says Using the 'standard' SHA256 implementation
    • take binary built on RPI4 and run it on ALTRA
      • bitcoind starts
      • bitcoind output says Using the 'arm_shani(1way,2way)' SHA256 implementation

    I think this proves the build mechanism and the runtime detection works as intended.

  68. Sjors commented at 8:04 pm on January 31, 2022: member
    Concept ACK. I find it near-impossible to follow what sha256d64_arm_shani is doing, but that’s mainly because our c++ TransformD64 is undocumented (introduced in #13191). In particular I don’t understand how the algorithm follows from a single sha256. I assume it’s an optimization. Otherwise they seem similar enough, with sha256d64_arm_shani splitting the input to take advantage of the 2-way instructions. And the tests pass :-)
  69. sipa commented at 8:27 pm on January 31, 2022: member

    @Sjors That’s quite possibly worth documenting in general (for all D64 code).

    What these functions do:

    • Take a pointer to an input N*64 bytes buffer, and an output N*32 bytes buffer (N=1 for 1-way ,N=2 for 2-way, etc).
    • Treat the input as the concatenation of N 64-byte inputs, compute SHA256(SHA256(input)) for each, and concatenate those outputs in the output buffer.

    A bit about SHA256’s structure. SHA256(bytes) is really the following algorithm:

    • Append padding to input (between 9 and 72 bytes); the result is always a multiple of 64 bytes.
    • Initialize the state (a 32-byte value, typically represented as 8 32-bit integers) to the initial state, a constant.
    • Then split the input into blocks of 64 bytes, and for each do state = Transform(state, block), where Transform is the SHA256 transformation function at a high level.
    • The hash is equal to the final state.

    In case of SHA256(SHA256(64 bytes)), there are 3 Transforms being invoked:

    • The first operates on the 64 bytes of input, starting with initial state.
    • The second continues on the resulting state, processing 64 bytes of padding. That padding is a constant (it’s just a function of the length of the input).
    • The third operates on the 32 bytes of output produced by the second transform, followed by 32 bytes of padding, which is again constant, and starting with a new initial state.

    There are 3 types of optimizations we can do in this case:

    • Start by inlining the 3 transforms into one function body, together with all the initializations. The intermediary conversion to bytes after the second transform and then back to integers for the 3rd transform can be bypassed (serializing & deserializing is a no-op).
    • Observing that lots of intermediary values now actually become known at compile time. In particular, lots of values occurring during the 2nd transform (whose input is 100% fixed). I did this by simply writing the code up to this point, adding printf statements on these intermediaries, then turning the printed values into constants in the code and skipping their computation.
    • Taking advantage of vectorization and/or instruction level parallellism. In the case of x86 and ARM SHA instructions, we literally just duplicate every line of code (after doing the operations above), alternating between working on variables relating to a first or a second 64-byte input. This works because these instructions have a long pipeline, and there are sufficient registers available in hardware to store (most) of the data relating to two instances at once. This improves the throughput.

    The individual commits in https://github.com/sipa/bitcoin/commits/pr24115 show the process.

    Note that I don’t think it’s really required for verifying correctness to see these steps (otherwise I’d have argued for including them in this PR), but it may help understand how it came to be.

  70. Sjors commented at 9:29 am on February 1, 2022: member

    I think this is the step that confuses me:

    The second continues on the resulting state, processing 64 bytes of padding. That padding is a constant (it’s just a function of the length of the input).

    If the first transform is the equivalent of a single sha256(64 bytes) and the third is the equivalent of a second sha256() on the 32 byte result of the first, what is the second transform doing?

    I did this by simply writing the code up to this point, adding printf statements on these intermediaries, then turning the printed values into constants in the code and skipping their computation.

    This is definitely worth documenting (can be another PR). Even nicer if we can generate the values in a Python script (for manual comparison, not code generation).

  71. sipa commented at 1:51 pm on February 1, 2022: member

    @Sjors

    There are two SHA256 invocations:

    • H1 = SHA256(input)
    • H2 = SHA256(H1)

    Input is 64 bytes, which means it gets 64 bytes of padding (because the padding is always between 9 and 72 bytes long, and the result is always a multiple of 64).

    For H2, SHA256(H1) just gets a 32-byte input, so it also only gets a 32-byte padding, and the result just needs one transform.

    So we can write it this way:

    • H1 = Transform(Transform(Init(), input), Pad(64))
    • H2 = Transform(Init(), H1 + Pad(32))

    The first transform is the inner one for H1, the second the outer one for H1. The third transform is the H2 one.

  72. Sjors commented at 2:08 pm on February 1, 2022: member

    Ah that makes sense.

    Input is 64 bytes, which means it gets 64 bytes of padding

    I naively assumed a 64 byte message wasn’t padded, but it is: https://datatracker.ietf.org/doc/html/rfc6234#section-4.1

  73. mutatrum commented at 3:28 pm on February 1, 2022: contributor

    IBD up to block 700000 on a Rock Pi 4a w/ NVMe SSD, assumevalid=0, dbcache=2000:

    master (bd482b3f): 68H52M shani (4abca94): 65H29M

    Improvement ~5%

  74. sipa commented at 3:30 pm on February 1, 2022: member

    @Sjors

    I naively assumed a 64 byte message wasn’t padded, but it is: https://datatracker.ietf.org/doc/html/rfc6234#section-4.1

    Yes, it has to be. Otherwise you’d have a trivial 2nd preimage attack between hash(X) and hash(X || padding(len(X))), for non-multiple-of-64-bytes X.

  75. DrahtBot commented at 12:08 pm on February 3, 2022: member

    Guix builds

    File commit 133f73e86bd7c3114263500be2fb5090dd76b4bc(master) commit 19a5c3f8cc5c75da931d491340c1de68867934a1(master and this pull)
    SHA256SUMS.part 4d29ebeb3309d60d... b7844e9c678b97f6...
    *-aarch64-linux-gnu-debug.tar.gz c29ba0d0426063e8... 89ff765de2630eb7...
    *-aarch64-linux-gnu.tar.gz e52c846b1841b3eb... d6ea7915e264d3be...
    *-arm-linux-gnueabihf-debug.tar.gz 5d3f9731cf88da5b... d254cb4dc971d678...
    *-arm-linux-gnueabihf.tar.gz 282b33b13dd1f8a0... ea7a0cf5a93cb32a...
    *-arm64-apple-darwin.tar.gz 3f0dba0a6549c410... 3400772805824d22...
    *-osx-unsigned.dmg 6692809452b6cb62... 25da88b4e7f96778...
    *-osx-unsigned.tar.gz 4366f453672f800d... 3ad587987969546b...
    *-osx64.tar.gz 5387d0cdc36be37a... 198a4d9b96a87a5f...
    *-powerpc64-linux-gnu-debug.tar.gz 1ed57908059941ef... e3a27aa697de6985...
    *-powerpc64-linux-gnu.tar.gz 21079e1bae0f459b... 7c91ce0fa9a40f23...
    *-powerpc64le-linux-gnu-debug.tar.gz 6b441c519f2d3185... 2bfee29cc7203662...
    *-powerpc64le-linux-gnu.tar.gz 132f95573d0fd005... 48a553082d2d03b8...
    *-riscv64-linux-gnu-debug.tar.gz 3c5e1f8e3d9aa92a... 64dc9f50e4993c2e...
    *-riscv64-linux-gnu.tar.gz b17efbca76425fc5... 78bba78a4dd894cb...
    *-x86_64-linux-gnu-debug.tar.gz 9f61cd7fca8d6425... edbfec791406e681...
    *-x86_64-linux-gnu.tar.gz 323dc8d7d0aa8290... b38a112a72acb36a...
    *.tar.gz 5887839fbd29cd1c... 58afa778369cde02...
    guix_build.log 53868781dafe6675... 8cae536cac2f270c...
    guix_build.log.diff a14b825c6610d5c4...
  76. DrahtBot removed the label DrahtBot Guix build requested on Feb 3, 2022
  77. DrahtBot commented at 10:23 pm on February 11, 2022: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #24322 ([kernel 1/n] Introduce initial libbitcoinkernel by dongcarl)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  78. laanwj commented at 7:46 pm on February 14, 2022: member

    Code review and lightly tested ACK aaa1d03d3acebeb44fdd40a302f086aad3d329ce I have checked

    • that the code gets compiled (bitcoind contains the instructions)
    • on a old ARM64 device without the instruction set that it correctly doesn’t enable the code.
    • on a recent ARM64 with the instruction set that it uses and enables the code
  79. laanwj merged this on Feb 14, 2022
  80. laanwj closed this on Feb 14, 2022

  81. prusnak deleted the branch on Feb 14, 2022
  82. sidhujag referenced this in commit 42ca85234e on Feb 15, 2022
  83. in src/crypto/sha256_arm_shani.cpp:65 in aaa1d03d3a
    60+        MSG1 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 16)));
    61+        MSG2 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 32)));
    62+        MSG3 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 48)));
    63+        chunk += 64;
    64+
    65+        // Original implemenation preloaded message and constant addition which was 1-3% slower.
    


    hebasto commented at 11:22 am on February 16, 2022:
    typo: implemenation ==> implementation
  84. PastaPastaPasta referenced this in commit a7a3a93489 on Mar 29, 2022
  85. PastaPastaPasta referenced this in commit 8278356354 on Mar 29, 2022
  86. laanwj referenced this in commit 51527ec1ec on May 11, 2022
  87. sidhujag referenced this in commit e4cf1bde8f on May 11, 2022
  88. DrahtBot locked this on Feb 16, 2023

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-21 12:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me