ARMv8 SHA2 Intrinsics

prusnak commented at 6:17 pm on January 20, 2022: contributor

This PR adds support for ARMv8 SHA2 Intrinsics.

Fixes #13401 and #17414

Integration part was done by me.
The original SHA2 NI code comes from https://github.com/noloader/SHA-Intrinsics/blob/master/sha256-arm.c
Minor optimizations from https://github.com/rollmeister/bitcoin-armv8/blob/master/src/crypto/sha256.cpp are applied too.
The 2-way transform added by @sipa

prusnak force-pushed on Jan 20, 2022

prusnak marked this as a draft on Jan 20, 2022

laanwj commented at 6:55 pm on January 20, 2022: member

Concept ACK!

detection when the feature can be used

On Linux (the only system we care about for ARM, i guess), the following would be the way to do detection:

 0#include <sys/auxv.h>
 1#include <asm/hwcap.h>
 2…
 3#ifdef __arm__
 4/* ARM 32 bit */
 5if (getauxval(AT_HWCAP2) & HWCAP2_SHA2) {
 6    have_arm_shani = true;
 7}
 8#endif
 9#ifdef __aarch64__
10/* ARM 64 bit */
11if (getauxval(AT_HWCAP) & HWCAP_SHA2) {
12    have_arm_shani = true;
13}
14#endif

Note that the capability bit is on a different HWCAP word on 32 and 64 bit (dunno if you even want to support 32 bit here).

laanwj added the label Utils/log/libs on Jan 20, 2022

prusnak force-pushed on Jan 20, 2022

prusnak commented at 7:26 pm on January 20, 2022: contributor

the following would be the way to do detection:

Added in f7dd1ef

prusnak force-pushed on Jan 20, 2022

sipa commented at 8:43 pm on January 20, 2022: member

On commit f7dd1efae715593f5c9ff8186d518d25d1c9023c

On a Linux aarch64 Cortex-A53 system with:

0$ cat /proc/cpuinfo 
1processor       : 0
2BogoMIPS        : 200.00
3Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
4CPU implementer : 0x41
5CPU architecture: 8
6CPU variant     : 0x0
7CPU part        : 0xd03
8CPU revision    : 4

which I presume means it has the necessary SHA2 extensions.

The GCC 9.3.0 compiler used supports the extensions (crypto/libbitcoin_crypto_arm_shani.a is being built):

0checking for x86 SHA-NI intrinsics... no
1checking whether C++ compiler accepts -march=armv8-a+crc+crypto... yes
2checking whether C++ compiler accepts -march=armv8-a+crc+crypto... (cached) yes
3checking for AArch64 CRC32 intrinsics... yes
4checking for AArch64 SHA-NI intrinsics... yes

Still, the extension doesn’t seem to be detected. debug.log says:

02022-01-20T20:37:05Z Using the 'standard' SHA256 implementation

prusnak force-pushed on Jan 20, 2022

prusnak commented at 9:42 pm on January 20, 2022: contributor

@sipa should be fixed in c0849fc4fd9aa7e70973445a1b962a9270fd859c

sipa commented at 10:16 pm on January 20, 2022: member

On c0849fc4fd9aa7e70973445a1b962a9270fd859c:

02022-01-20T22:15:44Z Using the 'arm_shani(1way)' SHA256 implementation

prusnak marked this as ready for review on Jan 20, 2022

sipa commented at 11:01 pm on January 20, 2022: member

This PR (c0849fc4fd9aa7e70973445a1b962a9270fd859c):

ns/byte	byte/s	err%	total	benchmark
2.56	390,886,099.70	0.9%	0.03	`SHA256`
10.27	97,329,584.86	0.0%	0.01	`SHA256D64_1024`
14.60	68,478,670.64	0.4%	0.01	`SHA256_32b`

On master (e3ce019667fba2ec50a59814a26566fb67fa9125):

ns/byte	byte/s	err%	total	benchmark
15.69	63,715,155.54	0.0%	0.17	`SHA256`
43.42	23,029,615.98	0.5%	0.03	`SHA256D64_1024`
41.67	23,995,549.21	0.1%	0.01	`SHA256_32b`

PastaPastaPasta commented at 5:08 am on January 21, 2022: contributor

On Linux (the only system we care about for ARM, i guess)

M1 macs would like a word with you.

PastaPastaPasta commented at 5:31 am on January 21, 2022: contributor

Speaking of m1, I was able to compile this locally on my m1 pro 10 core, ./configure realized that SHA2 intrinsics could be used. See benchmarks below.

on https://github.com/bitcoin/bitcoin/commit/c0849fc4fd9aa7e70973445a1b962a9270fd859c

ns/byte	byte/s	err%	total	benchmark
0.47	2,148,996,364.97	0.3%	0.01	`SHA256`
1.48	676,354,727.08	0.3%	0.01	`SHA256D64_1024`
1.15	873,060,380.08	0.1%	0.01	`SHA256_32b`

on master

ns/byte	byte/s	err%	total	benchmark
3.10	322,550,263.01	1.3%	0.03	`SHA256`
9.26	107,941,088.30	0.7%	0.01	`SHA256D64_1024`
6.22	160,743,377.26	1.6%	0.01	`SHA256_32b`

prusnak commented at 10:22 am on January 21, 2022: contributor

Speaking of m1, I was able to compile this locally on my m1 pro 10 core, ./configure realized that SHA2 intrinsics could be used. See benchmarks below.

Yes, support for Apple Silicon is included in this PR.

in src/crypto/sha256_arm_shani.cpp:50 in d47c5fb566 outdated

45+
46+    // Load state
47+    STATE0 = vld1q_u32(&s[0]);
48+    STATE1 = vld1q_u32(&s[4]);
49+
50+    const uint8x16_t* input32 = reinterpret_cast<const uint8x16_t*>(chunk);

sipa commented at 5:38 pm on January 21, 2022:

I’m a bit concerned this is not well-defined. We can’t guarantee that chunk has an alignment that is compatible with the uint8x16_t type, and even if it is, whether such a reinterpretation is permitted would be highly architecture-specific (of course, this is already architecture-specific code). Do you know of documentation that specifically permits this?

If not, I’d use this patch:

 0diff --git a/src/crypto/sha256_arm_shani.cpp b/src/crypto/sha256_arm_shani.cpp
 1index c051d87042..a783be9068 100644
 2--- a/src/crypto/sha256_arm_shani.cpp
 3+++ b/src/crypto/sha256_arm_shani.cpp
 4@@ -47,8 +47,6 @@ void Transform(uint32_t* s, const unsigned char* chunk, size_t blocks)
 5     STATE0 = vld1q_u32(&s[0]);
 6     STATE1 = vld1q_u32(&s[4]);
 7 
 8-    const uint8x16_t* input32 = reinterpret_cast<const uint8x16_t*>(chunk);
 9-
10     while (blocks--)
11     {
12         // Save state
13@@ -56,10 +54,14 @@ void Transform(uint32_t* s, const unsigned char* chunk, size_t blocks)
14         CDGH_SAVE = STATE1;
15 
16         // Load and convert input chunk to Big Endian
17-        MSG0 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
18-        MSG1 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
19-        MSG2 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
20-        MSG3 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
21+        MSG0 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 0)));
22+        MSG1 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 16)));
23+        MSG2 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 32)));
24+        MSG3 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 48)));
25+        chunk += 64;
26 
27         // Original implemenation preloaded message and constant addition which was 1-3% slower.
28         // Now included as first step in quad round code saving one Q Neon register

prusnak commented at 6:08 pm on January 21, 2022:

Better safe than sorry - applied in f06f46c

I do not see any performance hit with this change.

hebasto commented at 10:45 pm on January 21, 2022: member

Tested c0849fc4fd9aa7e70973445a1b962a9270fd859c on Mac mini (M1, 2020):

0% time ./src/bitcoind -datadir=/Users/hebasto/SHANI -assumevalid=0 -stopatheight=719700 -prune=550
12022-01-21T07:44:20Z Bitcoin Core version v22.99.0-c0849fc4fd9a (release build)
22022-01-21T07:44:20Z Validating signatures for all blocks.
32022-01-21T07:44:20Z Setting nMinimumChainWork=00000000000000000000000000000000000000001fa4663bbbe19f82de910280
42022-01-21T07:44:20Z Prune configured to target 550 MiB on disk for block and undo files.
52022-01-21T07:44:20Z Using the 'arm_shani(1way)' SHA256 implementation
6...
72022-01-21T22:38:17Z Shutdown: done
8./src/bitcoind -datadir=/Users/hebasto/SHANI -assumevalid=0  -prune=550  149587.28s user 11456.17s system 300% cpu 14:53:56.52 total

UPDATE. The same for the master branch (e3ce019667fba2ec50a59814a26566fb67fa9125):

0% time ./src/bitcoind -datadir=/Users/hebasto/MASTER -assumevalid=0 -stopatheight=719700 -prune=550
12022-01-21T22:49:25Z Bitcoin Core version v22.99.0-e3ce019667fb (release build)
22022-01-21T22:49:25Z Validating signatures for all blocks.
32022-01-21T22:49:25Z Setting nMinimumChainWork=00000000000000000000000000000000000000001fa4663bbbe19f82de910280
42022-01-21T22:49:25Z Prune configured to target 550 MiB on disk for block and undo files.
52022-01-21T22:49:25Z Using the 'standard' SHA256 implementation
6...
72022-01-22T14:37:08Z Shutdown: done
8./src/bitcoind -datadir=/Users/hebasto/MASTER -assumevalid=0  -prune=550  174110.07s user 11526.30s system 326% cpu 15:47:43.83 total

51 min or 6% faster IBD.

sipa commented at 11:50 pm on January 21, 2022: member

See https://github.com/sipa/bitcoin/commits/pr24115, which adds a 2-way 64-byte optimized variant. On my Cortex-A53 It’s roughly a 2x speedup for the SHA256D64_1024 benchmark (relevant for Merkle root computation) compared to this PR. For more modern architectures I could imagine it’s more:

ns/byte	byte/s	err%	total	benchmark
2.60	384,105,263.28	0.3%	0.03	`SHA256`
5.35	187,019,153.94	0.1%	0.01	`SHA256D64_1024`
14.61	68,437,280.69	0.0%	0.01	`SHA256_32b`

For reference, master again:

ns/byte	byte/s	err%	total	benchmark
15.69	63,715,155.54	0.0%	0.17	`SHA256`
43.42	23,029,615.98	0.5%	0.03	`SHA256D64_1024`
41.67	23,995,549.21	0.1%	0.01	`SHA256_32b`

PastaPastaPasta commented at 5:04 am on January 22, 2022: contributor

@sipa’s branch on m1:

ns/byte	byte/s	err%	total	benchmark
0.46	2,174,603,243.64	0.6%	0.01	`SHA256`
0.95	1,053,985,898.82	0.8%	0.01	`SHA256D64_1024`
1.15	871,857,965.44	0.4%	0.01	`SHA256_32b`

previous results on c0849

ns/byte	byte/s	err%	total	benchmark
0.47	2,148,996,364.97	0.3%	0.01	`SHA256`
1.48	676,354,727.08	0.3%	0.01	`SHA256D64_1024`
1.15	873,060,380.08	0.1%	0.01	`SHA256_32b`

in src/crypto/sha256.cpp:654 in f06f46cd5e outdated

639+#ifdef __linux__
640+#ifdef __arm__ // 32-bit
641+    if (getauxval(AT_HWCAP2) & HWCAP2_SHA2) {
642+        have_arm_shani = true;
643+    }
644+#endif

PastaPastaPasta commented at 6:10 am on January 22, 2022:

I’m not sure if bitcoin core has a preference documented somewhere, but I prefer documenting what the endif is associated with such as

0#endif // __arm__

prusnak commented at 8:44 am on January 22, 2022:

I checked all the other files in src/crypto and we use the endif comments only for the include guards.

prusnak commented at 11:55 am on January 22, 2022: contributor

I confirm the numbers on M1:

before @sipa’s improvements (f06f46cd5e2d732669e70fbfeacd7bf9b9ac8815):

ns/byte	byte/s	err%	total	benchmark
0.45	2,198,307,083.71	0.3%	0.01	`SHA256`
1.49	671,303,457.11	0.3%	0.01	`SHA256D64_1024`
1.15	866,670,334.07	0.2%	0.01	`SHA256_32b`

after @sipa’s improvements (0e7299508af17014a906a951ad9b4ecfcc891bb1):

ns/byte	byte/s	err%	total	benchmark
0.46	2,197,198,571.82	0.3%	0.01	`SHA256`
0.94	1,059,405,097.65	0.3%	0.01	`SHA256D64_1024`
1.17	857,291,152.82	0.5%	0.01	`SHA256_32b`

prusnak commented at 11:56 am on January 22, 2022: contributor

I confirm the numbers on M1:

I merged @sipa’s improvements into this branch => 0e7299508af17014a906a951ad9b4ecfcc891bb1 ❤️

PastaPastaPasta commented at 2:11 pm on January 22, 2022: contributor

I’m not able to build this branch on m1 at the moment

 0config.status: creating libbitcoinconsensus.pc
 1config.status: creating Makefile
 2config.status: creating src/Makefile
 3config.status: creating doc/man/Makefile
 4config.status: creating share/setup.nsi
 5config.status: creating share/qt/Info.plist
 6config.status: creating test/config.ini
 7config.status: creating contrib/devtools/split-debug.sh
 8config.status: creating src/config/bitcoin-config.h
 9config.status: src/config/bitcoin-config.h is unchanged
10config.status: executing depfiles commands
11config.status: executing libtool commands
12 cd . && /bin/sh /Users/pasta/workspace/bitcoin/build-aux/missing automake-1.16 --foreign
13/bin/sh: /Users/pasta/workspace/bitcoin/build-aux/missing: No such file or directory
14make: *** [Makefile.in] Error 1

I just checked out sipa’s branch here: https://github.com/sipa/bitcoin/commit/0e7299508af17014a906a951ad9b4ecfcc891bb1 and compilation worked trivially

sipa commented at 2:30 pm on January 22, 2022: member

@prusnak @PastaPastaPasta Perhaps you want to also benchmark with the two last commits removed (so at “Optimization: precompute a few 3rd transform intermediaries”). Whether the last two help may be very architecture-dependent. For me they contribute a ~30% speedup, but maytbe on M1 that is not the case.

prusnak commented at 3:04 pm on January 22, 2022: contributor

@sipa benchmark of 38ed75f8b9fdb60ab45d07f3d4069a9c9cb636d4 Optimization: precompute a few 3rd transform intermediaries on M1:

ns/byte	byte/s	err%	total	benchmark
0.46	2,163,137,327.85	0.0%	0.01	`SHA256`
1.28	780,225,941.01	0.3%	0.01	`SHA256D64_1024`
1.18	850,467,547.93	0.2%	0.01	`SHA256_32b`

The improvement of using 0e72995 is there also for M1.

sipa commented at 3:06 pm on January 22, 2022: member

Looks like the 2-way version is a clear win on M1 as well, thanks!

prusnak force-pushed on Jan 22, 2022

hebasto commented at 9:36 am on January 23, 2022: member

Approach ACK 1cfacecf8203f775f5827fa225a6dccf7a28cdef.

It seems commits need some cleanup (reorder? squash?) for the nice commit history, no?

prusnak commented at 9:38 am on January 23, 2022: contributor

It seems commits need some cleanup (reorder? squash?) for the nice commit history, no?

I will squash when @sipa confirms he’s satisfied with the result and won’t be doing any further changes. @sipa are you OK with me doing the rebase/squash?

sipa commented at 2:17 pm on January 23, 2022: member

Feel free to squash/rebase wherever you feel it is useful.

prusnak force-pushed on Jan 23, 2022

prusnak commented at 2:23 pm on January 23, 2022: contributor

Feel free to squash/rebase wherever you feel it is useful.

Thanks! Squashed and rebased => 1f140783e1fbf818d4a6045464db3903af426bd0

in src/crypto/sha256_arm_shani.cpp:1 in 1f140783e1 outdated

0@@ -0,0 +1,898 @@
1+// Copyright (c) 2018-2020 The Bitcoin Core developers

PastaPastaPasta commented at 4:14 pm on January 23, 2022:

was this moved from somewhere? Otherwise it should probably be updated

prusnak commented at 5:01 pm on January 23, 2022:

Addresssed in the latest rebase

in src/crypto/sha256_arm_shani.cpp:19 in 1f140783e1 outdated

14+#include <cstddef>
15+#include <arm_acle.h>
16+#include <arm_neon.h>
17+
18+namespace {
19+static const uint32_t K[64] =

PastaPastaPasta commented at 4:16 pm on January 23, 2022:

nit: I’d prefer this being a constexpr std::array

prusnak commented at 5:01 pm on January 23, 2022:

Addresssed in the latest rebase

in src/crypto/sha256_arm_shani.cpp:203 in 1f140783e1 outdated

198+
199+namespace sha256d64_arm_shani {
200+void Transform_2way(unsigned char* output, const unsigned char* input)
201+{
202+    /* Initial state. */
203+    static const uint32_t INIT[8] = {

PastaPastaPasta commented at 4:18 pm on January 23, 2022:

nit: same, constexpr std::array, and below

prusnak commented at 5:01 pm on January 23, 2022:

Addresssed in the latest rebase

prusnak force-pushed on Jan 23, 2022

in src/crypto/sha256.cpp:657 in 68dbb6775c outdated

652+        have_arm_shani = true;
653+    }
654+#endif
655+#endif
656+
657+#if defined(__APPLE__)

hebasto commented at 5:57 pm on January 23, 2022:

a6ae9ef1626c73253aa8e6c8e406f3ee18b8017f

style nit: #ifdef __linux__ vs #ifdef __aarch64__ vs #if defined(__APPLE__) looks inconsistent. Mind choosing one style for new code?

prusnak commented at 6:09 pm on January 23, 2022:

Addressed in 9560981b0dda342b79e25067132dcb37e8d9e27b

hebasto commented at 5:59 pm on January 23, 2022: member

ACK a6ae9ef1626c73253aa8e6c8e406f3ee18b8017f (first two commits, won’t ACK code change in /src/crypto/ because of lack of my expertise).

prusnak force-pushed on Jan 23, 2022

hebasto commented at 6:14 pm on January 23, 2022: member

re-ACK 9560981b0dda342b79e25067132dcb37e8d9e27b (first two commits only).

in src/crypto/sha256_arm_shani.cpp:20 in 5202c6dd76 outdated

15+#include <cstddef>
16+#include <arm_acle.h>
17+#include <arm_neon.h>
18+
19+namespace {
20+static constexpr std::array<uint32_t, 64> K =

sipa commented at 9:33 pm on January 23, 2022:

May want to put an alignas(uint32x4_t) on this and other constants that are read using vld1q_u32 (better alignment may matter on some platforms).

prusnak commented at 10:32 pm on January 23, 2022:

Addressed in fe180d53c9035dfcec4211182ac954bd95ccaa60 + 195eb2fbe3754efe2ba31fbadb0fa77e801ce8bf

prusnak force-pushed on Jan 23, 2022

hebasto commented at 10:45 am on January 24, 2022: member

Tested 5202c6dd7695f66325638e15861468999ec834c1 on Mac mini (M1, 2020):

0% time ./src/bitcoind -datadir=/Users/hebasto/SHANI-2 -assumevalid=0 -stopatheight=719700 -prune=550
12022-01-23T18:25:19Z Bitcoin Core version v22.99.0-5202c6dd7695 (release build)
22022-01-23T18:25:19Z Validating signatures for all blocks.
32022-01-23T18:25:19Z Setting nMinimumChainWork=00000000000000000000000000000000000000001fa4663bbbe19f82de910280
42022-01-23T18:25:19Z Prune configured to target 550 MiB on disk for block and undo files.
52022-01-23T18:25:19Z Using the 'arm_shani(1way,2way)' SHA256 implementation
6...
72022-01-24T08:16:46Z Shutdown: done
8./src/bitcoind -datadir=/Users/hebasto/SHANI-2 -assumevalid=0  -prune=550  147477.18s user 11980.05s system 319% cpu 13:51:28.21 total

117 min or 12% faster IBD than the master branch. Also see #24115 (comment).

in src/crypto/sha256.cpp:657 in 195eb2fbe3 outdated

652+        have_arm_shani = true;
653+    }
654+#endif
655+#endif
656+
657+#if defined(__APPLE__)

fanquake commented at 7:36 am on January 25, 2022:

Can you use MAC_OSX here to match our other macOS defines.

prusnak commented at 8:56 am on January 25, 2022:

Addressed in 3af2f3f5024a43bacceff4aabfd108a8cd971d1f

prusnak force-pushed on Jan 25, 2022

PastaPastaPasta approved

PastaPastaPasta commented at 2:17 pm on January 25, 2022: contributor

ACK 4abca94b1e12792e821c2169ff7bee05f35ca284

mutatrum commented at 2:44 pm on January 25, 2022: contributor

Compiled and benchmarked on two aarch64 SBC:

Rock Pi 4 (RK3399, Dual Cortex-A72 and Quad Cortex-A53):

master bd482b3

ns/byte	byte/s	err%	ins/byte	total	benchmark
6.64	150,596,444.75	0.0%	30.02	0.07	`SHA256`
18.89	52,942,682.86	0.0%	82.25	0.01	`SHA256D64_1024`
15.95	62,706,213.16	0.0%	70.72	0.01	`SHA256_32b`

4abca94

ns/byte	byte/s	err%	ins/byte	total	benchmark
1.56	642,517,847.54	0.9%	1.75	0.02	`SHA256`
2.70	370,040,371.53	0.0%	4.38	0.01	`SHA256D64_1024`
5.13	194,811,160.90	1.2%	13.59	0.01	`SHA256_32b`

Firefly RK3566-ROC-PC (Quad-core Cortex-A55)

master bd482b3

ns/byte	byte/s	err%	ins/byte	bra/byte	miss%	total	benchmark
9.06	110,429,040.01	0.1%	30.02	0.02	0.1%	0.10	`SHA256`
24.94	40,094,485.36	0.9%	82.25	0.05	0.3%	0.02	`SHA256D64_1024`
22.83	43,792,465.97	0.1%	70.72	1.09	0.0%	0.01	`SHA256_32b`

4abca94

ns/byte	byte/s	err%	ins/byte	bra/byte	miss%	total	benchmark
1.49	672,648,756.27	0.5%	1.75	0.02	0.1%	0.02	`SHA256`
3.10	322,551,749.73	1.1%	4.38	0.02	0.2%	0.01	`SHA256D64_1024`
7.53	132,832,412.14	0.0%	13.59	1.09	0.0%	0.01	`SHA256_32b`

ACK

mutatrum approved

prusnak commented at 7:17 pm on January 26, 2022: contributor

May I propose this to be added to 23.0 Milestone?

fanquake added this to the milestone 23.0 on Jan 27, 2022

mutatrum commented at 9:31 am on January 27, 2022: contributor

Two questions: the machine that builds the aarch64 binaries, does that have these extensions? Or in other words, if this is merged, will the aarch64 binaries have this included?

And has someone tested the case of running this build on a device without the crypto extensions, f.e. a Raspberry Pi?

sipa commented at 0:04 am on January 28, 2022: member

ACK 3abdb593618e62d2c0f0bcc29e19bbadbf31e84c. I have not reviewed whether the macOS feature detection works. Also, some of the code is my own.

I did verify that the x86 SHA-NI keeps working (benchmark + debug.log on my Ryzen 5950X system indicate the SHA instructions are being used). I’ve also verified the instruction sequence, and reviewed the constants used cursorily. I conjecture that it is impossible to create anything resembling the SHA-2 transformation’s flow using these instructions that passes unit tests in a non-trivial test set, but is still wrong in others. Simply because these instructions are so high-level, any modification would greatly affect the outcome of nearly every input.

sipa commented at 0:07 am on January 28, 2022: member

@mutatrum I believe we use GCC 10 in the current guix build environment, so yes, the binaries would have these extensions. But let’s trigger a build to verify. Good idea to check that things still keep working on RPi.

sipa added the label DrahtBot Guix build requested on Jan 28, 2022

fanquake commented at 6:40 am on January 28, 2022: member

@prusnak Could you rebase this on master now that support for arm64 Darwin has been added to Guix (#21851)? Then we can get some Guix builds done, test the binaries etc.

Rename SHANI to X86_SHANI to allow future implementation of ARM_SHANI c2b7934250

Add sha256_arm_shani to build system

Also rename AArch64 intrinsics to ARMv8 intrinsics
as these are not necessarily limited to 64-bit

48a72fa81f

Implement sha256_arm_shani::Transform

Co-Authored-By: Rauli Kumpulainen <rauliweb@gmail.com>
Co-Authored-By: Pieter Wuille <pieter@wuille.net>

fe0629852a

Add optimized sha256d64_arm_shani::Transform_2way aaa1d03d3a

prusnak commented at 8:45 am on January 28, 2022: contributor

@fanquake rebased on top of current master

prusnak force-pushed on Jan 28, 2022

hebasto commented at 10:43 pm on January 28, 2022: member

Guix builds:

 0$ find guix-build-$(git rev-parse --short=12 HEAD)/output/ -type f -print0 | env LC_ALL=C sort -z | xargs -r0 sha256sum
 14a309ef27036065f787330a50659c85323f78f7b5d3c69a79e3eca232f4d3e55  guix-build-aaa1d03d3ace/output/aarch64-linux-gnu/SHA256SUMS.part
 2cfedbd51f5bf65d57fe9200e22e56f3a38f44308eea565b201483811996d957b  guix-build-aaa1d03d3ace/output/aarch64-linux-gnu/bitcoin-aaa1d03d3ace-aarch64-linux-gnu-debug.tar.gz
 385c9195cf594fbbf3ce6c6b76370f746dd4342c793b54b8b5bdb650a53761eaf  guix-build-aaa1d03d3ace/output/aarch64-linux-gnu/bitcoin-aaa1d03d3ace-aarch64-linux-gnu.tar.gz
 4280234b6a20bbcddb847e0bbaf27f35ac30d292ea5dd096f8c8096220f8ef29a  guix-build-aaa1d03d3ace/output/arm-linux-gnueabihf/SHA256SUMS.part
 548c0fd0a6a4c1e86e942f6907804ed87527f329954e81e5bdc22beed83cbe6b3  guix-build-aaa1d03d3ace/output/arm-linux-gnueabihf/bitcoin-aaa1d03d3ace-arm-linux-gnueabihf-debug.tar.gz
 6c435b7a53606f33f9d23f28e6c98ce3d55f50f61fd67bf2b4ebe65261fe68aba  guix-build-aaa1d03d3ace/output/arm-linux-gnueabihf/bitcoin-aaa1d03d3ace-arm-linux-gnueabihf.tar.gz
 7e0d61af7b471ba5135a848be6d4d6cf0caa8042dab9d929f6578de17e47e40c2  guix-build-aaa1d03d3ace/output/arm64-apple-darwin/SHA256SUMS.part
 8e9eda618bf90d5d1522005bd13bd296ae89aec9c3b0b2528b9c9f67e70178828  guix-build-aaa1d03d3ace/output/arm64-apple-darwin/bitcoin-aaa1d03d3ace-arm64-apple-darwin.tar.gz
 9a6540dcea7c2a2562edd0f2232f511da97339d388cdca686e084d03e381d6fa2  guix-build-aaa1d03d3ace/output/arm64-apple-darwin/bitcoin-aaa1d03d3ace-osx-unsigned.dmg
106d970ae7c94b78c5f97e978718862e85c3375c48d3b8ade6fafce39e7c8e0f0e  guix-build-aaa1d03d3ace/output/arm64-apple-darwin/bitcoin-aaa1d03d3ace-osx-unsigned.tar.gz
11cdbc9eb5281b14ecbecc2956a5a41709fda93762115ce3cee9516a68874676b8  guix-build-aaa1d03d3ace/output/dist-archive/bitcoin-aaa1d03d3ace.tar.gz
12d063099449e40036d15ed5b023f602ba824edf303ecbe78bcd3f01feeabb535f  guix-build-aaa1d03d3ace/output/powerpc64-linux-gnu/SHA256SUMS.part
133e6da4026d466039cd361e48a492ea780021bc0efb8e67e948097871ce4226cc  guix-build-aaa1d03d3ace/output/powerpc64-linux-gnu/bitcoin-aaa1d03d3ace-powerpc64-linux-gnu-debug.tar.gz
14669291a4767509f053bdd8789f4631ff6d28cd4ad62156105bc98cdb7d3a295a  guix-build-aaa1d03d3ace/output/powerpc64-linux-gnu/bitcoin-aaa1d03d3ace-powerpc64-linux-gnu.tar.gz
15868cbe0f73cd786d91dd6d1990550e6c9c673ef3d874c545150b6e030951a24d  guix-build-aaa1d03d3ace/output/powerpc64le-linux-gnu/SHA256SUMS.part
1690157bdc29262abd421535e4d6921e72aff7f3d43ca5634bc76598d8daf3a1ec  guix-build-aaa1d03d3ace/output/powerpc64le-linux-gnu/bitcoin-aaa1d03d3ace-powerpc64le-linux-gnu-debug.tar.gz
178275665e19f85193de4e7e5ee7b451d6a4f8e414a33586e7066575812e878eda  guix-build-aaa1d03d3ace/output/powerpc64le-linux-gnu/bitcoin-aaa1d03d3ace-powerpc64le-linux-gnu.tar.gz
182723b48d06e8217adb41bcc640417cba3470b234cce7815104e878866d775046  guix-build-aaa1d03d3ace/output/riscv64-linux-gnu/SHA256SUMS.part
19f63822813587ec9e4bcc044c4b7918b1330d6b16be09f28bd95fedfd3dcdb147  guix-build-aaa1d03d3ace/output/riscv64-linux-gnu/bitcoin-aaa1d03d3ace-riscv64-linux-gnu-debug.tar.gz
203ec47d6968e2e430e3ff1629f07243f8589bd30406c0de916c2bbc6d5d88e0e8  guix-build-aaa1d03d3ace/output/riscv64-linux-gnu/bitcoin-aaa1d03d3ace-riscv64-linux-gnu.tar.gz
212120e8021edfe8170e318d82e44e405af68bb04c3fe3a3cd0407a17507cd74a0  guix-build-aaa1d03d3ace/output/x86_64-apple-darwin/SHA256SUMS.part
2211b173cbeff7b20c717fd880446904eec2d33b30076348c0e9698e876f5be4a4  guix-build-aaa1d03d3ace/output/x86_64-apple-darwin/bitcoin-aaa1d03d3ace-osx-unsigned.dmg
23af86643730d769ddf6635e9e240b50bd5b94aa576cf7450b4289d8ec1fc883ed  guix-build-aaa1d03d3ace/output/x86_64-apple-darwin/bitcoin-aaa1d03d3ace-osx-unsigned.tar.gz
243c3591954aaf6b0557b74baf156f892b9b90ec63c35f3495889431afe0a17f93  guix-build-aaa1d03d3ace/output/x86_64-apple-darwin/bitcoin-aaa1d03d3ace-osx64.tar.gz
25553d227d7e32c72390e259a46f37e4fd98f5781bd4b6d7ed8d83c5a9ba04f65b  guix-build-aaa1d03d3ace/output/x86_64-linux-gnu/SHA256SUMS.part
26479f3597f1577a33dad6634b97515f4ceac0ff163d64e7fa00b8685793d36231  guix-build-aaa1d03d3ace/output/x86_64-linux-gnu/bitcoin-aaa1d03d3ace-x86_64-linux-gnu-debug.tar.gz
27c273a93458f2afa04c66d7e23346835dc6201f8e7b878ffbf661ae104c8746e7  guix-build-aaa1d03d3ace/output/x86_64-linux-gnu/bitcoin-aaa1d03d3ace-x86_64-linux-gnu.tar.gz

UPDATE: build artifacts are available in https://github.com/hebasto/artefacts/tree/master/pr24115/guix-build-aaa1d03d3ace/output

prusnak commented at 2:18 pm on January 29, 2022: contributor

Two questions:

Systems used:

RPI4 - Raspberry Pi 4 system (which does not support ARMv8 SHA2 NI), Ubuntu 21.10
ALTRA - Ampere Altra system (which does support ARMv8 SHA2 NI), Ubuntu 20.4.3 LTS

I performed the following tests:

build aaa1d03 on RPI4
- configure output says checking for ARMv8 SHA-NI intrinsics... yes
- resulting binary contains ARMv8 SHA-NI code
- bitcoind starts
- bitcoind output says Using the 'standard' SHA256 implementation
build aaa1d03 on ALTRA
- configure output says checking for ARMv8 SHA-NI intrinsics... yes
- resulting binary contains ARMv8 SHA-NI code
- bitcoind starts
- bitcoind output says Using the 'arm_shani(1way,2way)' SHA256 implementation

take binary built on ALTRA and run it on RPI4
- bitcoind starts
- bitcoind output says Using the 'standard' SHA256 implementation
take binary built on RPI4 and run it on ALTRA
- bitcoind starts
- bitcoind output says Using the 'arm_shani(1way,2way)' SHA256 implementation

take guix build binary from hebasto and run it on ALTRA:
- bitcoind starts
- bitcoind output says Using the 'arm_shani(1way,2way)' SHA256 implementation
take guix build binary from hebasto and run it on RPI4:
- bitcoind starts
- bitcoind output says Using the 'standard' SHA256 implementation

I think this proves the build mechanism and the runtime detection works as intended.

Sjors commented at 8:04 pm on January 31, 2022: member

Concept ACK. I find it near-impossible to follow what sha256d64_arm_shani is doing, but that’s mainly because our c++ TransformD64 is undocumented (introduced in #13191). In particular I don’t understand how the algorithm follows from a single sha256. I assume it’s an optimization. Otherwise they seem similar enough, with sha256d64_arm_shani splitting the input to take advantage of the 2-way instructions. And the tests pass :-)

sipa commented at 8:27 pm on January 31, 2022: member

@Sjors That’s quite possibly worth documenting in general (for all D64 code).

What these functions do:

Take a pointer to an input N*64 bytes buffer, and an output N*32 bytes buffer (N=1 for 1-way ,N=2 for 2-way, etc).
Treat the input as the concatenation of N 64-byte inputs, compute SHA256(SHA256(input)) for each, and concatenate those outputs in the output buffer.

A bit about SHA256’s structure. SHA256(bytes) is really the following algorithm:

Append padding to input (between 9 and 72 bytes); the result is always a multiple of 64 bytes.
Initialize the state (a 32-byte value, typically represented as 8 32-bit integers) to the initial state, a constant.
Then split the input into blocks of 64 bytes, and for each do state = Transform(state, block), where Transform is the SHA256 transformation function at a high level.
The hash is equal to the final state.

In case of SHA256(SHA256(64 bytes)), there are 3 Transforms being invoked:

The first operates on the 64 bytes of input, starting with initial state.
The second continues on the resulting state, processing 64 bytes of padding. That padding is a constant (it’s just a function of the length of the input).
The third operates on the 32 bytes of output produced by the second transform, followed by 32 bytes of padding, which is again constant, and starting with a new initial state.

There are 3 types of optimizations we can do in this case:

Start by inlining the 3 transforms into one function body, together with all the initializations. The intermediary conversion to bytes after the second transform and then back to integers for the 3rd transform can be bypassed (serializing & deserializing is a no-op).
Observing that lots of intermediary values now actually become known at compile time. In particular, lots of values occurring during the 2nd transform (whose input is 100% fixed). I did this by simply writing the code up to this point, adding printf statements on these intermediaries, then turning the printed values into constants in the code and skipping their computation.
Taking advantage of vectorization and/or instruction level parallellism. In the case of x86 and ARM SHA instructions, we literally just duplicate every line of code (after doing the operations above), alternating between working on variables relating to a first or a second 64-byte input. This works because these instructions have a long pipeline, and there are sufficient registers available in hardware to store (most) of the data relating to two instances at once. This improves the throughput.

The individual commits in https://github.com/sipa/bitcoin/commits/pr24115 show the process.

Note that I don’t think it’s really required for verifying correctness to see these steps (otherwise I’d have argued for including them in this PR), but it may help understand how it came to be.

Sjors commented at 9:29 am on February 1, 2022: member

I think this is the step that confuses me:

The second continues on the resulting state, processing 64 bytes of padding. That padding is a constant (it’s just a function of the length of the input).

If the first transform is the equivalent of a single sha256(64 bytes) and the third is the equivalent of a second sha256() on the 32 byte result of the first, what is the second transform doing?

I did this by simply writing the code up to this point, adding printf statements on these intermediaries, then turning the printed values into constants in the code and skipping their computation.

This is definitely worth documenting (can be another PR). Even nicer if we can generate the values in a Python script (for manual comparison, not code generation).

sipa commented at 1:51 pm on February 1, 2022: member

@Sjors

There are two SHA256 invocations:

H1 = SHA256(input)
H2 = SHA256(H1)

Input is 64 bytes, which means it gets 64 bytes of padding (because the padding is always between 9 and 72 bytes long, and the result is always a multiple of 64).

For H2, SHA256(H1) just gets a 32-byte input, so it also only gets a 32-byte padding, and the result just needs one transform.

So we can write it this way:

H1 = Transform(Transform(Init(), input), Pad(64))
H2 = Transform(Init(), H1 + Pad(32))

The first transform is the inner one for H1, the second the outer one for H1. The third transform is the H2 one.

Sjors commented at 2:08 pm on February 1, 2022: member

Ah that makes sense.

Input is 64 bytes, which means it gets 64 bytes of padding

I naively assumed a 64 byte message wasn’t padded, but it is: https://datatracker.ietf.org/doc/html/rfc6234#section-4.1

mutatrum commented at 3:28 pm on February 1, 2022: contributor

IBD up to block 700000 on a Rock Pi 4a w/ NVMe SSD, assumevalid=0, dbcache=2000:

master (bd482b3f): 68H52M shani (4abca94): 65H29M

Improvement ~5%

sipa commented at 3:30 pm on February 1, 2022: member

@Sjors

I naively assumed a 64 byte message wasn’t padded, but it is: https://datatracker.ietf.org/doc/html/rfc6234#section-4.1

Yes, it has to be. Otherwise you’d have a trivial 2nd preimage attack between hash(X) and hash(X || padding(len(X))), for non-multiple-of-64-bytes X.

DrahtBot commented at 12:08 pm on February 3, 2022: member

Guix builds

File	commit 133f73e86bd7c3114263500be2fb5090dd76b4bc(master)	commit 19a5c3f8cc5c75da931d491340c1de68867934a1(master and this pull)
SHA256SUMS.part	`4d29ebeb3309d60d...`	`b7844e9c678b97f6...`
*-aarch64-linux-gnu-debug.tar.gz	`c29ba0d0426063e8...`	`89ff765de2630eb7...`
*-aarch64-linux-gnu.tar.gz	`e52c846b1841b3eb...`	`d6ea7915e264d3be...`
*-arm-linux-gnueabihf-debug.tar.gz	`5d3f9731cf88da5b...`	`d254cb4dc971d678...`
*-arm-linux-gnueabihf.tar.gz	`282b33b13dd1f8a0...`	`ea7a0cf5a93cb32a...`
*-arm64-apple-darwin.tar.gz	`3f0dba0a6549c410...`	`3400772805824d22...`
*-osx-unsigned.dmg	`6692809452b6cb62...`	`25da88b4e7f96778...`
*-osx-unsigned.tar.gz	`4366f453672f800d...`	`3ad587987969546b...`
*-osx64.tar.gz	`5387d0cdc36be37a...`	`198a4d9b96a87a5f...`
*-powerpc64-linux-gnu-debug.tar.gz	`1ed57908059941ef...`	`e3a27aa697de6985...`
*-powerpc64-linux-gnu.tar.gz	`21079e1bae0f459b...`	`7c91ce0fa9a40f23...`
*-powerpc64le-linux-gnu-debug.tar.gz	`6b441c519f2d3185...`	`2bfee29cc7203662...`
*-powerpc64le-linux-gnu.tar.gz	`132f95573d0fd005...`	`48a553082d2d03b8...`
*-riscv64-linux-gnu-debug.tar.gz	`3c5e1f8e3d9aa92a...`	`64dc9f50e4993c2e...`
*-riscv64-linux-gnu.tar.gz	`b17efbca76425fc5...`	`78bba78a4dd894cb...`
*-x86_64-linux-gnu-debug.tar.gz	`9f61cd7fca8d6425...`	`edbfec791406e681...`
*-x86_64-linux-gnu.tar.gz	`323dc8d7d0aa8290...`	`b38a112a72acb36a...`
*.tar.gz	`5887839fbd29cd1c...`	`58afa778369cde02...`
guix_build.log	`53868781dafe6675...`	`8cae536cac2f270c...`
guix_build.log.diff		`a14b825c6610d5c4...`

DrahtBot removed the label DrahtBot Guix build requested on Feb 3, 2022

DrahtBot commented at 10:23 pm on February 11, 2022: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#24322 ([kernel 1/n] Introduce initial libbitcoinkernel by dongcarl)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

laanwj commented at 7:46 pm on February 14, 2022: member

Code review and lightly tested ACK aaa1d03d3acebeb44fdd40a302f086aad3d329ce I have checked

that the code gets compiled (bitcoind contains the instructions)
on a old ARM64 device without the instruction set that it correctly doesn’t enable the code.
on a recent ARM64 with the instruction set that it uses and enables the code

laanwj merged this on Feb 14, 2022

laanwj closed this on Feb 14, 2022

prusnak deleted the branch on Feb 14, 2022

sidhujag referenced this in commit 42ca85234e on Feb 15, 2022

in src/crypto/sha256_arm_shani.cpp:65 in aaa1d03d3a

60+        MSG1 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 16)));
61+        MSG2 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 32)));
62+        MSG3 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 48)));
63+        chunk += 64;
64+
65+        // Original implemenation preloaded message and constant addition which was 1-3% slower.

hebasto commented at 11:22 am on February 16, 2022:

typo: implemenation ==> implementation

PastaPastaPasta referenced this in commit a7a3a93489 on Mar 29, 2022

PastaPastaPasta referenced this in commit 8278356354 on Mar 29, 2022

laanwj referenced this in commit 51527ec1ec on May 11, 2022

sidhujag referenced this in commit e4cf1bde8f on May 11, 2022

DrahtBot locked this on Feb 16, 2023

ARMv8 SHA2 Intrinsics #24115

Guix builds:

Guix builds

Conflicts