I'm a bit concerned this is not well-defined. We can't guarantee that chunk has an alignment that is compatible with the uint8x16_t type, and even if it is, whether such a reinterpretation is permitted would be highly architecture-specific (of course, this is already architecture-specific code). Do you know of documentation that specifically permits this?
If not, I'd use this patch:
diff --git a/src/crypto/sha256_arm_shani.cpp b/src/crypto/sha256_arm_shani.cpp
index c051d87042..a783be9068 100644
--- a/src/crypto/sha256_arm_shani.cpp
+++ b/src/crypto/sha256_arm_shani.cpp
@@ -47,8 +47,6 @@ void Transform(uint32_t* s, const unsigned char* chunk, size_t blocks)
STATE0 = vld1q_u32(&s[0]);
STATE1 = vld1q_u32(&s[4]);
- const uint8x16_t* input32 = reinterpret_cast<const uint8x16_t*>(chunk);
-
while (blocks--)
{
// Save state
@@ -56,10 +54,14 @@ void Transform(uint32_t* s, const unsigned char* chunk, size_t blocks)
CDGH_SAVE = STATE1;
// Load and convert input chunk to Big Endian
- MSG0 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
- MSG1 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
- MSG2 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
- MSG3 = vreinterpretq_u32_u8(vrev32q_u8(*input32++));
+ MSG0 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 0)));
+ MSG1 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 16)));
+ MSG2 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 32)));
+ MSG3 = vreinterpretq_u32_u8(vrev32q_u8(vld1q_u8(chunk + 48)));
+ chunk += 64;
// Original implemenation preloaded message and constant addition which was 1-3% slower.
// Now included as first step in quad round code saving one Q Neon register