Add initial vectorized chacha20 implementation for 2-3x speedup #34083

theuni commented at 8:42 PM on December 16, 2025: member

Exploit modern simd to calculate 2/4/6/8/16 states at a time depending on the size of the input.

This demonstrates a 2x speedup on x86-64 and 3x for arm+neon. Platforms which require runtime detection (avx2/avx512) improve performance even further, and will come as a follow-up.

Rather than hand-writing assembly or using arch-specific intrinsics, this is written using compiler built-ins understood by gcc and clang.

In practice (at least on x86_64 and armv8), the compilers are able to produce assembly that's not much worse than hand-written.

This means that every architecture can benefit from its own vectorized instructions without having to write/maintain an implementation for each one. But because each will vary in ability to exploit the parallelism, we allow (via ifdefs) each architecture to opt-out of some or all multi-state calculation at compile-time..

Here, as a starting point, x86-64 and arm+neon have been defined based on local benchmarks.

Local profiling revealed that chacha20 accounts for a substantial amount of the network thread's time. It's not clear to me if speeding up chacha20 will improve network performance/latency, but it will definitely make it more efficient.

This is part 1 of a series of PR's for chacha20. I think it makes sense to take a look at the generic implementation and tune the architecture-specific defines for parallel blocks before adding the runtime-dependent platforms.

My WIP branch which includes avx2/avx512 can be seen here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/

I've been hacking on this for quite a while, trying every imaginable tweak and comparing lots of resulting asm/ir. I'm happy to answer any questions about any choices made which aren't immediately obvious.

Edit: some more impl details:

I wrestled with gcc/clang a good bit, tweaking something and comparing the generated code output. A few things I found, which may explain some of the decisions I made:

gcc really wanted to inline some of the helpers, which comes at a very substantial performance cost due to register clobbering (and with avx2, vzeroupper). Hence, all helpers are decorated with ALWAYS_INLINE.
gcc/clang do well with the vec256 loads/stores with minimal fussing. Though loading each element with [] is clumsy and verbose, it avoids compiler-specific layout assumptions. Other things I tried (which produced the same asm):
- casting directly to using unaligned_vec256 __attribute__((aligned (1))) = vec256
- memcpy into ^^
- clang's __builtin_masked_load
Loop unrolling was hit-or-miss without #pragma GCC unroll n, and I tried to avoid macros for loops, hence the awkward recursive inline template loops. But in practice, I see those unrolled 100% of the time.
I used std::get in the helpers for some extra compile-time safety (this actually pointed out some off-by-one's that would've been annoying to track down)
I avoided using any lambdas or classes for fear of compilers missing obvious optimizations
All vec256 are passed by reference to avoid an annoying clang warning about returning a vector changing the abi (this is specific to x86 when not compiling with -avx). Even though our functions are all inlined, I didn't see any harm in making that adjustment.

chacha20: move single-block crypt to inline helper function 1cbbd14e6d

chacha20: Add generic vectorized chacha20 implementation

Exploit modern simd to calculate 2/4/6/8/16 states at a time depending on the
size of the input.

Demonstrates a 2x speedup on x86-64 and 3x for arm+neon. Platforms which
require runtime detection (avx2/avx512) improve performance even further, and
will come as a follow-up.

Rather than hand-writing assembly or using arch-specific intrinsics, this is
written using compiler built-ins understood by gcc and clang.

In practice (at least on x86_64 and armv8), the compilers are able to produce
assembly that's not much worse than hand-written.

This means that every architecture can benefit from its own vectorized
instructions without having to write/maintain an implementation for each one.
But because each will vary in ability to exploit the parallelism, we allow
(via ifdefs) each architecture to opt-out of some or all multi-state
calculation at compile-time..

Here, as a starting point, x86-64 and arm+neon have been defined based on local
benchmarks.

3bddf59cd3

DrahtBot commented at 8:42 PM on December 16, 2025: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/34083.

Reviews

See the guideline for information on the review process. A summary of reviews will appear here.

LLM Linter (✨ experimental)

Possible typos and grammar issues:

"32bytes" -> "32 bytes" [missing space between number and unit]
"which used to increment the states" -> "which are used to increment the states" [missing auxiliary verb "are"]
"256bit registers" -> "256‑bit registers" [missing hyphen for the compound adjective; can be written as "256-bit" or "256 bit" for clarity]

theuni commented at 8:49 PM on December 16, 2025: member

Adding pings for a few people I've discussed this with: @sipa @ajtowns @l0rinc

in src/crypto/chacha20_vec_base.cpp:26 in 3bddf59cd3 outdated

  21 | +#  define CHACHA20_VEC_DISABLE_STATES_6
  22 | +#  define CHACHA20_VEC_DISABLE_STATES_4
  23 | +#  define CHACHA20_VEC_DISABLE_STATES_2
  24 | +#endif
  25 | +
  26 | +#include <crypto/chacha20_vec.ipp>

ajtowns commented at 1:11 AM on December 17, 2025:

Why separate this into an .ipp file if it's only included in a single .cpp file?

theuni commented at 4:14 PM on December 17, 2025:

See branch here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized/

I decided to exclude the impls which require runtime detection from this PR, as I think how that should be done is a separate conversation.

To answer your question more specifically: some impls may require different compilation flags (-mavx2/-mavx512vl), which have to be in their own compilation units.

in src/crypto/chacha20_vec_base.cpp:12 in 3bddf59cd3 outdated

   7 | +// This file should define which states should be en/disabled for all
   8 | +// supported architectures. For some, like x86-64 and armv8, simd features
   9 | +// (sse2 and neon respectively) are safe to use without runtime detection.
  10 | +
  11 | +#if defined(__x86_64__) || defined(__amd64__)
  12 | +#  define CHACHA20_VEC_DISABLE_STATES_16

ajtowns commented at 1:39 AM on December 17, 2025:

Using #if !defined for these seems old school. Why not static constexpr bools, and if constexpr (..)?

Writing:

#if defined(__x86_64__) || defined(__amd64__)
#  define CHACHA20_VEC_DISABLE_STATES_16 true
#  define CHACHA20_VEC_DISABLE_STATES_8  true
#  define CHACHA20_VEC_DISABLE_STATES_6  true
#  define CHACHA20_VEC_DISABLE_STATES_4  false
#  define CHACHA20_VEC_DISABLE_STATES_2  false
#elif defined(__ARM_NEON)
...

static constexpr bool CHACHA20_VEC_ALL_MULTI_STATES_DISABLED =
    CHACHA20_VEC_DISABLE_STATES_16 &&
    CHACHA20_VEC_DISABLE_STATES_8 &&
    CHACHA20_VEC_DISABLE_STATES_6 &&
    CHACHA20_VEC_DISABLE_STATES_4 &&
    CHACHA20_VEC_DISABLE_STATES_2;

 void chacha20_crypt_vectorized(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, const std::array<uint32_t, 12>& input) noexcept
 {
    if constexpr (CHACHA20_VEC_ALL_MULTI_STATES_DISABLED) return;
...
    if constexpr (!CHACHA20_VEC_DISABLE_STATES_16) {
         while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 16) {
...

seems to work right, and also seems like it avoids needing to put all the potentially unused code in a #if block.

ajtowns commented at 7:58 AM on December 17, 2025:

Here's a branch that replaces the #defines and abstracts the recursive template stuff a bit more for your perusal https://github.com/ajtowns/bitcoin/commits/202512-pr34083-templates/

l0rinc commented at 3:26 PM on December 17, 2025:

+1 for if constexpr, should simplify the code a lot (especially if we migrate from recursive templates as well)

theuni commented at 4:25 PM on December 17, 2025:

if constexpr certainly makes more sense where possible. I'll have a look, thanks!

in src/crypto/chacha20_vec.ipp:38 in 3bddf59cd3 outdated

  33 | +#endif
  34 | +
  35 | +
  36 | +namespace {
  37 | +
  38 | +using vec256 = uint32_t __attribute__((__vector_size__(32)));

ajtowns commented at 2:03 AM on December 17, 2025:

static_assert(sizeof(vec256) == 32) ? (Or sizeof(vec256) == 32 || ALL_MULTI_STATES_DISABLED)

l0rinc commented at 3:39 PM on December 17, 2025:

I haven't used this before but the internets tells me we could use the attribute syntax here, something like:

template <typename T, size_t N>
using VectorType [[gnu::vector_size(sizeof(T) * N)]] = T;

using vec256 = VectorType<uint32_t, 8>;

theuni commented at 5:03 PM on December 17, 2025:

Because they're compiler-specific, the code makes no assumptions about the size/structure/alignment of vec256. The only accesses are via operator[]. So afaik, there's no need to check for this.

ajtowns commented at 1:22 AM on December 18, 2025:

Sorry, I meant just so you get an error if you're using a compiler ignores the __attribute__ entirely but still passes the #if in _base.cpp.

in src/crypto/chacha20_vec.ipp:101 in 3bddf59cd3 outdated

  96 | +
  97 | +    x += y;
  98 | +    z ^= x;
  99 | +    vec_rotl<BITS>(z);
 100 | +
 101 | +    if constexpr(ITER + 1 < I ) arr_add_xor_rot<BITS, I, ITER + 1>(arr0, arr1, arr2);

ajtowns commented at 2:04 AM on December 17, 2025:

if constexpr (ITER + 1 < I) (we have a space after if constexpr elsewhere)

ajtowns commented at 2:07 AM on December 17, 2025: contributor

Some nits. Approach looks very nice, and at least going by the bench results, gives a good improvement.

l0rinc commented at 8:11 AM on December 17, 2025: contributor

Looking forward to reviewing it - quick question before I do: was it tested on any big-endian systems which don't revert to non-vectorized run?

in src/crypto/chacha20_vec.ipp:121 in 3bddf59cd3 outdated

 116 | +
 117 | +After the first round, arr_shuf0, arr_shuf1, and arr_shuf2 are used to shuffle
 118 | +the layout to prepare for the second round.
 119 | +
 120 | +After the second round, they are used (in reverse) to restore the original
 121 | +layout.

ajtowns commented at 8:48 AM on December 17, 2025:

My understanding is that you're inverting the logic here: instead of

#define QUARTERROUND(a,b,c,d) \
    X(a,b,d,16) X(c,d,b,12) X(a,d,b,8) X(c,d,b,7)

This does the X(a,b,d,16) all four times via simd, then X(c,d,b,12) all four times, etc. And the simd part relies on the data for those four steps being exactly +4 units apart, which is why the shuffling and unshuffling is necessary for the second round. (Okay, not four times but eight times because each vec256 is two blocks)

Could use a little more explanation in the comment I think? But makes sense to me.

sedited commented at 1:49 PM on December 28, 2025:

Makes sense to me too, but I'm not sure why doing it this way is preferable. Intuitively I would have expected that the vectors hold the same word spread over multiple blocks. But if I understand your approach here, we have multiple words over two blocks. Is it just easier to express the quarter rounds in this way?

in src/crypto/chacha20.cpp:294 in 3bddf59cd3 outdated

 291 | +    assert(in_bytes.size() == out_bytes.size());
 292 | +    size_t blocks = out_bytes.size() / ChaCha20Aligned::BLOCKLEN;
 293 | +    assert(blocks * ChaCha20Aligned::BLOCKLEN == out_bytes.size());
 294 | +#ifdef ENABLE_CHACHA20_VEC
 295 | +    // Only use the vectorized implementations if the counter will not overflow.
 296 | +    const bool overflow = static_cast<uint64_t>(input[8]) + blocks > std::numeric_limits<uint32_t>::max();

ajtowns commented at 8:58 AM on December 17, 2025:

Overflow happens every 274GB I guess, which presumably isn't worth putting much effort in to optimising around.

theuni commented at 5:10 PM on December 17, 2025:

This is actually doing the opposite.. it's keeping the vectorized impl from having to worry about this case. In the current code, we have:

++j12;
if (!j12) ++j13;
...
input[8] = j12;
input[9] = j13;

So effectively input[8] and input[9] are treated as a single uint64_t. It's not possible to express "cast to uint64_t elements and increment" or "increment and overflow over there" with the vector extensions, so it turned out to be easier to just forbid the overflow cases from being vectorized at all.

ajtowns commented at 1:23 AM on December 18, 2025:

Yeah, I was thinking about whether the code should just do the non-vectorized stuff to get past the overflow then immediately go back to vectorizing, rather than waiting for the next call. I think what you've got makes sense.

l0rinc commented at 5:43 PM on May 4, 2026:

If you decide to keep it, consider subtracting from max instead:

    const bool overflow = blocks > std::numeric_limits<uint32_t>::max() - input[8];

in src/crypto/chacha20_vec.ipp:46 in 3bddf59cd3 outdated

  41 | +ALWAYS_INLINE void vec_byteswap(vec256& vec)
  42 | +{
  43 | +    if constexpr (std::endian::native == std::endian::big)
  44 | +    {
  45 | +        vec256 ret;
  46 | +        ret[0] = __builtin_bswap32(vec[0]);

l0rinc commented at 9:24 AM on December 17, 2025:

can we assume that all big endian systems have __builtin_bswap32 available? https://github.com/bitcoin/bitcoin/blob/432b18ca8d0654318a8d882b28b20af2cb2d2e5d/src/compat/byteswap.h#L14-L16 indicates we could use our existing helpers here instead:

ALWAYS_INLINE void vec_byteswap(vec256& vec)
{
    if constexpr (std::endian::native == std::endian::big) {
        for (size_t i = 0; i < 8; ++i) {
            vec[i] = internal_bswap_32(vec[i]);
        }
    }
}

A small, fixed-size loop like this should be unrolled by every compiler, it's what we did in https://github.com/bitcoin/bitcoin/pull/31144/files#diff-f26d4597a7f5a5d4aa30053032d49427b402ef4fd7a1b3194cb75b2551d58ca4R48 as well.

(nit: could you please reformat the patch, it differs slightly from how clang-format is set for new code)

theuni commented at 5:12 PM on December 17, 2025:

Thanks, yes, I meant to change that to internal_bswap_32 before pushing. Will do.

in src/crypto/chacha20_vec.ipp:67 in 3bddf59cd3 outdated

  62 | +    vec = (vec << BITS) | (vec >> (32 - BITS));
  63 | +}
  64 | +
  65 | +/** Store a vector in all array elements */
  66 | +template <size_t I, size_t ITER = 0>
  67 | +ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)

l0rinc commented at 9:25 AM on December 17, 2025:

It seems to me the existing standard test vectors are too short (mostly < 128 bytes) to trigger the vectorized optimization - can we extend the tests to runa and compare the new specializations (could show as skipped when the given architecture isn't available locally)?

theuni commented at 5:12 PM on December 17, 2025:

Yep, will do.

in src/crypto/chacha20_vec_base.cpp:23 in 3bddf59cd3 outdated

  18 | +// Be conservative and require platforms to opt-in
  19 | +#  define CHACHA20_VEC_DISABLE_STATES_16
  20 | +#  define CHACHA20_VEC_DISABLE_STATES_8
  21 | +#  define CHACHA20_VEC_DISABLE_STATES_6
  22 | +#  define CHACHA20_VEC_DISABLE_STATES_4
  23 | +#  define CHACHA20_VEC_DISABLE_STATES_2

l0rinc commented at 2:33 PM on December 17, 2025:

Forcing vectorized operations on an emulated big-endian system (otherwise it falls back to the scalar implementation due to the safety checks):

diff --git a/src/crypto/chacha20_vec_base.cpp b/src/crypto/chacha20_vec_base.cpp
index 9fda9452a1..a2aa1e5552 100644
--- a/src/crypto/chacha20_vec_base.cpp
+++ b/src/crypto/chacha20_vec_base.cpp
@@ -20,7 +20,7 @@
 #  define CHACHA20_VEC_DISABLE_STATES_8
 #  define CHACHA20_VEC_DISABLE_STATES_6
 #  define CHACHA20_VEC_DISABLE_STATES_4
-#  define CHACHA20_VEC_DISABLE_STATES_2
+//#  define CHACHA20_VEC_DISABLE_STATES_2
 #endif

 #include <crypto/chacha20_vec.ipp>

and running the crypto_tests:

brew install podman pigz qemu
podman machine init
podman machine start

podman run --platform linux/s390x -it --rm ubuntu:latest /bin/bash -c \
  'apt-get update && \
   DEBIAN_FRONTEND=noninteractive apt-get install -y \
   git build-essential cmake ccache pkg-config \
   libevent-dev libboost-dev libssl-dev libsqlite3-dev python3 && \
   git clone https://github.com/bitcoin/bitcoin.git && cd bitcoin && \
   git fetch origin pull/34083/head:chacha20-vec && git checkout chacha20-vec && \
   sed -i "s/#  define CHACHA20_VEC_DISABLE_STATES_2/\/\/#  define CHACHA20_VEC_DISABLE_STATES_2/" src/crypto/chacha20_vec_base.cpp && \
   cmake -B build -DBUILD_BENCH=OFF -DBUILD_GUI=OFF -DBUILD_DAEMON=OFF -DBUILD_TX=OFF -DENABLE_IPC=OFF -DENABLE_WALLET=OFF -DENABLE_ZMQ=OFF -DENABLE_UPNP=OFF -DENABLE_NATPMP=OFF && \
   cmake --build build --target test_bitcoin -j1 && \ ./build/bin/test_bitcoin --run_test=crypto_tests'

We're getting a lot of failures:

Running 17 test cases...
./test/crypto_tests.cpp(155): error: in "crypto_tests/chacha20_testvector": check hexout == HexStr(outres) has failed [a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4042e02785ececfa4b4bb5e8ead0440e20b6e8db09d881a7c6132f420e52795042bdfa7773d8a9051447b3291ce1411c680465552aa6c405b7764d5e87bea85ad00f8449ed8f72d0d662ab052691ca66424bc86d2df80ea41f43abf937d3259dc4b2d0dfb48a6c9139ddd7f76966e928e635553ba76c5c879d7b35d49eb2e62b0871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221 != c2ece71ceded38c04f376ca225cc3d6b463409986f263e9db1994a204b6844ec6e9695cfc52b7003e3b6961ceac96a18138981cb4ee5919649b263d542b82b114a57f52d8ba2a2f4540af4f7f49f0b1072a7f9891b97ceb5c61c20420143685f169add2373c4b505122eda2f5df3535d555637680dc5a722b83209519ae7f147c11a9158f48479c9926ea7412ac69d6a584ac97768e412e1514fa7b737ce389dc4bbd9df9cc422b95ccfc5926171fe20e9284834a77646878d7a34c485b3e7300c3589a8448b3bc53b980c6bced42938afc1398c2f1a3a6c2887c5be68a18039cb2fba983271c1202a73af95c79ae29faafb40973694b4afa523f2ef13b84300c39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221]
...
./test/crypto_tests.cpp(185): error: in "crypto_tests/chacha20_testvector": check hexout == HexStr(outres) has failed [a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4042e02785ececfa4b4bb5e8ead0440e20b6e8db09d881a7c6132f420e52795042bdfa7773d8a9051447b3291ce1411c680465552aa6c405b7764d5e87bea85ad00f8449ed8f72d0d662ab052691ca66424bc86d2df80ea41f43abf937d3259dc4b2d0dfb48a6c9139ddd7f76966e928e635553ba76c5c879d7b35d49eb2e62b0871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221 != a3fbf07df3fa2fde4f376ca23e82737041605d9f4f4f57bd8cff2c1d4b7955ec2a97948bd3722915c8f3d337f7d370050e9e96d647b7c39f56e031ca5eb6250d4a57f52d8ba2a2f4540af4f7f49f0b1072a7f9891b97ceb5c61c20420143685f169add2373c4b505122eda2f5df3535d555637680dc5a722b83209519ae7f147c11a9158f48479c9926ea7412ac69d6a584ac97768e412e1514fa7b737ce389dc4bbd9df9cc422b95ccfc5926171fe20e9284834a77646878d7a34c485b3e7300871cdac638939e25e8a1e0ef9d5280fa8ca328b351c3c765989cbcf3daa8b6ccc3aaf9f3979c92b3720fc88dc95ed84a1be059c6499b9fda236e7e818b04b0bc39c1e876b193bfe5569753f88128cc08aaa9b63d1a16f80ef2554d7189c411f5869ca52c5b83fa36ff216b9c1d30062bebcfd2dc5bce0911934fda79a86f6e698ced759c3ff9b6477338f3da4f9cd8514ea9982ccafb341b2384dd902f3d1ab7ac61dd29c6f21ba5b862f3730e37cfdc4fd806c22f221]
./test/crypto_tests.cpp(272): error: in "crypto_tests/chacha20poly1305_testvectors": check cipher == expected_cipher has failed
...
./test/crypto_tests.cpp(337): error: in "crypto_tests/chacha20poly1305_testvectors": check decipher == plain has failed

*** 37 failures are detected in the test module "Bitcoin Core Test Suite"

This appears to be an endianness issue in vec_read_xor_write: the current implementation swaps the result of the XOR, but on Big Endian systems we must swap the state vec before the XOR.

Changing it to:

diff --git a/src/crypto/chacha20_vec.ipp b/src/crypto/chacha20_vec.ipp
index 46a159ce01..cfc0535a92 100644
--- a/src/crypto/chacha20_vec.ipp
+++ b/src/crypto/chacha20_vec.ipp
@@ -174,8 +174,9 @@ ALWAYS_INLINE void vec_read_xor_write(std::span<const std::byte, 32> in_bytes, s
 {
     std::array<uint32_t, 8> temparr;
     memcpy(temparr.data(), in_bytes.data(), in_bytes.size());
-    vec256 tempvec = vec ^ (vec256){temparr[0], temparr[1], temparr[2], temparr[3], temparr[4], temparr[5], temparr[6], temparr[7]};
+    vec256 tempvec = vec;
     vec_byteswap(tempvec);
+    tempvec ^= (vec256){temparr[0], temparr[1], temparr[2], temparr[3], temparr[4], temparr[5], temparr[6], temparr[7]};
     temparr = {tempvec[0], tempvec[1], tempvec[2], tempvec[3], tempvec[4], tempvec[5], tempvec[6], tempvec[7]};
     memcpy(out_bytes.data(), temparr.data(), out_bytes.size());
 }

Makes it pass for me.

We should find a way to exercise this via CI and benchmarks, maybe similarly to SHA256AutoDetect in https://github.com/bitcoin/bitcoin/blob/bdb8eadcdc193f398ebad83911d3297b5257e721/src/crypto/sha256.cpp#L585-L690 which would enable us running benchmarks and tests selectively.

theuni commented at 5:17 PM on December 17, 2025:

Thanks for catching this! I forgot to mention in the PR description that big-endian was best-effort and untested. I figured our c-i would catch any obvious bugs. Agree it's not great that it didn't :(

Thanks for the quick fix too :)

l0rinc commented at 5:22 PM on December 17, 2025:

I had the same disappointment when implementing the obfuscation optimization. :) @maflcko do we have a big-endian nightly that supports vectorized operations? Would it have caught this?

maflcko commented at 5:50 PM on December 17, 2025:

See #33436, but it was slow despite having the gui and the fuzz tests disabled. I am running it as part of nightly, so any issues will be caught before a release, but I am not sure if catching everything in pull requests is possible.

Maybe it is possible to split the task into two: One for a cross-compile, which should be faster than a "native" compile via qemu. And another to run the tests, similar to the windows-cross tests.

maflcko commented at 6:13 AM on December 18, 2025:

Actually, I ran the CI config locally, but it didn't catch this, as the platform opts out. The CI doesn't run the sed -i "s/# define CHACHA20_VEC_DISABLE_STATES_2/\/\/# define CHACHA20_VEC_DISABLE_STATES_2/" src/crypto/chacha20_vec_base.cpp && \ portion, so it would not have caught this, even if the task was run.

l0rinc commented at 11:47 AM on December 18, 2025:

Thanks for checking @maflcko, that's why I asked. The vectorized operations are obviously supported, but for some reason were not triggered for me either. @theuni, is this just an emulation anomaly or we were just too cautious? I understdood that @achow101 has access to real big-endian power9 machine that we might be able to test this on when it's ready.

theuni commented at 3:22 PM on December 18, 2025:

@l0rinc As the code is written at the moment, all platforms must opt-in to vectorization as opposed to opting out. I'm not sure that's the best approach, but I figured that was a reasonable starting point.

My reasoning for that was: consider non-x86, non-arm+neon platforms. For the most part, I'm assuming they're under-powered. Enabling (for example) 4x blocks/sec for mipsel would probably cause a drastic slowdown. Obviously there are lots of other powerful platforms, but I figured those would be added over time.

So if there's a big-endian architecture (or more ideally, a specific instruction set ala __ARM_NEON) that demonstrates a performance gain from calculating multiple states at once, it should be added here.

Of course, we could go the opposite direction and opt all platforms IN to all states, and instead opt-out the slow ones.

tl;dr: If we want a big-endian platform to be supported, we need to opt one in or provide an override.

in src/crypto/chacha20_vec.ipp:102 in 3bddf59cd3 outdated

  97 | +    x += y;
  98 | +    z ^= x;
  99 | +    vec_rotl<BITS>(z);
 100 | +
 101 | +    if constexpr(ITER + 1 < I ) arr_add_xor_rot<BITS, I, ITER + 1>(arr0, arr1, arr2);
 102 | +}

l0rinc commented at 3:17 PM on December 17, 2025:

Hmmm, it seems to me we're doing heavy recursive templates here to unroll loops. My understanding (and experience with the mentioned Obfuscation PR) is that modern compilers are good at unrolling fixed-bound loops. Can you please try if this also works and results in the same speedup?

template <size_t BITS, size_t I>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
    for (size_t i{0}; i < I; ++i) {
        arr0[i] += arr1[i];
        arr2[i] ^= arr0[i];
        vec_rotl<BITS>(arr2[i]);
    }
}

(nit: the size is often N instead of I)

sipa commented at 4:10 PM on December 17, 2025:

Alternatively, with a compile-time loop:

/** Perform add/xor/rotate for the round function */
template <size_t BITS, size_t I>
ALWAYS_INLINE void arr_add_xor_rot(std::array<vec256, I>& arr0, const std::array<vec256, I>& arr1, std::array<vec256, I>& arr2)
{
    [&]<size_t... ITER>(std::index_sequence<ITER...>) {
        ((
            arr0[ITER] += arr1[ITER],
            arr2[ITER] ^= arr0[ITER],
            vec_rotl<BITS>(arr2[ITER])
        ), ...);
    }(std::make_index_sequence<I>());
}

theuni commented at 5:25 PM on December 17, 2025:

See updated title description where I touched on this.

I found (with lots of experimentation here) that different compilers use complicated and unpredictable heuristics to decide whether or not to unroll loops. Even if an unroll-able loop is detected, unrolling may be skipped because of the function size (as is the case here because it's huge).

So, I'd like to not take chances on unrolling. That means one of:

Manual unrolling
Macro-based (as-in REPEAT10())
A pragma
Recursive template based

c++26 introduces template for, which is what we really want.

I'm up for whichever of those is generally preferred, as long as it's explicit rather than implicit.

l0rinc commented at 5:30 PM on December 17, 2025:

Is there any way we could help that would make you reconsider? I don't mind running the benchmarks on a few platforms, as long as we can have simpler code. The current template magic is not something I would like to see more of - modern C++20 should be able to handle the situations we have here. I don't mind experimenting with this if you don't want to.

sipa commented at 5:41 PM on December 17, 2025:

@theuni See my std::make_index_sequence based I demonstrated above. It's template based, but doesn't need recursion, and doesn't rely on compiler unrolling. It simply expands to an expression (a[0] = v, a[1] = v, a[2] = v, ...) e.g.

l0rinc commented at 5:45 PM on December 17, 2025:

https://www.agner.org/optimize/optimizing_cpp.pdf contains a few useful examples:

The automatic vectorization works best if the following conditions are satisfied:

Use a compiler with good support for automatic vectorization, such as Gnu, Clang, or Intel.

Use the latest version of the compiler. The compilers are becoming better and better at vectorization.

If the arrays or structures are accessed through pointers or references then tell the compiler explicitly that pointers do not alias, if appropriate, using the restrict or __restrict keyword.

Use appropriate compiler options to enable the desired instruction set (/arch:SSE2, /arch:AVX etc. for Windows, -msse2, -mavx512f, etc. for Linux)

Use the less restrictive floating point options. For Gnu and Clang compilers, use the options -O2 -fno-trapping-math -fno-math-errno -fno-signed-zeros (-ffast-math works as well, but functions like isnan(x) do not work under -ffast-math).

Align arrays and big structures by 16 for SSE2, preferably 32 for AVX and preferably 64 for AVX512.

The loop count should preferably be a constant that is divisible by the number of elements in a vector.

If arrays are accessed through pointers so that the alignment is not visible in the scope of the function where you want vectorization then follow the advice given above.

Minimize the use of branches at the vector element level.

Avoid table lookup at the vector element level.

Similarly in https://en.algorithmica.org/hpc/simd/auto-vectorization useful hints:

The other way, specific to SIMD, is the “ignore vector dependencies” pragma. It is a general way to inform the compiler that there are no dependencies between the loop iterations:

#pragma GCC ivdep for (int i = 0; i < n; i++)

in src/crypto/chacha20_vec.ipp:33 in 3bddf59cd3 outdated

  28 | +#  endif
  29 | +#endif
  30 | +
  31 | +#if !defined(ALWAYS_INLINE)
  32 | +#  define ALWAYS_INLINE inline
  33 | +#endif

l0rinc commented at 3:20 PM on December 17, 2025:

Could we use https://github.com/bitcoin/bitcoin/blob/e16c22fe025f82166c7f3f15a37c96bf4a06e4cf/src/attributes.h#L19-L25 instead?

in src/crypto/chacha20_vec.ipp:309 in 3bddf59cd3 outdated

 304 | +    while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 8) {
 305 | +        multi_block_crypt<8>(in_bytes, out_bytes, state0, state1, state2);
 306 | +        state2 += (vec256){8, 0, 0, 0, 8, 0, 0, 0};
 307 | +        in_bytes = in_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 8);
 308 | +        out_bytes = out_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 8);
 309 | +    }

l0rinc commented at 3:25 PM on December 17, 2025:

There's a lot of repetition here - could we extract that to an ALWAYS_INLINE lambda and have something like:

    if constexpr (ENABLE_16) process_blocks(16);
    else if constexpr (ENABLE_8)  process_blocks(8);
...

I haven't implemented it locally, maybe it's naive, let me know what you think.

l0rinc changes_requested

l0rinc commented at 3:46 PM on December 17, 2025: contributor

I went through the code roughly, I like that we're focusing on this and looking forward to specializing other parts of the code that are this critical (looking at you, SipHash!).

My biggest objection currently is that it's broken on big-endian systems - left a suggestion how to reproduce and fix it. This also reveals the lack of testing - we have to find a way to selectively enable and disable these optimizations to make sure we have tests that compare their outputs. I agree with AJ that we could modernize this a bit with constexpr and less general recursive template magic since C++20 allows us to use simple loops and constexpr conditions and lambdas - it could reduce a lot of duplication while maintaining performance. There's also some repetition (e.g. ALWAYS_INLINE and internal_bswap_32), already defined elsewhere which I think we could use instead. Which leads me to think we should extract the new primitives used here (__builtin_shufflevector, vec_rotl, vec_byteswap, vec256) to a reusable header - tested and benchmarked separately from the chacha work.

in src/crypto/chacha20_vec.ipp:70 in 3bddf59cd3 outdated

  65 | +/** Store a vector in all array elements */
  66 | +template <size_t I, size_t ITER = 0>
  67 | +ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
  68 | +{
  69 | +    std::get<ITER>(arr) = vec;
  70 | +    if constexpr(ITER + 1 < I ) arr_set_vec256<I, ITER + 1>(arr, vec);

sipa commented at 3:59 PM on December 17, 2025:

I believe it's possible to use compile-time loops here instead of recursive templates (if runtime loops don't get optimized sufficiently):

/** Store a vector in all array elements */
template <size_t I>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
    [&]<size_t... ITER>(std::index_sequence<ITER...>) {
        ((std::get<ITER>(arr) = vec),...);
    }(std::make_index_sequence<I>());
}

theuni commented at 5:41 PM on December 17, 2025:

I avoided lambdas as I've historically observed lots of optims being skipped (with clang, at least) when using them. But since you/@ajtowns/@l0rinc have all made the same comment, I'll play around and see if these indeed compile down to nothing as one would hope.

sipa commented at 5:55 PM on December 17, 2025:

It can be done with a helper function too, but that's not as concise:

template <size_t I, size_t... ITER>
ALWAYS_INLINE void arr_set_vec256_inner(std::array<vec256, I>& arr, const vec256& vec, std::index_sequence<ITER...>)
{
    ((std::get<ITER>(arr) = vec),...);
}

/** Store a vector in all array elements */
template <size_t I>
ALWAYS_INLINE void arr_set_vec256(std::array<vec256, I>& arr, const vec256& vec)
{
    arr_set_vec256_inner(arr, vec, std::make_index_sequence<I>());
}

ajtowns commented at 1:58 AM on December 18, 2025:

Writing this as:

    template<size_t... ITER> using iseq = std::index_sequence<ITER>;
    static constexpr ISEQ = std::make_index_sequence<I>();
    
    template<size_t... ITER>
    ALWAYS_INLINE void arr_set_vec256(iseq<ITER>, std::array<vec256, I>& arr, const vec256& vec)
    {
        ((std::get<ITER>(arr) = vec), ...);
    }

   ...
        arr_set_vec256(ISEQ, arr0, num256);

doesn't seem too bad, and avoids lambdas? Possibly still a bit clumsy if you can't capture I by putting everything in a class? Might still be a bit clumsy for add_xor_rot.

sedited commented at 11:04 PM on December 24, 2025:

Is there a particular reason for the aversion to the current approach? It seems easier to read and probably also to debug to me.

theuni commented at 5:48 PM on December 17, 2025: member

An additional note about the actual vectorizing algorithm itself...

There's a different (and arguably more obvious) algorithm that's possible when calculating exactly 8 states. Rather than loading each vec256 with partial info from 2 states, it's possible to use 16 vec256 where each vector contains 1 element for each of the 8 states. It looks like:

vec256 x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15;

broadcast(x0, 0x61707865);
broadcast(x1, 0x3320646e);
broadcast(x2, 0x79622d32);
broadcast(x3, 0x6b206574);
broadcast(x4, input[0]);
...
broadcast(x15, input[11]);

QUARTERROUND( x0, x4, x8,x12);
QUARTERROUND( x1, x5, x9,x13);
...

vec256 j0, j1, j2, j3, ... j31;
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 0, j0, j1);
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 1, j2, j3);
...
extract_column(x0, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15, 15, j30, j31);

vec_read_xor_write(in_bytes, out_bytes, j0);
vec_read_xor_write(in_bytes, out_bytes, j1);
...

The problem with this is the overhead of the transpose at the end. My experiments showed (on x86_64+avx2, at least) that this was actually slower than what's currently implemented. It's possible that this approach is better for other architectures, but I left it out of this PR to reduce the initial complexity.

Edit: This is what the linux kernel does for x86_64+avx2. I grabbed their .S and hacked it into Core to benchmark, only to find that the impl here actually outperformed their hand-written asm. That was neat :)

theuni commented at 9:46 PM on December 17, 2025: member

Ok, after some experimenting, I am very hesitant to add the helpers as suggested above. For example, consider @ajtowns's seemingly innocuous Repeat function:

template <size_t REPS, typename Fn>
ALWAYS_INLINE void Repeat(Fn&& fn)
{
    if constexpr (REPS > 0) {
        fn();
        Repeat<REPS-1>(std::forward<Fn>(fn));
    }
}

This causes a 10% slowdown on this branch, but with avx2 the performance is abysmal:

Baseline(master):

|                1.38 |      723,275,584.44 | `CHACHA20_1MB`

This PR:

|                0.71 |    1,408,751,284.34 | `CHACHA20_1MB`

202512-pr34083-templates

|                0.79 |    1,258,748,451.87 | `CHACHA20_1MB`

This PR + avx2 commits:

|                0.50 |    1,988,653,922.83 |  `CHACHA20_1MB`

202512-pr34083-templates + avx2 commits:

|                1.34 |      748,148,324.75 | `CHACHA20_1MB`

The problem in AJ's branch that some functions end up un-inlined. For clang, it's doubleround. For gcc it's arr_read_xor_write and doubleround.

Both are fixed with:

diff --git a/src/crypto/chacha20_vec.ipp b/src/crypto/chacha20_vec.ipp
index 344c68259a5..16d7da45633 100644
--- a/src/crypto/chacha20_vec.ipp
+++ b/src/crypto/chacha20_vec.ipp
@@ -164 +164 @@ public:
-        Repeat<10>([&]() {
+        Repeat<10>([&]() __attribute__ ((always_inline)) {

That means, to be safe, every lambda would need that attribute, which is pretty ugly and easy to forget. So I think my aversion to lambdas in this code was somewhat justified.

Further, clang's inlining docs state that variadic functions aren't inlined. Experimenting now, that's obviously not true as of v21-git. But clearly it was in the recent past.

So.. I'd really prefer to keep things as simple here as possible. I know the recursive template functions aren't exactly pretty, but I don't think they're THAT bad? Plus, with c++26, it all goes away in favor of template for :)

ajtowns commented at 1:38 AM on December 18, 2025: contributor

So.. I'd really prefer to keep things as simple here as possible. I know the recursive template functions aren't exactly pretty, but I don't think they're THAT bad? Plus, with c++26, it all goes away in favor of template for :)

Could we get some comments in the code as to compiler versions (and architecture) that failed to be efficient enough with lambdas, to hopefully avoid future people modernizing the code without checking that these problems have gone away? Presumably will be a long time before we can modernize to template for (which doesn't even seem to be on cppref's compiler support page yet?)...

(Only thing that I do think is "that bad" about the recursive templates is using ITER and I -- i should be the loop variable, not the size!)

theuni commented at 3:52 PM on December 18, 2025: member

Could we get some comments in the code as to compiler versions (and architecture) that failed to be efficient enough with lambdas, to hopefully avoid future people modernizing the code without checking that these problems have gone away? Presumably will be a long time before we can modernize to template for (which doesn't even seem to be on cppref's compiler support page yet?)...

(Only thing that I do think is "that bad" about the recursive templates is using ITER and I -- i should be the loop variable, not the size!)

Sure, I can definitely add some more context and clean up the wonky variable names.

Another brain dump after thinking about all this some more last night:

I don't think lambdas specifically are the problem, more specifically, the issue is: can the compiler infer enough about the "callback" function to inline it?

There are a few considerations there. @sipa's make_index_sequence suggestion is executed immediately, so I imagine that's more likely to be inlined than AJ's Repeat. But still, from a compiler's POV, if it's evaluating: "here's a function call without ALWAYS_INLINE and my function is already huge, should I exclude it from inlining?", imo it'd be reasonable for it to conclude "yes".

So to be safe, whatever we do, I think we should strive to annotate all functions. And imo, annotating a bunch of lambdas with an inline attribute feels weird.

(as an aside: There's also the flatten attribute, which could be added to multi_block_crypt as a belt-and-suspenders)

The index_sequence trick is neat. If that ends up looking cleaner/more obvious than the recursive iteration, that works for me. I'll play around with it.

Not all loops have to be unrolled.

In my testing, loop unrolling always showed slightly better performance (presumably because calculating multiple blocks is otherwise branch-free), but it's not nearly as performance-critical as the inlining. For example, skipping unrolling of doubleround leads to much smaller code, which I imagine could end up being faster on some architectures.

Is there any way we could help that would make you reconsider? I don't mind running the benchmarks on a few platforms, as long as we can have simpler code. The current template magic is not something I would like to see more of - modern C++20 should be able to handle the situations we have here.

My primary goal with this code (and hopefully setting a precedent for other multi-arch simd code... poly1305 is next ;) is to be as explicit to the compiler about what we want as possible. Ideally as portably as possible. So if we want our functions inlined or loops unrolled, we should attempt to communicate those things opposed to leaving them implicit. Unfortunately, modern c++ doesn't have ways to express either of those yet, but we can make our intentions clear enough.

ajtowns commented at 9:38 AM on December 19, 2025: contributor

And imo, annotating a bunch of lambdas with an inline attribute feels weird.

You could #define AI __attribute__((always_inline)), then you'd just be adding AI to all the lambdas, which, if nothing else, would be very modern?

My primary goal with this code (and hopefully setting a precedent for other multi-arch simd code... poly1305 is next ;) is to be as explicit to the compiler about what we want as possible. Ideally as portably as possible.

The above was kind-of a joke, but, perhaps you could combine it with a clang-tidy plugin that lets CI check that all lambdas in a particular namespace are annotated in that way? That might be both clear to compilers and reasonably friendly to human authors/reviewers?

(Explicit recursive templates and prohibiting lambdas are fine by me though; that's still a big step up from inline asm)

l0rinc commented at 8:18 AM on December 22, 2025: contributor

I have measured its effect on IBD and happy to say it produces a measurable speedup.

COMMITS="938d7aacabd0bb3784bb3e529b1ed06bb2891864 3bddf59cd3f201ecf8d65bb1f6c0cde5c39595e9"; \
STOP=926619; DBCACHE=450; \
CC=gcc; CXX=g++; \
BASE_DIR="/mnt/my_storage"; DATA_DIR="$BASE_DIR/BitcoinData"; LOG_DIR="$BASE_DIR/logs"; \
(echo ""; for c in $COMMITS; do git fetch -q origin $c && git log -1 --pretty='%h %s' $c || exit 1; done) && \
(echo "" && echo "IBD | ${STOP} blocks | dbcache ${DBCACHE} | $(hostname) | $(uname -m) | $(lscpu | grep 'Model name' | head -1 | cut -d: -f2 | xargs) | $(nproc) cores | $(free -h | awk '/^Mem:/{print $2}') RAM | $(df -T $BASE_DIR | awk 'NR==2{print $2}') | $(lsblk -no ROTA $(df --output=source $BASE_DIR | tail -1) | grep -q 0 && echo SSD || echo HDD)"; echo "") &&\
hyperfine \
  --sort command \
  --runs 2 \
  --export-json "$BASE_DIR/ibd-$(sed -E 's/(\w{8})\w+ ?/\1-/g;s/-$//'<<<"$COMMITS")-$STOP-$DBCACHE-$CC.json" \
  --parameter-list COMMIT ${COMMITS// /,} \
  --prepare "killall -9 bitcoind 2>/dev/null; rm -rf $DATA_DIR/*; git checkout {COMMIT}; git clean -fxd; git reset --hard && \
    cmake -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo && ninja -C build bitcoind -j2 && \
    ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=1 -printtoconsole=0; sleep 20" \
  --conclude "cp $DATA_DIR/debug.log $LOG_DIR/debug-{COMMIT}-$(date +%s).log && \
             grep -q 'height=0' $DATA_DIR/debug.log && grep -q 'Disabling script verification at block [#1](/bitcoin-bitcoin/1/)' $DATA_DIR/debug.log && grep -q 'height=$STOP' $DATA_DIR/debug.log" \
  "COMPILER=$CC ./build/bin/bitcoind -datadir=$DATA_DIR -stopatheight=$STOP -dbcache=$DBCACHE -blocksonly -printtoconsole=0"

938d7aacab Merge bitcoin/bitcoin#33657: rest: allow reading partial block data from storage 3bddf59cd3 chacha20: Add generic vectorized chacha20 implementation

Benchmark 1: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=926619 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 938d7aacabd0bb3784bb3e529b1ed06bb2891864)
  Time (mean ± σ):     45198.232 s ± 539.231 s    [User: 57185.689 s, System: 4367.881 s]
  Range (min … max):   44816.939 s … 45579.526 s    2 runs
 
Benchmark 2: COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=926619 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 3bddf59cd3f201ecf8d65bb1f6c0cde5c39595e9)
  Time (mean ± σ):     43699.641 s ± 25.532 s    [User: 57610.736 s, System: 4058.824 s]
  Range (min … max):   43681.587 s … 43717.695 s    2 runs
 
Relative speed comparison
        1.03 ±  0.01  COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=926619 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 938d7aacabd0bb3784bb3e529b1ed06bb2891864)
        1.00          COMPILER=gcc ./build/bin/bitcoind -datadir=/mnt/my_storage/BitcoinData -stopatheight=926619 -dbcache=450 -blocksonly -printtoconsole=0 (COMMIT = 3bddf59cd3f201ecf8d65bb1f6c0cde5c39595e9)

</details>

ajtowns commented at 7:50 AM on December 24, 2025: contributor

I have measured its effect on IBD and happy to say it produces a measurable speedup.

Presumably that indicates this IBD run is compute bound (vs disk or network), and that ChaCha20 is using up maybe 7% of CPU time (or 7% of a core when we're bottlenecked on something single-threaded) prior to this PR. That seems like a lot? Is this due to FastRandomContext, or something else? Or is it just that obfuscating all the block data is a large component of IBD CPU currently? This seems a bit surprising to me.

in src/crypto/chacha20_vec.ipp:288 in 3bddf59cd3 outdated

 283 | +
 284 | +#if defined(CHACHA20_NAMESPACE)
 285 | +namespace CHACHA20_NAMESPACE {
 286 | +#endif
 287 | +
 288 | +void chacha20_crypt_vectorized(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, const std::array<uint32_t, 12>& input) noexcept

sedited commented at 11:02 PM on December 24, 2025:

I think this function should get a short description too. Wasn't immediately clear to me that it attempts to peel off larger blocks first before potentially finishing with a few smaller blocks.

theuni commented at 9:18 PM on January 6, 2026: member

Presumably that indicates this IBD run is compute bound (vs disk or network), and that ChaCha20 is using up maybe 7% of CPU time (or 7% of a core when we're bottlenecked on something single-threaded) prior to this PR. That seems like a lot? Is this due to FastRandomContext, or something else? Or is it just that obfuscating all the block data is a large component of IBD CPU currently? This seems a bit surprising to me.

Not sure if you missed this in the description, but that's exactly what this is trying to improve:

Local profiling revealed that chacha20 accounts for a substantial amount of the network thread's time. It's not clear to me if speeding up chacha20 will improve network performance/latency, but it will definitely make it more efficient.

IBD would indeed be slowed by Chacha20 via single-threaded bip324 handling on the network thread. While profiling my POC multi-process net binary I observed ChaCha20Aligned::Crypt() accounting for 30% of the net thread's cpu time.

So it's not surprising to me at all that 3x'ing that function (it's only 2x here, avx2/avx512 improve performance even more) speeds up IBD.

squashme: fix vectorized chacha20 on big endian

Co-authored-by: Lőrinc <pap.lorinc@gmail.com>

33b48a2927

squashme: fix main performance discrepancy between clang and gcc

__builtin_shufflevector emits non-optimal code for this case, so avoid using it
here. See: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107563#c14

cdfca84e23

theuni commented at 7:34 PM on January 13, 2026: member

I finally managed to track down the gcc slowdown that @l0rinc mentioned during last week's IRC meeting. The culprit was this gcc bug. Thankfully, it's easily worked around by simply not calling the guilty builtin.

Also pushed @l0rinc's fix for big-endian.

Now that gcc/clang are more on par and the impl seems feasible again, I'll address the other feedback.

in src/crypto/chacha20_vec.h:25 in 3bddf59cd3

  20 | +
  21 | +#ifdef ENABLE_CHACHA20_VEC
  22 | +
  23 | +namespace chacha20_vec_base
  24 | +{
  25 | +    void chacha20_crypt_vectorized(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, const std::array<uint32_t, 12>& input) noexcept;

l0rinc commented at 5:42 PM on May 4, 2026:

3bddf59 chacha20: Add generic vectorized chacha20 implementation:

We could avoid a copy by making it a span instead:

    void chacha20_crypt_vectorized(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, std::span<const uint32_t, 12> input) noexcept;

in src/crypto/chacha20_vec.ipp:301 in 3bddf59cd3

 296 | +    while(in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * 16) {
 297 | +        multi_block_crypt<16>(in_bytes, out_bytes, state0, state1, state2);
 298 | +        state2 += (vec256){16, 0, 0, 0, 16, 0, 0, 0};
 299 | +        in_bytes = in_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 16);
 300 | +        out_bytes = out_bytes.subspan(CHACHA20_VEC_BLOCKLEN * 16);
 301 | +    }

l0rinc commented at 5:45 PM on May 4, 2026:

3bddf59 chacha20: Add generic vectorized chacha20 implementation:

We might have mentioned this before but we can reduce duplication by extracting this to something like:

template <size_t STATES>
ALWAYS_INLINE void process_blocks(std::span<const std::byte>& in_bytes, std::span<std::byte>& out_bytes, const vec256& state0, const vec256& state1, vec256& state2)
{
    static constexpr vec256 increment = (vec256){STATES, 0, 0, 0, STATES, 0, 0, 0};
    while (in_bytes.size() >= CHACHA20_VEC_BLOCKLEN * STATES) {
        multi_block_crypt<STATES>(in_bytes, out_bytes, state0, state1, state2);
        state2 += increment;
        in_bytes = in_bytes.subspan(CHACHA20_VEC_BLOCKLEN * STATES);
        out_bytes = out_bytes.subspan(CHACHA20_VEC_BLOCKLEN * STATES);
    }
}

and use it as e.g.

    process_blocks<16>(in_bytes, out_bytes, state0, state1, state2);

Local benchmarks indicate it retains the performance.

in src/crypto/chacha20_vec.ipp:180 in 3bddf59cd3

 175 | +    std::array<uint32_t, 8> temparr;
 176 | +    memcpy(temparr.data(), in_bytes.data(), in_bytes.size());
 177 | +    vec256 tempvec = vec ^ (vec256){temparr[0], temparr[1], temparr[2], temparr[3], temparr[4], temparr[5], temparr[6], temparr[7]};
 178 | +    vec_byteswap(tempvec);
 179 | +    temparr = {tempvec[0], tempvec[1], tempvec[2], tempvec[3], tempvec[4], tempvec[5], tempvec[6], tempvec[7]};
 180 | +    memcpy(out_bytes.data(), temparr.data(), out_bytes.size());

l0rinc commented at 5:50 PM on May 4, 2026:

3bddf59 chacha20: Add generic vectorized chacha20 implementation:

Could we simplify this by using memcpy directly between byte spans and vec256 instead, something like:

    vec256 tempvec;
    memcpy(&tempvec, in_bytes.data(), sizeof(tempvec));
    vec_byteswap(tempvec);
    tempvec ^= vec;
    vec_byteswap(tempvec);
    memcpy(out_bytes.data(), &tempvec, sizeof(tempvec));

l0rinc changes_requested

l0rinc commented at 5:52 PM on May 4, 2026: contributor

I have remeasured it on a few platforms - on RPI5 this is a serious slowdown compared to before:

Machine	Baseline	Final	Speedup
Mac M4 Max	1062 MB/s	2172 MB/s	2.05x
umbrel N150 (GCC 12.2)	116 MB/s	272 MB/s	2.34x
RPi5 (GCC 15.0.1)	406 MB/s	207 MB/s	0.51x ⚠️
RPi5 (Clang 22.0.0)	452 MB/s	558 MB/s	1.24x

RPi5 with GCC shows a regression: the generic vectorized implementation is ~2x slower than scalar on ARM64 with GCC, while Clang handles it fine. The "gcc/clang fix" commit helps x86 (umbrel) but doesn't fix the ARM64/GCC issue. I haven't investigated the source of the problem, just identified it.

<details><summary>Mac M4 - AppleClang 21.0.0.21000099</summary>

for commit in 224120bf1299392deaa59ab71c895a1e6264f205 b9300cc696135d1ff31ffbdf639dc0f99167d49c 96f741441180d14df0519caa3f7c73f818a12dbf 26e9c7b588fcbdc956c4e2b241f4ef7d121d3d79 e6ec033f175031f72e23253bd0821f6fc6d353b2 6db8cf0e28bbaef24cb9e1e3f0c59f118a25619c 63a99f2b31efde3db91349bbf13f3c243c453084 e82c752615badf8a811846122c2c370d645385e9 62a8c487da9721fd1149dc3fefa7fb11292370bf 332fa6e26f293af58fad83ccc896291a68650fae 548791e2b58fb65a52159a18f3765613e11422a3; do \
    git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && echo "" && git log -1 --pretty='%h %s' && \
    rm -rfd build >/dev/null 2>&1 && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 && \
    cmake --build build -j$(nproc) >/dev/null 2>&1 && \
    for _ in $(seq 1); do \
      sleep 5; \
      sudo taskpolicy -t 5 -l 5 nice -n -20 ./build/bin/bench_bitcoin -filter='CHACHA20_.*' -min-time=10000; \
    done; \
done

224120bf12 Merge bitcoin/bitcoin#32394: net: make m_nodes_mutex non-recursive

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.94 |    1,061,878,902.65 |    0.4% |     11.00 | `CHACHA20_1MB`
|                0.95 |    1,057,294,303.75 |    0.3% |     10.96 | `CHACHA20_256BYTES`
|                0.96 |    1,037,370,700.06 |    0.5% |     10.52 | `CHACHA20_64BYTES`

b9300cc696 chacha20: move single-block crypt to inline helper function

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.88 |    1,131,925,309.10 |    0.3% |     10.95 | `CHACHA20_1MB`
|                0.89 |    1,123,374,575.47 |    0.4% |     10.89 | `CHACHA20_256BYTES`
|                0.92 |    1,086,410,862.30 |    0.2% |     11.00 | `CHACHA20_64BYTES`

96f7414411 chacha20: Add generic vectorized chacha20 implementation

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.46 |    2,190,867,321.66 |    0.0% |     11.01 | `CHACHA20_1MB`
|                0.60 |    1,664,336,473.57 |    0.1% |     10.99 | `CHACHA20_256BYTES`
|                0.94 |    1,063,169,379.64 |    0.2% |     10.99 | `CHACHA20_64BYTES`

26e9c7b588 squashme: fix vectorized chacha20 on big endian

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.46 |    2,175,076,479.08 |    0.1% |     11.00 | `CHACHA20_1MB`
|                0.61 |    1,644,558,660.17 |    0.1% |     10.94 | `CHACHA20_256BYTES`
|                0.95 |    1,055,198,073.31 |    0.3% |     10.53 | `CHACHA20_64BYTES`

e6ec033f17 squashme: fix main performance discrepancy between clang and gcc

|             ns/byte |              byte/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|                0.46 |    2,172,474,228.31 |    0.1% |     11.00 | `CHACHA20_1MB`
|                0.61 |    1,641,693,934.28 |    0.3% |     10.97 | `CHACHA20_256BYTES`
|                0.95 |    1,051,868,628.18 |    0.3% |     10.54 | `CHACHA20_64BYTES`

</details>

<details><summary>Umbrel</summary>

for compiler in gcc; do \
  if [ "$compiler" = "gcc" ]; then CC=gcc; CXX=g++; COMP_VER=$(gcc -dumpfullversion); \
  else CC=clang; CXX=clang++; COMP_VER=$(clang -dumpversion); fi && \
  echo "> Compiler: $compiler $COMP_VER" && \
  for commit in 224120bf1299392deaa59ab71c895a1e6264f205 b9300cc696135d1ff31ffbdf639dc0f99167d49c 96f741441180d14df0519caa3f7c73f818a12dbf 26e9c7b588fcbdc956c4e2b241f4ef7d121d3d79 e6ec033f175031f72e23253bd0821f6fc6d353b2 6db8cf0e28bbaef24cb9e1e3f0c59f118a25619c 63a99f2b31efde3db91349bbf13f3c243c453084 e82c752615badf8a811846122c2c370d645385e9 62a8c487da9721fd1149dc3fefa7fb11292370bf 332fa6e26f293af58fad83ccc896291a68650fae 548791e2b58fb65a52159a18f3765613e11422a3; do \
    git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && echo "" && git log -1 --pretty='%h %s' && \
    rm -rf build >/dev/null 2>&1 && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=$CC -DCMAKE_CXX_COMPILER=$CXX >/dev/null 2>&1 && \
    cmake --build build -j$(nproc) >/dev/null 2>&1 && \
    for i in 1; do \
      build/bin/bench_bitcoin -filter='CHACHA20_.*' -min-time=10000; \
    done; \
  done; \
done

</details>

<details><summary>Umbrel: gcc 12.2.0</summary>

224120bf12 Merge bitcoin/bitcoin#32394: net: make m_nodes_mutex non-recursive

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                8.60 |      116,311,101.69 |    0.0% |           16.55 |            6.92 |  2.391 |           0.02 |    0.0% |     10.86 | `CHACHA20_1MB`
|                8.76 |      114,180,117.53 |    0.0% |           17.21 |            7.05 |  2.441 |           0.08 |    0.0% |     10.97 | `CHACHA20_256BYTES`
|                9.26 |      108,047,834.96 |    0.0% |           19.22 |            7.45 |  2.578 |           0.27 |    0.0% |     11.00 | `CHACHA20_64BYTES`

b9300cc696 chacha20: move single-block crypt to inline helper function

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                8.67 |      115,349,945.71 |    0.0% |           16.75 |            6.98 |  2.401 |           0.02 |    0.0% |     10.87 | `CHACHA20_1MB`
|                8.81 |      113,493,232.74 |    0.0% |           17.31 |            7.10 |  2.440 |           0.05 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                9.22 |      108,508,333.73 |    0.0% |           19.00 |            7.42 |  2.560 |           0.17 |    0.0% |     11.01 | `CHACHA20_64BYTES`

96f7414411 chacha20: Add generic vectorized chacha20 implementation

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                7.27 |      137,465,451.07 |    0.0% |           12.54 |            5.85 |  2.142 |           0.00 |    0.0% |     10.70 | `CHACHA20_1MB`
|                7.37 |      135,758,621.02 |    0.0% |           13.04 |            5.93 |  2.199 |           0.07 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                9.33 |      107,202,828.01 |    0.1% |           19.25 |            7.51 |  2.563 |           0.22 |    0.0% |     11.00 | `CHACHA20_64BYTES`

26e9c7b588 squashme: fix vectorized chacha20 on big endian

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                7.28 |      137,438,778.33 |    0.0% |           12.54 |            5.86 |  2.141 |           0.00 |    0.0% |     10.69 | `CHACHA20_1MB`
|                7.37 |      135,749,652.15 |    0.0% |           13.04 |            5.93 |  2.199 |           0.07 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                9.32 |      107,248,535.46 |    0.0% |           19.25 |            7.51 |  2.563 |           0.22 |    0.0% |     11.00 | `CHACHA20_64BYTES`

e6ec033f17 squashme: fix main performance discrepancy between clang and gcc

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                3.68 |      271,912,991.35 |    0.0% |            9.79 |            2.96 |  3.310 |           0.00 |    0.0% |     11.01 | `CHACHA20_1MB`
|                3.80 |      263,471,169.28 |    0.0% |           10.29 |            3.06 |  3.366 |           0.07 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                9.32 |      107,266,696.30 |    0.0% |           19.25 |            7.51 |  2.564 |           0.22 |    0.0% |     11.00 | `CHACHA20_64BYTES`

</details>

rpi@rpi5-8:/mnt/my_storage/bitcoin$ for compiler in gcc clang; do \
  if [ "$compiler" = "gcc" ]; then CC=gcc; CXX=g++; COMP_VER=$(gcc -dumpfullversion); \
  else CC=clang; CXX=clang++; COMP_VER=$(clang -dumpversion); fi && \
  echo "> Compiler: $compiler $COMP_VER" && \
  for commit in 224120bf1299392deaa59ab71c895a1e6264f205 b9300cc696135d1ff31ffbdf639dc0f99167d49c 96f741441180d14df0519caa3f7c73f818a12dbf 26e9c7b588fcbdc956c4e2b241f4ef7d121d3d79 e6ec033f175031f72e23253bd0821f6fc6d353b2 6db8cf0e28bbaef24cb9e1e3f0c59f118a25619c 63a99f2b31efde3db91349bbf13f3c243c453084 e82c752615badf8a811846122c2c370d645385e9 62a8c487da9721fd1149dc3fefa7fb11292370bf 332fa6e26f293af58fad83ccc896291a68650fae 548791e2b58fb65a52159a18f3765613e11422a3; do \
    git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && echo "" && git log -1 --pretty='%h %s' && \
    rm -rf build >/dev/null 2>&1 && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=$CC -DCMAKE_CXX_COMPILER=$CXX >/dev/null 2>&1 && \
    cmake --build build -j$(nproc) >/dev/null 2>&1 && \
    for i in 1; do \
      build/bin/bench_bitcoin -filter='CHACHA20_.*' -min-time=10000; \
    done; \
  done; \
done

</details>

224120bf12 Merge bitcoin/bitcoin#32394: net: make m_nodes_mutex non-recursive

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                2.46 |      406,186,773.08 |    0.0% |           15.56 |            5.89 |  2.640 |           0.02 |    0.0% |     11.00 | `CHACHA20_1MB`
|                2.50 |      400,125,462.96 |    0.0% |           16.14 |            5.99 |  2.697 |           0.07 |    0.0% |     10.59 | `CHACHA20_256BYTES`
|                2.70 |      370,016,682.88 |    0.0% |           17.88 |            6.47 |  2.762 |           0.25 |    0.0% |     11.00 | `CHACHA20_64BYTES`

b9300cc696 chacha20: move single-block crypt to inline helper function

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                2.32 |      430,122,901.72 |    0.0% |           15.63 |            5.57 |  2.807 |           0.02 |    0.0% |     11.01 | `CHACHA20_1MB`
|                2.43 |      411,502,538.79 |    0.2% |           16.11 |            5.82 |  2.767 |           0.06 |    0.0% |     10.55 | `CHACHA20_256BYTES`
|                2.58 |      387,118,423.10 |    0.1% |           17.55 |            6.19 |  2.836 |           0.19 |    0.0% |     11.01 | `CHACHA20_64BYTES`

96f7414411 chacha20: Add generic vectorized chacha20 implementation

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                4.83 |      206,987,012.46 |    0.0% |           23.35 |           11.56 |  2.021 |           0.00 |    0.1% |     10.98 | `CHACHA20_1MB`
|                4.81 |      207,928,148.69 |    0.0% |           20.59 |           11.52 |  1.788 |           0.07 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                2.71 |      368,891,391.01 |    0.0% |           17.84 |            6.49 |  2.749 |           0.19 |    0.0% |     11.00 | `CHACHA20_64BYTES`

26e9c7b588 squashme: fix vectorized chacha20 on big endian

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                4.84 |      206,725,965.93 |    0.0% |           23.35 |           11.57 |  2.018 |           0.00 |    0.4% |     11.00 | `CHACHA20_1MB`
|                4.80 |      208,151,217.86 |    0.0% |           20.59 |           11.50 |  1.790 |           0.07 |    0.0% |     11.01 | `CHACHA20_256BYTES`
|                2.72 |      367,823,395.01 |    0.0% |           17.84 |            6.51 |  2.741 |           0.19 |    0.0% |     11.00 | `CHACHA20_64BYTES`

e6ec033f17 squashme: fix main performance discrepancy between clang and gcc

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                4.83 |      207,011,227.50 |    0.0% |           23.29 |           11.56 |  2.015 |           0.00 |    0.1% |     11.00 | `CHACHA20_1MB`
|                4.84 |      206,631,508.34 |    0.0% |           20.58 |           11.59 |  1.776 |           0.07 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                2.71 |      368,486,078.94 |    0.0% |           17.84 |            6.50 |  2.745 |           0.19 |    0.0% |     11.00 | `CHACHA20_64BYTES`

</details>

<details><summary>Rpi5: clang 22.0.0</summary>

224120bf12 Merge bitcoin/bitcoin#32394: net: make m_nodes_mutex non-recursive

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                2.21 |      451,646,217.56 |    0.0% |           15.72 |            5.30 |  2.967 |           0.03 |    0.0% |     11.00 | `CHACHA20_1MB`
|                2.30 |      435,574,764.05 |    0.1% |           16.31 |            5.50 |  2.967 |           0.08 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                2.55 |      391,818,167.07 |    0.0% |           18.09 |            6.11 |  2.960 |           0.23 |    0.0% |     11.00 | `CHACHA20_64BYTES`

b9300cc696 chacha20: move single-block crypt to inline helper function

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                2.21 |      451,882,020.43 |    0.0% |           15.72 |            5.30 |  2.967 |           0.03 |    0.0% |     11.00 | `CHACHA20_1MB`
|                2.30 |      435,590,027.21 |    0.1% |           16.31 |            5.50 |  2.968 |           0.08 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                2.55 |      391,773,300.32 |    0.0% |           18.09 |            6.11 |  2.960 |           0.23 |    0.0% |     11.00 | `CHACHA20_64BYTES`

96f7414411 chacha20: Add generic vectorized chacha20 implementation

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                1.79 |      558,716,231.71 |    0.0% |            9.60 |            4.28 |  2.243 |           0.00 |    0.1% |     11.01 | `CHACHA20_1MB`
|                1.65 |      607,517,921.76 |    0.0% |            6.55 |            3.94 |  1.662 |           0.10 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                2.60 |      384,629,496.56 |    0.0% |           18.30 |            6.23 |  2.938 |           0.28 |    0.0% |     11.00 | `CHACHA20_64BYTES`

26e9c7b588 squashme: fix vectorized chacha20 on big endian

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                1.79 |      558,241,984.97 |    0.0% |            9.60 |            4.28 |  2.242 |           0.00 |    0.1% |     11.00 | `CHACHA20_1MB`
|                1.64 |      609,910,639.38 |    0.0% |            6.55 |            3.93 |  1.668 |           0.10 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                2.58 |      386,939,531.73 |    0.0% |           18.30 |            6.19 |  2.956 |           0.28 |    0.0% |     11.00 | `CHACHA20_64BYTES`

e6ec033f17 squashme: fix main performance discrepancy between clang and gcc

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                1.79 |      558,279,486.98 |    0.0% |            9.60 |            4.28 |  2.242 |           0.00 |    0.1% |     11.00 | `CHACHA20_1MB`
|                1.64 |      609,989,533.96 |    0.0% |            6.55 |            3.93 |  1.669 |           0.10 |    0.0% |     11.00 | `CHACHA20_256BYTES`
|                2.59 |      386,421,109.02 |    0.0% |           18.30 |            6.20 |  2.952 |           0.28 |    0.0% |     11.00 | `CHACHA20_64BYTES`

</details>

theuni commented at 8:19 PM on May 15, 2026: member

@l0rinc Thanks for the benchmarks! Those were very helpful.

I haven't gone through all of the comments here yet, since fixing the gcc regression was a blocker for everything else.

I tracked down the root cause of the gcc slowdown compared to clang: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125303

tl;dr: clang gracefully uses 128bit simd when 256bit isn't available, gcc doesn't (yet).

To work around that, I refactored to allow either 128bit or 256bit depending on the architecture. This works great. With the new code, 256bit operations only give a sliiight speedup (using avx2) over the 128bit ones. So I think the 128bit simd approach is very reasonable.

gcc is now in the same ballpark as clang, also showing a 2x-4x speedup on the hardware that I've tested. @l0rinc: If you'd like to bench the new approach, see the branch here: https://github.com/theuni/bitcoin/commits/chacha20-vectorized-128bit/

I'll work on cleaning that up next week.