From mboxrd@z Thu Jan 1 00:00:00 1970 Delivery-date: Mon, 23 Mar 2026 02:55:41 -0700 Received: from mail-qt1-f186.google.com ([209.85.160.186]) by mail.fairlystable.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1w4c0b-0004Hb-7F for bitcoindev@gnusha.org; Mon, 23 Mar 2026 02:55:41 -0700 Received: by mail-qt1-f186.google.com with SMTP id d75a77b69052e-50917996cfasf7048941cf.0 for ; Mon, 23 Mar 2026 02:55:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlegroups.com; s=20251104; t=1774259735; x=1774864535; darn=gnusha.org; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:x-original-sender:mime-version :subject:message-id:to:from:date:sender:from:to:cc:subject:date :message-id:reply-to; bh=gLZL+/u0YiVm1Rfy/p4EFerkvLCLLcf8ijV67TQgAzo=; b=a9H6luIlEAbLT4IEjLhne9ygcqdwSdpsRf3RDMDh5PcMfzuZWpGuirR03h0onPon4k okNsqfL77CWrXxTyD4ZLDM5IUA4asDs9g1YKflhpxUpP/bQ+TzhyWe3XJdX/M+chVYg2 YvaKa9dUgWLYj/vNsalpWucRzDpqtKYYGW3ReDBxuCsa3iAWgyDH7BZgBnTx0MNkh5fn PIS7UIFsQBSBYsdRI70aY46YLzyYDMf2AWY4DBRPjfTGGfeKEVkgRLpthZf3wJq4eba5 oRiTTmUqHiy9i9PV4q2BoF/ZEULsnvqc6Ug3b5ycbehdC2mcimuBdmz/F3WG/q20vzjO dJtA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1774259735; x=1774864535; darn=gnusha.org; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:x-original-sender:mime-version :subject:message-id:to:from:date:from:to:cc:subject:date:message-id :reply-to; bh=gLZL+/u0YiVm1Rfy/p4EFerkvLCLLcf8ijV67TQgAzo=; b=JXneuKN59jkVv2VY082l+jm84wp1sZb6cpEXnRUxyupTqVsEzQcW0BwpzfGsTr+dr3 z/QPWOWLrycNjVP4f47SPKQ/6tP6uLYNs6R61dfKmNFB3N3QtvexjYOMUXf2PaSL/SR6 bkRHa5/5Aeezx2zfvhnWXg+bzt8UXMJ+4gevDQkBGjzCpjNtOjf2xYeIKMySaCwJ8J8N hdCc1UaiP+Ee03D1OKBC2+Zh/PFwnM5JWtVe0JPNt4z0RB2sCaQAzutTTN27YbBi8fNq t6XASCQH74AhJ2Fqk+D1SCDecD4rU4BRB/CPXNRuziahXw1As4KoLjX6cwfiuRpac+pV 5HWQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774259735; x=1774864535; h=list-unsubscribe:list-subscribe:list-archive:list-help:list-post :list-id:mailing-list:precedence:x-original-sender:mime-version :subject:message-id:to:from:date:x-beenthere:x-gm-message-state :sender:from:to:cc:subject:date:message-id:reply-to; bh=gLZL+/u0YiVm1Rfy/p4EFerkvLCLLcf8ijV67TQgAzo=; b=rz8VP4T8ryHX34j/vVs7PHFX/573yC1PBo/iC55Hqwydf/IA3sQl0ImUm6yZ9EqkXA i6CRd8gVuA6BGuGRHHfdUznh+h25sT2BFM0d+l5m0DbWP8K1JU7q3DdStIedtnm/JRwJ 416K9QEFYhAAS9f/ZpbmJsh9NKLlAl+/5YDAlK9/BPxbv0fGDslIJAPmDnio71qOB8ww uAvgi1sbkUYm1FaK81NDoZg5hpMgfMfEtk5GhId8I8sYd4RnSynAipuW3KrJPYZUd8lZ VyQplkNPkQ+BxQ+VzhUq9/bAIrOUzwG0yIjeen746qfK24VJ8m0W36jjpu4S+GjKYqz5 K9ew== Sender: bitcoindev@googlegroups.com X-Forwarded-Encrypted: i=1; AJvYcCUEwbMJV9qKw1Mw0A6IhokUCW0svs+OeOmbbJZeRzL4UqXug72lJWcR4LGTimU/UARkYtD5IuDMV8pf@gnusha.org X-Gm-Message-State: AOJu0YzAXtGcofp5cowRVZ+yfaL4bwRWMb9O8iJpnPWHrVlCSrsaWoyz Py5fbGsyV1nVtvM8bPzCNmJlTS8PWAj7zKBNsHnSmkX4PClLrn1UP4ix X-Received: by 2002:a05:622a:4c18:b0:50b:382e:f09a with SMTP id d75a77b69052e-50b382efb95mr162520591cf.33.1774259735032; Mon, 23 Mar 2026 02:55:35 -0700 (PDT) X-BeenThere: bitcoindev@googlegroups.com; h="AV1CL+HjVL/0PmoiD+K4AHewx2kwONyspBjt2SW7uUoGpgoJDw==" Received: by 2002:ac8:5d44:0:b0:509:4ba2:3b38 with SMTP id d75a77b69052e-50b25185f44ls89354091cf.0.-pod-prod-06-us; Mon, 23 Mar 2026 02:55:30 -0700 (PDT) X-Received: by 2002:a05:620a:318b:b0:8cd:b617:6522 with SMTP id af79cd13be357-8cfc7ea6375mr1704262985a.29.1774259730393; Mon, 23 Mar 2026 02:55:30 -0700 (PDT) Received: by 2002:a05:690c:a1ce:b0:79a:8470:2a3e with SMTP id 00721157ae682-79a907c563dms7b3; Sun, 22 Mar 2026 19:52:57 -0700 (PDT) X-Received: by 2002:a05:690c:e059:b0:79a:3d1a:a667 with SMTP id 00721157ae682-79a90c2519fmr100807977b3.48.1774234376860; Sun, 22 Mar 2026 19:52:56 -0700 (PDT) Date: Sun, 22 Mar 2026 19:52:56 -0700 (PDT) From: Vano Chkheidze To: Bitcoin Development Mailing List Message-Id: <35d5cbfb-48ea-4cec-bbb2-6597d40d8795n@googlegroups.com> Subject: [bitcoindev] BIP324 transport performance: CPU baseline, GPU offload, batching and latency tradeoffs MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----=_Part_256096_2028521339.1774234376306" X-Original-Sender: payysoon@gmail.com Precedence: list Mailing-list: list bitcoindev@googlegroups.com; contact bitcoindev+owners@googlegroups.com List-ID: X-Google-Group-Id: 786775582512 List-Post: , List-Help: , List-Archive: , List-Unsubscribe: , X-Spam-Score: -0.5 (/) ------=_Part_256096_2028521339.1774234376306 Content-Type: multipart/alternative; boundary="----=_Part_256097_935980955.1774234376306" ------=_Part_256097_935980955.1774234376306 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi all, I=E2=80=99ve been experimenting with BIP324 v2 encrypted transport and want= ed to=20 share some measurements around its performance characteristics, focusing on= =20 throughput, latency, and batching effects. The goal was not to propose changes, but to better understand where the=20 actual costs are and how they scale under different execution models. ------------------------------ Setup =20 -=20 =20 Full BIP324 v2 stack implemented (ChaCha20-Poly1305 AEAD, HKDF-SHA256,= =20 ElligatorSwift, session management) -=20 =20 CPU: x86-64, clang-19, -O3 -=20 =20 GPU: RTX 5060 Ti (CUDA), batch-oriented execution model -=20 =20 Measurements use median of multiple runs (RTDSCP timing) =20 ------------------------------ CPU baseline (single-thread) Mixed traffic profile: -=20 =20 ~715K packets/sec -=20 =20 ~221 MB/s goodput -=20 =20 ~5.5% protocol overhead =20 Selected primitives: -=20 =20 ChaCha20: ~780=E2=80=93840 MB/s -=20 =20 Poly1305: ~1.5=E2=80=932.2 GB/s -=20 =20 AEAD encrypt: ~265=E2=80=93580 MB/s -=20 =20 AEAD decrypt: ~232=E2=80=93587 MB/s =20 One-time operations: -=20 =20 HKDF (extract+expand): ~286 ns -=20 =20 ElligatorSwift create: ~53 =C2=B5s -=20 =20 ElligatorSwift XDH: ~30 =C2=B5s -=20 =20 Full handshake (both sides): ~172 =C2=B5s =20 ------------------------------ GPU offload (batch processing) With batching (128K packets): -=20 =20 ~12.78M packets/sec -=20 =20 ~3.9 GB/s goodput -=20 =20 ~17=E2=80=9318x throughput increase vs CPU =20 After optimizations (state reuse, instruction-level tuning, memory layout): -=20 =20 ~21.37M packets/sec -=20 =20 ~6.6 GB/s goodput -=20 =20 ~30x throughput vs CPU =20 Overhead remains roughly the same (~5.5=E2=80=935.6%). ------------------------------ Latency vs batching A key observation is the strong dependence on batch size: -=20 =20 1 packet: ~17.6 =C2=B5s (launch + transfer dominated) -=20 =20 64 packets: ~0.5 =C2=B5s/packet -=20 =20 1024+ packets: ~63 ns/packet =20 This suggests: GPU behaves as a throughput engine, not a latency engine. Small workloads are dominated by launch and transfer overhead, while large= =20 batches amortize these costs effectively. ------------------------------ PCIe / data movement effects End-to-end profiling shows: -=20 =20 Kernel execution: ~55=E2=80=9358% of total time -=20 =20 PCIe transfer: ~42=E2=80=9345% =20 Effective end-to-end throughput stabilizes around: -=20 =20 ~3.2=E2=80=933.6 GB/s =20 This indicates that once crypto is sufficiently optimized, *data movement= =20 becomes the dominant bottleneck*, not the cryptographic primitives=20 themselves. ------------------------------ Additional observations =20 -=20 =20 Decoy traffic overhead is relatively small on GPU: ~20% decoy rate results in only ~1.4% throughput drop -=20 =20 Multi-stream execution (overlapping copy + compute): ~1.37x improvement vs single stream -=20 =20 Optimal batch size appears to be in the 4K=E2=80=9316K packet range for = this=20 setup =20 ------------------------------ Takeaways =20 1.=20 =20 BIP324 cryptographic overhead on CPU is measurable but not extreme=20 (~5=E2=80=936%) 2.=20 =20 Throughput can scale significantly with parallel execution (30x in this= =20 setup) 3.=20 =20 Latency and throughput behave very differently depending on batching 4.=20 =20 Once crypto is fast enough, *transport becomes memory/IO bound* 5.=20 =20 Batch size and execution model are critical factors in performance =20 ------------------------------ Open questions =20 -=20 =20 Are there realistic node-level scenarios where large batch sizes=20 naturally occur? -=20 =20 Would transport-level batching be compatible with current peer/message= =20 handling models? -=20 =20 How relevant are throughput optimizations vs latency in real-world node= =20 deployments? =20 ------------------------------ I=E2=80=99m happy to share more details or run additional measurements if u= seful. For reference, the implementation used for these measurements is available= =20 here: https://github.com/shrec/UltrafastSecp256k1/tree/dev Best, Ivane Chkheidze =20 --=20 You received this message because you are subscribed to the Google Groups "= Bitcoin Development Mailing List" group. To unsubscribe from this group and stop receiving emails from it, send an e= mail to bitcoindev+unsubscribe@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/bitcoindev/= 35d5cbfb-48ea-4cec-bbb2-6597d40d8795n%40googlegroups.com. ------=_Part_256097_935980955.1774234376306 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Hi all,

I=E2=80=99ve been = experimenting with BIP324 v2 encrypted transport and wanted to share some m= easurements around its performance characteristics, focusing on throughput,= latency, and batching effects.

The goal was not = to propose changes, but to better understand where the actual costs are and= how they scale under different execution models.


Setup=
  • Full BIP324 v2 = stack implemented (ChaCha20-Poly1305 AEAD, HKDF-SHA256, ElligatorSwift, ses= sion management)

  • CPU: x86-64, c= lang-19, -O3

  • GPU: RTX 5060 Ti (= CUDA), batch-oriented execution model

  • Measurements use median of multiple runs (RTDSCP timing)


  • CPU baseline (single-thread)

    Mixed= traffic profile:

    • ~715K packets/sec

    • ~221 MB= /s goodput

    • ~5.5% protocol overh= ead

    Selected primitives:

    • ChaCha20: ~780=E2=80=938= 40 MB/s

    • Poly1305: ~1.5=E2=80=93= 2.2 GB/s

    • AEAD encrypt: ~265=E2= =80=93580 MB/s

    • AEAD decrypt: ~2= 32=E2=80=93587 MB/s

    One-time operations= :

    • HKDF (extr= act+expand): ~286 ns

    • ElligatorS= wift create: ~53 =C2=B5s

    • Elliga= torSwift XDH: ~30 =C2=B5s

    • Full = handshake (both sides): ~172 =C2=B5s


    GPU offload (ba= tch processing)

    With batching (128K packets):<= /p>

    • ~12.78M pack= ets/sec

    • ~3.9 GB/s goodput

    • ~17=E2=80=9318x throughput increase = vs CPU

    After optimizations (state reuse= , instruction-level tuning, memory layout):

    • ~21.37M packets/sec

    • ~6.6 GB/s goodput

    • ~30x throughput vs CPU

    Overhead= remains roughly the same (~5.5=E2=80=935.6%).


    Latency vs batc= hing

    A key observation is the strong dependenc= e on batch size:

    • 64 packets: ~0.5 =C2=B5s/packet

    • 1024+ packets: ~63 ns/packet

    This suggests:

    GPU= behaves as a throughput engine, not a latency engine.

    Small workloads are dominated by launch and transfer ove= rhead, while large batches amortize these costs effectively.


    PCIe / data movement effects

    End-to-end profi= ling shows:

    • = Kernel execution: ~55=E2=80=9358% of total time

    • PCIe transfer: ~42=E2=80=9345%

    Effective end-to-end throughput stabilizes around:

    • ~3.2=E2=80=933.6 GB/s

    This indicates that once crypto is sufficien= tly optimized,=C2=A0data movement becomes the dominant bottleneck, not the cryptographic primitives themselves.


    Addition= al observations
    • Decoy traffic overhead is relatively small on GPU:
      ~20% decoy r= ate results in only ~1.4% throughput drop

    • Multi-stream execution (overlapping copy + compute):
      ~1.37x= improvement vs single stream

    • O= ptimal batch size appears to be in the 4K=E2=80=9316K packet range for this= setup


    Takeaways
    1. BIP324 cryptographic overhead on CPU is meas= urable but not extreme (~5=E2=80=936%)

    2. Throughput can scale significantly with parallel execution (30x in = this setup)

    3. Latency and through= put behave very differently depending on batching

    4. Once crypto is fast enough,=C2=A0transport becom= es memory/IO bound

    5. Bat= ch size and execution model are critical factors in performance


    6. Open questions
      • Are there realistic node-level scenarios where large batc= h sizes naturally occur?

      • Would = transport-level batching be compatible with current peer/message handling m= odels?

      • How relevant are through= put optimizations vs latency in real-world node deployments?

      <= hr style=3D"font-family: Arial, Helvetica, sans-serif; font-size: small;" /= >

      I=E2=80=99m happy to share more details or run addi= tional measurements if useful.

      For reference, the= implementation used for these measurements is available here:
      https://github.com/shrec/UltrafastSecp2= 56k1/tree/dev

      =C2=A0 Best, Ivane Chkheidze=C2=A0 =C2=A0=C2= =A0

      --
      You received this message because you are subscribed to the Google Groups &= quot;Bitcoin Development Mailing List" group.
      To unsubscribe from this group and stop receiving emails from it, send an e= mail to bitcoind= ev+unsubscribe@googlegroups.com.
      To view this discussion visit https://groups.google.com/d/msgid/bitcoind= ev/35d5cbfb-48ea-4cec-bbb2-6597d40d8795n%40googlegroups.com.
      ------=_Part_256097_935980955.1774234376306-- ------=_Part_256096_2028521339.1774234376306--