← index

Introducing UltrafastSecp256k1: A Multi-Architecture Exploration of Secp256k1 Optimizations

An archive of delvingbitcoin.org · view original topic →

vano · #1 ·

Introduction Hello everyone. I’ve been developing a high-throughput implementation called UltrafastSecp256k1. The project, which was open-sourced on February 11th, 2026, started as an exploration of how modern hardware features (SHA-NI, AVX2, ARM64 Assembly) can be leveraged to push the limits of ECC performance across diverse platforms—from high-end x86 servers to resource-constrained IoT devices like ESP32-S3 and RISC-V boards.

The goal is to create a highly portable, constant-time, and branchless library that is accessible through multiple language bindings (12+ languages including Rust, Go, Swift, and Dart). I am reaching out to this community for a technical audit, feedback on the cryptographic primitives, and suggestions on our constant-time implementation.

Architecture & Core Optimizations

The library is built on a “Zero-Allocation” hot-path contract, ensuring no heap overhead during critical operations. Key technical pillars include:

Platform-Specific Implementation & Benchmarks

We have focused on making the library performant where it’s needed most:

Current State (v3.10.x): The library currently passes over 12,000 consistency tests across x86 and ARM64 platforms. The ecosystem includes full bindings for NPM (Node.js/React Native) and NuGet (.NET), making it ready for high-level integration.

Request for Review & Technical Discussion

I am specifically looking for feedback on:

  1. Constant-Time Integrity: Review of our assembly bypasses for potential side-channel leaks.
  2. Algorithm Selection: Evaluation of our H-Product Serial Inversion and SafeGCD implementation details.
  3. Branchless Logic: Suggestions for further removing branches in the point-normalization and signing flows to improve security.

The project is fully open-source, and I believe that peer review from the Delving Bitcoin community is vital to ensure this tool remains both fast and secure for the wider ecosystem.

GitHub Repository: https://github.com/shrec/UltrafastSecp256k1

Technical Changelog:https://github.com/shrec/UltrafastSecp256k1/blob/c649f6dfd80b1611b17f606206b156e3c2e6a058/CHANGELOG.md

vano · #2 ·

Just finished the RISC-V optimization sprint for Milk-V Mars (SiFive U74). Using U74-specific in-order scheduling gave us a 34% boost in verification speed. This is part of the v3.11 roadmap to make UltrafastSecp256k1 the go-to library for resource-constrained IoT devices. Cycles don’t lie! :rocket:

Tim Ruffing · #3 ·

I wonder what your expectation is. If it is that someone here will make the effort of reading and reasoning about more than 150 000 lines of cryptographic code, then I deem that the probability that this happens is negligible.

My main piece of feedback is that the license is not a good fit for the Bitcoin system ecosystem. Almost everything in the ecosystem uses the MIT license. Picking the AGPL means that essentially no projects will be able to use your code, even if they wanted to.

vano · #4 · · in reply to #3

Thank you for the candid feedback — I appreciate it.

You’re absolutely right regarding the license friction. After reflecting on your comment and the broader ecosystem norms, I’ve decided to switch the project to the MIT license to better align with Bitcoin Core and related projects.

My intention was never to create adoption barriers. The goal is to build a portable, zero-dependency secp256k1 engine that can be evaluated and integrated freely.

I understand that a full manual review of a large cryptographic codebase is unrealistic without structured audit scope. I’m currently working on:

• A clear threat model document • A minimized audit surface breakdown • Reproducible apples-to-apples benchmark harness • Cross-implementation comparison vs libsecp256k1

Any targeted feedback on specific subsystems (e.g., scalar arithmetic, field layer, constant-time strategy) would already be extremely valuable.

Thanks again for taking the time to respond.

vano · #5 ·

Cumulative release: v3.14.0 → v3.21.0 120+ commits ABI compatible No breaking changes — drop-in upgrade from v3.14.x

Highlights: • Bernstein‑Yang SafeGCD constant‑time scalar inverse • 6.4× faster ct::scalar_inverse • ~43% faster constant‑time ECDSA signing • RISC‑V constant‑time timing leak fixes • strict BIP‑340 parsing • expanded audit infrastructure • reproducible Docker CI • cross‑platform benchmarks on x86‑64, ARM64, RISC‑V and ESP32

vano · #6 ·

vano · #7 · · in reply to #6

vano · #8 · · in reply to #7

more benchmaks can be found here:

vano · #9 ·

Ultrafastsecp256k1 and BIP352 with i5cpu and Nvidia 5060ti

vano · #10 · · in reply to #9

GitHub - shrec/bench_bip352: BIP-352 Standalone Benchmark · GitHub benchmark repo

vano · #11 ·

Hi all,

I’ve been experimenting with BIP324 v2 encrypted transport and wanted to share some measurements around its performance characteristics, focusing on throughput, latency, and batching effects.

The goal was not to propose changes, but to better understand where the actual costs are and how they scale under different execution models.


Setup


CPU baseline (single-thread)

Mixed traffic profile:

Selected primitives:

One-time operations:


GPU offload (batch processing)

With batching (128K packets):

After optimizations (state reuse, instruction-level tuning, memory layout):

Overhead remains roughly the same (~5.5–5.6%).


Latency vs batching

A key observation is the strong dependence on batch size:

This suggests:

GPU behaves as a throughput engine, not a latency engine.

Small workloads are dominated by launch and transfer overhead, while large batches amortize these costs effectively.


PCIe / data movement effects

End-to-end profiling shows:

Effective end-to-end throughput stabilizes around:

This indicates that once crypto is sufficiently optimized, data movement becomes the dominant bottleneck, not the cryptographic primitives themselves.


Additional observations


Takeaways

  1. BIP324 cryptographic overhead on CPU is measurable but not extreme (~5–6%)

  2. Throughput can scale significantly with parallel execution (30x in this setup)

  3. Latency and throughput behave very differently depending on batching

  4. Once crypto is fast enough, transport becomes memory/IO bound

  5. Batch size and execution model are critical factors in performance


Open questions


I’m happy to share more details or run additional measurements if useful.