[WIP, Please benchmark] Use homogeneous coordinates in pippenger

real-or-random commented at 3:47 pm on November 6, 2025: contributor

This adds a new representation geh of group elements, namely in homogeneous (also called projective) coordinates. This is supposed to be faster for unmixed (i.e., the second summand is not ge) addition in terms of field operations, namely 12M+0S+25etc vs. 12M+4S+11etc for Jacobian coordinates.

The addition and doubling formulas are due to Renes, Costello, and Batina 2016, Algorithms 7 and 9. The formulas are complete, i.e., they have no special cases. However, this implementation still keeps track of infinity in a dedicated boolean flag for performance reasons. Since the buckets in Pippenger’s algorithm are initialized with infinity (=zero), we’ll have many additions involving infinity, and going through the entire formula for each of those hurts performance (and the entire point of this PR is performance).

The formulas were implemented by giving GPT-5 mini screenshots of the algorithms in the paper and the field.h. The result was not awesome but I could clean it up manually.

The new representation is used in Pippenger’s ecmult_multi for accumulating the buckets after every window iteration. Buckets are still constructed as gej (because it has faster mixed addition) and only converted to geh before accumulation. This is still supposed to be faster even if the conversion is accounted for. The conversion costs 2M+1S but we then do two geh additions in a row, saving 8S. This PR has three different variants of how geh could be used:

https://github.com/bitcoin-core/secp256k1/commit/2ccc5e02776b8ae3483adf342b741abd99e48931 Only the inner accumulation loop is done in geh.
https://github.com/bitcoin-core/secp256k1/commit/867fe34dd4f642fb966d5ee652e0d77a9adafa41 All of the accumulation is done in geh.
https://github.com/bitcoin-core/secp256k1/commit/0fc73f397bc6a10b723b89d6f2a0dc52e7621748 Like the previous, but we switch back to gej for rows of doublings.

Unfortunately, none of these turns out to be really faster in ecmult_bench pippenger_wnaf on my x86_64 system with gcc 15.2.1 or clang 21.1.4. The best variant (2) beats master by just 0.21%; the other variants are slower than master. :/ If I compile in 32-bit mode, all three variants beat master consistently, but only by 1.2%. But this latter result gives at least some hope that this PR could pay off on some platform. I’m not even sure how much we care about 32-bit platforms. Maybe we care about hardware wallets in general, but probably not when it comes to ecmult_multi. Plus this would need real benchmarks; I didn’t even run this on a native 32-bit CPU).

But we’d certainly care about ARM64 which I couldn’t test on. Anyone with an ARM Mac willing to benchmark this?

The exact benchmark command was SECP256K1_BENCH_ITERS=100000 bench_ecmult pippenger_wnaf (or 20000 iters for 32-bit). Don’t forget the pippenger_wnaf argument to make sure you don’t benchmark Strauss’ algorithm instead, at least below the threshold where we switch to Pippenger automatically. I did this on a 12th Gen Intel(R) Core(TM) i7-1260P, pinned to a P-core, and with TurboBoost disabled. See the attached spreadsheet: for details. benchmark-gcc.ods

If you want to benchmark this, I think it makes sense to get four runs per setup: one for the baseline (https://github.com/bitcoin-core/secp256k1/commit/d0f3123c0c2d4643a0191af6e7c5e18032666605, just disabling low point counts in bench_ecmult for quicker benchmarking) and the three “step” commits as mentioned above. You could just extend the spreadsheet with your results.

Also, if you have any ideas on how to improve this further, I’d be happy to hear them. I tried various micro-optimizations, but none of them turned out to be significant on my machine. In fact, most of them made the code slower in practice. In theory, this PR should make it possible to increase the window size a bit, but playing around with the window size didn’t make a difference either in practice.

edit: Don’t care about CI. It fails on some platforms because I forgot to mark functions static. This should compile locally without issues.

HACK: Don't run unrelevant stuff in bench_ecmult d0f3123c0c

group/geh: Add homogeneous group representation

Even though the formulas are complete, infinity is special cased for
performance.

8eca7f6524

geh: Add testutil functions (TODO: unused) 75d0e66958

geh: Add some tests a995c0e63f

geh: Add internal benchmarks (TODO naming) 0f210198f4

pippenger/geh step 1: Use geh in accumulation loop 2ccc5e0277

pippenger/geh step 2: Use geh in entire accumulation 867fe34dd4

pippenger/geh step 3: Go back to gej for doublings 0fc73f397b

fixups 9cfba13041

real-or-random added the label performance on Nov 6, 2025

theStack commented at 8:42 pm on November 7, 2025: contributor

Ran the benchmarks on my arm64 notebook (it’s a Lenovo Thinkpad T14s Gen 6, with a Qualcomm Snapdragon X Elite CPU, using gcc 14.2.0) using a hacked-together build-and-benchmark script and got the following results: https://gist.github.com/theStack/897d7b50b5b8a6f288ed2b817fcca9fc Based on only this, it looks like only variant 1 (2ccc5e02776b8ae3483adf342b741abd99e48931) is a bit faster (~1%-ish, looking at the last few lines each?), variant 2 is pretty much the same and variant 3 is worse. Will give it a few more runs next week to verify if these results are consistent.

siv2r commented at 3:39 pm on November 21, 2025: contributor

I’ve benchmarked this pull request on my Macbook Pro: M4 Pro Chip (ARM64, 12 Cores: 8P + 4E) | macOS 15.6.1 | Plugged in & no background apps.

I used this python script to benchmark the code. It’s basically an adapted version of @thestack’s bash script. But extends it to parse the .txt benchmark files to output an .xlsx comparison of performance improvements.

In my benchmarks, all three variants are slightly slower (~0.5% on average) than master, with v2 performing best, followed by v3 and v1.

I tried to keep the benchmarks as accurate as possible. From my brief readings, on macOS with Apple Silicon, you can’t actually pin processes to specific P-cores like you would with taskset on Linux. I’ll read more on this and see if there’s a workaround. Hopefully that’d make the benchmark more robust.

real-or-random commented at 4:20 pm on November 27, 2025: contributor

I’ve benchmarked this pull request on my Macbook Pro: M4 Pro Chip (ARM64, 12 Cores: 8P + 4E) | macOS 15.6.1 | Plugged in & no background apps. son of performance improvements. […] In my benchmarks, all three variants are slightly slower (~0.5% on average) than master, with v2 performing best, followed by v3 and v1.

Thanks a lot! Well, if it’s not faster on Macs either, then I guess there’s no reason to keep this PR open. Sad, but also nice that we don’t have the temptation to maintain a third group representation. ;)

real-or-random closed this on Nov 27, 2025

peterdettman commented at 9:55 am on December 8, 2025: contributor

@real-or-random I built a similar prototype for the pippenger bucket-summing loop, based on Co-Z arithmetic.

In short, we ensure running_sum always has the same z coordinate as the accumulator (“r”), so that adding running_sum to r can be done in 5M + 2S (with free update of running_sum for the new Co-Z coordinate); this is often called a ZADDU (or DBLU when it devolves to doubling). On the downside, when adding each bucket to the running_sum, we now need to also update r to keep them Co-Z; cost 3M + 1S. So a typical iteration of the summing loop costs 20M + 7S instead of 24M + 8S (there’s no explosion of linear field operations here though).

I measure a few % overall improvement for pippenger_wnaf, depending on the number of points, although I would appreciate others sanity-checking that. Unfortunately, as with this _geh branch, it involves rather a lot of new code, not helped by the fact that there are special cases everywhere.

real-or-random commented at 11:03 am on December 8, 2025: contributor

@real-or-random I built a similar prototype for the pippenger bucket-summing loop, based on Co-Z arithmetic.

Awesome, I had briefly considered something like this but haven’t deeply thought about it because I didn’t want to try yet another approach after spending time on the one in this PR.

I measure a few % overall improvement for pippenger_wnaf, depending on the number of points, although I would appreciate others sanity-checking that. Unfortunately, as with this _geh branch, it involves rather a lot of new code, not helped by the fact that there are special cases everywhere.

Would you be willing to open a draft PR (which, of course, is not a commitment to work on it) just with the current branch, only to keep the discussion organized and provide a place where people could add their benchmarks?

peterdettman commented at 12:44 pm on December 9, 2025: contributor

Done: #1782 .

[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767