Low-footprint mode

gmaxwell commented at 5:33 am on January 11, 2015: contributor

The current strategy is optimized for large fast chips with huge cache. Especially signing would be useful on some embedded devices where multiple megabytes of pre-computation is not acceptable. Reasonably low memory can be accomplished just by lowering window usage, but we could also consider a separate compile-time low-footprint option.

gmaxwell added the label enhancement on Jan 11, 2015

sipa commented at 11:45 pm on January 24, 2015: contributor

Ideas:

Smaller G table for verification
Use 16-entry table with only the combinations of b0,b1,b2,b3 for G_(b0_1 + b1_2^64 + b2_2^128 + b3*2^192) (suggested by @gmaxwell as well), needing 64 additions and 64 doublings instead of only additions. (As opposed to the 1024-entry table we have now).
Use more compact storage format for field entries (generally useful).
Use generic / branch-free versions of code also where variable-time code is allowed. In particular, avoid 5 different implementations for normalize.
Selectively only enable compilation of signing or verification.

gmaxwell commented at 11:14 pm on August 31, 2015: contributor

Signing table is now static by default, which delivers a significant fraction of the need here for small devices.

douglasbakkum commented at 10:34 am on October 18, 2015: none

As a side note to PR #337, is there a security risk for using the wNAF algorithm (i.e. secp256k1_ecmult_const()) for signing? This would eliminate the need for the large precomputed context tables. The tradeoff is wNAF runs at half the signing speed:

0Default setup
1./bench_sign
2ecdsa_sign: min 83.5us / avg 89.6us / max 95.3us
3
4Using wNAF (with blinding)
5./bench_sign 
6ecdsa_sign: min 174us / avg 178us / max 182us

The verification precomputed context tables (~1.375MB in code comments) can also be eliminated by adding the result of two wNAF secp256k1_ecmult_const() calls instead of the current wNAF secp256k1_ecmult() implemenation, again running at half the speed. This should make verification small enough to run on embedded devices (not tested).

0Using secp256k1_ecmult() (default)
1./bench_verify 
2ecdsa_verify: min 128us / avg 132us / max 135us
3
4Using 2x secp256k1_ecmult_const()
5./bench_verify 
6ecdsa_verify: min 268us / avg 271us / max 275us

Overall, I see 3 different ecmult implementations: secp256k1_ecmult(), secp256k1_ecmult_gen(), secp256k1_ecmult_const(). Each requires its own precomputation table. Using one ecmult implemenation with the smallest table would be useful for embedded systems. It looks like the table size for wNAF secp256k1_ecmult_const() is small at 672 bytes (secp256k1_ge pre_a[ECMULT_TABLE_SIZE(WINDOW_A)]) with satisfactory speeds.

I am happy to give it a go if it makes sense.

gmaxwell commented at 4:56 pm on October 18, 2015: contributor

For security, the fixed-window NAF precludes one of the blinding techniques we’re using (the nums blinding)– which provides speculative improvements against power side-channel attacks. I’d be sad to lose it, but it’s likely not critical. I’ve also personally scrutinized its side-channel behavior much less carefully than ecmult_gen (because its newish and it’s also considerably more complex) though this can be fixed.

I’m surprised your signing with ecmult_const was only half the speed, esp since ecmult_const precomputes its own table on the fly. Haven’t tried it myself. If so, it suggests to me that making ecmult_gen also use the fixed window NAF but with the current table size (which would take 32kb) would actually be faster than ecmult_gen as is, in spite of the NAF conversion costs and constant time negations (presumably due to needing half the number of constant time conditional moves).

In verify; Changing to 2x secp256k1_ecmult_const() is a 2.36x slowdown on my laptop– does your version pass the tests?

ecdsa_verify: min 57.2us / avg 57.4us / max 57.7us ecdsa_verify: min 136us / avg 136us / max 137us

Probably better would be ecmult_gen + ecmult_const for this– for me thats taking only 106 microseconds, so only a 1.84x slowdown; and is sharing the same table with signing. It’s still needlessly losing a lot of performance due to constant size operation. But I suppose if you’re being very size miserly you wouldn’t want code bloat for non-constant time versions of these functions. Similarly, some further code size reduction by replacing some of the specialized code with more generic versions (e.g. the weaker normalization with the full normalization).

Alternatively, ecmult_const could be made to support performing multiple multiplies concurrently, which would also be much faster than calling it twice… esp if for G it used a pre-computed table. The design of ecmult_const wouldn’t make it easy to use different table sizes for different bases in a multi-exp, so I suspect a multi-exp-ified ecmult_const with precomputed G would be slower than ecmult_gen + ecmult_const_variable_base_fixed_base.

We probably need to do some more IPR analysis of the fixed window NAF used in _const before deciding if we could use it in a non-module role.

It looks like the table size for wNAF secp256k1_ecmult_const() is small at 672 bytes

Take care not to conflate ROM tables with RAM ones, ecmult also only uses a window A sized dynamic table, the rest is constant (and ecmult_gen uses almost no dynamic memory). For many cases this makes a big difference; because ram is usually quite limited while rom usually much less so.

douglasbakkum commented at 7:23 pm on October 18, 2015: none

Thank you for the detailed reply.

In verify; Changing to 2x secp256k1_ecmult_const() is a 2.36x slowdown on my laptop– does your version pass the tests?

Yes, it passes the tests. I just checked again and got the same performance numbers. This is the quick hack I used: https://github.com/dbitbox/secp256k1/blob/smallverify/src/ecmult_impl.h#L269

A 32kB ecmult_gen table would be workable for me. Then an ecmult_gen + ecmult_const sounds attractive for verification.

One other situation to give motivation for making the code size as small as possible is the use case of verifying the authenticity of signed firmware updates. In this case, the verification code would need to share space inside a bootloader, which of course should be small in order to give more room for the firmware.

vhnatyk commented at 8:02 pm on June 16, 2019: none

Hi - any progress on this? I see it’s hanging for soo long. Need it desperately for an embedded device with 256KiB RAM and 1MiB flash and trying to find the right way and balance to reduce RAM footprint of ecmult precompute - as it tries to fill all RAM and kills the app.

real-or-random commented at 8:18 am on June 17, 2019: contributor

@vhnatyk Have you checked the latest revisions? We made a lot of progress for use on embedded devices in the past weeks (PRs #557, #566, #595, #596). In particular you can configure the size of the ecmult table used for verification (#596) and building the table does not require a lot of additional RAM now (PR #557).

real-or-random commented at 8:19 am on June 17, 2019: contributor

This issue is related to #622.

vhnatyk commented at 2:53 pm on June 17, 2019: none

@real-or-random Yes - sorry that was a bit lazy question in hopes someone competent and focused would reply.

Have you checked the latest revisions?

Y, by the time of your reply I got some bits of info from connected PRs about WINDOW_A and WINDOW_G tuning 😄 but, since this page was the only I got after a day or two of googling, hopefully, your reply as a contributor will save somebody else’s good few hours, not to mention a couple more of research through the confusing tree of PR’s.

Btw I came here from rust-secp256k1 noticed you there as well. Hopefully, it will get update sources soon as I see PR is open already. Not sure I’m fit for PR or review, but going to work on that and willing to help - saw there some discussion/PR’s started in rust-secp256k1#115 - will continue there. Thanx 👍

elichai commented at 7:43 pm on June 19, 2019: contributor

Isn’t it possible to verify/sign without any precumputed tables? Just manually doing all the needed multiplications

vhnatyk commented at 8:54 pm on June 19, 2019: none

Isn’t it possible to verify/sign without any precumputed tables? Just manually doing all the needed multiplications

Yes, probably - but what about security of constant time and/or performance. Hmm need to find out with tests, what exactly from secp256k1 I’m using

elichai commented at 9:01 pm on June 19, 2019: contributor

hmm I’m not big on side channel attacks, if you do all the needed multiplications it will take a different time depending on the private key. can you leverage that? and is that not the same even with tables? like an odd vs even private key should take different time to compute the pubkey.

sipa commented at 2:28 am on June 20, 2019: contributor

@elichai There really isn’t any need for that. We have perfectly fine variable time and constant time algorithms. There is also no need to run without tables entirely; just make them small enough so it’s ok (you can very reasonably make them just a few kB or even less). Doing so will still be vastly more efficient than using algorithm with no tables at all.

vhnatyk commented at 6:29 pm on June 20, 2019: none

There is also no need to run without tables entirely; just make them small enough so it’s ok (you can very reasonably make them just a few kB or even less). Doing so will still be vastly more efficient than using algorithm with no tables at all.

Yes, exactly 👍 - the problem I couldn’t make it work with all the defines like ECMULT_WINDOW_SIZE setting to 2 etc. I’ll try to find exactly why it fails on mcu, and if will find out something meaningful will share here, thanx for help 😄

real-or-random removed the label enhancement on Jan 5, 2023

sipa commented at 11:08 am on April 16, 2024: contributor

#1058 will support signing & key generation with a 2 KiB precomputed table.

Low-footprint mode #189