Signed-digit multi-comb for ecmult

sipa commented at 10:41 pm on November 11, 2019: contributor

This is a rebase of #546. See the original PR for a full description, but in short, this introduces a new constant-time multiplication algorithm with precomputation, with better speed/size tradeoffs. It is more flexible, allowing both better speeds with the same table size, or smaller table sizes for the same speed. It permits extrmely small tables with still reasonable speeds.

In addition to the original PR, this also:

Removes the old ecmult algorithm entirely
Makes the tunables configurable through configure, and tests a few combinations in Travis.

sipa cross-referenced this on Nov 11, 2019 from issue Signed-digit multi-comb for ecmult_gen by peterdettman

sipa commented at 10:46 pm on November 11, 2019: contributor

Note that I have not reviewed the multiplication code in detail yet.

gmaxwell commented at 10:49 am on November 13, 2019: contributor

@sipa do you think it would make more sense to benchmark all the blocks and teeth settings, and then find the efficient frontier for best performance as a function of memory, and then have the setting just offer a couple options along that frontier?

As this stands, by my count this creates 306 build configurations which should at least get minimal testing occasionally. Probably just a couple sizes are sufficient for exposed options. Usage probably comes in “embedded device, as small as possible please” “embedded device, I care about performance but small please” “desktop, meh whatever but don’t blow out my cache too bad” “signing network service GIVE ME ALL YOUR MEMORY”, flavour and not too many others.

sipa commented at 4:41 pm on November 13, 2019: contributor

@gmaxwell Yeah, I was thinking of doing something like that. I hoped to reduce the number of combinations by looking at the number of adds/doubles, and removing things that don’t improve on either but have a larger table size. The problem is that the number of cmovs grows exponentially with the teeth settings, and at teeth=8 every point add is preceded by iterating over 8 kB of data to select the right entry, at which point the cmov cost is certainly nontrivial as well.

sipa commented at 4:43 pm on November 13, 2019: contributor

Anyway, ignoring the costs of scalar operations and cmovs, and counting a point add as 260 ns and a doubling as 159 ns, these are the optimal configurations:

blocks=1 teeth=2 tblsize=192 adds=127 doubles=127 blocks=1 teeth=3 tblsize=256 adds=85 doubles=85 blocks=2 teeth=3 tblsize=512 adds=85 doubles=42 blocks=1 teeth=4 tblsize=576 adds=63 doubles=63 blocks=1 teeth=5 tblsize=1024 adds=51 doubles=51 blocks=2 teeth=4 tblsize=1088 adds=63 doubles=31 blocks=3 teeth=4 tblsize=1536 adds=65 doubles=21 blocks=1 teeth=6 tblsize=2048 adds=42 doubles=42 blocks=2 teeth=5 tblsize=2048 adds=51 doubles=25 blocks=3 teeth=5 tblsize=3072 adds=53 doubles=17 blocks=1 teeth=7 tblsize=4096 adds=36 doubles=36 blocks=2 teeth=6 tblsize=4096 adds=43 doubles=21 blocks=3 teeth=6 tblsize=6144 adds=44 doubles=14 blocks=2 teeth=7 tblsize=8192 adds=37 doubles=18 blocks=3 teeth=7 tblsize=12288 adds=38 doubles=12 blocks=4 teeth=7 tblsize=16384 adds=39 doubles=9 blocks=2 teeth=8 tblsize=16448 adds=31 doubles=15 blocks=3 teeth=8 tblsize=24576 adds=32 doubles=10 blocks=4 teeth=8 tblsize=32832 adds=31 doubles=7 blocks=8 teeth=8 tblsize=65600 adds=31 doubles=3 blocks=16 teeth=8 tblsize=131136 adds=31 doubles=1 blocks=32 teeth=8 tblsize=262208 adds=31 doubles=0

sipa commented at 7:53 pm on November 13, 2019: contributor

Here is a more extensive list of all potentially-useful combinations, taking number of cmovs into account (and not making any assumptions about the relative costs of adds vs doubles vs cmovs):

teeth=1 blocks=1 tblsize=128 adds=255 doubles=255 cmovs=256 teeth=1 blocks=3 tblsize=192 adds=257 doubles=85 cmovs=258 teeth=1 blocks=5 tblsize=320 adds=259 doubles=51 cmovs=260 teeth=1 blocks=7 tblsize=448 adds=258 doubles=36 cmovs=259 teeth=1 blocks=9 tblsize=576 adds=260 doubles=28 cmovs=261 teeth=1 blocks=11 tblsize=704 adds=263 doubles=23 cmovs=264 teeth=1 blocks=13 tblsize=832 adds=259 doubles=19 cmovs=260 teeth=1 blocks=15 tblsize=960 adds=269 doubles=17 cmovs=270 teeth=1 blocks=19 tblsize=1216 adds=265 doubles=13 cmovs=266 teeth=1 blocks=29 tblsize=1856 adds=260 doubles=8 cmovs=261 teeth=1 blocks=37 tblsize=2368 adds=258 doubles=6 cmovs=259 teeth=1 blocks=43 tblsize=2752 adds=257 doubles=5 cmovs=258

teeth=2 blocks=1 tblsize=192 adds=127 doubles=127 cmovs=256 teeth=2 blocks=2 tblsize=320 adds=127 doubles=63 cmovs=256 teeth=2 blocks=3 tblsize=384 adds=128 doubles=42 cmovs=258 teeth=2 blocks=4 tblsize=576 adds=127 doubles=31 cmovs=256 teeth=2 blocks=5 tblsize=640 adds=129 doubles=25 cmovs=260 teeth=2 blocks=6 tblsize=768 adds=131 doubles=21 cmovs=264 teeth=2 blocks=7 tblsize=896 adds=132 doubles=18 cmovs=266 teeth=2 blocks=8 tblsize=1088 adds=127 doubles=15 cmovs=256 teeth=2 blocks=9 tblsize=1152 adds=134 doubles=14 cmovs=270 teeth=2 blocks=10 tblsize=1280 adds=129 doubles=12 cmovs=260 teeth=2 blocks=11 tblsize=1408 adds=131 doubles=11 cmovs=264 teeth=2 blocks=12 tblsize=1536 adds=131 doubles=10 cmovs=264 teeth=2 blocks=13 tblsize=1664 adds=129 doubles=9 cmovs=260 teeth=2 blocks=15 tblsize=1920 adds=134 doubles=8 cmovs=270 teeth=2 blocks=16 tblsize=2112 adds=127 doubles=7 cmovs=256 teeth=2 blocks=19 tblsize=2432 adds=132 doubles=6 cmovs=266 teeth=2 blocks=22 tblsize=2816 adds=131 doubles=5 cmovs=264 teeth=2 blocks=26 tblsize=3328 adds=129 doubles=4 cmovs=260 teeth=2 blocks=32 tblsize=4160 adds=127 doubles=3 cmovs=256 teeth=2 blocks=43 tblsize=5504 adds=128 doubles=2 cmovs=258 teeth=2 blocks=64 tblsize=8256 adds=127 doubles=1 cmovs=256 teeth=2 blocks=128 tblsize=16448 adds=127 doubles=0 cmovs=256

teeth=3 blocks=1 tblsize=256 adds=85 doubles=85 cmovs=344 teeth=3 blocks=2 tblsize=512 adds=85 doubles=42 cmovs=344 teeth=3 blocks=3 tblsize=768 adds=86 doubles=28 cmovs=348 teeth=3 blocks=4 tblsize=1024 adds=87 doubles=21 cmovs=352 teeth=3 blocks=5 tblsize=1280 adds=89 doubles=17 cmovs=360 teeth=3 blocks=6 tblsize=1536 adds=89 doubles=14 cmovs=360 teeth=3 blocks=7 tblsize=1792 adds=90 doubles=12 cmovs=364 teeth=3 blocks=8 tblsize=2048 adds=87 doubles=10 cmovs=352 teeth=3 blocks=9 tblsize=2304 adds=89 doubles=9 cmovs=360 teeth=3 blocks=10 tblsize=2560 adds=89 doubles=8 cmovs=360 teeth=3 blocks=11 tblsize=2816 adds=87 doubles=7 cmovs=352 teeth=3 blocks=13 tblsize=3328 adds=90 doubles=6 cmovs=364 teeth=3 blocks=15 tblsize=3840 adds=89 doubles=5 cmovs=360 teeth=3 blocks=18 tblsize=4608 adds=89 doubles=4 cmovs=360 teeth=3 blocks=22 tblsize=5632 adds=87 doubles=3 cmovs=352 teeth=3 blocks=29 tblsize=7424 adds=86 doubles=2 cmovs=348 teeth=3 blocks=43 tblsize=11008 adds=85 doubles=1 cmovs=344 teeth=3 blocks=86 tblsize=22016 adds=85 doubles=0 cmovs=344

teeth=4 blocks=1 tblsize=576 adds=63 doubles=63 cmovs=512 teeth=4 blocks=2 tblsize=1088 adds=63 doubles=31 cmovs=512 teeth=4 blocks=3 tblsize=1536 adds=65 doubles=21 cmovs=528 teeth=4 blocks=4 tblsize=2112 adds=63 doubles=15 cmovs=512 teeth=4 blocks=5 tblsize=2560 adds=64 doubles=12 cmovs=520 teeth=4 blocks=6 tblsize=3072 adds=65 doubles=10 cmovs=528 teeth=4 blocks=7 tblsize=3584 adds=69 doubles=9 cmovs=560 teeth=4 blocks=8 tblsize=4160 adds=63 doubles=7 cmovs=512 teeth=4 blocks=10 tblsize=5120 adds=69 doubles=6 cmovs=560 teeth=4 blocks=11 tblsize=5632 adds=65 doubles=5 cmovs=528 teeth=4 blocks=13 tblsize=6656 adds=64 doubles=4 cmovs=520 teeth=4 blocks=16 tblsize=8256 adds=63 doubles=3 cmovs=512 teeth=4 blocks=22 tblsize=11264 adds=65 doubles=2 cmovs=528 teeth=4 blocks=32 tblsize=16448 adds=63 doubles=1 cmovs=512 teeth=4 blocks=64 tblsize=32832 adds=63 doubles=0 cmovs=512

teeth=5 blocks=1 tblsize=1024 adds=51 doubles=51 cmovs=832 teeth=5 blocks=2 tblsize=2048 adds=51 doubles=25 cmovs=832 teeth=5 blocks=3 tblsize=3072 adds=53 doubles=17 cmovs=864 teeth=5 blocks=4 tblsize=4096 adds=51 doubles=12 cmovs=832 teeth=5 blocks=5 tblsize=5120 adds=54 doubles=10 cmovs=880 teeth=5 blocks=6 tblsize=6144 adds=53 doubles=8 cmovs=864 teeth=5 blocks=7 tblsize=7168 adds=55 doubles=7 cmovs=896 teeth=5 blocks=8 tblsize=8192 adds=55 doubles=6 cmovs=896 teeth=5 blocks=9 tblsize=9216 adds=53 doubles=5 cmovs=864 teeth=5 blocks=11 tblsize=11264 adds=54 doubles=4 cmovs=880 teeth=5 blocks=13 tblsize=13312 adds=51 doubles=3 cmovs=832 teeth=5 blocks=18 tblsize=18432 adds=53 doubles=2 cmovs=864 teeth=5 blocks=26 tblsize=26624 adds=51 doubles=1 cmovs=832 teeth=5 blocks=52 tblsize=53248 adds=51 doubles=0 cmovs=832

teeth=6 blocks=1 tblsize=2048 adds=42 doubles=42 cmovs=1376 teeth=6 blocks=2 tblsize=4096 adds=43 doubles=21 cmovs=1408 teeth=6 blocks=3 tblsize=6144 adds=44 doubles=14 cmovs=1440 teeth=6 blocks=4 tblsize=8192 adds=43 doubles=10 cmovs=1408 teeth=6 blocks=5 tblsize=10240 adds=44 doubles=8 cmovs=1440 teeth=6 blocks=6 tblsize=12288 adds=47 doubles=7 cmovs=1536 teeth=6 blocks=8 tblsize=16384 adds=47 doubles=5 cmovs=1536 teeth=6 blocks=9 tblsize=18432 adds=44 doubles=4 cmovs=1440 teeth=6 blocks=11 tblsize=22528 adds=43 doubles=3 cmovs=1408 teeth=6 blocks=15 tblsize=30720 adds=44 doubles=2 cmovs=1440 teeth=6 blocks=22 tblsize=45056 adds=43 doubles=1 cmovs=1408 teeth=6 blocks=43 tblsize=88064 adds=42 doubles=0 cmovs=1376

teeth=7 blocks=1 tblsize=4096 adds=36 doubles=36 cmovs=2368 teeth=7 blocks=2 tblsize=8192 adds=37 doubles=18 cmovs=2432 teeth=7 blocks=3 tblsize=12288 adds=38 doubles=12 cmovs=2496 teeth=7 blocks=4 tblsize=16384 adds=39 doubles=9 cmovs=2560 teeth=7 blocks=5 tblsize=20480 adds=39 doubles=7 cmovs=2560 teeth=7 blocks=8 tblsize=32768 adds=39 doubles=4 cmovs=2560 teeth=7 blocks=10 tblsize=40960 adds=39 doubles=3 cmovs=2560 teeth=7 blocks=13 tblsize=53248 adds=38 doubles=2 cmovs=2496 teeth=7 blocks=19 tblsize=77824 adds=37 doubles=1 cmovs=2432 teeth=7 blocks=37 tblsize=151552 adds=36 doubles=0 cmovs=2368

teeth=8 blocks=1 tblsize=8256 adds=31 doubles=31 cmovs=4096 teeth=8 blocks=2 tblsize=16448 adds=31 doubles=15 cmovs=4096 teeth=8 blocks=3 tblsize=24576 adds=32 doubles=10 cmovs=4224 teeth=8 blocks=4 tblsize=32832 adds=31 doubles=7 cmovs=4096 teeth=8 blocks=5 tblsize=40960 adds=34 doubles=6 cmovs=4480 teeth=8 blocks=6 tblsize=49152 adds=35 doubles=5 cmovs=4608 teeth=8 blocks=7 tblsize=57344 adds=34 doubles=4 cmovs=4480 teeth=8 blocks=8 tblsize=65600 adds=31 doubles=3 cmovs=4096 teeth=8 blocks=11 tblsize=90112 adds=32 doubles=2 cmovs=4224 teeth=8 blocks=16 tblsize=131136 adds=31 doubles=1 cmovs=4096 teeth=8 blocks=32 tblsize=262208 adds=31 doubles=0 cmovs=4096

gmaxwell commented at 1:28 am on November 14, 2019: contributor

Here is the efficient frontier I found running bench_sign on an AMD 7742 (using sha256+sha-ni as the nonce function).

Blocks	Teeth	bytes	µs	Sigs/core/sec
1	1	64	157	6369
1	2	128	87.7	11403
1	3	256	65.3	15314
1	4	512	54.0	18519
1	5	1024	48.6	20576
3	4	1536	47.5	21053
2	5	2048	44.1	22676
3	5	3072	43.6	22936
4	5	4096	41.7	23981
3	6	6144	41.1	24331
4	6	8192	39.9	25063
9	6	18432	39.3	25445
11	6	22528	38.7	25840
22	6	45056	38.4	26042
43	6	88064	38.1	26247
37	7	151552	37.9	26385

Here are all the configurations visualized:

Most points off the frontier aren’t even close. The current 64kb table is 43.7µs on my test setup.

I think the logic options are {11,6} (22528b), {2,5} (2048b) and then one of the smallest three options. I would say {1,3} (256b) but I assume that {1,1} (64b) has the benefit that a considerable amount of code could also be omitted which might understate its space advantage. Presumably only someone who was extremely memory starved would consider using something smaller than {2,5}.

One possibility would be to just offer the 22kb choice and 2kb choice and leave it to the user to pick their best trade-off if they really want to go smaller, with a copy of these benchmark numbers in some integration documentation.

You also might want to consider how this changes with endomorphism, since after next year the library can probably drop the non-endomorphism configuration. Under the old algorithm endomorphism would halve the table size for only slightly less performance.

I might also want to run the figures for schnorr signing, since the constant time modular inversion of K is probably a big fraction of the total time in these figures and it doesn’t exist there. The extra constant time chunk diminishes the magnitude of the speedup. Edit: LOL, yes, it’s a 1.4x speedup to avoid that inversion.

Maybe someone who wants to use this library to sign on a very memory constrained device might want to comment? The code itself is some tens of kilobytes, so I don’t know how much anyone cares about saving a hundred bytes in exchange for running a tiny fraction of the speed. I wouldn’t be surprised that if the slower configurations were also somewhat easier to attack with differential power analysis too.

gmaxwell commented at 2:28 am on November 14, 2019: contributor

How small does a table need to be before it’s reasonable to include in the source explicitly? Not having to run gen_context at build time would be nice.

In #614 I considered 1024 bytes obviously OK.

gmaxwell commented at 5:40 am on November 14, 2019: contributor

Just for fun, I ran the same test using a more optimized signer that doesn’t use any slow inversions.

The efficient frontier is slightly different, but I think my suggested options would be the same.

Blocks	Teeth	bytes	µs	Sigs/core/sec
1	1	64	141	7092
1	2	128	73.3	13643
1	3	256	50.7	19724
1	4	512	39.3	25445
1	5	1024	33.5	29851
3	4	1536	32.3	30960
2	5	2048	28.9	34602
3	5	3072	28.4	35211
4	5	4096	26.6	37594
3	6	6144	26.4	37879
4	6	8192	25.0	40000
5	6	10240	24.9	40161
9	6	18432	24.3	41152
11	6	22528	23.7	42194
22	6	45056	23.4	42735
43	6	88064	22.5	44444

gmaxwell commented at 11:34 pm on November 15, 2019: contributor

@douglasbakkum @jonasschnelli @ddustin @benma you’ve commented before about memory usage in signing.

sipa commented at 8:14 pm on November 18, 2019: contributor

@gmaxwell I’m not sure what the best strategy is:

Just leaving comb/teeth individually configurable by the build system
Providing a number of presets (small/medium/large)
Letting the configuration control the actual size (e.g. –ecmult-gen-table-size=N, which guarantees a table size of at most (1«N) bytes).

It’d be very useful to hear from people on embedded systems how much they care about small table sizes.

gmaxwell commented at 4:46 am on November 19, 2019: contributor

I dislike leaving comb/teeth configurable in the build system. It would increase the number of build configurations that the project should be testing by hundreds of times for no clear gain…

Plus if the project can’t figure out how to set them correctly, it’s doubtful that someone just using the build system is going to know what to do.

I wouldn’t see any problem with having hidden options to override– no promises made that those configurations were in any way tested.

Obviously I’m a fan of small number of presets but there needs to be feedback on which options people want. I’m hoping to hear something like the size of the library itself means that no one cares too much about 2048 bytes vs 256 bytes vs 64 bytes– if so then choosing a small option will be much easier.

Taking a size and solving for the best value that fits would be my second choice… but only because it results in a testing headache. Things like the peak stack usage and other behavioural differences wouldn’t be monotone with the size… so I don’t think e.g. testing the smallest and largest would be sufficient. Otherwise I would probably prefer this one.

ddustin commented at 6:33 pm on November 20, 2019: none

Excited that this is getting worked on!

Maybe someone who wants to use this library to sign on a very memory constrained device might want to comment? The code itself is some tens of kilobytes, so I don’t know how much anyone cares about saving a hundred bytes in exchange for running a tiny fraction of the speed. I wouldn’t be surprised that if the slower configurations were also somewhat easier to attack with differential power analysis too.

If the code itself is tens of kilobytes than making it small may be a moot issue. What’s making the code so big – is there any low hanging fruit for shrinking it down?

I’ve been using the mbed tls library: https://tls.mbed.org/api/group__encdec__module.html

The main perk there being it uses probably under 1 kilobyte of ram in total – though it is super slow. It also doesn’t seem to get the attention that secp256k1 gets, so perhaps some vulnerabilities could slip through the cracks.

I had a hell of a time figuring out that I needed to flip negative values to their positive variants in some places – a task that libsecp256k1 mostly handles for you automatically. So perhaps there’s an argument for libsecp256k1 getting more embed usable from an ease of use standpoint. I don’t really know what’s up with the project other than I guess ARM bought it? It actually looks fairly active: https://github.com/ARMmbed/mbedtls

Surprisingly in the last few years some chips have been adding hardware optimized curves. Though support, documentation, etc can be sparse to non-existent.

ddustin commented at 6:37 pm on November 20, 2019: none

Taking a size and solving for the best value that fits would be my second choice… but only because it results in a testing headache. Things like the peak stack usage and other behavioural differences wouldn’t be monotone with the size… so I don’t think e.g. testing the smallest and largest would be sufficient. Otherwise I would probably prefer this one.

While adjusting the size for speed with granularity would be nice I don’t think it’s every really going to be needed. Steps would be great though – xlarge, large, mediumlarge, medium, mediumsmall, small, xtrasmall, tiny would be plenty.

Often you end up in this situation where you’re just out of memory and you’re hunting for things to make smaller. Having an option for that would be great – but I can’t imagine a scenario where it needs to be granular.

gmaxwell commented at 4:50 am on November 25, 2019: contributor

@ddustin The library isn’t optimized for code size: Correctness, testability, performance, safe interfaces…, sure.

To some extent there are size/performance trade-offs:

Things like the constant time inverse and the sqrt could be implemented using a small state machine, instead of an entirely unrolled function. Instead of 5 different field element normalize functions for different combinations of strength and constant-timeness we could have just one that was the strongest and constant time. Instead of implementing things that give big speedups like WNAF but take a bit of code … we could … just not do that. Rather than having an overcomplete field with carefully hand optimize multiplies and squares we could have just a plain bignum with schoolbook multiplies and squaring accomplished using the multiply instruction…

On a desktop computer, all these decisions favouring speed at the expense of size are almost certainly good ones– in fact it would probably be well worth it to increase the code size another 50% if it made it a few percent faster. Even on a mobile phone or similar, the same applies. On some tiny embedded device? Probably not– it depends on the application. Of course, these devices are also slow– so a tradeoff that made it 50% slower just to save a few hundred bytes of code probably wouldn’t be a win.

Some of these tradeoffs like the multi-fold normalizes could easily be made ifdef switchable between existing code and are also not that huge a performance cost. Other tradeoffs would be big performance hits or would require extensive rewrites. But it’s hard to justify expending effort doing that, even where its easy, without a lot of help or at least feedback from someone that really cares.

Otherwise, it’s probably more interesting spending time on additional validation, other optimizations that are useful on big desktop/server cpus– things like ADX or AVX2-FMA or GMP-less fast jacobi/inverse, because there the objectives are more clear: Is it faster enough to justify the effort testing and reviewing it. :) or on implementing additional cryptographic functions, since those at least enable new and interesting applications.

Some things could probably be made a lot smaller without being any slower too (I think the aforementioned constant time sqrt and inverses are likely candidates), but just no one has tried.

I think to the extent that this project is active at all the contributors to it are interested in working with people to make size optimizations at least ones that don’t come at a substantial cost to other applications. But without at least people who care about them helping to drive the tradeoffs, I wouldn’t expect any to happen quickly.

ddustin commented at 5:18 pm on December 2, 2019: none

@gmaxwell That all makes sense. Targeting super small chips may be foolish.

Medium size chips might be interesting. I’ve been throwing around the idea of a cheap and low power as possible embedded full node. In that situation power usage is probably the main concern, though turning on flash cells also takes power.

As kind of an aside to all of this – I wonder if having an easy way to throw in backend hardware optimized curves would be worthwhile. Keep the safe interfaces while leveraging hardware optimization where it’s available. Also hardware optimized curves are just, well, cool!

real-or-random commented at 1:48 pm on February 25, 2020: contributor

Needs rebase. I think we should push this forward, also because it is the cleanest way to solve #692, and any other work to solve #692 will be wasted if want to merge this here anyway.

sipa force-pushed on Jul 29, 2020

sipa commented at 2:43 am on July 29, 2020: contributor

Rebased. I’ve left the configurability for now - we can decide on a set of supported configurations later if all tests still pass.

I also noticed that COMB_NEGATION wasn’t configurable before; added that.

sipa commented at 6:58 am on July 29, 2020: contributor

@gmaxwell If you feel inclined, there are now 2x more configurations to choose from (with COMB_NEGATION=0, the table size is doubled but negations are avoided… I’m not sure that’s ever a good tradeoff, but with benchmarks we could tell for sure).

gmaxwell commented at 9:30 am on July 29, 2020: contributor

Sounds fun. I can’t imagine disabling negation will ever help, but lets see. :)

real-or-random commented at 9:51 am on July 29, 2020: contributor

Mentioning @peterdettman here, because I’m not sure he’s aware that people are working on this.

peterdettman commented at 1:02 pm on July 29, 2020: contributor

Yep, was aware, but it’s in capable hands. Regarding the optional COMB_NEGATION, I recall I was thinking more about power analysis than performance.

sipa force-pushed on Jul 30, 2020

sipa commented at 0:25 am on July 30, 2020: contributor

Rebased.

sipa force-pushed on Jul 31, 2020

sipa commented at 7:07 am on July 31, 2020: contributor

@peterdettman In what way is avoiding negations useful for power analysis? If there isn’t a good reason to support it, I’d rather remove it from the codebase so there are fewer combinations to test and maintain.

gmaxwell commented at 7:12 am on July 31, 2020: contributor

Neg	Blocks	Teeth	bytes	Sigs/core/sec
1	1	1	128	1192
1	1	2	192	2278
1	1	3	256	3257
1	2	3	512	3759
1	1	4	576	4149
1	1	5	1024	4762
1	3	4	1536	4878
1	2	5	2048	5464
1	3	5	3072	5587
1	2	6	4096	5988
1	3	6	6144	6135
1	4	6	8192	6410
1	9	6	18432	6494
1	11	6	22528	6667
1	22	6	45056	6711
1	19	7	77824	6803
1	43	6	88064	6944
1	37	7	151552	6993
0	1	1	192	1183
0	1	2	320	2268
0	1	3	512	3215
0	2	3	1024	3690
0	1	4	1088	4000
0	1	5	2048	4587
0	2	4	2112	4608
0	3	4	3072	4739
0	2	5	4096	5236
0	3	5	6144	5319
0	4	5	8192	5618
0	4	6	16384	5848
0	11	6	45056	5917
0	22	6	90112	6098
0	43	6	176128	6250

peterdettman commented at 8:24 am on July 31, 2020: contributor

@peterdettman In what way is avoiding negations useful for power analysis? If there isn’t a good reason to support it, I’d rather remove it from the codebase so there are fewer combinations to test and maintain.

See e.g. One&Done: A Single-Decryption EM-Based Attack on OpenSSL’s Constant-Time Blinded RSA, by Alam et. al.

The code for calculating windows in this PR was written to somewhat obscure the bits during window assembly by keeping more bits involved at each step (but it’s not perfect). OTOH, the sign bit is isolated and turned into an all 0s or all 1s value. The paper is about attacking RSA, so negation of lookup values doesn’t arise, but I think it’s reasonable to extrapolate that isolating the sign bit and using it as a mask should be visible in the same way.

It’s possible to conceive of mitigations. e.g. the precomputed values could be possibly-pre-negated according to a random mask (as part of blinding), then the sign bit could be XORed with that mask before being isolated.

EDIT: Removed section mistakenly based on scalar blinding cost being related to precomp size; that’s only true with #570).

peterdettman commented at 9:29 am on July 31, 2020: contributor

@gmaxwell More details on the arm8 processor (32/64 bit?, clock?, L1?) please. Do you have an intuition (or of course, actual data) about how the relative cost of table cmovs and group operations changes with… well, whichever processor features affect it?

gmaxwell commented at 9:47 am on July 31, 2020: contributor

Odroid C2, this test is using your safe-gcd and a faster nonce function. https://wiki.odroid.com/odroid-c2/odroid-c2

a53, 32kb L1 512K L2. 1.5GHz. a53 tends to be pretty sensitive to instruction scheduling since its in-order but dual issue, so compiler/optimizations probably add a lot.

Here is a bench_internal number if and the raw numbers from all the configurations I tested: https://0bin.net/paste/A-pyLrge-fMJpJdQ#FH7EPr1-cePlFUTayXw+8j0u++7ApQPvqXImRMCQ24y

I collected new epyc numbers with the same setup and tomorrow I’ll get 32-bit arm.

gmaxwell commented at 12:14 pm on July 31, 2020: contributor

Neg	Blocks	Teeth	bytes	Sigs/core/sec
1	1	1	128	6897
1	1	2	192	12937
1	1	3	256	18182
1	2	3	512	21097
1	1	4	576	22573
1	1	5	1024	26385
1	3	4	1536	26882
1	2	5	2048	29851
1	3	5	3072	30303
1	4	5	4096	32051
1	4	6	8192	32468
1	9	5	9216	32573
1	13	5	13312	33898
1	26	5	26624	34364
1	52	5	53248	34722
1	43	6	88064	35336
0	1	1	192	6803
0	1	2	320	12887
0	1	3	512	17986
0	2	3	1024	20790
0	1	4	1088	21882
0	1	5	2048	24570
0	2	4	2112	24938
0	3	4	3072	25840
0	2	5	4096	28090
0	3	5	6144	28409
0	4	5	8192	30120
0	4	6	16384	30211
0	9	5	18432	30395
0	13	5	26624	31646
0	26	5	53248	32051
0	22	6	90112	32154
0	52	5	106496	32362
0	43	6	176128	32787

Bench internal and raw results: https://0bin.net/paste/alMVeDUrJMbbM3sl#nhPKL6VE7PPrZzUAFFeOYlCXBFN2IUcDyy4xor1lhwu

(this is faster than my earlier result because it’s using safe-gcd, and slower than the second result on this system because it’s not using sha-ni and a blinded variable time invert)

sipa commented at 9:23 pm on August 4, 2020: contributor

@peterdettman I’m happy to take side-channel resistance into account, but I feel this is too low level a tweak to be exposed as a compilation option. If we think that with COMB_NEGATION enabled there is a particularly serious risk of information leak, perhaps it should always be off. If it’s not specifically worse than the risk of leaks in the codebase already, I’d rather have it always on.

Put otherwise: if it’s going to be a compilation option, who do we expect it to turn it on/off?

gmaxwell commented at 11:30 pm on August 4, 2020: contributor

@sipa I think probably if the negation has a differential power/ EM sidechannel it will need to be solved more generally – the constant time double and adds both have field negates, and the CMOV have the same kind of structure that paper implicates. But that’s also work that basically impossible to work on without a measurement setup.

If keeping negation around in the codebase isn’t too much of a maintenance burden then it might be a useful piece of test fodder for someone once they do have a sidechannel measurement setup working. … but that doesn’t mean exposing it.

sipa cross-referenced this on Aug 15, 2020 from issue SECP256K1_BUILD sanity checking and USE_FIELD_INV_BUILTIN/USE_SCALAR_INV_BUILTIN simplification. by gmaxwell

sipa force-pushed on Jan 26, 2021

sipa commented at 0:09 am on January 26, 2021: contributor

Rebased, removed the optional no-negation option. Made a few changes to make sure all commits compile with and without static precomputation, and also added a mention in README.md.

real-or-random commented at 9:26 am on January 28, 2021: contributor

I know Travis got stuck but still the ubsan job reported an error here: https://travis-ci.org/github/bitcoin-core/secp256k1/jobs/756179145#L439

sipa force-pushed on Jan 28, 2021

sipa cross-referenced this on Jan 29, 2021 from issue Rescale initial point before every ecmult_gen by real-or-random

sipa force-pushed on Jan 30, 2021

sipa commented at 4:06 am on January 30, 2021: contributor

Rebased on top of #864 (and then on master).

sipa force-pushed on Jan 30, 2021

sipa commented at 8:04 am on January 30, 2021: contributor

@peterdettman I’ve changed all the internal variables in secp256k1_ecmult_gen from int to uint32_t, as the old code triggered left-shift outside of the range of signed types. I believe this has no effect.

sipa force-pushed on Jan 30, 2021

sipa commented at 10:51 pm on January 30, 2021: contributor

@peterdettman Any particular reason why in the COMB_BITS==256 case a 257-bit recoding is used rather than a 256-bit one (adding 2^256 inside the mod N, rather than outside)? It’s certainly easier, and maybe the reasoning is that the COMB_BITS==256 case just isn’t really worth optimizing for.

EDIT: I see, that’s done through the “offset” to tweak the initial value; the top bit of the recoding isn’t actually used during the addition loop.

Signed-digit multi-comb for ecmult_gen

- see section 3.3 of https://eprint.iacr.org/2012/309

ad341610af

Support static precomputation with multi-comb e0d7b1229e

Add comments for context offset e81a78eba8

Reduce side-channels from single-bit reads 0f67709e07

Avoid unnecessary doublings in precomputation ad91bd9bd4

Add precompiler guards on comb constants 1b45383449

Remove old ecmult_gen code and always use multi-comb 63d2cf53f0

Make COMB_BLOCKS and COMB_TEETH configurable, and test in Cirrus 937a8c270f

sipa force-pushed on Mar 18, 2021

sipa commented at 0:51 am on March 18, 2021: contributor

Rebased, but I still intend to write an explanation of the algorithm.

sipa cross-referenced this on Mar 18, 2021 from issue Automatically check that VERIFY_CHECKs are side effect free by sipa

sipa cross-referenced this on Apr 13, 2021 from issue Clean up configuration in gen_context by real-or-random

real-or-random commented at 2:48 pm on August 25, 2021: contributor

What’s the status here? Concretely, I’m wondering whether it makes sense to work on a PR that sticks with the existing gen_context code but prebuilds the table and adds a to the repo, in the same way as in #956. Then we would get rid of the common cross-compilation issues entirely, we’d be in the position to reconsider some things about contexts (flags, etc.)

Pro: I could get this done pretty quickly. It’s mostly changes in the build system and we have the boilerplate in #956 already.

Con: This would conflict a lot of with the changes in this PR.

sipa commented at 2:54 pm on August 25, 2021: contributor

I’m ok with moving forward with making the ecmult_gen context compile-time/precomputed only, and then rebasing this.

real-or-random cross-referenced this on Dec 23, 2021 from issue Try a non-uniform group law (e.g., for ecmult_gen)? by real-or-random

sipa commented at 5:35 pm on December 26, 2021: contributor

I’m reworking this, and am considering a few changes. In particular:

The recoding operation needs to find (input+2^bits-1)/2. If we store a precomputed scalar offset = 2^bits-1-blind in the context (instead of blind directly), the computation becomes just recode = (input+offset)/2. This avoids the need for COMB_OFFSET (and the associated pre-offset initial point). It also avoids the need for the 288 limit on blocks*teeth*spacing.
With the recoding being simply an add+halving, it’s perhaps easier to just implement it directly on a scalar with a new secp256k1_scalar_half, avoiding the need for a specialized 8*32bit representation for the recoded number.
We can now just initialize r to INFINITY, and instead add blind*G (also precomputed in the context) at the end. The first table point addition can become a fetch instead, saving a point addition - so in terms of point additions the result is identical.
With the point unblinding at the end, we leave the door open for #1051-like approaches, as it lets us reason about potential doublings/cancellations in intermediary additions.
With the first table lookup not being an addition, we can use an arbitrary corresponding jacobian coordinate for its fetching, giving a place for performing projective blinding. @peterdettman Thoughts?

peterdettman commented at 6:15 pm on December 26, 2021: contributor

@sipa The recoding is into 9*32bit, not 8. This is also why it can’t just be a scalar. It’s also the reason for the 288 limit. It’s true that only 1 bit is ever set in the highest limb of the recoding, but it simplified the core loop to just be able to read into that limb instead of treating it specially.

COMB_OFFSET is a specific optimization to let the comb cover exactly 256 bits since that has nice factors; there’s still an implied extra bit for 257 total which is why we need the initial value in that case. If it can be merged with the blind somehow, that’s all good.

I’m a bit unsure of the rest since it’s been a while. Concepts around initial point and where adjustments are done sound fine.

sipa commented at 6:17 pm on December 26, 2021: contributor

@peterdettman If there is a benefit to the 9*32 representation that could easily be kept, but I think it’s only useful for COMB_GROUPED configurations really, and none of those appear to be among the best candidates.

peterdettman commented at 6:18 am on December 27, 2021: contributor

@sipa The comb is always larger than 256 bits (although only 1 higher bit is ever set), e.g. the current default config covers 260 bits. The inner loop of _ecmul_gen collects the “teeth” at each position: bit = recoded[bit_pos >> 5] >> (bit_pos & 0x1F);

e.g. bit_pos is in the range [0, 260) for the default config. So having the 9*32 representation avoids dealing specially with the bit positions >= 256.

It may be possible to always use an offset point (secp256k1_ecmult_gen_context.offset) to account for the high bit (currently only done in the case of a comb config that covers exactly 256 bits) and just a scalar for the remaining bits. It would require dealing with those high comb positions somehow - probably just some const-time bit-twiddling can ensure we always get 0 when bit_pos >= 256:

0    valid_mask = (bit_pos - 256) >> 31;
1    valid_bit_pos = bit_pos & valid_mask;
2    bit = secp256k1_scalar_get_bit(&recoded, valid_bit_pos) & valid_mask;

Depending on the comb configuration that is probably only really needed for the highest block and maybe only the highest tooth of that block.

Just to be clear: unless the offset point is merged with other initialization/finalization there is effectively an extra addition incurred which the comb usually does “for free”. Also if the comb reads the high-bit as ‘0’ it’s subtracting 2^(bits-1) instead of adding, so the offset would then actually be 2^bits instead.

I think that COMB_GROUPED should be removed. It only makes possible sense for 4 or 8 teeth and the fast options are actually 5,6,7. Even at 4 or 8 (even 2), using spacing of 1 would be extravagant use of memory, since you could halve that usage at the cost of a single doubling per sig. Removing it will simplify dealing with the above “high-tooth” issue.

Once that is all resolved, then the ideas in #693 (comment) seem reasonable things to investigate.

sipa commented at 10:32 pm on December 27, 2021: contributor

@peterdettman See https://github.com/sipa/secp256k1/commits/202112_sdmc

I think the design decisions fall into three questions:

Additive blinding of the scalar needs to be compensated by something.
- The current PR does that using an initial value, ctx->initial.
- Using an initial value has the downside of not being able to reason about the ranges of DLs of intermediary results. By changing to a ctx->final instead (of the would-be ctx->initial value, but divided by 2^(COMB_SPACING-1)) this becomes possible. This adds a point addition, but since it makes the initial value infinity, the very first table lookup can avoid an addition too, so this change doesn’t affect the number of group operations.
We need to find the bits of a value equivalent to (input + 2^COMB_BITS - 1) / 2 (mod order).
- The current PR achieves that by computing the bits of (input - 1)/2 + 2^(COMB_BITS-1), and if COMB_BITS ends up in a 32-bit limb by itself (the COMB_OFFSET optimization), instead moving its effect to ctx->initial, and using (input - 1)/2 instead.
- Alternatively, it is possible to operate on the bits of (input + 2^COMB - 1) / 2 (mod order) directly. I find this slightly easier to explain as it avoids special-casing the COMB_BITS==256 situation, but it has also a few very weak advantages:
  - It enables an optimization where the final addition in a block, if it spans the 256th bit, can use a smaller range for the index cmov lookup (because the actual bits used never exceed 256).
  - It permits a blinding-free version of the code with one fewer point addition (not implemented, and I don’t plan to), even when COMB_BITS==256, because it would result in initial/final both infinity.
There is the question of how to organize the computed bits:
- The current PR uses the uint32_t recoded[9] approach, as it needs more than 256 bits to store the bits. Even when there are only 256 bits this approach could be used, as it avoids the need for dealing with reads beyond 256 bits, but if the reduced-cmov optimization above is used anyway, there is little benefit I think.

peterdettman commented at 9:28 am on December 28, 2021: contributor

It enables an optimization where the final addition in a block, if it spans the 256th bit, can use a smaller range for the index cmov lookup (because the actual bits used never exceed 256).

There’s also a more formal way of restricting the comb, which is to not have all the blocks be uniform. (This options is mentioned in the original paper already, BTW: https://eprint.iacr.org/2012/309).

One general pattern for this might be distinguishing “lower blocks” and “upper blocks” (even though it may be a common case to have only one upper block).

If our current default scheme is 11B * 6T * 4S (==264), therefore because of the extra 8 bits, we’re doing in a vague sense 1 useless addition (6 of those bits), and 1 excessive lookup (the other 2); there may be table entries that are unused in the top block (or let’s say, at least, less-frequently used). So we can somehow save some combination of an addition, or memory, or lookup cmovs.

e.g.

Upper(1B * 4T * 1S), Lower(14B * 6T * 3S) covers 256 bits. 43 additions (1 saved), 3 doubles (1 saved), but uses (3 * 2^5 + 1 * 2^3 == 104) extra table entries vs. original (11 * 2^5 == 352).
Upper(1B * 4T * 4S), Lower(10B * 6T * 4S) covers 256 bits. 44 additions, 4 doubles, but saves (1 * 2^5 - 1 * 2^3 == 24) table entries.
Upper(1B * 4T * 4S), Lower(8B * 6T * 5S) covers 256 bits. Still 44 additions, 1 extra double, but saves (3 * 2^5 - 1 * 2^3 == 88) table entries.
Upper(1B * 4T * 1S), Lower(7B * 6T * 6S) covers 256 bits. 43 additions (1 saved), 6 doubles (2 extra), but saves (4 * 2^5 - 1 * 2^3 = 120) of 352 table entries.

There are also implied cmov lookup savings in the upper blocks of those options. The single upper block of 4 bits ones are maybe especially attractive since they are easily added with a single addition at the end of the comb and are likely to simplify things for non-uniform addition.

peterdettman commented at 10:02 am on December 28, 2021: contributor

Is there any hope that dedicated asm could speed up the cmov table lookups? After all we have a very nice data shape and just want to blit it all with some masking. Can explicit prefetch add anything beyond what the compiler and/or hardware are doing automatically (given that we are just scanning forward over the whole precomputed data in each pass). Is there any possibility that consolidating the lookups (one-per-block) before performing the additions can be faster?

real-or-random commented at 10:28 am on December 28, 2021: contributor

No idea if it will help here but I sometimes considered an “arbitray-data” cmov that works on char arrays.

Advantages will include not only the possibility to cmov data of arbitrary size. Also, we’ll need to get only one cmov function right and consolidate some improvements/issues (#777, #457). So we could spend our time more efficiently on getting that single function constant-time on all platforms (the compiler could still specialize it for some sizes but I still think it’s a good idea). We could then write an x86 ASM version, which could be beneficial not only for performance but using native CMOVs will guarantee that the compiler can’t make it variable time.

(In spirit it’s similar to my unfinished #636 on data cleaning. IIRC, this implements all the _clear functions for different types via single one that works on arbitrary memory.)

real-or-random commented at 10:30 am on December 28, 2021: contributor

Using an initial value has the downside of not being able to reason about the ranges of DLs of intermediary results. By changing to a ctx->final instead (of the would-be ctx->initial value, but divided by 2^(COMB_SPACING-1)) this becomes possible. This adds a point addition, but since it makes the initial value infinity, the very first table lookup can avoid an addition too, so this change doesn’t affect the number of group operations.

Am I right that we could still use the rescaling blinding that we currently use (e.g., by applying rescaling the new first point instead of initial)?

peterdettman commented at 10:35 am on December 28, 2021: contributor

Am I right that we could still use the rescaling blinding that we currently use (e.g., by applying rescaling the new first point instead of initial)?

Yes, it’s already in the new PR: https://github.com/sipa/secp256k1/blob/c914fcc130a04f7df174aa14b9ba134fcf79efa3/src/ecmult_gen_impl.h#L187 .

sipa commented at 2:09 pm on December 28, 2021: contributor

@peterdettman Good to point out that non-uniform approaches are possible too, and if we’re going in the direction of actually optimizing the fetch/add loop, it’s probably better to just do nonuniform blocks (as they can avoid post-bit-256 issues entirely, and are probably much easier to emit efficient code for, as the loops are more predictable).

I think I’ll try that. My preference would be to keep the spacing uniform for simplicity reasons (the outer loop can stay), and then have lower blocks with T teeth en upper blocks with T-1 teeth. The upper blocks can use the same table and fetch loop, but forego conditional negation (or halve the fetch loop but keep conditional negation). This doesn’t save additions, but isn’t too big of a change either, and my intuition is that the optimal performance will be had with solutions that don’t have too much difference between upper and lower blocks.

EDIT: What I said here isn’t very accurate; the upper blocks of course have separate tables to begin with. So in that case, I think the most reasonable approach to try is halving the upper block tables in size, and keep conditional negation for them.

peterdettman commented at 2:46 pm on December 28, 2021: contributor

@sipa I think you may come to appreciate the virtues of just a single upper block of 4 (in the default case) bits as I mentioned previously.

I actually find Upper(1B * 4T * 1S), Lower(7B * 6T * 6S) quite attractive. The extra 2 doubles are (with latest formulae) cheaper than the saved addition, so it will actually be faster than the current one, using 2/3 the table memory (which will also make it faster), and it saves an entire lookup PLUS replaces another 6T lookup (32 entries) with a 4T lookup (8 entries), so let’s say 1.75 lookups saved (faster again). All that and the awkward upper block can just be added at the end outside the loop (basically it joins the final iteration), so it doesn’t complicate that. The lower blocks only cover 252 bits (i.e. an odd scalar in (-2^252, 2^252) and it seems likely that could all be non-uniform additions. It’s even possible to add the final blind/offset value to all 8 of the upper block table entries in precomp and save another (uniform) addition (need context space for that though). We should check for a similar option in a small-memory config too I guess.

Since maximizing the number of teeth (in the face of the cmov costs) is where most of the speed comes from I realise it’s intuitive to say lower@T then upper@(T-1), but I played around with the available configurations, and (trying to target exactly 256 bits) it’s not all that important in the end and mostly the optimum is a single block at the top. If the spacing is less in the upper blocks, it just means join the loop after a few doubles, but yes that might end up being a separate loop. OTOH if it’s a single block with a spacing of one it can just be a single addition after the loop.

sipa commented at 3:24 pm on December 28, 2021: contributor

@peterdettman I’m starting to see the appeal of having a single upper block only, and giving it spacing 1 would be extra nice.

Thinking about how one would specify the parametrization for that?

Configuration input is B blocks and max T teeth.
If BT divides 256, just have (B,T,256/(BT)) as configuration.
Otherwise:
- (B-1,T,floor(256/((B-1)T))) as lower configuration.
- (1,256 mod ((B-1)T),1) as upper configuration.

We wouldn’t want the number of teeth in the upper configuration to exceed T (because going beyond the user’s wish for T has a large impact on table size and cmov cost). So that would mean a requirement that (256 mod ((B-1)T)) <= T. Do you think that could be overly restrictive?

Another possibility may be specifying configurations as (max adds, max table size) pairs, which probably more accurately corresponds to real-world costs (the entire table is read in cmovs, and most of the cost beyond that is adds (and the bit fetching for it)). It’s not obvious to me how to translate these to actual table configurations though.

To be clear, I do expect we’ll have just 2 or 3 precomputed configurations to choose from for production, which could easily be curated, but it’d be nice to be able to test the code by hand for a wide variety of configurations easily.

sipa commented at 4:39 pm on December 28, 2021: contributor

Here are a list of configurations (blocks=number of “full” blocks excluding the upper one if any, fteeth=teeth for final block which has spacing one, tblsize=in bytes, groupops=adds+doubles, bitops=number of get_bits or equivalent operations to construct the table entries). Excluded any configuration that is not better in at least one of the metrics:

 0blocks=27 teeth=9: spacing=1 fteeth=13 tblsize=704512 groupops=28 bitops=28
 1blocks=28 teeth=9: spacing=1 fteeth=4 tblsize=459264 groupops=29 bitops=29
 2blocks=14 teeth=9: spacing=2 fteeth=4 tblsize=229888 groupops=30 bitops=57
 3blocks=7 teeth=9: spacing=4 fteeth=4 tblsize=115200 groupops=32 bitops=113
 4blocks=31 teeth=8: spacing=1 fteeth=8 tblsize=262144 groupops=32 bitops=32
 5blocks=16 teeth=8: spacing=2 fteeth=0 tblsize=131072 groupops=33 bitops=64
 6blocks=8 teeth=8: spacing=4 fteeth=0 tblsize=65536 groupops=35 bitops=128
 7blocks=35 teeth=7: spacing=1 fteeth=11 tblsize=208896 groupops=36 bitops=36
 8blocks=36 teeth=7: spacing=1 fteeth=4 tblsize=147968 groupops=37 bitops=37
 9blocks=18 teeth=7: spacing=2 fteeth=4 tblsize=74240 groupops=38 bitops=73
10blocks=4 teeth=8: spacing=8 fteeth=0 tblsize=32768 groupops=39 bitops=256
11blocks=12 teeth=7: spacing=3 fteeth=4 tblsize=49664 groupops=39 bitops=109
12blocks=9 teeth=7: spacing=4 fteeth=4 tblsize=37376 groupops=40 bitops=145
13blocks=6 teeth=7: spacing=6 fteeth=4 tblsize=25088 groupops=42 bitops=217
14blocks=41 teeth=6: spacing=1 fteeth=10 tblsize=116736 groupops=42 bitops=42
15blocks=42 teeth=6: spacing=1 fteeth=4 tblsize=86528 groupops=43 bitops=43
16blocks=21 teeth=6: spacing=2 fteeth=4 tblsize=43520 groupops=44 bitops=85
17blocks=4 teeth=7: spacing=9 fteeth=4 tblsize=16896 groupops=45 bitops=325
18blocks=14 teeth=6: spacing=3 fteeth=4 tblsize=29184 groupops=45 bitops=127
19blocks=2 teeth=8: spacing=16 fteeth=0 tblsize=16384 groupops=47 bitops=512
20blocks=3 teeth=7: spacing=12 fteeth=4 tblsize=12800 groupops=48 bitops=433
21blocks=7 teeth=6: spacing=6 fteeth=4 tblsize=14848 groupops=48 bitops=253
22blocks=6 teeth=6: spacing=7 fteeth=4 tblsize=12800 groupops=49 bitops=295
23blocks=50 teeth=5: spacing=1 fteeth=6 tblsize=53248 groupops=51 bitops=51
24blocks=25 teeth=5: spacing=2 fteeth=6 tblsize=27648 groupops=52 bitops=101
25blocks=51 teeth=5: spacing=1 fteeth=1 tblsize=52288 groupops=52 bitops=52
26blocks=2 teeth=7: spacing=18 fteeth=4 tblsize=8704 groupops=54 bitops=649
27blocks=17 teeth=5: spacing=3 fteeth=1 tblsize=17472 groupops=54 bitops=154
28blocks=10 teeth=5: spacing=5 fteeth=6 tblsize=12288 groupops=55 bitops=251
29blocks=3 teeth=6: spacing=14 fteeth=4 tblsize=6656 groupops=56 bitops=589
30blocks=5 teeth=5: spacing=10 fteeth=6 tblsize=7168 groupops=60 bitops=501
31blocks=2 teeth=6: spacing=21 fteeth=4 tblsize=4608 groupops=63 bitops=883
32blocks=62 teeth=4: spacing=1 fteeth=8 tblsize=39936 groupops=63 bitops=63
33blocks=31 teeth=4: spacing=2 fteeth=8 tblsize=24064 groupops=64 bitops=125
34blocks=63 teeth=4: spacing=1 fteeth=4 tblsize=32768 groupops=64 bitops=64
35blocks=32 teeth=4: spacing=2 fteeth=0 tblsize=16384 groupops=65 bitops=128
36blocks=21 teeth=4: spacing=3 fteeth=4 tblsize=11264 groupops=66 bitops=190
37blocks=16 teeth=4: spacing=4 fteeth=0 tblsize=8192 groupops=67 bitops=256
38blocks=3 teeth=5: spacing=17 fteeth=1 tblsize=3136 groupops=68 bitops=868
39blocks=9 teeth=4: spacing=7 fteeth=4 tblsize=5120 groupops=70 bitops=442
40blocks=8 teeth=4: spacing=8 fteeth=0 tblsize=4096 groupops=71 bitops=512
41blocks=4 teeth=4: spacing=16 fteeth=0 tblsize=2048 groupops=79 bitops=1024
42blocks=83 teeth=3: spacing=1 fteeth=7 tblsize=25344 groupops=84 bitops=84
43blocks=84 teeth=3: spacing=1 fteeth=4 tblsize=22016 groupops=85 bitops=85
44blocks=42 teeth=3: spacing=2 fteeth=4 tblsize=11264 groupops=86 bitops=169
45blocks=85 teeth=3: spacing=1 fteeth=1 tblsize=21824 groupops=86 bitops=86
46blocks=28 teeth=3: spacing=3 fteeth=4 tblsize=7680 groupops=87 bitops=253
47blocks=21 teeth=3: spacing=4 fteeth=4 tblsize=5888 groupops=88 bitops=337
48blocks=14 teeth=3: spacing=6 fteeth=4 tblsize=4096 groupops=90 bitops=505
49blocks=17 teeth=3: spacing=5 fteeth=1 tblsize=4416 groupops=90 bitops=426
50blocks=12 teeth=3: spacing=7 fteeth=4 tblsize=3584 groupops=91 bitops=589
51blocks=2 teeth=4: spacing=32 fteeth=0 tblsize=1024 groupops=95 bitops=2048
52blocks=7 teeth=3: spacing=12 fteeth=4 tblsize=2304 groupops=96 bitops=1009
53blocks=5 teeth=3: spacing=17 fteeth=1 tblsize=1344 groupops=102 bitops=1446
54blocks=125 teeth=2: spacing=1 fteeth=6 tblsize=18048 groupops=126 bitops=126
55blocks=1 teeth=4: spacing=64 fteeth=0 tblsize=512 groupops=127 bitops=4096
56blocks=126 teeth=2: spacing=1 fteeth=4 tblsize=16640 groupops=127 bitops=127
57blocks=21 teeth=2: spacing=6 fteeth=4 tblsize=3200 groupops=132 bitops=757
58blocks=18 teeth=2: spacing=7 fteeth=4 tblsize=2816 groupops=133 bitops=883
59blocks=1 teeth=3: spacing=85 fteeth=1 tblsize=320 groupops=170 bitops=7226
60blocks=2 teeth=2: spacing=64 fteeth=0 tblsize=256 groupops=191 bitops=8192
61blocks=1 teeth=2: spacing=128 fteeth=0 tblsize=128 groupops=255 bitops=16384
62blocks=1 teeth=1: spacing=256 fteeth=0 tblsize=64 groupops=511 bitops=65536

If I also include conditional negations as a metric, 72 configurations remain. If I additionally split doubles/adds into separate metrics, 95.

peterdettman commented at 4:46 pm on December 28, 2021: contributor

This may just be a rephrasing, but for some choice of B,T, use the highest S such that B*T*S <= 256, then require that the remainder (U) be <= T itself or the config is invalid. Then there are S doubles, B*S + (U > 0) additions, and (2^(T-1)*B + 2^(U-1)) table entries (1 entry is 64 bytes).

For 7T we need to hit 252 == 7*(2*2*3*3) (but is 7T ever reasonable?) 9B,7T => 4S,4U => 4 doubles, 37 additions, 584 entries 6B,7T => 6S,4U => 6 doubles, 37 additions, 392 entries

For 6T we need to hit 252 == 6*(2*3*7) 14B,6T => 3S,4U => 3 doubles, 43 additions, 456 entries 7B,6T => 6S,4U => 6 doubles, 43 additions, 232 entries => GOOD

For 5T we need to hit 255 == 5*(3*17) (restrictive!) 17B,5T => 3S,1U => 3 doubles, 52 additions, 273 entries (probably absurd) 3B,5T => 17S,1U => 17 doubles, 52 additions, 49 entries => NOT TERRIBLE 2B,5T => 25S,6U => INVALID, which is annoying as (I think) this was on the perf. frontier as (2B,5T,26S) => 26D, 52A, 32E

For 4T we can hit 256 == 4*(2^6) or 252 == 4*(3*3*7) 7B,4T => 9S,4U => 9 doubles, 64 additions, 56 entries (probably absurd) 4B,4T => 16S,0U => 16 doubles, 64 additions, 32 entries => OK, but probably inferior to the (invalid) 2B,5T 3B,4T => 21S,4U => 21 doubles, 64 additions, 24 entries 2B,4T => 32S,0U => 32 doubles, 64 additions, 16 entries 1B,4T => 64S,0U => 64 doubles, 64 additions, 8 entries

We’re already down at 512 byte, so I doubt fewer teeth are relevant, but in case 4T is somehow too many: 5B,3T => 17S,1U => 17 doubles, 85 additions, 20 entries (the only choice at 3T due to factorization of 255)

So, yeah, overall it feels a little restrictive to me. This is basically why the original PR just let you aim for 256 plus a bit and we just accepted that the price of simplicity was probably going to be an extra addition. I’m not saying that’s better or worse, it’s just that the comb configurations aren’t as flexible as we might like.

Edit: sorry, of course it’s the number of doubles is 1 less than the spacing, so all the above are 1 too high.

peterdettman commented at 4:54 pm on December 28, 2021: contributor

If I also include conditional negations as a metric, 72 configurations remain. If I additionally split doubles/adds into separate metrics, 95.

Just give the additions a relative weight of 2 vs the doubles (even 2.1).

Edit: Then again, what’s the weight for a non-uniform addition? Edit 2: Then again again, each addition comes with a relatively expensive cmov lookup too.

peterdettman commented at 4:56 pm on December 28, 2021: contributor

Also there seems to be no restriction on fteeth, that will cut your numbers down.

sipa commented at 5:00 pm on December 28, 2021: contributor

I realized that there is no need for restricting fteeth: its “badness” is captured perfectly by the increase in tablesize (both memory usage and cmovs are affected by it, but those scale 1:1 with table size).

I’ll run more analysis to compare with the configurations supported by the current code, but my thinking now is that this design space is getting too big, and we should focus on getting some form of SDMC in first. I’ll probably remove some optimizations from my new PR and perhaps reintroduce the uint32[9] array to simplify some edge cases.

peterdettman commented at 5:06 pm on December 28, 2021: contributor

Yes I think that it would be wise to rein it in and get a basic version in place.

sipa cross-referenced this on Dec 29, 2021 from issue Signed-digit multi-comb ecmult_gen algorithm by sipa

sipa closed this on Dec 29, 2021

sipa cross-referenced this on Dec 29, 2021 from issue Signed-digit multi-comb ecmult_gen algorithm by sipa

sipa commented at 9:00 pm on December 29, 2021: contributor

See #1058 for the newest iteration.

Signed-digit multi-comb for ecmult_gen (by peterdettman) #693