See section 3.3 of https://eprint.iacr.org/2012/309 for a description of the algorithm. Briefly, the scalar is recoded into signed-binary form, then divided into several blocks. A separate precomp. table is prepared for each block, and performing a multiplication is done using one comb per block, interleaved.
This implementation is constant-time, preserves the existing scalar blinding, but the NUMS group element is not yet used, perhaps not really useful (no zeroes in the signed-digit recoding). Static precomputation is not yet implemented. Settings are overridden to let the exhaustive tests work.
You can play with the comb parameters in ecmult_gen.h .
Compared to the existing approach, this gives improved performance/memory tradeoffs, and allows considerable flexibility in the parameters depending on platform details.
The following table gives an idea of the sort of tradeoffs available (bench_sign results - best “min” of 3, asm=no, 64bit field and scalar, -O3, Haswell):
Blocks | Teeth | Spacing | Memory (KiB) | Time (us) |
---|---|---|---|---|
43 | 6 | 1 | 86 | 39.2 |
22 | 6 | 2 | 44 | 39.7 |
11 | 6 | 4 | 22 | 40.1 |
4 | 6 | 11 | 8 | 41.0 |
4 | 5 | 13 | 4 | 42.6 |
2 | 5 | 26 | 2 | 44.6 |
2 | 4 | 32 | 1 | 48.4 |
1 | 4 | 64 | 0.5 | 53.3 |
1 | 3 | 86 | 0.25 | 63.1 |
1 | 2 | 128 | 0.125 | 82.3 |
1 | 1 | 256 | 0.0625 | 140 |
For existing approach: 44.6us (64KiB precomp. data)