Add native num implementation; modular inverse and Jacobi symbol without GMP

apoelstra commented at 5:03 am on August 20, 2015: contributor

Implement num.h API without GMP backend; benchmarks show modular inversion taking about 65% longer than with GMP, but much faster than the constant-time version.

I will update this PR with an implementation of the Jacobi symbol computation; for now I’m pushing it in the hopes that somebody will see some more perf improvements that will let us catch up with GMP. (In fact, since we don’t require arbitrary size numbers or allocation we should in principle be able to beat them..). This has a modular inverse but no Jacobi symbol.

This is required for https://github.com/bitcoin/secp256k1/pull/262 because we do not want to require a GMP dependency.

Edit: I have a paper by the author of the GMP code which describes (mostly) what they are doing. With this I am able to read the GMP code; I think I see how to optimize it for our special case. So we should hold off any detailed review until I’ve written new gcd code.

apoelstra force-pushed on Aug 20, 2015

in configure.ac: in 3833f51374 outdated

116@@ -117,7 +117,7 @@ AC_ARG_ENABLE(module_schnorr,
117 AC_ARG_WITH([field], [AS_HELP_STRING([--with-field=64bit|32bit|auto],
118 [Specify Field Implementation. Default is auto])],[req_field=$withval], [req_field=auto])
119 
120-AC_ARG_WITH([bignum], [AS_HELP_STRING([--with-bignum=gmp|no|auto],
121+AC_ARG_WITH([bignum], [AS_HELP_STRING([--with-bignum=gmp|64bit|auto],

sipa commented at 7:03 pm on August 21, 2015:

Seems there is a 32bit as well?

apoelstra commented at 3:20 pm on August 23, 2015:

Thanks, fixed.

apoelstra force-pushed on Aug 23, 2015

apoelstra force-pushed on Aug 28, 2015

sipa commented at 6:27 pm on September 4, 2015: contributor

Needs rebase.

apoelstra force-pushed on Sep 6, 2015

sipa commented at 9:30 pm on September 22, 2015: contributor

Need more rebase.

apoelstra cross-referenced this on Oct 2, 2015 from issue Default selection of field/scalar inverse implementation by chfast

sipa cross-referenced this on Oct 11, 2015 from issue Moves all test-only code into tests and re-enables -Wunused-function. by gmaxwell

apoelstra force-pushed on Oct 25, 2015

apoelstra commented at 11:38 pm on October 25, 2015: contributor

Rebased.

peterdettman commented at 3:11 am on October 26, 2015: contributor

How is the inversion performance looking? I ask because I’ve implemented an inversion based on the GCD in http://perso.ens-lyon.fr/damien.stehle/downloads/recbinary.pdf (except non-recursive). Currently it does 1 million inversions in ~2.3s, which is at least %10 faster than GMP according to benchmark_internal. There’s room for improvement because this is currently in Java, using 32 bit multiplies.

If it is proving difficult to improve on GMP’s performance using the current approach, perhaps this would be more fruitful. Due to time constraints, I think I would need help with the C conversion and secp256k1 integration though.

apoelstra commented at 12:54 pm on October 26, 2015: contributor

@peterdettman I’m surprised that you were able to beat my code (let alone GMP’s!!) with binary GCD. Nobody in this space has published benchmarks anywhere but the papers I’ve read have consistently implied that Euclid + Lehmer would be a bit faster because it involved fewer iterations. I did some back-of-envelope calculations that were consistent with this.

And GMP doesn’t use pure Euclid+Lehmer, but a recursive “half GCD” algorithm invented by Niels Möller. It is sorta described here https://www.jstor.org/stable/40234522 but the actual GMP code is heavily optimized beyond that for their data structures.

peterdettman commented at 2:45 pm on October 26, 2015: contributor

@apoelstra I was a bit surprised myself and I suppose it may yet turn out that I’ve mis-measured something, but I don’t believe so. The focus in most GCD papers is on improving the big-O bounds for very large numbers; at small sizes like 256 bits the performance limit of an algorithm often depends on serendipity - whether things work out “cleanly”. Ultimately the proof is in the pudding I guess.

Still, if you’re interested, the best place to start is perhaps to just get you a copy of the Java code and have you confirm the performance comparison?

peterdettman commented at 3:15 pm on October 26, 2015: contributor

BTW, it’s strictly speaking not binary GCD, although similar. In particular it calculates exact “quotients” from the least significant bits only. Best to read the paper!

apoelstra force-pushed on Oct 26, 2015

apoelstra commented at 3:53 pm on October 26, 2015: contributor

Rebased (correctly this time, I think). @peterdettman Sure, I’m interested. You can email me at apoelstra@wpsoftware.net if this is not public code. Note that I’m also time-constrained, plus I’m much less comfortable in Java than in C.

I’ll read your paper as soon as I have have some cycles to work on this again. @sipa had some ideas about using Jacobi symbols for fast batch inversion, so the priority of this patch might be bumped up again..

For historic curiosity:

If I remember correctly, I actually did implement a binary GCD to do some measurements. Because proper bin-GCD requires lots of left-shifting, and I was working with fixed-size buffers, and I didn’t really believe in this approach enough to dedicate lots of time to it, I basically did a modular reduction after every left shift (given the functions I had access to already, and C, this was actually easier than using a variable-sized buffer :)). This of course was hundreds, if not thousands, of times slower than a properly written bin-GCD, so I could not benchmark it directly. Instead I added a bunch of trace code and simply printed out some iteration counts to get a feel for “how much work a proper bin-GCD would do”. If I remember right, The result was 4-600 iterations for a random 256-bit number, vs 20-30 for Euclid-Lehmer. A bingcd iteration is a left-shift-then-subtract (so on 5-word bignums, roughly 10 word operations) while an Euclid-Lehmer iteration is multiplication of a bignum 2-vector by a 2x2 word matrix (roughly 60 operations I think). So my intuition was that bingcd was going to be slower pretty-much no matter what I did.

Obviously these numbers are full of SWAGs and don’t take into account compiler or CPU optimizations. :) And since your paper isn’t really binary GCD maybe they’re totally inapplicable. But there they are.

peterdettman commented at 4:41 pm on October 26, 2015: contributor

Yeah, your intuition was right about that class of algorithms (Laszlo Hars has a good paper rounding up the variations: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.384.8124). About 18 months ago I tried implementing a couple of them and could never get better than very roughly half as fast as GMP at 256 bits, maybe worse.

This algorithm is really more comparable to Euclid-Lehmer. It works on the least significant double-words, calculating exact “quotient” matrices (of the Generalised Binary Division, a sort of dual to plain division, for 2-adic numbers) of a little less than word size without regard to the absolute or relative sizes of the full values. No actual division is needed, and no “fix-up” procedure. Then the matrices are used to update the full-length values.

The current 32-bit implementation almost always handles a 256 bit inverse in 10 iterations (frustratingly close to being 9), and a 64-bit oriented one would comfortably finish in 5 iterations (fewer full-length updates, but the amount of LSB-processing doesn’t change much).

The raw output (of the modular inversion) can be larger than 256 bits (there’s some analysis in the paper of how large the quotients can get, but 512 bits is plenty), and an adjustment is needed at the end to account for extra powers of 2 that are collected in the result (total coda cost is around 2 field multiplies on average, using a tiny precomputed table per prime).

Anyway, I’ll tidy things up a little and get a copy to you in the next few days, just for benchmarking in the first instance.

apoelstra commented at 7:45 pm on October 26, 2015: contributor

(Note that the build failures are due to timeouts on the unit tests, which are due to the very bad perf on the current Jacobi symbol implementation vs GMP’s.)

gmaxwell commented at 0:09 am on October 27, 2015: contributor

The current 32-bit implementation almost always handles a 256 bit inverse in 10 iterations (frustratingly close to being 9),

How close? -1/F(-x) = 1/F(x). So if having a smaller magnitude input helps, you can shave off a bit that way.

peterdettman commented at 8:27 am on October 27, 2015: contributor

I collected some better stats on the rounds and it turns out it’s about 0.5% 9, 98% 10 and 1.5% 11. After nine rounds there’s on average about 14 bits left, so each round averages just under 27 bits. Saying its close to 9 was an exaggeration that probably stems from earlier in the process before I had implemented the “stop heuristics” correctly. Still, a few extra bits in the average case would skew things more towards 9 and away from 11. @gmaxwell In this context the smaller magnitude doesn’t help, although a subtraction can get you a low order zero, which could - but the quotient calculation is already built around generating low-order zeroes and it’s cheaper to just let it do its thing instead of manipulating the full values (apart from the single update per round).

in src/num_native_impl.h: in 6ecb0b6881 outdated

311+     * shift only needs to happen once.
312+     */
313+    int shift_w, shift_b;
314+    secp256k1_num_word rem_high;  /* This slides as rem decreases */
315+    secp256k1_num_word div_high;  /* This one stays constant */
316+    secp256k1_num div;

sipa commented at 1:21 am on November 2, 2015:

Declaration of div shadows a global declaration.

sipa commented at 1:44 am on November 2, 2015: contributor

Fails inexplicably, but needs a trivial rebase.

apoelstra commented at 2:22 pm on November 2, 2015: contributor

@sipa the inexplicable failures are that my Jacobi implementation is so slow compared to GMP (like 50x; they do some magic I don’t yet understand to make Jacobi way faster than gcd, though the code is superficially similar) that it times out.

But sure, I’ll rebase this.

sipa commented at 3:29 pm on November 2, 2015: contributor

I’d like to see this in, even if inefficient. Both the Jacobi construct and the ability to have a num interface without GMP.

apoelstra commented at 3:33 pm on November 2, 2015: contributor

Is it OK if I add an #ifdef to the tests binary which drops the number of Jacobi iterations by a factor of 100 when it’s testing my implementation?

Also I’ll need to re-run kcov on this; I think I’m missing coverage for some error cases and I don’t want to hurt @gmaxwell’s numbers :)

sipa commented at 3:47 pm on November 2, 2015: contributor

How about just dropping the iteration count by a factor 100 unconditionally? :)

And yes, not hurting the coverage numbers would be nice!

Move endianness macros into util.h e38eb3ba1c

Remove secp256k1_num_mul from num.h

This function isn't used anywhere and will cause test failures if we
implement the full num.h API for a fixed-width 256-bit numeric type.

We lose the unit test for secp256k1_scalar_mul_shift_var; we compensate
by improving the unit test for secp256k1_scalar_split_lambda (which is
the only user of this function) to test that the algebraic relation
`N = s_lam * lambda + s_1` actually holds for the lambda decomposition.

c6c7c44e3f

Add native num.h implementation with 32- and 64-bit variants

This num.h implementation works using fixed-size arrays large enough
to hold a 256-bit number (plus one word for slop). It includes a
modular inversion. Typical perf numbers on my 64-bit system are:

  scalar_inverse:
    constant time: min 13.4us / avg 13.5us / max 13.8us
     native num.h: min 5.18us / avg 4.55us / max 5.43us
        gmp num.h: min 2.65us / avg 2.68us / max 2.70us

  field_inverse:
    constant time: min 6.02us / avg 6.09us / max 6.15us
     native num.h: min 5.48us / avg 4.94us / max 5.68us
        gmp num.h: min 2.96us / avg 3.02us / max 3.09us

79a59be7ae

apoelstra force-pushed on Nov 7, 2015

apoelstra commented at 6:06 pm on November 7, 2015: contributor

Rebased, and also:

Got full test coverage.
Fixed a couple edge-case bugs revealed by having full test coverage (in particular computing the Jacobi symbol of 0 wrt any modulus would cause an invalid memory access!!)
Dropped the number of Jacobi iterations in bench_internal by a factor of 100 (saving ~70s on my system with the 32-bit num implementation :P)
Renamed helper function secp256k1_num_div_mul_word to secp256k1_num_mul_word_shift because the old name made no sense to me.

Add Jacobi symbol test via GMP

Also add native Jacobi symbol test (Andrew)

Rebased-by: Andrew Poelstra

953894aa48

Simplify divmod handling of "top dividend word exceeds top divisor word" case

Originally we computed q by dividing the top words of the dividend and
divisor, then corrected it by at most 2. However, the ratio of the top
words can only ever be 1, so rather than having three cases (original
correct; need to subtract 1; need to subtract 2), we actually only have
two (1 is correct; 0 is correct).

This simplifies the algorithm a little bit and gets us to full test coverage,
which we didn't have before since the "correct by 2" case was impossible to
hit.

5f7c1c939d

apoelstra force-pushed on Nov 7, 2015

sipa commented at 10:27 pm on November 10, 2015: contributor

Still way too slow for tests.

sipa commented at 0:56 am on March 22, 2017: contributor

Any interest in picking this up in the near future?

spartucus commented at 2:27 am on August 26, 2019: none

Any progress on this?

spartucus cross-referenced this on Aug 26, 2019 from issue Why not use GMP num? by spartucus

real-or-random commented at 7:36 am on August 26, 2019: contributor

Also, is there any progress on @peterdettman’s approach? @peterdettman Are there newer developments? Are you still interested in this? Any code, even in java, would be useful.

apoelstra cross-referenced this on Oct 22, 2019 from issue Is the compiler optimizing out some of the benchmarks? by elichai

elichai commented at 8:06 pm on October 23, 2019: contributor

Just to prevent duplication of work, I want to mention that in light of these results #667 (comment) I’m picking this up and working on implementing a potentially faster subquadratic algorithm for the jacobi symbol.

And potentially a faster algorithm for calculating modulo (i.e. without doing actual division).

If i’ll be able to understand and implement correctly I hope it will result in faster jacobi symbol calculation :) (or at least more comparable to gmp)

Edit: In case anyone wants to comment/is interested I’m basing these on: https://arxiv.org/pdf/1004.2091.pdf and maybe also https://eprint.iacr.org/2014/755.pdf

elichai commented at 8:49 am on October 24, 2019: contributor

@apoelstra did you actually implement Schönhage algorithm in the end? because you mentioned it but I can’t see it in the code. only lehmer’s algorithm. (It’s fine if you don’t remember, it’s 4 years ago lol)

gmaxwell commented at 5:44 am on October 25, 2019: contributor

Related, someone who doesn’t find C++ numeric code unreadable might find this codebase interesting: https://github.com/JeanLucPons/VanitySearch/blob/master/IntMod.cpp#L110

apoelstra commented at 8:29 pm on October 29, 2019: contributor

@elichai as I recall, I got as far as searching flights to Sweden to visit Schönhage and ask him how his algorithm works, because I could not make sense of the paper :). I have some more experience now and maybe could get further if I tried again. So no, I did not implement it.

elichai cross-referenced this on Mar 29, 2020 from issue Use the extened euclidian algorithm to compute scalar inverse in variable time by deadalnix

elichai cross-referenced this on Aug 24, 2020 from issue WIP: "safegcd" field and scalar inversion by peterdettman

roconnor-blockstream commented at 1:00 pm on March 24, 2021: contributor

I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged.

real-or-random commented at 9:13 pm on March 24, 2021: contributor

@elichai as I recall, I got as far as searching flights to Sweden to visit Schönhage and ask him how his algorithm works,

I believe this PR is now obsolete and can be closed since https://github.com/bitcoin-core/secp256k1/pull/831/files has now been merged.

I’m all in favor of optimizing implementations but even I didn’t expect #831 to come with such large CO2 savings.

real-or-random closed this on Mar 24, 2021

Add native num implementation; modular inverse and Jacobi symbol without GMP #290