- interleave calculation of the lower and upper partial product ranges, and reduction
- less registers needed, more opportunities for parallel ops
- “bench” performance: +3.5% for 64bit, +7.5% for 32bit (as tested on a 64-bit machine)
Rewrite mul/sqr for 32bit/64bit #73
pull peterdettman wants to merge 1 commits into bitcoin-core:master from peterdettman:opt-mul-sqr changing 2 files +351 −334-
peterdettman commented at 3:15 pm on October 27, 2014: contributor
-
peterdettman force-pushed on Oct 30, 2014
-
peterdettman force-pushed on Oct 31, 2014
-
peterdettman commented at 5:43 pm on October 31, 2014: contributor
Made a couple more changes and rebased.
The performance improvement for 64bit is now measuring more like +15%. Not sure whether I just screwed up the earlier measurement or whether changing from gcc 4.2.1 to 4.9 has made the difference. In any case, the pre/post timings for bench_verify are roughly 78.8s vs 66.9s. Playing around with -march=native (i.e. haswell) seems to show an even larger pre/post disparity there.
-
peterdettman force-pushed on Nov 1, 2014
-
sipa commented at 9:28 am on November 2, 2014: contributorWow, very impressive. With -O3 -march=native this code comes very close (within 3%) of the speed of the assembly versions - which probably means those can be improved further as well. With just -O2 it’s within 14% of the assembly code, but still very significantly faster than before.
-
sipa commented at 12:59 pm on November 2, 2014: contributorI added a commit here: sipa/secp256k1@d7e936b00e742681a9b5407b90c0bbc193f7bef4, which adds range checks during the computation (as consistency check, and as documentation). Interestingly, in the 64 bit version, the ’d’ variable fits in 64bit after every ‘block’. In the 32 bit version, only a few time at the end. Maybe this can be used to optimize even further.
-
peterdettman commented at 3:55 pm on November 2, 2014: contributor
Those range checks are a good idea, I’ll decorate this version with them too. I’m a little unclear on what the compiler does with a __int128 that’s statically only 64 bits… I presume the usual cast then mul 64 bit just gives a 64x64 mul. Timings suggest the compiler is doing that here too (is it inferring the possible range of values?) for c/d (similar question arises for the u? variables in the 32 bit version). I guess checking the generated asm is the way to go from here.
One thing that I can confirm speeds this up a little more is adding ‘restrict’ to the second input to mul, in which case a couple of registers can be used for a0, a1 and t0, t1 can be written out immediately. I found a couple of call sites in ge_double that would need their inputs swapped around, otherwise it seems feasible. I realise restrict is not compiler enforced so we could have at best a debug/verify check; is it safe?
-
sipa commented at 3:56 pm on November 2, 2014: contributorWhat is ‘restrict’ ?
-
peterdettman commented at 4:47 pm on November 2, 2014: contributorI mean this: http://en.wikipedia.org/wiki/Restrict , and I’m referring to declaring the argument ‘b’ in that fashion.
-
sipa commented at 5:32 pm on November 2, 2014: contributor
Interesting, I didn’t know about that. It seems reasonble to mark the b argument that way yes, and add a VERIFY_CHECK for it.
Avoiding C99 would be nice, but not a blocker as we already rely on it. It’s easy enough to add a “#define restrict” later if necessary for c89 compatibility.
-
gmaxwell commented at 6:28 pm on November 2, 2014: contributorYou should use a SECP256k1_RESTRICT macro instead of restrict directly. (Restrict, in particular, fries older MSVC and must be replaced with _Restrict to get its implementation of that) Just take care, it’s easy to get compilers to miscompile code when you get restrict usage wrong. :)
-
sipa commented at 10:41 pm on November 2, 2014: contributor@peterdettman you can just cherry-pick my commit in, if you like.
-
sipa cross-referenced this on Nov 3, 2014 from issue Extra validation and comments on #73 by sipa
-
sipa cross-referenced this on Nov 3, 2014 from issue Extra validation and comments on top of: Rewrite mul/sqr for 32bit/64bit by sipa
-
peterdettman force-pushed on Nov 4, 2014
-
peterdettman commented at 6:25 am on November 4, 2014: contributor@sipa Unfortunately it seems your changes were based on the original pull request, which I had updated and rebased later. I probably shouldn’t have rebased prematurely like that…
-
sipa commented at 6:45 am on November 4, 2014: contributor@peterdettman Oh, too bad - I should have noticed that.
-
sipa closed this on Nov 4, 2014
-
sipa reopened this on Nov 4, 2014
-
sipa commented at 6:47 am on November 4, 2014: contributorClosed the wrong PR, sorry!
-
peterdettman commented at 6:48 am on November 4, 2014: contributor@sipa I think the newer one is faster if it’s any consolation!
-
peterdettman commented at 6:49 am on November 4, 2014: contributorI’m still happy to apply the extra validation and comments myself if you like, but better let me know one way or the other so we don’t both do it.
-
sipa commented at 6:52 am on November 4, 2014: contributorCan you join #bitcoin-dev on freenode IRC, or some other channel? That’s probably easier for discussion than posts here…
-
sipa commented at 7:43 am on November 4, 2014: contributor
Your older version is faster for me, when compiled with -O3 -march=native and –enable-endomorphism (numbers in 1000’s of CPU cycles on i7-4800MQ):
- current head: 421
- your old rewrite: 279
- your new rewrite: 300
- the asm version: 264
However, with -O2 and –enable-endomorphism:
- current head: 454
- your old rewrite: 305
- your new rewrite: 283
- the asm version: 265
-
peterdettman force-pushed on Nov 4, 2014
-
sipa commented at 8:48 am on November 4, 2014: contributorAdded some more benchmarks above. As the new version’s -O2 is nearly as fast as the best possible with -O3, I vote for the new one.
-
sipa commented at 9:42 am on November 4, 2014: contributor
With -O2 -m32 –enable-endomorphism:
- current head: 1251
- your old rewrite: 1308
- your new rewrite: 1328
Not sure it’s an improvement there…
-
gmaxwell commented at 9:43 am on November 4, 2014: contributor
Cool, I was able to use frama-c to mechnically verify overflow freeness for secp256k1_fe_mul_inner and secp256k1_fe_sqr_inner (10x26).
Given the precondition:
0requires (a/b[0] < 1073741824); 1requires (a/b[1] < 1073741824); 2requires (a/b[2] < 1073741824); 3requires (a/b[3] < 1073741824); 4requires (a/b[4] < 1073741824); 5requires (a/b[5] < 1073741824); 6requires (a/b[6] < 1073741824); 7requires (a/b[7] < 1073741824); 8requires (a/b[8] < 1073741824); 9requires (a/b[9] < 67108864);
Both have output ranges of:
S_r[0..8] ∈ [0..67108863] [9] ∈ [0..4194304]
Which is more strongly normalized than the results from PR91.
-
peterdettman commented at 10:53 am on November 5, 2014: contributorI can’t get back to this for a couple of days, but I’ve taken note of the performance reports and will revisit the several variations I have. @sipa I’ll catch you on #bitcoin-dev at some point (ryjz). @gmaxwell frama-c looks useful, thx for the lead. It seems to be indicating tighter bounds than my napkin math.
-
peterdettman force-pushed on Nov 5, 2014
-
peterdettman commented at 10:11 am on November 10, 2014: contributor
After re-measuring everything, I’ve reverted to the earlier version for 32bit, with a few changes to avoid a (syntactic at least) 128x64 mul. I’ve kept the later 64bit version, with some final opts for sqr, and am going to leave the restrict stuff for another PR to keep things simple.
EDIT: I meant the changes to 32bit avoid a 64x32 mul.
-
sipa commented at 10:35 am on November 10, 2014: contributor
New 64-bit version with –enable-endomorphism:
- -O2: 287k cycles
- -O3 -march=native: 296k cycles
-
peterdettman commented at 4:06 am on November 12, 2014: contributor@sipa As long as we’re still comfortably ahead of master on performance, I’d like to go forward with the code as is (with verification to be added).
-
sipa commented at 11:19 am on November 12, 2014: contributor@peterdettman Sounds good. I’d like to do the verification myself, as a way to assure myself it’s correct, but preferable we don’t do double work or use an outdated version.
-
peterdettman commented at 2:23 am on November 13, 2014: contributor@sipa No more changes from my end, so it’s stable for adding verification. Take it over to a new branch/PR if that’s easier for you.
-
Rewrite mul/sqr for 32bit/64bit
- interleave calculation of the lower and upper partial product ranges, and reduction - less registers needed, more opportunities for parallel ops
-
peterdettman force-pushed on Nov 13, 2014
-
gmaxwell closed this on Nov 15, 2014
-
ysangkok referenced this in commit cad7cc8f34 on Mar 19, 2020
This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-01-24 08:15 UTC
More mirrored repositories can be found on mirror.b10c.me