Rewrite mul/sqr for 32bit/64bit

peterdettman commented at 3:15 pm on October 27, 2014: contributor

interleave calculation of the lower and upper partial product ranges, and reduction
less registers needed, more opportunities for parallel ops
“bench” performance: +3.5% for 64bit, +7.5% for 32bit (as tested on a 64-bit machine)

peterdettman force-pushed on Oct 30, 2014

peterdettman force-pushed on Oct 31, 2014

peterdettman commented at 5:43 pm on October 31, 2014: contributor

Made a couple more changes and rebased.

The performance improvement for 64bit is now measuring more like +15%. Not sure whether I just screwed up the earlier measurement or whether changing from gcc 4.2.1 to 4.9 has made the difference. In any case, the pre/post timings for bench_verify are roughly 78.8s vs 66.9s. Playing around with -march=native (i.e. haswell) seems to show an even larger pre/post disparity there.

peterdettman force-pushed on Nov 1, 2014

sipa commented at 9:28 am on November 2, 2014: contributor

Wow, very impressive. With -O3 -march=native this code comes very close (within 3%) of the speed of the assembly versions - which probably means those can be improved further as well. With just -O2 it’s within 14% of the assembly code, but still very significantly faster than before.

sipa commented at 12:59 pm on November 2, 2014: contributor

I added a commit here: sipa/secp256k1@d7e936b00e742681a9b5407b90c0bbc193f7bef4, which adds range checks during the computation (as consistency check, and as documentation). Interestingly, in the 64 bit version, the ’d’ variable fits in 64bit after every ‘block’. In the 32 bit version, only a few time at the end. Maybe this can be used to optimize even further.

peterdettman commented at 3:55 pm on November 2, 2014: contributor

Those range checks are a good idea, I’ll decorate this version with them too. I’m a little unclear on what the compiler does with a __int128 that’s statically only 64 bits… I presume the usual cast then mul 64 bit just gives a 64x64 mul. Timings suggest the compiler is doing that here too (is it inferring the possible range of values?) for c/d (similar question arises for the u? variables in the 32 bit version). I guess checking the generated asm is the way to go from here.

One thing that I can confirm speeds this up a little more is adding ‘restrict’ to the second input to mul, in which case a couple of registers can be used for a0, a1 and t0, t1 can be written out immediately. I found a couple of call sites in ge_double that would need their inputs swapped around, otherwise it seems feasible. I realise restrict is not compiler enforced so we could have at best a debug/verify check; is it safe?

sipa commented at 3:56 pm on November 2, 2014: contributor

What is ‘restrict’ ?

peterdettman commented at 4:47 pm on November 2, 2014: contributor

I mean this: http://en.wikipedia.org/wiki/Restrict , and I’m referring to declaring the argument ‘b’ in that fashion.

sipa commented at 5:32 pm on November 2, 2014: contributor

Interesting, I didn’t know about that. It seems reasonble to mark the b argument that way yes, and add a VERIFY_CHECK for it.

Avoiding C99 would be nice, but not a blocker as we already rely on it. It’s easy enough to add a “#define restrict” later if necessary for c89 compatibility.

gmaxwell commented at 6:28 pm on November 2, 2014: contributor

You should use a SECP256k1_RESTRICT macro instead of restrict directly. (Restrict, in particular, fries older MSVC and must be replaced with _Restrict to get its implementation of that) Just take care, it’s easy to get compilers to miscompile code when you get restrict usage wrong. :)

sipa commented at 10:41 pm on November 2, 2014: contributor

@peterdettman you can just cherry-pick my commit in, if you like.

sipa cross-referenced this on Nov 3, 2014 from issue Extra validation and comments on #73 by sipa

sipa cross-referenced this on Nov 3, 2014 from issue Extra validation and comments on top of: Rewrite mul/sqr for 32bit/64bit by sipa

peterdettman force-pushed on Nov 4, 2014

peterdettman commented at 6:25 am on November 4, 2014: contributor

@sipa Unfortunately it seems your changes were based on the original pull request, which I had updated and rebased later. I probably shouldn’t have rebased prematurely like that…

sipa commented at 6:45 am on November 4, 2014: contributor

@peterdettman Oh, too bad - I should have noticed that.

sipa closed this on Nov 4, 2014

sipa reopened this on Nov 4, 2014

sipa commented at 6:47 am on November 4, 2014: contributor

Closed the wrong PR, sorry!

peterdettman commented at 6:48 am on November 4, 2014: contributor

@sipa I think the newer one is faster if it’s any consolation!

peterdettman commented at 6:49 am on November 4, 2014: contributor

I’m still happy to apply the extra validation and comments myself if you like, but better let me know one way or the other so we don’t both do it.

sipa commented at 6:52 am on November 4, 2014: contributor

Can you join #bitcoin-dev on freenode IRC, or some other channel? That’s probably easier for discussion than posts here…

sipa commented at 7:43 am on November 4, 2014: contributor

Your older version is faster for me, when compiled with -O3 -march=native and –enable-endomorphism (numbers in 1000’s of CPU cycles on i7-4800MQ):

current head: 421
your old rewrite: 279
your new rewrite: 300
the asm version: 264

However, with -O2 and –enable-endomorphism:

current head: 454
your old rewrite: 305
your new rewrite: 283
the asm version: 265

peterdettman force-pushed on Nov 4, 2014

sipa commented at 8:48 am on November 4, 2014: contributor

Added some more benchmarks above. As the new version’s -O2 is nearly as fast as the best possible with -O3, I vote for the new one.

sipa commented at 9:42 am on November 4, 2014: contributor

With -O2 -m32 –enable-endomorphism:

current head: 1251
your old rewrite: 1308
your new rewrite: 1328

Not sure it’s an improvement there…

gmaxwell commented at 9:43 am on November 4, 2014: contributor

Cool, I was able to use frama-c to mechnically verify overflow freeness for secp256k1_fe_mul_inner and secp256k1_fe_sqr_inner (10x26).

Given the precondition:

0requires (a/b[0] < 1073741824);
1requires (a/b[1] < 1073741824); 
2requires (a/b[2] < 1073741824);
3requires (a/b[3] < 1073741824); 
4requires (a/b[4] < 1073741824);
5requires (a/b[5] < 1073741824);  
6requires (a/b[6] < 1073741824);  
7requires (a/b[7] < 1073741824); 
8requires (a/b[8] < 1073741824);  
9requires (a/b[9] < 67108864);

Both have output ranges of:

S_r[0..8] ∈ [0..67108863] [9] ∈ [0..4194304]

Which is more strongly normalized than the results from PR91.

peterdettman commented at 10:53 am on November 5, 2014: contributor

I can’t get back to this for a couple of days, but I’ve taken note of the performance reports and will revisit the several variations I have. @sipa I’ll catch you on #bitcoin-dev at some point (ryjz). @gmaxwell frama-c looks useful, thx for the lead. It seems to be indicating tighter bounds than my napkin math.

peterdettman force-pushed on Nov 5, 2014

peterdettman commented at 10:11 am on November 10, 2014: contributor

After re-measuring everything, I’ve reverted to the earlier version for 32bit, with a few changes to avoid a (syntactic at least) 128x64 mul. I’ve kept the later 64bit version, with some final opts for sqr, and am going to leave the restrict stuff for another PR to keep things simple.

EDIT: I meant the changes to 32bit avoid a 64x32 mul.

sipa commented at 10:35 am on November 10, 2014: contributor

New 64-bit version with –enable-endomorphism:

-O2: 287k cycles
-O3 -march=native: 296k cycles

peterdettman commented at 4:06 am on November 12, 2014: contributor

@sipa As long as we’re still comfortably ahead of master on performance, I’d like to go forward with the code as is (with verification to be added).

sipa commented at 11:19 am on November 12, 2014: contributor

@peterdettman Sounds good. I’d like to do the verification myself, as a way to assure myself it’s correct, but preferable we don’t do double work or use an outdated version.

peterdettman commented at 2:23 am on November 13, 2014: contributor

@sipa No more changes from my end, so it’s stable for adding verification. Take it over to a new branch/PR if that’s easier for you.

Rewrite mul/sqr for 32bit/64bit

- interleave calculation of the lower and upper partial product ranges, and reduction
- less registers needed, more opportunities for parallel ops

6bd0cf6235

peterdettman force-pushed on Nov 13, 2014

gmaxwell commented at 4:23 am on November 15, 2014: contributor

merged via #103

gmaxwell closed this on Nov 15, 2014

ysangkok referenced this in commit cad7cc8f34 on Mar 19, 2020

Rewrite mul/sqr for 32bit/64bit #73