field10x26: change type of uX to uint32_t #170

pull laanwj wants to merge 1 commits into bitcoin-core:master from laanwj:2014_12_field10x26_uint32 changing 1 files +36 −37

laanwj commented at 8:35 AM on December 21, 2014: member

Change the type of uX in secp256k1_fe_mul_inner and secp256k1_fe_sqr_inner to uint32_t, and cast only to uint64_t when necessary.

On X86 the difference was in the noise bound, but for ARM this reduces code and required stack size, and increases performance significantly. Apparantly, GCC is less smart there and sees this as a hint that uint32_t*uint32_t→uint64_t instructions can be used.

Before (asm):

test	min	avg	max
inv	384.321us	384.537us	385.125us
recover	2845.161us	2845.341us	2845.426us
sign	1324.141us	1324.300us	1324.453us
verify	2672.787us	2673.188us	2673.498us

After (asm):

test	min	avg	max
inv	384.400us	384.614us	385.357us
recover	2549.572us	2550.015us	2552.387us
sign	1223.006us	1223.169us	1223.313us
verify	2410.431us	2410.598us	2410.885us

Eg, a 10% speedup for verification.

I still intend to do a NEON implementation, but this picks a bit of low-hanging fruit that I expect to be significant on other embedded 32-bit architectures as well.

field10x26: change type of uX to uint32_t

Cast only to uint64_t when necessary. This reduces code size on ARM.

4902ff6474

peterdettman commented at 2:57 AM on December 22, 2014: contributor

Looks good to me. I did at some point have the code like this but couldn't measure any difference (x86); I thought perhaps the compiler was smart enough to understand the range of uX. Maybe it is... sometimes.
gmaxwell commented at 4:00 AM on December 22, 2014: contributor

ACK. Looks fine, and has survived several hours of agressive testing for me.

I see an even larger performance gain for bench_verify on a cortex A8 (the in-order little cousin of the A9, in this case a beaglebone w/ GCC 4.9.2):

min 5099.428us / avg 5166.601us / max 5176.198us

becomes

min 3952.994us / avg 3953.490us / max 3954.802us
laanwj commented at 9:09 AM on December 22, 2014: member

Ah yes should have mentioned what I measured on: HummingBoard-i1 w. i.MX6 Solo (Cortex A9). Great to see that there is even more gain on A8.
laanwj commented at 11:53 AM on December 22, 2014: member

On my OLPC XO-4 (Marvell PXA2128/PJ4) the bench_verify goes from 2926.030us to 2655.351us, similar to A9
gmaxwell commented at 10:02 PM on December 22, 2014: contributor

Looks like bench_verify is 1.5% slower on 32-bit PPC w/ defaults and GCC 4.9.2. I don't consider this a concern (relative to the huge gains on arm), just a data point.
sipa commented at 10:51 PM on December 22, 2014: contributor

It's also 1.6% slower (gmp, endo, gcc v4.8.2, -m32) here on an i7 CPU.
gmaxwell commented at 10:57 PM on December 22, 2014: contributor

Interesting. Maybe we should extract a microbenchmark and toss it over the fence at GCC?
laanwj commented at 10:24 AM on December 23, 2014: member

Disappointing. Maybe we should skip this step and just go for ARM-specific assembly then.

I can hardly understand why this would be slower, though. Effective number of casts stays the same. What about code size and stack size on x86/ppc?
laanwj closed this on Dec 24, 2014

Contributors