field10x26: change type of uX to uint32_t #170

pull laanwj wants to merge 1 commits into bitcoin-core:master from laanwj:2014_12_field10x26_uint32 changing 1 files +36 −37
  1. laanwj commented at 8:35 AM on December 21, 2014: member

    Change the type of uX in secp256k1_fe_mul_inner and secp256k1_fe_sqr_inner to uint32_t, and cast only to uint64_t when necessary.

    On X86 the difference was in the noise bound, but for ARM this reduces code and required stack size, and increases performance significantly. Apparantly, GCC is less smart there and sees this as a hint that uint32_t*uint32_tuint64_t instructions can be used.

    Before (asm):

    test min avg max
    inv 384.321us 384.537us 385.125us
    recover 2845.161us 2845.341us 2845.426us
    sign 1324.141us 1324.300us 1324.453us
    verify 2672.787us 2673.188us 2673.498us

    After (asm):

    test min avg max
    inv 384.400us 384.614us 385.357us
    recover 2549.572us 2550.015us 2552.387us
    sign 1223.006us 1223.169us 1223.313us
    verify 2410.431us 2410.598us 2410.885us

    Eg, a 10% speedup for verification.

    I still intend to do a NEON implementation, but this picks a bit of low-hanging fruit that I expect to be significant on other embedded 32-bit architectures as well.

  2. field10x26: change type of uX to uint32_t
    Cast only to uint64_t when necessary. This reduces code size on ARM.
    4902ff6474
  3. peterdettman commented at 2:57 AM on December 22, 2014: contributor

    Looks good to me. I did at some point have the code like this but couldn't measure any difference (x86); I thought perhaps the compiler was smart enough to understand the range of uX. Maybe it is... sometimes.

  4. gmaxwell commented at 4:00 AM on December 22, 2014: contributor

    ACK. Looks fine, and has survived several hours of agressive testing for me.

    I see an even larger performance gain for bench_verify on a cortex A8 (the in-order little cousin of the A9, in this case a beaglebone w/ GCC 4.9.2):

    min 5099.428us / avg 5166.601us / max 5176.198us

    becomes

    min 3952.994us / avg 3953.490us / max 3954.802us

  5. laanwj commented at 9:09 AM on December 22, 2014: member

    Ah yes should have mentioned what I measured on: HummingBoard-i1 w. i.MX6 Solo (Cortex A9). Great to see that there is even more gain on A8.

  6. laanwj commented at 11:53 AM on December 22, 2014: member

    On my OLPC XO-4 (Marvell PXA2128/PJ4) the bench_verify goes from 2926.030us to 2655.351us, similar to A9

  7. gmaxwell commented at 10:02 PM on December 22, 2014: contributor

    Looks like bench_verify is 1.5% slower on 32-bit PPC w/ defaults and GCC 4.9.2. I don't consider this a concern (relative to the huge gains on arm), just a data point.

  8. sipa commented at 10:51 PM on December 22, 2014: contributor

    It's also 1.6% slower (gmp, endo, gcc v4.8.2, -m32) here on an i7 CPU.

  9. gmaxwell commented at 10:57 PM on December 22, 2014: contributor

    Interesting. Maybe we should extract a microbenchmark and toss it over the fence at GCC?

  10. laanwj commented at 10:24 AM on December 23, 2014: member

    Disappointing. Maybe we should skip this step and just go for ARM-specific assembly then.

    I can hardly understand why this would be slower, though. Effective number of casts stays the same. What about code size and stack size on x86/ppc?

  11. laanwj closed this on Dec 24, 2014


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-14 18:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me