Improve normalization performance for 32bit #40

pull peterdettman wants to merge 1 commits into bitcoin-core:master from peterdettman:faster-32bit-normalize changing 2 files +47 −72
  1. peterdettman commented at 5:20 AM on June 23, 2014: contributor
    • Uses a similar approach to the latest 64bit _normalize.
    • Add one useful optimization back into the 64bit _normalize too.

    Performance of 'bench' improved by around 0.5% for the 32bit field (but tested on a 64-bit machine).

  2. Improve normalization performance for 32bit
    - Uses a similar approach to the latest 64bit _normalize.
    - Add one useful optimization back into the 64bit _normalize too.
    
    Performance of 'bench' improved by around 0.5% for the 32bit field (but tested on a 64-bit machine).
    42822baaa8
  3. in src/field_10x26_impl.h:None in 42822baaa8
      89 | +    // ... except for a possible carry at bit 22 of t9 (i.e. bit 256 of the field element)
      90 | +    assert(t9 >> 23 == 0);
      91 | +
      92 | +    // At most a single final reduction is needed; check if the value is >= the field characteristic
      93 | +    x = (t9 >> 22) | ((t9 == 0x03FFFFFULL) & (m == 0x3FFFFFFULL)
      94 | +        & ((t1 + 0x40UL + ((t0 + 0x3D1UL) >> 26)) > 0x3FFFFFFULL));
    


    sipa commented at 12:19 AM on June 25, 2014:

    The expression:

    (t1 + 0x40 + ((t0 + 0x3D1) >> 26)) > 0x0x3FFFFFF
    =
    (t1 + ((t0 + 0x3D1) >> 26)) > 0x3FFFFBF
    =
    (t1<<26) + t0 + 3D1 > 0xFFFFEFC000000
    =
    t0 + (t1<<26) > 0xFFFFEFBFFFC2F
    

    which is not exactly the lower bits of the characteristic. Sure this is right?


    sipa commented at 12:45 AM on June 25, 2014:

    Nevermind; the inequality changes meaning when shifting; when adding 1 to the right side and using >= it's correct.

  4. sipa commented at 12:49 AM on June 25, 2014: contributor

    ACK

  5. sipa merged this on Jun 25, 2014
  6. sipa closed this on Jun 25, 2014

  7. sipa referenced this in commit 61a203517a on Jun 25, 2014
  8. peterdettman commented at 1:27 AM on June 25, 2014: contributor

    Thanks for checking the math! I originally had something like: (t1>0x3FFFFBF)|((t1==0x3FFFFBF)&(t0>0x3FFFC2F)) The final form is, I think, clearer, and surprisingly to me, performs measurably better on my machine at least. Are comparisons per clock cycle more limited than other ops? If so it might be possible to wring a couple more cycles out of the final reduction check.

  9. gmaxwell commented at 2:10 AM on June 25, 2014: contributor

    They are kind of more limited. They tend to produce data dependencies on the flags register that stall the pipeline.

    Modern micro-architectures are complex enough that there isn't really a replacement for testing however. My general intuition is to prefer bitops to comparisons. Sometimes the opposite is true, just depending on which units are available. It's also best if you can interleave things so that the result of an operation is not needed right away. ( If you're interested in some light reading: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf) :)

  10. peterdettman deleted the branch on Jun 25, 2014
  11. peterdettman commented at 5:54 AM on June 25, 2014: contributor

    @gmaxwell OK, that's consistent with the measurements I am seeing. I think I'll stick with the testing-guided intuitive approach and leave the manual on the bedside table in case of insomnia!

  12. real-or-random referenced this in commit 43dd1f4fe7 on Mar 5, 2019

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin-core/secp256k1. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-29 15:15 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me