Improve normalization performance for 32bit #40

pull peterdettman wants to merge 1 commits into bitcoin-core:master from peterdettman:faster-32bit-normalize changing 2 files +47 −72

peterdettman commented at 5:20 AM on June 23, 2014: contributor
- Uses a similar approach to the latest 64bit _normalize.
- Add one useful optimization back into the 64bit _normalize too.
Performance of 'bench' improved by around 0.5% for the 32bit field (but tested on a 64-bit machine).

Improve normalization performance for 32bit

- Uses a similar approach to the latest 64bit _normalize.
- Add one useful optimization back into the 64bit _normalize too.

Performance of 'bench' improved by around 0.5% for the 32bit field (but tested on a 64-bit machine).

42822baaa8

in src/field_10x26_impl.h:None in 42822baaa8

  89 | +    // ... except for a possible carry at bit 22 of t9 (i.e. bit 256 of the field element)
  90 | +    assert(t9 >> 23 == 0);
  91 | +
  92 | +    // At most a single final reduction is needed; check if the value is >= the field characteristic
  93 | +    x = (t9 >> 22) | ((t9 == 0x03FFFFFULL) & (m == 0x3FFFFFFULL)
  94 | +        & ((t1 + 0x40UL + ((t0 + 0x3D1UL) >> 26)) > 0x3FFFFFFULL));

sipa commented at 12:19 AM on June 25, 2014:

The expression:

(t1 + 0x40 + ((t0 + 0x3D1) >> 26)) > 0x0x3FFFFFF
=
(t1 + ((t0 + 0x3D1) >> 26)) > 0x3FFFFBF
=
(t1<<26) + t0 + 3D1 > 0xFFFFEFC000000
=
t0 + (t1<<26) > 0xFFFFEFBFFFC2F

which is not exactly the lower bits of the characteristic. Sure this is right?

sipa commented at 12:45 AM on June 25, 2014:

Nevermind; the inequality changes meaning when shifting; when adding 1 to the right side and using >= it's correct.

sipa commented at 12:49 AM on June 25, 2014: contributor

ACK
sipa merged this on Jun 25, 2014
sipa closed this on Jun 25, 2014
sipa referenced this in commit 61a203517a on Jun 25, 2014
peterdettman commented at 1:27 AM on June 25, 2014: contributor

Thanks for checking the math! I originally had something like: (t1>0x3FFFFBF)|((t1==0x3FFFFBF)&(t0>0x3FFFC2F)) The final form is, I think, clearer, and surprisingly to me, performs measurably better on my machine at least. Are comparisons per clock cycle more limited than other ops? If so it might be possible to wring a couple more cycles out of the final reduction check.
gmaxwell commented at 2:10 AM on June 25, 2014: contributor

They are kind of more limited. They tend to produce data dependencies on the flags register that stall the pipeline.

Modern micro-architectures are complex enough that there isn't really a replacement for testing however. My general intuition is to prefer bitops to comparisons. Sometimes the opposite is true, just depending on which units are available. It's also best if you can interleave things so that the result of an operation is not needed right away. ( If you're interested in some light reading: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf) :)
peterdettman deleted the branch on Jun 25, 2014
peterdettman commented at 5:54 AM on June 25, 2014: contributor

@gmaxwell OK, that's consistent with the measurements I am seeing. I think I'll stick with the testing-guided intuitive approach and leave the manual on the bedside table in case of insomnia!
real-or-random referenced this in commit 43dd1f4fe7 on Mar 5, 2019

Contributors

peterdettman

sipa

gmaxwell