- Uses a similar approach to the latest 64bit _normalize.
- Add one useful optimization back into the 64bit _normalize too.
Performance of 'bench' improved by around 0.5% for the 32bit field (but tested on a 64-bit machine).
Performance of 'bench' improved by around 0.5% for the 32bit field (but tested on a 64-bit machine).
- Uses a similar approach to the latest 64bit _normalize.
- Add one useful optimization back into the 64bit _normalize too.
Performance of 'bench' improved by around 0.5% for the 32bit field (but tested on a 64-bit machine).
89 | + // ... except for a possible carry at bit 22 of t9 (i.e. bit 256 of the field element) 90 | + assert(t9 >> 23 == 0); 91 | + 92 | + // At most a single final reduction is needed; check if the value is >= the field characteristic 93 | + x = (t9 >> 22) | ((t9 == 0x03FFFFFULL) & (m == 0x3FFFFFFULL) 94 | + & ((t1 + 0x40UL + ((t0 + 0x3D1UL) >> 26)) > 0x3FFFFFFULL));
The expression:
(t1 + 0x40 + ((t0 + 0x3D1) >> 26)) > 0x0x3FFFFFF
=
(t1 + ((t0 + 0x3D1) >> 26)) > 0x3FFFFBF
=
(t1<<26) + t0 + 3D1 > 0xFFFFEFC000000
=
t0 + (t1<<26) > 0xFFFFEFBFFFC2F
which is not exactly the lower bits of the characteristic. Sure this is right?
Nevermind; the inequality changes meaning when shifting; when adding 1 to the right side and using >= it's correct.
ACK
Thanks for checking the math! I originally had something like: (t1>0x3FFFFBF)|((t1==0x3FFFFBF)&(t0>0x3FFFC2F)) The final form is, I think, clearer, and surprisingly to me, performs measurably better on my machine at least. Are comparisons per clock cycle more limited than other ops? If so it might be possible to wring a couple more cycles out of the final reduction check.
They are kind of more limited. They tend to produce data dependencies on the flags register that stall the pipeline.
Modern micro-architectures are complex enough that there isn't really a replacement for testing however. My general intuition is to prefer bitops to comparisons. Sometimes the opposite is true, just depending on which units are available. It's also best if you can interleave things so that the result of an operation is not needed right away. ( If you're interested in some light reading: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf) :)
@gmaxwell OK, that's consistent with the measurements I am seeing. I think I'll stick with the testing-guided intuitive approach and leave the manual on the bedside table in case of insomnia!