Need info for NEON implementation of field multiplication #431

issue laanwj opened this issue on December 2, 2016

laanwj commented at 6:08 AM on December 2, 2016: member

@peterdettman In #173 (comment) you mentioned an alternative approach to the multiplication that would be good for SIMD paralellization.

Back then you said "I have sample C code for 5x52 and could whip up a 10x26 version". Could you send this to me? I'm especially interested in 10x26 as ARM NEON has only 32x32->64 multiplication, but possibly I could figure it out myself with the 5x52 one when I know the approach.
peterdettman commented at 9:49 AM on December 20, 2016: contributor

Sorry for the slow reply. I managed to locate my experimental code from 2 years ago and pushed it to a branch here: https://github.com/peterdettman/secp256k1/tree/alt_mul . It actually has both 5x52 and 10x26 versions. Probably best read in conjunction with the paper: http://eprint.iacr.org/2014/852.pdf .

Both versions appear to pass basic tests, but I have a vague recollection that the 10x26 one in particular might actually have potential overflows as written. These are definitely use-at-your-own-risk. Still, the basic structure should give you an idea whether there's good potential for SIMD there. I'm not entirely optimistic; the 5x52 is currently a few percent slower than master, but the 10x26 one is something like 17% slower for me.
laanwj closed this on Apr 14, 2022

Contributors