vector processor acceleration #226

voisine commented at 5:12 AM on March 5, 2015: contributor

I'm wondering if it would be possible to get a performance improvement using vector processor acceleration for scalar math, either directly with NEON/SSE instructions, or using math libraries like vecLib/MKL/ACML

I don't know too much about it myself, or what security issues their might be when handling private key data, but presumably the security concerns would be less of an issue when verifying signatures.

gmaxwell commented at 6:18 PM on August 27, 2015: contributor

External libraries are probably not suitable. For verification we have consensus concerns, for signing we have constant time concerns... plus I would be surprised if you'd find anything useful out of them. E.g. All GMP does for us usefully now is modular inversion (and Andrew is working on replacing that) (and the modular inverse has the property that its very easy to verify at runtime).

On 64 bit processors the 64,64->128 widening multiplies are very powerful and look like they'll do better than SIMD. On 32-bit processors the story is different, and SIMD may be a big win.

But deploying it will likely need a different factorization of the inner product that has more parallelism. Also, x86 SIMD is not that interesting because x86 is generally quite fast just on a clock-speed basis and all (? or almost all) x86 being produced today are 64-bit-- I'd rather see SIMD time put in to NEON than SSE.

peterdettman commented at 5:26 AM on September 2, 2015: contributor

I think there is a reasonable prospect that a multiplication algorithm similar to the one described in https://eprint.iacr.org/2014/852 would result in improved performance, needing SIMD for only addition (subtraction). IIRC correctly the number of word multiplies is reduced to the same as for squaring, but there are a considerable number of extra word additions required.

I managed to get the algorithm to work in straight C code (for our 64bit field), but it was around 10-20% slower than our current implementation I think. If SIMD can even halve the cost of the additions, I think this approach would show a net improvement.

I tried implementing the algorithm for the 32bit field also, but never completed it - it looked to be much more awkward as the magnitude-8 inputs have 30 bits used per word, meaning that various intermediate sums/carries end up overflowing, where they do not for 64bit.

msotoodeh commented at 12:50 AM on January 11, 2017: none

I confirm @gmaxwell comment regarding 64x64 multiplication. I have an implementation of curve25519 library that is 5.8 to 18 times faster than google's donna implementation. It is hosted at https://github.com/msotoodeh/curve25519. The math library uses classical approaches but uses an endian-safe struct as accumulator that eliminates shift and mask operations.

gmaxwell commented at 11:21 AM on February 20, 2019: contributor

Closing this because it's not really an actionable thing right now. I think we're absolutely interested in further optimizations and the developers would love to help out anyone seriously working on implementing some of them, but at the moment it isn't being worked on.

gmaxwell closed this on Feb 20, 2019

sysmanalex commented at 5:49 AM on May 23, 2020: none

@gmaxwell did you saw this research https://medium.com/@CoinExChain/acceleration-of-ecdsa-verification-with-endomorphism-mapping-of-secp256k1-126e77a51dba ?

gmaxwell commented at 1:55 AM on May 24, 2020: contributor

I hadn't seen it-- but it's just reviewing an optimization this library already has. (Unless I've misunderstood something!) Please always feel free to share citations to research touching on potentially relevant optimizations!