I'm wondering if it would be possible to get a performance improvement using vector processor acceleration for scalar math, either directly with NEON/SSE instructions, or using math libraries like vecLib/MKL/ACML
I don't know too much about it myself, or what security issues their might be when handling private key data, but presumably the security concerns would be less of an issue when verifying signatures.