(Not for immediate merge)
In #452 I noted that sqr and mul take about the same time in my config (OSX, 64-bit, no-asm, -O3 -march=native), so this is a quick attempt to speed up _scalar_sqr. This initial commit rewrites _scalar_sqr_512 for an ~ 8% improvement in _scalar_sqr. Second opinions/measurements would be appreciated.
It seems from the measurements that _scalar_reduce_512 is the real heavyweight here, so I’ll be trying to re-implement that next.
I can rewrite in terms of macros (the current local code style) prior to any merge.