This is mostly for future reference.
I tried implementing SipHashUint256Extra with SSE2 instructions but when seeing it's half of the speed of the naive C impl I gave up on trying to optimize it more.
(afterwards I tried benchmarking the implementations in supercop and the SSE ones were also slower than the naive)
diff: https://gist.github.com/elichai/134b95fee25f8170fcdc69535f8f8bd4
compiled with: CXXFLAGS='-g -O3 -march=native' and got:
# Benchmark, evals, iterations, total, min, max, median
SaltedSipHash, 5, 40000000, 4.27147, 2.09707e-08, 2.22874e-08, 2.10788e-08
SaltedSipHashX86, 5, 40000000, 8.53348, 4.12294e-08, 4.45225e-08, 4.23751e-08