@peterdettman In #173 (comment) you mentioned an alternative approach to the multiplication that would be good for SIMD paralellization.
Back then you said "I have sample C code for 5x52 and could whip up a 10x26 version". Could you send this to me? I'm especially interested in 10x26 as ARM NEON has only 32x32->64 multiplication, but possibly I could figure it out myself with the 5x52 one when I know the approach.