This avoids the potential speed differences between reading from the begin and end of cache lines that exists in the byte-slicing approach. It’s also slightly faster.
I’ve looked at the generated code with -O3, and it looks like it is actually iterating over all data, but it’s hard to be sure. The result is slower than an equivalent that just picks the right value to add directly.