In my benchmark, this rewrite improves runtime 27% (g++) to 46% (clang++) for the benchmark HexStrBench:
g++ 11.2.0
| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.94 | 1,061,381,310.36 | 0.7% | 12.00 | 3.01 | 3.990 | 1.00 | 0.0% | 0.01 | HexStrBench master
| 0.68 | 1,465,366,544.25 | 1.7% | 6.00 | 2.16 | 2.778 | 1.00 | 0.0% | 0.01 | HexStrBench branch
clang++ 13.0.1
| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.80 | 1,244,713,415.92 | 0.9% | 10.00 | 2.56 | 3.913 | 0.50 | 0.0% | 0.01 | HexStrBench master
| 0.43 | 2,324,188,940.72 | 0.2% | 4.00 | 1.37 | 2.914 | 0.25 | 0.0% | 0.01 | HexStrBench branch
Note that the idea for this change comes from denis2342 in #23364. This is a rewrite so no unaligned accesses occur.
Also, the lookup table is now calculated at compile time, which hopefully makes the code a bit easier to review.