@maflcko, so my point was that it seems to me that changing the benchmark slightly changes it's performance considerably:
static void SipHash_32b(benchmark::Bench& bench)
{
uint256 x;
uint64_t k1 = 0;
bench.run([&] {
*((uint64_t*)x.begin()) = SipHashUint256(0, ++k1, x);
});
}
works with inputs such as:
<img src="https://github.com/bitcoin/bitcoin/assets/1841944/66d3022e-fa15-4950-bf12-c7db9896d766">
and the benchmark results in:
make -j10 && ./src/bench/bench_bitcoin --filter='SipHash_32b' --min-time=10000
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 35.11 | 28,479,847.20 | 0.1% | 11.00 | SipHash_32b
Changing the benchmark by adding starting values for each input and modifying every 64 bit chunk of x, and consuming the SipHashUint256 result via doNotOptimizeAway, as follows:
static void SipHash_32b_new(benchmark::Bench& bench)
{
FastRandomContext rng(true);
auto k0{rng.rand64()}, k1{rng.rand64()};
auto x{rng.rand256()};
auto* x_ptr{reinterpret_cast<uint64_t*>(x.data())};
bench.run([&] {
ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, x));
++k0; ++k1; ++x_ptr[0]; ++x_ptr[1]; ++x_ptr[2]; ++x_ptr[3];
});
}
which would work with inputs such as:
<img src="https://github.com/bitcoin/bitcoin/assets/1841944/5470f279-d3a9-444c-8460-b5febe3f1379">
results in the following benchmark:
make -j10 && ./src/bench/bench_bitcoin --filter='SipHash_32b_new' --min-time=10000
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 21.54 | 46,420,932.76 | 0.1% | 10.98 | SipHash_32b_new
I've added doNotOptimizeAway to every other benchmark in crypto_hash.cpp and their values didn't change considerably.
For the record, the following is also running a lot faster:
static void SipHash_32b_new(benchmark::Bench& bench)
{
uint256 x;
uint64_t k1 = 0;
bench.run([&] {
auto result = SipHashUint256(0, ++k1, x);
ankerl::nanobench::doNotOptimizeAway(result);
*((uint64_t*)x.begin()) += 1;
});
}
but adding result instead of the 1 makes it slow again:
*((uint64_t*)x.begin()) += result;