WIP: Simplify SipHash #30317

l0rinc commented at 9:58 PM on June 20, 2024: contributor

WIP, not finished yet.

Profiling ./bitcoind -reindex-chainstate -checkblocks=10000 revealed that hashing is a bottleneck: <img src="https://github.com/bitcoin/bitcoin/assets/1841944/8fa91d78-a5f0-4f4e-a281-f3ff1548b536">

I tried optimizing it, but the benchmark seemed off, it showed a 10% speedup which didn't replicate on CI - so I unified the bechmark to follow the format of other ones, which seems to have fixed the inconsistent behavior.

I didn't end up optimizing the code (has the exact same speed as before), but if I've already simplified it, I'll push it anyway.

DrahtBot commented at 9:58 PM on June 20, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process. A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30349 (benchmark: Improve SipHash_32b accuracy to avoid potential optimization issues by paplorinc)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

fanquake commented at 10:02 AM on June 21, 2024: member

The changes make it ~10% faster

I tested this change on an aarch64 machine with GCC 14.1.1, and the benchmark was ~ the same, or slower: Master

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|               27.32 |       36,601,145.02 |    0.2% |          201.00 |           80.58 |  2.494 |           3.00 |    0.0% |     11.01 | `SipHash_32b`

PR

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|               28.66 |       34,891,459.89 |    0.2% |          240.00 |           83.01 |  2.891 |          11.00 |    0.0% |     10.99 | `SipHash_32b`

I tested on the same machine with Clang 18.1.6, and the performance was also ~ the same, or slower: Master

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|               25.89 |       38,623,097.25 |    0.0% |          197.00 |           77.18 |  2.552 |           3.00 |    0.0% |     11.01 | `SipHash_32b`

PR

|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|               27.00 |       37,036,922.01 |    0.2% |          197.00 |           80.44 |  2.449 |           3.00 |    0.0% |     10.92 | `SipHash_32b

l0rinc commented at 10:07 AM on June 21, 2024: contributor

Thanks for checking @fanquake, I noticed that the coverage report also reported the same - seems the M1 processor is doing something differently.

l0rinc renamed this:
~~WIP Optimize SipHash~~
WIP Simplify SipHash
on Jun 21, 2024

l0rinc force-pushed on Jun 21, 2024

Add SipHash_32b_Extra benchmark since it's on the critical path

Also removed the `*((uint64_t*)` introduced in 619d5691c20bee5d08be2ce85aafa2cb570dbca4 since it seems to distort measurement (and no other benchmark uses this style)

0b56abc068

Convert SIPROUND macro to inline function with array reference 6c2af44e54

Extract common parts to other inline functions b6dd18153c

in src/bench/crypto_hash.cpp:195 in 2c58c9d9af outdated

 190 | @@ -190,9 +191,19 @@ static void SHA512(benchmark::Bench& bench)
 191 |  static void SipHash_32b(benchmark::Bench& bench)
 192 |  {
 193 |      uint256 x;
 194 | -    uint64_t k1 = 0;
 195 | +    uint64_t k0 = 0, k1 = 0;
 196 |      bench.run([&] {
 197 | -        *((uint64_t*)x.begin()) = SipHashUint256(0, ++k1, x);

l0rinc commented at 4:30 PM on June 21, 2024:

@fanquake, is it possible that this way of benchmarking isn't actually representative? I tried comparing code that was obviously faster according to the benchmark by running the before and after code explicitly many times in a test and measuring total time - and the new version seems to reflect it more accurately. I tried it with ankerl::nanobench::doNotOptimizeAway as well, but the results seem to be exactly the same with the after code (but not with the before one).

maflcko commented at 1:52 PM on June 25, 2024:

One reason to assign the result is to give the compiler a hint that the result is uses (and should not be discarded). Otherwise, the compiler may optimize away the calculation (and the whole benchmark).

Another reason is to modify the value that is being hashed. Otherwise, the compiler may detect that the same value is hashed in every loop iteration and will only calculate it once, effectively optimizing away the calculation (and the whole benchmark). (While I don't think it matters here, in theory calling the function on a single special value may show a different benchmark result than calling it on a variety of values).

l0rinc commented at 1:58 PM on June 25, 2024:

Yes, I understand the reason, that's why doNotOptimizeAway is needed in some cases. But here it seems to produce the wrong result. I have validated it with a simple timed test, calculating a million different hashes with the two versions of the algorithm, which were giving different benchmarking speeds - and this new version resembles it more closely. Also, we're not using this format in any other benchmarks I've seen.

maflcko commented at 2:26 PM on June 25, 2024:

But here it seems to produce the wrong result. I have validated it with a simple timed test

I am not sure what you are trying to say. If you want to claim that something is wrong, you will have to provide an explanation or at least steps to reproduce exactly. Otherwise there is nothing that can be done.

maflcko commented at 2:28 PM on June 25, 2024:

Also, mixing "fixes" with "refactor" is wrong anyway. A commit should either be a fix or a refactor or a feature change. This is explained in the contrib guide in this repo.

l0rinc commented at 2:39 PM on June 25, 2024:

You're right, that's why it's still a draft. Thanks for the early input.

l0rinc commented at 4:10 PM on June 26, 2024:

@maflcko, so my point was that it seems to me that changing the benchmark slightly changes it's performance considerably:

static void SipHash_32b(benchmark::Bench& bench)
{
    uint256 x;
    uint64_t k1 = 0;
    bench.run([&] {
        *((uint64_t*)x.begin()) = SipHashUint256(0, ++k1, x);
    });
}

works with inputs such as: <img src="https://github.com/bitcoin/bitcoin/assets/1841944/66d3022e-fa15-4950-bf12-c7db9896d766"> and the benchmark results in:

make -j10 && ./src/bench/bench_bitcoin --filter='SipHash_32b' --min-time=10000

	ns/op	op/s	err%	total	benchmark
	35.11	28,479,847.20	0.1%	11.00	`SipHash_32b`

Changing the benchmark by adding starting values for each input and modifying every 64 bit chunk of x, and consuming the SipHashUint256 result via doNotOptimizeAway, as follows:

static void SipHash_32b_new(benchmark::Bench& bench)
{
    FastRandomContext rng(true);
    auto k0{rng.rand64()}, k1{rng.rand64()};
    auto x{rng.rand256()};
    auto* x_ptr{reinterpret_cast<uint64_t*>(x.data())};
    bench.run([&] {
        ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, x));
        ++k0; ++k1; ++x_ptr[0]; ++x_ptr[1]; ++x_ptr[2]; ++x_ptr[3];
    });
}

which would work with inputs such as: <img src="https://github.com/bitcoin/bitcoin/assets/1841944/5470f279-d3a9-444c-8460-b5febe3f1379"> results in the following benchmark:

make -j10 && ./src/bench/bench_bitcoin --filter='SipHash_32b_new' --min-time=10000

	ns/op	op/s	err%	total	benchmark
	21.54	46,420,932.76	0.1%	10.98	`SipHash_32b_new`

I've added doNotOptimizeAway to every other benchmark in crypto_hash.cpp and their values didn't change considerably.

For the record, the following is also running a lot faster:

static void SipHash_32b_new(benchmark::Bench& bench)
{
    uint256 x;
    uint64_t k1 = 0;
    bench.run([&] {
        auto result = SipHashUint256(0, ++k1, x);
        ankerl::nanobench::doNotOptimizeAway(result);
        *((uint64_t*)x.begin()) += 1;
    });
}

but adding result instead of the 1 makes it slow again:

        *((uint64_t*)x.begin()) += result;

maflcko commented at 4:15 PM on June 26, 2024:

Yes, this is expected. If you change the code that is benchmarked, the compiler will compile it to something else, which will perform differently, because it calculates something else.

l0rinc commented at 4:29 PM on June 26, 2024:

Can you please be more specific that this? My intention is to measure SipHash more realistically, since it didn't behave as I expected before.

sipa commented at 4:31 PM on June 26, 2024:

For very simple functions, variations in performance that are just due to e.g. how code is aligned in memory cannot be avoided.

l0rinc commented at 4:37 PM on June 26, 2024:

Thanks @sipa, do you agree that

static void SipHash_32b(benchmark::Bench& bench)
{
    FastRandomContext rng(true);
    auto k0{rng.rand64()}, k1{rng.rand64()};
    auto x{rng.rand256()};
    auto* x_ptr{reinterpret_cast<uint64_t*>(x.data())};
    bench.run([&] {
        ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, x));
        ++k0; ++k1; ++x_ptr[0]; ++x_ptr[1]; ++x_ptr[2]; ++x_ptr[3];
    });
}

likely produces more realistic results (see above examples)? The current one gave me a few false optimization hopes - which this one didn't.

maflcko commented at 10:23 AM on June 27, 2024:

Yeah, seems fine to remove the recursion, given that real code probably doesn't evaluate the hash recursively.

Unrelated, I don't think uint256 is generally guaranteed to be uint64_t-aligned. While probably not a problem here, the cast won't work in the general case:

bench/crypto_hash.cpp: runtime error: store to misaligned address 0x7ffe64f6caa1 for type 'uint64_t' (aka 'unsigned long'), which requires 8 byte alignment
0x7ffe64f6caa1: note: pointer points here
 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00
              ^

l0rinc commented at 2:05 PM on June 27, 2024:

Thank you, I've pushed it in https://github.com/bitcoin/bitcoin/pull/30349

l0rinc force-pushed on Jun 25, 2024

l0rinc renamed this:
~~WIP Simplify SipHash~~
refactor: Simplify SipHash
on Jun 25, 2024

DrahtBot added the label Refactoring on Jun 25, 2024

l0rinc renamed this:
~~refactor: Simplify SipHash~~
refactor: Simplify SipHash and fix benchmarks
on Jun 25, 2024

l0rinc renamed this:
~~refactor: Simplify SipHash and fix benchmarks~~
refactor: Simplify SipHash and fix related benchmarks
on Jun 25, 2024

l0rinc renamed this:
~~refactor: Simplify SipHash and fix related benchmarks~~
WIP: Simplify SipHash
on Jun 25, 2024

l0rinc closed this on Jul 2, 2024

l0rinc deleted the branch on Jul 2, 2024

achow101 referenced this in commit 380e1f44e8 on Nov 14, 2024

bitcoin locked this on Jul 2, 2025