WIP: Simplify SipHash #30317

pull paplorinc wants to merge 3 commits into bitcoin:master from paplorinc:paplorinc/siphash changing 2 files +78 −111
  1. paplorinc commented at 9:58 pm on June 20, 2024: contributor

    WIP, not finished yet.

    Profiling ./bitcoind -reindex-chainstate -checkblocks=10000 revealed that hashing is a bottleneck:

    I tried optimizing it, but the benchmark seemed off, it showed a 10% speedup which didn’t replicate on CI - so I unified the bechmark to follow the format of other ones, which seems to have fixed the inconsistent behavior.

    I didn’t end up optimizing the code (has the exact same speed as before), but if I’ve already simplified it, I’ll push it anyway.

  2. DrahtBot commented at 9:58 pm on June 20, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30349 (benchmark: Improve SipHash_32b accuracy to avoid potential optimization issues by paplorinc)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. fanquake commented at 10:02 am on June 21, 2024: member

    The changes make it ~10% faster

    I tested this change on an aarch64 machine with GCC 14.1.1, and the benchmark was ~ the same, or slower: Master

    0|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    2|               27.32 |       36,601,145.02 |    0.2% |          201.00 |           80.58 |  2.494 |           3.00 |    0.0% |     11.01 | `SipHash_32b`
    

    PR

    0|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    2|               28.66 |       34,891,459.89 |    0.2% |          240.00 |           83.01 |  2.891 |          11.00 |    0.0% |     10.99 | `SipHash_32b`
    

    I tested on the same machine with Clang 18.1.6, and the performance was also ~ the same, or slower: Master

    0|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    2|               25.89 |       38,623,097.25 |    0.0% |          197.00 |           77.18 |  2.552 |           3.00 |    0.0% |     11.01 | `SipHash_32b`
    

    PR

    0|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
    1|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    2|               27.00 |       37,036,922.01 |    0.2% |          197.00 |           80.44 |  2.449 |           3.00 |    0.0% |     10.92 | `SipHash_32b
    
  4. paplorinc commented at 10:07 am on June 21, 2024: contributor
    Thanks for checking @fanquake, I noticed that the coverage report also reported the same - seems the M1 processor is doing something differently.
  5. paplorinc renamed this:
    WIP Optimize SipHash
    WIP Simplify SipHash
    on Jun 21, 2024
  6. paplorinc force-pushed on Jun 21, 2024
  7. Add SipHash_32b_Extra benchmark since it's on the critical path
    Also removed the `*((uint64_t*)` introduced in 619d5691c20bee5d08be2ce85aafa2cb570dbca4 since it seems to distort measurement (and no other benchmark uses this style)
    0b56abc068
  8. Convert SIPROUND macro to inline function with array reference 6c2af44e54
  9. Extract common parts to other inline functions b6dd18153c
  10. in src/bench/crypto_hash.cpp:195 in 2c58c9d9af outdated
    190@@ -190,9 +191,19 @@ static void SHA512(benchmark::Bench& bench)
    191 static void SipHash_32b(benchmark::Bench& bench)
    192 {
    193     uint256 x;
    194-    uint64_t k1 = 0;
    195+    uint64_t k0 = 0, k1 = 0;
    196     bench.run([&] {
    197-        *((uint64_t*)x.begin()) = SipHashUint256(0, ++k1, x);
    


    paplorinc commented at 4:30 pm on June 21, 2024:
    @fanquake, is it possible that this way of benchmarking isn’t actually representative? I tried comparing code that was obviously faster according to the benchmark by running the before and after code explicitly many times in a test and measuring total time - and the new version seems to reflect it more accurately. I tried it with ankerl::nanobench::doNotOptimizeAway as well, but the results seem to be exactly the same with the after code (but not with the before one).

    maflcko commented at 1:52 pm on June 25, 2024:

    One reason to assign the result is to give the compiler a hint that the result is uses (and should not be discarded). Otherwise, the compiler may optimize away the calculation (and the whole benchmark).

    Another reason is to modify the value that is being hashed. Otherwise, the compiler may detect that the same value is hashed in every loop iteration and will only calculate it once, effectively optimizing away the calculation (and the whole benchmark). (While I don’t think it matters here, in theory calling the function on a single special value may show a different benchmark result than calling it on a variety of values).


    paplorinc commented at 1:58 pm on June 25, 2024:
    Yes, I understand the reason, that’s why doNotOptimizeAway is needed in some cases. But here it seems to produce the wrong result. I have validated it with a simple timed test, calculating a million different hashes with the two versions of the algorithm, which were giving different benchmarking speeds - and this new version resembles it more closely. Also, we’re not using this format in any other benchmarks I’ve seen.

    maflcko commented at 2:26 pm on June 25, 2024:

    But here it seems to produce the wrong result. I have validated it with a simple timed test

    I am not sure what you are trying to say. If you want to claim that something is wrong, you will have to provide an explanation or at least steps to reproduce exactly. Otherwise there is nothing that can be done.


    maflcko commented at 2:28 pm on June 25, 2024:
    Also, mixing “fixes” with “refactor” is wrong anyway. A commit should either be a fix or a refactor or a feature change. This is explained in the contrib guide in this repo.

    paplorinc commented at 2:39 pm on June 25, 2024:
    You’re right, that’s why it’s still a draft. Thanks for the early input.

    paplorinc commented at 4:10 pm on June 26, 2024:

    @maflcko, so my point was that it seems to me that changing the benchmark slightly changes it’s performance considerably:

    0static void SipHash_32b(benchmark::Bench& bench)
    1{
    2    uint256 x;
    3    uint64_t k1 = 0;
    4    bench.run([&] {
    5        *((uint64_t*)x.begin()) = SipHashUint256(0, ++k1, x);
    6    });
    7}
    

    works with inputs such as: and the benchmark results in:

    make -j10 && ./src/bench/bench_bitcoin –filter=‘SipHash_32b’ –min-time=10000

    ns/op op/s err% total benchmark
    35.11 28,479,847.20 0.1% 11.00 SipHash_32b

    Changing the benchmark by adding starting values for each input and modifying every 64 bit chunk of x, and consuming the SipHashUint256 result via doNotOptimizeAway, as follows:

     0static void SipHash_32b_new(benchmark::Bench& bench)
     1{
     2    FastRandomContext rng(true);
     3    auto k0{rng.rand64()}, k1{rng.rand64()};
     4    auto x{rng.rand256()};
     5    auto* x_ptr{reinterpret_cast<uint64_t*>(x.data())};
     6    bench.run([&] {
     7        ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, x));
     8        ++k0; ++k1; ++x_ptr[0]; ++x_ptr[1]; ++x_ptr[2]; ++x_ptr[3];
     9    });
    10}
    

    which would work with inputs such as: results in the following benchmark:

    make -j10 && ./src/bench/bench_bitcoin –filter=‘SipHash_32b_new’ –min-time=10000

    ns/op op/s err% total benchmark
    21.54 46,420,932.76 0.1% 10.98 SipHash_32b_new

    I’ve added doNotOptimizeAway to every other benchmark in crypto_hash.cpp and their values didn’t change considerably.

    For the record, the following is also running a lot faster:

    0static void SipHash_32b_new(benchmark::Bench& bench)
    1{
    2    uint256 x;
    3    uint64_t k1 = 0;
    4    bench.run([&] {
    5        auto result = SipHashUint256(0, ++k1, x);
    6        ankerl::nanobench::doNotOptimizeAway(result);
    7        *((uint64_t*)x.begin()) += 1;
    8    });
    9}
    

    but adding result instead of the 1 makes it slow again:

    0        *((uint64_t*)x.begin()) += result;
    

    maflcko commented at 4:15 pm on June 26, 2024:
    Yes, this is expected. If you change the code that is benchmarked, the compiler will compile it to something else, which will perform differently, because it calculates something else.

    paplorinc commented at 4:29 pm on June 26, 2024:
    Can you please be more specific that this? My intention is to measure SipHash more realistically, since it didn’t behave as I expected before.

    sipa commented at 4:31 pm on June 26, 2024:
    For very simple functions, variations in performance that are just due to e.g. how code is aligned in memory cannot be avoided.

    paplorinc commented at 4:37 pm on June 26, 2024:

    Thanks @sipa, do you agree that

     0static void SipHash_32b(benchmark::Bench& bench)
     1{
     2    FastRandomContext rng(true);
     3    auto k0{rng.rand64()}, k1{rng.rand64()};
     4    auto x{rng.rand256()};
     5    auto* x_ptr{reinterpret_cast<uint64_t*>(x.data())};
     6    bench.run([&] {
     7        ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, x));
     8        ++k0; ++k1; ++x_ptr[0]; ++x_ptr[1]; ++x_ptr[2]; ++x_ptr[3];
     9    });
    10}
    

    likely produces more realistic results (see above examples)? The current one gave me a few false optimization hopes - which this one didn’t.


    maflcko commented at 10:23 am on June 27, 2024:

    Yeah, seems fine to remove the recursion, given that real code probably doesn’t evaluate the hash recursively.

    Unrelated, I don’t think uint256 is generally guaranteed to be uint64_t-aligned. While probably not a problem here, the cast won’t work in the general case:

    0bench/crypto_hash.cpp: runtime error: store to misaligned address 0x7ffe64f6caa1 for type 'uint64_t' (aka 'unsigned long'), which requires 8 byte alignment
    10x7ffe64f6caa1: note: pointer points here
    2 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00
    3              ^ 
    

    paplorinc commented at 2:05 pm on June 27, 2024:
    Thank you, I’ve pushed it in https://github.com/bitcoin/bitcoin/pull/30349
  11. paplorinc force-pushed on Jun 25, 2024
  12. paplorinc renamed this:
    WIP Simplify SipHash
    refactor: Simplify SipHash
    on Jun 25, 2024
  13. DrahtBot added the label Refactoring on Jun 25, 2024
  14. paplorinc renamed this:
    refactor: Simplify SipHash
    refactor: Simplify SipHash and fix benchmarks
    on Jun 25, 2024
  15. paplorinc renamed this:
    refactor: Simplify SipHash and fix benchmarks
    refactor: Simplify SipHash and fix related benchmarks
    on Jun 25, 2024
  16. paplorinc renamed this:
    refactor: Simplify SipHash and fix related benchmarks
    WIP: Simplify SipHash
    on Jun 25, 2024
  17. paplorinc closed this on Jul 2, 2024

  18. paplorinc deleted the branch on Jul 2, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-11-21 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me