benchmark: Improve SipHash_32b accuracy to avoid potential optimization issues

l0rinc commented at 2:05 pm on June 27, 2024: contributor

This PR stems from the discussions in #30317 (review)

The previous benchmark for SipHash was slightly less accurate in representing real-world usage and allowed for potential compiler optimizations that could invalidate the benchmark. This change aims to ensure the benchmark produces more realistic results.

By modifying the initial values and only incrementing the bytes of val, the benchmark should reflects a more typical usage patterns - and prevent the compiler from optimizing away the calculations.

On my M1 processor the benchmark’s speed changed significantly (but the CI seems to produce the same result as before):

cmake -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_BENCH=ON && cmake –build build -j10 && ./build/src/bench/bench_bitcoin –filter=SipHash_32b –min-time=1000

Before:

ns/op	op/s	err%	total	benchmark
35.15	28,445,856.66	0.2%	1.10	`SipHash_32b`

After (note that only the benchmark changed):

ns/op	op/s	err%	total	benchmark
22.05	45,350,886.64	0.3%	1.10	`SipHash_32b`

DrahtBot commented at 2:05 pm on June 27, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/30349.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	maflcko, hodlinator, achow101
Stale ACK	laanwj

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

l0rinc renamed this:
~~benchmark: Refactor SipHash_32b to improve accuracy and potential avoid optimization issues~~
refactor: Refactor SipHash_32b to improve accuracy and potential avoid optimization issues
on Jun 27, 2024

DrahtBot added the label Refactoring on Jun 27, 2024

l0rinc renamed this:
~~refactor: Refactor SipHash_32b to improve accuracy and potential avoid optimization issues~~
refactor: Improve SipHash_32b accuracy and potential avoid optimization issues
on Jun 27, 2024

l0rinc renamed this:
~~refactor: Improve SipHash_32b accuracy and potential avoid optimization issues~~
refactor: Improve SipHash_32b accuracy to avoid potential optimization issues
on Jun 27, 2024

maflcko commented at 2:10 pm on June 27, 2024: member

Changing the behavior of something is not a refactor.

l0rinc renamed this:
~~refactor: Improve SipHash_32b accuracy to avoid potential optimization issues~~
benchmark: Improve SipHash_32b accuracy to avoid potential optimization issues
on Jun 27, 2024

maflcko removed the label Refactoring on Jun 27, 2024

DrahtBot added the label CI failed on Jul 22, 2024

DrahtBot commented at 10:17 am on July 22, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/26763788481

Make sure to run all tests locally, according to the documentation.

The failure may happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

l0rinc force-pushed on Jul 22, 2024

DrahtBot removed the label CI failed on Jul 22, 2024

achow101 requested review from laanwj on Oct 15, 2024

achow101 requested review from josibake on Oct 15, 2024

laanwj added the label Tests on Oct 16, 2024

in src/bench/crypto_hash.cpp:200 in 8c4e14deda outdated

195+    auto k0 = rng.rand64(), k1 = rng.rand64();
196+    auto val = rng.rand256();
197     bench.run([&] {
198-        *((uint64_t*)x.begin()) = SipHashUint256(0, ++k1, x);
199+        ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, val));
200+        ++k0; ++k1; ++(*val.begin());

laanwj commented at 9:21 am on October 16, 2024:

nit: please put multiple statements on multiple lines, this is pretty hard to read code as-is :sweat_smile:

l0rinc commented at 1:11 pm on October 16, 2024:

Thanks, done

laanwj commented at 9:22 am on October 16, 2024: member

ACK on overall change, small nit

l0rinc force-pushed on Oct 16, 2024

laanwj approved

laanwj commented at 8:05 am on October 17, 2024: member

ACK c49cc30b8185a50b4aa00cf705d313c8aa9482a3

in src/bench/crypto_hash.cpp:198 in c49cc30b81 outdated

196-    uint64_t k1 = 0;
197+    FastRandomContext rng(true);
198+    auto k0 = rng.rand64(), k1 = rng.rand64();
199+    auto val = rng.rand256();
200     bench.run([&] {
201-        *((uint64_t*)x.begin()) = SipHashUint256(0, ++k1, x);

maflcko commented at 10:01 am on October 17, 2024:

Maybe I am wrong, but in theory I think that a byte array has weaker memory alignment requirements than uint64_t. So this is (in a strict sense) undefined behavior anyway. I presume it could be tested by adding a padding byte to uint256 and compile with ubsan, and then run this bench to observe the failure.

laanwj commented at 10:24 am on October 17, 2024:

This is a great point. Instead of the type-pun, we’d likely want to make this safer by assigning the result to a uint64_t then memcpy it in instead. That likely isn’t going to be significantly slower.

Edit: Wait, this code is being removed. Sorry.

maflcko commented at 10:30 am on October 17, 2024:

Edit: Wait, this code is being removed. Sorry.

Yeah, sorry, I was just trying to come up with another reason to change the code here :sweat_smile:

l0rinc commented at 10:48 am on October 17, 2024:

Do you have any concrete recommendation that I could apply here?

maflcko commented at 11:31 am on October 17, 2024:

Sorry again, I am just trying to add useful context that hasn’t been mentioned before. I just have an allergy against undefined behavior, so I wanted to mention that the previous code may have had one.

As a fun fact, I tried it myself and in theory the UB should look like:

 0src/bench/crypto_hash.cpp:198:9: runtime error: store to misaligned address 0x7fffdb31cb81 for type 'uint64_t' (aka 'unsigned long'), which requires 8 byte alignment
 10x7fffdb31cb81: note: pointer points here
 2 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00
 3              ^ 
 4    [#0](/bitcoin-bitcoin/0/) 0x556ee2095df7 in SipHash_32b(ankerl::nanobench::Bench&)::$_0::operator()() const bld-cmake/src/bench/./src/bench/crypto_hash.cpp:198:33
 5    [#1](/bitcoin-bitcoin/1/) 0x556ee2095df7 in ankerl::nanobench::Bench& ankerl::nanobench::Bench::run<SipHash_32b(ankerl::nanobench::Bench&)::$_0>(SipHash_32b(ankerl::nanobench::Bench&)::$_0&&) bld-cmake/src/bench/./src/bench/nanobench.h:1221:13
 6    [#2](/bitcoin-bitcoin/2/) 0x556ee2092fc6 in SipHash_32b(ankerl::nanobench::Bench&) bld-cmake/src/bench/./src/bench/crypto_hash.cpp:197:11
 7
 8...
 9
10SUMMARY: UndefinedBehaviorSanitizer: misaligned-pointer-use ./src/bench/crypto_hash.cpp:198:9

l0rinc commented at 11:39 am on October 17, 2024:

Thank you for checking!

maflcko approved

maflcko commented at 10:02 am on October 17, 2024: member

lgtm ACK c49cc30b8185a50b4aa00cf705d313c8aa9482a3

maflcko added this to the milestone 29.0 on Oct 17, 2024

in src/bench/crypto_hash.cpp:202 in c49cc30b81 outdated

200     bench.run([&] {
201-        *((uint64_t*)x.begin()) = SipHashUint256(0, ++k1, x);
202+        ankerl::nanobench::doNotOptimizeAway(SipHashUint256(k0, k1, val));
203+        ++k0;
204+        ++k1;
205+        ++(*val.begin());

maflcko commented at 4:25 pm on October 17, 2024:

This fails ubsan:

0src/bench/crypto_hash.cpp:202:9: runtime error: implicit conversion from type 'int' of value 256 (32-bit, signed) to type 'unsigned char' changed the value to 0 (8-bit, unsigned)
1    [#0](/bitcoin-bitcoin/0/) 0x55b763589f3f in SipHash_32b(ankerl::nanobench::Bench&)::$_0::operator()() const bld-cmake/src/bench/./src/bench/crypto_hash.cpp:202:9
2
3
4...
5
6SUMMARY: UndefinedBehaviorSanitizer: implicit-signed-integer-truncation /home/marco/workspace/b-c/src/bench/crypto_hash.cpp:202:9

To reproduce you’ll have to run the bench under ubsan/isan without -sanity-check.

l0rinc commented at 12:26 pm on October 18, 2024:

I couldn’t reproduce the warning locally, but changed it to modify every value in val.data() instead to make sure none of them can be optimized away - would you be so kind and check if this solves the undefined behavior?

laanwj commented at 10:09 am on October 20, 2024:

i’m very confused about this one; begin() returns a unsigned char*, and unsigned values are definitely allowed to overflow within defined behavior. Also, m_data is an array (of fixed size) so by definition [0] will not be out of bounds.

laanwj commented at 11:43 am on October 20, 2024:

a workaround may be to do val.data()[0] = (val.data()[0] + 1) & 0xff; to make the wrap-around explicit, the C++ compiler will almost certainly just optimize it away. It’s also slightly more readable i think.

sipa commented at 11:47 am on October 20, 2024:

This is not UB, but ubsan is explicitly told to warn for certain error-prone but implementation-defined or well-defined behaviors.

If the effect is deliberate, it is possible to add the function name to the ubsan suppressions.

l0rinc commented at 11:48 am on October 20, 2024:

Would you prefer I ‘& 0xFF’ the current implementation?

laanwj commented at 12:56 pm on October 20, 2024:

dunno, i would prefer &0xff to adding a supression (if that works to make the error go away, ofc), it more explicitly defines what is expected

maflcko commented at 2:05 pm on October 23, 2024:

Yeah, this is a bit confusing. UndefinedBehaviorSanitizer is just the name of the tool, not an implication that the issue it found is UB.

If you want to reproduce, you’ll have to compile with the integer sanitizer (yes, to add to the confusion, the tool is called different here). Also, it only works with clang++/clang, so you’ll have to select that as well during cmake -B. Maybe this can be documented in doc/developer-notes.md?

Edit: (reply to &0xff): An alternative to &0xff would be to cast to the underlying type uint8_t(bla), but both are fine and identical.

maflcko commented at 7:31 pm on October 24, 2024:

Maybe this can be documented in doc/developer-notes.md?

(An alternative would be to add the CI configs to the cmake presets file, because the ISan is part of one. )

l0rinc commented at 11:35 am on October 31, 2024:

Thanks for the hints, I’ve replaced the increment of the bytes with a xor, this should keep the values in check. The overflow of i should be well-defined in this case, please let me know if you think any other change is needed since I couldn’t reproduce the sanitizer warnings. I have checked to make sure this still helps with the benchmarking of #30442. Done in https://github.com/bitcoin/bitcoin/compare/c88b1bde7500ace70a694f36a82521f92221936c..100cded580cfe9d69cc30866c29697d9658b4ce3 (also changed to brace inits since the spirit of @hodlinator was haunting me)

maflcko commented at 11:40 am on October 31, 2024:

The overflow of i should be well-defined in this case

Unsigned integer overflow is well-defined (and relied upon) either way. This is just about documenting where one is expected and where one is unexpected.

l0rinc commented at 11:43 am on October 31, 2024:

So are you ok with the current version?

maflcko commented at 11:54 am on October 31, 2024:

I just think that it would be nice to not print the warning and to document that the unsigned overflow is expected here.

Steps to reproduce on a fresh Ubuntu 24.04:

 0export DEBIAN_FRONTEND=noninteractive && apt update && apt install curl wget htop git vim ccache -y && git clone https://github.com/bitcoin/bitcoin.git  --depth=1 ./b-c && cd b-c && apt install build-essential cmake pkg-config  python3-zmq libzmq3-dev libevent-dev libboost-dev libsqlite3-dev  libdb++-dev clang llvm  -y
 1
 2git fetch origin --depth=1 100cded580cfe9d69cc30866c29697d9658b4ce3 && git checkout 100cded580cfe9d69cc30866c29697d9658b4ce3 && cmake -B ./bld-cmake -DSANITIZERS=integer -DCMAKE_C_COMPILER='clang' -DCMAKE_CXX_COMPILER='clang++'  -DBUILD_BENCH=ON                              && cmake --build ./bld-cmake --parallel  $(nproc)
 3
 4UBSAN_OPTIONS="suppressions=$(pwd)/test/sanitizer_suppressions/ubsan:print_stacktrace=1:halt_on_error=1:report_error_type=1" ./bld-cmake/src/bench/bench_bitcoin --filter=SipHash_32b
 5
 6...
 7
 8/b-c/src/bench/crypto_hash.cpp:203:9: runtime error: implicit conversion from type 'int' of value 256 (32-bit, signed) to type 'uint8_t' (aka 'unsigned char') changed the value to 0 (8-bit, unsigned)
 9    [#0](/bitcoin-bitcoin/0/) 0x5583348ec575 in SipHash_32b(ankerl::nanobench::Bench&)::$_0::operator()() const bld-cmake/src/bench/./src/bench/crypto_hash.cpp:203:9
10    [#1](/bitcoin-bitcoin/1/) 0x5583348ec575 in ankerl::nanobench::Bench& ankerl::nanobench::Bench::run<SipHash_32b(ankerl::nanobench::Bench&)::$_0>(SipHash_32b(ankerl::nanobench::Bench&)::$_0&&) bld-cmake/src/bench/./src/bench/nanobench.h:1221:13
11    [#2](/bitcoin-bitcoin/2/) 0x5583348ea256 in SipHash_32b(ankerl::nanobench::Bench&) bld-cmake/src/bench/./src/bench/crypto_hash.cpp:199:11
12    [#3](/bitcoin-bitcoin/3/) 0x558334873009 in std::function<void (ankerl::nanobench::Bench&)>::operator()(ankerl::nanobench::Bench&) const /usr/bin/../lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/bits/std_function.h:591:9
13    [#4](/bitcoin-bitcoin/4/) 0x558334873009 in benchmark::BenchRunner::RunAll(benchmark::Args const&) bld-cmake/src/bench/./src/bench/bench.cpp:127:13
14    [#5](/bitcoin-bitcoin/5/) 0x55833486d35b in main bld-cmake/src/bench/./src/bench/bench_bitcoin.cpp:135:9
15    [#6](/bitcoin-bitcoin/6/) 0x7f9aad6d11c9  (/lib/x86_64-linux-gnu/libc.so.6+0x2a1c9) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
16    [#7](/bitcoin-bitcoin/7/) 0x7f9aad6d128a in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2a28a) (BuildId: 6d64b17fbac799e68da7ebd9985ddf9b5cb375e6)
17    [#8](/bitcoin-bitcoin/8/) 0x558334840bc4 in _start (/b-c/bld-cmake/src/bench/bench_bitcoin+0x1b6bc4) (BuildId: 6155e3af28ae68265b04bfa46b763f94bded03e5)
18
19SUMMARY: UndefinedBehaviorSanitizer: implicit-signed-integer-truncation /b-c/src/bench/crypto_hash.cpp:203:9

l0rinc commented at 11:40 am on November 3, 2024:

Thanks a lot @maflcko, that reproducer was super useful! Added a & 0xFF (or do we prefer 0xff? Found about the same number of occurrences in the code) and bumped the uint8_t to get rid of the overflow warning, see: https://github.com/bitcoin/bitcoin/compare/100cded580cfe9d69cc30866c29697d9658b4ce3..42066f45ff5d48e78a317eda63c035809bd657c6

l0rinc force-pushed on Oct 18, 2024

maflcko removed this from the milestone 29.0 on Oct 24, 2024

maflcko commented at 1:37 pm on October 30, 2024: member

Are you still working on this? If not, maybe turn into a draft for now?

l0rinc force-pushed on Oct 31, 2024

DrahtBot commented at 11:34 am on October 31, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32331774653

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot added the label CI failed on Oct 31, 2024

DrahtBot removed the label CI failed on Oct 31, 2024

Refactor SipHash_32b benchmark to improve accuracy and avoid optimization issues

- Modify `SipHash_32b` benchmark to use `FastRandomContext` for generating initial values.
- Cycle through and modify each byte of the `uint256` value to ensure no part of it can be optimized away.

The lack of "recursion" (where the method call overwrites the used inputs partially) and the systematic modification of each input byte makes the benchmark usage more reliable and thorough.

42066f45ff

l0rinc force-pushed on Nov 3, 2024

maflcko commented at 7:58 am on November 4, 2024: member

review ACK 42066f45ff5d48e78a317eda63c035809bd657c6

Didn’t re-check ubsan

DrahtBot requested review from laanwj on Nov 4, 2024

hodlinator commented at 10:03 am on November 4, 2024: contributor

Concept ACK 42066f45ff5d48e78a317eda63c035809bd657c6

More realistic to also change k0 and val. (The modification of the uint256 val only happens for one byte at a time, so benchmark time shouldn’t be that affected by some unrelated expensive instructions / memory access :+1: ).

UBSAN Repro

Managed to repro maflcko’s concern on older push c88b1bde7500ace70a694f36a82521f92221936c:

 0₿ UBSAN_OPTIONS="suppressions=$(pwd)/test/sanitizer_suppressions/ubsan:print_stacktrace=1:halt_on_error=1:report_error_type=1" ./build/src/bench/bench_bitcoin --filter=SipHash_32b
 1/home/hodlinator/bitcoin/src/bench/crypto_hash.cpp:203:9: runtime error: implicit conversion from type 'int' of value 256 (32-bit, signed) to type 'unsigned char' changed the value to 0 (8-bit, unsigned)
 2    [#0](/bitcoin-bitcoin/0/) 0x5f6de628f68b  (/home/hodlinator/bitcoin/build/src/bench/bench_bitcoin+0x24d68b)
 3    [#1](/bitcoin-bitcoin/1/) 0x5f6de628d4ba  (/home/hodlinator/bitcoin/build/src/bench/bench_bitcoin+0x24b4ba)
 4    [#2](/bitcoin-bitcoin/2/) 0x5f6de6217da2  (/home/hodlinator/bitcoin/build/src/bench/bench_bitcoin+0x1d5da2)
 5    [#3](/bitcoin-bitcoin/3/) 0x5f6de6212d2a  (/home/hodlinator/bitcoin/build/src/bench/bench_bitcoin+0x1d0d2a)
 6    [#4](/bitcoin-bitcoin/4/) 0x75b2ee70127d  (/nix/store/sl141d1g77wvhr050ah87lcyz2czdxa3-glibc-2.40-36/lib/libc.so.6+0x2a27d) (BuildId: 6a788357de379aee7162f608d25a7692f7162b15)
 7    [#5](/bitcoin-bitcoin/5/) 0x75b2ee701338  (/nix/store/sl141d1g77wvhr050ah87lcyz2czdxa3-glibc-2.40-36/lib/libc.so.6+0x2a338) (BuildId: 6a788357de379aee7162f608d25a7692f7162b15)
 8    [#6](/bitcoin-bitcoin/6/) 0x5f6de61de094  (/home/hodlinator/bitcoin/build/src/bench/bench_bitcoin+0x19c094)
 9
10SUMMARY: UndefinedBehaviorSanitizer: implicit-signed-integer-truncation /home/hodlinator/bitcoin/src/bench/crypto_hash.cpp:203:9 in

Does not reproduce on current version (42066f45ff5d48e78a317eda63c035809bd657c6).

Benchmark results

Running on Intel workstation.

Before:

0₿ build/src/bench/bench_bitcoin -filter=SipHash_32b -min-time=5000
1
2|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
3|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
4|               24.63 |       40,601,125.32 |    0.0% |          256.00 |           90.84 |  2.818 |           3.00 |    0.0% |      5.50 | `SipHash_32b`

After:

0₿ build/src/bench/bench_bitcoin -filter=SipHash_32b -min-time=5000
1
2|               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |         bra/op |   miss% |     total | benchmark
3|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
4|               26.30 |       38,022,455.52 |    0.0% |          266.00 |           97.02 |  2.742 |           3.00 |    0.0% |      5.50 | `SipHash_32b`

A decrease in op/s seems understandable.

Concern

I don’t understand why M1 Macs would have more ops after in the PR summary…? (Also - might need to refresh the results anyway after latest push).

l0rinc commented at 2:49 pm on November 4, 2024: contributor

I don’t understand why M1 Macs would have more ops after

My understand is that when we’ve reused the hash as input, it could only run a single instance since each one depended on the previous one. The new way is more predictable and clang often optimizes more aggressively, I guess it unrolls the instances. I can investigate if you think it’s important.

refresh the results anyway after latest push

Update the description, thanks.

hodlinator approved

hodlinator commented at 4:14 pm on November 4, 2024: contributor

ACK 42066f45ff5d48e78a317eda63c035809bd657c6

My understand is that when we’ve reused the hash as input, it could only run a single instance since each one depended on the previous one. The new way is more predictable and clang often optimizes more aggressively, I guess it unrolls the instances. I can investigate if you think it’s important.

Ah, that makes sense that the CPU/compiler wasn’t able to do as much out-of-order execution previously because of reusing the output from the last run as input for the next. Feels more accurate to change the benchmark in the way you’ve done. Finally was able to see the main tree in the forest of this little PR. :) No need to investigate further IMO.

achow101 commented at 9:24 pm on November 14, 2024: member

ACK 42066f45ff5d48e78a317eda63c035809bd657c6

achow101 merged this on Nov 14, 2024

achow101 closed this on Nov 14, 2024