bench: update nanobench add `-min

martinus commented at 10:44 am on September 18, 2021: contributor

This PR updates the nanobench with the latest release from upstream, v4.3.6. It fixes the missing performance counters.

Due to discussions on #22999 I have done some work that should make the benchmark results more reliable. It introduces a new flag -min_time that allows to run a benchmark for much longer then the default. When results are unreliable, choosing a large timeframe here should usually get repeatable results even when frequency scaling cannot be disabled. The default is now 10ms. For this to work I have changed the AddrManGood and EvictionProtection benchmarks so they work with any number of iterations.

Also, this adds more usage documentation to bench_bitcoin -h and I’ve cherry-picked two changes from #22999 authored by Jon Atack

fanquake added the label Upstream on Sep 18, 2021

in src/bench/peer_eviction.cpp:31 in 354fbc0022 outdated

32-    size_t i{0};
33+
34     bench.run([&] {
35-        ProtectEvictionCandidatesByRatio(copies.at(i));
36-        ++i;
37+        // creating a copy has an overhead of about 3%, so it does not influcence the benchmark results much.

jonatack commented at 1:12 pm on September 18, 2021:

3154897f nit, s/influcence/influence/

martinus commented at 10:52 am on September 19, 2021:

fixed in eb399ab0577dffe7e1080bb497ff642de1c8c359

in src/bench/addrman.cpp:123 in 354fbc0022 outdated

121+
122     bench.run([&] {
123-        markSomeAsGood(*addrmans.at(i));
124-        ++i;
125+        // To make the benchmark independent of the number of evaluations, we always prepare a new addrman.
126+        // This is necessary because CAddrMan::Good() method modifies theobject, affecting the timing of subsequent calls

jonatack commented at 1:19 pm on September 18, 2021:

c67f365 nit s/method modifies theobject/modifies the object/

martinus commented at 10:53 am on September 19, 2021:

fixed in 652faa16a1410b51a5ad78a3121cf6ab9329f640

in src/bench/bench_bitcoin.cpp:15 in 354fbc0022 outdated

10 #include <util/system.h>
11 
12 #include <memory>
13 
14 static const char* DEFAULT_BENCH_FILTER = ".*";
15+static const int64_t DEFAULT_MIN_DURATION_MS = 10;

jonatack commented at 1:25 pm on September 18, 2021:

f167647

0static constexpr int64_t DEFAULT_MIN_DURATION_MS{10};

martinus commented at 10:54 am on September 19, 2021:

fixed in b8f16c323e02c3d06e5ad57b6c0f96196d9b12e0

in src/bench/bench_bitcoin.cpp:64 in 354fbc0022 outdated

60+                     "  highly depend on the stability of the machine. It can sometimes be very\n"
61+                     "  difficult to get repeatable results, so here are a few tips to get stable\n"
62+                     "  results:\n"
63+                     "\n"
64+                     "  * Use pyperf [1] to disable frequency scaling, turbo boost etc. For best\n"
65+                     "    use CPU pinning and CPU isolation (see [2]).\n"

jonatack commented at 1:31 pm on September 18, 2021:

fc2d44dd missing word

0                     "    results, use CPU pinning and CPU isolation (see [2]).\n"

in src/bench/bench_bitcoin.cpp:71 in 354fbc0022 outdated

67+                     "  * Each call of run() should do exactly the same work. E.g. inserting into\n"
68+                     "    a std::vector doesn't do that as it will reallocate on certain calls. Make\n"
69+                     "    sure each run has exactly the same preconditions.\n"
70+                     "\n"
71+                     "  * If results are still not reliable increase runtime with e.g.\n"
72+                     "    -min_duration_ms=5000 to let a benchmark run for at least 5 seconds\n"

jonatack commented at 1:32 pm on September 18, 2021:

fc2d44d s/reliable increase/reliable, increase/ and s/seconds/seconds./

martinus commented at 10:55 am on September 19, 2021:

wordings fixed in 57a7a020a87f13c231b44ad1861b0bb6e6f92e70

in src/bench/bench.h:49 in 354fbc0022 outdated

45     bool is_list_only;
46     std::vector<double> asymptote;
47     std::string output_csv;
48     std::string output_json;
49+    std::string regex_filter;
50+    std::chrono::milliseconds min_duration;

jonatack commented at 1:37 pm on September 18, 2021:

f167647 or 3e1a8ca maybe place this line after bool is_list_only, as it is a very lightweight data type in terms of packing

martinus commented at 10:55 am on September 19, 2021:

done!

in src/bench/rollingbloom.cpp:25 in 354fbc0022 outdated

28-        data[2] = count >> 8;
29-        data[3] = count;
30+        data[0] = static_cast<unsigned char>(count >> 24);
31+        data[1] = static_cast<unsigned char>(count >> 16);
32+        data[2] = static_cast<unsigned char>(count >> 8);
33+        data[3] = static_cast<unsigned char>(count);

jonatack commented at 1:43 pm on September 18, 2021:

354fbc0 Would it be a good idea to instead just make data a vector of uint8? There has been ongoing refactoring in the codebase to move from unsigned char to uint8.

0-    std::vector<unsigned char> data(32);
1-    uint32_t count = 0;
2+    std::vector<uint8_t> data(32);
3+    uint8_t count = 0;

martinus commented at 10:56 am on September 19, 2021:

I can’t easily change that because the RollingBloomHash’s API uses that and I don’t want the change to be too large. I’ve changed this though to use & 0xFF like it’s done in many other places in the code. Done in b901174bfb6895c8e332c2b5e280348836398e68

in src/bench/bench_bitcoin.cpp:60 in 354fbc0022 outdated

56+                     "\n"
57+                     "Description:\n"
58+                     "\n"
59+                     "  bench_bitcoin executes microbenchmarks. The quality of the benchmark results\n"
60+                     "  highly depend on the stability of the machine. It can sometimes be very\n"
61+                     "  difficult to get repeatable results, so here are a few tips to get stable\n"

jonatack commented at 1:53 pm on September 18, 2021:

fc2d44dd suggestion

“It can sometimes be difficult to get stable, repeatable results, so here are a few tips:”

in src/bench/bench_bitcoin.cpp:88 in 354fbc0022 outdated

84+                     "  In rare cases it can be useful to suppress stability warnings. This can be\n"
85+                     "  done with the environment variable NANOBENCH_SUPPRESS_WARNINGS, e.g:\n"
86+                     "\n"
87+                     "    NANOBENCH_SUPPRESS_WARNINGS=1 ./bench_bitcoin\n"
88+                     "\n"
89+                  << argsman.GetHelpMessage()

jonatack commented at 1:55 pm on September 18, 2021:

fc2d44d I think it would be most helpful to begin with the Options, followed by the Description and Notes, as they go together.

jonatack commented at 6:41 pm on September 18, 2021:

Another alternative would be to add this Description and Notes documentation to doc/benchmarking.md.

martinus commented at 10:57 am on September 19, 2021:

changed in 57a7a020a87f13c231b44ad1861b0bb6e6f92e70

in src/bench/bench_bitcoin.cpp:24 in 354fbc0022 outdated

21-    argsman.AddArg("-asymptote=n1,n2,n3,...", "Test asymptotic growth of the runtime of an algorithm, if supported by the benchmark", ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
22+    argsman.AddArg("-asymptote=<n1,n2,n3,...>", "Test asymptotic growth of the runtime of an algorithm, if supported by the benchmark", ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
23     argsman.AddArg("-filter=<regex>", strprintf("Regular expression filter to select benchmark by name (default: %s)", DEFAULT_BENCH_FILTER), ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
24-    argsman.AddArg("-list", "List benchmarks without executing them", ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
25+    argsman.AddArg("-list", "List benchmarks without executing them", ArgsManager::ALLOW_BOOL, OptionsCategory::OPTIONS);
26+    argsman.AddArg("-min_duration_ms=<milliseconds>", strprintf("Minimum runtime per benchmark, in milliseconds (default: %d)", DEFAULT_MIN_DURATION_MS), ArgsManager::ALLOW_INT, OptionsCategory::OPTIONS);

jonatack commented at 1:58 pm on September 18, 2021:

f167647c consider shortening the argument name to -min_time (faster and easier to type!)

martinus commented at 10:56 am on September 19, 2021:

done in b8f16c323e02c3d06e5ad57b6c0f96196d9b12e0

jonatack commented at 2:07 pm on September 18, 2021: member

Nice–thank you for working on this! Using the min duration arg does improve stability for me (lower err%) and is convenient as a runtime option.

Tested Concept/Approach ACK with various suggestions below.

in src/bench/bench.cpp:64 in 354fbc0022 outdated

58@@ -61,6 +59,11 @@ void benchmark::BenchRunner::RunAll(const Args& args)
59 
60         Bench bench;
61         bench.name(p.first);
62+        if (args.min_duration > std::chrono::milliseconds::zero()) {
63+            // convert to nanos before dividing to reduce rounding errors
64+            bench.minEpochTime(std::chrono::nanoseconds(args.min_duration) / bench.epochs());

jonatack commented at 2:32 pm on September 18, 2021:

Looking at https://en.cppreference.com/w/cpp/chrono/duration/duration_cast, would something like this be a good idea?

0// converting integral duration to integral duration of shorter divisible time unit:
1    // no duration_cast needed
2    std::chrono::duration<long, std::micro> int_usec = int_ms;

also line 62, can shorten if preferred

0-        if (args.min_duration > std::chrono::milliseconds::zero()) {
1+        if (args.min_duration > 0ms) {

also, where std::chrono is added, look at include headers for <chrono> where needed

martinus commented at 10:59 am on September 19, 2021:

hm, I think it’s fine to use std::chrono::nanoseconds, it’s basically the same but with int64_t. It should be large so there’s no easy chance for overflow. I generally try to not use duration_cast, because that’s a sign that at that place there will be some loss of information due to the casting.

I’ve changed line 64 to 0ms in b8f16c323e02c3d06e5ad57b6c0f96196d9b12e0

DrahtBot commented at 2:50 am on September 19, 2021: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#22950 ([p2p] Pimpl AddrMan to abstract implementation details by amitiuttarwar)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

martinus force-pushed on Sep 19, 2021

martinus commented at 11:00 am on September 19, 2021: contributor

Thanks for the reviews! I’ve incorporated all the suggestions into the right commits and rebased everything.

martinus renamed this:
~~bench: update nanobench add `-min_duration_ms`~~
bench: update nanobench add `-min_time`
on Sep 19, 2021

in src/bench/bench_bitcoin.cpp:71 in b901174bfb outdated

67+                     "  * Each call of run() should do exactly the same work. E.g. inserting into\n"
68+                     "    a std::vector doesn't do that as it will reallocate on certain calls. Make\n"
69+                     "    sure each run has exactly the same preconditions.\n"
70+                     "\n"
71+                     "  * If results are still not reliable, increase runtime with e.g.\n"
72+                     "    -min_duration_ms=5000 to let a benchmark run for at least 5 seconds.\n"

jonatack commented at 3:26 pm on September 19, 2021:

57a7a02

0                     "    -min_time=5000 to let a benchmark run for at least 5 seconds.\n"

martinus commented at 5:31 am on September 20, 2021:

damn, forgot that one! Fixed in the latest rebase.

in src/bench/bench.cpp:73 in b901174bfb outdated

58@@ -61,6 +59,12 @@ void benchmark::BenchRunner::RunAll(const Args& args)
59 
60         Bench bench;
61         bench.name(p.first);
62+        if (args.min_time > 0ms) {
63+            // convert to nanos before dividing to reduce rounding errors
64+            std::chrono::nanoseconds min_time_ns = args.min_time;

jonatack commented at 3:40 pm on September 19, 2021:

b8f16c3

per doc/developer-notes.md, it looks like #include <chrono> should be added to bench.cpp and bench_bitcoin.cpp:

0- Every `.cpp` and `.h` file should `#include` every header file it directly uses classes, functions or other
1  definitions from, even if those headers are already included indirectly through other headers.
2
3  - *Rationale*: Excluding headers because they are already indirectly included results in compilation
4    failures when those indirect dependencies change. Furthermore, it obscures what the real code
5    dependencies are.

pico-nit, feel free to ignore

0            const std::chrono::nanoseconds min_time_ns{args.min_time};

martinus commented at 5:37 pm on September 19, 2021:

hm, that include guideline is a bit hard to follow. There are lots of other missing includes, e.g. bench_bitcoin.cpp would need at least chrono, cstdint, vector, sstream, string, iostream. Should I add these as well?

About the constructor, I think that is one of the cases where assignment is better: chrono’s assignment is more strict than the constructor, you can’t do e.g. std::chrono::nanoseconds x = 123 but you can do std::chrono::nanoseconds x{123}. So with the assignment it’s certain that args.min_time is actually another chrono type that convertible without loss of information, and not just any number that could represent anything.

martinus commented at 5:32 am on September 20, 2021:

I’ve now added the headers in clean up includes commit

jonatack commented at 3:52 pm on September 19, 2021: member

Almost-ACK b901174bfb6

martinus force-pushed on Sep 20, 2021

martinus commented at 5:32 am on September 20, 2021: contributor

Rebased with header cleanups, and signed all commits.

jonatack commented at 8:53 am on September 20, 2021: member

ACK e7e890079e92e346997cbdf491b5b0717cd4bcab

in src/bench/addrman.cpp:123 in cf84fc6f4a outdated

121-        ++i;
122+        // To make the benchmark independent of the number of evaluations, we always prepare a new addrman.
123+        // This is necessary because CAddrMan::Good() method modifies the object, affecting the timing of subsequent calls
124+        // to the same method and we want to do the same amount of work in every loop iteration.
125+        //
126+        // This has some overhead (about 8% on my machine), but that overhead is constant so improvements in

MarcoFalke commented at 9:08 am on September 20, 2021:

Likely going to change with #22974

martinus commented at 11:12 am on September 20, 2021:

hm, I don’t know much about what addrman does, but currently markSomeAsGood in the benchmark only uses every 32nd element. Can be bumped up to call Good() on every entry, maybe in random order?

martinus commented at 12:49 pm on September 21, 2021:

I’ve rebased because #22974 is already merged, and updated the benchmark in 153e686 so it makes more sense with the much faster Good(). Nothing else has changed

MarcoFalke commented at 1:28 pm on September 21, 2021:

If the overhead is now 99%, then changes in the order of -50% or +100% are not noticed. Just asking to make sure the overhead isn’t too much.

martinus commented at 1:46 pm on September 21, 2021:

This is what I get when I run ./src/bench/bench_bitcoin -min_time=10000 -filter="AddrManAdd.*" on my machine:

ns/op	op/s	err%	ins/op	cyc/op	IPC	bra/op	miss%	total	benchmark
44,370,397.17	22.54	0.6%	448,854,352.19	141,515,648.83	3.172	33,913,128.43	0.1%	11.05	`AddrManAdd`
114,389,015.11	8.74	0.2%	1,201,264,095.33	364,837,620.00	3.293	78,812,510.33	0.1%	10.87	`AddrManAddThenGood`

So with the change it’s ~44ms for AddrManAdd, ~114ms for AddrManAddThenGood, leaving ~70ms for just the Good() part of the benchmark

bench: update nanobench from 4.3.4 to 4.3.6

Most importantly, this update fixes a bug in nanobench that always
disabled performance counters on linux.

It also adds another sanitizer suppression that is caught in clang++ 12.

eed99cf272

bench: remove unnecessary & incorrect multiplication in MuHashDiv

Introduced in #19055, MuHashDiv benchmark used to multiply with a loop
based on epochIterations. That does not do what it is supposed to do,
because epochIterations() is determined automatically from nanobench.

Also, multiplication is not needed for the algorithm (as pointed out by
a comment in #19055), so it's better to remove this loop.

468b232f71

laanwj added the label Tests on Sep 21, 2021

bench: change AddrManGood to AddrManAddThenGood

Moves some of the setup into the benchmark loop so it is possible to run
the loop for an arbitrary number of times. Due to recent optimizations
in #22974 the benchmark now runs much faster, so the inner loop now calls
Good() 32 times as often to get better numbers.

Renamed the benchmark to AddrManAddThenGood because that's now what is
actually tested. To get the the number of just Good(), one needs to
subtract the benchmark result of AddrManAdd.

153e6860e8

bench: make EvictionProtection.* work with any number of iterations

Moves copying of the setup into the benchmark loop so it is possible
to run the loop for an arbitrary number of times.

The overhead due to copying the candidates inside the loop is about 3%.

9fef832932

bench: introduce -min_time argument

When it is not easily possible to stabilize benchmark machine and code
the argument -min_time can be used to specify a minimum duration
that a benchmark should take. E.g. choose -min_time=1000 if you
are willing to wait about 1 second for each benchmark result.

The default is now set to 10ms instead of 0, which should make runs on
fast machines more stable with negligible slowdown.

d3c6f8bfa1

bench: add usage description and documentation

This adds some usage description with tips to `bench_bitcoin -h`.

1f10f1663e

bench: clean up includes

Drops unneeded and adds missing includes

d312fd94a1

bench: various args improvements

- use ALLOW_BOOL for -list arg instead of ALLOW_ANY
- touch up `-asymptote=<n1,n2,n3...>` help
- pack Args struct a bit more efficiently
- handle args in alphabetical order

da4e2f1da0

bench: fixed ubsan implicit conversion

The benchmarks can now run much longer due to the minimum of 10ms or
directly with -min_time. With -min_time=20000 I could trigger two ubsan
errors in the benchmarks, which are fixed in this commit by using
unsigned type and adding "& 0xFF".

e148a52332

martinus force-pushed on Sep 21, 2021

jonatack commented at 1:27 pm on September 21, 2021: member

re-ACK e148a5233292d156cda76cb20afb6641fc20f25e

The Win64 CI failure looks unrelated.

0$ NANOBENCH_SUPPRESS_WARNINGS=1 ./bench_bitcoin -filter=AddrMan*.* -min_time=20000
1
2|               ns/op |                op/s |    err% |     total | benchmark
3|--------------------:|--------------------:|--------:|----------:|:----------
4|      762,448,657.67 |                1.31 |    6.8% |     23.20 | `AddrManAdd`
5|    1,794,860,430.00 |                0.56 |    2.5% |     19.71 | `AddrManAddThenGood`
6|        9,101,993.99 |              109.87 |    7.8% |     20.56 | `AddrManGetAddr`
7|           11,379.90 |           87,874.25 |    2.1% |     21.93 | `AddrManSelect`

If you retouch, maybe order the addrman benchmark functions alphabetically, which would be the same order as the bench result output and place the two similar functions, AddrManAdd and AddrManAddThenGood, next to each other.

martinus commented at 1:51 pm on September 21, 2021: contributor

@jonatack Interesting that your benchmark results are so different. Did you compile for release? I can’t believe my computer is ~20 times faster than yours… I compile with ./autogen.sh && ./configure --with-incompatible-bdb CC=clang CXX=clang++

jonatack commented at 1:58 pm on September 21, 2021: member

@martinus no, it’s a debug build and the computer is not optimized. Re-building a non-debug build would take a long time (I have a slow 2-core laptop) and I’d already tested it properly above. Sorry for the weird example.

laanwj commented at 4:38 pm on September 24, 2021: member

Code review ACK e148a5233292d156cda76cb20afb6641fc20f25e

laanwj merged this on Sep 24, 2021

laanwj closed this on Sep 24, 2021

sidhujag referenced this in commit 1f347afcd1 on Sep 24, 2021

sidhujag referenced this in commit fbf3524358 on Sep 24, 2021

martinus deleted the branch on Sep 26, 2021

DrahtBot locked this on Oct 30, 2022

bench: update nanobench add -min_time #23025

Conflicts

bench: update nanobench add `-min_time` #23025