Replace current benchmarking framework with nanobench

martinus commented at 9:53 pm on January 27, 2020: contributor

This replaces the current benchmarking framework with nanobench [1], an MIT licensed single-header benchmarking library, of which I am the autor. This has in my opinion several advantages, especially on Linux:

fast: Running all benchmarks takes ~6 seconds instead of 4m13s on an Intel i7-8700 CPU @ 3.20GHz.
accurate: I ran e.g. the benchmark for SipHash_32b 10 times and calculate standard deviation / mean = coefficient of variation:
- 0.57% CV for old benchmarking framework
- 0.20% CV for nanobench
So the benchmark results with nanobench seem to vary less than with the old framework.
It automatically determines runtime based on clock precision, no need to specify number of evaluations.
measure instructions, cycles, branches, instructions per cycle, branch misses (only Linux, when performance counters are available)
output in markdown table format.
Warn about unstable environment (frequency scaling, turbo, …)
For better profiling, it is possible to set the environment variable NANOBENCH_ENDLESS to force endless running of a particular benchmark without the need to recompile. This makes it to e.g. run “perf top” and look at hotspots.

Here is an example copy & pasted from the terminal output:

ns/byte	byte/s	err%	ins/byte	cyc/byte	IPC	bra/byte	miss%	total	benchmark
2.52	396,529,415.94	0.6%	25.42	8.02	3.169	0.06	0.0%	0.03	`bench/crypto_hash.cpp RIPEMD160`
1.87	535,161,444.83	0.3%	21.36	5.95	3.589	0.06	0.0%	0.02	`bench/crypto_hash.cpp SHA1`
3.22	310,344,174.79	1.1%	36.80	10.22	3.601	0.09	0.0%	0.04	`bench/crypto_hash.cpp SHA256`
2.01	496,375,796.23	0.0%	18.72	6.43	2.911	0.01	1.0%	0.00	`bench/crypto_hash.cpp SHA256D64_1024`
7.23	138,263,519.35	0.1%	82.66	23.11	3.577	1.63	0.1%	0.00	`bench/crypto_hash.cpp SHA256_32b`
3.04	328,780,166.40	0.3%	35.82	9.69	3.696	0.03	0.0%	0.03	`bench/crypto_hash.cpp SHA512`

[1] https://github.com/martinus/nanobench

DrahtBot added the label Build system on Jan 27, 2020

DrahtBot added the label Docs on Jan 27, 2020

DrahtBot added the label Scripts and tools on Jan 27, 2020

DrahtBot added the label Tests on Jan 27, 2020

MarcoFalke removed the label Build system on Jan 27, 2020

MarcoFalke removed the label Docs on Jan 27, 2020

MarcoFalke removed the label Scripts and tools on Jan 27, 2020

JeremyRubin commented at 10:09 pm on January 27, 2020: contributor

Strong concept ACK! Seems like a big improvement.

Can you comment more on the 6 seconds claim? AFAIK each bench was supposed to target running for 1 second? Is this no longer required to reduce variance?

Secondly – and separately – can you comment on how this might impact the need for something like #17375? Can we add better support for benchmarks where we want to run with different scaling params and output each trial to get a sense of the complexity?

martinus commented at 10:21 pm on January 27, 2020: contributor

I calculate a good number of iterations based on the clock accuracy, then perform these iterations a few times and use the median to get rid of outliers. I found it actually to be more reliable with shorter runs, because there is less chance for random fluctuations to interfer. It is necessary though to disable frequency scaling etc (but this should be done with the old framework too anyways). This can be easily done with e.g pyperf

Concerning #17375, nanobench can estimate complexity, but it requires a bit of code change: https://github.com/martinus/nanobench/blob/master/docs/reference.md#asymptotic-complexity

fanquake added the label Needs Conceptual Review on Jan 27, 2020

DrahtBot commented at 1:49 am on January 28, 2020: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#19377 (bench: Add OrphanTxPool benchmark by hebasto)
#19326 (Simplify hash.h interface using Spans by sipa)
#19280 (Verify the block filter hash when reading from disk. by pstratem)
#19181 (Add ASM optimizations for MuHash3072 by fjahr)
#19145 (Add hash_type MUHASH for gettxoutsetinfo by fjahr)
#19055 (Add MuHash3072 implementation by fjahr)
#18815 (bench: Add logging benchmark by MarcoFalke)
#18731 (refactor: Make CCheckQueue RAII-styled by hebasto)
#18710 (Add local thread pool to CCheckQueue by hebasto)
#18354 (Use shared pointers only in validation interface by bvbfan)
#18261 (Erlay: bandwidth-efficient transaction relay protocol by naumenkogs)
#18014 (lib: Optimizing siphash implementation by elichai)
#17526 (Use Single Random Draw In addition to knapsack as coin selection fallback by achow101)
#17331 (Use effective values throughout coin selection by achow101)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

martinus commented at 7:00 am on January 28, 2020: contributor

The lint check currently fails with this error:

fatal: bad revision ‘8b138526b5dc…488d538cbf6f’

I believe the reason is some key verification check at the end of ci/lint/06_script.sh, but I can’t really say why this is failing

MarcoFalke commented at 3:05 pm on January 28, 2020: member

Concept ACK

The travis failure is

 0The locale dependent function std::to_string(...) appears to be used:
 1
 2src/bench/nanobench.h:            auto sysCpu = "/sys/devices/system/cpu/cpu" + std::to_string(id);
 3
 4src/bench/nanobench.h:                warnings.emplace_back("CPU frequency scaling enabled: CPU " + std::to_string(id) + " between " +
 5
 6Unnecessary locale dependence can cause bugs that are very
 7
 8tricky to isolate and fix. Please avoid using locale dependent
 9
10functions if possible.
11
12Advice not applicable in this specific case? Add an exception
13
14by updating the ignore list in test/lint/lint-locale-dependence.sh
15
16^---- failure generated from test/lint/lint-locale-dependence.sh

jamesob commented at 3:24 pm on January 28, 2020: member

Would it be easy to hack in csv output that is somewhat similar to the existing output? The markdown table looks a little tricky to parse programmatically (though it could be done). For example, bitcoinperf (https://github.com/chaincodelabs/bitcoinperf) currently relies on this format.

martinus commented at 3:53 pm on January 28, 2020: contributor

The locale dependent function std::to_string(…) appears to be used:

Ah, ok I’ll fix this

Would it be easy to hack in csv output that is somewhat similar to the existing output?

I think it should be easy, in nanobench I already have CSV & JSON output format using mustache-like templates, so it’s possible to create a custom format. I have not exposed this feature yet though in this PR.

practicalswift commented at 4:49 pm on January 28, 2020: contributor

Strong concept ACK @martinus, thanks for your great contributions! Please keep them coming :)

elichai commented at 5:02 pm on January 28, 2020: contributor

Does this library also does memory clobbers and barriers? (like google’s DoNotOptimize[0], ClobberMemory[1], or Rust’s black_box[2])

[0] https://github.com/google/benchmark/blob/master/include/benchmark/benchmark.h#L307 [1] https://github.com/google/benchmark/blob/master/include/benchmark/benchmark.h#L326 [2] https://doc.rust-lang.org/std/hint/fn.black_box.html

martinus commented at 5:11 pm on January 28, 2020: contributor

Does this library also does memory clobbers and barriers? (like google’s DoNotOptimize[0], ClobberMemory[1], or Rust’s black_box[2])

I currently have doNotOptimizeAway, which is based on folly’s benchmark. I think folly’s version is based on google benchmark. I have not added doNotOptimizeAway calls in the PR because I did not want to modify each benchmark too much

I don’t have clobberMemory yet, because I’ve never used it… What I’m also doing is I force the run(...) to be noinline to prevent some optimizations.

martinus force-pushed on Jan 28, 2020

JeremyRubin commented at 8:11 pm on January 28, 2020: contributor

Is it possible to make the nanobench include a submodule (like secp256k1 or univalue) so that it’s easier for us to pull in updates from upstream? If you plan on adding new features to nanobench, that should help streamline review potentially. If you think that there will be Bitcoin specific changes made to the header that you wouldn’t want to upstream, then I would leave it as you’ve done.

JeremyRubin commented at 8:21 pm on January 28, 2020: contributor

Closed #17375 in favor of nanobench. When you have time would love assistance in making the asymptotic test introduced there nanobench compatible.

Being able to run asymptotic tests on the code is going to be a huge help with advocating for mempool policy changes in the future (e.g., loosening descendants limit) once the epoch mempool work is completed. This also impacts other parts of the code (e.g., wallet) where right now we don’t have good insight into if we introduce a regression. I think the curve fitting is a bit less useful because we do care about the constant factors too (e.g., if we’re a O(n log n) v.s. O(n) but c > log n for max n), but it’s quite nifty nonetheless.

JeremyRubin added this to the "General Testing" column in a project

martinus commented at 9:46 am on January 29, 2020: contributor

Is it possible to make the nanobench include a submodule (like secp256k1 or univalue)

I think it should be possible, I need to read up how git-subtree works… I prefer if nanobench stays generic, and try to implement bitcoin’s requirement in a generic way so it’s usable by others too. So no separate repository if possible.

When you have time would love assistance in making the asymptotic test introduced there nanobench compatible.

Sure, I’ll have a look at #17375 when I have time! I can’t predict how soon this is though.

elichai commented at 9:52 am on January 29, 2020: contributor

Haven’t looked at the library itself yet, but Concept ACK on replacing the current framework. (I really dislike it) Personally I also would’ve been fine with dynamically linking against google’s benchmarking library (https://github.com/google/benchmark)

martinus commented at 11:46 am on January 29, 2020: contributor

Personally I also would’ve been fine with dynamically linking against google’s benchmarking library (https://github.com/google/benchmark)

I don’t think google benchmark is viable here. It’s a large dependency, and you also need to use the gtest framework for this. It would be quite a big change.

Empact commented at 3:16 am on January 30, 2020: member

Concept ACK

JeremyRubin commented at 8:15 pm on January 30, 2020: contributor

We discussed nanobench today in the IRC Meeting. There’s seems to be general agreement that this is a nice idea, and that the current bench framework isn’t perfect.

Our current bench framework is actually based on Google’s, and I think most people are opposed to linking google’s whole thing.

With respect to the question of if to subtree or not: let’s ignore that for now, copied in is fine, and we can deal with that in the future if we require changes to nanobench or if there are new features in nanobench we want to incorporate.

There’s some concern about maintaining compatibility with existing tools. I think given that the output is fundamentally different from before (no longer reporting iterations and things like that) we can’t have perfect parity. But perhaps we could:

“backport” on top of the last few releases (no new release) so that we have a bit more history to compare with
Add a compatibility mode which emits something similar to the previous output with NaN subsituted where nanobench has no equivalent value.
Ignore compatibility, but do support output into a nice machine-readable format.

I think most people would be satisfied with 3, as 2 can be done a script on 3’s output and 1 can be done if someone has the itch for it.

martinus commented at 9:50 pm on January 30, 2020: contributor

Thanks for the summary! I just read through the log here and think I can add a few clarifications:

I think if we can do a cursory check it’s not actually malware

It’s not malware, not sure how I can help here :) I’ve created nanobench because I was annoyed at how difficult other benchmarking frameworks were to integrate into existing codebase because I don’t like google test.

For example, a lot of tools rely on the output format of the current bench framework

I have templating support in nanobench, and I can relatively easily add another output format that resembles the current output format closely. I need to do some improvements to the available data in nanobench, then I can use a template liks this to produce practically the same output as before:

0# Benchmark, evals, iterations, total, min, max, median
1{{#benchmarks}} {{name}}, {{num_measurements}}, {{total_iters}}, {{total_runtime}}, {{min}}, {{max}}, {{median}}
2{{/benchmarks}}

Then I can e.g. print the markdown tables to stdout, and create e.g. benchmarkresults.csv along with it based on the template format.

I beleive nanobench autodetects variance or something

Google benchmark is quite simple: it has a fixed runtime that it wants to achieve, then finds out the number of iterations it needs to do to get there, then it measures the time for that.

In nanobench I try to be a bit smarter: I find out the clocks accuracy first, and base the target runtime on that. Since clocks are nowadays very accurate (18ns or so on my machine), I can easily perform the measurement multiple times and use the median to get rid of outliers.

The fast runtimes gives very repeatable measurements for stuff that’s deterministic (e.g. SHA hashing). There I believe nanobench has a clear advantage over all other bencharking frameworks that I’ve tried.

When the code under test has fluctuations (e.g. because it allocates stuff, or has some randomness or lots of cache misses / branch misspredictions in it), nanobench’s runtime measurement probably isn’t better than google benchmark. In that case it helps to also show the numbers for branch misses and retired instruction count to get a better feeling.

DrahtBot added the label Needs rebase on Feb 10, 2020

martinus force-pushed on Feb 20, 2020

DrahtBot removed the label Needs rebase on Feb 20, 2020

martinus force-pushed on Feb 20, 2020

martinus commented at 4:17 pm on February 20, 2020: contributor

I’ve rebased & pushed a big update to the code. In addition to the markdown output, I also generate a file benchmarkresults.csv which has practically the same content as the output had previously. This should can be used by any tools that rely on the benchmark output. On my computer, the file has this output:

 0# Benchmark, evals, iterations, total, min, max, median
 1AssembleBlock, 11, 1, 0.006106585, 0.000538949, 0.000643127, 0.000545681
 2Base58CheckEncode, 11, 9.81818181818182, 0.000240226, 2.06e-06, 3.02745454545455e-06, 2.07709090909091e-06
 3Base58Decode, 11, 27.2727272727273, 0.000245003, 8.11689655172414e-07, 8.1968e-07, 8.166e-07
 4Base58Encode, 11, 18.5454545454545, 0.000244574, 1.19615e-06, 1.2115e-06, 1.19763157894737e-06
 5Bech32Decode, 11, 42.3636363636364, 0.000250479, 4.27466666666667e-07, 1.32740740740741e-06, 4.35666666666667e-07
 6Bech32Encode, 11, 34.6363636363636, 0.000248104, 6.29258064516129e-07, 8.0896875e-07, 6.36314285714286e-07
 7BenchLockedPool, 11, 167.363636363636, 0.000264967, 1.08922651933702e-07, 1.73769662921348e-07, 1.46138888888889e-07
 8BenchTimeDeprecated, 11, 4587.81818181818, 0.00024425, 4.38138264341248e-09, 5.01115472009915e-09, 5.0051077059738e-09
 9BenchTimeMillis, 11, 237.363636363636, 0.00024483, 9.26184738955823e-08, 9.43307086614173e-08, 9.3963963963964e-08
10BenchTimeMillisSys, 11, 243.272727272727, 0.000244799, 9.1088122605364e-08, 9.19113924050633e-08, 9.15038461538462e-08
11BenchTimeMock, 11, 10562.9090909091, 0.000246826, 2.08633681343622e-09, 2.29533626901521e-09, 2.08746447742343e-09
12BlockToJsonVerbose, 11, 1, 0.812951704, 0.072167457, 0.085656596, 0.072707134
13CCheckQueueSpeedPrevectorJob, 11, 11.1818181818182, 0.199323679, 0.00153634016666667, 0.00172734963636364, 0.0016258129
14CCoinsCaching, 11, 36.5454545454545, 0.000227408, 5.422e-07, 7.69351351351351e-07, 5.44605263157895e-07
15CHACHA20_1MB, 11, 1, 0.023085774, 0.00201953, 0.002240195, 0.002090533
16CHACHA20_256BYTES, 11, 44.5454545454545, 0.000245398, 4.99659574468085e-07, 5.02047619047619e-07, 5.00717391304348e-07
17CHACHA20_64BYTES, 11, 168.363636363636, 0.000245097, 1.32019736842105e-07, 1.32545454545455e-07, 1.32360759493671e-07
18CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT, 11, 1, 0.063083902, 0.005614908, 0.006009997, 0.005679771
19CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT, 11, 1, 0.031591654, 0.002799809, 0.003083797, 0.002855794
20CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT, 11, 11.2727272727273, 0.000236947, 1.89941666666667e-06, 1.92533333333333e-06, 1.90981818181818e-06
21CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT, 11, 21.5454545454545, 0.000233245, 9.55045454545455e-07, 1.23718181818182e-06, 9.5747619047619e-07
22CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT, 11, 25, 0.000247072, 8.81333333333333e-07, 1.015125e-06, 8.87961538461539e-07
23CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT, 11, 48.9090909090909, 0.00024602, 4.42068181818182e-07, 5.63777777777778e-07, 4.45867924528302e-07
24ComplexMemPool, 11, 1, 3.457592325, 0.313054487, 0.316239363, 0.313548617
25ConstructGCSFilter, 11, 1, 0.018953542, 0.001667658, 0.001879675, 0.001676587
26DeserializeAndCheckBlockTest, 11, 1, 0.068399757, 0.006003952, 0.006412799, 0.00618906
27DeserializeBlockTest, 11, 1, 0.057322626, 0.005100356, 0.00547949, 0.005164525
28DuplicateInputs, 11, 1, 0.082094071, 0.007329521, 0.007526674, 0.007472722
29FastRandom_1bit, 11, 14773.0909090909, 0.000239545, 1.46252213259886e-09, 1.47889590295829e-09, 1.47590446579989e-09
30FastRandom_32bit, 11, 2285.45454545455, 0.000242346, 9.26413255360624e-09, 1.2139653815893e-08, 9.36415362731152e-09
31HASH_1MB, 11, 1, 0.037448639, 0.00333361, 0.003531469, 0.003387306
32HASH_256BYTES, 11, 17.2727272727273, 0.000244883, 1.28611764705882e-06, 1.291e-06, 1.28872222222222e-06
33HASH_64BYTES, 11, 32.8181818181818, 0.000244595, 6.75885714285714e-07, 6.79625e-07, 6.77514285714286e-07
34MatchGCSFilter, 11, 1, 0.000303108, 2.6787e-05, 3.0751e-05, 2.7047e-05
35MempoolEviction, 11, 1, 0.00035951, 2.6216e-05, 4.4312e-05, 3.0276e-05
36MerkleRoot, 11, 1, 0.013989533, 0.001220381, 0.001462655, 0.001242791
37POLY1305_1MB, 11, 1, 0.008822394, 0.000778289, 0.000909028, 0.000782461
38POLY1305_256BYTES, 11, 106.272727272727, 0.000243963, 2.07696428571429e-07, 2.09357142857143e-07, 2.088e-07
39POLY1305_64BYTES, 11, 328.636363636364, 0.000248855, 6.70127795527157e-08, 8.63612040133779e-08, 6.72832369942197e-08
40PrevectorClearNontrivial, 11, 791.454545454545, 0.000372666, 2.59770491803279e-08, 1.94881395348837e-07, 2.60037735849057e-08
41PrevectorClearTrivial, 11, 2630.54545454545, 0.000244958, 8.46212395795578e-09, 8.47009966777409e-09, 8.46505271378368e-09
42PrevectorDeserializeNontrivial, 11, 1, 0.001192481, 0.000105425, 0.000123912, 0.00010568
43PrevectorDeserializeTrivial, 11, 2, 0.000255826, 1.1333e-05, 1.2048e-05, 1.1635e-05
44PrevectorDestructorNontrivial, 11, 415.272727272727, 0.000358251, 5.15516483516484e-08, 3.43633971291866e-07, 5.15764966740576e-08
45PrevectorDestructorTrivial, 11, 1590.36363636364, 0.000237761, 1.31619407687461e-08, 1.39810235767683e-08, 1.3519e-08
46PrevectorResizeNontrivial, 11, 811.272727272727, 0.000273009, 2.46290155440415e-08, 8.05017709563164e-08, 2.46548295454545e-08
47PrevectorResizeTrivial, 11, 2536.54545454545, 0.000244927, 8.77334809892949e-09, 8.78255578093306e-09, 8.77719528178244e-09
48RIPEMD160, 11, 1, 0.028423992, 0.002516573, 0.002757947, 0.00255517
49RollingBloom, 11, 44.3636363636364, 0.000245064, 4.99872340425532e-07, 5.05348837209302e-07, 5.02531914893617e-07
50RollingBloomReset, 11, 1, 0.000696645, 6.2348e-05, 6.9429e-05, 6.2588e-05
51RpcMempool, 11, 1, 0.121633853, 0.010798163, 0.011604045, 0.010929157
52SHA1, 11, 1, 0.021057291, 0.001862186, 0.002103509, 0.001878588
53SHA256, 11, 1, 0.035631576, 0.00317911, 0.003383292, 0.003233048
54SHA256D64_1024, 11, 1, 0.001464103, 0.000132135, 0.000138764, 0.000132205
55SHA256_32b, 11, 95.6363636363636, 0.000244411, 2.31903225806452e-07, 2.32772727272727e-07, 2.32415841584158e-07
56SHA512, 11, 1, 0.034134625, 0.00303295, 0.003282536, 0.003086899
57SipHash_32b, 11, 776.181818181818, 0.000244754, 2.85349500713267e-08, 2.88179581795818e-08, 2.86630872483221e-08
58Trig, 11, 4313.81818181818, 0.000238146, 5.01395348837209e-09, 5.03433333333333e-09, 5.01748251748252e-09
59VerifyScriptBench, 11, 1, 0.001860283, 0.000155127, 0.000299556, 0.000155384

Note that “number of iterations” is now a double value, because in nanobench I automatically determine the number of iterations, and the value is the average number of iterations over the 11 evaluations. (so e.g. Base58CheckEncode as 11 evaluations and 9.81818181818182 iterations, so 11*9.818 = 108 iterations in total)

martinus force-pushed on Feb 20, 2020

JeremyRubin commented at 6:55 pm on February 20, 2020: contributor

utACK 83a7839

Verified that only benchmarks are effected, checked that the high level design seems reasonable & an improvement over what we do presently.

martinus commented at 9:47 pm on February 20, 2020: contributor

In bf5ae5e I’ve added some support for asymptotes. I hope that’s somewhat similar to what you did in #17375, @JeremyRubin?

Usage is e.g. like this:

0./bench_bitcoin -filter=ComplexMemPool -asymptote=25,50,100,200,400,600,800

This runs the benchmark ComplexMemPool several times but with different complexityN settings. The benchmark can extract that number and use it accordingly. Here, it’s used for childTxs. The output is this:

complexityN	ns/op	op/s	err%	ins/op	cyc/op	IPC	total	benchmark
25	1,064,241.00	939.64	1.4%	3,960,279.00	2,829,708.00	1.400	0.01	`ComplexMemPool`
50	1,579,530.00	633.10	1.0%	6,231,810.00	4,412,674.00	1.412	0.02	`ComplexMemPool`
100	4,022,774.00	248.58	0.6%	16,544,406.00	11,889,535.00	1.392	0.04	`ComplexMemPool`
200	15,390,986.00	64.97	0.2%	63,904,254.00	47,731,705.00	1.339	0.17	`ComplexMemPool`
400	69,394,711.00	14.41	0.1%	272,602,461.00	219,014,691.00	1.245	0.76	`ComplexMemPool`
600	168,977,165.00	5.92	0.1%	639,108,082.00	535,316,887.00	1.194	1.86	`ComplexMemPool`
800	310,109,077.00	3.22	0.1%	1,149,134,246.00	984,620,812.00	1.167	3.41	`ComplexMemPool`

coefficient	err%	complexity
4.78486e-07	4.5%	O(n^2)
6.38557e-10	21.7%	O(n^3)
3.42338e-05	38.0%	O(n log n)
0.000313914	46.9%	O(n)
0.0129823	114.4%	O(log n)
0.0815055	133.8%	O(1)

The best fitting curve is O(n^2), so the algorithm seems to scale quadratic with childTxs in the range 25 to 800.

JeremyRubin commented at 10:51 pm on February 20, 2020: contributor

utACK bf5ae5e

bravo!

in src/bench/bench.h:37 in bf5ae5ed0f outdated

67-    const uint64_t m_num_evals;
68-    std::vector<double> m_elapsed_results;
69-    time_point m_start_time;
70 
71-    bool UpdateTimer(time_point finish_time);
72+using namespace ankerl::nanobench;

Empact commented at 6:11 pm on February 28, 2020:

nit: How about:

0namespace nanobench { using namespace ankerl::nanobench; }

So that all the external nanobench members are more explicitly identified?

martinus commented at 12:11 pm on February 29, 2020:

Sure I can do that. Would you do that inside the benchmark namespace? then all benchmark arguments would become e.g.

0static void Base58Encode(benchmark::nanobench::Bench& bench)

Which is a bit long.

Empact commented at 7:58 pm on February 29, 2020:

How about doing individual assignments for the classes in use, e.g.:

0namespace benchmark { using ankerl::nanobench::Bench; }

martinus commented at 6:45 am on March 1, 2020:

I think that’s better, Bench is the only thing that’s needed in the benchmarks anyway. I’ve commited c2e924f which does that

Empact commented at 9:44 am on March 1, 2020: member

ACK https://github.com/bitcoin/bitcoin/pull/18011/commits/c2e924fc046110eb7ac5ab7bf19cfaf6daf1c44b

in src/bench/examples.cpp:6 in c2e924fc04 outdated

4@@ -5,29 +5,18 @@
5 #include <bench/bench.h>
6 #include <util/time.h>

jonatack commented at 5:08 pm on March 1, 2020:

can remove #include <util/time.h>

martinus commented at 3:24 pm on March 8, 2020:

Done in https://github.com/bitcoin/bitcoin/pull/18011/commits/5bd582bbadb90970f631a7c4c4d793689584ca4e

in src/bench/nanobench.h:663 in c2e924fc04 outdated

660+// declarations ///////////////////////////////////////////////////////////////////////////////////
661+
662+namespace ankerl {
663+namespace nanobench {
664+
665+// helper stuff that only intended to be used internally

jonatack commented at 5:24 pm on March 1, 2020:

nit here and L::1056: s/that/that is/

in src/bench/nanobench.h:1191 in c2e924fc04 outdated

1188+#        pragma clang diagnostic pop
1189+#    endif
1190+    return pc;
1191+}
1192+
1193+// Windows version of do not optimize away

jonatack commented at 5:28 pm on March 1, 2020:

nit: of?

martinus commented at 2:41 pm on March 2, 2020:

it should say Windows version of doNotOptimizeAway

martinus commented at 3:25 pm on March 8, 2020:

I’ve reworded it a bit in rebase with new version of nanobench.h in https://github.com/bitcoin/bitcoin/pull/18011/commits/79fd93ae7da64aa6d7532aa46734623d0824098d

in src/bench/bench_bitcoin.cpp:18 in c2e924fc04 outdated

19 
20 static void SetupBenchArgs()
21 {
22     SetupHelpOptions(gArgs);
23 
24     gArgs.AddArg("-list", "List benchmarks without executing them. Can be combined with -scaling and -filter", ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);

jonatack commented at 6:08 pm on March 1, 2020:

remove “-scaling and”… as the option is now removed

in src/bench/bench.cpp:70 in c2e924fc04 outdated

89-
90-        if (m_elapsed_results.size() == m_num_evals) {
91-            return false;
92+    if (!benchmarkResults.empty()) {
93+        // Generate legacy CSV data to "benchmarkresults.csv"
94+        std::ofstream fout("benchmarkresults.csv");

jonatack commented at 6:18 pm on March 1, 2020:

thought: would benchmark_results.csv be more consistent with the project file naming (I’m not sure and won’t bikeshed further)

martinus commented at 2:46 pm on March 2, 2020:

maybe I should add an option like -csv=<filename> that enables writing the .csv to the given file, if the option is present. Currently I always write a “benchmarkresults.csv”, which can be a bit annoying when not wanted

jonatack commented at 6:36 pm on March 1, 2020: member

Tested and light code review ACK c2e924fc04 – built/tests/ran benches/tested the options (-?/-help, -asymptote=, -filter=, -list)… modulo a few minor comments below.

Runs very quickly.

Various bench runs for info and comparison:

bench output of this PR e.g. ./src/bench/bench_bitcoin
benchmarkresults.csv (current bench output format)
Bench output on master to compare with the csv
Bench with asymptotes e.g. ./src/bench/bench_bitcoin -filter=ComplexMemPool -asymptote=25,50,100,200,400,600,800

New, simplified options help:

 0bitcoin ((HEAD detached at origin/pr/18011))$ ./src/bench/bench_bitcoin -?
 1Options:
 2
 3  -?
 4       Print this help message and exit
 5
 6  -asymptote=n1,n2,n3,...
 7       Test asymptotic growth of the runtime of an algorithm, if supported by
 8       the benchmark
 9
10  -filter=<regex>
11       Regular expression filter to select benchmark by name (default: .*)
12
13  -list
14       List benchmarks without executing them. Can be combined with -scaling
15       and -filter

One potential concern is how this will affect the usefulness of long-term benchmarking projects like https://github.com/chaincodelabs/bitcoinperf, e.g. https://bitcoinperf.com (which seems to be down?)

DrahtBot added the label Needs rebase on Mar 6, 2020

martinus force-pushed on Mar 7, 2020

DrahtBot removed the label Needs rebase on Mar 7, 2020

martinus force-pushed on Mar 8, 2020

martinus commented at 7:22 am on March 8, 2020: contributor

I’ve pushed a few updates:

updated nanobench.h with a (bit) faster RNG, and explicit constructors so extended-lint-all.h doesn’t complain any more
add command line options -output_csv to enable creation of legacy CSV file
add command line optoin -output_json to create a big JSON with all data

Not sure if I should squash all the changes into a single commit or leave the separately?

Here is console output, .CSV file, and .json file of one run (click RAW):

https://gist.github.com/martinus/d5e596b7802199737ef38399ef749d51

jonatack commented at 9:55 am on March 8, 2020: member

Perhaps the gArgs.AddArg("-list") fix and the “comment nits” changes ought to be in 79fd93a where they are first changed rather than 51b83e1 and 9d6eb72. Alternatively, with 2 ACKs it may have been good to rebase with no changes to preserve existing review and do the rest in a follow-up PR (I’m not sure).

in src/bench/examples.cpp:10 in 79fd93ae7d outdated

 4@@ -5,29 +5,18 @@
 5 #include <bench/bench.h>
 6 #include <util/time.h>
 7 
 8-// Sanity test: this should loop ten times, and
 9-// min/max/average should be close to 100ms.
10-static void Sleep100ms(benchmark::State& state)

elichai commented at 10:33 am on March 8, 2020:

Why did you remove this test? (should probably move to tests though)

elichai commented at 10:37 am on March 8, 2020:

It seems to have been added here: https://github.com/bitcoin/bitcoin/commit/535ed9223dcb32bf90ead5b2c95052838b780620#diff-5f8387aba8e5e6c0c871e093c9145085R9 So I guess it is somewhat a useless test

martinus commented at 10:55 am on March 8, 2020:

I removed this test because it’s rather useless, it would uselessly slow down the whole benchmark run quite a bit compared to the others, and since it’s in examples.cpp I assumed that it’s just an example anyways

jonatack commented at 12:52 pm on March 8, 2020:

Unless I was missing something, if you remove this test then the #include <util/time.h> can be removed with it.

martinus commented at 3:19 pm on March 8, 2020:

Ah right I totally forgot removing the include

martinus commented at 3:25 pm on March 8, 2020:

Done in https://github.com/bitcoin/bitcoin/pull/18011/commits/5bd582bbadb90970f631a7c4c4d793689584ca4e

elichai approved

elichai commented at 2:54 pm on March 8, 2020: contributor

tACK 9d6eb7207a9baa78ce9c6231e517717a598c8d33

Went over the code changes in core, very briefly looked over nanobench.h, compared the benchmark results with the results before, and it looks ok (there are some differences but in benchmarks with high variance)

And I really like that it’s human readable now :)

DrahtBot added the label Needs rebase on Mar 14, 2020

martinus force-pushed on Mar 28, 2020

martinus commented at 7:32 am on March 28, 2020: contributor

rebased

DrahtBot removed the label Needs rebase on Mar 28, 2020

DrahtBot added the label Needs rebase on Apr 9, 2020

MarcoFalke commented at 2:55 pm on April 9, 2020: member

I tried running this locally, and it seems I am getting varying results:

 0mac-mini:bitcoin-core marco$ git log -1 && make -j 9 && ./src/bench/bench_bitcoin --filter=VerifyNestedIfScript
 1commit a841d1e25b1b26b6381f36e14307c2549a79edb4 (HEAD)
 2Author: Martin Ankerl <martin.ankerl@gmail.com>
 3Date:   Sun Mar 8 16:20:55 2020 +0100
 4
 5    remove unnecessary include util/time.h
 6Making all in src
 7Making all in doc/man
 8make[1]: Nothing to be done for `all'.
 9make[1]: Nothing to be done for `all-am'.
10Warning, results might be unstable:
11* NDEBUG not defined, assert() macros are evaluated
12
13Recommendations
14* Make sure you compile for Release
15
16|               ns/op |                op/s |    err% |     total | benchmark
17|--------------------:|--------------------:|--------:|----------:|:----------
18|          303,061.00 |            3,299.67 |    2.6% |      0.00 | `VerifyNestedIfScript`
19mac-mini:bitcoin-core marco$ git log -1 && make -j 9 && ./src/bench/bench_bitcoin --filter=VerifyNestedIfScript
20commit a841d1e25b1b26b6381f36e14307c2549a79edb4 (HEAD)
21Author: Martin Ankerl <martin.ankerl@gmail.com>
22Date:   Sun Mar 8 16:20:55 2020 +0100
23
24    remove unnecessary include util/time.h
25Making all in src
26Making all in doc/man
27make[1]: Nothing to be done for `all'.
28make[1]: Nothing to be done for `all-am'.
29Warning, results might be unstable:
30* NDEBUG not defined, assert() macros are evaluated
31
32Recommendations
33* Make sure you compile for Release
34
35|               ns/op |                op/s |    err% |     total | benchmark
36|--------------------:|--------------------:|--------:|----------:|:----------
37|          155,083.00 |            6,448.16 |    9.4% |      0.00 | :wavy_dash: `VerifyNestedIfScript` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)

MarcoFalke commented at 2:57 pm on April 9, 2020: member

Notably:

Consecutive runs are off by a factor of 2
There is a warning Warning, results might be unstable: NDEBUG not defined, assert() macros are evaluated, but I think it is impossible to compile Bitcoin Core with assert disabled
There is a warning from the framework itself: :wavy_dash: VerifyNestedIfScript(Unstable with ~1.0 iters. IncreaseminEpochIterations to e.g. 10)

Replace current benchmarking framework with nanobench

This replaces the current benchmarking framework with nanobench [1], an
MIT licensed single-header benchmarking library, of which I am the
autor. This has in my opinion several advantages, especially on Linux:

* fast: Running all benchmarks takes ~6 seconds instead of 4m13s on
  an Intel i7-8700 CPU @ 3.20GHz.

* accurate: I ran e.g. the benchmark for SipHash_32b 10 times and
  calculate standard deviation / mean = coefficient of variation:

  * 0.57% CV for old benchmarking framework
  * 0.20% CV for nanobench

  So the benchmark results with nanobench seem to vary less than with
  the old framework.

* It automatically determines runtime based on clock precision, no need
  to specify number of evaluations.

* measure instructions, cycles, branches, instructions per cycle,
  branch misses (only Linux, when performance counters are available)

* output in markdown table format.

* Warn about unstable environment (frequency scaling, turbo, ...)

* For better profiling, it is possible to set the environment variable
  NANOBENCH_ENDLESS to force endless running of a particular benchmark
  without the need to recompile. This makes it to e.g. run "perf top"
  and look at hotspots.

Here is an example copy & pasted from the terminal output:

|             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
|                2.52 |      396,529,415.94 |    0.6% |           25.42 |            8.02 |  3.169 |           0.06 |    0.0% |      0.03 | `bench/crypto_hash.cpp RIPEMD160`
|                1.87 |      535,161,444.83 |    0.3% |           21.36 |            5.95 |  3.589 |           0.06 |    0.0% |      0.02 | `bench/crypto_hash.cpp SHA1`
|                3.22 |      310,344,174.79 |    1.1% |           36.80 |           10.22 |  3.601 |           0.09 |    0.0% |      0.04 | `bench/crypto_hash.cpp SHA256`
|                2.01 |      496,375,796.23 |    0.0% |           18.72 |            6.43 |  2.911 |           0.01 |    1.0% |      0.00 | `bench/crypto_hash.cpp SHA256D64_1024`
|                7.23 |      138,263,519.35 |    0.1% |           82.66 |           23.11 |  3.577 |           1.63 |    0.1% |      0.00 | `bench/crypto_hash.cpp SHA256_32b`
|                3.04 |      328,780,166.40 |    0.3% |           35.82 |            9.69 |  3.696 |           0.03 |    0.0% |      0.03 | `bench/crypto_hash.cpp SHA512`

[1] https://github.com/martinus/nanobench

* Adds support for asymptotes

  This adds support to calculate asymptotic complexity of a benchmark.
  This is similar to #17375, but currently only one asymptote is
  supported, and I have added support in the benchmark `ComplexMemPool`
  as an example.

  Usage is e.g. like this:

  ```
  ./bench_bitcoin -filter=ComplexMemPool -asymptote=25,50,100,200,400,600,800
  ```

  This runs the benchmark `ComplexMemPool` several times but with
  different complexityN settings. The benchmark can extract that number
  and use it accordingly. Here, it's used for `childTxs`. The output is
  this:

  | complexityN |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |     total | benchmark
  |------------:|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|----------:|:----------
  |          25 |        1,064,241.00 |              939.64 |    1.4% |    3,960,279.00 |    2,829,708.00 |  1.400 |      0.01 | `ComplexMemPool`
  |          50 |        1,579,530.00 |              633.10 |    1.0% |    6,231,810.00 |    4,412,674.00 |  1.412 |      0.02 | `ComplexMemPool`
  |         100 |        4,022,774.00 |              248.58 |    0.6% |   16,544,406.00 |   11,889,535.00 |  1.392 |      0.04 | `ComplexMemPool`
  |         200 |       15,390,986.00 |               64.97 |    0.2% |   63,904,254.00 |   47,731,705.00 |  1.339 |      0.17 | `ComplexMemPool`
  |         400 |       69,394,711.00 |               14.41 |    0.1% |  272,602,461.00 |  219,014,691.00 |  1.245 |      0.76 | `ComplexMemPool`
  |         600 |      168,977,165.00 |                5.92 |    0.1% |  639,108,082.00 |  535,316,887.00 |  1.194 |      1.86 | `ComplexMemPool`
  |         800 |      310,109,077.00 |                3.22 |    0.1% |1,149,134,246.00 |  984,620,812.00 |  1.167 |      3.41 | `ComplexMemPool`

  |   coefficient |   err% | complexity
  |--------------:|-------:|------------
  |   4.78486e-07 |   4.5% | O(n^2)
  |   6.38557e-10 |  21.7% | O(n^3)
  |   3.42338e-05 |  38.0% | O(n log n)
  |   0.000313914 |  46.9% | O(n)
  |     0.0129823 | 114.4% | O(log n)
  |     0.0815055 | 133.8% | O(1)

  The best fitting curve is O(n^2), so the algorithm seems to scale
  quadratic with `childTxs` in the range 25 to 800.

78c312c983

martinus force-pushed on Jun 13, 2020

martinus commented at 11:03 am on June 13, 2020: contributor

I’ve finally found the time to rebased my branch with a major update of nanobench, and updated the new benchmarks too. @MarcoFalke, I am pretty sure your instability comes from CPU frequency scaling / turbo mode. For nanobench to be able to have accurate runtime results it needs a CPU locked to fixed frequency. Under Linux I print warnings & suggestions if I detect this, but I don’t have such a logic in place for other operating systems (yet).

When I run the benchmark VerifyNestedIfScript a few times I get these results:

ns/op	op/s	err%	ins/op	cyc/op	IPC	bra/op	miss%	benchmark
63,447.00	15,761.19	0.9%	677,018.00	201,628.00	3.358	143,829.00	0.2%	`VerifyNestedIfScript`
63,094.00	15,849.37	0.8%	677,018.00	201,495.00	3.360	143,829.00	0.2%	`VerifyNestedIfScript`
63,548.00	15,736.14	0.7%	677,018.00	201,894.00	3.353	143,829.00	0.2%	`VerifyNestedIfScript`
63,340.00	15,787.81	0.9%	677,018.00	201,894.00	3.353	143,829.00	0.2%	`VerifyNestedIfScript`
63,288.00	15,800.78	1.0%	677,018.00	201,096.00	3.367	143,829.00	0.2%	`VerifyNestedIfScript`
63,261.00	15,807.53	0.9%	677,018.00	201,894.00	3.353	143,829.00	0.2%	`VerifyNestedIfScript`
63,309.00	15,795.54	0.7%	677,018.00	202,027.00	3.351	143,829.00	0.2%	`VerifyNestedIfScript`

So it is a very stable benchmark for me, with locked frequency scaling. When I don’t lock the CPU my results too fluctuate, and the benchmark prints these warnings (I’ve removed the NDEBUG warning):

0Warning, results might be unstable:
1* CPU frequency scaling enabled: CPU 0 between 800.0 and 4,600.0 MHz
2* CPU governor is 'powersave' but should be 'performance'
3* Turbo is enabled, CPU frequency will fluctuate
4
5Recommendations
6* Use 'pyperf system tune' before benchmarking. See https://github.com/vstinner/pyperf

DrahtBot removed the label Needs rebase on Jun 13, 2020

JeremyRubin commented at 6:33 pm on June 13, 2020: contributor

utack 78c312c. @MarcoFalke can you repeat your benchmark after running pyperf system tune?

dongcarl commented at 7:37 pm on July 2, 2020: member

@martinus While going thru this in July 2nd, 2020’s meeting, I believe people were wondering what the support is like for non x86 architectures. Would it fail to compile? Have limited functionality? Or fail at runtime?

laanwj added this to the "Blockers" column in a project

laanwj removed the label Needs Conceptual Review on Jul 2, 2020

laanwj commented at 9:37 pm on July 2, 2020: member

FWIW, I could compile and run this PR (as merged on master) on RV64. It doesn’t seem there is any compatibility issue as was implied by the travis run.

Concept ACK.

martinus commented at 7:10 am on July 3, 2020: contributor

@martinus While going thru this in July 2nd, 2020’s meeting, I believe people were wondering what the support is like for non x86 architectures. Would it fail to compile? Have limited functionality? Or fail at runtime?

The CPU statistics like instructions, cycles, branch misspredictions are only available on Linux through perf events. But it should compile on any platform with C++11 support, then I’m only relying on the std::chrono timers.

I have now a relatively comprehensive documentation of nanobench available here: https://nanobench.ankerl.com/

fjahr commented at 8:51 pm on July 10, 2020: member

Concept ACK

So far I have built the PR and run some tests without any problems.

laanwj commented at 1:33 pm on July 30, 2020: member

ACK 78c312c983255e15fc274de2368a2ec13ce81cbf

fanquake removed this from the "Blockers" column in a project

laanwj merged this on Jul 30, 2020

laanwj closed this on Jul 30, 2020

laanwj commented at 1:45 pm on July 30, 2020: member

I’ve merged this because there was unanimous agreement that we want the new benchmark framework and it works as expected here (as well as for @fjahr and @JeremyRubin ). If @MarcoFalke’s issue, is still a problem we should look into it, please open a github issue for it.

jonatack commented at 2:17 pm on July 30, 2020: member

👍 per my ACK a few months ago #18011#pullrequestreview-366872789

martinus commented at 5:43 pm on July 30, 2020: contributor

Thanks for merging! If anyone has any questions or issue with nanobench please notify me

sidhujag referenced this in commit f60da833be on Jul 31, 2020

hebasto commented at 6:27 pm on July 31, 2020: member

@martinus Mind looking into #18710 if it has any performance regression on supported platforms?

hebasto commented at 11:04 am on August 14, 2020: member

@martinus How to conditionally skip a benchmark in nanobench framework (wrt #19710 (comment) and #19710 (comment))?

martinus commented at 11:07 am on August 14, 2020: contributor

You can use -filter to specify a regular expression for which tests to run

hebasto commented at 11:08 am on August 14, 2020: member

You can use -filter to specify a regular expression for which tests to run

I mean in the code, e.g., skip CCheckQueueSpeedPrevectorJob if GetNumCores() < 2

martinus commented at 11:20 am on August 14, 2020: contributor

You can use -filter to specify a regular expression for which tests to run

I mean in the code, e.g., skip CCheckQueueSpeedPrevectorJob if GetNumCores() < 2

Ah, of course

Before benchmark is run you can do a check and then simply return, e.g. like so:

0static void CCheckQueueSpeedPrevectorJob(benchmark::Bench& bench)
1{
2    if (GetNumCores() < 2) {
3        return;
4    }

But that needs a little update in bench.cpp, because then the benchmark doesn’t have any results:

0        if (!bench.results().empty()) {
1            benchmarkResults.push_back(bench.results().back());
2        }

hebasto commented at 12:26 pm on August 14, 2020: member

@martinus Thanks! I’ve submitted a commit (ce3e6a7cb21d1aa455513970846e1f70c01472a4) in #19710.

MarcoFalke commented at 9:25 am on August 24, 2020: member

Not sure if this is caused by this pull, but some benchmarks changed performance quite significantly:

BenchLockedPool went down almost 100% to approx 0 compared to before: https://codespeed.bitcoinperf.com/timeline/#/?exe=3,4,2,1,5&base=1+23&ben=micro.clang.BenchLockedPool&env=1&revs=200&equid=off&quarts=on&extr=on
All the prevector ones as well (except the serialization one): https://codespeed.bitcoinperf.com/timeline/#/?exe=3,4,2,1,5&base=1+23&ben=micro.clang.PrevectorClearNontrivial&env=1&revs=200&equid=off&quarts=on&extr=on

Is this due to compiler optimizations or something else?

Funnily the trig dummy bench went up by ~100%: https://codespeed.bitcoinperf.com/timeline/#/?exe=3,4,2,1,5&base=1+23&ben=micro.clang.Trig&env=1&revs=200&equid=off&quarts=on&extr=on

Fabcien referenced this in commit f95373f234 on Feb 15, 2021

PastaPastaPasta referenced this in commit c0fe0715eb on May 1, 2021

kittywhiskers referenced this in commit cd539c56de on Jun 4, 2021

kittywhiskers referenced this in commit c7f24ef868 on Jun 4, 2021

kittywhiskers referenced this in commit 7690d29f56 on Jun 5, 2021

kittywhiskers referenced this in commit 9cdcc24331 on Jun 16, 2021

kittywhiskers referenced this in commit 089e4714d2 on Jun 24, 2021

kittywhiskers referenced this in commit c3b767074d on Jun 25, 2021

kittywhiskers referenced this in commit 3ea50bce72 on Jun 25, 2021

kittywhiskers referenced this in commit 4dc595d234 on Jun 26, 2021

kittywhiskers referenced this in commit c6edf6654f on Jun 27, 2021

kittywhiskers referenced this in commit 5849582e77 on Jun 28, 2021

kittywhiskers referenced this in commit 3c15853960 on Jul 2, 2021

kittywhiskers referenced this in commit 92e05078ae on Jul 4, 2021

kittywhiskers referenced this in commit b5db3c3d65 on Jul 5, 2021

PastaPastaPasta referenced this in commit 6e4099ea67 on Jul 6, 2021

DrahtBot locked this on Feb 15, 2022

Replace current benchmarking framework with nanobench #18011

Conflicts