Replace current benchmarking framework with nanobench #18011

pull martinus wants to merge 1 commits into bitcoin:master from martinus:2019-10-nanobench changing 38 files +3644 −573
  1. martinus commented at 9:53 pm on January 27, 2020: contributor

    Replace current benchmarking framework with nanobench

    This replaces the current benchmarking framework with nanobench [1], an MIT licensed single-header benchmarking library, of which I am the autor. This has in my opinion several advantages, especially on Linux:

    • fast: Running all benchmarks takes ~6 seconds instead of 4m13s on an Intel i7-8700 CPU @ 3.20GHz.

    • accurate: I ran e.g. the benchmark for SipHash_32b 10 times and calculate standard deviation / mean = coefficient of variation:

      • 0.57% CV for old benchmarking framework
      • 0.20% CV for nanobench

      So the benchmark results with nanobench seem to vary less than with the old framework.

    • It automatically determines runtime based on clock precision, no need to specify number of evaluations.

    • measure instructions, cycles, branches, instructions per cycle, branch misses (only Linux, when performance counters are available)

    • output in markdown table format.

    • Warn about unstable environment (frequency scaling, turbo, …)

    • For better profiling, it is possible to set the environment variable NANOBENCH_ENDLESS to force endless running of a particular benchmark without the need to recompile. This makes it to e.g. run “perf top” and look at hotspots.

    Here is an example copy & pasted from the terminal output:

    ns/byte byte/s err% ins/byte cyc/byte IPC bra/byte miss% total benchmark
    2.52 396,529,415.94 0.6% 25.42 8.02 3.169 0.06 0.0% 0.03 bench/crypto_hash.cpp RIPEMD160
    1.87 535,161,444.83 0.3% 21.36 5.95 3.589 0.06 0.0% 0.02 bench/crypto_hash.cpp SHA1
    3.22 310,344,174.79 1.1% 36.80 10.22 3.601 0.09 0.0% 0.04 bench/crypto_hash.cpp SHA256
    2.01 496,375,796.23 0.0% 18.72 6.43 2.911 0.01 1.0% 0.00 bench/crypto_hash.cpp SHA256D64_1024
    7.23 138,263,519.35 0.1% 82.66 23.11 3.577 1.63 0.1% 0.00 bench/crypto_hash.cpp SHA256_32b
    3.04 328,780,166.40 0.3% 35.82 9.69 3.696 0.03 0.0% 0.03 bench/crypto_hash.cpp SHA512

    [1] https://github.com/martinus/nanobench

  2. DrahtBot added the label Build system on Jan 27, 2020
  3. DrahtBot added the label Docs on Jan 27, 2020
  4. DrahtBot added the label Scripts and tools on Jan 27, 2020
  5. DrahtBot added the label Tests on Jan 27, 2020
  6. MarcoFalke removed the label Build system on Jan 27, 2020
  7. MarcoFalke removed the label Docs on Jan 27, 2020
  8. MarcoFalke removed the label Scripts and tools on Jan 27, 2020
  9. JeremyRubin commented at 10:09 pm on January 27, 2020: contributor

    Strong concept ACK! Seems like a big improvement.

    Can you comment more on the 6 seconds claim? AFAIK each bench was supposed to target running for 1 second? Is this no longer required to reduce variance?

    Secondly – and separately – can you comment on how this might impact the need for something like #17375? Can we add better support for benchmarks where we want to run with different scaling params and output each trial to get a sense of the complexity?

  10. martinus commented at 10:21 pm on January 27, 2020: contributor

    I calculate a good number of iterations based on the clock accuracy, then perform these iterations a few times and use the median to get rid of outliers. I found it actually to be more reliable with shorter runs, because there is less chance for random fluctuations to interfer. It is necessary though to disable frequency scaling etc (but this should be done with the old framework too anyways). This can be easily done with e.g pyperf

    Concerning #17375, nanobench can estimate complexity, but it requires a bit of code change: https://github.com/martinus/nanobench/blob/master/docs/reference.md#asymptotic-complexity

  11. fanquake added the label Needs Conceptual Review on Jan 27, 2020
  12. DrahtBot commented at 1:49 am on January 28, 2020: member

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #19377 (bench: Add OrphanTxPool benchmark by hebasto)
    • #19326 (Simplify hash.h interface using Spans by sipa)
    • #19280 (Verify the block filter hash when reading from disk. by pstratem)
    • #19181 (Add ASM optimizations for MuHash3072 by fjahr)
    • #19145 (Add hash_type MUHASH for gettxoutsetinfo by fjahr)
    • #19055 (Add MuHash3072 implementation by fjahr)
    • #18815 (bench: Add logging benchmark by MarcoFalke)
    • #18731 (refactor: Make CCheckQueue RAII-styled by hebasto)
    • #18710 (Add local thread pool to CCheckQueue by hebasto)
    • #18354 (Use shared pointers only in validation interface by bvbfan)
    • #18261 (Erlay: bandwidth-efficient transaction relay protocol by naumenkogs)
    • #18014 (lib: Optimizing siphash implementation by elichai)
    • #17526 (Use Single Random Draw In addition to knapsack as coin selection fallback by achow101)
    • #17331 (Use effective values throughout coin selection by achow101)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  13. martinus commented at 7:00 am on January 28, 2020: contributor

    The lint check currently fails with this error:

    fatal: bad revision ‘8b138526b5dc…488d538cbf6f’

    I believe the reason is some key verification check at the end of ci/lint/06_script.sh, but I can’t really say why this is failing

  14. MarcoFalke commented at 3:05 pm on January 28, 2020: member

    Concept ACK

    The travis failure is

     0The locale dependent function std::to_string(...) appears to be used:
     1
     2src/bench/nanobench.h:            auto sysCpu = "/sys/devices/system/cpu/cpu" + std::to_string(id);
     3
     4src/bench/nanobench.h:                warnings.emplace_back("CPU frequency scaling enabled: CPU " + std::to_string(id) + " between " +
     5
     6Unnecessary locale dependence can cause bugs that are very
     7
     8tricky to isolate and fix. Please avoid using locale dependent
     9
    10functions if possible.
    11
    12Advice not applicable in this specific case? Add an exception
    13
    14by updating the ignore list in test/lint/lint-locale-dependence.sh
    15
    16^---- failure generated from test/lint/lint-locale-dependence.sh
    
  15. jamesob commented at 3:24 pm on January 28, 2020: member
    Would it be easy to hack in csv output that is somewhat similar to the existing output? The markdown table looks a little tricky to parse programmatically (though it could be done). For example, bitcoinperf (https://github.com/chaincodelabs/bitcoinperf) currently relies on this format.
  16. martinus commented at 3:53 pm on January 28, 2020: contributor

    The locale dependent function std::to_string(…) appears to be used:

    Ah, ok I’ll fix this

    Would it be easy to hack in csv output that is somewhat similar to the existing output?

    I think it should be easy, in nanobench I already have CSV & JSON output format using mustache-like templates, so it’s possible to create a custom format. I have not exposed this feature yet though in this PR.

  17. practicalswift commented at 4:49 pm on January 28, 2020: contributor
    Strong concept ACK @martinus, thanks for your great contributions! Please keep them coming :)
  18. elichai commented at 5:02 pm on January 28, 2020: contributor

    Does this library also does memory clobbers and barriers? (like google’s DoNotOptimize[0], ClobberMemory[1], or Rust’s black_box[2])

    [0] https://github.com/google/benchmark/blob/master/include/benchmark/benchmark.h#L307 [1] https://github.com/google/benchmark/blob/master/include/benchmark/benchmark.h#L326 [2] https://doc.rust-lang.org/std/hint/fn.black_box.html

  19. martinus commented at 5:11 pm on January 28, 2020: contributor

    Does this library also does memory clobbers and barriers? (like google’s DoNotOptimize[0], ClobberMemory[1], or Rust’s black_box[2])

    I currently have doNotOptimizeAway, which is based on folly’s benchmark. I think folly’s version is based on google benchmark. I have not added doNotOptimizeAway calls in the PR because I did not want to modify each benchmark too much

    I don’t have clobberMemory yet, because I’ve never used it… What I’m also doing is I force the run(...) to be noinline to prevent some optimizations.

  20. martinus force-pushed on Jan 28, 2020
  21. JeremyRubin commented at 8:11 pm on January 28, 2020: contributor
    Is it possible to make the nanobench include a submodule (like secp256k1 or univalue) so that it’s easier for us to pull in updates from upstream? If you plan on adding new features to nanobench, that should help streamline review potentially. If you think that there will be Bitcoin specific changes made to the header that you wouldn’t want to upstream, then I would leave it as you’ve done.
  22. JeremyRubin commented at 8:21 pm on January 28, 2020: contributor

    Closed #17375 in favor of nanobench. When you have time would love assistance in making the asymptotic test introduced there nanobench compatible.

    Being able to run asymptotic tests on the code is going to be a huge help with advocating for mempool policy changes in the future (e.g., loosening descendants limit) once the epoch mempool work is completed. This also impacts other parts of the code (e.g., wallet) where right now we don’t have good insight into if we introduce a regression. I think the curve fitting is a bit less useful because we do care about the constant factors too (e.g., if we’re a O(n log n) v.s. O(n) but c > log n for max n), but it’s quite nifty nonetheless.

  23. JeremyRubin added this to the "General Testing" column in a project

  24. martinus commented at 9:46 am on January 29, 2020: contributor

    Is it possible to make the nanobench include a submodule (like secp256k1 or univalue)

    I think it should be possible, I need to read up how git-subtree works… I prefer if nanobench stays generic, and try to implement bitcoin’s requirement in a generic way so it’s usable by others too. So no separate repository if possible.

    When you have time would love assistance in making the asymptotic test introduced there nanobench compatible.

    Sure, I’ll have a look at #17375 when I have time! I can’t predict how soon this is though.

  25. elichai commented at 9:52 am on January 29, 2020: contributor
    Haven’t looked at the library itself yet, but Concept ACK on replacing the current framework. (I really dislike it) Personally I also would’ve been fine with dynamically linking against google’s benchmarking library (https://github.com/google/benchmark)
  26. martinus commented at 11:46 am on January 29, 2020: contributor

    Personally I also would’ve been fine with dynamically linking against google’s benchmarking library (https://github.com/google/benchmark)

    I don’t think google benchmark is viable here. It’s a large dependency, and you also need to use the gtest framework for this. It would be quite a big change.

  27. Empact commented at 3:16 am on January 30, 2020: member
    Concept ACK
  28. JeremyRubin commented at 8:15 pm on January 30, 2020: contributor

    We discussed nanobench today in the IRC Meeting. There’s seems to be general agreement that this is a nice idea, and that the current bench framework isn’t perfect.

    Our current bench framework is actually based on Google’s, and I think most people are opposed to linking google’s whole thing.

    With respect to the question of if to subtree or not: let’s ignore that for now, copied in is fine, and we can deal with that in the future if we require changes to nanobench or if there are new features in nanobench we want to incorporate.

    There’s some concern about maintaining compatibility with existing tools. I think given that the output is fundamentally different from before (no longer reporting iterations and things like that) we can’t have perfect parity. But perhaps we could:

    1. “backport” on top of the last few releases (no new release) so that we have a bit more history to compare with
    2. Add a compatibility mode which emits something similar to the previous output with NaN subsituted where nanobench has no equivalent value.
    3. Ignore compatibility, but do support output into a nice machine-readable format.

    I think most people would be satisfied with 3, as 2 can be done a script on 3’s output and 1 can be done if someone has the itch for it.

  29. martinus commented at 9:50 pm on January 30, 2020: contributor

    Thanks for the summary! I just read through the log here and think I can add a few clarifications:

    I think if we can do a cursory check it’s not actually malware

    It’s not malware, not sure how I can help here :) I’ve created nanobench because I was annoyed at how difficult other benchmarking frameworks were to integrate into existing codebase because I don’t like google test.

    For example, a lot of tools rely on the output format of the current bench framework

    I have templating support in nanobench, and I can relatively easily add another output format that resembles the current output format closely. I need to do some improvements to the available data in nanobench, then I can use a template liks this to produce practically the same output as before:

    0# Benchmark, evals, iterations, total, min, max, median
    1{{#benchmarks}} {{name}}, {{num_measurements}}, {{total_iters}}, {{total_runtime}}, {{min}}, {{max}}, {{median}}
    2{{/benchmarks}}
    

    Then I can e.g. print the markdown tables to stdout, and create e.g. benchmarkresults.csv along with it based on the template format.

    I beleive nanobench autodetects variance or something

    Google benchmark is quite simple: it has a fixed runtime that it wants to achieve, then finds out the number of iterations it needs to do to get there, then it measures the time for that.

    In nanobench I try to be a bit smarter: I find out the clocks accuracy first, and base the target runtime on that. Since clocks are nowadays very accurate (18ns or so on my machine), I can easily perform the measurement multiple times and use the median to get rid of outliers.

    The fast runtimes gives very repeatable measurements for stuff that’s deterministic (e.g. SHA hashing). There I believe nanobench has a clear advantage over all other bencharking frameworks that I’ve tried.

    When the code under test has fluctuations (e.g. because it allocates stuff, or has some randomness or lots of cache misses / branch misspredictions in it), nanobench’s runtime measurement probably isn’t better than google benchmark. In that case it helps to also show the numbers for branch misses and retired instruction count to get a better feeling.

  30. DrahtBot added the label Needs rebase on Feb 10, 2020
  31. martinus force-pushed on Feb 20, 2020
  32. DrahtBot removed the label Needs rebase on Feb 20, 2020
  33. martinus force-pushed on Feb 20, 2020
  34. martinus commented at 4:17 pm on February 20, 2020: contributor

    I’ve rebased & pushed a big update to the code. In addition to the markdown output, I also generate a file benchmarkresults.csv which has practically the same content as the output had previously. This should can be used by any tools that rely on the benchmark output. On my computer, the file has this output:

     0# Benchmark, evals, iterations, total, min, max, median
     1AssembleBlock, 11, 1, 0.006106585, 0.000538949, 0.000643127, 0.000545681
     2Base58CheckEncode, 11, 9.81818181818182, 0.000240226, 2.06e-06, 3.02745454545455e-06, 2.07709090909091e-06
     3Base58Decode, 11, 27.2727272727273, 0.000245003, 8.11689655172414e-07, 8.1968e-07, 8.166e-07
     4Base58Encode, 11, 18.5454545454545, 0.000244574, 1.19615e-06, 1.2115e-06, 1.19763157894737e-06
     5Bech32Decode, 11, 42.3636363636364, 0.000250479, 4.27466666666667e-07, 1.32740740740741e-06, 4.35666666666667e-07
     6Bech32Encode, 11, 34.6363636363636, 0.000248104, 6.29258064516129e-07, 8.0896875e-07, 6.36314285714286e-07
     7BenchLockedPool, 11, 167.363636363636, 0.000264967, 1.08922651933702e-07, 1.73769662921348e-07, 1.46138888888889e-07
     8BenchTimeDeprecated, 11, 4587.81818181818, 0.00024425, 4.38138264341248e-09, 5.01115472009915e-09, 5.0051077059738e-09
     9BenchTimeMillis, 11, 237.363636363636, 0.00024483, 9.26184738955823e-08, 9.43307086614173e-08, 9.3963963963964e-08
    10BenchTimeMillisSys, 11, 243.272727272727, 0.000244799, 9.1088122605364e-08, 9.19113924050633e-08, 9.15038461538462e-08
    11BenchTimeMock, 11, 10562.9090909091, 0.000246826, 2.08633681343622e-09, 2.29533626901521e-09, 2.08746447742343e-09
    12BlockToJsonVerbose, 11, 1, 0.812951704, 0.072167457, 0.085656596, 0.072707134
    13CCheckQueueSpeedPrevectorJob, 11, 11.1818181818182, 0.199323679, 0.00153634016666667, 0.00172734963636364, 0.0016258129
    14CCoinsCaching, 11, 36.5454545454545, 0.000227408, 5.422e-07, 7.69351351351351e-07, 5.44605263157895e-07
    15CHACHA20_1MB, 11, 1, 0.023085774, 0.00201953, 0.002240195, 0.002090533
    16CHACHA20_256BYTES, 11, 44.5454545454545, 0.000245398, 4.99659574468085e-07, 5.02047619047619e-07, 5.00717391304348e-07
    17CHACHA20_64BYTES, 11, 168.363636363636, 0.000245097, 1.32019736842105e-07, 1.32545454545455e-07, 1.32360759493671e-07
    18CHACHA20_POLY1305_AEAD_1MB_ENCRYPT_DECRYPT, 11, 1, 0.063083902, 0.005614908, 0.006009997, 0.005679771
    19CHACHA20_POLY1305_AEAD_1MB_ONLY_ENCRYPT, 11, 1, 0.031591654, 0.002799809, 0.003083797, 0.002855794
    20CHACHA20_POLY1305_AEAD_256BYTES_ENCRYPT_DECRYPT, 11, 11.2727272727273, 0.000236947, 1.89941666666667e-06, 1.92533333333333e-06, 1.90981818181818e-06
    21CHACHA20_POLY1305_AEAD_256BYTES_ONLY_ENCRYPT, 11, 21.5454545454545, 0.000233245, 9.55045454545455e-07, 1.23718181818182e-06, 9.5747619047619e-07
    22CHACHA20_POLY1305_AEAD_64BYTES_ENCRYPT_DECRYPT, 11, 25, 0.000247072, 8.81333333333333e-07, 1.015125e-06, 8.87961538461539e-07
    23CHACHA20_POLY1305_AEAD_64BYTES_ONLY_ENCRYPT, 11, 48.9090909090909, 0.00024602, 4.42068181818182e-07, 5.63777777777778e-07, 4.45867924528302e-07
    24ComplexMemPool, 11, 1, 3.457592325, 0.313054487, 0.316239363, 0.313548617
    25ConstructGCSFilter, 11, 1, 0.018953542, 0.001667658, 0.001879675, 0.001676587
    26DeserializeAndCheckBlockTest, 11, 1, 0.068399757, 0.006003952, 0.006412799, 0.00618906
    27DeserializeBlockTest, 11, 1, 0.057322626, 0.005100356, 0.00547949, 0.005164525
    28DuplicateInputs, 11, 1, 0.082094071, 0.007329521, 0.007526674, 0.007472722
    29FastRandom_1bit, 11, 14773.0909090909, 0.000239545, 1.46252213259886e-09, 1.47889590295829e-09, 1.47590446579989e-09
    30FastRandom_32bit, 11, 2285.45454545455, 0.000242346, 9.26413255360624e-09, 1.2139653815893e-08, 9.36415362731152e-09
    31HASH_1MB, 11, 1, 0.037448639, 0.00333361, 0.003531469, 0.003387306
    32HASH_256BYTES, 11, 17.2727272727273, 0.000244883, 1.28611764705882e-06, 1.291e-06, 1.28872222222222e-06
    33HASH_64BYTES, 11, 32.8181818181818, 0.000244595, 6.75885714285714e-07, 6.79625e-07, 6.77514285714286e-07
    34MatchGCSFilter, 11, 1, 0.000303108, 2.6787e-05, 3.0751e-05, 2.7047e-05
    35MempoolEviction, 11, 1, 0.00035951, 2.6216e-05, 4.4312e-05, 3.0276e-05
    36MerkleRoot, 11, 1, 0.013989533, 0.001220381, 0.001462655, 0.001242791
    37POLY1305_1MB, 11, 1, 0.008822394, 0.000778289, 0.000909028, 0.000782461
    38POLY1305_256BYTES, 11, 106.272727272727, 0.000243963, 2.07696428571429e-07, 2.09357142857143e-07, 2.088e-07
    39POLY1305_64BYTES, 11, 328.636363636364, 0.000248855, 6.70127795527157e-08, 8.63612040133779e-08, 6.72832369942197e-08
    40PrevectorClearNontrivial, 11, 791.454545454545, 0.000372666, 2.59770491803279e-08, 1.94881395348837e-07, 2.60037735849057e-08
    41PrevectorClearTrivial, 11, 2630.54545454545, 0.000244958, 8.46212395795578e-09, 8.47009966777409e-09, 8.46505271378368e-09
    42PrevectorDeserializeNontrivial, 11, 1, 0.001192481, 0.000105425, 0.000123912, 0.00010568
    43PrevectorDeserializeTrivial, 11, 2, 0.000255826, 1.1333e-05, 1.2048e-05, 1.1635e-05
    44PrevectorDestructorNontrivial, 11, 415.272727272727, 0.000358251, 5.15516483516484e-08, 3.43633971291866e-07, 5.15764966740576e-08
    45PrevectorDestructorTrivial, 11, 1590.36363636364, 0.000237761, 1.31619407687461e-08, 1.39810235767683e-08, 1.3519e-08
    46PrevectorResizeNontrivial, 11, 811.272727272727, 0.000273009, 2.46290155440415e-08, 8.05017709563164e-08, 2.46548295454545e-08
    47PrevectorResizeTrivial, 11, 2536.54545454545, 0.000244927, 8.77334809892949e-09, 8.78255578093306e-09, 8.77719528178244e-09
    48RIPEMD160, 11, 1, 0.028423992, 0.002516573, 0.002757947, 0.00255517
    49RollingBloom, 11, 44.3636363636364, 0.000245064, 4.99872340425532e-07, 5.05348837209302e-07, 5.02531914893617e-07
    50RollingBloomReset, 11, 1, 0.000696645, 6.2348e-05, 6.9429e-05, 6.2588e-05
    51RpcMempool, 11, 1, 0.121633853, 0.010798163, 0.011604045, 0.010929157
    52SHA1, 11, 1, 0.021057291, 0.001862186, 0.002103509, 0.001878588
    53SHA256, 11, 1, 0.035631576, 0.00317911, 0.003383292, 0.003233048
    54SHA256D64_1024, 11, 1, 0.001464103, 0.000132135, 0.000138764, 0.000132205
    55SHA256_32b, 11, 95.6363636363636, 0.000244411, 2.31903225806452e-07, 2.32772727272727e-07, 2.32415841584158e-07
    56SHA512, 11, 1, 0.034134625, 0.00303295, 0.003282536, 0.003086899
    57SipHash_32b, 11, 776.181818181818, 0.000244754, 2.85349500713267e-08, 2.88179581795818e-08, 2.86630872483221e-08
    58Trig, 11, 4313.81818181818, 0.000238146, 5.01395348837209e-09, 5.03433333333333e-09, 5.01748251748252e-09
    59VerifyScriptBench, 11, 1, 0.001860283, 0.000155127, 0.000299556, 0.000155384
    

    Note that “number of iterations” is now a double value, because in nanobench I automatically determine the number of iterations, and the value is the average number of iterations over the 11 evaluations. (so e.g. Base58CheckEncode as 11 evaluations and 9.81818181818182 iterations, so 11*9.818 = 108 iterations in total)

  35. martinus force-pushed on Feb 20, 2020
  36. martinus force-pushed on Feb 20, 2020
  37. JeremyRubin commented at 6:55 pm on February 20, 2020: contributor

    utACK 83a7839

    Verified that only benchmarks are effected, checked that the high level design seems reasonable & an improvement over what we do presently.

  38. martinus commented at 9:47 pm on February 20, 2020: contributor

    In bf5ae5e I’ve added some support for asymptotes. I hope that’s somewhat similar to what you did in #17375, @JeremyRubin?

    Usage is e.g. like this:

    0./bench_bitcoin -filter=ComplexMemPool -asymptote=25,50,100,200,400,600,800
    

    This runs the benchmark ComplexMemPool several times but with different complexityN settings. The benchmark can extract that number and use it accordingly. Here, it’s used for childTxs. The output is this:

    complexityN ns/op op/s err% ins/op cyc/op IPC total benchmark
    25 1,064,241.00 939.64 1.4% 3,960,279.00 2,829,708.00 1.400 0.01 ComplexMemPool
    50 1,579,530.00 633.10 1.0% 6,231,810.00 4,412,674.00 1.412 0.02 ComplexMemPool
    100 4,022,774.00 248.58 0.6% 16,544,406.00 11,889,535.00 1.392 0.04 ComplexMemPool
    200 15,390,986.00 64.97 0.2% 63,904,254.00 47,731,705.00 1.339 0.17 ComplexMemPool
    400 69,394,711.00 14.41 0.1% 272,602,461.00 219,014,691.00 1.245 0.76 ComplexMemPool
    600 168,977,165.00 5.92 0.1% 639,108,082.00 535,316,887.00 1.194 1.86 ComplexMemPool
    800 310,109,077.00 3.22 0.1% 1,149,134,246.00 984,620,812.00 1.167 3.41 ComplexMemPool
    coefficient err% complexity
    4.78486e-07 4.5% O(n^2)
    6.38557e-10 21.7% O(n^3)
    3.42338e-05 38.0% O(n log n)
    0.000313914 46.9% O(n)
    0.0129823 114.4% O(log n)
    0.0815055 133.8% O(1)

    The best fitting curve is O(n^2), so the algorithm seems to scale quadratic with childTxs in the range 25 to 800.

  39. JeremyRubin commented at 10:51 pm on February 20, 2020: contributor

    utACK bf5ae5e

    bravo!

  40. in src/bench/bench.h:37 in bf5ae5ed0f outdated
    67-    const uint64_t m_num_evals;
    68-    std::vector<double> m_elapsed_results;
    69-    time_point m_start_time;
    70 
    71-    bool UpdateTimer(time_point finish_time);
    72+using namespace ankerl::nanobench;
    


    Empact commented at 6:11 pm on February 28, 2020:

    nit: How about:

    0namespace nanobench { using namespace ankerl::nanobench; }
    

    So that all the external nanobench members are more explicitly identified?


    martinus commented at 12:11 pm on February 29, 2020:

    Sure I can do that. Would you do that inside the benchmark namespace? then all benchmark arguments would become e.g.

    0static void Base58Encode(benchmark::nanobench::Bench& bench)
    

    Which is a bit long.


    Empact commented at 7:58 pm on February 29, 2020:

    How about doing individual assignments for the classes in use, e.g.:

    0namespace benchmark { using ankerl::nanobench::Bench; }
    

    martinus commented at 6:45 am on March 1, 2020:
    I think that’s better, Bench is the only thing that’s needed in the benchmarks anyway. I’ve commited c2e924f which does that
  41. in src/bench/examples.cpp:6 in c2e924fc04 outdated
    4@@ -5,29 +5,18 @@
    5 #include <bench/bench.h>
    6 #include <util/time.h>
    


    jonatack commented at 5:08 pm on March 1, 2020:
    can remove #include <util/time.h>

  42. in src/bench/nanobench.h:663 in c2e924fc04 outdated
    660+// declarations ///////////////////////////////////////////////////////////////////////////////////
    661+
    662+namespace ankerl {
    663+namespace nanobench {
    664+
    665+// helper stuff that only intended to be used internally
    


    jonatack commented at 5:24 pm on March 1, 2020:
    nit here and L::1056: s/that/that is/
  43. in src/bench/nanobench.h:1191 in c2e924fc04 outdated
    1188+#        pragma clang diagnostic pop
    1189+#    endif
    1190+    return pc;
    1191+}
    1192+
    1193+// Windows version of do not optimize away
    


    jonatack commented at 5:28 pm on March 1, 2020:
    nit: of?

    martinus commented at 2:41 pm on March 2, 2020:
    it should say Windows version of doNotOptimizeAway

    martinus commented at 3:25 pm on March 8, 2020:
    I’ve reworded it a bit in rebase with new version of nanobench.h in https://github.com/bitcoin/bitcoin/pull/18011/commits/79fd93ae7da64aa6d7532aa46734623d0824098d
  44. in src/bench/bench_bitcoin.cpp:18 in c2e924fc04 outdated
    19 
    20 static void SetupBenchArgs()
    21 {
    22     SetupHelpOptions(gArgs);
    23 
    24     gArgs.AddArg("-list", "List benchmarks without executing them. Can be combined with -scaling and -filter", ArgsManager::ALLOW_ANY, OptionsCategory::OPTIONS);
    


    jonatack commented at 6:08 pm on March 1, 2020:
    remove “-scaling and”… as the option is now removed
  45. in src/bench/bench.cpp:70 in c2e924fc04 outdated
    89-
    90-        if (m_elapsed_results.size() == m_num_evals) {
    91-            return false;
    92+    if (!benchmarkResults.empty()) {
    93+        // Generate legacy CSV data to "benchmarkresults.csv"
    94+        std::ofstream fout("benchmarkresults.csv");
    


    jonatack commented at 6:18 pm on March 1, 2020:
    thought: would benchmark_results.csv be more consistent with the project file naming (I’m not sure and won’t bikeshed further)

    martinus commented at 2:46 pm on March 2, 2020:
    maybe I should add an option like -csv=<filename> that enables writing the .csv to the given file, if the option is present. Currently I always write a “benchmarkresults.csv”, which can be a bit annoying when not wanted
  46. jonatack commented at 6:36 pm on March 1, 2020: member

    Tested and light code review ACK c2e924fc04 – built/tests/ran benches/tested the options (-?/-help, -asymptote=, -filter=, -list)… modulo a few minor comments below.

    Runs very quickly.

    Various bench runs for info and comparison:

    New, simplified options help:

     0bitcoin ((HEAD detached at origin/pr/18011))$ ./src/bench/bench_bitcoin -?
     1Options:
     2
     3  -?
     4       Print this help message and exit
     5
     6  -asymptote=n1,n2,n3,...
     7       Test asymptotic growth of the runtime of an algorithm, if supported by
     8       the benchmark
     9
    10  -filter=<regex>
    11       Regular expression filter to select benchmark by name (default: .*)
    12
    13  -list
    14       List benchmarks without executing them. Can be combined with -scaling
    15       and -filter
    

    One potential concern is how this will affect the usefulness of long-term benchmarking projects like https://github.com/chaincodelabs/bitcoinperf, e.g. https://bitcoinperf.com (which seems to be down?)

  47. DrahtBot added the label Needs rebase on Mar 6, 2020
  48. martinus force-pushed on Mar 7, 2020
  49. DrahtBot removed the label Needs rebase on Mar 7, 2020
  50. martinus force-pushed on Mar 8, 2020
  51. martinus commented at 7:22 am on March 8, 2020: contributor

    I’ve pushed a few updates:

    • updated nanobench.h with a (bit) faster RNG, and explicit constructors so extended-lint-all.h doesn’t complain any more
    • add command line options -output_csv to enable creation of legacy CSV file
    • add command line optoin -output_json to create a big JSON with all data

    Not sure if I should squash all the changes into a single commit or leave the separately?

    Here is console output, .CSV file, and .json file of one run (click RAW):

    https://gist.github.com/martinus/d5e596b7802199737ef38399ef749d51

  52. jonatack commented at 9:55 am on March 8, 2020: member
    Perhaps the gArgs.AddArg("-list") fix and the “comment nits” changes ought to be in 79fd93a where they are first changed rather than 51b83e1 and 9d6eb72. Alternatively, with 2 ACKs it may have been good to rebase with no changes to preserve existing review and do the rest in a follow-up PR (I’m not sure).
  53. in src/bench/examples.cpp:10 in 79fd93ae7d outdated
     4@@ -5,29 +5,18 @@
     5 #include <bench/bench.h>
     6 #include <util/time.h>
     7 
     8-// Sanity test: this should loop ten times, and
     9-// min/max/average should be close to 100ms.
    10-static void Sleep100ms(benchmark::State& state)
    


    elichai commented at 10:33 am on March 8, 2020:
    Why did you remove this test? (should probably move to tests though)

    elichai commented at 10:37 am on March 8, 2020:

    martinus commented at 10:55 am on March 8, 2020:
    I removed this test because it’s rather useless, it would uselessly slow down the whole benchmark run quite a bit compared to the others, and since it’s in examples.cpp I assumed that it’s just an example anyways

    jonatack commented at 12:52 pm on March 8, 2020:
    Unless I was missing something, if you remove this test then the #include <util/time.h> can be removed with it.

    martinus commented at 3:19 pm on March 8, 2020:
    Ah right I totally forgot removing the include

  54. elichai approved
  55. elichai commented at 2:54 pm on March 8, 2020: contributor

    tACK 9d6eb7207a9baa78ce9c6231e517717a598c8d33

    Went over the code changes in core, very briefly looked over nanobench.h, compared the benchmark results with the results before, and it looks ok (there are some differences but in benchmarks with high variance)

    And I really like that it’s human readable now :)

  56. DrahtBot added the label Needs rebase on Mar 14, 2020
  57. martinus force-pushed on Mar 28, 2020
  58. martinus commented at 7:32 am on March 28, 2020: contributor
    rebased
  59. DrahtBot removed the label Needs rebase on Mar 28, 2020
  60. DrahtBot added the label Needs rebase on Apr 9, 2020
  61. MarcoFalke commented at 2:55 pm on April 9, 2020: member

    I tried running this locally, and it seems I am getting varying results:

     0mac-mini:bitcoin-core marco$ git log -1 && make -j 9 && ./src/bench/bench_bitcoin --filter=VerifyNestedIfScript
     1commit a841d1e25b1b26b6381f36e14307c2549a79edb4 (HEAD)
     2Author: Martin Ankerl <martin.ankerl@gmail.com>
     3Date:   Sun Mar 8 16:20:55 2020 +0100
     4
     5    remove unnecessary include util/time.h
     6Making all in src
     7Making all in doc/man
     8make[1]: Nothing to be done for `all'.
     9make[1]: Nothing to be done for `all-am'.
    10Warning, results might be unstable:
    11* NDEBUG not defined, assert() macros are evaluated
    12
    13Recommendations
    14* Make sure you compile for Release
    15
    16|               ns/op |                op/s |    err% |     total | benchmark
    17|--------------------:|--------------------:|--------:|----------:|:----------
    18|          303,061.00 |            3,299.67 |    2.6% |      0.00 | `VerifyNestedIfScript`
    19mac-mini:bitcoin-core marco$ git log -1 && make -j 9 && ./src/bench/bench_bitcoin --filter=VerifyNestedIfScript
    20commit a841d1e25b1b26b6381f36e14307c2549a79edb4 (HEAD)
    21Author: Martin Ankerl <martin.ankerl@gmail.com>
    22Date:   Sun Mar 8 16:20:55 2020 +0100
    23
    24    remove unnecessary include util/time.h
    25Making all in src
    26Making all in doc/man
    27make[1]: Nothing to be done for `all'.
    28make[1]: Nothing to be done for `all-am'.
    29Warning, results might be unstable:
    30* NDEBUG not defined, assert() macros are evaluated
    31
    32Recommendations
    33* Make sure you compile for Release
    34
    35|               ns/op |                op/s |    err% |     total | benchmark
    36|--------------------:|--------------------:|--------:|----------:|:----------
    37|          155,083.00 |            6,448.16 |    9.4% |      0.00 | :wavy_dash: `VerifyNestedIfScript` (Unstable with ~1.0 iters. Increase `minEpochIterations` to e.g. 10)
    
  62. MarcoFalke commented at 2:57 pm on April 9, 2020: member

    Notably:

    • Consecutive runs are off by a factor of 2
    • There is a warning Warning, results might be unstable: NDEBUG not defined, assert() macros are evaluated, but I think it is impossible to compile Bitcoin Core with assert disabled
    • There is a warning from the framework itself: :wavy_dash: VerifyNestedIfScript(Unstable with ~1.0 iters. IncreaseminEpochIterations to e.g. 10)
  63. Replace current benchmarking framework with nanobench
    This replaces the current benchmarking framework with nanobench [1], an
    MIT licensed single-header benchmarking library, of which I am the
    autor. This has in my opinion several advantages, especially on Linux:
    
    * fast: Running all benchmarks takes ~6 seconds instead of 4m13s on
      an Intel i7-8700 CPU @ 3.20GHz.
    
    * accurate: I ran e.g. the benchmark for SipHash_32b 10 times and
      calculate standard deviation / mean = coefficient of variation:
    
      * 0.57% CV for old benchmarking framework
      * 0.20% CV for nanobench
    
      So the benchmark results with nanobench seem to vary less than with
      the old framework.
    
    * It automatically determines runtime based on clock precision, no need
      to specify number of evaluations.
    
    * measure instructions, cycles, branches, instructions per cycle,
      branch misses (only Linux, when performance counters are available)
    
    * output in markdown table format.
    
    * Warn about unstable environment (frequency scaling, turbo, ...)
    
    * For better profiling, it is possible to set the environment variable
      NANOBENCH_ENDLESS to force endless running of a particular benchmark
      without the need to recompile. This makes it to e.g. run "perf top"
      and look at hotspots.
    
    Here is an example copy & pasted from the terminal output:
    
    |             ns/byte |              byte/s |    err% |        ins/byte |        cyc/byte |    IPC |       bra/byte |   miss% |     total | benchmark
    |--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
    |                2.52 |      396,529,415.94 |    0.6% |           25.42 |            8.02 |  3.169 |           0.06 |    0.0% |      0.03 | `bench/crypto_hash.cpp RIPEMD160`
    |                1.87 |      535,161,444.83 |    0.3% |           21.36 |            5.95 |  3.589 |           0.06 |    0.0% |      0.02 | `bench/crypto_hash.cpp SHA1`
    |                3.22 |      310,344,174.79 |    1.1% |           36.80 |           10.22 |  3.601 |           0.09 |    0.0% |      0.04 | `bench/crypto_hash.cpp SHA256`
    |                2.01 |      496,375,796.23 |    0.0% |           18.72 |            6.43 |  2.911 |           0.01 |    1.0% |      0.00 | `bench/crypto_hash.cpp SHA256D64_1024`
    |                7.23 |      138,263,519.35 |    0.1% |           82.66 |           23.11 |  3.577 |           1.63 |    0.1% |      0.00 | `bench/crypto_hash.cpp SHA256_32b`
    |                3.04 |      328,780,166.40 |    0.3% |           35.82 |            9.69 |  3.696 |           0.03 |    0.0% |      0.03 | `bench/crypto_hash.cpp SHA512`
    
    [1] https://github.com/martinus/nanobench
    
    * Adds support for asymptotes
    
      This adds support to calculate asymptotic complexity of a benchmark.
      This is similar to #17375, but currently only one asymptote is
      supported, and I have added support in the benchmark `ComplexMemPool`
      as an example.
    
      Usage is e.g. like this:
    
      ```
      ./bench_bitcoin -filter=ComplexMemPool -asymptote=25,50,100,200,400,600,800
      ```
    
      This runs the benchmark `ComplexMemPool` several times but with
      different complexityN settings. The benchmark can extract that number
      and use it accordingly. Here, it's used for `childTxs`. The output is
      this:
    
      | complexityN |               ns/op |                op/s |    err% |          ins/op |          cyc/op |    IPC |     total | benchmark
      |------------:|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|----------:|:----------
      |          25 |        1,064,241.00 |              939.64 |    1.4% |    3,960,279.00 |    2,829,708.00 |  1.400 |      0.01 | `ComplexMemPool`
      |          50 |        1,579,530.00 |              633.10 |    1.0% |    6,231,810.00 |    4,412,674.00 |  1.412 |      0.02 | `ComplexMemPool`
      |         100 |        4,022,774.00 |              248.58 |    0.6% |   16,544,406.00 |   11,889,535.00 |  1.392 |      0.04 | `ComplexMemPool`
      |         200 |       15,390,986.00 |               64.97 |    0.2% |   63,904,254.00 |   47,731,705.00 |  1.339 |      0.17 | `ComplexMemPool`
      |         400 |       69,394,711.00 |               14.41 |    0.1% |  272,602,461.00 |  219,014,691.00 |  1.245 |      0.76 | `ComplexMemPool`
      |         600 |      168,977,165.00 |                5.92 |    0.1% |  639,108,082.00 |  535,316,887.00 |  1.194 |      1.86 | `ComplexMemPool`
      |         800 |      310,109,077.00 |                3.22 |    0.1% |1,149,134,246.00 |  984,620,812.00 |  1.167 |      3.41 | `ComplexMemPool`
    
      |   coefficient |   err% | complexity
      |--------------:|-------:|------------
      |   4.78486e-07 |   4.5% | O(n^2)
      |   6.38557e-10 |  21.7% | O(n^3)
      |   3.42338e-05 |  38.0% | O(n log n)
      |   0.000313914 |  46.9% | O(n)
      |     0.0129823 | 114.4% | O(log n)
      |     0.0815055 | 133.8% | O(1)
    
      The best fitting curve is O(n^2), so the algorithm seems to scale
      quadratic with `childTxs` in the range 25 to 800.
    78c312c983
  64. martinus force-pushed on Jun 13, 2020
  65. martinus commented at 11:03 am on June 13, 2020: contributor

    I’ve finally found the time to rebased my branch with a major update of nanobench, and updated the new benchmarks too. @MarcoFalke, I am pretty sure your instability comes from CPU frequency scaling / turbo mode. For nanobench to be able to have accurate runtime results it needs a CPU locked to fixed frequency. Under Linux I print warnings & suggestions if I detect this, but I don’t have such a logic in place for other operating systems (yet).

    When I run the benchmark VerifyNestedIfScript a few times I get these results:

    ns/op op/s err% ins/op cyc/op IPC bra/op miss% total benchmark
    63,447.00 15,761.19 0.9% 677,018.00 201,628.00 3.358 143,829.00 0.2% 0.00 VerifyNestedIfScript
    63,094.00 15,849.37 0.8% 677,018.00 201,495.00 3.360 143,829.00 0.2% 0.00 VerifyNestedIfScript
    63,548.00 15,736.14 0.7% 677,018.00 201,894.00 3.353 143,829.00 0.2% 0.00 VerifyNestedIfScript
    63,340.00 15,787.81 0.9% 677,018.00 201,894.00 3.353 143,829.00 0.2% 0.00 VerifyNestedIfScript
    63,288.00 15,800.78 1.0% 677,018.00 201,096.00 3.367 143,829.00 0.2% 0.00 VerifyNestedIfScript
    63,261.00 15,807.53 0.9% 677,018.00 201,894.00 3.353 143,829.00 0.2% 0.00 VerifyNestedIfScript
    63,309.00 15,795.54 0.7% 677,018.00 202,027.00 3.351 143,829.00 0.2% 0.00 VerifyNestedIfScript

    So it is a very stable benchmark for me, with locked frequency scaling. When I don’t lock the CPU my results too fluctuate, and the benchmark prints these warnings (I’ve removed the NDEBUG warning):

    0Warning, results might be unstable:
    1* CPU frequency scaling enabled: CPU 0 between 800.0 and 4,600.0 MHz
    2* CPU governor is 'powersave' but should be 'performance'
    3* Turbo is enabled, CPU frequency will fluctuate
    4
    5Recommendations
    6* Use 'pyperf system tune' before benchmarking. See https://github.com/vstinner/pyperf
    
  66. DrahtBot removed the label Needs rebase on Jun 13, 2020
  67. JeremyRubin commented at 6:33 pm on June 13, 2020: contributor
    utack 78c312c. @MarcoFalke can you repeat your benchmark after running pyperf system tune?
  68. dongcarl commented at 7:37 pm on July 2, 2020: member
    @martinus While going thru this in July 2nd, 2020’s meeting, I believe people were wondering what the support is like for non x86 architectures. Would it fail to compile? Have limited functionality? Or fail at runtime?
  69. laanwj added this to the "Blockers" column in a project

  70. laanwj removed the label Needs Conceptual Review on Jul 2, 2020
  71. laanwj commented at 9:37 pm on July 2, 2020: member

    FWIW, I could compile and run this PR (as merged on master) on RV64. It doesn’t seem there is any compatibility issue as was implied by the travis run.

    Concept ACK.

  72. martinus commented at 7:10 am on July 3, 2020: contributor

    @martinus While going thru this in July 2nd, 2020’s meeting, I believe people were wondering what the support is like for non x86 architectures. Would it fail to compile? Have limited functionality? Or fail at runtime?

    The CPU statistics like instructions, cycles, branch misspredictions are only available on Linux through perf events. But it should compile on any platform with C++11 support, then I’m only relying on the std::chrono timers.

    I have now a relatively comprehensive documentation of nanobench available here: https://nanobench.ankerl.com/

  73. fjahr commented at 8:51 pm on July 10, 2020: member

    Concept ACK

    So far I have built the PR and run some tests without any problems.

  74. laanwj commented at 1:33 pm on July 30, 2020: member
    ACK 78c312c983255e15fc274de2368a2ec13ce81cbf
  75. fanquake removed this from the "Blockers" column in a project

  76. laanwj merged this on Jul 30, 2020
  77. laanwj closed this on Jul 30, 2020

  78. laanwj commented at 1:45 pm on July 30, 2020: member
    I’ve merged this because there was unanimous agreement that we want the new benchmark framework and it works as expected here (as well as for @fjahr and @JeremyRubin ). If @MarcoFalke’s issue, is still a problem we should look into it, please open a github issue for it.
  79. jonatack commented at 2:17 pm on July 30, 2020: member
    👍 per my ACK a few months ago #18011#pullrequestreview-366872789
  80. martinus commented at 5:43 pm on July 30, 2020: contributor
    Thanks for merging! If anyone has any questions or issue with nanobench please notify me
  81. sidhujag referenced this in commit f60da833be on Jul 31, 2020
  82. hebasto commented at 6:27 pm on July 31, 2020: member
    @martinus Mind looking into #18710 if it has any performance regression on supported platforms?
  83. hebasto commented at 11:04 am on August 14, 2020: member
    @martinus How to conditionally skip a benchmark in nanobench framework (wrt #19710 (comment) and #19710 (comment))?
  84. martinus commented at 11:07 am on August 14, 2020: contributor
    You can use -filter to specify a regular expression for which tests to run
  85. hebasto commented at 11:08 am on August 14, 2020: member

    You can use -filter to specify a regular expression for which tests to run

    I mean in the code, e.g., skip CCheckQueueSpeedPrevectorJob if GetNumCores() < 2

  86. martinus commented at 11:20 am on August 14, 2020: contributor

    You can use -filter to specify a regular expression for which tests to run

    I mean in the code, e.g., skip CCheckQueueSpeedPrevectorJob if GetNumCores() < 2

    Ah, of course

    Before benchmark is run you can do a check and then simply return, e.g. like so:

    0static void CCheckQueueSpeedPrevectorJob(benchmark::Bench& bench)
    1{
    2    if (GetNumCores() < 2) {
    3        return;
    4    }
    

    But that needs a little update in bench.cpp, because then the benchmark doesn’t have any results:

    0        if (!bench.results().empty()) {
    1            benchmarkResults.push_back(bench.results().back());
    2        }
    
  87. hebasto commented at 12:26 pm on August 14, 2020: member
    @martinus Thanks! I’ve submitted a commit (ce3e6a7cb21d1aa455513970846e1f70c01472a4) in #19710.
  88. MarcoFalke commented at 9:25 am on August 24, 2020: member

    Not sure if this is caused by this pull, but some benchmarks changed performance quite significantly:

    Is this due to compiler optimizations or something else?

    Funnily the trig dummy bench went up by ~100%: https://codespeed.bitcoinperf.com/timeline/#/?exe=3,4,2,1,5&base=1+23&ben=micro.clang.Trig&env=1&revs=200&equid=off&quarts=on&extr=on

  89. Fabcien referenced this in commit f95373f234 on Feb 15, 2021
  90. PastaPastaPasta referenced this in commit c0fe0715eb on May 1, 2021
  91. kittywhiskers referenced this in commit cd539c56de on Jun 4, 2021
  92. kittywhiskers referenced this in commit c7f24ef868 on Jun 4, 2021
  93. kittywhiskers referenced this in commit 7690d29f56 on Jun 5, 2021
  94. kittywhiskers referenced this in commit 9cdcc24331 on Jun 16, 2021
  95. kittywhiskers referenced this in commit 089e4714d2 on Jun 24, 2021
  96. kittywhiskers referenced this in commit c3b767074d on Jun 25, 2021
  97. kittywhiskers referenced this in commit 3ea50bce72 on Jun 25, 2021
  98. kittywhiskers referenced this in commit 4dc595d234 on Jun 26, 2021
  99. kittywhiskers referenced this in commit c6edf6654f on Jun 27, 2021
  100. kittywhiskers referenced this in commit 5849582e77 on Jun 28, 2021
  101. kittywhiskers referenced this in commit 3c15853960 on Jul 2, 2021
  102. kittywhiskers referenced this in commit 92e05078ae on Jul 4, 2021
  103. kittywhiskers referenced this in commit b5db3c3d65 on Jul 5, 2021
  104. PastaPastaPasta referenced this in commit 6e4099ea67 on Jul 6, 2021
  105. DrahtBot locked this on Feb 15, 2022

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-12-18 21:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me