bench: Remove various extraneous benchmarks

dergoegge commented at 2:27 pm on October 25, 2024: member

Measuring performance of components that do not meaningfully impact the performance of critical aspects of the system (e.g. block validation or mining) seems extraneous.

Maintaining extraneous benchmark tests incentivizes non-impactful improvements. This frequently comes up with new contributors and they can’t be blamed for thinking that e.g. hex parsing should be made faster if a benchmark test exists.

DrahtBot commented at 2:27 pm on October 25, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/31153.

Reviews

See the guideline for information on the review process.

Type	Reviewers
Concept ACK	stickies-v, fanquake, brunoerg, mzumsande
Approach NACK	davidgumberg

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#30987 (Don’t zero-after-free DataStream: Faster IBD on some configurations by davidgumberg)
#28792 (Embed default ASMap as binary dump header file by fjahr)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

DrahtBot added the label Tests on Oct 25, 2024

stickies-v commented at 3:04 pm on October 25, 2024: contributor

Concept ACK. Perhaps also worth updating doc/benchmarking.md with a brief section on which benchmarks are typically useful to add, and which benchmarks should rather be ephemeral (e.g. provided on a separate branch, supplementary to a PR)?

in src/bench/logging.cpp:1 in cda9d1c9a6 outdated

-1@@ -1,64 +0,0 @@
0-// Copyright (c) 2020-2022 The Bitcoin Core developers

jonatack commented at 3:24 pm on October 25, 2024:

Not sure about removing the logging benchmarks. Logging can be very high frequency and also used for logging lock contentions. Even if that is more likely to occur for developers, it may be useful or valuable to be able to track this.

jonatack commented at 3:34 pm on October 25, 2024: member

Not sure about some of these. People can be asked to add benchmarks for performance improvements or to verify that a change to existing code doesn’t degrade performance, and having existing benchmarks to refer to may make that easier. Seems odd to throw out work like #22284.

Are any of the benchmarks called by outside sites that track their results over time?

fanquake commented at 3:35 pm on October 25, 2024: member

Concept ACK - agree with stickies-v re updating the doc.

brunoerg commented at 3:59 pm on October 25, 2024: contributor

Concept ACK

instagibbs commented at 4:26 pm on October 25, 2024: member

Perhaps also worth updating doc/benchmarking.md with a brief section on which benchmarks are typically useful to add, and which benchmarks should rather be ephemeral (e.g. provided on a separate branch, supplementary to a PR)?

I think this documentation update would be helpful to get in even without this PR.

dergoegge commented at 4:34 pm on October 25, 2024: member

People can be asked to add benchmarks for performance improvements or to verify that a change to existing code doesn’t degrade performance

Imo, performance improvements (and accompanying benchmarks) should only be made to code that makes critical things slower than they need to be. Otherwise we’re just improving performance for the sake of doing it, which is not something we have the resources for. We should aim our resources with intention, and the benchmarks can act as the scope.

For example, should we spend our time making hex parsing 500x faster or maybe shave 10% from IBD? I’m arguing for scoping our tests to the latter (and other critical things e.g. mining) so that we don’t get distracted.

Are any of the benchmarks called by outside sites that track their results over time?

https://corecheck.dev sort of does but it’s quite noisy.

dergoegge commented at 4:35 pm on October 25, 2024: member

Will add the doc changes next week

maflcko commented at 8:56 am on October 28, 2024: member

Hex parsing is used in mining via DecodeHexBlk, no?

Also, while some of the benchmarks do not exist to be optimized, but merely to give a feeling and to avoid a regression where an algorithm is swapped to go from O(n) to O(n^2). So in a sense, they are a bit of a unit test.

In any case, updating the docs should be uncontroversial and useful going forward.

jarolrod commented at 9:11 am on October 29, 2024: member

but merely to give a feeling and to avoid a regression where an algorithm is swapped to go from O(n) to O(n^2). So in a sense, they are a bit of a unit test.

This is a good reason to keep.

I’m arguing for scoping our tests to the latter (and other critical things e.g. mining) so that we don’t get distracted.

And this is an important statement.

We don’t necessarily need to throw away work that has previously been a useful metric in issues (https://github.com/bitcoin/bitcoin/issues/17501) to achieve the goal of new benchmark contributions being focused on specific areas.

dergoegge commented at 10:02 am on October 29, 2024: member

throw away work that has previously been a useful metric in issues (https://github.com/bitcoin/bitcoin/issues/17501)

The benchmarks weren’t mentioned in the issue you linked or the PR that fixed the issue?

jarolrod commented at 10:20 am on October 29, 2024: member

throw away work that has previously been a useful metric in issues (#17501)

The benchmarks weren’t mentioned in the issue you linked or the PR that fixed the issue?

I said it’s a useful metric, as in given that context, would it not be nice to have the benchmark?

dergoegge commented at 10:29 am on October 29, 2024: member

as in given that context

I see. No I don’t think they would have been useful in that context, because all the base58 benchmarks run with inputs of the same size, so they couldn’t have found that particular issue and therefore were also not useful in verifying the fix.

practicalswift mentioned a fuzz test in the PR, which is likely how he found the issue (speculating).

jarolrod commented at 10:33 am on October 29, 2024: member

as in given that context

I see. No I don’t think they would have been useful in that context, because all the base58 benchmarks run with inputs of the same size, so they couldn’t have found that particular issue and therefore were also not useful in verifying the fix.

practicalswift mentioned a fuzz test in the PR, which is likely how he found the issue (speculating).

I’m not intending to argue, and as i said your main point in the PR is valid.

If a bench didn’t exist at the time of the issue, I imagine a review comment would have mentioned it would be nice to have, as a metric. I don’t think that’s controversial.

dergoegge commented at 3:34 pm on October 30, 2024: member

Added the doc commit, ready for further review

in src/bench/parse_hex.cpp:30 in fcfb1d862a outdated

24-static void HexParse(benchmark::Bench& bench)
25-{
26-    auto data = generateHexString(130); // Generates 678B0EDA0A1FD30904D5A65E3568DB82DB2D918B0AD8DEA18A63FECCB877D07CAD1495C7157584D877420EF38B8DA473A6348B4F51811AC13C786B962BEE5668F9 by default
27-
28-    bench.batch(data.size()).unit("base16").run([&] {
29-        auto result = TryParseHex(data);

maflcko commented at 3:40 pm on October 30, 2024:

Hex parsing is used in mining via DecodeHexBlk, no?

dergoegge commented at 4:10 pm on October 30, 2024:

Does that meaningfully impact block submission? if so, seems like DecodeHexBlk should be the target for the benchmark (or submitblock as a whole)

maflcko commented at 12:03 pm on October 31, 2024:

I see your point, but is seems a bit arbitrary to remove this one and not ExpandDescriptor, which should also be an end-to-end target, no?

Maybe it would help review if you created a list of all benchmarks, and then added a reason why you kept it, or why you removed it for each one. Otherwise, it may be harder for reviewers to follow and see how the change is self-consistent?

dergoegge commented at 2:58 pm on November 4, 2024:

Thanks, removed the ExpandDescriptor bench as well. Will comment on how I picked the benches soon.

davidgumberg commented at 6:17 pm on October 30, 2024: contributor

Concept ACK, Approach NACK

I think there should be a burden on PR’s that claim to improve performance to demonstrate a user-impacting scenario where the improvements are relevant, and I agree that having benchmarks that test workloads that will never impact users leads to PR’s being opened and review being spent on things that don’t deserve attention.

I think a similar burden exists for this PR to demonstrate or at least describe a rationale for why these removed benchmarks don’t test behavior that would impact users.

in doc/benchmarking.md:66 in fcfb1d862a outdated

67+- Block propagation (Cost: slow block propagation may increase the rate of
68+  orphaned blocks)
69+
70+Pull requests that improve the performance of components covered by a benchmark
71+are valuable. Adding new benchmarks is valuable if the component being tested
72+affects critical functions mentioned above.

maflcko commented at 10:36 am on October 31, 2024:

Maybe add a note to say that performance optimizations of micro-benchmarks are not valuable, if they do not show a visible end-user end-to-end performance improvement? Also could mention that the contribution of benchmarks that do not cover an end-user end-to-end scenario may not be valuable?

Even if they do, I think some improvements may still be rejected, if the code bloat or review/maintenance is too much to justify the improvement.

dergoegge commented at 2:57 pm on November 4, 2024:

Done. Let me know if you’re not happy with the wording.

maflcko commented at 10:39 am on October 31, 2024: member

No strong opinion on the removals. I think the doc update is valuable and could be extended. Simply declaring the removed benchmarks as potentially not valuable, when looked at too narrowly, could be less controversial and still achieve the essence of the goal of this pull request.

kevkevinpal commented at 0:50 am on November 1, 2024: contributor

Not yet sure on this PR we already have priority_level for benchmarks

All the benchmarks being removed seem to be benchmark::PriorityLevel::HIGH not sure if it makes sense to move them down a level?

This frequently comes up with new contributors and they can’t be blamed for thinking that e.g. hex parsing should be made faster if a benchmark test exists.

I understand this is one of the motivations for this change does this happen often for benchmarks that are marked benchmark::PriorityLevel::LOW

in doc/benchmarking.md:54 in fcfb1d862a outdated

55-- P2P throughput
56+
57+Benchmarks help with monitoring for performance regressions and act as a scope
58+for future performance improvements. They should cover components that impact
59+performance critical functions of the system. Functions are performance
60+critical, if the cost associated with a degradation in perforamnce is high. A

mzumsande commented at 5:43 pm on November 1, 2024:

typo: perforamnce

in doc/benchmarking.md:72 in fcfb1d862a outdated

71+are valuable. Adding new benchmarks is valuable if the component being tested
72+affects critical functions mentioned above.
73+
74+Benchmarks are ill-suited for testing of denial of service issues as they are
75+restricted to the same input set (introducing bias). [Fuzz
76+tests](/doc/fuzzing.md) are better suited for this purpose, as they are much

mzumsande commented at 5:55 pm on November 1, 2024:

Not sure I follow this point: Under “denial of service” I would understand attacks that overwhelm the target with repeated requests (which are often but not necessarily identical) and as result may just slow it down / make it unresponsive instead of outright crashing it - while I agree that benchmarks don’t really find these, I wouldn’t see fuzz tests designed to find crashes as the best way to test these issues (instead of integration tests with some sort of stress)?

dergoegge commented at 2:12 pm on November 4, 2024:

Fuzz testing does not only find crashes, it finds slow running inputs as well. Fuzz engines usually terminate an input’s execution if it takes longer than N seconds to run, usually referred to as a “timeout”. See #28812 for examples.

Depending on what the harness is testing, a slow input might indicate a DoS issue (e.g. infinite loop, quadratic behavior, …).

mzumsande commented at 6:02 pm on November 1, 2024: contributor

Concept ACK

I wouldn’t oppose deleting some benchmarks, but I also like the idea of using priorities. That, combined with the doc should be sufficient to deter micro-optimiziations in non-critical code.

dergoegge force-pushed on Nov 4, 2024

dergoegge commented at 3:16 pm on November 4, 2024: member

I’m not sure if using the priority levels is a good alternative to deleting the benches. Afaict, the priority level was introduced to avoid running slow benchmarks in the CI as apposed to indicating how “important” they are. As a result, we’re not catching issues with these tests until someone executes them (e.g. #29061 (comment)) and I’d prefer not to expand on that.

DrahtBot added the label Needs rebase on Nov 13, 2024

[bench] Remove various extraneous benchmarks 810c2dc769

[doc] Ammend notes on benchmarking 8b19137a68

dergoegge force-pushed on Nov 13, 2024

DrahtBot removed the label Needs rebase on Nov 13, 2024

dergoegge closed this on Jan 16, 2025

maflcko commented at 11:47 am on January 20, 2025: member

Not sure why it was closed. It seemed like a useful doc update to me, but maybe I am missing something. Feel free to ACK/NACK #31690, which is just the doc update from here cherry-picked.

bench: Remove various extraneous benchmarks #31153

Code Coverage & Benchmarks

Reviews

Conflicts