test: re-bucket long-running tests

willcl-ark commented at 2:02 pm on September 12, 2024: member

Re-bucket:

p2p_node_network_limited -v*transport
feature_assume_utxo

On CI runners these tests are taking longer than their current bucket suggests, often being among the last to finish.

Re-bucket them to improve CI efficiency.

DrahtBot commented at 2:02 pm on September 12, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage

For detailed information about the code coverage, see the test coverage report.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	maflcko

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

DrahtBot added the label Tests on Sep 12, 2024

DrahtBot added the label CI failed on Sep 13, 2024

theStack commented at 11:58 am on September 13, 2024: contributor

The re-bucketed test in the commit doesn’t match the one in the commit/PR title (p2p_timeouts vs. p2p_node_network_limited), so I suppose one of the two has to be adapted.

maflcko commented at 12:30 pm on September 13, 2024: member

Locally these tests take about 20 seconds and appear in the correct bucket.

I think the sorting isn’t really based on the “correct bucket”, but rather the inverse overall duration, the buckets are just a guideline/reminder. That is, as long as the longest running test is started first, everything is fine.

Did you measure if this actually makes a meaningful difference?

maflcko commented at 12:33 pm on September 13, 2024: member

Also, could rebase for a fresh CI? At least for me the CI timeouts went away earlier this night when trying to reproduce, but I don’t know if they are really gone, and whether the issue was inside this repo, or on the CI runners, or on the CI backend software, or on the Cirrus/GHA integration.

willcl-ark force-pushed on Sep 16, 2024

willcl-ark commented at 8:05 am on September 16, 2024: member

The re-bucketed test in the commit doesn’t match the one in the commit/PR title (p2p_timeouts vs. p2p_node_network_limited), so I suppose one of the two has to be adapted.

Ugh, I cherry-picked the wrong tests from my working branch. The title is correct, tests in the commit here updated.

I think the sorting isn’t really based on the “correct bucket”, but rather the inverse overall duration, the buckets are just a guideline/reminder. That is, as long as the longest running test is started first, everything is fine.

Did you measure if this actually makes a meaningful difference?

Not sure how to measure this, but I have been frequently seeing these tests taking the longest to run out of the suite, see e.g. https://github.com/bitcoin/bitcoin/actions/runs/10833854937/job/30163805745?pr=30866#step:12:64 and (same PR another job): https://github.com/bitcoin/bitcoin/actions/runs/10833854937/job/30163805709?pr=30866#step:7:5343

It’s only going to make a difference of some number of seconds, but possibly up to 2 minutes, even in the best case. Whether that can be considered meaningful I will leave for reviewers to determine. If we launch this test right at the end, and other long-running ones finish, it could be adding ~240 secs onto the total test suite time each run.

maflcko commented at 8:08 am on September 16, 2024: member

Thanks, the explanation is correct and the two links with data are useful to support this change.

review ACK 45663cd02cfb99c0bf91828b33a0611bb223497e

maflcko commented at 8:14 am on September 16, 2024: member

Nit: May be good to reword the pull request description to say that this better reflects the reality in CI runs, and may even speed them up. The mention of “CI timeouts” could be removed, because adding or removing 200 seconds from a run shouldn’t be a fix (or cause) of a timeout. Any run that takes longer than half of the overall timeout value is a warning that should be investigated and fixed, because the CI timeout is really only there to catch cases where there is a true timeout (zombie process, infinite loop, infinite sleep, …).

But just a doc-nit, up to you. The change looks good either way.

willcl-ark commented at 8:20 am on September 16, 2024: member

Nit: May be good to reword the pull request description to say that this better reflects the reality in CI runs, and may even speed them up. The mention of “CI timeouts” could be removed, because adding or removing 200 seconds from a run shouldn’t be a fix (or cause) of a timeout. Any run that takes longer than half of the overall timeout value is a warning that should be investigated and fixed, because the CI timeout is really only there to catch cases where there is a true timeout (zombie process, infinite loop, infinite sleep, …).

But just a doc-nit, up to you. The change looks good either way.

Thanks, re-worded the description :)

DrahtBot removed the label CI failed on Sep 16, 2024

maflcko commented at 9:30 am on September 16, 2024: member

Should feature_assumeutxo also be moved up a bit? According to https://corecheck.dev/tests it is one of the top-20 slowest tests. Also, according to the CI runs in this pull (Asan + macOS) it is in the tail.

test: re-bucket long-running tests

- p2p_node_network_limited -v*transport
- feature_assume_utxo

On CI runners these tests are taking longer than their current bucket
suggests, often being among the last to finish.

Re-bucket them to improve CI efficiency.

f5a2000579

willcl-ark force-pushed on Sep 23, 2024

willcl-ark renamed this:
~~test: re-bucket p2p_node_network_limited~~
test: re-bucket long-running tests
on Sep 23, 2024

willcl-ark commented at 2:25 pm on September 23, 2024: member

Should feature_assumeutxo also be moved up a bit? According to corecheck.dev/tests it is one of the top-20 slowest tests. Also, according to the CI runs in this pull (Asan + macOS) it is in the tail.

Added feature_assumeutxo to the < 2m bucket

1p1c is the other one I have noticed cropping up as long-running, but might leave it for now.

maflcko commented at 9:55 am on September 27, 2024: member

feature_config_args.py could be moved as well, but at some point it doesn’t matter too much. I wonder if this even has an effect at all.

maflcko commented at 9:56 am on September 27, 2024: member

review ACK f5a2000579b140a1f51fc433706c775ca560c62c

Haven’t checked what effect this has, but it sounds plausible and harmless.

fanquake merged this on Sep 27, 2024

fanquake closed this on Sep 27, 2024

test: re-bucket long-running tests #30879

Code Coverage

Reviews