Increase timeout or remove valgrind CI job?

dergoegge commented at 9:19 am on April 12, 2023: member

I think there are only ~5 occurrences of the valgrind job passing out of all the merged/closed PRs. They always (afaict) fail due to timing out.

So maybe we should:

Increase the timeout? (unless the job would take way to long anyway)
Remove the job?
Only run on corpora that were changed in a PR?

fanquake commented at 9:22 am on April 12, 2023: member

We are currently set at the limit of 120m for non-credit tasks, see https://cirrus-ci.org/faq/#instance-timed-out. Maybe we can use some compute credits.

maflcko commented at 9:24 am on April 12, 2023: contributor

I run it locally, and it takes 12 CPU days, so I am not sure if this is good to spend credits on.

Is there any value in running in remotely? If yes, we could spin up a remote (dedicated runner) for it.

maflcko commented at 9:25 am on April 12, 2023: contributor

(12 CPU days on a slow CPU running valgrind in qemu :sweat_smile: )

fanquake commented at 12:52 pm on April 19, 2023: member

Relevant CI job now that credits PR has been merged: https://cirrus-ci.com/task/6079733960540160

fanquake commented at 1:04 pm on April 19, 2023: member

Looks like it is working: credits

fanquake commented at 9:27 am on April 20, 2023: member

Relevant CI job now that credits PR has been merged: https://cirrus-ci.com/task/6079733960540160

Ultimately timed-out after 8 hours. However looking at the job, it also didn’t use any compute credits? credits

maflcko commented at 9:30 am on April 20, 2023: contributor

I calculated the cost here: #114 (comment)

If you want to see it in Cirrus CI, you may have to refresh the page.

fanquake commented at 9:33 am on April 20, 2023: member

Don’t know. I have refreshed multiple times, and am logged in. Nothing showing up.

maflcko commented at 9:01 am on May 2, 2023: contributor

I think this can be closed. Not sure what else can be done here?

dergoegge commented at 9:37 am on May 2, 2023: member

Hm, the job still times out even with the credits, so maybe we should remove it from CI?

maflcko commented at 10:17 am on May 2, 2023: contributor

All fuzz targets are run, except for a few slow inputs for one remaining target, so it seems better than removing it completely?

fanquake commented at 10:19 am on May 2, 2023: member

Yea I’m happy for this to remain, even if it never quite completes, if we’re running 95+% of targets.

dergoegge commented at 10:21 am on May 2, 2023: member

Ok! Should we exclude the really slow targets from running then?

maflcko commented at 10:22 am on May 2, 2023: contributor

it seems better to run them than not to run them at all?

dergoegge commented at 10:27 am on May 2, 2023: member

I find the red CI is super annoying because it has to be looked at every time and the slow targets still run in the other jobs (with MSan for example).

We could also setup our own remote runner (like you suggested earlier) but is that worth it?

maflcko commented at 10:34 am on May 2, 2023: contributor

is that worth it?

At least one person or CI should run all targets in valgrind on a regular basis. I do it, but it would be good to have a backup.

dergoegge commented at 3:17 pm on May 5, 2023: member

Closing.

I will setup a machine for the valgrind job sometime soon and then open a PR to no longer run it in CI here.

dergoegge closed this on May 5, 2023

maflcko commented at 3:25 pm on May 5, 2023: contributor

I still think it is good to have a result that is trivially and publicly visible for anyone with internet access. Maybe the name of the task can be changed to indicate that a timeout can be ignored?

maflcko commented at 3:11 pm on May 16, 2023: contributor

I am also thinking about adding a script or CI task to check that a pull only adds files, not removes them. Such a task would then fail (red) if non-reduced and stale inputs are deleted. So avoiding a red CI by all costs may not be a worthy goal?

dergoegge commented at 3:27 pm on May 16, 2023: member

So avoiding a red CI by all costs may not be a worthy goal?

My main concern with the red CI is that it has to be looked at, since the valgrind job either fails because of some bug or it times out and we definitely want to catch when there’s a bug. I know it’s easy to check but I just find it very annoying. I couldn’t find this in the cirrus docs, but maybe there is an option to ignore timeouts?

Otherwise I think, we can just have one (or more) remote runners that publish the result to GitHub somehow, so we still get the green CI (I think I saw something about cirrus allowing remote runners). I am happy to set that up at some point and until then we keep it the way it is now.

adding a script or CI task to check that a pull only adds files, not removes them. Such a task would then fail (red) if non-reduced and stale inputs are deleted.

Sounds good and this should only fail on that ~one pruning PR every 6 months.

maflcko commented at 5:50 pm on May 16, 2023: contributor

Otherwise I think

Yeah, it is called persistent worker. See https://github.com/bitcoin/bitcoin/blob/904631e0fc00ac9c8a03d1ce226d071bf88c00db/.cirrus.yml#L17

Sounds good and this should only fail on that ~one pruning PR every 6 months.

https://github.com/bitcoin-core/qa-assets/pull/125

Increase timeout or remove valgrind CI job? #113