Tracking and reporting failures and repeated flakiness of functional tests

michaelfolkson commented at 2:10 pm on May 12, 2022: contributor

I’m sure anyone who runs the functional tests regularly will experience occasional failures. Often you run those failed tests again and they will pass. Occasionally a functional test will repeatedly fail which is more concerning. It is difficult to assess whether others are experiencing the same flakiness and/or repeated failures and whether it is worth spending time trying to understand and fix the issue.

In #25030 @MarcoFalke raised that there are some functional tests that have been flaky for a long time now.

This issue is a first attempt to track failures and repeated flakiness contributors experience with particular functional tests. I’m not sure how to organize this. Whether there should be a table of the functional tests that can be regularly updated and edited or whether to just have individual contributors adding comments below on which functional tests they are experiencing failures/flakiness with.

For now please comment below if you experience failures/flakiness with a particular functional test and certainly if you spend any time trying to understand why it failed feel free to add thoughts on what you think is causing the problem.

Ideally this issue would help identify which functional tests to prioritize to fix but we’ll see if this is useful or not. If it isn’t useful feel free to close.

fanquake commented at 8:03 pm on May 12, 2022: member

This issue is a first attempt to track failures and repeated flakiness contributors experience with particular functional tests.

I don’t think some sort of meta issue, that needs to be continually updated / maintained is going to be an improvement here (I think there is also at least one other effort to track this outside of Core). If people spot tests failures they should open issues. If other people see the same thing / can provide more info, they can comment on them. If eventually they can’t be reproduced / disappear, the issue can be closed. Consolidating all of that into a single thread doesn’t seem like it’s going to improve things, other than be quite confusing / noisy.

mzumsande commented at 9:47 am on May 13, 2022: member

I don’t think that tracking is the problem, the current system of having an issue per failed tests works, and makes it easy to find flaky tests waiting to be fixed.

The problem is that it’s hard to find the root cause of test failures that you can’t reproduce. And even if you think that you understand it, some of the failures (e.g. issues with the fee estimation algorithm such as #23165 / #21161 ) are very nontrivial to fix properly. It’s tempting to go for the easy way of just softening the test assumptions a bit to remove the flakiness and be able to forget about it, but the proper approach of understanding why the test failed (and whether the root cause is in the test code or the code under test) can be hard - and I think that’s the bottleneck, not the tracking.

michaelfolkson commented at 2:57 pm on May 13, 2022: contributor

Thanks for the comments. @fanquake:

I think there is also at least one other effort to track this outside of Core

Yeah Marco referred to it here.

I keep track of them myself, but there is also a “dashboard” (not public) where some people have access.

One could also look at the red stuff in https://cirrus-ci.com/github/bitcoin/bitcoin/master

I think the best way to keep track of them is by creating an issue here and updating it with an exponential backoff if it still occurs. (After 2 weeks, 4 weeks, 2 months, 4 months …)

Is there a particular reason why the dashboard isn’t public? Is it a security concern or just a haven’t got round to making it public type thing? I can only speak for myself but when I experience failures/flakiness I have no idea if it is just me or a longstanding issue that everyone else is experiencing. It seems to me that having “high-frequent ones (that) remain unfixed for months/years” doesn’t appear to be ideal. But I don’t know which ones those are.

MarcoFalke commented at 3:14 pm on May 13, 2022: member

If someone wants to maintain a list of failures, I’d find that great. Surely people that fix the issues are needed more, but if a list attracts those, then why not.

You’ll have to ask @adamjonas for the dashboard, but doing the task outside the dashboard should be trivial to do as well.

Basically anything in https://cirrus-ci.com/github/bitcoin/bitcoin/master should not be red (the master branch should not fail tests). There may be a remote network outage when downloading packages, which is unavoidable and the task can be reset (permission might be needed for that). Anything else is likely some kind of bug. For example from today: #25124

The same is true for https://github.com/bitcoin/bitcoin/pulls (anything red is a remote network outage or a bug). Though, obviously the bug might have been introduced in the pull. In that case, pinging the author about the bug is also helpful.

If the bug is (presumed) in master, reporting it is recommended. A small analysis would be ideal. See for example #25128

A reproducer would be even more ideal. See for example https://github.com/bitcoin/bitcoin/issues/25129

fanquake commented at 3:15 pm on May 13, 2022: member

Is there a particular reason why the dashboard isn’t public? Is it a security concern or just a haven’t got round to making it public type thing?

Pretty sure it’s just using some 3rd party service that requires logging in. There’s no security concern as far as I’m aware. Anyone else could be tracking the same failures using whichever method they prefer.

when I experience failures/flakiness I have no idea if it is just me or a longstanding issue that everyone else is experiencing.

The simplest solution to that would be searching the open / closed issues. If your issue hasn’t been reported, open a new issue. If it has, you could leave a comment.

jonatack commented at 3:19 pm on May 13, 2022: member

The simplest solution to that would be searching the open / closed issues. If your issue hasn’t been reported, open a new issue. If it has, you could leave a comment.

Agree, this is what I do when I see a CI error, and then look at cirrus-ci.com/github/bitcoin/bitcoin/master and https://github.com/bitcoin/bitcoin/pulls as @MarcoFalke mentioned for any new red ones.

michaelfolkson commented at 4:26 pm on May 13, 2022: contributor

Surely people that fix the issues are needed more, but if a list attracts those, then why not.

Indeed. If no one is going to try to attempt to fix the issues any list or additional access to a dashboard etc is pointless. Maybe I’m optimistic but there have been a number of new(er) contributors completing MiniWallet tasks on the functional tests. I’d have thought fixing other issues with functional tests would be a natural next step for new contributors if they wanted to continue to contribute and it was obvious which issues needed fixing.

The simplest solution to that would be searching the open / closed issues. If your issue hasn’t been reported, open a new issue. If it has, you could leave a comment.

Knowing this (opening issues for test failures) is encouraged is useful too. Until now it wasn’t clear to me that this was encouraged for flakiness.

MarcoFalke commented at 2:08 pm on May 18, 2022: member

I’d have thought fixing other issues with functional tests would be a natural next step for new contributors if they wanted to continue to contribute and it was obvious which issues needed fixing.

I am happy to tag test races as good first issue (and/or create an issue template for them), but as explained before, finding the root cause and fixing it might be non-trivial sometimes: #25116 (comment). Though, observing the issue and filing it (after checking for duplicates) should be trivial to the extent where a bot could do it.

MarcoFalke added the label Brainstorming on May 18, 2022

MarcoFalke added the label Tests on May 18, 2022

MarcoFalke commented at 7:55 am on May 21, 2022: member

Closing for now, as this issue is open-ended, but feel free to continue discussion. Also, happy to re-open if someone thinks this is useful.

MarcoFalke closed this on May 21, 2022

DrahtBot locked this on May 21, 2023

Tracking and reporting failures and repeated flakiness of functional tests #25116