Tracking CI false positive rates

sdaftuar commented at 7:28 pm on August 7, 2020: member

Just opening this issue as a brainstorming topic.

I don’t believe we have any kind of systematized way to track the false positive rate of our various CI tools (whether by platform or by test), but developer time wasted due to false positives is something that we should generally want to minimize.

There are a couple different sources of false positives in our tests: the main one that I think most of us are deeply aware of is flakiness of the platforms we use (slow vms or hardware, transient networking issues, etc). So if we notice that one platform or another is doing worse over time, we could use that to help prioritize shifting how we use those platforms.

The other place is from the tests themselves; I think in the abstract many of us might think that if a particular test has a high false positive rate (say due to high sensitivity to random events or fragility in the way it’s written) that it should be fixed. But I don’t know that we do a good enough job of keeping track of which tests fail spuriously most often, to know that a test needs to be rewritten. Or, if we instead see that test failures are evenly distributed across our test suite, that could be motivation to rethink the fundamental structure of these tests, and how they interact with bitcoind, to be more robust.

How important (if at all!) it is to do work like this would be easier to know if we had some data. I have no idea how one might go about collecting such data, but if anyone could come up with a way to track CI failures and determine if they are true positives or false positives, and break it down by platform and test, I think that would be fascinating to look at every few months to see how things are trending.

(If anyone is already collecting something like this, please share your findings!)

fanquake added the label Tests on Aug 10, 2020

fanquake commented at 1:07 pm on August 10, 2020: member

Sounds like a good initiative. The tendency on the repo is almost certainly to treat failures as “random” and just blindly restart CIs until they pass. However saying that, recently there has been a trend of opening an issue with the log/failure before restarting, and I think we’ve seen more of those intermittent failures being fixed because of it.

Semi related; I think that we should also be more aggressive about disabling hardware/jobs that (start to) consistently fail (due to issues outside our control. i.e the s390x build on Travis), otherwise the failures just generate spammy comments (here and on IRC), about having to restart builds, and in general seem to confuse non-regular contributors.

MarcoFalke added the label Brainstorming on Aug 10, 2020

MarcoFalke commented at 1:53 pm on August 10, 2020: member

I think we should treat issues that are a direct consequence of mishaps attributed to us different from issues that are third party mishaps.

Absent any alternatives, I agree on aggressively disabling failing builds that are out of our control to fix. Honestly, it is quite depressing that there are literally no CI providers out there that suit our needs when it comes to features/stability and uptime. I wanted to avoid having to roll our own solution, because it means that the time spent on developing a CI infrastructure (hardware, provisioning, configurations, user authentication, hosting, anti-DOS, …) can not be spent on user-facing Bitcoin Core improvements. Though, if the situation doesn’t improve over the next year, we might have no other choice.

False positives in our tests/code

When it comes to test failures that are in our control to fix, I also agree that that there has been a minimal increase in reported intermittent false positives. See the reports https://github.com/bitcoin/bitcoin/issues?q=is%3Aissue+intermittent+ . Non-intermittent issues that are in our control to fix should be rare and need to be fixed immediately anyway, so they are imo not worth to track.

Though, I observe that CI failures are still blindly restarted without filing an issue. If no one wants to take the time to file an issue or look at the existing issues and fix them, I don’t think it will help to spend time on automatically (or manually) tracking the false positive rates over time.

I have two goals when it comes to false positive CI results:

Issues are created when the failure occurs. Either by adding a comment to an existing issue or filing a new issue if none exists.
Intermittent false positives might indicate actual bugs in Bitcoin Core. So those issues should be fixed before a new major version is released, ideally as soon as possible.

Filing an issue should take a minimal amount of time, so I am unsure why it is done so rarely. After #19110 it takes no more than one mouse click and a copy-paste. Let me know what else can be done to incentivize this.

Automatically collecting false postive rates

Not sure if this is something to spend time on before the above goals are achieved…

The ci tools run most often on pull requests, as opposed to the master branch, since every proposed patch is merged at most once into the master branch, but pushed at least once to the pull request. This means that most of the data points are within pull requests. Though, pull requests also leak the auth tokens needed to report back the ci result to a third service. Moreover, pull requests run into true positive failures (if they never did, then all tests would be useless). Automatically distinguishing false positives from true positives might need some heuristics (see e.g. https://hacks.mozilla.org/2020/07/testing-firefox-more-efficiently-with-machine-learning/, ignore the “machine learning” part).

sdaftuar commented at 6:27 pm on August 31, 2020: member

One example of false positives in the CI provider itself is here. I had another example of a false positive on that same job in a different PR, but I think it was restarted so I’m not sure how to link to the failing job.

I don’t know how frequently this particular job in that particular CI fails, so I don’t know if this is a test that we should be looking to perform on another provider, or even consider dropping altogether until we have a more stable place we think we can run it.

MarcoFalke commented at 2:41 pm on November 10, 2020: member

Intermittent failures due to block sync races should hopefully go down after #20362 is merged

adamjonas commented at 7:57 pm on January 13, 2021: member

The CI move from travis to cirrus has helped with some issues but not others. @evanbaer has a project underway to track the failures in a few different ways. Clustering failures across builds and tracking the hardware that the failure occurs on can give quick clues as to what is flaky. Parsing the error logs to pull out the specific failures outside of functional test isn’t all that easy though. @promag mentioned that he investigated sentry for this job. He attempted a github + travis crawler to get logs but had trouble polling travis. Adding a --sentry option to the test framework might help us get cleaner data. Sentry has an on-prem version as well.

MarcoFalke commented at 3:22 pm on January 14, 2021: member

It would be good if someone with some basic background in zmq fixed the zmq test: In just the month of December we had 4 bug reports about it being flaky #20672 (comment) . That’s a sad record. Also, I am resetting builds for it about once every day.

I’d say if the test isn’t fixed, we should go ahead and remove it temporarily until someone fixes it.

Edit: #20934

MarcoFalke commented at 7:27 pm on March 12, 2021: member

@evanbaer has a project underway

Can be closed now?

MarcoFalke closed this on Mar 29, 2022

MarcoFalke commented at 3:35 pm on March 29, 2022: member

For reference, the project is https://evanbaer.grafana.net/d/P6i5LzTGk/bitcoin-github-ci?orgId=1 and I believe you may have to ask @adamjonas for access.

DrahtBot locked this on Mar 29, 2023

Tracking CI false positive rates #19684

CI provider issues/issues not related to code/config in this repo

False positives in our tests/code

Automatically collecting false postive rates