Just opening this issue as a brainstorming topic.
I don’t believe we have any kind of systematized way to track the false positive rate of our various CI tools (whether by platform or by test), but developer time wasted due to false positives is something that we should generally want to minimize.
There are a couple different sources of false positives in our tests: the main one that I think most of us are deeply aware of is flakiness of the platforms we use (slow vms or hardware, transient networking issues, etc). So if we notice that one platform or another is doing worse over time, we could use that to help prioritize shifting how we use those platforms.
The other place is from the tests themselves; I think in the abstract many of us might think that if a particular test has a high false positive rate (say due to high sensitivity to random events or fragility in the way it’s written) that it should be fixed. But I don’t know that we do a good enough job of keeping track of which tests fail spuriously most often, to know that a test needs to be rewritten. Or, if we instead see that test failures are evenly distributed across our test suite, that could be motivation to rethink the fundamental structure of these tests, and how they interact with bitcoind, to be more robust.
How important (if at all!) it is to do work like this would be easier to know if we had some data. I have no idea how one might go about collecting such data, but if anyone could come up with a way to track CI failures and determine if they are true positives or false positives, and break it down by platform and test, I think that would be fascinating to look at every few months to see how things are trending.
(If anyone is already collecting something like this, please share your findings!)