ci: Add Travis check to make sure unit test coverage reports stay deterministic

practicalswift commented at 8:56 am on July 1, 2019: contributor

Add Travis check to make sure unit test coverage reports stay deterministic.

Rationale:

A necessary condition for meaningful line coverage measuring is that the test suite is deterministic in the sense that the set of lines executed at least once is identical between test suite runs.

This PR addresses issue #14343 (MarcoFalke): “coverage reports non-deterministic”:

Our unit tests and functional tests are non-deterministic in the overall execution, but the coverage should not be affected by that. I.e. some functions might be executed in a different order or sometimes skipped, but every line, function and branch should be executed at least once.

This is currently not true, even for serialization errors that should be hit exactly once.

Beside the obvious issue of missing test coverage on some runs, this also makes it impossible to see how test coverage changes between two commits.

Example output in case of line coverage deterministic unit tests:

0[2019-06-30 08:32:59] Measuring coverage, run [#1](/bitcoin-bitcoin/1/) of 3
1[2019-06-30 08:36:38] Measuring coverage, run [#2](/bitcoin-bitcoin/2/) of 3
2[2019-06-30 08:40:15] Measuring coverage, run [#3](/bitcoin-bitcoin/3/) of 3
3
4Coverage test passed: Deterministic coverage across 3 runs.

Example output in case of line coverage non-deterministic unit tests:

 0[2019-06-30 08:32:59] Measuring coverage, run [#1](/bitcoin-bitcoin/1/) of 3
 1[2019-06-30 08:36:38] Measuring coverage, run [#2](/bitcoin-bitcoin/2/) of 3
 2
 3The line coverage is non-deterministic between runs.
 4
 5The test suite must be deterministic in the sense that the set of lines executed at least
 6once must be identical between runs. This is a necessary condition for meaningful coverage
 7measuring.
 8
 9--- gcovr.run-1.txt   2019-01-30 23:14:07.419418694 +0100
10+++ gcovr.run-2.txt   2019-01-30 23:15:57.998811282 +0100
11@@ -471,7 +471,7 @@
12 test/crypto_tests.cpp                        270     270   100%
13 test/cuckoocache_tests.cpp                   142     142   100%
14 test/dbwrapper_tests.cpp                     148     148   100%
15-test/denialofservice_tests.cpp               225     225   100%
16+test/denialofservice_tests.cpp               225     224    99%   363
17 test/descriptor_tests.cpp                    116     116   100%
18 test/fs_tests.cpp                             24       3    12%   14,16-17,19-20,23,25-26,29,31-32,35-36,39,41-42,45-46,49,51-52
19 test/getarg_tests.cpp                        111     111   100%
20@@ -585,5 +585,5 @@
21 zmq/zmqpublishnotifier.h                       5       0     0%   12,31,37,43,49
22 zmq/zmqrpc.cpp                                21       0     0%   16,18,20,22,33-35,38-45,49,52,56,60,62-63
23 ------------------------------------------------------------------------------
24-TOTAL                                      61561   27606    44%
25+TOTAL                                      61561   27605    44%
26 ------------------------------------------------------------------------------

This Travis check uses test_deterministic_coverage.sh which was introduced in #15296.

To make sure test_deterministic_coverage.sh won’t trigger any line coverage non-determinism alarms with the current test suite I’ve performed 8 000 test runs (against 98958c81f5065a5de13699d46995d278ecb6709e) which all gave identical line coverage results.

Note to reviewers: Which would be the most appropriate Travis job to put this on? I’m currently using x86_64 Linux [GOAL: install] [bionic] [uses qt5 dev package instead of depends Qt to speed up build and avoid timeout], but I’m afraid the total run-time of that job is a bit on the high end with this check included. Would it be preferable to add a new job instead of adding to an existing job? Please advice :-)

ci: Test for deterministic coverage in Travis d825f0c472

fanquake added the label Tests on Jul 1, 2019

DrahtBot commented at 9:17 am on July 1, 2019: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#15134 (tests: Switch one of the Travis jobs to an unsigned char environment (-funsigned-char) by practicalswift)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

MarcoFalke commented at 1:35 pm on July 1, 2019: member

I think there is a value to non-deterministic tests, as long as their seed is known (e.g printed to stdout at the beginning of the test, or written to a file on failure, …)

sdaftuar commented at 6:42 pm on July 1, 2019: member

I also think that non-deterministic tests can be valuable; as an example, a non-deterministic functional test (feature_dbcrash.py) found a bug in the non-atomic utxo writes PR (#10148). I doubt I would have found that bug without it.

So if this PR’s intention is to move towards eliminating non-determinism as a tool in our testing, then I tend to concept NACK.

practicalswift renamed this:
~~ci: Add Travis check to catch accidental introduction of non-determinism in unit tests~~
ci: Add Travis check to catch accidental introduction of line coverage non-determinism in unit tests
on Jul 2, 2019

practicalswift renamed this:
~~ci: Add Travis check to catch accidental introduction of line coverage non-determinism in unit tests~~
ci: Add Travis check to make sure unit test coverage reports stay deterministic
on Jul 2, 2019

practicalswift commented at 7:58 am on July 2, 2019: contributor

@MarcoFalke @sdaftuar

It seems like I didn’t properly communicate the goal of this PR :-) Sorry about that.

The goal of this PR is to make sure line coverage-reports are deterministic.

This PR addresses #14343: “coverage reports non-deterministic” (opened by you @MarcoFalke :-))

The problem is described in MarcoFalke’s issue:

Our unit tests and functional tests are non-deterministic in the overall execution, but the coverage should not be affected by that. I.e. some functions might be executed in a different order or sometimes skipped, but every line, function and branch should be executed at least once.

This is currently not true, even for serialization errors that should be hit exactly once.

Beside the obvious issue of missing test coverage on some runs, this also makes it impossible to see how test coverage changes between two commits.

More specifically this PR makes sure line coverage is stable between unit test runs - from the OP of this PR:

A necessary condition for meaningful line coverage measuring is that the test suite is deterministic in the sense that the set of lines executed at least once is identical between test suite runs.

I’ve now updated the PR title to better reflect the goal of this PR. @MarcoFalke @sdaftuar Just to make sure we share the same goal: we all want line coverage unit tests reports to be deterministic, right? :-)

sdaftuar commented at 6:34 pm on July 3, 2019: member

I’m not familiar with the line coverage reports at all and don’t really have a view there. If we merged this PR, then in the future if someone wanted to write a unit test that was non-deterministic, what should they do if this linter fails?

practicalswift commented at 9:06 pm on July 3, 2019: contributor

@sdaftuar If line coverage fluctuation in a test is intentional then one would simply add a suppression to test_deterministic_coverage.sh :-)

This PR is only meant to guard against accidental introduction of new tests which turns the line coverage number into a random process.

More concretely the situation I want to guard against is the following: line coverage is reported as 80,5% on first run, 80,6% on second run, 80,4% on third run, etc :-)

Some context:

#15296 (“tests: Add script checking for deterministic line coverage in unit tests”)
#15324 (“test: Make bloom tests deterministic”)
#15327 (“tests: Make test updatecoins_simulation_test deterministic”)
#14343 (“coverage reports non-deterministic”)

sdaftuar commented at 3:33 pm on July 4, 2019: member

@practicalswift Thanks for clarifying that.

I tend to think that test writers face too many hurdles and knowledge requirements about the repo (and figuring out how to deal with linters that provide unclear benefits when applied as a hard-and-fast rule, like this one would). It’s demoralizing to include a test for something and then have some unrelated script break for spurious reasons that requires learning and investigating something new in order to make travis pass.

So I think I’m still a Concept NACK. I think our concerns about line-coverage, which reflect a goal about the whole project’s test suite, should not provide too much inconvenience for contributors working on a smaller piece.

practicalswift commented at 3:49 pm on July 4, 2019: contributor

@MarcoFalke As the person who filed issue #14343 (“coverage reports non-deterministic”) which this PR tries to address: can you clarify your position?

Personally I kind of took it for granted that we want to know how suggested PR:s changes unit test line coverage. Assumptions are dangerous :-)

If the consensus opinion is that we are okay with the unit test line coverage sometimes being a random process (80,5% on first run, 80,6% on second run, 80,4% on third run, etc.) then this PR is clearly not needed and I’ll close :-)

fanquake commented at 2:30 am on July 5, 2019: member

Assumptions are dangerous :-)

I’m more concerned about the potential for reduced input from high-quality, long term contributors due to the overhead of new scripts, lints, tools etc than I am about occasionally seeing a small variation in coverage output. So I’m in agreement with @sdaftuar here.

MarcoFalke commented at 9:13 pm on July 8, 2019: member

I think it is nice to know whether a new test has stable line coverage, but it non-stable coverage shouldn’t be blocking a merge. Especially, since it might not be trivial to make them stable (c.f. #14343 (comment))

I guess I wouldn’t mind if some of the linter were put behind the allow_failures travis flag.

practicalswift commented at 4:50 pm on July 9, 2019: contributor

@MarcoFalke Using allow_failures sounds like a very good idea. I’ll investigate! :-)

practicalswift closed this on Aug 10, 2019

practicalswift deleted the branch on Apr 10, 2021

DrahtBot locked this on Aug 18, 2022

ci: Add Travis check to make sure unit test coverage reports stay deterministic #16320

Conflicts