test: Implicitly sync after generate*, unless opted out

MarcoFalke commented at 11:27 am on October 18, 2021: member

The most frequent failure in functional tests are intermittent races. Fixing such bugs is cumbersome because it involves:

Noticing the failure
Fetching and reading the log to determine the test case that failed
Adding a self.sync_all() where it was forgotten
Spamming out a pr and waiting for review, which is already sparse

Also, writing a linter to catch those is not possible, nor is review effective in finding these bugs prior to merge.

Fix all future intermittent races caused by a missing sync_block call by calling sync_all implicitly after each generate*, unless opted out. This ensures that the code is race-free (with regards to blocks) when the tests pass once, instead of our current approach where the code can never be guaranteed to be race-free.

There are some scripted-diff cleanups (see #22567), but they will be submitted in a follow-up to reduce the conflicts in this pull.

practicalswift commented at 12:28 pm on October 18, 2021: contributor

Concept ACK

DrahtBot added the label Tests on Oct 18, 2021

DrahtBot commented at 8:48 pm on October 18, 2021: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

#23371 (test: MiniWallet: add P2TR support and use it per default by theStack)
#23075 (refactoring: Fee estimation functional test cleanups by darosior)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

MarcoFalke force-pushed on Oct 19, 2021

test: Implicitly sync after generate*, unless opted out facc352648

MarcoFalke force-pushed on Oct 29, 2021

in test/functional/test_framework/test_framework.py:633 in facc352648

626@@ -627,20 +627,27 @@ def join_network(self):
627         self.connect_nodes(1, 2)
628         self.sync_all()
629 
630-    def generate(self, generator, *args, **kwargs):
631+    def no_op(self):
632+        pass
633+
634+    def generate(self, generator, *args, sync_fun=None, **kwargs):

luke-jr commented at 3:46 pm on October 29, 2021:

Suggest making it a required parameter, otherwise there’s a high risk of backports breaking.

MarcoFalke commented at 3:59 pm on October 29, 2021:

I don’t think we’ll backport any major test changes to 22.x (or earlier) at this point, and hopefully this pull is merged before the next major release, so backports to 23.x aren’t an issue.

If reviewers insist that this should be done, I suggest doing it in a follow up as a scripted-diff to make it easy to review and keep the conflicts in this pull low.

hebasto commented at 11:24 am on November 8, 2021: member

Concept ACK.

lsilva01 approved

lsilva01 commented at 3:27 pm on November 8, 2021: contributor

tACK facc352 on Ubuntu 20.04

brunoerg approved

brunoerg commented at 7:48 pm on November 8, 2021: member

tACK facc352648e4fe1ed9b406400b6e4a9d51f30349 on MacOS 11.6

Nice. It solves the failure I was getting when running test_runner. I think there are some other tests you could apply it like rpc_packages.py.

MarcoFalke merged this on Nov 9, 2021

MarcoFalke closed this on Nov 9, 2021

MarcoFalke deleted the branch on Nov 9, 2021

MarcoFalke commented at 9:33 am on November 9, 2021: member

I think there are some other tests you could apply it like rpc_packages.py.

Can you explain this a bit more? Or maybe create a follow-up yourself?

hebasto commented at 9:38 am on November 9, 2021: member

FWIW, was testing this PR, and with the following patch:

 0diff --git a/.cirrus.yml b/.cirrus.yml
 1index f7962e951..93866de06 100644
 2--- a/.cirrus.yml
 3+++ b/.cirrus.yml
 4@@ -164,7 +164,7 @@ task:
 5     - netsh int ipv4 set dynamicport tcp start=1025 num=64511
 6     - netsh int ipv6 set dynamicport tcp start=1025 num=64511
 7     # Exclude feature_dbcrash for now due to timeout
 8-    - python test\functional\test_runner.py --nocleanup --ci --quiet --combinedlogslen=4000 --jobs=4 --timeout-factor=8 --failfast --extended --exclude feature_dbcrash
 9+    - python test\functional\test_runner.py --nocleanup --ci --quiet --combinedlogslen=4000 --jobs=4 --failfast --extended --exclude feature_dbcrash
10 
11 task:
12   name: 'ARM [unit tests, no functional tests] [bullseye]'
13diff --git a/ci/test/00_setup_env.sh b/ci/test/00_setup_env.sh
14index 8a9d808f5..0685d1f0c 100755
15--- a/ci/test/00_setup_env.sh
16+++ b/ci/test/00_setup_env.sh
17@@ -41,7 +41,7 @@ export RUN_SECURITY_TESTS=${RUN_SECURITY_TESTS:-false}
18 # By how much to scale the test_runner timeouts (option --timeout-factor).
19 # This is needed because some ci machines have slow CPU or disk, so sanitizers
20 # might be slow or a reindex might be waiting on disk IO.
21-export TEST_RUNNER_TIMEOUT_FACTOR=${TEST_RUNNER_TIMEOUT_FACTOR:-40}
22+export TEST_RUNNER_TIMEOUT_FACTOR=${TEST_RUNNER_TIMEOUT_FACTOR:-1}
23 export TEST_RUNNER_ENV=${TEST_RUNNER_ENV:-}
24 export RUN_FUZZ_TESTS=${RUN_FUZZ_TESTS:-false}
25 export EXPECTED_TESTS_DURATION_IN_SECONDS=${EXPECTED_TESTS_DURATION_IN_SECONDS:-1000}

all tasks, except for the native Windows one, pass.

In the Windows task two tests fail due timeouts:

0feature_pruning.py                                 | ✖ Failed  | 1381 s
1rpc_misc.py                                        | ✖ Failed  | 63 s

MarcoFalke commented at 9:49 am on November 9, 2021: member

Thanks for testing. Do you have logs for the two timeouts?

hebasto commented at 10:09 am on November 9, 2021: member

Thanks for testing. Do you have logs for the two timeouts?

https://api.cirrus-ci.com/v1/task/4512194625798144/logs/functional_tests.log

MarcoFalke commented at 10:23 am on November 9, 2021: member

The pruning one has a reorg and a call to verifychain, which may take a long time, so this is expected:

0                                   test_framework.authproxy.JSONRPCException: 'verifychain' RPC took longer than 60.000000 seconds. Consider using larger timeout for calls that take longer to return. (-344)

The rpc_misc one looks like an unrelated issue.

hebasto commented at 10:25 am on November 9, 2021: member

The rpc_misc one looks like an unrelated issue.

Could be easy tuned:

0@@ -84,7 +84,7 @@ class RpcMiscTest(BitcoinTestFramework):
1 
2         # Restart the node with indices and wait for them to sync
3         self.restart_node(0, ["-txindex", "-blockfilterindex", "-coinstatsindex"])
4-        self.wait_until(lambda: all(i["synced"] for i in node.getindexinfo().values()))
5+        self.wait_until(lambda: all(i["synced"] for i in node.getindexinfo().values()), timeout=120)
6 
7         # Returns a list of all running indices by default
8         values = {"synced": True, "best_block_height": 200}

MarcoFalke commented at 10:27 am on November 9, 2021: member

The sync threads all exit within a second, so I don’t understand why raising the timeout would fix this issue.

hebasto commented at 11:31 am on November 9, 2021: member

@MarcoFalke

The sync threads all exit within a second…

That is not the case in CI environment. See my demo branch, and its CI log:

02021-11-09T11:28:10.405000Z TestFramework (INFO): Total time taken by `wait_until` call: 52.64929151535034 seconds.

MarcoFalke commented at 11:39 am on November 9, 2021: member

Oh you are right.

Any idea why coinstatsindex takes more than a minute on windows, but finishes in less than a second on linux?

hebasto commented at 12:25 pm on November 9, 2021: member

Any idea why coinstatsindex takes more than a minute on windows, but finishes in less than a second on linux?

Have no idea.

Locally, on Windows 10 Pro 21H1, it takes about 30 seconds.

sidhujag referenced this in commit 26504dee4b on Nov 9, 2021

sidhujag referenced this in commit 6cca5e7f0b on Nov 9, 2021

MarcoFalke referenced this in commit 15d109802a on Nov 15, 2021

DrahtBot locked this on Nov 9, 2022

test: Implicitly sync after generate*, unless opted out #23300

Conflicts