test: fix zmq test flakiness, improve speed

theStack commented at 2:05 am on January 26, 2021: member

Fixes #20934 by using the “sync up” method described in #20538 (comment).

After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times sync_mempool() or sync_blocks() timed out, which can happen because the trickle relay time has no upper bound – hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test.

For further details, also see the explanations in the commit messages.

There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with --valgrind):

 0#!/bin/sh
 1OUTPUT_FILE=./zmq_results
 2echo ===== repeated zmq test ===== > $OUTPUT_FILE
 3
 4for i in `seq 1000`; do
 5    echo ------------------------
 6    echo ----- test run $i -----
 7    echo ------------------------
 8    echo --- $i --- >> $OUTPUT_FILE
 9    ./test/functional/interface_zmq.py --valgrind
10    if [ $? -ne 0 ]; then
11        echo "FAILED. /o\\" >> $OUTPUT_FILE
12    else
13        echo "PASSED. \\o/" >> $OUTPUT_FILE
14    fi
15done
16
17echo Failed test runs:
18grep FAILED $OUTPUT_FILE | wc -l

fanquake added the label Tests on Jan 26, 2021

DrahtBot commented at 9:33 am on January 26, 2021: member

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

ajtowns marked this as a draft on Jan 26, 2021

theStack commented at 11:09 am on January 26, 2021: member

Asked to mark this as draft since the test now fails on two cirrus instances. Have to investigate deeper what the problem is and how to properly reproduce it… locally 1000 test runs passed successfully.

MarcoFalke commented at 11:33 am on January 26, 2021: member

You might want to try with --valgrind, which makes races more likely to happen locally

theStack force-pushed on Jan 27, 2021

theStack commented at 7:54 pm on January 27, 2021: member

@MarcoFalke: Thanks, that helped a lot! On master, running via --valgrind leads to a failed test run quite quickly on my machine.

The PR is ready for review now. The problem of my original approach was that most tests needed nodes 0 and 1 to have the same tip. So, after this robust “sync up” setup approach of repeatedly generating blocks, node 1 has to catch up. Added a parameter “sync_blocks” for that purpose, as for the last test, the chains are already different and synchronization is not possible in a trivial way. Now, I had locally a few hundred test runs with --valgrind and all of them passed. Also adapted the PR description to include a script, that reviewers can use to test the robustness.

theStack marked this as ready for review on Jan 27, 2021

jonatack commented at 6:10 pm on January 28, 2021: member

Concept ACK, will be great to robustify this test.

So far your script has run the test 25 times without errors for me (other than an unrelated issue that the test runner --valgrind option raises for me in general (not just on this test), but valgrind test/functional/interface_zmq.py works fine).

 0------------------------
 1----- test run 25 -----
 2------------------------
 3==42527== Memcheck, a memory error detector
 4==42527== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
 5==42527== Using Valgrind-3.16.0.GIT and LibVEX; rerun with -h for copyright info
 6==42527== Command: test/functional/interface_zmq.py
 7==42527== 
 82021-01-28T18:10:51.208000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_i65ciqkh
 92021-01-28T18:10:54.197000Z TestFramework (INFO): Generate 5 blocks (and 5 coinbase txes)
102021-01-28T18:10:55.327000Z TestFramework (INFO): Wait for tx from second node
112021-01-28T18:10:56.502000Z TestFramework (INFO): Test the getzmqnotifications RPC
122021-01-28T18:10:56.505000Z TestFramework (INFO): Testing 'sequence' publisher
132021-01-28T18:10:57.900000Z TestFramework (INFO): Wait for tx from second node
142021-01-28T18:10:59.026000Z TestFramework (INFO): Testing sequence notifications with mempool sequence values
152021-01-28T18:10:59.027000Z TestFramework (INFO): Testing RBF notification
162021-01-28T18:11:26.168000Z TestFramework (INFO): Testing reorg notifications
172021-01-28T18:11:29.256000Z TestFramework (INFO): Evict mempool transaction by block conflict
182021-01-28T18:11:30.558000Z TestFramework (INFO): Testing 'mempool sync' usage of sequence notifier
192021-01-28T18:11:53.212000Z TestFramework (INFO): Stopping nodes
202021-01-28T18:11:53.469000Z TestFramework (INFO): Cleaning up /tmp/bitcoin_func_test_i65ciqkh on exit
212021-01-28T18:11:53.470000Z TestFramework (INFO): Tests successful
22------------------------
23----- test run 26 -----
24------------------------

  0------------------------
  1----- test run 1 -----
  2------------------------
  32021-01-28T17:22:07.128000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_awm_44bl
  42021-01-28T17:22:34.997000Z TestFramework (ERROR): Unexpected exception caught during testing
  5Traceback (most recent call last):
  6  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 108, in _request
  7    return self._get_response()
  8  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 168, in _get_response
  9    http_response = self.__conn.getresponse()
 10  File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
 11    response.begin()
 12  File "/usr/lib/python3.9/http/client.py", line 307, in begin
 13    version, status, reason = self._read_status()
 14  File "/usr/lib/python3.9/http/client.py", line 276, in _read_status
 15    raise RemoteDisconnected("Remote end closed connection without"
 16http.client.RemoteDisconnected: Remote end closed connection without response
 17
 18During handling of the above exception, another exception occurred:
 19
 20Traceback (most recent call last):
 21  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 125, in main
 22    self.setup()
 23  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 262, in setup
 24    self.setup_network()
 25  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 356, in setup_network
 26    self.setup_nodes()
 27  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 383, in setup_nodes
 28    self.import_deterministic_coinbase_privkeys()
 29  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 400, in import_deterministic_coinbase_privkeys
 30    self.init_wallet(i)
 31  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 407, in init_wallet
 32    n.createwallet(wallet_name=wallet_name, descriptors=self.options.descriptors, load_on_startup=True)
 33  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 676, in createwallet
 34    return self.__getattr__('createwallet')(wallet_name, disable_private_keys, blank, passphrase, avoid_reuse, descriptors, load_on_startup)
 35  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/coverage.py", line 47, in __call__
 36    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
 37  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 144, in __call__
 38    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
 39  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 113, in _request
 40    self.__conn.request(method, path, postdata, headers)
 41  File "/usr/lib/python3.9/http/client.py", line 1255, in request
 42    self._send_request(method, url, body, headers, encode_chunked)
 43  File "/usr/lib/python3.9/http/client.py", line 1301, in _send_request
 44    self.endheaders(body, encode_chunked=encode_chunked)
 45  File "/usr/lib/python3.9/http/client.py", line 1250, in endheaders
 46    self._send_output(message_body, encode_chunked=encode_chunked)
 47  File "/usr/lib/python3.9/http/client.py", line 1010, in _send_output
 48    self.send(msg)
 49  File "/usr/lib/python3.9/http/client.py", line 950, in send
 50    self.connect()
 51  File "/usr/lib/python3.9/http/client.py", line 921, in connect
 52    self.sock = self._create_connection(
 53  File "/usr/lib/python3.9/socket.py", line 843, in create_connection
 54    raise err
 55  File "/usr/lib/python3.9/socket.py", line 831, in create_connection
 56    sock.connect(sa)
 57ConnectionRefusedError: [Errno 111] Connection refused
 582021-01-28T17:22:35.049000Z TestFramework (INFO): Stopping nodes
 592021-01-28T17:22:35.050000Z TestFramework.node0 (ERROR): Unable to stop node.
 60Traceback (most recent call last):
 61  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 320, in stop_node
 62    self.stop(wait=wait)
 63  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/coverage.py", line 47, in __call__
 64    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
 65  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 144, in __call__
 66    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
 67  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 107, in _request
 68    self.__conn.request(method, path, postdata, headers)
 69  File "/usr/lib/python3.9/http/client.py", line 1255, in request
 70    self._send_request(method, url, body, headers, encode_chunked)
 71  File "/usr/lib/python3.9/http/client.py", line 1266, in _send_request
 72    self.putrequest(method, url, **skips)
 73  File "/usr/lib/python3.9/http/client.py", line 1092, in putrequest
 74    raise CannotSendRequest(self.__state)
 75http.client.CannotSendRequest: Request-sent
 76Traceback (most recent call last):
 77  File "/home/jon/projects/bitcoin/bitcoin/./test/functional/interface_zmq.py", line 529, in <module>
 78    ZMQTest().main()
 79  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 149, in main
 80    exit_code = self.shutdown()
 81  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 278, in shutdown
 82    self.stop_nodes()
 83  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 525, in stop_nodes
 84    node.stop_node(wait=wait, wait_until_stopped=False)
 85  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 334, in stop_node
 86    raise AssertionError("Unexpected stderr {} != {}".format(stderr, expected_stderr))
 87AssertionError: Unexpected stderr ==37264== Thread 10 b-httpworker.3:
 88==37264== Conditional jump or move depends on uninitialised value(s)
 89==37264==    at 0xB0DEC5: __log_putr.isra.2 (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 90==37264==    by 0xB0F090: __log_put (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 91==37264==    by 0xB99C1D: __crdel_metasub_log (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 92==37264==    by 0xAD526A: __db_log_page (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 93==37264==    by 0xB4068A: __bam_new_subdb (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 94==37264==    by 0xAE7A6D: __db_init_subdb (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 95==37264==    by 0xB044ED: __fop_subdb_setup (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 96==37264==    by 0xAE6EB0: __db_open (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 97==37264==    by 0xAE1719: __db_open_pp (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 98==37264==    by 0xAC1D3A: Db::open(DbTxn*, char const*, char const*, DBTYPE, unsigned int, int) (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
 99==37264==    by 0x873618: BerkeleyDatabase::Open() (bdb.cpp:345)
100==37264==    by 0x873184: BerkeleyBatch::BerkeleyBatch(BerkeleyDatabase&, bool, bool) (bdb.cpp:308)
101==37264== 
102{
103   <insert_a_suppression_name_here>
104   Memcheck:Cond
105   fun:__log_putr.isra.2
106   fun:__log_put
107   fun:__crdel_metasub_log
108   fun:__db_log_page
109   fun:__bam_new_subdb
110   fun:__db_init_subdb
111   fun:__fop_subdb_setup
112   fun:__db_open
113   fun:__db_open_pp
114   fun:_ZN2Db4openEP5DbTxnPKcS3_6DBTYPEji
115   fun:_ZN16BerkeleyDatabase4OpenEv
116   fun:_ZN13BerkeleyBatchC1ER16BerkeleyDatabasebb
117}
118==37264== 
119==37264== Exit program on first error (--exit-on-first-error=yes) != 
120[node 1] Cleaning up leftover process
121[node 0] Cleaning up leftover process

in test/functional/interface_zmq.py:69 in c02f9a1882 outdated

64@@ -62,6 +65,8 @@ def receive_sequence(self):
65 class ZMQTest (BitcoinTestFramework):
66     def set_test_params(self):
67         self.num_nodes = 2
68+        # immediate tx relay (node1 -> node0)
69+        self.extra_args = [[], ["-whitelist=noban@127.0.0.1"]]

jonatack commented at 6:53 pm on January 28, 2021:

c02f9a188 maybe general-case it with the existing explanation in wallet_avoidreuse.py

0        # This test isn't testing txn relay/timing, so set whitelist on the
1        # peers for instant txn relay. This speeds up the test run time 2-3x.
2        self.extra_args = [["-whitelist=noban@127.0.0.1"]] * self.num_nodes

theStack commented at 1:29 am on January 29, 2021:

Thanks, I like the clear explanation and used it. My initial reason for not using the whitelist parameter on all nodes was that for node0 they would be overwritten in the setup routine. But simply adding self.extra_args[0] to the extra_args parameter for the restart_node call also tackles this.

in test/functional/interface_zmq.py:129 in c02f9a1882 outdated

131+            sub.socket.set(zmq.RCVTIMEO, recv_timeout*1000)
132+
133+        if sync_blocks:
134+            self.connect_nodes(0, 1)
135+            self.sync_blocks()
136+            self.disconnect_nodes(0, 1)

jonatack commented at 7:01 pm on January 28, 2021:

a8ddb26150 I didn’t understand why we disconnect here only to connect again on returning to the two callers for which the default sync_blocks=true.

theStack commented at 1:21 am on January 29, 2021:

Agree that this was unnecessarily complicated. Changed the setup routine to always connect nodes 0 and 1 (which is needed for the block sync anways, that is used in every subtest except the last one) and disconnect in the subtest if necessary.

jonatack commented at 7:49 pm on January 28, 2021: member

Almost-ACK. I was unable to make the test fail with valgrind both here and on master. The test does run ~2x faster on this branch, which is great.

theStack force-pushed on Jan 29, 2021

theStack commented at 1:32 am on January 29, 2021: member

Force-pushed with changes suggested by @jonatack (https://github.com/bitcoin/bitcoin/pull/21008#discussion_r566331714 and #21008 (review)).

in test/functional/interface_zmq.py:91 in 9a9453d35b outdated

87@@ -82,23 +88,46 @@ def run_test(self):
88 
89     # Restart node with the specified zmq notifications enabled, subscribe to
90     # all of them and return the corresponding ZMQSubscriber objects.
91-    def setup_zmq_test(self, services, recv_timeout=60, connect_nodes=False):
92+    def setup_zmq_test(self, services, recv_timeout=60, sync_blocks=True):

jonatack commented at 3:38 pm on January 29, 2021:

nit, can enforce named args with

0    def setup_zmq_test(self, services, *, recv_timeout=60, sync_blocks=True):

or alternatively for all args

 0@@ -88,7 +88,7 @@ class ZMQTest (BitcoinTestFramework):
 1 
 2     # Restart node with the specified zmq notifications enabled, subscribe to
 3     # all of them and return the corresponding ZMQSubscriber objects.
 4-    def setup_zmq_test(self, services, recv_timeout=60, sync_blocks=True):
 5+    def setup_zmq_test(self, *, services, recv_timeout=60, sync_blocks=True):
 6         subscribers = []
 7         for topic, address in services:
 8             socket = self.ctx.socket(zmq.SUB)
 9@@ -137,7 +137,7 @@ class ZMQTest (BitcoinTestFramework):
10         self.restart_node(0, ["-zmqpubrawtx=foo", "-zmqpubhashtx=bar"])
11 
12         address = 'tcp://127.0.0.1:28332'
13-        subs = self.setup_zmq_test([(topic, address) for topic in ["hashblock", "hashtx", "rawblock", "rawtx"]])
14+        subs = self.setup_zmq_test(services=[(topic, address) for topic in ["hashblock", "hashtx", "rawblock", "rawtx"]])
15 
16         hashblock = subs[0]
17         hashtx = subs[1]
18@@ -212,7 +212,7 @@ class ZMQTest (BitcoinTestFramework):
19 
20         # Should only notify the tip if a reorg occurs
21         hashblock, hashtx = self.setup_zmq_test(
22-            [(topic, address) for topic in ["hashblock", "hashtx"]],
23+            services=[(topic, address) for topic in ["hashblock", "hashtx"]],
24             recv_timeout=2)  # 2 second timeout to check end of notifications
25         self.disconnect_nodes(0, 1)
26 
27@@ -262,7 +262,7 @@ class ZMQTest (BitcoinTestFramework):
28         <32-byte hash>A<8-byte LE uint> : Transactionhash added mempool
29         """
30         self.log.info("Testing 'sequence' publisher")
31-        [seq] = self.setup_zmq_test([("sequence", "tcp://127.0.0.1:28333")])
32+        [seq] = self.setup_zmq_test(services=[("sequence", "tcp://127.0.0.1:28333")])
33         self.disconnect_nodes(0, 1)
34 
35         # Mempool sequence number starts at 1
36@@ -414,7 +414,7 @@ class ZMQTest (BitcoinTestFramework):
37             return
38 
39         self.log.info("Testing 'mempool sync' usage of sequence notifier")
40-        [seq] = self.setup_zmq_test([("sequence", "tcp://127.0.0.1:28333")])
41+        [seq] = self.setup_zmq_test(services=[("sequence", "tcp://127.0.0.1:28333")])
42 
43         # In-memory counter, should always start at 1
44         next_mempool_seq = self.nodes[0].getrawmempool(mempool_sequence=True)["mempool_sequence"]
45@@ -514,7 +514,7 @@ class ZMQTest (BitcoinTestFramework):
46 
47     def test_multiple_interfaces(self):
48         # Set up two subscribers with different addresses
49-        subscribers = self.setup_zmq_test([
50+        subscribers = self.setup_zmq_test(services=[
51             ("hashblock", "tcp://127.0.0.1:28334"),
52             ("hashblock", "tcp://127.0.0.1:28335"),
53         ], sync_blocks=False)

theStack commented at 11:00 pm on February 9, 2021:

Thanks, I decided for the first variant, i.e. enforcing named args for recv_timeout and sync_blocks.

in test/functional/interface_zmq.py:525 in 9a9453d35b outdated

516@@ -488,7 +517,7 @@ def test_multiple_interfaces(self):
517         subscribers = self.setup_zmq_test([
518             ("hashblock", "tcp://127.0.0.1:28334"),
519             ("hashblock", "tcp://127.0.0.1:28335"),
520-        ])
521+        ], sync_blocks=False)

jonatack commented at 3:39 pm on January 29, 2021:

nit, maybe add a comment clarifying why sync_blocks must be false (the test hangs without it)

theStack commented at 11:01 pm on February 9, 2021:

Thanks, done.

jonatack commented at 3:41 pm on January 29, 2021: member

ACK 9a9453d35be5e6d24e14f75c911428a5dbbd2b30

Might be good if @instagibbs had a look.

(IDK if he prefers instagibbs or Gregory Sanders for the commit credit)

DrahtBot added the label Needs rebase on Feb 5, 2021

zmq test: dedup message reception handling in ZMQSubscriber 6014d6e1b5

zmq test: accept arbitrary sequence start number in ZMQSubscriber

The ZMQSubscriber reception methods currently assert that the first
received publisher message has a sequence number of zero. In order to
fix the current test flakiness via "syncing up" to nodes in the setup
phase, we have to cope with the situation that messages get lost and the
first actual received message has a sequence number larger than zero.

8666033630

zmq test: fix flakiness by using more robust sync method

After connecting the subscriber sockets to the node, there is no
guarantee that the node's zmq publisher interfaces are ready yet, which
means that potentially the first expected notification messages could
get lost and the test fails. Currently this is handled by just waiting
for a short period of time (200ms), which works most of the time but is
still problematic, as in some rare cases the setup time takes much
longer, even in the range of multiple seconds.

The solution in this commit approaches the problem by using a more
robust method of syncing up, originally proposed by instagibbs:
    1. Generate a block on the node
    2. Try to receive a notification on all subscribers
    3. If all subscribers get a message within the timeout (1 second),
       we are done, otherwise repeat starting from step 1

5c6546362d

zmq test: speedup test by whitelisting peers (immediate tx relay)

Speeds up the zmq test roughly by a factor of 2x (~20 sec. instead of
~40 sec.) and also avoids timeouts on the synchronization methods
(sync_mempool() / sync_blocks()) that happened with a slight chance.
This is due to the fact that there is no upper bound on the trickle
relay time, so even the default of 60s is sometimes too low. Fixed by
enabling immediate tx relay on node1.

ef21fb7313

theStack force-pushed on Feb 9, 2021

theStack commented at 11:02 pm on February 9, 2021: member

Force-pushed with a rebase on master and suggestions by jonatack (https://github.com/bitcoin/bitcoin/pull/21008#discussion_r566907134 and #21008 (review)).

DrahtBot removed the label Needs rebase on Feb 9, 2021

jonatack commented at 4:18 pm on February 10, 2021: member

Light ACK ef21fb7313005a8a2d4f03fb4056f1f66c1b04f0 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can’t vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this.

Thanks for adding the comment. It would be good for the tests to not be order-dependent.

MarcoFalke merged this on Feb 16, 2021

MarcoFalke closed this on Feb 16, 2021

sidhujag referenced this in commit 31ef542332 on Feb 16, 2021

MarcoFalke referenced this in commit cfce346508 on Mar 2, 2021

Fabcien referenced this in commit 30b874af38 on Nov 30, 2021

Fabcien referenced this in commit 320e98a7e4 on Nov 30, 2021

Fabcien referenced this in commit b98d91efd3 on Nov 30, 2021

DrahtBot locked this on Aug 16, 2022

test: fix zmq test flakiness, improve speed #21008

Conflicts