test: fix zmq test flakiness, improve speed #21008

theStack commented at 2:05 AM on January 26, 2021: contributor

Fixes #20934 by using the "sync up" method described in #20538 (comment).

After improving robustness with this approach (commits 1-3), it turned out that there were still some fails, but those were unrelated to zmq: Out of 500 runs, 3 times sync_mempool() or sync_blocks() timed out, which can happen because the trickle relay time has no upper bound -- hence in rare cases, it takes longer than 60s. This is fixed by enabling immediate tx relay on node1 (commit 4), which as a nice side-effect also gives us a rough 2x speedup for the test.

For further details, also see the explanations in the commit messages.

There is no guarantee that the test is still not flaky, but it would help if potential reviewers would run the following script locally and report how many runs failed (feel free to do less than 1000 runs, as this takes quite a long if ran with --valgrind):

#!/bin/sh
OUTPUT_FILE=./zmq_results
echo ===== repeated zmq test ===== > $OUTPUT_FILE

for i in `seq 1000`; do
    echo ------------------------
    echo ----- test run $i -----
    echo ------------------------
    echo --- $i --- >> $OUTPUT_FILE
    ./test/functional/interface_zmq.py --valgrind
    if [ $? -ne 0 ]; then
        echo "FAILED. /o\\" >> $OUTPUT_FILE
    else
        echo "PASSED. \\o/" >> $OUTPUT_FILE
    fi
done

echo Failed test runs:
grep FAILED $OUTPUT_FILE | wc -l

fanquake added the label Tests on Jan 26, 2021

DrahtBot commented at 9:33 AM on January 26, 2021: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

No conflicts as of last run.

DrahtBot cross-referenced this on Jan 26, 2021 from issue tests: Run both descriptor and legacy tests within a single test invocation by achow101

ajtowns marked this as a draft on Jan 26, 2021

theStack commented at 11:09 AM on January 26, 2021: contributor

Asked to mark this as draft since the test now fails on two cirrus instances. Have to investigate deeper what the problem is and how to properly reproduce it... locally 1000 test runs passed successfully.

DrahtBot cross-referenced this on Jan 26, 2021 from issue Disable and fix tests for when BDB is not compiled by achow101

MarcoFalke commented at 11:33 AM on January 26, 2021: member

You might want to try with --valgrind, which makes races more likely to happen locally

theStack force-pushed on Jan 27, 2021

theStack commented at 7:54 PM on January 27, 2021: contributor

@MarcoFalke: Thanks, that helped a lot! On master, running via --valgrind leads to a failed test run quite quickly on my machine.

The PR is ready for review now. The problem of my original approach was that most tests needed nodes 0 and 1 to have the same tip. So, after this robust "sync up" setup approach of repeatedly generating blocks, node 1 has to catch up. Added a parameter "sync_blocks" for that purpose, as for the last test, the chains are already different and synchronization is not possible in a trivial way. Now, I had locally a few hundred test runs with --valgrind and all of them passed. Also adapted the PR description to include a script, that reviewers can use to test the robustness.

theStack marked this as ready for review on Jan 27, 2021

jonatack commented at 6:10 PM on January 28, 2021: contributor

Concept ACK, will be great to robustify this test.

So far your script has run the test 25 times without errors for me (other than an unrelated issue that the test runner --valgrind option raises for me in general (not just on this test), but valgrind test/functional/interface_zmq.py works fine).

------------------------
----- test run 25 -----
------------------------
==42527== Memcheck, a memory error detector
==42527== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==42527== Using Valgrind-3.16.0.GIT and LibVEX; rerun with -h for copyright info
==42527== Command: test/functional/interface_zmq.py
==42527== 
2021-01-28T18:10:51.208000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_i65ciqkh
2021-01-28T18:10:54.197000Z TestFramework (INFO): Generate 5 blocks (and 5 coinbase txes)
2021-01-28T18:10:55.327000Z TestFramework (INFO): Wait for tx from second node
2021-01-28T18:10:56.502000Z TestFramework (INFO): Test the getzmqnotifications RPC
2021-01-28T18:10:56.505000Z TestFramework (INFO): Testing 'sequence' publisher
2021-01-28T18:10:57.900000Z TestFramework (INFO): Wait for tx from second node
2021-01-28T18:10:59.026000Z TestFramework (INFO): Testing sequence notifications with mempool sequence values
2021-01-28T18:10:59.027000Z TestFramework (INFO): Testing RBF notification
2021-01-28T18:11:26.168000Z TestFramework (INFO): Testing reorg notifications
2021-01-28T18:11:29.256000Z TestFramework (INFO): Evict mempool transaction by block conflict
2021-01-28T18:11:30.558000Z TestFramework (INFO): Testing 'mempool sync' usage of sequence notifier
2021-01-28T18:11:53.212000Z TestFramework (INFO): Stopping nodes
2021-01-28T18:11:53.469000Z TestFramework (INFO): Cleaning up /tmp/bitcoin_func_test_i65ciqkh on exit
2021-01-28T18:11:53.470000Z TestFramework (INFO): Tests successful
------------------------
----- test run 26 -----
------------------------

<details><summary>unrelated error with --valgrind test runner option</summary><p>

------------------------
----- test run 1 -----
------------------------
2021-01-28T17:22:07.128000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_awm_44bl
2021-01-28T17:22:34.997000Z TestFramework (ERROR): Unexpected exception caught during testing
Traceback (most recent call last):
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 108, in _request
    return self._get_response()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 168, in _get_response
    http_response = self.__conn.getresponse()
  File "/usr/lib/python3.9/http/client.py", line 1347, in getresponse
    response.begin()
  File "/usr/lib/python3.9/http/client.py", line 307, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.9/http/client.py", line 276, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 125, in main
    self.setup()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 262, in setup
    self.setup_network()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 356, in setup_network
    self.setup_nodes()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 383, in setup_nodes
    self.import_deterministic_coinbase_privkeys()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 400, in import_deterministic_coinbase_privkeys
    self.init_wallet(i)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 407, in init_wallet
    n.createwallet(wallet_name=wallet_name, descriptors=self.options.descriptors, load_on_startup=True)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 676, in createwallet
    return self.__getattr__('createwallet')(wallet_name, disable_private_keys, blank, passphrase, avoid_reuse, descriptors, load_on_startup)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/coverage.py", line 47, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 144, in __call__
    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 113, in _request
    self.__conn.request(method, path, postdata, headers)
  File "/usr/lib/python3.9/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1301, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/lib/python3.9/http/client.py", line 950, in send
    self.connect()
  File "/usr/lib/python3.9/http/client.py", line 921, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.9/socket.py", line 843, in create_connection
    raise err
  File "/usr/lib/python3.9/socket.py", line 831, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
2021-01-28T17:22:35.049000Z TestFramework (INFO): Stopping nodes
2021-01-28T17:22:35.050000Z TestFramework.node0 (ERROR): Unable to stop node.
Traceback (most recent call last):
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 320, in stop_node
    self.stop(wait=wait)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/coverage.py", line 47, in __call__
    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 144, in __call__
    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/authproxy.py", line 107, in _request
    self.__conn.request(method, path, postdata, headers)
  File "/usr/lib/python3.9/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.9/http/client.py", line 1266, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.9/http/client.py", line 1092, in putrequest
    raise CannotSendRequest(self.__state)
http.client.CannotSendRequest: Request-sent
Traceback (most recent call last):
  File "/home/jon/projects/bitcoin/bitcoin/./test/functional/interface_zmq.py", line 529, in <module>
    ZMQTest().main()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 149, in main
    exit_code = self.shutdown()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 278, in shutdown
    self.stop_nodes()
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_framework.py", line 525, in stop_nodes
    node.stop_node(wait=wait, wait_until_stopped=False)
  File "/home/jon/projects/bitcoin/bitcoin/test/functional/test_framework/test_node.py", line 334, in stop_node
    raise AssertionError("Unexpected stderr {} != {}".format(stderr, expected_stderr))
AssertionError: Unexpected stderr ==37264== Thread 10 b-httpworker.3:
==37264== Conditional jump or move depends on uninitialised value(s)
==37264==    at 0xB0DEC5: __log_putr.isra.2 (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xB0F090: __log_put (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xB99C1D: __crdel_metasub_log (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAD526A: __db_log_page (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xB4068A: __bam_new_subdb (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAE7A6D: __db_init_subdb (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xB044ED: __fop_subdb_setup (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAE6EB0: __db_open (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAE1719: __db_open_pp (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0xAC1D3A: Db::open(DbTxn*, char const*, char const*, DBTYPE, unsigned int, int) (in /home/jon/projects/bitcoin/bitcoin/src/bitcoind)
==37264==    by 0x873618: BerkeleyDatabase::Open() (bdb.cpp:345)
==37264==    by 0x873184: BerkeleyBatch::BerkeleyBatch(BerkeleyDatabase&, bool, bool) (bdb.cpp:308)
==37264== 
{
   <insert_a_suppression_name_here>
   Memcheck:Cond
   fun:__log_putr.isra.2
   fun:__log_put
   fun:__crdel_metasub_log
   fun:__db_log_page
   fun:__bam_new_subdb
   fun:__db_init_subdb
   fun:__fop_subdb_setup
   fun:__db_open
   fun:__db_open_pp
   fun:_ZN2Db4openEP5DbTxnPKcS3_6DBTYPEji
   fun:_ZN16BerkeleyDatabase4OpenEv
   fun:_ZN13BerkeleyBatchC1ER16BerkeleyDatabasebb
}
==37264== 
==37264== Exit program on first error (--exit-on-first-error=yes) != 
[node 1] Cleaning up leftover process
[node 0] Cleaning up leftover process

</p></details>

in test/functional/interface_zmq.py:69 in c02f9a1882 outdated

  64 | @@ -62,6 +65,8 @@ def receive_sequence(self):
  65 |  class ZMQTest (BitcoinTestFramework):
  66 |      def set_test_params(self):
  67 |          self.num_nodes = 2
  68 | +        # immediate tx relay (node1 -> node0)
  69 | +        self.extra_args = [[], ["-whitelist=noban@127.0.0.1"]]

jonatack commented at 6:53 PM on January 28, 2021:

c02f9a188 maybe general-case it with the existing explanation in wallet_avoidreuse.py

        # This test isn't testing txn relay/timing, so set whitelist on the
        # peers for instant txn relay. This speeds up the test run time 2-3x.
        self.extra_args = [["-whitelist=noban@127.0.0.1"]] * self.num_nodes

theStack commented at 1:29 AM on January 29, 2021:

Thanks, I like the clear explanation and used it. My initial reason for not using the whitelist parameter on all nodes was that for node0 they would be overwritten in the setup routine. But simply adding self.extra_args[0] to the extra_args parameter for the restart_node call also tackles this.

in test/functional/interface_zmq.py:129 in c02f9a1882 outdated

 131 | +            sub.socket.set(zmq.RCVTIMEO, recv_timeout*1000)
 132 | +
 133 | +        if sync_blocks:
 134 | +            self.connect_nodes(0, 1)
 135 | +            self.sync_blocks()
 136 | +            self.disconnect_nodes(0, 1)

jonatack commented at 7:01 PM on January 28, 2021:

a8ddb26150 I didn't understand why we disconnect here only to connect again on returning to the two callers for which the default sync_blocks=true.

theStack commented at 1:21 AM on January 29, 2021:

Agree that this was unnecessarily complicated. Changed the setup routine to always connect nodes 0 and 1 (which is needed for the block sync anways, that is used in every subtest except the last one) and disconnect in the subtest if necessary.

jonatack commented at 7:49 PM on January 28, 2021: contributor

Almost-ACK. I was unable to make the test fail with valgrind both here and on master. The test does run ~2x faster on this branch, which is great.

theStack force-pushed on Jan 29, 2021

theStack commented at 1:32 AM on January 29, 2021: contributor

Force-pushed with changes suggested by @jonatack (https://github.com/bitcoin/bitcoin/pull/21008#discussion_r566331714 and #21008 (review)).

in test/functional/interface_zmq.py:91 in 9a9453d35b outdated

  87 | @@ -82,23 +88,46 @@ def run_test(self):
  88 |  
  89 |      # Restart node with the specified zmq notifications enabled, subscribe to
  90 |      # all of them and return the corresponding ZMQSubscriber objects.
  91 | -    def setup_zmq_test(self, services, recv_timeout=60, connect_nodes=False):
  92 | +    def setup_zmq_test(self, services, recv_timeout=60, sync_blocks=True):

jonatack commented at 3:38 PM on January 29, 2021:

nit, can enforce named args with

    def setup_zmq_test(self, services, *, recv_timeout=60, sync_blocks=True):

or alternatively for all args

@@ -88,7 +88,7 @@ class ZMQTest (BitcoinTestFramework):
 
     # Restart node with the specified zmq notifications enabled, subscribe to
     # all of them and return the corresponding ZMQSubscriber objects.
-    def setup_zmq_test(self, services, recv_timeout=60, sync_blocks=True):
+    def setup_zmq_test(self, *, services, recv_timeout=60, sync_blocks=True):
         subscribers = []
         for topic, address in services:
             socket = self.ctx.socket(zmq.SUB)
@@ -137,7 +137,7 @@ class ZMQTest (BitcoinTestFramework):
         self.restart_node(0, ["-zmqpubrawtx=foo", "-zmqpubhashtx=bar"])
 
         address = 'tcp://127.0.0.1:28332'
-        subs = self.setup_zmq_test([(topic, address) for topic in ["hashblock", "hashtx", "rawblock", "rawtx"]])
+        subs = self.setup_zmq_test(services=[(topic, address) for topic in ["hashblock", "hashtx", "rawblock", "rawtx"]])
 
         hashblock = subs[0]
         hashtx = subs[1]
@@ -212,7 +212,7 @@ class ZMQTest (BitcoinTestFramework):
 
         # Should only notify the tip if a reorg occurs
         hashblock, hashtx = self.setup_zmq_test(
-            [(topic, address) for topic in ["hashblock", "hashtx"]],
+            services=[(topic, address) for topic in ["hashblock", "hashtx"]],
             recv_timeout=2)  # 2 second timeout to check end of notifications
         self.disconnect_nodes(0, 1)
 
@@ -262,7 +262,7 @@ class ZMQTest (BitcoinTestFramework):
         <32-byte hash>A<8-byte LE uint> : Transactionhash added mempool
         """
         self.log.info("Testing 'sequence' publisher")
-        [seq] = self.setup_zmq_test([("sequence", "tcp://127.0.0.1:28333")])
+        [seq] = self.setup_zmq_test(services=[("sequence", "tcp://127.0.0.1:28333")])
         self.disconnect_nodes(0, 1)
 
         # Mempool sequence number starts at 1
@@ -414,7 +414,7 @@ class ZMQTest (BitcoinTestFramework):
             return
 
         self.log.info("Testing 'mempool sync' usage of sequence notifier")
-        [seq] = self.setup_zmq_test([("sequence", "tcp://127.0.0.1:28333")])
+        [seq] = self.setup_zmq_test(services=[("sequence", "tcp://127.0.0.1:28333")])
 
         # In-memory counter, should always start at 1
         next_mempool_seq = self.nodes[0].getrawmempool(mempool_sequence=True)["mempool_sequence"]
@@ -514,7 +514,7 @@ class ZMQTest (BitcoinTestFramework):
 
     def test_multiple_interfaces(self):
         # Set up two subscribers with different addresses
-        subscribers = self.setup_zmq_test([
+        subscribers = self.setup_zmq_test(services=[
             ("hashblock", "tcp://127.0.0.1:28334"),
             ("hashblock", "tcp://127.0.0.1:28335"),
         ], sync_blocks=False)

theStack commented at 11:00 PM on February 9, 2021:

Thanks, I decided for the first variant, i.e. enforcing named args for recv_timeout and sync_blocks.

in test/functional/interface_zmq.py:525 in 9a9453d35b outdated

 516 | @@ -488,7 +517,7 @@ def test_multiple_interfaces(self):
 517 |          subscribers = self.setup_zmq_test([
 518 |              ("hashblock", "tcp://127.0.0.1:28334"),
 519 |              ("hashblock", "tcp://127.0.0.1:28335"),
 520 | -        ])
 521 | +        ], sync_blocks=False)

jonatack commented at 3:39 PM on January 29, 2021:

nit, maybe add a comment clarifying why sync_blocks must be false (the test hangs without it)

theStack commented at 11:01 PM on February 9, 2021:

Thanks, done.

jonatack commented at 3:41 PM on January 29, 2021: contributor

ACK 9a9453d35be5e6d24e14f75c911428a5dbbd2b30

Might be good if @instagibbs had a look.

(IDK if he prefers instagibbs or Gregory Sanders for the commit credit)

DrahtBot added the label Needs rebase on Feb 5, 2021

zmq test: dedup message reception handling in ZMQSubscriber 6014d6e1b5

zmq test: accept arbitrary sequence start number in ZMQSubscriber

The ZMQSubscriber reception methods currently assert that the first
received publisher message has a sequence number of zero. In order to
fix the current test flakiness via "syncing up" to nodes in the setup
phase, we have to cope with the situation that messages get lost and the
first actual received message has a sequence number larger than zero.

8666033630

zmq test: fix flakiness by using more robust sync method

After connecting the subscriber sockets to the node, there is no
guarantee that the node's zmq publisher interfaces are ready yet, which
means that potentially the first expected notification messages could
get lost and the test fails. Currently this is handled by just waiting
for a short period of time (200ms), which works most of the time but is
still problematic, as in some rare cases the setup time takes much
longer, even in the range of multiple seconds.

The solution in this commit approaches the problem by using a more
robust method of syncing up, originally proposed by instagibbs:
    1. Generate a block on the node
    2. Try to receive a notification on all subscribers
    3. If all subscribers get a message within the timeout (1 second),
       we are done, otherwise repeat starting from step 1

5c6546362d

zmq test: speedup test by whitelisting peers (immediate tx relay)

Speeds up the zmq test roughly by a factor of 2x (~20 sec. instead of
~40 sec.) and also avoids timeouts on the synchronization methods
(sync_mempool() / sync_blocks()) that happened with a slight chance.
This is due to the fact that there is no upper bound on the trickle
relay time, so even the default of 60s is sometimes too low. Fixed by
enabling immediate tx relay on node1.

ef21fb7313

theStack force-pushed on Feb 9, 2021

theStack commented at 11:02 PM on February 9, 2021: contributor

Force-pushed with a rebase on master and suggestions by jonatack (https://github.com/bitcoin/bitcoin/pull/21008#discussion_r566907134 and #21008 (review)).

DrahtBot removed the label Needs rebase on Feb 9, 2021

jonatack commented at 4:18 PM on February 10, 2021: contributor

Light ACK ef21fb7313005a8a2d4f03fb4056f1f66c1b04f0 with the caveat that I was unable to make the test fail with valgrind both here and on master, so I can't vouch that it actually fixes the CI flakiness. The test does run ~2x faster with this.

Thanks for adding the comment. It would be good for the tests to not be order-dependent.

MarcoFalke merged this on Feb 16, 2021

MarcoFalke closed this on Feb 16, 2021

sidhujag referenced this in commit 31ef542332 on Feb 16, 2021

jonatack cross-referenced this on Feb 17, 2021 from issue Failure in test/functional/interface_zmq.py by sdaftuar

jnewbery cross-referenced this on Feb 22, 2021 from issue net: Address outstanding review comments from PR20721 by jnewbery

jnewbery cross-referenced this on Feb 22, 2021 from issue net processing: Extract `addr` send functionality into MaybeSendAddr() by jnewbery

jnewbery cross-referenced this on Feb 22, 2021 from issue net/net processing: Move tx inventory into net_processing by jnewbery

theStack cross-referenced this on Feb 27, 2021 from issue zmq test: fix sync-up by matching notification to generated block by theStack

adamjonas cross-referenced this on Mar 1, 2021 from issue Fix zmq test flakiness by MarcoFalke

MarcoFalke referenced this in commit cfce346508 on Mar 2, 2021

Fabcien referenced this in commit 30b874af38 on Nov 30, 2021

Fabcien referenced this in commit 320e98a7e4 on Nov 30, 2021

Fabcien referenced this in commit b98d91efd3 on Nov 30, 2021

bitcoin locked this on Aug 16, 2022