Fix zmq test flakiness

MarcoFalke commented at 3:26 pm on January 14, 2021: member

There are many reports of the test being flaky: #20672 (comment)

Thus, it should be made more robust, as described in #20538 (comment)

Useful skills:

Background in our functional test suite (python3)
Background in zmq

Want to work on this issue?

For guidance on contributing, please read CONTRIBUTING.md before opening your pull request.

MarcoFalke added the label Tests on Jan 14, 2021

MarcoFalke added the label good first issue on Jan 14, 2021

adamjonas commented at 8:15 pm on January 14, 2021: member

Of the last 571 failures, 22 are from the interface_zmq.py functional tests (3.8%). According to the numbers, it’s the flakiest functional tests we have. @domob1812 @theStack @mruddy @n-thumann are any of you willing to give this a shot?

theStack commented at 5:49 pm on January 17, 2021: member

Took some time to look at the problem, it seems to be quite tricky to solve in a solid way. I tried the suggested method of “syncing up” via repeatedly generating a block and waiting for the expected message (until it doesn’t timeout anymore), but generating a block seems to interfere with some of the sub-tests. It also already generates notification messages for our subs that are received later (even if we are not connected yet). Maybe something like this would work:

restart node with additional pubhashtx test publisher (on a port not used by any of the test subs)
repeatedly generate block and wait for expected messages from test publisher, until it doesn’t time out anymore
invalidate generated blocks
clear mempool (needed?)
read from our subscriber sockets until there is no data (a “reverse flush” so to say)

Maybe I’m thinking too complicated though. Whatever the solution will be, at least having a common test setup method should serve as a better basis for solving this issue: #20953

instagibbs commented at 2:58 am on January 18, 2021: member

but generating a block seems to interfere with some of the sub-tests

Yes it would require making all the subtests more robust I think.

alternative setup

Seems pretty complicated, and with intentional block rollbacks things can get weird.

fanquake referenced this in commit 3734adba39 on Jan 21, 2021

sidhujag referenced this in commit 4dceb42b8b on Jan 21, 2021

MarcoFalke commented at 8:06 am on January 22, 2021: member

Could a mempool tx be used to sync up instead of a block?

practicalswift commented at 11:29 am on January 26, 2021: contributor

What about temporarily disabling interface_zmq.py in CI until this is fixed?

It seems to me that interface_zmq.py as it is currently working is a net negative from a CI testing perspective due to its extreme flakiness :)

instagibbs commented at 11:53 am on January 26, 2021: member

How often is it failing?

On Tue, Jan 26, 2021, 7:30 PM practicalswift notifications@github.com wrote:

What about temporarily disabling interface_zmq.py in CI until this is fixed?

It seems to me that interface_zmq.py as it is currently working is a net negative from a testing perspective due to its extreme flakiness :)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bitcoin/bitcoin/issues/20934#issuecomment-767482690, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMAFU3JOQHWO4XZFSHAKP3S32RTTANCNFSM4WCTAKVQ .

MarcoFalke commented at 11:57 am on January 26, 2021: member

Of the last 571 failures, 22 are from the interface_zmq.py functional tests (3.8%). According to the numbers, it’s the flakiest functional tests we have.

(quote from @adamjonas )

MarcoFalke closed this on Feb 16, 2021

sidhujag referenced this in commit 31ef542332 on Feb 16, 2021

adamjonas reopened this on Mar 1, 2021

adamjonas commented at 4:03 pm on March 1, 2021: member

interface_zmq.py flakiness is back and I think #21008 is hurting more than helping.

Before merge of #21008 on 2/16 (Feb 12-15): Failed 1 time on 1 PR (1,274 bullds)

Same Friday to Monday time period after merge (Feb 19-22): Failed 11 times across different 9 PRs (1,470 total builds)

MarcoFalke closed this on Mar 2, 2021

MarcoFalke commented at 10:31 am on March 2, 2021: member

Fixed in #21216 ?

adamjonas commented at 10:59 pm on March 2, 2021: member

ref #21310

Fabcien referenced this in commit 30b874af38 on Nov 30, 2021

DrahtBot locked this on Aug 18, 2022

Fix zmq test flakiness #20934

Useful skills:

Want to work on this issue?