qa: Another intermittent failure in `p2p_compactblocks.py` #21535

issue hebasto opened this issue on March 27, 2021
  1. hebasto commented at 9:28 PM on March 27, 2021: member

    https://bitcoinbuilds.org/?job=471847ae-c57f-4c38-8407-ec08c553463b (redirects to an ad/tracker as of Aug 20th 2022)

    <details><summary>log excerpt</summary> <p>

    2021-03-27T18:33:03.120000Z TestFramework (INFO): Testing end-to-end block relay...
    2021-03-27T18:37:03.427000Z TestFramework.utils (ERROR): wait_until() failed. Predicate: ''''
            def test_function():
                if check_connected:
                    assert self.is_connected
                return test_function_in()
    '''
    2021-03-27T18:37:03.428000Z TestFramework (ERROR): Assertion failed
    Traceback (most recent call last):
      File "/home/ubuntu/src/test/functional/test_framework/test_framework.py", line 128, in main
        self.run_test()
      File "/home/ubuntu/src/test/functional/p2p_compactblocks.py", line 844, in run_test
        self.test_end_to_end_block_relay([self.segwit_node, self.old_node])
      File "/home/ubuntu/src/test/functional/p2p_compactblocks.py", line 675, in test_end_to_end_block_relay
        l.wait_until(lambda: "cmpctblock" in l.last_message, timeout=30)
      File "/home/ubuntu/src/test/functional/test_framework/p2p.py", line 450, in wait_until
        wait_until_helper(test_function, timeout=timeout, lock=p2p_lock, timeout_factor=self.timeout_factor)
      File "/home/ubuntu/src/test/functional/test_framework/util.py", line 261, in wait_until_helper
        raise AssertionError("Predicate {} not true after {} seconds".format(predicate_source, timeout))
    AssertionError: Predicate ''''
            def test_function():
                if check_connected:
                    assert self.is_connected
                return test_function_in()
    ''' not true after 240.0 seconds
    2021-03-27T18:37:03.479000Z TestFramework (INFO): Stopping nodes
    

    </p> </details>

  2. hebasto added the label Bug on Mar 27, 2021
  3. practicalswift commented at 8:28 AM on May 17, 2021: contributor
  4. mzumsande commented at 12:09 AM on June 7, 2021: contributor

    Looks like something strange is going on with validationinterface (referring to https://cirrus-ci.com/task/4959275735842816):

    The block 697bad... is enqueued in test_compactblock_reconstruction_multiple_peers but unlike in my local runs, the scheduler doesn't seem to pick this up until later, after the next block 59e460... has been created in test_end_to_end_block_relay and only then the scheduler picks them up back to back.

    Not sure if this is any way related to the old issue #12978 , but perhaps @sdaftuar @jnewbery this is of interest to you?

    Here are the relevant messages with timestamps:

     node0 2021-05-15T12:41:09.058353Z [msghand] [validationinterface.cpp:226] [BlockConnected] Enqueuing BlockConnected: block hash=697bad32cfe9bcc2568a879275d58eab09768e16302a58e09274e9054d7eea9d block height=349 
     node0 2021-05-15T12:41:09.058507Z [msghand] [validationinterface.cpp:199] [UpdatedBlockTip] Enqueuing UpdatedBlockTip: new block hash=697bad32cfe9bcc2568a879275d58eab09768e16302a58e09274e9054d7eea9d fork block hash=5ed562ea74e4344776d8f3bebd57d5945239a85b7e21f4e940482934eefdd308 (in IBD=false) 
    
     node0 2021-05-15T12:41:09.281551Z [httpworker.0] [validationinterface.cpp:226] [BlockConnected] Enqueuing BlockConnected: block hash=59e46023b0263c5ce8ba268680984b214627d1c7c818c28b9c2d0a8f4981c21e block height=350 
     node0 2021-05-15T12:41:09.281624Z [httpworker.0] [validationinterface.cpp:199] [UpdatedBlockTip] Enqueuing UpdatedBlockTip: new block hash=59e46023b0263c5ce8ba268680984b214627d1c7c818c28b9c2d0a8f4981c21e fork block hash=697bad32cfe9bcc2568a879275d58eab09768e16302a58e09274e9054d7eea9d (in IBD=false) 
    
     node0 2021-05-15T12:41:09.348408Z [scheduler] [validationinterface.cpp:226] [operator()] BlockConnected: block hash=697bad32cfe9bcc2568a879275d58eab09768e16302a58e09274e9054d7eea9d block height=349 
     node0 2021-05-15T12:41:09.349191Z [scheduler] [validationinterface.cpp:199] [operator()] UpdatedBlockTip: new block hash=697bad32cfe9bcc2568a879275d58eab09768e16302a58e09274e9054d7eea9d fork block hash=5ed562ea74e4344776d8f3bebd57d5945239a85b7e21f4e940482934eefdd308 (in IBD=false) 
    
     node0 2021-05-15T12:41:09.349506Z [scheduler] [validationinterface.cpp:226] [operator()] BlockConnected: block hash=59e46023b0263c5ce8ba268680984b214627d1c7c818c28b9c2d0a8f4981c21e block height=350 
     node0 2021-05-15T12:41:09.350540Z [scheduler] [validationinterface.cpp:199] [operator()] UpdatedBlockTip: new block hash=59e46023b0263c5ce8ba268680984b214627d1c7c818c28b9c2d0a8f4981c21e fork block hash=697bad32cfe9bcc2568a879275d58eab09768e16302a58e09274e9054d7eea9d (in IBD=false) 
    
  5. jnewbery commented at 10:20 AM on June 7, 2021: contributor

    Good investigation @mzumsande! You're right that this is caused by asynchronicity of the validation interface (as well as incorrect assumptions/bugs in the test). Here's what's happening:

    Successful case

    In test_compactblock_reconstruction_multiple_peers(), block 349 is reconstructed from the final blocktxn message. Net processing is able to reconstruct the block and submit it to ProcessBlock(). Block validation succeeds, and the BlockConnected/UpdatedBlockTip callbacks are enqueued:

     node0 2021-06-07T09:07:28.255393Z [msghand] [validationinterface.cpp:224] [BlockConnected] Enqueuing BlockConnected: block hash=4c429e49ce86fb14ce83fe6c2c0b9c6dbdf5d819884e534238916c1ae6dc5577 block height=349 
     node0 2021-06-07T09:07:28.255473Z [msghand] [validationinterface.cpp:196] [UpdatedBlockTip] Enqueuing UpdatedBlockTip: new block hash=4c429e49ce86fb14ce83fe6c2c0b9c6dbdf5d819884e534238916c1ae6dc5577 fork block hash=396aebbbbc2ab6ab5ff4c808586702c5ca0041b8de36d90a9d154b18
    25721efd (in IBD=false) 
    

    The test framework verifies that the node has advanced to block 349:

    https://github.com/bitcoin/bitcoin/blob/912cb594901b1a5dd99aa4c00be34ae36512bc57/test/functional/p2p_compactblocks.py#L770

    The validation interface callbacks are fired quickly:

     node0 2021-06-07T09:07:28.256318Z [scheduler] [validationinterface.cpp:196] [operator()] UpdatedBlockTip: new block hash=4c429e49ce86fb14ce83fe6c2c0b9c6dbdf5d819884e534238916c1ae6dc5577 fork block hash=396aebbbbc2ab6ab5ff4c808586702c5ca0041b8de36d90a9d154b1825721efd (in 
    IBD=false) 
    

    This causes the block hash to be added to m_blocks_for_headers_relay for peer 1:

    https://github.com/bitcoin/bitcoin/blob/912cb594901b1a5dd99aa4c00be34ae36512bc57/src/net_processing.cpp#L1508

    The next time that SendMessages() is called for peer1, the block is announced via cmpctblock:

     node0 2021-06-07T09:07:28.256384Z [msghand] [net_processing.cpp:4375] [SendMessages] SendMessages sending header-and-ids 4c429e49ce86fb14ce83fe6c2c0b9c6dbdf5d819884e534238916c1ae6dc5577 to peer=1 
     node0 2021-06-07T09:07:28.256420Z [msghand] [net.cpp:2969] [PushMessage] sending cmpctblock (185 bytes) peer=1 
    

    (note that peer1 ~has not requested high bandwidth compact block relay from the node and it's also~ is a non-witness peer. ~Both of those things prevent~ That prevents it from being send a cmpctblock in the NewValidPOWBlock() callback and mean that the cmpctblock is only announced in the next SendMessages() call).

    EDIT: peer 1 has requested to be a hb peer in the earlier test_sendcmpct() subtest.

    This all happens quickly (before the "Testing end-to-end block relay...")

    Both peer0 and peer1 then send a sendcmpct(announce=True) to request hb compact block relay.

    In test_end_to_end_block_relay() a new block is submitted to the node (over RPC), causing the synchronous NewValidPOWBlock() callback to be fired:

     node0 2021-06-07T09:07:28.422098Z [httpworker.0] [validationinterface.cpp:254] [NewPoWValidBlock] NewPoWValidBlock: block hash=35ce74cae5d020e25c53f29b119ea62cabb7bdfb1c9a391f0ea4af3306dd76e5 
    

    It looks to me like the intent of test_end_to_end_block_relay() is to check the high-bandwidth cmpctblock relay, which is what happens for peer0 (segwit-enabled):

     node0 2021-06-07T09:07:28.422163Z [httpworker.0] [net_processing.cpp:1394] [operator()] PeerManager::NewPoWValidBlock sending header-and-ids 35ce74cae5d020e25c53f29b119ea62cabb7bdfb1c9a391f0ea4af3306dd76e5 to peer=0 
     node0 2021-06-07T09:07:28.422190Z [httpworker.0] [net.cpp:2969] [PushMessage] sending cmpctblock (215 bytes) peer=0 
    

    Notice the NewPoWValidBlock sending header-and-ids - that shows that this is hb cmpctblock relay, as opposed to SendMessages sending header-and-ids which is compact block relay in our normal message handling loop:

    https://github.com/bitcoin/bitcoin/blob/912cb594901b1a5dd99aa4c00be34ae36512bc57/src/net_processing.cpp#L4479

    Now for peer1, we don't do hb compact block relay, since we won't relay a witness compact block to a non-witness peer:

    https://github.com/bitcoin/bitcoin/blob/912cb594901b1a5dd99aa4c00be34ae36512bc57/src/net_processing.cpp#L1466-L1467

    But next time we go through the SendMessages() loop for peer1, we'll send a compact block:

    https://github.com/bitcoin/bitcoin/blob/912cb594901b1a5dd99aa4c00be34ae36512bc57/src/net_processing.cpp#L4475-L4480

    and the test passes.

    Failure case

    As above, block 349 is reconstructed in test_compactblock_reconstruction_multiple_peers() and UpdatedBlockTip() is enqueued. However, the scheduler thread doesn't fire the callbacks until after submitblock has been called in test_end_to_end_block_relay() and UpdatedBlockTip() for block 350 has been enqueued, so the UpdatedBlockTip() callbacks for block 349 and 350 are called in quick succession, and the block hashes for block 349 and 350 are both added to peer 1's m_blocks_for_headers_relay.

    In SendMessages() for peer 1, the node doesn't announce via cmpctblock, since there are two blocks to announce:

    https://github.com/bitcoin/bitcoin/blob/912cb594901b1a5dd99aa4c00be34ae36512bc57/src/net_processing.cpp#L4476-L4478

    Instead, the node falls back to announcing just the tip (block 350) via inv (since sendheaders has not been sent by peer1):

    https://github.com/bitcoin/bitcoin/blob/912cb594901b1a5dd99aa4c00be34ae36512bc57/src/net_processing.cpp#L4520-L4544

    So peer1 never receives a cmpctblock and the test times out here:

    https://github.com/bitcoin/bitcoin/blob/912cb594901b1a5dd99aa4c00be34ae36512bc57/test/functional/p2p_compactblocks.py#L680

    Incorrect assumption in test

    The "Testing end-to-end block relay..." part of the test assumes that if peer 1 sends a sendcmpct(announce=True) to the node, then it will receive hb compact block announcements immediately. It doesn't because peer 1 is a non-witness peer. Normally, this incorrect assumption is hidden by the fact that the node will send a cmpctblock announcement in its SendMessages() loop. However, if the scheduler thread is held up, then there will be two UpdatedBlockTip() callbacks in quick succession, causing the node to fall back to announcing blocks via inv.

    Fix

    The quick and dirty way to fix this flakiness would be to add some kind of synchronization to the end of test_compactblock_reconstruction_multiple_peers() to ensure that the UpdatedBlockTip() callback had been fired. However, we should really fix the test to remove the incorrect assumption that non-witness peers will receive hb cmpctblock announcements.

    I've had a PR open for several months that removes support for version 1 compact blocks: #20799. Part of that PR removes this invalid test assumption: https://github.com/bitcoin/bitcoin/pull/20799/files#diff-3d0d1b84c608319965e0c2675eefadc6820103793bf5a9d671562bc24dcbc3eaR791-R792.

    That PR depends on #21090, which I'd encourage people to go and review.

    More generally, there's a lot of cruft in the p2p logic and tests. I'd be very happy if people spent more effort removing old, redundant code rather than trying to add yet more logic and complexity. #21562, #22141, #21527 are a few PRs that could use some review.

  6. fanquake commented at 2:04 AM on January 21, 2022: member

    Recent p2p_compactblocks.py failure:

     node0 2022-01-20T19:56:53.230675Z [C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\src\net_processing.cpp:2533] [ProcessMessage] received: pong (8 bytes) peer=2 
     test  2022-01-20T19:57:07.492000Z TestFramework.utils (ERROR): wait_until() failed. Predicate: '''' 
                                               def test_function():
                                                   if check_connected:
                                                       assert self.is_connected
                                                   return test_function_in()
                                       '''
     test  2022-01-20T19:57:07.492000Z TestFramework (ERROR): Assertion failed 
                                       Traceback (most recent call last):
                                         File "C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\test\functional\test_framework\test_framework.py", line 132, in main
                                           self.run_test()
                                         File "C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\test\functional\p2p_compactblocks.py", line 890, in run_test
                                           self.test_end_to_end_block_relay([self.segwit_node, self.old_node])
                                         File "C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\test\functional\p2p_compactblocks.py", line 722, in test_end_to_end_block_relay
                                           l.wait_until(lambda: "cmpctblock" in l.last_message, timeout=30)
                                         File "C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\test\functional\test_framework\p2p.py", line 453, in wait_until
                                           wait_until_helper(test_function, timeout=timeout, lock=p2p_lock, timeout_factor=self.timeout_factor)
                                         File "C:\Users\ContainerAdministrator\AppData\Local\Temp\cirrus-ci-build\test\functional\test_framework\util.py", line 269, in wait_until_helper
                                           raise AssertionError("Predicate {} not true after {} seconds".format(predicate_source, timeout))
                                       AssertionError: Predicate ''''
                                               def test_function():
                                                   if check_connected:
                                                       assert self.is_connected
                                                   return test_function_in()
                                       ''' not true after 240.0 seconds
     test  2022-01-20T19:57:07.507000Z TestFramework (DEBUG): Closing down network thread 
     test  2022-01-20T19:57:07.683000Z TestFramework (INFO): Stopping nodes 
     test  2022-01-20T19:57:07.699000Z TestFramework.node0 (DEBUG): Stopping node 
    
  7. hebasto renamed this:
    qa: Another intermittent failure in p2p_compactblocks.py
    qa: Another intermittent failure in `p2p_compactblocks.py`
    on Jun 27, 2022
  8. fanquake commented at 2:01 PM on February 7, 2023: member

    Is this still an issue given that #20799 has been merged?

  9. maflcko added the label Tests on Feb 7, 2023
  10. maflcko commented at 3:19 PM on February 7, 2023: member

    A new issue can be filed if this happens again

  11. maflcko closed this on Feb 7, 2023

  12. bitcoin locked this on Feb 7, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2026-04-16 18:14 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me