test: fix intermittent failure in p2p_orphan_handling.py #32063

pull mzumsande wants to merge 1 commits into bitcoin:master from mzumsande:202503_test_fix_p2p_orphan changing 1 files +1 −0
  1. mzumsande commented at 10:07 pm on March 13, 2025: contributor

    If the mocktime is bumped before the node has successfully disconnected the peer, the requests for both parents could be spread over two GETDATAS: The first time GetRequestsToSend is invoked it would only request one tx from peer2, because the other one would only be available after peer1 was disconnected and its outstanding txrequest cleared. So two GETDATAs would be sent, which would make the test fail.

    Fixes #31700

  2. test: fix intermittent failure in p2p_orphan_handling.py
    If we bump the mocktime before the node has successfully disconnected
    the peer, the requests for both parents could be spread over
    two GETDATAS, which would make the test fail.
    02942056fd
  3. DrahtBot commented at 10:07 pm on March 13, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32063.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK maflcko, instagibbs

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

  4. DrahtBot added the label Tests on Mar 13, 2025
  5. fanquake added this to the milestone 29.0 on Mar 13, 2025
  6. fanquake added the label Needs backport (29.x) on Mar 14, 2025
  7. fanquake requested review from instagibbs on Mar 16, 2025
  8. maflcko commented at 11:25 am on March 16, 2025: member
    lgtm ACK 02942056fd861581503a8a35a06dcf22d4ba1473
  9. fanquake merged this on Mar 16, 2025
  10. fanquake closed this on Mar 16, 2025

  11. glozow referenced this in commit 5ebcb59fdb on Mar 17, 2025
  12. glozow commented at 2:12 am on March 17, 2025: member
    Backported in #32062
  13. glozow removed the label Needs backport (29.x) on Mar 17, 2025
  14. mzumsande deleted the branch on Mar 17, 2025
  15. tnndbtc commented at 7:48 pm on March 17, 2025: none

    Hi @mzumsande , Followed your hint that the node needs to successfully disconnect the peer before bumping mocktime, I added one line of “debugging” code to reproduce the issue 100%: in src/net.cpp:CConnman::DisconnectNodes(), add a one second sleep before calling DeleteNode(pnode);

     0    {
     1        // Delete disconnected nodes
     2        std::list<CNode*> nodes_disconnected_copy = m_nodes_disconnected;
     3        for (CNode* pnode : nodes_disconnected_copy)
     4        {
     5            // Destroy the object only after other threads have stopped using it.
     6            if (pnode->GetRefCount() <= 0) {
     7                m_nodes_disconnected.remove(pnode);
     8                std::this_thread::sleep_for(std::chrono::seconds(1)); // SLEEP for 1 second to 100% reproduce the issue
     9                DeleteNode(pnode);
    10            }
    11        }
    12    }
    

    So, even I added your fix of waiting until num_test_p2p_connections reduces to 1, the issue is still 100% reproducible on Mac with Apple M1 chipset with OS Sequoia 15.1.1:

    0        peer1.peer_disconnect()
    1        self.wait_until(lambda: node.num_test_p2p_connections() == 1)
    2        node.bumpmocktime(TXREQUEST_TIME_SKIP)
    

    Then, I have moved the sleep 1 second AFTER DeleteNode(pnode); the issue is not reproducible.

    I agree with the idea that we should wait until the CConnman::DisconnectNodes is fully completed. However, it seems like this PR needs more work to fully fix the timing issue.

    To test with –timeout-factor=40.0: build/test/functional/test_runner.py test/functional/p2p_orphan_handling.py --tmpdir /tmp --nocleanup --timeout-factor=40.0 --quiet --failfast --v2transport

    1/1 - p2p_orphan_handling.py failed, Duration: 406 s

    stdout: 2025-03-17T21:44:15.055000Z TestFramework (INFO): PRNG seed is: 836743966629972745 2025-03-17T21:44:15.055000Z TestFramework (INFO): Initializing test directory /tmp/test_runner_₿🏃20250317_174414/p2p_orphan_handling_0 2025-03-17T21:44:18.271000Z TestFramework (INFO): Test that, if a parent goes missing during orphan reso, it is requested 2025-03-17T21:50:59.940000Z TestFramework.utils (ERROR): wait_until() failed. Predicate: ’’’’ def test_function(): if check_connected: assert self.is_connected return test_function_in() ’’’ 2025-03-17T21:51:01.048000Z TestFramework (ERROR): Assertion failed Traceback (most recent call last): File “/Users/masked/data/code/bitcoin/test/functional/test_framework/test_framework.py”, line 135, in main self.run_test() ~~~~~~~~~~~~~^^ File “/Users/masked/data/code/bitcoin/build/test/functional/p2p_orphan_handling.py”, line 826, in run_test self.test_parents_change() ~~~~~~~~~~~~~~~~~~~~~~~~^^ File “/Users/masked/data/code/bitcoin/build/test/functional/p2p_orphan_handling.py”, line 55, in wrapper func(self) ~~~~^^^^^^ File “/Users/masked/data/code/bitcoin/build/test/functional/p2p_orphan_handling.py”, line 797, in test_parents_change peer2.wait_for_parent_requests([int(parent_peekaboo_AB[“txid”], 16), int(parent_missing[“txid”], 16)]) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/Users/masked/data/code/bitcoin/build/test/functional/p2p_orphan_handling.py”, line 99, in wait_for_parent_requests self.wait_until(test_function, timeout=10) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/Users/masked/data/code/bitcoin/test/functional/test_framework/p2p.py”, line 594, in wait_until wait_until_helper_internal(test_function, timeout=timeout, lock=p2p_lock, timeout_factor=self.timeout_factor, check_interval=check_interval) ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “/Users/masked/data/code/bitcoin/test/functional/test_framework/util.py”, line 317, in wait_until_helper_internal raise AssertionError(“Predicate {} not true after {} seconds”.format(predicate_source, timeout)) AssertionError: Predicate ’’’’ def test_function(): if check_connected: assert self.is_connected return test_function_in() ’’’ not true after 400.0 seconds 2025-03-17T21:51:01.106000Z TestFramework (INFO): Not stopping nodes as test failed. The dangling processes will be cleaned up later. 2025-03-17T21:51:01.106000Z TestFramework (WARNING): Not cleaning up dir /tmp/test_runner🏃20250317_174414/p2p_orphan_handling_0 2025-03-17T21:51:01.106000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/test_runner🏃20250317_174414/p2p_orphan_handling_0/test_framework.log 2025-03-17T21:51:01.106000Z TestFramework (ERROR): 2025-03-17T21:51:01.106000Z TestFramework (ERROR): Hint: Call /Users/masked/data/code/bitcoin/test/functional/combine_logs.py ‘/tmp/test_runner🏃_20250317_174414/p2p_orphan_handling_0’ to consolidate all logs 2025-03-17T21:51:01.106000Z TestFramework (ERROR): 2025-03-17T21:51:01.106000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log. 2025-03-17T21:51:01.106000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues 2025-03-17T21:51:01.106000Z TestFramework (ERROR):

    stderr: [node 0] Cleaning up leftover process

    TEST | STATUS | DURATION

    p2p_orphan_handling.py | ✖ Failed | 406 s

    ALL | ✖ Failed | 406 s (accumulated) Runtime: 406 s

    CC: @glozow

  16. mzumsande commented at 8:52 pm on March 17, 2025: contributor
    Interesting - wouldn’t that mean that disconnect_p2ps (which is used in a lot of places) has the same issue?
  17. tnndbtc commented at 10:33 pm on March 17, 2025: none

    @mzumsande I have another finding, if I move the calling of self.wait_until(lambda: len(node.getorphantxs(verbosity=2)[0]["from"]) == 1) before node.bumpmocktime(TXREQUEST_TIME_SKIP), then this issue is not reproducible even I set a longer sleep before calling DeleteNode(pnode);.

    0        # Disconnect peer1. peer2 should become the new candidate for orphan resolution.
    1        peer1.peer_disconnect()
    2        self.wait_until(lambda: len(node.getorphantxs(verbosity=2)[0]["from"]) == 1) # MOVE this check before bumpmocktime will greatly reduce the chance that DeleteNode(pnode); was called AFTER bumpmocktime.
    3        node.bumpmocktime(TXREQUEST_TIME_SKIP)
    4
    5        # self.wait_until(lambda: len(node.getorphantxs(verbosity=2)[0]["from"]) == 1)
    6        # Both parents should be requested, now that they are both missing.
    7        peer2.wait_for_parent_requests([int(parent_peekaboo_AB["txid"], 16), int(parent_missing["txid"], 16)])
    8        peer2.send_and_ping(msg_tx(parent_missing["tx"]))
    9        peer2.send_and_ping(msg_tx(parent_peekaboo_AB["tx"]))
    

    This code runs fine consistently.

    Reason behind it is: DeleteNode(pnode); will actually disconnect the peer for the orphan transaction by calling TxOrphanage::EraseForPeer. When node.getorphantxs(verbosity=2)[0]["from"] shows 1 connection, the node at least is half way through DeleteNode(pnode);, if not completed. Call stack is:

    0    frame [#0](/bitcoin-bitcoin/0/): 0x000000010360fb04 bitcoind`TxOrphanage::EraseForPeer(this=0x00006000014fe4a8, peer=0) at txorphanage.cpp:121:5
    1    frame [#1](/bitcoin-bitcoin/1/): 0x00000001031f9fa8 bitcoind`node::TxDownloadManagerImpl::DisconnectedPeer(this=0x00006000014fe490, nodeid=0) at txdownloadman_impl.cpp:165:17
    2    frame [#2](/bitcoin-bitcoin/2/): 0x00000001031f9f34 bitcoind`node::TxDownloadManager::DisconnectedPeer(this=0x0000000133055168, nodeid=0) at txdownloadman_impl.cpp:40:13
    3    frame [#3](/bitcoin-bitcoin/3/): 0x0000000102ebb638 bitcoind`(anonymous namespace)::PeerManagerImpl::FinalizeNode(this=0x0000000133055000, node=0x000000013260b6c0) at net_processing.cpp:1601:25
    4    frame [#4](/bitcoin-bitcoin/4/): 0x0000000102dec0bc bitcoind`CConnman::DeleteNode(this=0x0000000133009200, pnode=0x000000013260b6c0) at net.cpp:3494:16
    5    frame [#5](/bitcoin-bitcoin/5/): 0x0000000102deb060 bitcoind`CConnman::DisconnectNodes(this=0x0000000133009200) at net.cpp:1969:17
    

    Would you consider this as a potential solution?

    CC: @hodlinator @maflcko @glozow

  18. fanquake referenced this in commit d6db87165c on Mar 18, 2025
  19. tnndbtc commented at 2:01 am on March 18, 2025: none

    @mzumsande Regarding to your concerns on disconnect_p2ps, I took a look at the implementation of it in test/functional/test_framework/test_node.py:

    0    def disconnect_p2ps(self):
    1        """Close all p2p connections to the node.
    2        Use only after each p2p has sent a version message to ensure the wait works."""
    3        for p in self.p2ps:
    4            p.peer_disconnect()
    5        del self.p2ps[:]
    6
    7        self.wait_until(lambda: self.num_test_p2p_connections() == 0)
    

    After wait until self.num_test_p2p_connections() equals to 0, the CConnman::m_nodes becomes empty. However, it doesn’t guarantee all underlying nodes have released its resources, or completed DeleteNode(pnode);

    I don’t know whether this is acceptable. But, I tried to move the erase of pnode from m_nodes AFTER DeleteNode(pnode); , it passes this test: test/functional/p2p_orphan_handling.py, but ran into other test failures. feature_index_prune.py | ✖ Failed | 107 s feature_pruning.py | ✖ Failed | 407 s feature_signet.py | ✖ Failed | 3 s p2p_ibd_stalling.py –v1transport | ✖ Failed | 86 s p2p_invalid_messages.py | ✖ Failed | 2 s rpc_net.py –v1transport | ✖ Failed | 2 s rpc_net.py –v2transport | ✖ Failed | 4 s

     0    {           
     1        LOCK(m_nodes_mutex);
     2        ...
     3        ...
     4        ...
     5        // Disconnect unused nodes
     6        std::vector<CNode*> nodes_copy = m_nodes;
     7        for (CNode* pnode : nodes_copy)
     8        {
     9            if (pnode->fDisconnect)
    10            {
    11                // remove from m_nodes
    12                // m_nodes.erase(remove(m_nodes.begin(), m_nodes.end(), pnode), m_nodes.end());
    13        ...
    14        ...
    15        ...
    16    } 
    17    {
    18        // Delete disconnected nodes
    19        std::list<CNode*> nodes_disconnected_copy = m_nodes_disconnected;
    20        for (CNode* pnode : nodes_disconnected_copy)
    21        {
    22            // Destroy the object only after other threads have stopped using it.
    23            if (pnode->GetRefCount() <= 0) {
    24                // m_nodes_disconnected.remove(pnode);
    25                // std::this_thread::sleep_for(std::chrono::seconds(5));
    26                DeleteNode(pnode);
    27            }    
    28        }    
    29    } 
    30    {    
    31        // MOVE the erase of m_nodes AFTER DeleteNode(pnode);
    32        LOCK(m_nodes_mutex);
    33
    34        // Delete disconnected nodes
    35        std::list<CNode*> nodes_disconnected_copy = m_nodes_disconnected;
    36        for (CNode* pnode : nodes_disconnected_copy)
    37        {    
    38            // Destroy the object only after other threads have stopped using it.
    39            m_nodes.erase(remove(m_nodes.begin(), m_nodes.end(), pnode), m_nodes.end());
    40            m_nodes_disconnected.remove(pnode);
    41        }    
    42    } 
    

    So, I currently don’t see a clean way to check that the p2p connections are fully dropped.


github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-03-28 15:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me