test: Shut down framework cleanly on RPC connection failure #30660

pull hodlinator wants to merge 5 commits into bitcoin:master from hodlinator:2024-08/test_node_rpc_connection_failure changing 3 files +87 −22
  1. hodlinator commented at 11:53 am on August 15, 2024: contributor

    Without this PR, BitcoinTestFramework.start_nodes(), upon failing to establish RPC connections and calling stop_nodes(), would end up attempting to call the stop()-RPC, raising additional exceptions which made the shutdown less clean.

    Since we now avoiding the additional exceptions which in some cases provided useful clues, we instead capture information about the original exceptions being triggered and ignored within TestNode.wait_for_rpc_connection(). This means we now have the more informative: AssertionError: [node 0] Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 1, 'OSError.ECONNREFUSED': 239}, latest error: ConnectionRefusedError(111, 'Connection refused'))

    As long as the increased complexity and maintenance burden is low, exposing previously silenced exceptions in combination with having a cleaner shutdown clears the muddy waters for devs trying to figure out root causes.

    Related: #30390

  2. DrahtBot commented at 11:53 am on August 15, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage

    For detailed information about the code coverage, see the test coverage report.

    Reviews

    See the guideline for information on the review process. A summary of reviews will appear here.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #30956 (DO NOT MERGE: Windows bitcoind stall debugging by hodlinator)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  3. DrahtBot added the label Tests on Aug 15, 2024
  4. hodlinator commented at 11:54 am on August 15, 2024: contributor

    Encountered a case of #30390 here: https://github.com/bitcoin/bitcoin/actions/runs/10380115902/job/28739417666?pr=30377

    Output with only the test (no fix commit):

     0$ test/functional/feature_framework_rpc_failure.py
     12024-08-15T07:30:20.427000Z TestFramework (INFO): PRNG seed is: 635212259632049987
     22024-08-15T07:30:20.427000Z TestFramework (INFO): Initializing test directory /run/user/1001/bitcoin_func_test_ujsvbdnm
     32024-08-15T07:30:20.428000Z TestFramework (INFO): Forcing us to timeout instantly to simulate an unresponsive bitcoind
     42024-08-15T07:30:20.429000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     7    node.wait_for_rpc_connection()
     8  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 329, in wait_for_rpc_connection
     9    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
    10  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 192, in _raise_assertion_error
    11    raise AssertionError(self._node_msg(msg))
    12AssertionError: [node 0] Unable to connect to bitcoind after 0s
    13
    14During handling of the above exception, another exception occurred:
    15
    16Traceback (most recent call last):
    17  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 21, in setup_network
    18    BitcoinTestFramework.setup_network(self)
    19  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    20    self.setup_nodes()
    21  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    22    self.start_nodes()
    23  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    24    self.stop_nodes()
    25  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    26    node.stop_node(wait=wait, wait_until_stopped=False)
    27  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 390, in stop_node
    28    self.stop(wait=wait)
    29    ^^^^^^^^^
    30  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 209, in __getattr__
    31    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    32           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    33AssertionError: [node 0] Error: no RPC connection
    34
    35During handling of the above exception, another exception occurred:
    36
    37Traceback (most recent call last):
    38  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
    39    self.setup()
    40  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    41    self.setup_network()
    42  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 23, in setup_network
    43    assert_equal(str(e), "[node 0] Unable to connect to bitcoind after 0s")
    44  File "/home/hodlinator/bitcoin/test/functional/test_framework/util.py", line 75, in assert_equal
    45    raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
    46AssertionError: not([node 0] Error: no RPC connection == [node 0] Unable to connect to bitcoind after 0s)
    472024-08-15T07:30:20.480000Z TestFramework (INFO): Stopping nodes
    48Traceback (most recent call last):
    49  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 32, in <module>
    50    FeatureFrameworkRPCFailure(__file__).main()
    51  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
    52    exit_code = self.shutdown()
    53                ^^^^^^^^^^^^^^^
    54  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
    55    self.stop_nodes()
    56  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    57    node.stop_node(wait=wait, wait_until_stopped=False)
    58  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 390, in stop_node
    59    self.stop(wait=wait)
    60    ^^^^^^^^^
    61  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 209, in __getattr__
    62    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    63           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    64AssertionError: [node 0] Error: no RPC connection
    65[node 0] Cleaning up leftover process
    

    With fix:

    0$ test/functional/feature_framework_rpc_failure.py
    12024-08-15T07:30:05.092000Z TestFramework (INFO): PRNG seed is: 1672695534155872365
    22024-08-15T07:30:05.092000Z TestFramework (INFO): Initializing test directory /run/user/1001/bitcoin_func_test_cow4vrf7
    32024-08-15T07:30:05.093000Z TestFramework (INFO): Forcing us to timeout instantly to simulate an unresponsive bitcoind
    42024-08-15T07:30:05.094000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 326500.
    52024-08-15T07:30:05.144000Z TestFramework (INFO): Caught AssertionError with expected message
    62024-08-15T07:30:05.194000Z TestFramework (INFO): Stopping nodes
    72024-08-15T07:30:05.194000Z TestFramework (INFO): Cleaning up /run/user/1001/bitcoin_func_test_cow4vrf7 on exit
    82024-08-15T07:30:05.194000Z TestFramework (INFO): Tests successful
    
  5. maflcko commented at 12:32 pm on August 15, 2024: member
    Can you explain this a bit more? I think optimizing the test framework to cleanly shut down on a failure is a bit premature, and may not be needed. However, if a failing test may leave behind a dangling bitcoind process, this seems like something to fix, because it will otherwise lead to test warnings and failures down the road.
  6. hodlinator force-pushed on Aug 18, 2024
  7. hodlinator commented at 8:23 pm on August 18, 2024: contributor

    Can you explain this a bit more? I think optimizing the test framework to cleanly shut down on a failure is a bit premature, and may not be needed.

    As long as the additional complexity and maintenance burden is low, I think it is worth having a cleaner shutdown to avoid muddying the waters for people trying to figure out root causes.

    However, if a failing test may leave behind a dangling bitcoind process, this seems like something to fix, because it will otherwise lead to test warnings and failures down the road.

    It does clean up dangling bitcoind processes without the fix, as can be seen in the log:

    0[node 0] Cleaning up leftover process
    

    The latest push (~bcc87514e780cd35d1f71d5c4c4069a8fac95324~ 0ea00cddc3d4a3f92c6bfae13a8b844e97ebc8c3) now explains in the log message that it is terminating the process to avoid wait_until_stopped timing out (which raises an additional exception).

    + Broke out a commit adding the self.expected_ret_code functionality, so that the fix commit in the middle is clearer. + Attempted to make the comment at the top of the test file more explanatory. + ~Switched from process.kill() to the somewhat less aggressive process.terminate() (although on second thoughts, if the RPC interface is not responding in a real-world situation outside of the added test, maybe .kill() is closer to what we want).~

  8. hodlinator force-pushed on Aug 18, 2024
  9. hodlinator commented at 8:27 pm on August 26, 2024: contributor

    Last week I was hit by this again. Probably line ending differences on Windows makes any additional Python exception reporting extra annoying to read due to every second line being blank (separate issue, but slightly improved by this PR producing less output).

    Git-sleuthing to help motivate this PR

    Was this ever working?

    Unfortunately for my case, it seems the answer is no: 7897338918dac072e788b8ab2919d4559f311bef (Aug, 2017) broke out test_node.py from test_framework.py added the self.rpc_connected flag. Would have been nice if it had protected use of the stop() RPC using this new flag (like in this PR) but unfortunately it did not.

    Fixes in the same area

    fafb33cdefd2d8ce065263978075d26a1672b630 (May, 2016) MarcoFalke wrapped stop() attempts in try/except so at least other nodes would have a chance of shutting down properly. (6 days later in fab5233fe6c99d446a2cebc48cf479c3f026c2cc a timeout for waiting on the process to shutdown was added too, probably so the tests wouldn’t hang waiting for a non-crashed bitcoind that we failed to send stop() to).

    #14670 (Dec, 2018) is tangential and adds test/functional/feature_shutdown.py in 28479f926f21f2a91bec5a06671c60e5b0c55532, to ensure that HTTP responses for stop() make it all the way back to the client if I’m understanding it correctly. It fixes #11777 (Nov, 2017) which is arguably about shutdowns not being clean enough (sometimes showing additional errors depending on timing). The current PR is certainly more nit-level, but still continuing in the same general direction.

  10. hodlinator commented at 10:07 pm on August 26, 2024: contributor

    Another argument for this PR is that even the current title of #30390 is “ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it”. This in itself when looking at logs like https://github.com/bitcoin/bitcoin/actions/runs/9677128783/job/26698120754#step:27:225 is actually just a knock-on exception happening after the RPC connection attempt times out and stop()-RPC is called in vain:

     0313/313 - wallet_listsinceblock.py --legacy-wallet failed, Duration: 2415 s
     1stdout:
     22024-06-26T09:53:17.196000Z TestFramework (INFO): PRNG seed is: 3062882430636553844
     32024-06-26T09:53:17.227000Z TestFramework (INFO): Initializing test directory D:\a\_temp\test_runner__🏃_20240626_093314\wallet_listsinceblock_88
     42024-06-26T10:33:29.783000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 569, in start_nodes
     7    node.wait_for_rpc_connection()
     8  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 314, in wait_for_rpc_connection
     9    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
    10  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
    11    raise AssertionError(self._node_msg(msg))
    12AssertionError: [node 3] Unable to connect to bitcoind after 2400s
    13
    14During handling of the above exception, another exception occurred:
    15
    16Traceback (most recent call last):
    17  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 131, in main
    18    self.setup()
    19  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 302, in setup
    20    self.setup_network()
    21  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 396, in setup_network
    22    self.setup_nodes()
    23  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 418, in setup_nodes
    24    self.start_nodes()
    25  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 572, in start_nodes
    26    self.stop_nodes()
    27  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 587, in stop_nodes
    28    node.stop_node(wait=wait, wait_until_stopped=False)
    29  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 375, in stop_node
    30    self.stop(wait=wait)
    31    ^^^^^^^^^
    32  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 205, in __getattr__
    33    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    34           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    35AssertionError: [node 3] Error: no RPC connection
    362024-06-26T10:33:29.842000Z TestFramework (INFO): Stopping nodes
    37[node 3] Cleaning up leftover process
    38[node 2] Cleaning up leftover process
    39[node 1] Cleaning up leftover process
    40[node 0] Cleaning up leftover process
    41
    42stderr:
    43Traceback (most recent call last):
    44  File "D:\a\bitcoin\bitcoin/test/functional/wallet_listsinceblock.py", line 508, in <module>
    45    ListSinceBlockTest().main()
    46  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 155, in main
    47    exit_code = self.shutdown()
    48                ^^^^^^^^^^^^^^^
    49  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 318, in shutdown
    50    self.stop_nodes()
    51  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 587, in stop_nodes
    52    node.stop_node(wait=wait, wait_until_stopped=False)
    53  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 375, in stop_node
    54    self.stop(wait=wait)
    55  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\coverage.py", line 50, in __call__
    56    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
    57                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    58  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\authproxy.py", line 127, in __call__
    59    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
    60                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    61  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\authproxy.py", line 105, in _request
    62    self.__conn.request(method, path, postdata, headers)
    63  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1336, in request
    64    self._send_request(method, url, body, headers, encode_chunked)
    65  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1382, in _send_request
    66    self.endheaders(body, encode_chunked=encode_chunked)
    67  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1331, in endheaders
    68    self._send_output(message_body, encode_chunked=encode_chunked)
    69  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1091, in _send_output
    70    self.send(msg)
    71  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1035, in send
    72    self.connect()
    73  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1001, in connect
    74    self.sock = self._create_connection(
    75                ^^^^^^^^^^^^^^^^^^^^^^^^
    76  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\socket.py", line 853, in create_connection
    77    raise exceptions[0]
    78  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\socket.py", line 838, in create_connection
    79    sock.connect(sa)
    80ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
    

    A better title might have been “Unable to connect to bitcoind after 2400s errors on Windows”, which this PR would have made it more likely to be in the first place (had it predated the issue).

  11. test refactor: wait_for_rpc_connection - Treat OSErrors the same
    ConnectionResetError is an OSError as well, no reason to have a separate except-block for it.
    b72488d9e1
  12. test: Include ignored errors in RPC connection timeout
    When an RPC connection attempt with bitcoind times out, include which ignored errors occurred in the exception message.
    
    May provide clues of what has gone wrong.
    928228a66d
  13. test: Allow changing expected internal return code
    To be used in next commit.
    
    Also adds process IDs in relevant log messages.
    458c00c462
  14. test: kill process if stop-RPC is not available
    `BitcoinTestFramework.start_nodes()`, upon failing to establish RPC connections, calls `stop_nodes()` which would raise an exception immediately through attempting to use the non-existent RPC interface, preventing a clean shutdown.
    
    We now avoid calling the RPC interface and instead kill the bitcoind process, so that subsequent wait_until_stopped() calls will succeed instantly instead of waiting for the full timeout only to raise another exception.
    
    Since explicit killing means the bitcoind process will not exit with return code 0, we adjust the expected value.
    bf36b06de6
  15. test: Add feature_framework_rpc_failure.py
    Verifies correct behavior when failing to connect to bitcoind RPC interface during setup. Parent commit fixes issues where a raised exception raised additional knock-on exceptions.
    3d97816fff
  16. hodlinator force-pushed on Aug 27, 2024
  17. hodlinator commented at 9:54 pm on August 27, 2024: contributor

    After more investigation into #30390, my impression is that since TestNode.wait_for_rpc_connection() ignores a bunch of exceptions, the knock-on exceptions when calling stop() and failing prior to this PR actually may sometimes provide a clue for why the RPC connection wasn’t established in the first place.

    That realization made me add 2 more commits here, which provide more information on the original exceptions being triggered and ignored inside wait_for_rpc_connection. Only fair since we are avoiding the triggering of knock-on exceptions later.

    The timeout exception’s message thereby now includes counters of encountered (and ignored) exceptions + latest exception type & message: Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 5}, latest error: ValueError('No RPC credentials'))

  18. DrahtBot added the label CI failed on Aug 27, 2024

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2024-09-29 01:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me