qa: Verify clean shutdown on startup failure #30660

pull hodlinator wants to merge 5 commits into bitcoin:master from hodlinator:2024-08/test_node_rpc_connection_failure changing 4 files +147 −31
  1. hodlinator commented at 11:53 am on August 15, 2024: contributor

    Improves handling of startup errors in functional tests and puts tests in place to ensure knock-on errors don’t creep in.

    • wait_for_rpc_connection() now appends specific failures leading up to the Unable to connect to bitcoind error to that error message: [node 0] Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 1, 'OSError.ECONNREFUSED': 239}, latest error: ConnectionRefusedError(111, 'Connection refused'))
    • Fixes Windows Python issue where socket.timeout exceptions end up with unset errno-fields.
    • Also adds comments, refactors code, improves logging.

    The underlying purpose is to ensure developer efficiency in finding root causes of test failures.

    Prior iterations of the PR partially focused on fixing the same issue as #31620. Originally inspired by #30390.

    Testing

    Can be tested by reverting either faf2f2c654d9aa18b2f49a157956f9ab0fce302a or fae3bf6b870eb0f9cddd1adac82ba72890806ae3 from #31620, or the “qa: Avoid calling stop-RPC if not connected” from this PR, and running feature_framework_startup_failures.py. Checking out the commit which adds the tests before applying the reverts avoids conflicts with the last logging-change-only commit.

  2. DrahtBot commented at 11:53 am on August 15, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/30660.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK ryanofsky
    Stale ACK tdb3, m3dwards, l0rinc

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  3. DrahtBot added the label Tests on Aug 15, 2024
  4. hodlinator commented at 11:54 am on August 15, 2024: contributor

    Encountered a case of #30390 here: https://github.com/bitcoin/bitcoin/actions/runs/10380115902/job/28739417666?pr=30377

    Output with only the test (no fix commit):

     0$ test/functional/feature_framework_rpc_failure.py
     12024-08-15T07:30:20.427000Z TestFramework (INFO): PRNG seed is: 635212259632049987
     22024-08-15T07:30:20.427000Z TestFramework (INFO): Initializing test directory /run/user/1001/bitcoin_func_test_ujsvbdnm
     32024-08-15T07:30:20.428000Z TestFramework (INFO): Forcing us to timeout instantly to simulate an unresponsive bitcoind
     42024-08-15T07:30:20.429000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     7    node.wait_for_rpc_connection()
     8  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 329, in wait_for_rpc_connection
     9    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
    10  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 192, in _raise_assertion_error
    11    raise AssertionError(self._node_msg(msg))
    12AssertionError: [node 0] Unable to connect to bitcoind after 0s
    13
    14During handling of the above exception, another exception occurred:
    15
    16Traceback (most recent call last):
    17  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 21, in setup_network
    18    BitcoinTestFramework.setup_network(self)
    19  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    20    self.setup_nodes()
    21  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    22    self.start_nodes()
    23  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    24    self.stop_nodes()
    25  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    26    node.stop_node(wait=wait, wait_until_stopped=False)
    27  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 390, in stop_node
    28    self.stop(wait=wait)
    29    ^^^^^^^^^
    30  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 209, in __getattr__
    31    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    32           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    33AssertionError: [node 0] Error: no RPC connection
    34
    35During handling of the above exception, another exception occurred:
    36
    37Traceback (most recent call last):
    38  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
    39    self.setup()
    40  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    41    self.setup_network()
    42  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 23, in setup_network
    43    assert_equal(str(e), "[node 0] Unable to connect to bitcoind after 0s")
    44  File "/home/hodlinator/bitcoin/test/functional/test_framework/util.py", line 75, in assert_equal
    45    raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
    46AssertionError: not([node 0] Error: no RPC connection == [node 0] Unable to connect to bitcoind after 0s)
    472024-08-15T07:30:20.480000Z TestFramework (INFO): Stopping nodes
    48Traceback (most recent call last):
    49  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 32, in <module>
    50    FeatureFrameworkRPCFailure(__file__).main()
    51  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
    52    exit_code = self.shutdown()
    53                ^^^^^^^^^^^^^^^
    54  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
    55    self.stop_nodes()
    56  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    57    node.stop_node(wait=wait, wait_until_stopped=False)
    58  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 390, in stop_node
    59    self.stop(wait=wait)
    60    ^^^^^^^^^
    61  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 209, in __getattr__
    62    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    63           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    64AssertionError: [node 0] Error: no RPC connection
    65[node 0] Cleaning up leftover process
    

    With fix:

    0$ test/functional/feature_framework_rpc_failure.py
    12024-08-15T07:30:05.092000Z TestFramework (INFO): PRNG seed is: 1672695534155872365
    22024-08-15T07:30:05.092000Z TestFramework (INFO): Initializing test directory /run/user/1001/bitcoin_func_test_cow4vrf7
    32024-08-15T07:30:05.093000Z TestFramework (INFO): Forcing us to timeout instantly to simulate an unresponsive bitcoind
    42024-08-15T07:30:05.094000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 326500.
    52024-08-15T07:30:05.144000Z TestFramework (INFO): Caught AssertionError with expected message
    62024-08-15T07:30:05.194000Z TestFramework (INFO): Stopping nodes
    72024-08-15T07:30:05.194000Z TestFramework (INFO): Cleaning up /run/user/1001/bitcoin_func_test_cow4vrf7 on exit
    82024-08-15T07:30:05.194000Z TestFramework (INFO): Tests successful
    
  5. maflcko commented at 12:32 pm on August 15, 2024: member
    Can you explain this a bit more? I think optimizing the test framework to cleanly shut down on a failure is a bit premature, and may not be needed. However, if a failing test may leave behind a dangling bitcoind process, this seems like something to fix, because it will otherwise lead to test warnings and failures down the road.
  6. hodlinator force-pushed on Aug 18, 2024
  7. hodlinator commented at 8:23 pm on August 18, 2024: contributor

    Can you explain this a bit more? I think optimizing the test framework to cleanly shut down on a failure is a bit premature, and may not be needed.

    As long as the additional complexity and maintenance burden is low, I think it is worth having a cleaner shutdown to avoid muddying the waters for people trying to figure out root causes.

    However, if a failing test may leave behind a dangling bitcoind process, this seems like something to fix, because it will otherwise lead to test warnings and failures down the road.

    It does clean up dangling bitcoind processes without the fix, as can be seen in the log:

    0[node 0] Cleaning up leftover process
    

    The latest push (bcc87514e780cd35d1f71d5c4c4069a8fac95324 0ea00cddc3d4a3f92c6bfae13a8b844e97ebc8c3) now explains in the log message that it is terminating the process to avoid wait_until_stopped timing out (which raises an additional exception).

    + Broke out a commit adding the self.expected_ret_code functionality, so that the fix commit in the middle is clearer. + Attempted to make the comment at the top of the test file more explanatory. + Switched from process.kill() to the somewhat less aggressive process.terminate() (although on second thoughts, if the RPC interface is not responding in a real-world situation outside of the added test, maybe .kill() is closer to what we want).

  8. hodlinator force-pushed on Aug 18, 2024
  9. hodlinator commented at 8:27 pm on August 26, 2024: contributor

    Last week I was hit by this again. Probably line ending differences on Windows makes any additional Python exception reporting extra annoying to read due to every second line being blank (separate issue, but slightly improved by this PR producing less output).

    Git-sleuthing to help motivate this PR

    Was this ever working?

    Unfortunately for my case, it seems the answer is no: 7897338918dac072e788b8ab2919d4559f311bef (Aug, 2017) broke out test_node.py from test_framework.py added the self.rpc_connected flag. Would have been nice if it had protected use of the stop() RPC using this new flag (like in this PR) but unfortunately it did not.

    Fixes in the same area

    fafb33cdefd2d8ce065263978075d26a1672b630 (May, 2016) MarcoFalke wrapped stop() attempts in try/except so at least other nodes would have a chance of shutting down properly. (6 days later in fab5233fe6c99d446a2cebc48cf479c3f026c2cc a timeout for waiting on the process to shutdown was added too, probably so the tests wouldn’t hang waiting for a non-crashed bitcoind that we failed to send stop() to).

    #14670 (Dec, 2018) is tangential and adds test/functional/feature_shutdown.py in 28479f926f21f2a91bec5a06671c60e5b0c55532, to ensure that HTTP responses for stop() make it all the way back to the client if I’m understanding it correctly. It fixes #11777 (Nov, 2017) which is arguably about shutdowns not being clean enough (sometimes showing additional errors depending on timing). The current PR is certainly more nit-level, but still continuing in the same general direction.

  10. hodlinator commented at 10:07 pm on August 26, 2024: contributor

    Another argument for this PR is that even the current title of #30390 is “ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it”. This in itself when looking at logs like https://github.com/bitcoin/bitcoin/actions/runs/9677128783/job/26698120754#step:27:225 is actually just a knock-on exception happening after the RPC connection attempt times out and stop()-RPC is called in vain:

     0313/313 - wallet_listsinceblock.py --legacy-wallet failed, Duration: 2415 s
     1stdout:
     22024-06-26T09:53:17.196000Z TestFramework (INFO): PRNG seed is: 3062882430636553844
     32024-06-26T09:53:17.227000Z TestFramework (INFO): Initializing test directory D:\a\_temp\test_runner__🏃_20240626_093314\wallet_listsinceblock_88
     42024-06-26T10:33:29.783000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 569, in start_nodes
     7    node.wait_for_rpc_connection()
     8  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 314, in wait_for_rpc_connection
     9    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
    10  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
    11    raise AssertionError(self._node_msg(msg))
    12AssertionError: [node 3] Unable to connect to bitcoind after 2400s
    13
    14During handling of the above exception, another exception occurred:
    15
    16Traceback (most recent call last):
    17  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 131, in main
    18    self.setup()
    19  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 302, in setup
    20    self.setup_network()
    21  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 396, in setup_network
    22    self.setup_nodes()
    23  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 418, in setup_nodes
    24    self.start_nodes()
    25  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 572, in start_nodes
    26    self.stop_nodes()
    27  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 587, in stop_nodes
    28    node.stop_node(wait=wait, wait_until_stopped=False)
    29  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 375, in stop_node
    30    self.stop(wait=wait)
    31    ^^^^^^^^^
    32  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 205, in __getattr__
    33    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    34           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    35AssertionError: [node 3] Error: no RPC connection
    362024-06-26T10:33:29.842000Z TestFramework (INFO): Stopping nodes
    37[node 3] Cleaning up leftover process
    38[node 2] Cleaning up leftover process
    39[node 1] Cleaning up leftover process
    40[node 0] Cleaning up leftover process
    41
    42stderr:
    43Traceback (most recent call last):
    44  File "D:\a\bitcoin\bitcoin/test/functional/wallet_listsinceblock.py", line 508, in <module>
    45    ListSinceBlockTest().main()
    46  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 155, in main
    47    exit_code = self.shutdown()
    48                ^^^^^^^^^^^^^^^
    49  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 318, in shutdown
    50    self.stop_nodes()
    51  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 587, in stop_nodes
    52    node.stop_node(wait=wait, wait_until_stopped=False)
    53  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 375, in stop_node
    54    self.stop(wait=wait)
    55  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\coverage.py", line 50, in __call__
    56    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
    57                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    58  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\authproxy.py", line 127, in __call__
    59    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
    60                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    61  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\authproxy.py", line 105, in _request
    62    self.__conn.request(method, path, postdata, headers)
    63  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1336, in request
    64    self._send_request(method, url, body, headers, encode_chunked)
    65  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1382, in _send_request
    66    self.endheaders(body, encode_chunked=encode_chunked)
    67  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1331, in endheaders
    68    self._send_output(message_body, encode_chunked=encode_chunked)
    69  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1091, in _send_output
    70    self.send(msg)
    71  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1035, in send
    72    self.connect()
    73  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1001, in connect
    74    self.sock = self._create_connection(
    75                ^^^^^^^^^^^^^^^^^^^^^^^^
    76  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\socket.py", line 853, in create_connection
    77    raise exceptions[0]
    78  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\socket.py", line 838, in create_connection
    79    sock.connect(sa)
    80ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
    

    A better title might have been “Unable to connect to bitcoind after 2400s errors on Windows”, which this PR would have made it more likely to be in the first place (had it predated the issue).

  11. hodlinator force-pushed on Aug 27, 2024
  12. hodlinator commented at 9:54 pm on August 27, 2024: contributor

    After more investigation into #30390, my impression is that since TestNode.wait_for_rpc_connection() ignores a bunch of exceptions, the knock-on exceptions when calling stop() and failing prior to this PR actually may sometimes provide a clue for why the RPC connection wasn’t established in the first place.

    That realization made me add 2 more commits here, which provide more information on the original exceptions being triggered and ignored inside wait_for_rpc_connection. Only fair since we are avoiding the triggering of knock-on exceptions later.

    The timeout exception’s message thereby now includes counters of encountered (and ignored) exceptions + latest exception type & message: Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 5}, latest error: ValueError('No RPC credentials'))

  13. DrahtBot added the label CI failed on Aug 27, 2024
  14. DrahtBot removed the label CI failed on Sep 29, 2024
  15. DrahtBot added the label CI failed on Oct 13, 2024
  16. achow101 requested review from josibake on Oct 15, 2024
  17. achow101 requested review from willcl-ark on Oct 15, 2024
  18. achow101 commented at 3:30 pm on October 15, 2024: member
    cc @tdb3
  19. DrahtBot removed the label CI failed on Oct 18, 2024
  20. hodlinator force-pushed on Oct 21, 2024
  21. hodlinator commented at 10:14 pm on October 21, 2024: contributor
    (Rebased onto latest master).
  22. in test/functional/test_framework/test_node.py:324 in 5c95843281 outdated
    320+                    ignored_errors["JSONRPCException -28"] += 1
    321                 # -342 Service unavailable, RPC server started but is shutting down due to error
    322-                if e.error['code'] != -28 and e.error['code'] != -342:
    323+                # - Ignore the exception to properly raise the FailedToStartError.
    324+                elif e.error['code'] == -342:
    325+                    ignored_errors["JSONRPCException -342"] += 1
    


    tdb3 commented at 6:19 pm on October 31, 2024:

    non-blocking nit: might be a bit more Pythonic to use Counter.update() rather than += 1. For example:

    0ignored_errors.update([f"JSONRPCException {e.error['code']}"])
    

    hodlinator commented at 8:23 pm on November 2, 2024:

    Tried using update(), but it resulted in things like:

    0AssertionError: [node 0] Unable to connect to bitcoind after 10s (ignored errors: {'m': 40, 'i': 120, 's': 120, 'n': 80, 'g': 40, '_': 40, 'c': 40, 'r': 40, 'e': 80, 'd': 40, 't': 40, 'a': 40, 'l': 40}, latest error: ValueError('No RPC credentials'))
    

    i.e. it was iterating over the string and incrementing an entry for each unique character, which seems to match the docs even though not super clear.~


    Edit: @tdb3 pointed out out of band than I’d probably missed out the []-brackets around the strings to make update() iterate through whole strings instead of chars.

    0ignored_errors.update([f"JSONRPCException {e.error['code']}"])
    

    feels slightly more contrived and verbose to me than

    0ignored_errors[f"JSONRPCException {e.error['code']}"] += 1
    

    but neither is too elegant.

    Might switch away from Counter if something more appropriate is found.

  23. in test/functional/test_framework/test_node.py:336 in 5c95843281 outdated
    335                 elif e.errno == errno.ETIMEDOUT:
    336-                    pass # Treat identical to ECONNRESET
    337+                    ignored_errors["OSError.ETIMEDOUT"] += 1 # Treat identical to ECONNRESET
    338                 elif e.errno == errno.ECONNREFUSED:
    339-                    pass  # Port not yet open?
    340+                    ignored_errors["OSError.ECONNREFUSED"] += 1 # Port not yet open?
    


    tdb3 commented at 6:27 pm on October 31, 2024:

    non-blocking nit: same with these. e.g.

    0ignored_errors.update([errno.errorcode[e.errno]])
    

    hodlinator commented at 8:21 pm on November 2, 2024:
    (Same issue with update(), see #30660 (review)).
  24. in test/functional/feature_framework_rpc_failure.py:35 in 9173b05600 outdated
    30+            BitcoinTestFramework.setup_network(self)
    31+        except AssertionError as e:
    32+            assert_equal(str(e)[:47], "[node 0] Unable to connect to bitcoind after 0s")
    33+            self.log.info(f"{test_name}: Caught AssertionError with expected RPC connection failure message")
    34+        else:
    35+            raise AssertionError("Didn't raise expected error")
    


    tdb3 commented at 6:47 pm on October 31, 2024:
    Maybe we can use util.assert_raises() or util.assert_raises_message() here to keep things more concise?

    hodlinator commented at 8:22 pm on November 2, 2024:
    Had to add a commit (2a0c0e410e9f32dbf7af7229457889b09553c438) that refurbishes assert_raises_message(), but test is much cleaner now!
  25. tdb3 approved
  26. tdb3 commented at 6:49 pm on October 31, 2024: contributor

    Code review ACK 9173b056006fd3940675108aa3fe0d0d855af768

    This seems quite useful when troubleshooting errors. Left a few nits, but nothing I feel very strongly about.

  27. hodlinator force-pushed on Nov 2, 2024
  28. hodlinator commented at 8:26 pm on November 2, 2024: contributor

    Thanks for the review @tdb3!

    Implemented feedback from that review, made the code a bit more Pythonic and improved comments slightly.

  29. in test/functional/test_framework/test_node.py:341 in b6b30c2d9c outdated
    337                     raise
    338+                ignored_errors["missing_credentials"] += 1
    339+                latest_error = repr(e)
    340             time.sleep(1.0 / poll_per_s)
    341-        self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
    342+        self._raise_assertion_error("Unable to connect to bitcoind after {}s (ignored errors: {}, latest error: {})".format(self.rpc_timeout, str(dict(ignored_errors)), latest_error))
    


    tdb3 commented at 10:50 pm on November 2, 2024:
    non-blocking pico-nit: If this file gets touched again, could use f string instead of format. https://github.com/bitcoin/bitcoin/blob/master/test/functional/README.md#style-guidelines

    hodlinator commented at 11:11 pm on November 2, 2024:
    Good point! Resolved in latest push.
  30. tdb3 approved
  31. tdb3 commented at 10:59 pm on November 2, 2024: contributor

    Code Review re ACK 1148c4a53295e7635a813aadfde53ba5d973395c

    Nice work cleaning up the existing code around the core changes. Increases readability.

  32. hodlinator force-pushed on Nov 2, 2024
  33. tdb3 approved
  34. tdb3 commented at 11:15 pm on November 2, 2024: contributor
    re ACK 042cee7073b338d79b5176d4157cf08bbd079b9f
  35. in test/functional/feature_framework_rpc_failure.py:26 in 042cee7073 outdated
    21+    def set_test_params(self):
    22+        self.num_nodes = 1
    23+
    24+    def setup_network(self):
    25+        test_name = Path(__file__).name
    26+        self.log.info(f"{test_name}: Setting RPC timeout to 0 to simulate an unresponsive bitcoind")
    


    maflcko commented at 10:15 am on November 4, 2024:

    Any reason to log the file name here? The containing folder of the log should already have the name of the test/file. If not, the user should know which test they called to know it.

    If not, and this was helpful, it would be better to log it for all tests, instead of just for this one test specifically.


    hodlinator commented at 10:44 am on November 8, 2024:
    Wanted to make clear it wasn’t something internal to BitcoinTestFramework/TestNode deciding to set the RPC timeout. Replaced that with instead appending “, expect one warning” to the log message. (Edit: now pushed).
  36. in test/functional/test_framework/test_node.py:315 in 5ca92b17c5 outdated
    310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    311                 self.rpc_connected = True
    312                 self.url = self.rpc.rpc_url
    313                 return
    314-            except JSONRPCException as e:  # Initialization phase
    315+            except JSONRPCException as e:
    316+                # Avoid raising these, to possibly raise the FailedToStartError on next iteration.
    


    maflcko commented at 11:24 am on November 4, 2024:

    nit in 5ca92b17c5d2d5efa7b2246da1905bb8fd186230: It is obvious, but could mention the success case as well?

    Retry after these, until a permanent success or failure state (such as FailedToStartError, or timeout) is reached.?


    hodlinator commented at 10:47 am on November 8, 2024:
    Good point, changed to Suppress these in favor of a later outcome (success, FailedToStartError, or timeout).
  37. in test/functional/test_framework/test_node.py:317 in 5ca92b17c5 outdated
    310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    311                 self.rpc_connected = True
    312                 self.url = self.rpc.rpc_url
    313                 return
    314-            except JSONRPCException as e:  # Initialization phase
    315+            except JSONRPCException as e:
    316+                # Avoid raising these, to possibly raise the FailedToStartError on next iteration.
    317                 # -28 RPC in warmup
    318                 # -342 Service unavailable, RPC server started but is shutting down due to error
    


    maflcko commented at 11:36 am on November 4, 2024:
    I don’t think this is true, it is just one example. I think 342 can be raised by authproxy.py for various reasons.

    hodlinator commented at 10:48 am on November 8, 2024:
    Made comment less definitive.
  38. in test/functional/test_framework/test_node.py:331 in 5ca92b17c5 outdated
    330-                    pass  # Port not yet open?
    331-                else:
    332+                # ECONNRESET   - This might happen when the RPC server is in warmup, but shut down before the call to
    333+                #                getblockcount succeeds. Try again to properly raise the FailedToStartError
    334+                # ETIMEDOUT    - Treat identical to ECONNRESET
    335+                # ECONNREFUSED - Port not yet open?
    


    maflcko commented at 11:42 am on November 4, 2024:
    nit: Could just mention that the three are treated equal to “-342 Service unavailable”, which is already explained above?

    hodlinator commented at 10:48 am on November 8, 2024:
    Added: Suppress similarly to the above JSONRPCException errors:

    maflcko commented at 3:06 pm on November 8, 2024:

    I still don’t know if the current docs are correct and useful.

    Why would ETIMEDOUT only happen on warmup? Why is it even relevant to mention anything here?

    I think it would be clearer to just mention that the three are treated equal to “-342 Service unavailable”, which is already explained above?


    maflcko commented at 11:27 am on November 18, 2024:

    re #30660 (review)

    Are you still working on this?


    hodlinator commented at 10:19 pm on November 18, 2024:
    I am still working on this PR. Not thrilled about removing the 3 comment strings predating this PR.

    hodlinator commented at 10:35 am on November 26, 2024:
    I am satisfied with the PR in it’s latest revision, but if you don’t consider this a nit, but rather a blocker, let me know.

    ryanofsky commented at 7:57 pm on January 6, 2025:

    In commit “test refactor: wait_for_rpc_connection - Treat OSErrors the same” (5321eb089cd1b39160038050a6ef6f27fd6f4ef9)

    I think these documentation changes look ok, but looking at old and new versions side by side I mostly found the previous comments easier to follow because they were more direct and interspersed with relevant code.

    E.g. the previous “# Initialization phase” comment next to the exception being caught told me that the exception was expected during initialization. The current comment “Suppress these in favor of a later outcome” tells me the code does not care about this exception as much as other exceptions, but not why.

    Am fine with both versions though.


    hodlinator commented at 4:43 pm on January 9, 2025:

    I mostly found the previous comments easier to follow because they were more direct and interspersed with relevant code.

    Hopefully this is from the latest push can be considered an improvement:

    0if e.errno not in [ errno.ECONNRESET,     # This might happen when the RPC server is in warmup,
    1                                          # but shut down before the call to getblockcount succeeds.
    2                    errno.ETIMEDOUT,      # Treat identical to ECONNRESET
    3                    errno.ECONNREFUSED ]: # Port not yet open?
    
  39. in test/functional/test_framework/test_node.py:328 in 5ca92b17c5 outdated
    335+                # ECONNREFUSED - Port not yet open?
    336+                if e.errno not in [errno.ECONNRESET, errno.ETIMEDOUT, errno.ECONNREFUSED]:
    337                     raise  # unknown OS error
    338-            except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
    339+            except ValueError as e:
    340+                # If cookie file is missing and no rpcuser or rpcpassword; bitcoind is still starting, otherwise raise
    


    maflcko commented at 11:47 am on November 4, 2024:

    nit: I don’t think “bitcoind is still starting” is true either. It is just one example. Another example would be that the cookie file is (accidentally) disabled in the config.

    What about just `Retry if cookie file is not created and no rpcuser or rpcpassword."?


    hodlinator commented at 10:49 am on November 8, 2024:
    Agree, made comment less definitive, but still think it’s good to include the common cause: Suppress if cookie file is missing and no rpcuser or rpcpassword; bitcoind may be starting
  40. maflcko commented at 6:18 pm on November 6, 2024: member
    left some nits on the first commit. Will take a look at the others later.
  41. hodlinator force-pushed on Nov 8, 2024
  42. hodlinator commented at 10:56 am on November 8, 2024: contributor
    Thanks for the reviews! Was trying to keep comments largely intact, but agree with @maflcko that they were somewhat incorrect, so altered them. Also renamed ignored_errors (introduced by this PR) to suppressed_errors since it felt more fitting and also matches the updated comments.
  43. DrahtBot added the label CI failed on Nov 8, 2024
  44. DrahtBot commented at 11:32 am on November 8, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32707688603

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  45. hodlinator force-pushed on Nov 8, 2024
  46. DrahtBot removed the label CI failed on Nov 8, 2024
  47. tdb3 approved
  48. tdb3 commented at 5:32 pm on November 9, 2024: contributor
    Code Review re ACK 68b0530d5fe867dbee5e551ecf6d491fecbc77f1
  49. m3dwards commented at 10:40 am on November 18, 2024: contributor

    Concept ACK

    I like the motivation to cleanup the logs and remove the cascade of exceptions.

    To test, I modified the included test to see what errors a user would have seen without the assertion:

     0--- a/test/functional/feature_framework_rpc_failure.py
     1+++ b/test/functional/feature_framework_rpc_failure.py
     2@@ -23,7 +23,8 @@ class FeatureFrameworkRPCFailure(BitcoinTestFramework):
     3         self.log.info(f"Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.")
     4         self.rpc_timeout = 0
     5
     6-        assert_raises_message(AssertionError, "[node 0] Unable to connect to bitcoind after 0s", BitcoinTestFramework.setup_network, self)
     7+        BitcoinTestFramework.setup_network(self)
     8
     9     def run_test(self):
    10         pass
    
     02024-11-18T09:59:33.936000Z TestFramework (INFO): PRNG seed is: 6704525597374825472
     12024-11-18T09:59:33.937000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_kccj2gaf
     22024-11-18T09:59:33.939000Z TestFramework (INFO): Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.
     32024-11-18T09:59:33.944000Z TestFramework (ERROR): Assertion failed
     4Traceback (most recent call last):
     5  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     6    node.wait_for_rpc_connection()
     7  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 335, in wait_for_rpc_connection
     8    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
     9  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 198, in _raise_assertion_error
    10    raise AssertionError(self._node_msg(msg))
    11AssertionError: [node 0] Unable to connect to bitcoind after 0s
    12
    13During handling of the above exception, another exception occurred:
    14
    15Traceback (most recent call last):
    16  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 131, in main
    17    self.setup()
    18  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 302, in setup
    19    self.setup_network()
    20  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 24, in setup_network
    21    BitcoinTestFramework.setup_network(self)
    22  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 396, in setup_network
    23    self.setup_nodes()
    24  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    25    self.start_nodes()
    26  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    27    self.stop_nodes()
    28  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    29    node.stop_node(wait=wait, wait_until_stopped=False)
    30  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 396, in stop_node
    31    self.stop(wait=wait)
    32    ^^^^^^^^^
    33  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 215, in __getattr__
    34    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    35           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    36AssertionError: [node 0] Error: no RPC connection
    372024-11-18T09:59:33.997000Z TestFramework (INFO): Stopping nodes
    38Traceback (most recent call last):
    39  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 32, in <module>
    40    FeatureFrameworkRPCFailure(__file__).main()
    41  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 155, in main
    42    exit_code = self.shutdown()
    43                ^^^^^^^^^^^^^^^
    44  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 318, in shutdown
    45    self.stop_nodes()
    46  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    47    node.stop_node(wait=wait, wait_until_stopped=False)
    48  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 396, in stop_node
    49    self.stop(wait=wait)
    50    ^^^^^^^^^
    51  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 215, in __getattr__
    52    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    53           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    54AssertionError: [node 0] Error: no RPC connection
    55[node 0] Cleaning up leftover process
    
     02024-11-18T10:09:52.135000Z TestFramework (INFO): PRNG seed is: 7321683449385937851
     12024-11-18T10:09:52.135000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a
     22024-11-18T10:09:52.137000Z TestFramework (INFO): Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.
     32024-11-18T10:09:52.141000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 47850 so that wait_until_stopped will not time out.
     42024-11-18T10:09:52.194000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
     7    self.setup()
     8  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
     9    self.setup_network()
    10  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 26, in setup_network
    11    BitcoinTestFramework.setup_network(self)
    12  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    13    self.setup_nodes()
    14  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    15    self.start_nodes()
    16  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
    17    node.wait_for_rpc_connection()
    18  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 344, in wait_for_rpc_connection
    19    self._raise_assertion_error(f"Unable to connect to bitcoind after {self.rpc_timeout}s (ignored errors: {str(dict(suppressed_errors))}, latest error: {latest_error})")
    20  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 199, in _raise_assertion_error
    21    raise AssertionError(self._node_msg(msg))
    22AssertionError: [node 0] Unable to connect to bitcoind after 0s (ignored errors: {}, latest error: )
    232024-11-18T10:09:52.250000Z TestFramework (INFO): Stopping nodes
    242024-11-18T10:09:52.250000Z TestFramework (WARNING): Not cleaning up dir /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a
    252024-11-18T10:09:52.250000Z TestFramework (ERROR): Test failed. Test logging available at /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a/test_framework.log
    262024-11-18T10:09:52.250000Z TestFramework (ERROR):
    272024-11-18T10:09:52.251000Z TestFramework (ERROR): Hint: Call /Users/max/source/bitcoin/test/functional/combine_logs.py '/var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a' to consolidate all logs
    282024-11-18T10:09:52.251000Z TestFramework (ERROR):
    292024-11-18T10:09:52.251000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
    302024-11-18T10:09:52.251000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
    312024-11-18T10:09:52.251000Z TestFramework (ERROR):
    

    I prefer the logs with this change!

    Next I modified bitcoind so that it would fail to start and ran a random functional test (p2p_blockfilters.py)

     02024-11-18T09:39:40.861000Z TestFramework (INFO): PRNG seed is: 8730495587434508112
     12024-11-18T09:39:40.862000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_0gw_zhhl
     22024-11-18T09:39:41.372000Z TestFramework (ERROR): Assertion failed
     3Traceback (most recent call last):
     4  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     5    node.wait_for_rpc_connection()
     6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 276, in wait_for_rpc_connection
     7    raise FailedToStartError(self._node_msg(
     8test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization.
     9
    10During handling of the above exception, another exception occurred:
    11
    12Traceback (most recent call last):
    13  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
    14    self.setup()
    15  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    16    self.setup_network()
    17  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    18    self.setup_nodes()
    19  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    20    self.start_nodes()
    21  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    22    self.stop_nodes()
    23  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    24    node.stop_node(wait=wait, wait_until_stopped=False)
    25  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 396, in stop_node
    26    self.stop(wait=wait)
    27    ^^^^^^^^^
    28  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 215, in __getattr__
    29    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    30           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    31AssertionError: [node 0] Error: no RPC connection
    322024-11-18T09:39:41.431000Z TestFramework (INFO): Stopping nodes
    33Traceback (most recent call last):
    34  File "/Users/max/source/bitcoin/build/test/functional/p2p_blockfilters.py", line 285, in <module>
    35    CompactFiltersTest(__file__).main()
    36  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
    37    exit_code = self.shutdown()
    38                ^^^^^^^^^^^^^^^
    39  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
    40    self.stop_nodes()
    41  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    42    node.stop_node(wait=wait, wait_until_stopped=False)
    43  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 396, in stop_node
    44    self.stop(wait=wait)
    45    ^^^^^^^^^
    46  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 215, in __getattr__
    47    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    48           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    49AssertionError: [node 0] Error: no RPC connection
    50[node 1] Cleaning up leftover process
    51[node 0] Cleaning up leftover process
    
     02024-11-15T13:07:28.308000Z TestFramework (INFO): PRNG seed is: 2784953090753807617
     12024-11-15T13:07:28.308000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_udgky3fr
     22024-11-15T13:07:28.821000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26867 so that wait_until_stopped will not time out.
     32024-11-15T13:07:28.821000Z TestFramework.node1 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26868 so that wait_until_stopped will not time out.
     42024-11-15T13:07:28.821000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     7    node.wait_for_rpc_connection()
     8  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 279, in wait_for_rpc_connection
     9    raise FailedToStartError(self._node_msg(
    10test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization.
    11
    12During handling of the above exception, another exception occurred:
    13
    14Traceback (most recent call last):
    15  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
    16    self.setup()
    17  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    18    self.setup_network()
    19  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    20    self.setup_nodes()
    21  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    22    self.start_nodes()
    23  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    24    self.stop_nodes()
    25  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 591, in stop_nodes
    26    node.wait_until_stopped()
    27  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in wait_until_stopped
    28    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    29  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 860, in wait_until
    30    return wait_until_helper_internal(test_function, timeout=timeout, timeout_factor=self.timeout_factor)
    31           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    32  File "/Users/max/source/bitcoin/test/functional/test_framework/util.py", line 291, in wait_until_helper_internal
    33    if predicate():
    34       ^^^^^^^^^^^
    35  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in <lambda>
    36    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    37                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    38  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 439, in is_node_stopped
    39    assert return_code == expected_ret_code, self._node_msg(
    40           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    41AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping
    422024-11-15T13:07:28.880000Z TestFramework (INFO): Stopping nodes
    432024-11-15T13:07:28.881000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26867 so that wait_until_stopped will not time out.
    442024-11-15T13:07:28.881000Z TestFramework.node1 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26868 so that wait_until_stopped will not time out.
    45Traceback (most recent call last):
    46  File "/Users/max/source/bitcoin/build/test/functional/p2p_blockfilters.py", line 285, in <module>
    47    CompactFiltersTest(__file__).main()
    48  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
    49    exit_code = self.shutdown()
    50                ^^^^^^^^^^^^^^^
    51  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
    52    self.stop_nodes()
    53  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 591, in stop_nodes
    54    node.wait_until_stopped()
    55  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in wait_until_stopped
    56    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    57  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 860, in wait_until
    58    return wait_until_helper_internal(test_function, timeout=timeout, timeout_factor=self.timeout_factor)
    59           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    60  File "/Users/max/source/bitcoin/test/functional/test_framework/util.py", line 291, in wait_until_helper_internal
    61    if predicate():
    62       ^^^^^^^^^^^
    63  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in <lambda>
    64    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    65                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    66  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 439, in is_node_stopped
    67    assert return_code == expected_ret_code, self._node_msg(
    68           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    69AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping
    70[node 1] Cleaning up leftover process 26868
    71[node 0] Cleaning up leftover process 26867
    

    I’m not sure if it’s the intention of this PR to improve things in that scenario but the log doesn’t seem much better and if anything it was slightly easier to understand before.

    Before this change the logs showed AssertionError: [node 0] Error: no RPC connection and after this change that is now AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping. I feel like no RPC connection is the better error here?

  50. hodlinator force-pushed on Nov 18, 2024
  51. hodlinator commented at 11:13 pm on November 18, 2024: contributor

    Thanks @m3dwards for having a look!

    I had not tested the case of bitcoind exiting early together with my suggested change here (focused more on stalling as in one of the issues inspiring it).

    Made some changes to this part of TestNode.stop_node() to handle an already dead process:

    https://github.com/bitcoin/bitcoin/blob/ba28593147ccc880c3a4d40db0d4ef57f4766254/test/functional/test_framework/test_node.py#L412-L420

    With those changes, the output for such a case is now:

     0 build/test/functional/p2p_blockfilters.py 
     12024-11-18T23:01:17.204000Z TestFramework (INFO): PRNG seed is: 5667811122664278008
     22024-11-18T23:01:17.205000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_mzr3jg96
     32024-11-18T23:01:17.458000Z TestFramework.node0 (WARNING): Process 175180 already died with return code 1.
     42024-11-18T23:01:17.458000Z TestFramework.node1 (WARNING): Process 175181 already died with return code 1.
     52024-11-18T23:01:17.459000Z TestFramework (ERROR): Unexpected exception caught during testing
     6Traceback (most recent call last):
     7  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
     8    self.setup()
     9  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    10    self.setup_network()
    11  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    12    self.setup_nodes()
    13  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    14    self.start_nodes()
    15  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
    16    node.wait_for_rpc_connection()
    17  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 279, in wait_for_rpc_connection
    18    raise FailedToStartError(self._node_msg(
    19test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization. 
    202024-11-18T23:01:17.513000Z TestFramework (INFO): Stopping nodes
    212024-11-18T23:01:17.513000Z TestFramework (WARNING): Not cleaning up dir /tmp/bitcoin_func_test_mzr3jg96
    222024-11-18T23:01:17.513000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/bitcoin_func_test_mzr3jg96/test_framework.log
    232024-11-18T23:01:17.513000Z TestFramework (ERROR): 
    242024-11-18T23:01:17.513000Z TestFramework (ERROR): Hint: Call /home/hodlinator/bitcoin/test/functional/combine_logs.py '/tmp/bitcoin_func_test_mzr3jg96' to consolidate all logs
    252024-11-18T23:01:17.513000Z TestFramework (ERROR): 
    262024-11-18T23:01:17.514000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
    272024-11-18T23:01:17.514000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
    282024-11-18T23:01:17.514000Z TestFramework (ERROR):
    
  52. m3dwards commented at 1:13 pm on November 21, 2024: contributor

    Made some changes to this part of TestNode.stop_node() to handle an already dead process

    Looks cleaner.

    ACK ba28593147ccc880c3a4d40db0d4ef57f4766254

  53. DrahtBot requested review from tdb3 on Nov 21, 2024
  54. in test/functional/test_framework/test_node.py:8 in ba28593147 outdated
    4@@ -5,6 +5,7 @@
    5 """Class for bitcoind node under test"""
    6 
    7 import contextlib
    8+from collections import Counter
    


    hodlinator commented at 8:31 pm on December 3, 2024:
    (Could possibly use collections.defaultdict(int) instead of Counter iff I really need to retouch).
  55. DrahtBot added the label CI failed on Dec 11, 2024
  56. DrahtBot commented at 1:30 am on December 11, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33168290506

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  57. hodlinator force-pushed on Dec 11, 2024
  58. hodlinator commented at 8:29 pm on December 11, 2024: contributor
    • Fixed ruff linter error reported by CI due to #31417 being merged.
    • Added feature_framework_stop_node.py-test to verify clean shutdown of TestNode when the bitcoind-process already exited/crashed, issue found earlier by m3dwards. (Also added comment justifying setting of self.expected_ret_code in test_node.py).
    • Switched from Counter to simpler collections.defaultdict(int)-type.
    • Updated copyright headers to include “-present”.
  59. hodlinator force-pushed on Dec 11, 2024
  60. DrahtBot removed the label CI failed on Dec 11, 2024
  61. in test/functional/test_framework/test_node.py:329 in 5321eb089c outdated
    337+                # ECONNREFUSED - Port not yet open?
    338+                if e.errno not in [errno.ECONNRESET, errno.ETIMEDOUT, errno.ECONNREFUSED]:
    339                     raise  # unknown OS error
    340-            except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
    341+            except ValueError as e:
    342+                # Suppress if cookie file is missing and no rpcuser or rpcpassword; bitcoind may be starting
    


    ryanofsky commented at 7:59 pm on January 6, 2025:

    In commit “test refactor: wait_for_rpc_connection - Treat OSErrors the same” (5321eb089cd1b39160038050a6ef6f27fd6f4ef9)

    Found both this comment “cookie file is missing” and previous comment “cookie file not found” confusing because it is not clear why the cookie file would not be present. IMO saying “node has not generated the cookie file yet” would be clearer.


    hodlinator commented at 4:41 pm on January 9, 2025:
    New version: # Suppress if cookie file isn't generated yet and no rpcuser or rpcpassword; bitcoind may be starting.
  62. in test/functional/test_framework/test_node.py:453 in db0e1580a5 outdated
    447@@ -447,7 +448,9 @@ def is_node_stopped(self, *, expected_stderr="", expected_ret_code=0):
    448 
    449     def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, expect_error=False, **kwargs):
    450         if "expected_ret_code" not in kwargs:
    451-            kwargs["expected_ret_code"] = 1 if expect_error else 0  # Whether node shutdown return EXIT_FAILURE or EXIT_SUCCESS
    452+            # Whether node shutdown is expected to return EXIT_FAILURE (1) or
    453+            # EXIT_SUCCESS (self.expected_ret_code).
    454+            kwargs["expected_ret_code"] = 1 if expect_error else self.expected_ret_code
    


    ryanofsky commented at 8:30 pm on January 6, 2025:

    In commit “test: Allow changing expected internal return code” (db0e1580a5212a162908ac4f23e935e1f76e06dc)

    Now that this logic is getting more complicated with self.expected_ret_code I think the expect_error argument should be dropped so the code can be more understandable. It looks like there is only a single expect_error usage in feature_abortnode.py and it can be easily replaced by using expected_ret_code. Would suggest:

     0--- a/test/functional/feature_abortnode.py
     1+++ b/test/functional/feature_abortnode.py
     2@@ -36,7 +36,7 @@ class AbortNodeTest(BitcoinTestFramework):
     3 
     4             # Check that node0 aborted
     5             self.log.info("Waiting for crash")
     6-            self.nodes[0].wait_until_stopped(timeout=5, expect_error=True, expected_stderr="Error: A fatal internal error occurred, see debug.log for details: Failed to disconnect block.")
     7+            self.nodes[0].wait_until_stopped(timeout=5, expected_ret_code=1, expected_stderr="Error: A fatal internal error occurred, see debug.log for details: Failed to disconnect block.")
     8         self.log.info("Node crashed - now verifying restart fails")
     9         self.nodes[0].assert_start_raises_init_error()
    10 
    11--- a/test/functional/test_framework/test_node.py
    12+++ b/test/functional/test_framework/test_node.py
    13@@ -460,11 +460,11 @@ class TestNode():
    14         self.log.debug("Node stopped")
    15         return True
    16 
    17-    def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, expect_error=False, **kwargs):
    18+    def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, **kwargs):
    19         if "expected_ret_code" not in kwargs:
    20             # Whether node shutdown is expected to return EXIT_FAILURE (1) or
    21             # EXIT_SUCCESS (self.expected_ret_code).
    22-            kwargs["expected_ret_code"] = 1 if expect_error else self.expected_ret_code
    23+            kwargs["expected_ret_code"] = self.expected_ret_code
    24         self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    25 
    26     def replace_in_config(self, replacements):
    

    hodlinator commented at 4:45 pm on January 9, 2025:
    Taken, although I moved some of the handling of expected_ret_code directly into is_node_stopped().
  63. in test/functional/test_framework/test_node.py:431 in 4ab7879ec6 outdated
    421             else:
    422-                self.stop()
    423-        except http.client.CannotSendRequest:
    424-            self.log.exception("Unable to stop node.")
    425+                self.log.warning(f"Process {self.process.pid} already died with return code {return_code}.")
    426+                # If we no longer have a running process once stop_node() is called, assume we've already triggered an
    


    ryanofsky commented at 8:47 pm on January 6, 2025:

    In commit “test: Kill process if stop-RPC is not available” (4ab7879ec6b71804cc3a51bcf58fa6318fa103b0)

    This change doesn’t seem safe. The stop_node method can be called by arbitrary test code at arbitrary times, including when self.rpc_connected is False, and I don’t think it should be suppressing errors unless the caller has explicitly requested that errors should not be passed on.

    Would suggest narrowing the error suppression behavior, so errors and bugs will be less likely to go unnoticed because of this change:

     0--- a/test/functional/feature_framework_stop_node.py
     1+++ b/test/functional/feature_framework_stop_node.py
     2@@ -42,7 +42,7 @@ class FeatureFrameworkStopNode(BitcoinTestFramework):
     3         assert self.nodes[0].running, \
     4             "The process should still be flagged as running before calling stop_node()"
     5         self.log.debug("Explicitly stopping the node to verify it completes cleanly during the test")
     6-        self.nodes[0].stop_node(expected_stderr=invalid_arg_error)
     7+        self.nodes[0].stop_node(expected_stderr=invalid_arg_error, force=True)
     8 
     9 if __name__ == '__main__':
    10     FeatureFrameworkStopNode(__file__).main()
    11--- a/test/functional/test_framework/test_framework.py
    12+++ b/test/functional/test_framework/test_framework.py
    13@@ -315,7 +315,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
    14         if not self.options.noshutdown:
    15             self.log.info("Stopping nodes")
    16             if self.nodes:
    17-                self.stop_nodes()
    18+                self.stop_nodes(force=self.success != TestStatus.PASSED)
    19         else:
    20             for node in self.nodes:
    21                 node.cleanup_on_exit = False
    22@@ -569,7 +569,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
    23                 node.wait_for_rpc_connection()
    24         except Exception:
    25             # If one node failed to start, stop the others
    26-            self.stop_nodes()
    27+            self.stop_nodes(force=True)
    28             raise
    29 
    30         if self.options.coveragedir is not None:
    31@@ -580,11 +580,11 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
    32         """Stop a bitcoind test node"""
    33         self.nodes[i].stop_node(expected_stderr, wait=wait)
    34 
    35-    def stop_nodes(self, wait=0):
    36+    def stop_nodes(self, wait=0, force=False):
    37         """Stop multiple bitcoind test nodes"""
    38         for node in self.nodes:
    39             # Issue RPC to stop nodes
    40-            node.stop_node(wait=wait, wait_until_stopped=False)
    41+            node.stop_node(wait=wait, wait_until_stopped=False, force=force)
    42 
    43         for node in self.nodes:
    44             # Wait for nodes to stop
    45--- a/test/functional/test_framework/test_node.py
    46+++ b/test/functional/test_framework/test_node.py
    47@@ -393,12 +393,30 @@ class TestNode():
    48     def version_is_at_least(self, ver):
    49         return self.version is None or self.version >= ver
    50 
    51-    def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True):
    52-        """Stop the node."""
    53+    def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True, force=False):
    54+        """Stop the node.
    55+
    56+        Args:
    57+            force (bool): Stop the node forcefully if necessary, and change
    58+                errors that would otherwise be exceptions into logged warnings.
    59+
    60+        Notes:
    61+            - The force argument should only be set to true when a test has
    62+            already failed, and the caller is already throwing an exception of
    63+            its own, and wants its own exception to take precedence over errors
    64+            from this function. Since the error that happened was presumably
    65+            unknown and unexpected, the node may have already crashed or may
    66+            need to be shut down forcefully.
    67+        """
    68         if not self.running:
    69             return
    70         self.log.debug("Stopping node")
    71-        if self.rpc_connected:
    72+
    73+        # If self.running is true but self.rpc_connected is not true, it means
    74+        # that the process has been started, but wait_for_rpc_connection() was
    75+        # not able to make a successful connection. In this case, if force is
    76+        # true, avoid trying to make another RPC call and just stop the process.
    77+        if not force or self.rpc_connected:
    78             try:
    79                 # Do not use wait argument when testing older nodes, e.g. in wallet_backwards_compatibility.py
    80                 if self.version_is_at_least(180000):
    

    Note: The diff above is still calling the stop RPC instead of killing the node when self.rpc_connected is true. But it might actually be more desirable to change if not force or self.rpc_connected: there to if not force: and prefer to kill node processes instead of stopping them after test failures. Would not suggest doing this here, since it would change other behavior beyond the scope of the PR, but I could imagine cases where changing this might reduce noise and help debugging errors in CI.


    hodlinator commented at 10:31 am on January 9, 2025:

    It seems wrong to me to attempt using the RPC connection in stop_node() when we known that self.rpc_connected == False. New version has:

    0if self.rpc_connected:
    1    # Call stop-RPC...
    2elif force:
    3    # Kill etc...
    4else:
    5    raise RuntimeError(f"Cannot call stop-RPC as we don't have an RPC connection to process {self.process.pid}.")
    

    Edit: (Ended up calling it avoid_exceptions which I think is slightly better than force).

  64. in test/functional/test_framework/test_node.py:428 in 4ab7879ec6 outdated
    415+            return_code = self.process.poll()
    416+            if return_code is None:
    417+                self.log.warning("Cannot call stop-RPC as we are not connected. "
    418+                    f"Killing process {self.process.pid} so that wait_until_stopped will not time out.")
    419+                self.expected_ret_code = 1 if platform.system() == "Windows" else -9
    420+                self.process.kill()
    


    ryanofsky commented at 4:05 pm on January 7, 2025:

    In commit “test: Kill process if stop-RPC is not available” (4ab7879ec6b71804cc3a51bcf58fa6318fa103b0)

    I think this is probably the simplest way to write this code and we should stick with it for now, but just want to point out there is a minor race condition here because the process could end between the time that poll and kill are called and then the kill call could throw an exception.

    Another separate idea is that it might be safer to write self.expected_ret_code = self.process.wait() instead of hardcoding 1/-9 return codes. Again though would be inclined to stick with the working code unless there is a more of a reason to change.


    hodlinator commented at 9:52 am on January 9, 2025:

    Thanks for pointing out the race condition! It is not so likely if we get here because of RPC connection timeout, but still possible.

    I tried adding a second .kill() after the first one and it does not raise an exception, and no exception is documented.

  65. in test/functional/feature_framework_stop_node.py:45 in 608b9ff097 outdated
    40+            self=self.nodes[0]
    41+        )
    42+        assert self.nodes[0].running, \
    43+            "The process should still be flagged as running before calling stop_node()"
    44+        self.log.debug("Explicitly stopping the node to verify it completes cleanly during the test")
    45+        self.nodes[0].stop_node(expected_stderr=invalid_arg_error)
    


    ryanofsky commented at 5:02 pm on January 7, 2025:

    In commit “test: Add feature_framework_stop_node.py” (608b9ff097214bff3527cc240b7be313c62131d5)

    One problem with this test as currently written is that it prints a real warning to the output (like TestFramework.node0 (WARNING): Process 579831 already died with return code 1). This is potentially confusing because this warning is expected and should not be a cause for concern, but you can’t know that from looking at test output.

    I can think of a few possible solutions:

    • Redirecting log output during the stop_node call and checking for the expected warning with an assertion here.
    • Letting the stop_node function throw exceptions instead of converting errors into warnings, then catching and handling the exceptions in the places (just 2 places currently) where they are expected.
    • Not changing anything but just add a warning before the warning instructing reader to ignore the next warning.

    hodlinator commented at 4:36 pm on January 9, 2025:
    Went for option 3, making clear in the log message just above that a WARNING is expected.
  66. in test/functional/feature_framework_rpc_failure.py:26 in f67d896357 outdated
    21+
    22+    def setup_network(self):
    23+        self.log.info("Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.")
    24+        self.rpc_timeout = 0
    25+
    26+        assert_raises_message(AssertionError, "[node 0] Unable to connect to bitcoind after 0s", BitcoinTestFramework.setup_network, self)
    


    ryanofsky commented at 5:20 pm on January 7, 2025:

    In commit “test: Add feature_framework_rpc_failure.py” (f67d8963575feecdd4914c1a56bab4e0bebc000d)

    Note: there is no test coverage for the “ignored errors” and “latest error” parts of the message. I think it would be good to add some coverage for this, even if it just checking for 0 and None, so there is a at least a hint that more coverage could be added here, especially since it’s possible to imagine this functionality becoming broken in the future and more coverage helping prevent this.


    hodlinator commented at 4:37 pm on January 9, 2025:
    Added test/functional/feature_framework_rpc_failure_details.py.
  67. in test/functional/test_framework/test_node.py:160 in db0e1580a5 outdated
    156@@ -157,6 +157,7 @@ def __init__(self, i, datadir_path, *, chain, rpchost, timewait, timeout_factor,
    157         self.rpc = None
    158         self.url = None
    159         self.log = logging.getLogger('TestFramework.node%d' % i)
    160+        self.expected_ret_code = 0 # EXIT_SUCCESS
    


    ryanofsky commented at 6:13 pm on January 7, 2025:

    In commit “test: Allow changing expected internal return code” (db0e1580a5212a162908ac4f23e935e1f76e06dc)

    Note: I like the changes in this commit, and think the new TestNode.expected_ret_code member could be a useful way to signal expected statuses when shutdown starts. But just want to point out that I don’t think adding a new expected_ret_code member is strictly necessary for this PR to work. The only place it is actually used later in the PR is in the stop_node function when it is killing a failed process, and it seems to me that stop_node could just as easily set self.running = False instead and call process.wait() itself to reap the pid without blocking.


    hodlinator commented at 4:39 pm on January 9, 2025:
    Interesting, I do agree there is a spaghetti/complexity factor to expected_ret_code, so might revisit this.
  68. ryanofsky commented at 6:54 pm on January 7, 2025: contributor

    Code review 608b9ff097214bff3527cc240b7be313c62131d5

    Nice changes in this PR. I’ve been confused by the error output from wait_for_rpc_connection() before, so think this will be helpful.

    One thing that wasn’t obvious to me from the PR description is that this improving error output in two separate ways:

    • One change here is adding error output to wait_for_rpc_connection. Previously errors that happened during polling before the final “Unable to connect to bitcoind” error were silently ignored, so less information would be visible about what was causing the problem. Now counts of the different errors and the most recent message are shown.

    • Another change here is suppressing error output after the wait_for_rpc_connection error, specifically suppressing exceptions from in stop_node when trying to call stop method fails.

    I didn’t ACK the PR because I think the changes to the stop_node method are overbroad, and I suggested a way to limit them. Other than that I think this all looks very good.

  69. hodlinator renamed this:
    test: Shut down framework cleanly on RPC connection failure
    qa: Shut down framework cleanly on RPC connection failure
    on Jan 9, 2025
  70. hodlinator force-pushed on Jan 9, 2025
  71. hodlinator commented at 4:55 pm on January 9, 2025: contributor

    Updated based off ryanofsky’s review.

    Hopefully latest PR summary is slightly clearer.

    Replaced “test” -> “qa” since PR is about functional tests and not unit tests, should conform better with CONTRIBUTING.md.

  72. hodlinator force-pushed on Jan 9, 2025
  73. DrahtBot added the label CI failed on Jan 9, 2025
  74. DrahtBot commented at 6:21 pm on January 9, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/35384673501

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  75. hodlinator force-pushed on Jan 9, 2025
  76. DrahtBot removed the label CI failed on Jan 9, 2025
  77. in test/functional/test_framework/test_framework.py:572 in 8148e4da2c outdated
    568@@ -569,7 +569,7 @@ def start_nodes(self, extra_args=None, *args, **kwargs):
    569                 node.wait_for_rpc_connection()
    570         except Exception:
    571             # If one node failed to start, stop the others
    572-            self.stop_nodes()
    573+            self.stop_nodes(avoid_exceptions=True)
    


    maflcko commented at 9:10 am on January 10, 2025:

    coming from the conflicting pull, I don’t think this call should exist in the first place for two reasons:

    • It doesn’t work in most cases, because a failure to start any node likely means the RPC isn’t up for the other nodes either, so the call will just fail in most cases. Effectively just using the kill fallback.
    • It doesn’t make sense to special case startup. Nodes can at any time fail to start (not only in start_nodes), or crash, and adding this exception handling to all places isn’t useful nor trivially possible.

    So I removed it in #31620.

    (Just leaving a comment here to document the merge conflict between the two changes. Should be trivial to solve either way, so nothing to be done here, I think.)

  78. hodlinator commented at 12:53 pm on January 17, 2025: contributor

    Excuse me from being brief earlier.

    Taking the time do document changes in latest push - git range-diff master 608b9ff 8148e4d

    • Inlined comments about different errno-codes instead of duplicating the identifiers in the comment.
    • Made missing cookie file comment slightly more informative.
    • Removed expect_error in favor of expected_ret_code after ryanofsky’s suggestion (grepping for uses of the former). This helps decrease the amount of added complexity to test_node.py.
    • Add avoid_exceptions parameter to stop_node() to make the behavior change more surgical, thanks to ryanofsky’s suggestion. We now only kill the node under more narrow conditions.
    • Clarified in log output of tests when a WARNING is expected (ryanofsky suggestion).
    • ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 - Added feature_framework_rpc_failure_details.py to verify that suppressed exception information is included in RPC connection timeout message (ryanofsky suggestion).
    • Switched feature_framework_stop_node.py from using TestNode.start() + TestNode.wait_for_rpc_connection() to BitcoinTestFramework.start_node() which combines the two.
  79. DrahtBot added the label Needs rebase on Jan 28, 2025
  80. hodlinator renamed this:
    qa: Shut down framework cleanly on RPC connection failure
    qa: Verify clean shutdown on startup failure
    on Jan 28, 2025
  81. hodlinator force-pushed on Jan 28, 2025
  82. hodlinator commented at 10:45 pm on January 28, 2025: contributor

    Resolved conflict with @maflcko’s #31620 which solved the core issue more elegantly.

    I think there is still some use in the tests from this PR, as we want to avoid knock-on exceptions being reintroduced. I’ve reworked the tests from 3 to 1 Python file. They now launch a child process of which keeps the success-state of the parent BitcoinTestFramework separate from the expected failures of the child processes.

    The final commit (f2e3721e018c3038d5d648ec7ae08e96418f3c29) is what remains of the core of what this PR used to be, but I haven’t been able to construe a test that exercises it. So if reviewers prefer, I can drop it.

    Test instructions in updated PR summary.

  83. hodlinator force-pushed on Jan 28, 2025
  84. DrahtBot added the label CI failed on Jan 28, 2025
  85. DrahtBot commented at 11:12 pm on January 28, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/36321182366

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  86. DrahtBot removed the label CI failed on Jan 29, 2025
  87. DrahtBot removed the label Needs rebase on Jan 29, 2025
  88. hodlinator force-pushed on Jan 29, 2025
  89. hodlinator commented at 4:37 pm on January 29, 2025: contributor

    Latest push adds 2347ce8f018b1b9b8d6ef1933a778023fd228b7f which fixes a Windows Python issue found by CI. It’s still a mystery to me why the old test from a previous version of this PR in ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 didn’t provoke the same issue on Windows CI (successful run including that commit). CI output states they’re running the same Python version.

    Test was also updated to re-launch child processes using the same Python interpreter as the parent, as I was ending up with mixed 64/32-bit interpeters on Windows. Also switches from using subprocesses.getoutput() which is marked as legacy, to subprocess.run() with a timeout.

  90. hodlinator force-pushed on Jan 29, 2025
  91. hodlinator commented at 8:50 pm on January 29, 2025: contributor

    Sorry about the churn here. Went back to ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 and was able to reproduce the TimeoutError.errno==None-issue there too, which is a relief. It seemed to be a bit timing-sensitive, so in the latest push I increased the still low rpc_timeout-override and applied self.options.timeout_factor to it in hopes of making it more robust.

    Edit: The above was for a18f7c8a252bdffd181fe696ede8a91e285e1cce. Fixed trivial type-error in next push, 8102f3d8e0b6715433eeda78b9b57a83109fc45e.

  92. DrahtBot added the label CI failed on Jan 29, 2025
  93. hodlinator force-pushed on Jan 29, 2025
  94. DrahtBot removed the label CI failed on Jan 30, 2025
  95. in test/functional/test_framework/test_node.py:315 in 8fa04cb13a outdated
    310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    311                 self.rpc_connected = True
    312                 self.url = self.rpc.rpc_url
    313                 return
    314-            except JSONRPCException as e:  # Initialization phase
    315+            except JSONRPCException as e:
    316+                # Suppress these in favor of a later outcome (success, FailedToStartError, or timeout).
    


    ryanofsky commented at 4:38 pm on February 12, 2025:

    In commit “qa refactor: wait_for_rpc_connection - Treat OSErrors the same” (8fa04cb13a4bf62a42b18e7c0c79151d13cb494b)

    Various thoughts:

    • I think “initialization phase” comment that was removed provided more context and was more understandable than the new “Suppress these in favor of” comment.
    • I don’t see why it is useful to list later outcomes that can happen after these errors are suppressed, especially when the list seems to contains every imaginable outcome.
    • I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned and https://github.com/bitcoin/bitcoin/blob/master/src/rpc/protocol.h doesn’t seem to contain that error code.

    Given all this, I’d suggest just keeping the comments in this except block unchanged.

    But if not doing that, I think “Suppress these” should be changed to something like “Suppress early init errors” so overall purpose of the except block is clear.


    hodlinator commented at 10:33 pm on February 12, 2025:

    I think “initialization phase” comment that was removed provided more context and was more understandable than the new “Suppress these in favor of” comment.

    As the commit message specifies, we are changing wait_for_rpc_connection(), so the only value the “Initialization phase” comment had came from adding some context to a ~80 line long function which is part of initialization IMO. (The mention of FailedToStartError also gives a hint at what phase we are in).

    I don’t see why it is useful to list later outcomes that can happen after these errors are suppressed, especially when the list seems to contains every imaginable outcome.

    I think it is good to document why we are catching and suppressing some exceptions (while re-raising others).

    I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned and https://github.com/bitcoin/bitcoin/blob/master/src/rpc/protocol.h doesn’t seem to contain that error code.

    I think -342 comes from test/functional/test_framework/authproxy.py. This is what prompted my changed phrasing: #30660 (review)

    But if not doing that, I think “Suppress these” should be changed to something like “Suppress early init errors” so overall purpose of the except block is clear.

    Would rather improve the comment at the top of the function if you think it doesn’t provide enough context? https://github.com/bitcoin/bitcoin/blob/8fa04cb13a4bf62a42b18e7c0c79151d13cb494b/test/functional/test_framework/test_node.py#L264-L265

    (Edit: This comment was pending when maflcko left his below).


    maflcko commented at 9:07 am on February 13, 2025:
    • I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned

    I’d guess that it is coming from authproxy and is pretty general:

     0$ git grep '\-342\>' ./test/
     1test/functional/feature_taproot.py:# Test Taproot softfork (BIPs 340-342)
     2test/functional/rpc_bind.py:        assert_raises_rpc_error(-342, "non-JSON HTTP response with '403 Forbidden' from server", self.run_allowip_test, ['1.1.1.1'], self.non_loopback_ip, self.defaultport)
     3test/functional/test_framework/authproxy.py:                    'code': -342, 'message': 'non-200 HTTP status code but no JSON-RPC error'}, status)
     4test/functional/test_framework/authproxy.py:                    'code': -342, 'message': 'non-200 HTTP status code'}, status)
     5test/functional/test_framework/authproxy.py:                'code': -342, 'message': 'non-200 HTTP status code'}, status)
     6test/functional/test_framework/authproxy.py:                'code': -342, 'message': 'missing HTTP response from server'})
     7test/functional/test_framework/authproxy.py:                raise JSONRPCException({'code': -342, 'message': 'Content received with NO CONTENT status code'})
     8test/functional/test_framework/authproxy.py:                {'code': -342, 'message': 'non-JSON HTTP response with \'%i %s\' from server' % (http_response.status, http_response.reason)},
     9test/functional/test_framework/authproxy.py:                'code': -342, 'message': f'Cannot decode response in utf8 format, content: {data}, exception: {e}'})
    10test/functional/test_framework/test_node.py:                # -342 Service unavailable, RPC server started but is shutting down due to error
    11test/functional/test_framework/test_node.py:                if e.error['code'] != -28 and e.error['code'] != -342:
    

    ryanofsky commented at 7:30 pm on March 13, 2025:

    re: #30660 (review)

    I think it is good to document why we are catching and suppressing some exceptions (while re-raising others).

    It’s funny, because to me replacing “Initialization phase” with “Suppress these in favor of a later outcome” is literally removing the comment that explains why it is ok to suppress these errors. The previous comment made it clear these errors were being suppressed because they are expected during initialization. The new comment tells me they are being suppressed and that various other outcomes are possible later, which tells me nothing about why it is safe or good to suppress errors here.

    Maybe the problem is that I’ve wasted too much time debugging issues and being traumatized by code that suppresses errors and fails to report useful information, but I found the new comment here hard to parse, and could not figure out what it was trying to tell me, and it made me not trust this code for seeming to suppress errors without a clear reason.

    I think -342 comes from test/functional/test_framework/authproxy.py. This is what prompted my changed phrasing: #30660 (comment)

    Thanks that helps. But the phrase “Service unavailable” would seem to connote that server is running but it is telling us it can’t process requests, when actually this is a low-level error coming from the JSON-RPC client. Would suggest replacing “-342 Service unavailable” with something like “-342 HTTP error” to be clearer about the source of this error.


    hodlinator commented at 2:33 pm on March 14, 2025:

    Thanks for taking the time to elaborate.

    “Initialization phase” was too terse for me - of course we’re in the initialization phase… duh. :)

    Changed to “Suppress these as they are expected during initialization.”

  96. in test/functional/feature_framework_startup_failures.py:36 in 2226e14891 outdated
    31+            self.rpc_timeout = self.options.rpc_timeout
    32+        if self.options.extra_args:
    33+            self.extra_args = [[self.options.extra_args]]
    34+
    35+    def add_options(self, parser):
    36+        parser.add_argument("--rpc_timeout", dest="rpc_timeout", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
    


    ryanofsky commented at 4:57 pm on February 12, 2025:

    In commit “qa: Add feature_framework_startup_failures.py” (2226e1489101f363114ce474d7f5e56ec50c9c6a)

    Maybe give these arguments a prefix like --internal- so it obvious they are internal, and to avoid risk of silent conflicts if real test arguments with the same names are added later.

  97. in test/functional/test_framework/test_framework.py:326 in 9d0e6a1b2a outdated
    327-            self.log.info("Not stopping nodes as test failed. The dangling processes will be cleaned up later.")
    328-        else:
    329-            self.log.info("Stopping nodes")
    330-            if self.nodes:
    331-                self.stop_nodes()
    332+        if self.nodes:
    


    ryanofsky commented at 5:10 pm on February 12, 2025:

    In commit “qa: Only log node stopping info if we have any” (9d0e6a1b2aa5be186550658ff85e5ad643554338)

    Would be helpful if commit description stated motivation for this change. After looking at the stop_nodes() method implementation it seem like this is a logging-only change and point is just to remove stopping log statements when there are no nodes. But it would be good if actual intent of change were explicit.


    hodlinator commented at 8:33 am on February 13, 2025:
    • Saw the code still had an inner if self.nodes-check - removed in new push.
    • Slightly clarified commit message, describing the change (even though it adds some negation) rather than describing then end state of the change.
  98. in test/functional/test_framework/test_node.py:403 in 8102f3d8e0 outdated
    399@@ -400,14 +400,14 @@ def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True):
    400         if not self.running:
    401             return
    402         self.log.debug("Stopping node")
    403-        try:
    404+        if self.rpc_connected:
    


    ryanofsky commented at 5:12 pm on February 12, 2025:

    In commit “qa: Avoid calling stop-RPC if not connected” (8102f3d8e0b6715433eeda78b9b57a83109fc45e)

    This change seems good, but unclear what motivation is. Is it supposed to be a refactoring that makes code more robust, or were there actually any cases where CannotSendRequest exceptions were caught and suppressed? If so what were those cases? Commit description should say what goals of this change are and if CannotSendRequest is an actual or theoretical change in behavior.


    hodlinator commented at 8:40 am on February 13, 2025:
    Thanks! This made me actually realize I could add a test for it and improve the exception message (+ commit message).

    ryanofsky commented at 6:58 pm on March 13, 2025:

    re: #30660 (review)

    Thanks! This made me actually realize I could add a test for it and improve the exception message (+ commit message).

    I’m still not clear when this case would ever be expected to happen.

    IIUC, there is no runtime error that could trigger this, and nothing non-test code could could do to trigger it, and the only thing that could trigger it is incorrectly written test code?

    If so, it would seem clearer to replace the exception with an assertion like assert self.rpc_connected, self._node_msg(...) to be clear this is a bug in the test code, not some other failure.


    hodlinator commented at 2:28 pm on March 14, 2025:

    Correct, especially now after maflcko’s change in fae3bf6b870eb0f9cddd1adac82ba72890806ae3 to not call stop() on test failure, I agree it makes sense to assert instead, good suggestion! Fixed.

    were there actually any cases where CannotSendRequest exceptions were caught and suppressed? If so what were those cases? Commit description should say what goals of this change are and if CannotSendRequest is an actual or theoretical change in behavior.

    It was a common knock-on issue from attempting to call stop-RPC on a non-functional RPC interface, see for example “CannotSendRequest” in the log of #21367. Should no longer happen due to maflcko’s change and us guarding against that kind of thing with the assert.

    Clarified commit description.

  99. ryanofsky approved
  100. ryanofsky commented at 5:26 pm on February 12, 2025: contributor

    Code review ACK 8102f3d8e0b6715433eeda78b9b57a83109fc45e. Looks like #31620 made some overlapping changes, but this PR still seems very useful.

    From PR description:

    Before #31620, failure to establish RPC connections would result in an attempt to stop the node via RPC, triggering knock-on exceptions. This would muddy the waters for developers looking for root causes of test failures. This PR used to solve that issue in a more roundabout way than aforementioned PR, but now merely adds tests to ensure knock-on errors/exceptions are not reintroduced.

    Even after reviewing this PR twice and rereading this paragraph several times, I find this pretty difficult to understand. Would suggest giving a more straightforward description like:

    • This PR improves handling of startup errors in functional tests. Specifically it improves wait_for_rpc_connection so instead of returning a vague Unable to connect to bitcoind error, it returns details about specific failures leading up to it. This PR also adds comments, refactors code, improves logging, detects timeout errors on windows, adds a test to make sure error handling behavior is not broken in the future.

    I think it is ok to mention #31620 and mention the earlier version of this PR in a postscript to provide context for early review comments, but the main goal of the PR description should be to describe the PR as it currently is.

  101. DrahtBot requested review from m3dwards on Feb 12, 2025
  102. hodlinator force-pushed on Feb 13, 2025
  103. hodlinator commented at 10:23 am on February 13, 2025: contributor

    Thanks for bearing with me @ryanofsky. :)

    Updated PR description based on feedback and addressed inline comments.

  104. hodlinator force-pushed on Feb 13, 2025
  105. DrahtBot added the label CI failed on Feb 13, 2025
  106. DrahtBot commented at 10:27 am on February 13, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/37155473613

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  107. DrahtBot removed the label CI failed on Feb 13, 2025
  108. ryanofsky approved
  109. ryanofsky commented at 8:02 pm on March 13, 2025: contributor
    Code review ACK c107aa8c873cb3c031b341baef487638133823f1. Changes since last review were just updating some messages and expanding the test a little. Thanks for the updates. This still looks good to me and I left some more comments, but they are not important.
  110. qa refactor: wait_for_rpc_connection - Treat OSErrors the same
    ConnectionResetError is an OSError as well (ECONNRESET), no reason to have a separate except-block for it.
    
    Also improves comments for other exceptions and make condition above more Pythonic.
    8ff738a6e9
  111. qa: Include ignored errors in RPC connection timeout
    When an RPC connection attempt with bitcoind times out, include which ignored errors occurred in the exception message.
    
    May provide clues of what has gone wrong.
    f5254ed51b
  112. hodlinator force-pushed on Mar 14, 2025
  113. hodlinator force-pushed on Mar 14, 2025
  114. hodlinator force-pushed on Mar 14, 2025
  115. hodlinator force-pushed on Mar 14, 2025
  116. hodlinator force-pushed on Mar 14, 2025
  117. in test/functional/test_framework/test_node.py:328 in fa2178e67f outdated
    321@@ -322,13 +322,18 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    322                 suppressed_errors[f"JSONRPCException {e.error['code']}"] += 1
    323                 latest_error = repr(e)
    324             except OSError as e:
    325+                error_num = e.errno
    326+                # Workaround issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.
    327+                if error_num is None:
    328+                    assert isinstance(e, TimeoutError)
    329+                    error_num = errno.ETIMEDOUT
    


    m3dwards commented at 5:35 pm on March 17, 2025:

    The inclusion of an assert suggests to me that we expect that all OSErrors without errno are TimeoutErrors? Is this actually what we expect?

    If not should the if statement be changed to if error_num is None and isinstance(e, TimeoutError):


    hodlinator commented at 9:20 pm on March 17, 2025:

    Had a feeling OSErrors must always have that field set (except for the socket TimeoutError bug), but the docs indicate it’s a bit more loose:

    The second form of the constructor sets the corresponding attributes, described below. The attributes default to None if not specified.

    Could change the condition as you suggestion, but it would be prudent to assert that error_num isn’t None for other cases:

    0                if error_num is None and isinstance(e, TimeoutError):
    1                    error_num = errno.ETIMEDOUT
    2                else:
    3                    assert error_num is not None
    

    So not sure about that change as it would repeat one of the conditions. Maybe just adding an error message to what we have now would be okay?

    0                if error_num is None:
    1                    assert isinstance(e, TimeoutError), "We only currently support " \
    2                        f"inferring errno from TimeoutErrors, encountered unhandled: {e!r}"
    3                    error_num = errno.ETIMEDOUT
    

    m3dwards commented at 12:29 pm on March 19, 2025:

    This is only a NIT so feel free to leave it as it is but in my mind an assert is used to signal an expectation of predicted state. In Python asserts are not even executed if run in optimised mode. It seems like we are using it more like an exception here?

    0                if error_num is None:
    1                    if isinstance(e, TimeoutError):
    2                        error_num = errno.ETIMEDOUT
    3                    else:
    4                        raise Exception("errno is required")
    

    hodlinator commented at 2:02 pm on March 19, 2025:
    Not so concerned about people running in optimised node since 5+ functional tests fail in that way, but changed in latest push to: https://github.com/bitcoin/bitcoin/blob/cac218dcbdd2b5682a7ac8659aac0306e68a65f5/test/functional/test_framework/test_node.py#L325-L331

    ryanofsky commented at 1:51 pm on March 26, 2025:

    re: #30660 (review)

    In commit “qa: Workaround Windows issue with socket timeouts” (c92a8fcd79d76b23c366a07be1c8778db7bea891)

    FWIW, I think the best way to write this would be simply:

    0# Work around issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.
    1if error_num is None and isinstance(e, TimeoutError):
    2    error_num = errno.ETIMEDOUT
    3
    4if error_num not in [ ... ]:
    5   raise
    

    Raising a new “Cannot infer errno when unset” exception just adds complexity and could make it harder to debug since the original exception contains more information than the new exception.


    hodlinator commented at 2:26 pm on March 26, 2025:

    Thanks for re-reviewing this again!

    Yes, in that way we would make error_num == None be one of the cases for which we raise. It’s obviously better to raise for something unexpected than to suppress, which should lead to it being tracked down and intentionally categorized for suppression or not.

    Updated in latest push.

  118. hodlinator force-pushed on Mar 19, 2025
  119. ryanofsky approved
  120. ryanofsky commented at 2:03 pm on March 26, 2025: contributor

    Code review ACK cac218dcbdd2b5682a7ac8659aac0306e68a65f5. Just changed exception to assert and assert to exception since last review. Left a minor suggestion but it’s not important so feel free to ignore. Thanks for being so responsive with previous feedback.

    I think this PR should be pretty helpful to developers writing python tests since it cleans up logging, returns more complete error information, provides better feedback if stop_node is used incorrectly, and adds test coverage for the test framework to make sure it works reliably.

  121. DrahtBot requested review from m3dwards on Mar 26, 2025
  122. hodlinator force-pushed on Mar 26, 2025
  123. ryanofsky approved
  124. ryanofsky commented at 3:13 pm on March 26, 2025: contributor
    Code review ACK 6539598b42005f453b358a0076287db99d1b02eb. Since last review just simplified errno code a little as suggested
  125. in test/functional/test_framework/test_node.py:326 in 52746d0c58 outdated
    321@@ -322,13 +322,17 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    322                 suppressed_errors[f"JSONRPCException {e.error['code']}"] += 1
    323                 latest_error = repr(e)
    324             except OSError as e:
    325+                error_num = e.errno
    326+                # Workaround issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.
    


    l0rinc commented at 10:37 am on March 27, 2025:
    Is this the issue? https://github.com/python/cpython/issues/109601 If so, it doesn’t seem to be Windows related.

    maflcko commented at 11:34 am on March 27, 2025:
    Also, the commit message seems to imply this is a bugfix for an issue observed on Windows. However, I don’t think anyone has observed this on current master. The first commit of this pull claims to be a “refactor”, so maybe the bug this is fixing was introduced there instead?

    hodlinator commented at 1:07 pm on March 27, 2025:

    Thanks for finding that issue @l0rinc!

    I’ve only observed it on Windows, maybe the socket subsystem or timeouts are different enough there to make it more of an issue.

    However, I don’t think anyone has observed this on current master.

    Maybe I was running under some less common setup like non-WSL MSVC Debug, or maybe my Windows network stack was having a bad day. (There’s an odd chance that I might have been using Linux->Windows cross-built binaries for some reason).

    The first commit of this pull claims to be a “refactor”, so maybe the bug this is fixing was introduced there instead?

    8ff738a6e96c19aaf14ebea52c8ae13751cf443b, the refactor-commit does indeed restructure how OSErrors are caught, at first I thought maybe it was filtering TimeoutError by type, but no, the approach on master also relies on errno.ETIMEDOUT (ConnectionResetError is partially filtered by type there, they are not direct descendants of each other).


    hodlinator commented at 10:31 pm on March 27, 2025:
    Pushed new version of this commit - 51b2115a348f681e12aabf514cb156da4420da2e. Now links to the Python issue in the comment, and only mentions that it was encountered on Windows in the commit message.

    maflcko commented at 7:13 am on March 28, 2025:
    Do you still know how often it happened and what the traceback looked like?

    hodlinator commented at 1:27 pm on March 28, 2025:

    Was able to easily reproduce the issue.

    C:\Users\hodlinator\bitcoin>py build\test\functional\feature_framework_startup_failures.py 2025-03-28T13:09:19.459000Z TestFramework (INFO): PRNG seed is: 4595347679033165165 2025-03-28T13:09:19.519000Z TestFramework (INFO): Initializing test directory C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2 2025-03-28T13:09:19.519000Z TestFramework (INFO): Verifying timeout in connecting to bitcoind’s RPC interface results in only one exception. 2025-03-28T13:09:19.788000Z TestFramework (INFO): Verifying inability to connect to bitcoind’s RPC interface due to wrong port results in one exception containing at least one OSError. 2025-03-28T13:09:21.378000Z TestFramework (ERROR): Found 0/1 occurrences of the specific exception: AssertionError: [node 0] Unable to connect to bitcoind after \d+s (ignored errors: {[^}]‘OSError \w+’?: \d+[^}]}, latest error: \w+([^)]+)) 2025-03-28T13:09:21.378000Z TestFramework (ERROR): Assertion failed Traceback (most recent call last): File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 135, in main self.run_test() ~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 111, in run_test self.test_wrong_rpc_port() ~~~~~~~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 85, in test_wrong_rpc_port self._run_test_internal( ~~~~~~~~~~~~~~~~~~~~~~~^ # Lower the timeout so we don’t wait that long. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …<4 lines>… r"AssertionError: [node 0] Unable to connect to bitcoind after \d+s (ignored errors: {[^}]‘OSError \w+’?: \d+[^}]}, latest error: \w+([^)]+))" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 74, in _run_test_internal raise AssertionError(f"Child test didn’t contain (only) expected errors.\n:\n{result.stdout}\n\n") AssertionError: Child test didn’t contain (only) expected errors. : 2025-03-28T13:09:19.929000Z TestFramework (INFO): PRNG seed is: 1755806780689552109 2025-03-28T13:09:19.952000Z TestFramework (INFO): Initializing test directory C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_zahuvbto 2025-03-28T13:09:21.234000Z TestFramework (ERROR): Unexpected exception caught during testing Traceback (most recent call last): File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 131, in main self.setup() ~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 313, in setup self.setup_network() ~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 47, in setup_network BitcoinTestFramework.setup_network(self) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 404, in setup_network self.setup_nodes() ~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 426, in setup_nodes self.start_nodes() ~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 585, in start_nodes node.wait_for_rpc_connection() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_node.py”, line 286, in wait_for_rpc_connection rpc.getblockcount() ~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\coverage.py”, line 50, in call return_val = self.auth_service_proxy_instance.call(*args, **kwargs) File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\authproxy.py”, line 127, in call response, status = self._request(‘POST’, self.__url.path, postdata.encode(‘utf-8’)) ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\authproxy.py”, line 105, in _request self.__conn.request(method, path, postdata, headers) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1336, in request self._send_request(method, url, body, headers, encode_chunked) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1382, in _send_request self.endheaders(body, encode_chunked=encode_chunked) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1331, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1091, in _send_output self.send(msg) ~~~~~~~~~^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1035, in send self.connect() ~~~~~~~~~~~~^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1001, in connect self.sock = self._create_connection( ~~~~~~~~~~~~~~~~~~~~~~~^ (self.host,self.port), self.timeout, self.source_address) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\socket.py”, line 864, in create_connection raise exceptions[0] File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\socket.py”, line 849, in create_connection sock.connect(sa) ~~~~~~~~~~~~^^^^ TimeoutError: timed out 2025-03-28T13:09:21.344000Z TestFramework (INFO): Not stopping nodes as test failed. The dangling processes will be cleaned up later. 2025-03-28T13:09:21.345000Z TestFramework (WARNING): Not cleaning up dir C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_zahuvbto 2025-03-28T13:09:21.345000Z TestFramework (ERROR): Test failed. Test logging available at C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_zahuvbto/test_framework.log 2025-03-28T13:09:21.345000Z TestFramework (ERROR): 2025-03-28T13:09:21.345000Z TestFramework (ERROR): Hint: Call C:\Users\hodlinator\bitcoin\build\test\functional\combine_logs.py ‘C:\Users\HODLIN~1\AppData\Local\Temp\bitcoin_func_test_zahuvbto’ to consolidate all logs 2025-03-28T13:09:21.346000Z TestFramework (ERROR): 2025-03-28T13:09:21.346000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log. 2025-03-28T13:09:21.346000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues 2025-03-28T13:09:21.346000Z TestFramework (ERROR): [node 0] Cleaning up leftover process

    2025-03-28T13:09:21.431000Z TestFramework (INFO): Not stopping nodes as test failed. The dangling processes will be cleaned up later. 2025-03-28T13:09:21.431000Z TestFramework (WARNING): Not cleaning up dir C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2 2025-03-28T13:09:21.431000Z TestFramework (ERROR): Test failed. Test logging available at C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2/test_framework.log 2025-03-28T13:09:21.431000Z TestFramework (ERROR): 2025-03-28T13:09:21.432000Z TestFramework (ERROR): Hint: Call C:\Users\hodlinator\bitcoin\build\test\functional\combine_logs.py ‘C:\Users\HODLIN~1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2’ to consolidate all logs 2025-03-28T13:09:21.432000Z TestFramework (ERROR): 2025-03-28T13:09:21.432000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log. 2025-03-28T13:09:21.432000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues 2025-03-28T13:09:21.432000Z TestFramework (ERROR):

    My hypothesis is that we trigger the issue through the first test setting a 0 rpc_timeout:

    https://github.com/bitcoin/bitcoin/blob/6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5/test/functional/feature_framework_startup_failures.py#L76-L81

    Attempting to use the network stack with a 0 timeout seems to have different meanings on Linux and Windows. It’s not really a realistic test. I think the next test covers a superset of the one above, so could probably remove the first one.

    https://github.com/bitcoin/bitcoin/blob/6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5/test/functional/feature_framework_startup_failures.py#L83-L92


    hodlinator commented at 1:55 pm on March 28, 2025:
    Dropped the test and workaround commit in latest push.
  126. in test/functional/feature_framework_startup_failures.py:98 in 75134fef62 outdated
    93+
    94+    def test_init_error(self):
    95+        self.log.info("Verify startup failure due to invalid arg results in only one exception.")
    96+        self._run_test_internal(
    97+            ["--internal-extra_args=-nonexistentarg"],
    98+            "FailedToStartError: \\[node 0\\] bitcoind exited with status 1 during initialization. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"
    


    l0rinc commented at 11:12 am on March 27, 2025:
    nit: for consistency I’d make this either a r"...\[node 0\] as well or do a re.escape if we want to be explicit. Though I think we’re probably matching way too much here, it doesn’t hint at which part of the error is important and which part is irrelevant and subject to change

    hodlinator commented at 1:15 pm on March 27, 2025:
    r""-string seems more consistent. Thanks for pointing it out! Will fix if I re-touch.

    hodlinator commented at 10:31 pm on March 27, 2025:
    Fixed.
  127. l0rinc commented at 11:13 am on March 27, 2025: contributor
    lightweight, code review only ACK 6539598b42005f453b358a0076287db99d1b02eb
  128. hodlinator force-pushed on Mar 27, 2025
  129. hodlinator commented at 10:33 pm on March 27, 2025: contributor
  130. in test/functional/feature_framework_startup_failures.py:98 in 6bbfb3eeef outdated
    93+
    94+    def test_init_error(self):
    95+        self.log.info("Verify startup failure due to invalid arg results in only one exception.")
    96+        self._run_test_internal(
    97+            ["--internal-extra_args=-nonexistentarg"],
    98+            r"FailedToStartError: \[node 0\] bitcoind exited with status 1 during initialization. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"
    


    l0rinc commented at 9:24 am on March 28, 2025:

    super-nit:

    0            r"FailedToStartError: \[node 0\] bitcoind exited with status 1 during initialization\. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"
    

    hodlinator commented at 10:14 am on March 28, 2025:
    Will take if anything else forces me to re-touch.

    hodlinator commented at 1:53 pm on March 28, 2025:
  131. l0rinc approved
  132. l0rinc commented at 9:28 am on March 28, 2025: contributor
    lightweight re-review-ack 6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5
  133. qa: Only allow calling TestNode.stop() after connecting
    (Still tolerate calling it on a no longer (self.)running node, as in a node that has been queried for is_node_stopped() and modified state before returning True).
    
    Tests should not attempt to use the non-functioning RPC interface to call stop() unless wait_for_connections() has succeeded.
    
    No longer log and suppress http.client.CannotSendRequest as a consequence of stop()-RPC, as error conditions causing this knock-on issue are now guarded against before the call.
    3362c4c868
  134. qa: Add feature_framework_startup_failures.py 3ee7062ee3
  135. qa: Avoid logging node stopping info when a test has 0 nodes
    This applies to the test in the parent commit which does not run any nodes itself, only indirectly through child test processes.
    fc88c2ec29
  136. hodlinator force-pushed on Mar 28, 2025
  137. hodlinator commented at 2:06 pm on March 28, 2025: contributor

    Latest push:

    • Removes test_instant_rpc_timeout which was artificially setting rpc_timeout = 0, which is not a supported case and which caused different behavior on Windows network stacks compared to Linux. That test was a rough subset of test_wrong_rpc_port. Removing test_instant_rpc_timeout allowed dropping the commit with the workaround for socket timeout with unset errno entirely. Special thanks to @maflcko for insisting that something was not right.
    • Fixed nit concerning escaping in regexp string.
    • Bumped subprocess.run-timeout since it was flaky under Windows.
  138. ryanofsky approved
  139. ryanofsky commented at 2:56 pm on March 28, 2025: contributor

    Code review ACK fc88c2ec297dc93ba06008bd5ae10798e9f6aeac. Just dropped 0 timeout workaround and instant timeout test since the last review.

    Ideally, I would want setting a 0 timeout to result in a comprehensible error message lke “Unable to connect to bitcoind after 0s” and not some harder to debug behavior, but it wouldn’t be worth restructuring the code or adding a special case for.

  140. DrahtBot requested review from l0rinc on Mar 28, 2025

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-04-03 06:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me