qa: Verify clean shutdown on startup failure #30660

pull hodlinator wants to merge 6 commits into bitcoin:master from hodlinator:2024-08/test_node_rpc_connection_failure changing 4 files +216 −30
  1. hodlinator commented at 11:53 am on August 15, 2024: contributor

    Improves handling of startup errors in functional tests and puts tests in place to ensure knock-on errors don’t creep in.

    • wait_for_rpc_connection() now appends specific failures leading up to the Unable to connect to bitcoind error to that error message: [node 0] Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 1, 'OSError.ECONNREFUSED': 239}, latest error: ConnectionRefusedError(111, 'Connection refused'))
    • Fixes Windows Python issue where socket.timeout exceptions end up with unset errno-fields.
    • Also adds comments, refactors code, improves logging.

    The underlying purpose is to ensure developer efficiency in finding root causes of test failures.

    Prior iterations of the PR partially focused on fixing the same issue as #31620. Originally inspired by #30390.

    Testing

    Can be tested by reverting either faf2f2c654d9aa18b2f49a157956f9ab0fce302a or fae3bf6b870eb0f9cddd1adac82ba72890806ae3 from #31620, or the “qa: Avoid calling stop-RPC if not connected” from this PR, and running feature_framework_startup_failures.py.

  2. DrahtBot commented at 11:53 am on August 15, 2024: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/30660.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK l0rinc
    Stale ACK tdb3, m3dwards, ryanofsky

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    No conflicts as of last run.

  3. DrahtBot added the label Tests on Aug 15, 2024
  4. hodlinator commented at 11:54 am on August 15, 2024: contributor

    Encountered a case of #30390 here: https://github.com/bitcoin/bitcoin/actions/runs/10380115902/job/28739417666?pr=30377

    Output with only the test (no fix commit):

     0$ test/functional/feature_framework_rpc_failure.py
     12024-08-15T07:30:20.427000Z TestFramework (INFO): PRNG seed is: 635212259632049987
     22024-08-15T07:30:20.427000Z TestFramework (INFO): Initializing test directory /run/user/1001/bitcoin_func_test_ujsvbdnm
     32024-08-15T07:30:20.428000Z TestFramework (INFO): Forcing us to timeout instantly to simulate an unresponsive bitcoind
     42024-08-15T07:30:20.429000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     7    node.wait_for_rpc_connection()
     8  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 329, in wait_for_rpc_connection
     9    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
    10  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 192, in _raise_assertion_error
    11    raise AssertionError(self._node_msg(msg))
    12AssertionError: [node 0] Unable to connect to bitcoind after 0s
    13
    14During handling of the above exception, another exception occurred:
    15
    16Traceback (most recent call last):
    17  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 21, in setup_network
    18    BitcoinTestFramework.setup_network(self)
    19  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    20    self.setup_nodes()
    21  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    22    self.start_nodes()
    23  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    24    self.stop_nodes()
    25  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    26    node.stop_node(wait=wait, wait_until_stopped=False)
    27  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 390, in stop_node
    28    self.stop(wait=wait)
    29    ^^^^^^^^^
    30  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 209, in __getattr__
    31    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    32           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    33AssertionError: [node 0] Error: no RPC connection
    34
    35During handling of the above exception, another exception occurred:
    36
    37Traceback (most recent call last):
    38  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
    39    self.setup()
    40  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    41    self.setup_network()
    42  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 23, in setup_network
    43    assert_equal(str(e), "[node 0] Unable to connect to bitcoind after 0s")
    44  File "/home/hodlinator/bitcoin/test/functional/test_framework/util.py", line 75, in assert_equal
    45    raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
    46AssertionError: not([node 0] Error: no RPC connection == [node 0] Unable to connect to bitcoind after 0s)
    472024-08-15T07:30:20.480000Z TestFramework (INFO): Stopping nodes
    48Traceback (most recent call last):
    49  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 32, in <module>
    50    FeatureFrameworkRPCFailure(__file__).main()
    51  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
    52    exit_code = self.shutdown()
    53                ^^^^^^^^^^^^^^^
    54  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
    55    self.stop_nodes()
    56  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    57    node.stop_node(wait=wait, wait_until_stopped=False)
    58  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 390, in stop_node
    59    self.stop(wait=wait)
    60    ^^^^^^^^^
    61  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 209, in __getattr__
    62    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    63           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    64AssertionError: [node 0] Error: no RPC connection
    65[node 0] Cleaning up leftover process
    

    With fix:

    0$ test/functional/feature_framework_rpc_failure.py
    12024-08-15T07:30:05.092000Z TestFramework (INFO): PRNG seed is: 1672695534155872365
    22024-08-15T07:30:05.092000Z TestFramework (INFO): Initializing test directory /run/user/1001/bitcoin_func_test_cow4vrf7
    32024-08-15T07:30:05.093000Z TestFramework (INFO): Forcing us to timeout instantly to simulate an unresponsive bitcoind
    42024-08-15T07:30:05.094000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 326500.
    52024-08-15T07:30:05.144000Z TestFramework (INFO): Caught AssertionError with expected message
    62024-08-15T07:30:05.194000Z TestFramework (INFO): Stopping nodes
    72024-08-15T07:30:05.194000Z TestFramework (INFO): Cleaning up /run/user/1001/bitcoin_func_test_cow4vrf7 on exit
    82024-08-15T07:30:05.194000Z TestFramework (INFO): Tests successful
    
  5. maflcko commented at 12:32 pm on August 15, 2024: member
    Can you explain this a bit more? I think optimizing the test framework to cleanly shut down on a failure is a bit premature, and may not be needed. However, if a failing test may leave behind a dangling bitcoind process, this seems like something to fix, because it will otherwise lead to test warnings and failures down the road.
  6. hodlinator force-pushed on Aug 18, 2024
  7. hodlinator commented at 8:23 pm on August 18, 2024: contributor

    Can you explain this a bit more? I think optimizing the test framework to cleanly shut down on a failure is a bit premature, and may not be needed.

    As long as the additional complexity and maintenance burden is low, I think it is worth having a cleaner shutdown to avoid muddying the waters for people trying to figure out root causes.

    However, if a failing test may leave behind a dangling bitcoind process, this seems like something to fix, because it will otherwise lead to test warnings and failures down the road.

    It does clean up dangling bitcoind processes without the fix, as can be seen in the log:

    0[node 0] Cleaning up leftover process
    

    The latest push (bcc87514e780cd35d1f71d5c4c4069a8fac95324 0ea00cddc3d4a3f92c6bfae13a8b844e97ebc8c3) now explains in the log message that it is terminating the process to avoid wait_until_stopped timing out (which raises an additional exception).

    + Broke out a commit adding the self.expected_ret_code functionality, so that the fix commit in the middle is clearer. + Attempted to make the comment at the top of the test file more explanatory. + Switched from process.kill() to the somewhat less aggressive process.terminate() (although on second thoughts, if the RPC interface is not responding in a real-world situation outside of the added test, maybe .kill() is closer to what we want).

  8. hodlinator force-pushed on Aug 18, 2024
  9. hodlinator commented at 8:27 pm on August 26, 2024: contributor

    Last week I was hit by this again. Probably line ending differences on Windows makes any additional Python exception reporting extra annoying to read due to every second line being blank (separate issue, but slightly improved by this PR producing less output).

    Git-sleuthing to help motivate this PR

    Was this ever working?

    Unfortunately for my case, it seems the answer is no: 7897338918dac072e788b8ab2919d4559f311bef (Aug, 2017) broke out test_node.py from test_framework.py added the self.rpc_connected flag. Would have been nice if it had protected use of the stop() RPC using this new flag (like in this PR) but unfortunately it did not.

    Fixes in the same area

    fafb33cdefd2d8ce065263978075d26a1672b630 (May, 2016) MarcoFalke wrapped stop() attempts in try/except so at least other nodes would have a chance of shutting down properly. (6 days later in fab5233fe6c99d446a2cebc48cf479c3f026c2cc a timeout for waiting on the process to shutdown was added too, probably so the tests wouldn’t hang waiting for a non-crashed bitcoind that we failed to send stop() to).

    #14670 (Dec, 2018) is tangential and adds test/functional/feature_shutdown.py in 28479f926f21f2a91bec5a06671c60e5b0c55532, to ensure that HTTP responses for stop() make it all the way back to the client if I’m understanding it correctly. It fixes #11777 (Nov, 2017) which is arguably about shutdowns not being clean enough (sometimes showing additional errors depending on timing). The current PR is certainly more nit-level, but still continuing in the same general direction.

  10. hodlinator commented at 10:07 pm on August 26, 2024: contributor

    Another argument for this PR is that even the current title of #30390 is “ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it”. This in itself when looking at logs like https://github.com/bitcoin/bitcoin/actions/runs/9677128783/job/26698120754#step:27:225 is actually just a knock-on exception happening after the RPC connection attempt times out and stop()-RPC is called in vain:

     0313/313 - wallet_listsinceblock.py --legacy-wallet failed, Duration: 2415 s
     1stdout:
     22024-06-26T09:53:17.196000Z TestFramework (INFO): PRNG seed is: 3062882430636553844
     32024-06-26T09:53:17.227000Z TestFramework (INFO): Initializing test directory D:\a\_temp\test_runner__🏃_20240626_093314\wallet_listsinceblock_88
     42024-06-26T10:33:29.783000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 569, in start_nodes
     7    node.wait_for_rpc_connection()
     8  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 314, in wait_for_rpc_connection
     9    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
    10  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
    11    raise AssertionError(self._node_msg(msg))
    12AssertionError: [node 3] Unable to connect to bitcoind after 2400s
    13
    14During handling of the above exception, another exception occurred:
    15
    16Traceback (most recent call last):
    17  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 131, in main
    18    self.setup()
    19  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 302, in setup
    20    self.setup_network()
    21  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 396, in setup_network
    22    self.setup_nodes()
    23  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 418, in setup_nodes
    24    self.start_nodes()
    25  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 572, in start_nodes
    26    self.stop_nodes()
    27  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 587, in stop_nodes
    28    node.stop_node(wait=wait, wait_until_stopped=False)
    29  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 375, in stop_node
    30    self.stop(wait=wait)
    31    ^^^^^^^^^
    32  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 205, in __getattr__
    33    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    34           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    35AssertionError: [node 3] Error: no RPC connection
    362024-06-26T10:33:29.842000Z TestFramework (INFO): Stopping nodes
    37[node 3] Cleaning up leftover process
    38[node 2] Cleaning up leftover process
    39[node 1] Cleaning up leftover process
    40[node 0] Cleaning up leftover process
    41
    42stderr:
    43Traceback (most recent call last):
    44  File "D:\a\bitcoin\bitcoin/test/functional/wallet_listsinceblock.py", line 508, in <module>
    45    ListSinceBlockTest().main()
    46  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 155, in main
    47    exit_code = self.shutdown()
    48                ^^^^^^^^^^^^^^^
    49  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 318, in shutdown
    50    self.stop_nodes()
    51  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 587, in stop_nodes
    52    node.stop_node(wait=wait, wait_until_stopped=False)
    53  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 375, in stop_node
    54    self.stop(wait=wait)
    55  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\coverage.py", line 50, in __call__
    56    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
    57                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    58  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\authproxy.py", line 127, in __call__
    59    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
    60                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    61  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\authproxy.py", line 105, in _request
    62    self.__conn.request(method, path, postdata, headers)
    63  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1336, in request
    64    self._send_request(method, url, body, headers, encode_chunked)
    65  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1382, in _send_request
    66    self.endheaders(body, encode_chunked=encode_chunked)
    67  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1331, in endheaders
    68    self._send_output(message_body, encode_chunked=encode_chunked)
    69  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1091, in _send_output
    70    self.send(msg)
    71  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1035, in send
    72    self.connect()
    73  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1001, in connect
    74    self.sock = self._create_connection(
    75                ^^^^^^^^^^^^^^^^^^^^^^^^
    76  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\socket.py", line 853, in create_connection
    77    raise exceptions[0]
    78  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\socket.py", line 838, in create_connection
    79    sock.connect(sa)
    80ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
    

    A better title might have been “Unable to connect to bitcoind after 2400s errors on Windows”, which this PR would have made it more likely to be in the first place (had it predated the issue).

  11. hodlinator force-pushed on Aug 27, 2024
  12. hodlinator commented at 9:54 pm on August 27, 2024: contributor

    After more investigation into #30390, my impression is that since TestNode.wait_for_rpc_connection() ignores a bunch of exceptions, the knock-on exceptions when calling stop() and failing prior to this PR actually may sometimes provide a clue for why the RPC connection wasn’t established in the first place.

    That realization made me add 2 more commits here, which provide more information on the original exceptions being triggered and ignored inside wait_for_rpc_connection. Only fair since we are avoiding the triggering of knock-on exceptions later.

    The timeout exception’s message thereby now includes counters of encountered (and ignored) exceptions + latest exception type & message: Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 5}, latest error: ValueError('No RPC credentials'))

  13. DrahtBot added the label CI failed on Aug 27, 2024
  14. DrahtBot removed the label CI failed on Sep 29, 2024
  15. DrahtBot added the label CI failed on Oct 13, 2024
  16. achow101 requested review from josibake on Oct 15, 2024
  17. achow101 requested review from willcl-ark on Oct 15, 2024
  18. achow101 commented at 3:30 pm on October 15, 2024: member
    cc @tdb3
  19. DrahtBot removed the label CI failed on Oct 18, 2024
  20. hodlinator force-pushed on Oct 21, 2024
  21. hodlinator commented at 10:14 pm on October 21, 2024: contributor
    (Rebased onto latest master).
  22. in test/functional/test_framework/test_node.py:324 in 5c95843281 outdated
    320+                    ignored_errors["JSONRPCException -28"] += 1
    321                 # -342 Service unavailable, RPC server started but is shutting down due to error
    322-                if e.error['code'] != -28 and e.error['code'] != -342:
    323+                # - Ignore the exception to properly raise the FailedToStartError.
    324+                elif e.error['code'] == -342:
    325+                    ignored_errors["JSONRPCException -342"] += 1
    


    tdb3 commented at 6:19 pm on October 31, 2024:

    non-blocking nit: might be a bit more Pythonic to use Counter.update() rather than += 1. For example:

    0ignored_errors.update([f"JSONRPCException {e.error['code']}"])
    

    hodlinator commented at 8:23 pm on November 2, 2024:

    Tried using update(), but it resulted in things like:

    0AssertionError: [node 0] Unable to connect to bitcoind after 10s (ignored errors: {'m': 40, 'i': 120, 's': 120, 'n': 80, 'g': 40, '_': 40, 'c': 40, 'r': 40, 'e': 80, 'd': 40, 't': 40, 'a': 40, 'l': 40}, latest error: ValueError('No RPC credentials'))
    

    i.e. it was iterating over the string and incrementing an entry for each unique character, which seems to match the docs even though not super clear.~


    Edit: @tdb3 pointed out out of band than I’d probably missed out the []-brackets around the strings to make update() iterate through whole strings instead of chars.

    0ignored_errors.update([f"JSONRPCException {e.error['code']}"])
    

    feels slightly more contrived and verbose to me than

    0ignored_errors[f"JSONRPCException {e.error['code']}"] += 1
    

    but neither is too elegant.

    Might switch away from Counter if something more appropriate is found.

  23. in test/functional/test_framework/test_node.py:336 in 5c95843281 outdated
    335                 elif e.errno == errno.ETIMEDOUT:
    336-                    pass # Treat identical to ECONNRESET
    337+                    ignored_errors["OSError.ETIMEDOUT"] += 1 # Treat identical to ECONNRESET
    338                 elif e.errno == errno.ECONNREFUSED:
    339-                    pass  # Port not yet open?
    340+                    ignored_errors["OSError.ECONNREFUSED"] += 1 # Port not yet open?
    


    tdb3 commented at 6:27 pm on October 31, 2024:

    non-blocking nit: same with these. e.g.

    0ignored_errors.update([errno.errorcode[e.errno]])
    

    hodlinator commented at 8:21 pm on November 2, 2024:
    (Same issue with update(), see #30660 (review)).
  24. in test/functional/feature_framework_rpc_failure.py:35 in 9173b05600 outdated
    30+            BitcoinTestFramework.setup_network(self)
    31+        except AssertionError as e:
    32+            assert_equal(str(e)[:47], "[node 0] Unable to connect to bitcoind after 0s")
    33+            self.log.info(f"{test_name}: Caught AssertionError with expected RPC connection failure message")
    34+        else:
    35+            raise AssertionError("Didn't raise expected error")
    


    tdb3 commented at 6:47 pm on October 31, 2024:
    Maybe we can use util.assert_raises() or util.assert_raises_message() here to keep things more concise?

    hodlinator commented at 8:22 pm on November 2, 2024:
    Had to add a commit (2a0c0e410e9f32dbf7af7229457889b09553c438) that refurbishes assert_raises_message(), but test is much cleaner now!
  25. tdb3 approved
  26. tdb3 commented at 6:49 pm on October 31, 2024: contributor

    Code review ACK 9173b056006fd3940675108aa3fe0d0d855af768

    This seems quite useful when troubleshooting errors. Left a few nits, but nothing I feel very strongly about.

  27. hodlinator force-pushed on Nov 2, 2024
  28. hodlinator commented at 8:26 pm on November 2, 2024: contributor

    Thanks for the review @tdb3!

    Implemented feedback from that review, made the code a bit more Pythonic and improved comments slightly.

  29. in test/functional/test_framework/test_node.py:341 in b6b30c2d9c outdated
    337                     raise
    338+                ignored_errors["missing_credentials"] += 1
    339+                latest_error = repr(e)
    340             time.sleep(1.0 / poll_per_s)
    341-        self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
    342+        self._raise_assertion_error("Unable to connect to bitcoind after {}s (ignored errors: {}, latest error: {})".format(self.rpc_timeout, str(dict(ignored_errors)), latest_error))
    


    tdb3 commented at 10:50 pm on November 2, 2024:
    non-blocking pico-nit: If this file gets touched again, could use f string instead of format. https://github.com/bitcoin/bitcoin/blob/master/test/functional/README.md#style-guidelines

    hodlinator commented at 11:11 pm on November 2, 2024:
    Good point! Resolved in latest push.
  30. tdb3 approved
  31. tdb3 commented at 10:59 pm on November 2, 2024: contributor

    Code Review re ACK 1148c4a53295e7635a813aadfde53ba5d973395c

    Nice work cleaning up the existing code around the core changes. Increases readability.

  32. hodlinator force-pushed on Nov 2, 2024
  33. tdb3 approved
  34. tdb3 commented at 11:15 pm on November 2, 2024: contributor
    re ACK 042cee7073b338d79b5176d4157cf08bbd079b9f
  35. in test/functional/feature_framework_rpc_failure.py:26 in 042cee7073 outdated
    21+    def set_test_params(self):
    22+        self.num_nodes = 1
    23+
    24+    def setup_network(self):
    25+        test_name = Path(__file__).name
    26+        self.log.info(f"{test_name}: Setting RPC timeout to 0 to simulate an unresponsive bitcoind")
    


    maflcko commented at 10:15 am on November 4, 2024:

    Any reason to log the file name here? The containing folder of the log should already have the name of the test/file. If not, the user should know which test they called to know it.

    If not, and this was helpful, it would be better to log it for all tests, instead of just for this one test specifically.


    hodlinator commented at 10:44 am on November 8, 2024:
    Wanted to make clear it wasn’t something internal to BitcoinTestFramework/TestNode deciding to set the RPC timeout. Replaced that with instead appending “, expect one warning” to the log message. (Edit: now pushed).
  36. in test/functional/test_framework/test_node.py:315 in 5ca92b17c5 outdated
    310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    311                 self.rpc_connected = True
    312                 self.url = self.rpc.rpc_url
    313                 return
    314-            except JSONRPCException as e:  # Initialization phase
    315+            except JSONRPCException as e:
    316+                # Avoid raising these, to possibly raise the FailedToStartError on next iteration.
    


    maflcko commented at 11:24 am on November 4, 2024:

    nit in 5ca92b17c5d2d5efa7b2246da1905bb8fd186230: It is obvious, but could mention the success case as well?

    Retry after these, until a permanent success or failure state (such as FailedToStartError, or timeout) is reached.?


    hodlinator commented at 10:47 am on November 8, 2024:
    Good point, changed to Suppress these in favor of a later outcome (success, FailedToStartError, or timeout).
  37. in test/functional/test_framework/test_node.py:317 in 5ca92b17c5 outdated
    310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    311                 self.rpc_connected = True
    312                 self.url = self.rpc.rpc_url
    313                 return
    314-            except JSONRPCException as e:  # Initialization phase
    315+            except JSONRPCException as e:
    316+                # Avoid raising these, to possibly raise the FailedToStartError on next iteration.
    317                 # -28 RPC in warmup
    318                 # -342 Service unavailable, RPC server started but is shutting down due to error
    


    maflcko commented at 11:36 am on November 4, 2024:
    I don’t think this is true, it is just one example. I think 342 can be raised by authproxy.py for various reasons.

    hodlinator commented at 10:48 am on November 8, 2024:
    Made comment less definitive.
  38. in test/functional/test_framework/test_node.py:331 in 5ca92b17c5 outdated
    330-                    pass  # Port not yet open?
    331-                else:
    332+                # ECONNRESET   - This might happen when the RPC server is in warmup, but shut down before the call to
    333+                #                getblockcount succeeds. Try again to properly raise the FailedToStartError
    334+                # ETIMEDOUT    - Treat identical to ECONNRESET
    335+                # ECONNREFUSED - Port not yet open?
    


    maflcko commented at 11:42 am on November 4, 2024:
    nit: Could just mention that the three are treated equal to “-342 Service unavailable”, which is already explained above?

    hodlinator commented at 10:48 am on November 8, 2024:
    Added: Suppress similarly to the above JSONRPCException errors:

    maflcko commented at 3:06 pm on November 8, 2024:

    I still don’t know if the current docs are correct and useful.

    Why would ETIMEDOUT only happen on warmup? Why is it even relevant to mention anything here?

    I think it would be clearer to just mention that the three are treated equal to “-342 Service unavailable”, which is already explained above?


    maflcko commented at 11:27 am on November 18, 2024:

    re #30660 (review)

    Are you still working on this?


    hodlinator commented at 10:19 pm on November 18, 2024:
    I am still working on this PR. Not thrilled about removing the 3 comment strings predating this PR.

    hodlinator commented at 10:35 am on November 26, 2024:
    I am satisfied with the PR in it’s latest revision, but if you don’t consider this a nit, but rather a blocker, let me know.

    ryanofsky commented at 7:57 pm on January 6, 2025:

    In commit “test refactor: wait_for_rpc_connection - Treat OSErrors the same” (5321eb089cd1b39160038050a6ef6f27fd6f4ef9)

    I think these documentation changes look ok, but looking at old and new versions side by side I mostly found the previous comments easier to follow because they were more direct and interspersed with relevant code.

    E.g. the previous “# Initialization phase” comment next to the exception being caught told me that the exception was expected during initialization. The current comment “Suppress these in favor of a later outcome” tells me the code does not care about this exception as much as other exceptions, but not why.

    Am fine with both versions though.


    hodlinator commented at 4:43 pm on January 9, 2025:

    I mostly found the previous comments easier to follow because they were more direct and interspersed with relevant code.

    Hopefully this is from the latest push can be considered an improvement:

    0if e.errno not in [ errno.ECONNRESET,     # This might happen when the RPC server is in warmup,
    1                                          # but shut down before the call to getblockcount succeeds.
    2                    errno.ETIMEDOUT,      # Treat identical to ECONNRESET
    3                    errno.ECONNREFUSED ]: # Port not yet open?
    
  39. in test/functional/test_framework/test_node.py:328 in 5ca92b17c5 outdated
    335+                # ECONNREFUSED - Port not yet open?
    336+                if e.errno not in [errno.ECONNRESET, errno.ETIMEDOUT, errno.ECONNREFUSED]:
    337                     raise  # unknown OS error
    338-            except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
    339+            except ValueError as e:
    340+                # If cookie file is missing and no rpcuser or rpcpassword; bitcoind is still starting, otherwise raise
    


    maflcko commented at 11:47 am on November 4, 2024:

    nit: I don’t think “bitcoind is still starting” is true either. It is just one example. Another example would be that the cookie file is (accidentally) disabled in the config.

    What about just `Retry if cookie file is not created and no rpcuser or rpcpassword."?


    hodlinator commented at 10:49 am on November 8, 2024:
    Agree, made comment less definitive, but still think it’s good to include the common cause: Suppress if cookie file is missing and no rpcuser or rpcpassword; bitcoind may be starting
  40. maflcko commented at 6:18 pm on November 6, 2024: member
    left some nits on the first commit. Will take a look at the others later.
  41. hodlinator force-pushed on Nov 8, 2024
  42. hodlinator commented at 10:56 am on November 8, 2024: contributor
    Thanks for the reviews! Was trying to keep comments largely intact, but agree with @maflcko that they were somewhat incorrect, so altered them. Also renamed ignored_errors (introduced by this PR) to suppressed_errors since it felt more fitting and also matches the updated comments.
  43. DrahtBot added the label CI failed on Nov 8, 2024
  44. DrahtBot commented at 11:32 am on November 8, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32707688603

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  45. hodlinator force-pushed on Nov 8, 2024
  46. DrahtBot removed the label CI failed on Nov 8, 2024
  47. tdb3 approved
  48. tdb3 commented at 5:32 pm on November 9, 2024: contributor
    Code Review re ACK 68b0530d5fe867dbee5e551ecf6d491fecbc77f1
  49. m3dwards commented at 10:40 am on November 18, 2024: contributor

    Concept ACK

    I like the motivation to cleanup the logs and remove the cascade of exceptions.

    To test, I modified the included test to see what errors a user would have seen without the assertion:

     0--- a/test/functional/feature_framework_rpc_failure.py
     1+++ b/test/functional/feature_framework_rpc_failure.py
     2@@ -23,7 +23,8 @@ class FeatureFrameworkRPCFailure(BitcoinTestFramework):
     3         self.log.info(f"Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.")
     4         self.rpc_timeout = 0
     5
     6-        assert_raises_message(AssertionError, "[node 0] Unable to connect to bitcoind after 0s", BitcoinTestFramework.setup_network, self)
     7+        BitcoinTestFramework.setup_network(self)
     8
     9     def run_test(self):
    10         pass
    
     02024-11-18T09:59:33.936000Z TestFramework (INFO): PRNG seed is: 6704525597374825472
     12024-11-18T09:59:33.937000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_kccj2gaf
     22024-11-18T09:59:33.939000Z TestFramework (INFO): Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.
     32024-11-18T09:59:33.944000Z TestFramework (ERROR): Assertion failed
     4Traceback (most recent call last):
     5  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     6    node.wait_for_rpc_connection()
     7  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 335, in wait_for_rpc_connection
     8    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
     9  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 198, in _raise_assertion_error
    10    raise AssertionError(self._node_msg(msg))
    11AssertionError: [node 0] Unable to connect to bitcoind after 0s
    12
    13During handling of the above exception, another exception occurred:
    14
    15Traceback (most recent call last):
    16  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 131, in main
    17    self.setup()
    18  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 302, in setup
    19    self.setup_network()
    20  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 24, in setup_network
    21    BitcoinTestFramework.setup_network(self)
    22  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 396, in setup_network
    23    self.setup_nodes()
    24  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    25    self.start_nodes()
    26  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    27    self.stop_nodes()
    28  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    29    node.stop_node(wait=wait, wait_until_stopped=False)
    30  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 396, in stop_node
    31    self.stop(wait=wait)
    32    ^^^^^^^^^
    33  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 215, in __getattr__
    34    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    35           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    36AssertionError: [node 0] Error: no RPC connection
    372024-11-18T09:59:33.997000Z TestFramework (INFO): Stopping nodes
    38Traceback (most recent call last):
    39  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 32, in <module>
    40    FeatureFrameworkRPCFailure(__file__).main()
    41  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 155, in main
    42    exit_code = self.shutdown()
    43                ^^^^^^^^^^^^^^^
    44  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 318, in shutdown
    45    self.stop_nodes()
    46  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    47    node.stop_node(wait=wait, wait_until_stopped=False)
    48  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 396, in stop_node
    49    self.stop(wait=wait)
    50    ^^^^^^^^^
    51  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 215, in __getattr__
    52    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    53           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    54AssertionError: [node 0] Error: no RPC connection
    55[node 0] Cleaning up leftover process
    
     02024-11-18T10:09:52.135000Z TestFramework (INFO): PRNG seed is: 7321683449385937851
     12024-11-18T10:09:52.135000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a
     22024-11-18T10:09:52.137000Z TestFramework (INFO): Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.
     32024-11-18T10:09:52.141000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 47850 so that wait_until_stopped will not time out.
     42024-11-18T10:09:52.194000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
     7    self.setup()
     8  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
     9    self.setup_network()
    10  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 26, in setup_network
    11    BitcoinTestFramework.setup_network(self)
    12  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    13    self.setup_nodes()
    14  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    15    self.start_nodes()
    16  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
    17    node.wait_for_rpc_connection()
    18  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 344, in wait_for_rpc_connection
    19    self._raise_assertion_error(f"Unable to connect to bitcoind after {self.rpc_timeout}s (ignored errors: {str(dict(suppressed_errors))}, latest error: {latest_error})")
    20  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 199, in _raise_assertion_error
    21    raise AssertionError(self._node_msg(msg))
    22AssertionError: [node 0] Unable to connect to bitcoind after 0s (ignored errors: {}, latest error: )
    232024-11-18T10:09:52.250000Z TestFramework (INFO): Stopping nodes
    242024-11-18T10:09:52.250000Z TestFramework (WARNING): Not cleaning up dir /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a
    252024-11-18T10:09:52.250000Z TestFramework (ERROR): Test failed. Test logging available at /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a/test_framework.log
    262024-11-18T10:09:52.250000Z TestFramework (ERROR):
    272024-11-18T10:09:52.251000Z TestFramework (ERROR): Hint: Call /Users/max/source/bitcoin/test/functional/combine_logs.py '/var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a' to consolidate all logs
    282024-11-18T10:09:52.251000Z TestFramework (ERROR):
    292024-11-18T10:09:52.251000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
    302024-11-18T10:09:52.251000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
    312024-11-18T10:09:52.251000Z TestFramework (ERROR):
    

    I prefer the logs with this change!

    Next I modified bitcoind so that it would fail to start and ran a random functional test (p2p_blockfilters.py)

     02024-11-18T09:39:40.861000Z TestFramework (INFO): PRNG seed is: 8730495587434508112
     12024-11-18T09:39:40.862000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_0gw_zhhl
     22024-11-18T09:39:41.372000Z TestFramework (ERROR): Assertion failed
     3Traceback (most recent call last):
     4  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     5    node.wait_for_rpc_connection()
     6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 276, in wait_for_rpc_connection
     7    raise FailedToStartError(self._node_msg(
     8test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization.
     9
    10During handling of the above exception, another exception occurred:
    11
    12Traceback (most recent call last):
    13  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
    14    self.setup()
    15  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    16    self.setup_network()
    17  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    18    self.setup_nodes()
    19  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    20    self.start_nodes()
    21  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    22    self.stop_nodes()
    23  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    24    node.stop_node(wait=wait, wait_until_stopped=False)
    25  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 396, in stop_node
    26    self.stop(wait=wait)
    27    ^^^^^^^^^
    28  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 215, in __getattr__
    29    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    30           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    31AssertionError: [node 0] Error: no RPC connection
    322024-11-18T09:39:41.431000Z TestFramework (INFO): Stopping nodes
    33Traceback (most recent call last):
    34  File "/Users/max/source/bitcoin/build/test/functional/p2p_blockfilters.py", line 285, in <module>
    35    CompactFiltersTest(__file__).main()
    36  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
    37    exit_code = self.shutdown()
    38                ^^^^^^^^^^^^^^^
    39  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
    40    self.stop_nodes()
    41  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
    42    node.stop_node(wait=wait, wait_until_stopped=False)
    43  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 396, in stop_node
    44    self.stop(wait=wait)
    45    ^^^^^^^^^
    46  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 215, in __getattr__
    47    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
    48           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    49AssertionError: [node 0] Error: no RPC connection
    50[node 1] Cleaning up leftover process
    51[node 0] Cleaning up leftover process
    
     02024-11-15T13:07:28.308000Z TestFramework (INFO): PRNG seed is: 2784953090753807617
     12024-11-15T13:07:28.308000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_udgky3fr
     22024-11-15T13:07:28.821000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26867 so that wait_until_stopped will not time out.
     32024-11-15T13:07:28.821000Z TestFramework.node1 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26868 so that wait_until_stopped will not time out.
     42024-11-15T13:07:28.821000Z TestFramework (ERROR): Assertion failed
     5Traceback (most recent call last):
     6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
     7    node.wait_for_rpc_connection()
     8  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 279, in wait_for_rpc_connection
     9    raise FailedToStartError(self._node_msg(
    10test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization.
    11
    12During handling of the above exception, another exception occurred:
    13
    14Traceback (most recent call last):
    15  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
    16    self.setup()
    17  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    18    self.setup_network()
    19  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    20    self.setup_nodes()
    21  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    22    self.start_nodes()
    23  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
    24    self.stop_nodes()
    25  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 591, in stop_nodes
    26    node.wait_until_stopped()
    27  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in wait_until_stopped
    28    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    29  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 860, in wait_until
    30    return wait_until_helper_internal(test_function, timeout=timeout, timeout_factor=self.timeout_factor)
    31           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    32  File "/Users/max/source/bitcoin/test/functional/test_framework/util.py", line 291, in wait_until_helper_internal
    33    if predicate():
    34       ^^^^^^^^^^^
    35  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in <lambda>
    36    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    37                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    38  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 439, in is_node_stopped
    39    assert return_code == expected_ret_code, self._node_msg(
    40           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    41AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping
    422024-11-15T13:07:28.880000Z TestFramework (INFO): Stopping nodes
    432024-11-15T13:07:28.881000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26867 so that wait_until_stopped will not time out.
    442024-11-15T13:07:28.881000Z TestFramework.node1 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26868 so that wait_until_stopped will not time out.
    45Traceback (most recent call last):
    46  File "/Users/max/source/bitcoin/build/test/functional/p2p_blockfilters.py", line 285, in <module>
    47    CompactFiltersTest(__file__).main()
    48  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
    49    exit_code = self.shutdown()
    50                ^^^^^^^^^^^^^^^
    51  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
    52    self.stop_nodes()
    53  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 591, in stop_nodes
    54    node.wait_until_stopped()
    55  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in wait_until_stopped
    56    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    57  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 860, in wait_until
    58    return wait_until_helper_internal(test_function, timeout=timeout, timeout_factor=self.timeout_factor)
    59           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    60  File "/Users/max/source/bitcoin/test/functional/test_framework/util.py", line 291, in wait_until_helper_internal
    61    if predicate():
    62       ^^^^^^^^^^^
    63  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in <lambda>
    64    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    65                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    66  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 439, in is_node_stopped
    67    assert return_code == expected_ret_code, self._node_msg(
    68           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    69AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping
    70[node 1] Cleaning up leftover process 26868
    71[node 0] Cleaning up leftover process 26867
    

    I’m not sure if it’s the intention of this PR to improve things in that scenario but the log doesn’t seem much better and if anything it was slightly easier to understand before.

    Before this change the logs showed AssertionError: [node 0] Error: no RPC connection and after this change that is now AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping. I feel like no RPC connection is the better error here?

  50. hodlinator force-pushed on Nov 18, 2024
  51. hodlinator commented at 11:13 pm on November 18, 2024: contributor

    Thanks @m3dwards for having a look!

    I had not tested the case of bitcoind exiting early together with my suggested change here (focused more on stalling as in one of the issues inspiring it).

    Made some changes to this part of TestNode.stop_node() to handle an already dead process:

    https://github.com/bitcoin/bitcoin/blob/ba28593147ccc880c3a4d40db0d4ef57f4766254/test/functional/test_framework/test_node.py#L412-L420

    With those changes, the output for such a case is now:

     0 build/test/functional/p2p_blockfilters.py 
     12024-11-18T23:01:17.204000Z TestFramework (INFO): PRNG seed is: 5667811122664278008
     22024-11-18T23:01:17.205000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_mzr3jg96
     32024-11-18T23:01:17.458000Z TestFramework.node0 (WARNING): Process 175180 already died with return code 1.
     42024-11-18T23:01:17.458000Z TestFramework.node1 (WARNING): Process 175181 already died with return code 1.
     52024-11-18T23:01:17.459000Z TestFramework (ERROR): Unexpected exception caught during testing
     6Traceback (most recent call last):
     7  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
     8    self.setup()
     9  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
    10    self.setup_network()
    11  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
    12    self.setup_nodes()
    13  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
    14    self.start_nodes()
    15  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
    16    node.wait_for_rpc_connection()
    17  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 279, in wait_for_rpc_connection
    18    raise FailedToStartError(self._node_msg(
    19test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization. 
    202024-11-18T23:01:17.513000Z TestFramework (INFO): Stopping nodes
    212024-11-18T23:01:17.513000Z TestFramework (WARNING): Not cleaning up dir /tmp/bitcoin_func_test_mzr3jg96
    222024-11-18T23:01:17.513000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/bitcoin_func_test_mzr3jg96/test_framework.log
    232024-11-18T23:01:17.513000Z TestFramework (ERROR): 
    242024-11-18T23:01:17.513000Z TestFramework (ERROR): Hint: Call /home/hodlinator/bitcoin/test/functional/combine_logs.py '/tmp/bitcoin_func_test_mzr3jg96' to consolidate all logs
    252024-11-18T23:01:17.513000Z TestFramework (ERROR): 
    262024-11-18T23:01:17.514000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
    272024-11-18T23:01:17.514000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
    282024-11-18T23:01:17.514000Z TestFramework (ERROR):
    
  52. m3dwards commented at 1:13 pm on November 21, 2024: contributor

    Made some changes to this part of TestNode.stop_node() to handle an already dead process

    Looks cleaner.

    ACK ba28593147ccc880c3a4d40db0d4ef57f4766254

  53. DrahtBot requested review from tdb3 on Nov 21, 2024
  54. in test/functional/test_framework/test_node.py:8 in ba28593147 outdated
    4@@ -5,6 +5,7 @@
    5 """Class for bitcoind node under test"""
    6 
    7 import contextlib
    8+from collections import Counter
    


    hodlinator commented at 8:31 pm on December 3, 2024:
    (Could possibly use collections.defaultdict(int) instead of Counter iff I really need to retouch).
  55. DrahtBot added the label CI failed on Dec 11, 2024
  56. DrahtBot commented at 1:30 am on December 11, 2024: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33168290506

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  57. hodlinator force-pushed on Dec 11, 2024
  58. hodlinator commented at 8:29 pm on December 11, 2024: contributor
    • Fixed ruff linter error reported by CI due to #31417 being merged.
    • Added feature_framework_stop_node.py-test to verify clean shutdown of TestNode when the bitcoind-process already exited/crashed, issue found earlier by m3dwards. (Also added comment justifying setting of self.expected_ret_code in test_node.py).
    • Switched from Counter to simpler collections.defaultdict(int)-type.
    • Updated copyright headers to include “-present”.
  59. hodlinator force-pushed on Dec 11, 2024
  60. DrahtBot removed the label CI failed on Dec 11, 2024
  61. in test/functional/test_framework/test_node.py:329 in 5321eb089c outdated
    337+                # ECONNREFUSED - Port not yet open?
    338+                if e.errno not in [errno.ECONNRESET, errno.ETIMEDOUT, errno.ECONNREFUSED]:
    339                     raise  # unknown OS error
    340-            except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
    341+            except ValueError as e:
    342+                # Suppress if cookie file is missing and no rpcuser or rpcpassword; bitcoind may be starting
    


    ryanofsky commented at 7:59 pm on January 6, 2025:

    In commit “test refactor: wait_for_rpc_connection - Treat OSErrors the same” (5321eb089cd1b39160038050a6ef6f27fd6f4ef9)

    Found both this comment “cookie file is missing” and previous comment “cookie file not found” confusing because it is not clear why the cookie file would not be present. IMO saying “node has not generated the cookie file yet” would be clearer.


    hodlinator commented at 4:41 pm on January 9, 2025:
    New version: # Suppress if cookie file isn't generated yet and no rpcuser or rpcpassword; bitcoind may be starting.
  62. in test/functional/test_framework/test_node.py:453 in db0e1580a5 outdated
    447@@ -447,7 +448,9 @@ def is_node_stopped(self, *, expected_stderr="", expected_ret_code=0):
    448 
    449     def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, expect_error=False, **kwargs):
    450         if "expected_ret_code" not in kwargs:
    451-            kwargs["expected_ret_code"] = 1 if expect_error else 0  # Whether node shutdown return EXIT_FAILURE or EXIT_SUCCESS
    452+            # Whether node shutdown is expected to return EXIT_FAILURE (1) or
    453+            # EXIT_SUCCESS (self.expected_ret_code).
    454+            kwargs["expected_ret_code"] = 1 if expect_error else self.expected_ret_code
    


    ryanofsky commented at 8:30 pm on January 6, 2025:

    In commit “test: Allow changing expected internal return code” (db0e1580a5212a162908ac4f23e935e1f76e06dc)

    Now that this logic is getting more complicated with self.expected_ret_code I think the expect_error argument should be dropped so the code can be more understandable. It looks like there is only a single expect_error usage in feature_abortnode.py and it can be easily replaced by using expected_ret_code. Would suggest:

     0--- a/test/functional/feature_abortnode.py
     1+++ b/test/functional/feature_abortnode.py
     2@@ -36,7 +36,7 @@ class AbortNodeTest(BitcoinTestFramework):
     3 
     4             # Check that node0 aborted
     5             self.log.info("Waiting for crash")
     6-            self.nodes[0].wait_until_stopped(timeout=5, expect_error=True, expected_stderr="Error: A fatal internal error occurred, see debug.log for details: Failed to disconnect block.")
     7+            self.nodes[0].wait_until_stopped(timeout=5, expected_ret_code=1, expected_stderr="Error: A fatal internal error occurred, see debug.log for details: Failed to disconnect block.")
     8         self.log.info("Node crashed - now verifying restart fails")
     9         self.nodes[0].assert_start_raises_init_error()
    10 
    11--- a/test/functional/test_framework/test_node.py
    12+++ b/test/functional/test_framework/test_node.py
    13@@ -460,11 +460,11 @@ class TestNode():
    14         self.log.debug("Node stopped")
    15         return True
    16 
    17-    def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, expect_error=False, **kwargs):
    18+    def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, **kwargs):
    19         if "expected_ret_code" not in kwargs:
    20             # Whether node shutdown is expected to return EXIT_FAILURE (1) or
    21             # EXIT_SUCCESS (self.expected_ret_code).
    22-            kwargs["expected_ret_code"] = 1 if expect_error else self.expected_ret_code
    23+            kwargs["expected_ret_code"] = self.expected_ret_code
    24         self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
    25 
    26     def replace_in_config(self, replacements):
    

    hodlinator commented at 4:45 pm on January 9, 2025:
    Taken, although I moved some of the handling of expected_ret_code directly into is_node_stopped().
  63. in test/functional/test_framework/test_node.py:431 in 4ab7879ec6 outdated
    421             else:
    422-                self.stop()
    423-        except http.client.CannotSendRequest:
    424-            self.log.exception("Unable to stop node.")
    425+                self.log.warning(f"Process {self.process.pid} already died with return code {return_code}.")
    426+                # If we no longer have a running process once stop_node() is called, assume we've already triggered an
    


    ryanofsky commented at 8:47 pm on January 6, 2025:

    In commit “test: Kill process if stop-RPC is not available” (4ab7879ec6b71804cc3a51bcf58fa6318fa103b0)

    This change doesn’t seem safe. The stop_node method can be called by arbitrary test code at arbitrary times, including when self.rpc_connected is False, and I don’t think it should be suppressing errors unless the caller has explicitly requested that errors should not be passed on.

    Would suggest narrowing the error suppression behavior, so errors and bugs will be less likely to go unnoticed because of this change:

     0--- a/test/functional/feature_framework_stop_node.py
     1+++ b/test/functional/feature_framework_stop_node.py
     2@@ -42,7 +42,7 @@ class FeatureFrameworkStopNode(BitcoinTestFramework):
     3         assert self.nodes[0].running, \
     4             "The process should still be flagged as running before calling stop_node()"
     5         self.log.debug("Explicitly stopping the node to verify it completes cleanly during the test")
     6-        self.nodes[0].stop_node(expected_stderr=invalid_arg_error)
     7+        self.nodes[0].stop_node(expected_stderr=invalid_arg_error, force=True)
     8 
     9 if __name__ == '__main__':
    10     FeatureFrameworkStopNode(__file__).main()
    11--- a/test/functional/test_framework/test_framework.py
    12+++ b/test/functional/test_framework/test_framework.py
    13@@ -315,7 +315,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
    14         if not self.options.noshutdown:
    15             self.log.info("Stopping nodes")
    16             if self.nodes:
    17-                self.stop_nodes()
    18+                self.stop_nodes(force=self.success != TestStatus.PASSED)
    19         else:
    20             for node in self.nodes:
    21                 node.cleanup_on_exit = False
    22@@ -569,7 +569,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
    23                 node.wait_for_rpc_connection()
    24         except Exception:
    25             # If one node failed to start, stop the others
    26-            self.stop_nodes()
    27+            self.stop_nodes(force=True)
    28             raise
    29 
    30         if self.options.coveragedir is not None:
    31@@ -580,11 +580,11 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
    32         """Stop a bitcoind test node"""
    33         self.nodes[i].stop_node(expected_stderr, wait=wait)
    34 
    35-    def stop_nodes(self, wait=0):
    36+    def stop_nodes(self, wait=0, force=False):
    37         """Stop multiple bitcoind test nodes"""
    38         for node in self.nodes:
    39             # Issue RPC to stop nodes
    40-            node.stop_node(wait=wait, wait_until_stopped=False)
    41+            node.stop_node(wait=wait, wait_until_stopped=False, force=force)
    42 
    43         for node in self.nodes:
    44             # Wait for nodes to stop
    45--- a/test/functional/test_framework/test_node.py
    46+++ b/test/functional/test_framework/test_node.py
    47@@ -393,12 +393,30 @@ class TestNode():
    48     def version_is_at_least(self, ver):
    49         return self.version is None or self.version >= ver
    50 
    51-    def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True):
    52-        """Stop the node."""
    53+    def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True, force=False):
    54+        """Stop the node.
    55+
    56+        Args:
    57+            force (bool): Stop the node forcefully if necessary, and change
    58+                errors that would otherwise be exceptions into logged warnings.
    59+
    60+        Notes:
    61+            - The force argument should only be set to true when a test has
    62+            already failed, and the caller is already throwing an exception of
    63+            its own, and wants its own exception to take precedence over errors
    64+            from this function. Since the error that happened was presumably
    65+            unknown and unexpected, the node may have already crashed or may
    66+            need to be shut down forcefully.
    67+        """
    68         if not self.running:
    69             return
    70         self.log.debug("Stopping node")
    71-        if self.rpc_connected:
    72+
    73+        # If self.running is true but self.rpc_connected is not true, it means
    74+        # that the process has been started, but wait_for_rpc_connection() was
    75+        # not able to make a successful connection. In this case, if force is
    76+        # true, avoid trying to make another RPC call and just stop the process.
    77+        if not force or self.rpc_connected:
    78             try:
    79                 # Do not use wait argument when testing older nodes, e.g. in wallet_backwards_compatibility.py
    80                 if self.version_is_at_least(180000):
    

    Note: The diff above is still calling the stop RPC instead of killing the node when self.rpc_connected is true. But it might actually be more desirable to change if not force or self.rpc_connected: there to if not force: and prefer to kill node processes instead of stopping them after test failures. Would not suggest doing this here, since it would change other behavior beyond the scope of the PR, but I could imagine cases where changing this might reduce noise and help debugging errors in CI.


    hodlinator commented at 10:31 am on January 9, 2025:

    It seems wrong to me to attempt using the RPC connection in stop_node() when we known that self.rpc_connected == False. New version has:

    0if self.rpc_connected:
    1    # Call stop-RPC...
    2elif force:
    3    # Kill etc...
    4else:
    5    raise RuntimeError(f"Cannot call stop-RPC as we don't have an RPC connection to process {self.process.pid}.")
    

    Edit: (Ended up calling it avoid_exceptions which I think is slightly better than force).

  64. in test/functional/test_framework/test_node.py:428 in 4ab7879ec6 outdated
    415+            return_code = self.process.poll()
    416+            if return_code is None:
    417+                self.log.warning("Cannot call stop-RPC as we are not connected. "
    418+                    f"Killing process {self.process.pid} so that wait_until_stopped will not time out.")
    419+                self.expected_ret_code = 1 if platform.system() == "Windows" else -9
    420+                self.process.kill()
    


    ryanofsky commented at 4:05 pm on January 7, 2025:

    In commit “test: Kill process if stop-RPC is not available” (4ab7879ec6b71804cc3a51bcf58fa6318fa103b0)

    I think this is probably the simplest way to write this code and we should stick with it for now, but just want to point out there is a minor race condition here because the process could end between the time that poll and kill are called and then the kill call could throw an exception.

    Another separate idea is that it might be safer to write self.expected_ret_code = self.process.wait() instead of hardcoding 1/-9 return codes. Again though would be inclined to stick with the working code unless there is a more of a reason to change.


    hodlinator commented at 9:52 am on January 9, 2025:

    Thanks for pointing out the race condition! It is not so likely if we get here because of RPC connection timeout, but still possible.

    I tried adding a second .kill() after the first one and it does not raise an exception, and no exception is documented.

  65. in test/functional/feature_framework_stop_node.py:45 in 608b9ff097 outdated
    40+            self=self.nodes[0]
    41+        )
    42+        assert self.nodes[0].running, \
    43+            "The process should still be flagged as running before calling stop_node()"
    44+        self.log.debug("Explicitly stopping the node to verify it completes cleanly during the test")
    45+        self.nodes[0].stop_node(expected_stderr=invalid_arg_error)
    


    ryanofsky commented at 5:02 pm on January 7, 2025:

    In commit “test: Add feature_framework_stop_node.py” (608b9ff097214bff3527cc240b7be313c62131d5)

    One problem with this test as currently written is that it prints a real warning to the output (like TestFramework.node0 (WARNING): Process 579831 already died with return code 1). This is potentially confusing because this warning is expected and should not be a cause for concern, but you can’t know that from looking at test output.

    I can think of a few possible solutions:

    • Redirecting log output during the stop_node call and checking for the expected warning with an assertion here.
    • Letting the stop_node function throw exceptions instead of converting errors into warnings, then catching and handling the exceptions in the places (just 2 places currently) where they are expected.
    • Not changing anything but just add a warning before the warning instructing reader to ignore the next warning.

    hodlinator commented at 4:36 pm on January 9, 2025:
    Went for option 3, making clear in the log message just above that a WARNING is expected.
  66. in test/functional/feature_framework_rpc_failure.py:26 in f67d896357 outdated
    21+
    22+    def setup_network(self):
    23+        self.log.info("Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.")
    24+        self.rpc_timeout = 0
    25+
    26+        assert_raises_message(AssertionError, "[node 0] Unable to connect to bitcoind after 0s", BitcoinTestFramework.setup_network, self)
    


    ryanofsky commented at 5:20 pm on January 7, 2025:

    In commit “test: Add feature_framework_rpc_failure.py” (f67d8963575feecdd4914c1a56bab4e0bebc000d)

    Note: there is no test coverage for the “ignored errors” and “latest error” parts of the message. I think it would be good to add some coverage for this, even if it just checking for 0 and None, so there is a at least a hint that more coverage could be added here, especially since it’s possible to imagine this functionality becoming broken in the future and more coverage helping prevent this.


    hodlinator commented at 4:37 pm on January 9, 2025:
    Added test/functional/feature_framework_rpc_failure_details.py.
  67. in test/functional/test_framework/test_node.py:160 in db0e1580a5 outdated
    156@@ -157,6 +157,7 @@ def __init__(self, i, datadir_path, *, chain, rpchost, timewait, timeout_factor,
    157         self.rpc = None
    158         self.url = None
    159         self.log = logging.getLogger('TestFramework.node%d' % i)
    160+        self.expected_ret_code = 0 # EXIT_SUCCESS
    


    ryanofsky commented at 6:13 pm on January 7, 2025:

    In commit “test: Allow changing expected internal return code” (db0e1580a5212a162908ac4f23e935e1f76e06dc)

    Note: I like the changes in this commit, and think the new TestNode.expected_ret_code member could be a useful way to signal expected statuses when shutdown starts. But just want to point out that I don’t think adding a new expected_ret_code member is strictly necessary for this PR to work. The only place it is actually used later in the PR is in the stop_node function when it is killing a failed process, and it seems to me that stop_node could just as easily set self.running = False instead and call process.wait() itself to reap the pid without blocking.


    hodlinator commented at 4:39 pm on January 9, 2025:
    Interesting, I do agree there is a spaghetti/complexity factor to expected_ret_code, so might revisit this.
  68. ryanofsky commented at 6:54 pm on January 7, 2025: contributor

    Code review 608b9ff097214bff3527cc240b7be313c62131d5

    Nice changes in this PR. I’ve been confused by the error output from wait_for_rpc_connection() before, so think this will be helpful.

    One thing that wasn’t obvious to me from the PR description is that this improving error output in two separate ways:

    • One change here is adding error output to wait_for_rpc_connection. Previously errors that happened during polling before the final “Unable to connect to bitcoind” error were silently ignored, so less information would be visible about what was causing the problem. Now counts of the different errors and the most recent message are shown.

    • Another change here is suppressing error output after the wait_for_rpc_connection error, specifically suppressing exceptions from in stop_node when trying to call stop method fails.

    I didn’t ACK the PR because I think the changes to the stop_node method are overbroad, and I suggested a way to limit them. Other than that I think this all looks very good.

  69. hodlinator renamed this:
    test: Shut down framework cleanly on RPC connection failure
    qa: Shut down framework cleanly on RPC connection failure
    on Jan 9, 2025
  70. hodlinator force-pushed on Jan 9, 2025
  71. hodlinator commented at 4:55 pm on January 9, 2025: contributor

    Updated based off ryanofsky’s review.

    Hopefully latest PR summary is slightly clearer.

    Replaced “test” -> “qa” since PR is about functional tests and not unit tests, should conform better with CONTRIBUTING.md.

  72. hodlinator force-pushed on Jan 9, 2025
  73. DrahtBot added the label CI failed on Jan 9, 2025
  74. DrahtBot commented at 6:21 pm on January 9, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/35384673501

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  75. hodlinator force-pushed on Jan 9, 2025
  76. DrahtBot removed the label CI failed on Jan 9, 2025
  77. in test/functional/test_framework/test_framework.py:572 in 8148e4da2c outdated
    568@@ -569,7 +569,7 @@ def start_nodes(self, extra_args=None, *args, **kwargs):
    569                 node.wait_for_rpc_connection()
    570         except Exception:
    571             # If one node failed to start, stop the others
    572-            self.stop_nodes()
    573+            self.stop_nodes(avoid_exceptions=True)
    


    maflcko commented at 9:10 am on January 10, 2025:

    coming from the conflicting pull, I don’t think this call should exist in the first place for two reasons:

    • It doesn’t work in most cases, because a failure to start any node likely means the RPC isn’t up for the other nodes either, so the call will just fail in most cases. Effectively just using the kill fallback.
    • It doesn’t make sense to special case startup. Nodes can at any time fail to start (not only in start_nodes), or crash, and adding this exception handling to all places isn’t useful nor trivially possible.

    So I removed it in #31620.

    (Just leaving a comment here to document the merge conflict between the two changes. Should be trivial to solve either way, so nothing to be done here, I think.)

  78. hodlinator commented at 12:53 pm on January 17, 2025: contributor

    Excuse me from being brief earlier.

    Taking the time do document changes in latest push - git range-diff master 608b9ff 8148e4d

    • Inlined comments about different errno-codes instead of duplicating the identifiers in the comment.
    • Made missing cookie file comment slightly more informative.
    • Removed expect_error in favor of expected_ret_code after ryanofsky’s suggestion (grepping for uses of the former). This helps decrease the amount of added complexity to test_node.py.
    • Add avoid_exceptions parameter to stop_node() to make the behavior change more surgical, thanks to ryanofsky’s suggestion. We now only kill the node under more narrow conditions.
    • Clarified in log output of tests when a WARNING is expected (ryanofsky suggestion).
    • ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 - Added feature_framework_rpc_failure_details.py to verify that suppressed exception information is included in RPC connection timeout message (ryanofsky suggestion).
    • Switched feature_framework_stop_node.py from using TestNode.start() + TestNode.wait_for_rpc_connection() to BitcoinTestFramework.start_node() which combines the two.
  79. DrahtBot added the label Needs rebase on Jan 28, 2025
  80. hodlinator renamed this:
    qa: Shut down framework cleanly on RPC connection failure
    qa: Verify clean shutdown on startup failure
    on Jan 28, 2025
  81. hodlinator force-pushed on Jan 28, 2025
  82. hodlinator commented at 10:45 pm on January 28, 2025: contributor

    Resolved conflict with @maflcko’s #31620 which solved the core issue more elegantly.

    I think there is still some use in the tests from this PR, as we want to avoid knock-on exceptions being reintroduced. I’ve reworked the tests from 3 to 1 Python file. They now launch a child process of which keeps the success-state of the parent BitcoinTestFramework separate from the expected failures of the child processes.

    The final commit (f2e3721e018c3038d5d648ec7ae08e96418f3c29) is what remains of the core of what this PR used to be, but I haven’t been able to construe a test that exercises it. So if reviewers prefer, I can drop it.

    Test instructions in updated PR summary.

  83. hodlinator force-pushed on Jan 28, 2025
  84. DrahtBot added the label CI failed on Jan 28, 2025
  85. DrahtBot commented at 11:12 pm on January 28, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/36321182366

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  86. DrahtBot removed the label CI failed on Jan 29, 2025
  87. DrahtBot removed the label Needs rebase on Jan 29, 2025
  88. hodlinator force-pushed on Jan 29, 2025
  89. hodlinator commented at 4:37 pm on January 29, 2025: contributor

    Latest push adds 2347ce8f018b1b9b8d6ef1933a778023fd228b7f which fixes a Windows Python issue found by CI. It’s still a mystery to me why the old test from a previous version of this PR in ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 didn’t provoke the same issue on Windows CI (successful run including that commit). CI output states they’re running the same Python version.

    Test was also updated to re-launch child processes using the same Python interpreter as the parent, as I was ending up with mixed 64/32-bit interpeters on Windows. Also switches from using subprocesses.getoutput() which is marked as legacy, to subprocess.run() with a timeout.

  90. hodlinator force-pushed on Jan 29, 2025
  91. hodlinator commented at 8:50 pm on January 29, 2025: contributor

    Sorry about the churn here. Went back to ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 and was able to reproduce the TimeoutError.errno==None-issue there too, which is a relief. It seemed to be a bit timing-sensitive, so in the latest push I increased the still low rpc_timeout-override and applied self.options.timeout_factor to it in hopes of making it more robust.

    Edit: The above was for a18f7c8a252bdffd181fe696ede8a91e285e1cce. Fixed trivial type-error in next push, 8102f3d8e0b6715433eeda78b9b57a83109fc45e.

  92. DrahtBot added the label CI failed on Jan 29, 2025
  93. hodlinator force-pushed on Jan 29, 2025
  94. DrahtBot removed the label CI failed on Jan 30, 2025
  95. in test/functional/test_framework/test_node.py:315 in 8fa04cb13a outdated
    310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    311                 self.rpc_connected = True
    312                 self.url = self.rpc.rpc_url
    313                 return
    314-            except JSONRPCException as e:  # Initialization phase
    315+            except JSONRPCException as e:
    316+                # Suppress these in favor of a later outcome (success, FailedToStartError, or timeout).
    


    ryanofsky commented at 4:38 pm on February 12, 2025:

    In commit “qa refactor: wait_for_rpc_connection - Treat OSErrors the same” (8fa04cb13a4bf62a42b18e7c0c79151d13cb494b)

    Various thoughts:

    • I think “initialization phase” comment that was removed provided more context and was more understandable than the new “Suppress these in favor of” comment.
    • I don’t see why it is useful to list later outcomes that can happen after these errors are suppressed, especially when the list seems to contains every imaginable outcome.
    • I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned and https://github.com/bitcoin/bitcoin/blob/master/src/rpc/protocol.h doesn’t seem to contain that error code.

    Given all this, I’d suggest just keeping the comments in this except block unchanged.

    But if not doing that, I think “Suppress these” should be changed to something like “Suppress early init errors” so overall purpose of the except block is clear.


    hodlinator commented at 10:33 pm on February 12, 2025:

    I think “initialization phase” comment that was removed provided more context and was more understandable than the new “Suppress these in favor of” comment.

    As the commit message specifies, we are changing wait_for_rpc_connection(), so the only value the “Initialization phase” comment had came from adding some context to a ~80 line long function which is part of initialization IMO. (The mention of FailedToStartError also gives a hint at what phase we are in).

    I don’t see why it is useful to list later outcomes that can happen after these errors are suppressed, especially when the list seems to contains every imaginable outcome.

    I think it is good to document why we are catching and suppressing some exceptions (while re-raising others).

    I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned and https://github.com/bitcoin/bitcoin/blob/master/src/rpc/protocol.h doesn’t seem to contain that error code.

    I think -342 comes from test/functional/test_framework/authproxy.py. This is what prompted my changed phrasing: #30660 (review)

    But if not doing that, I think “Suppress these” should be changed to something like “Suppress early init errors” so overall purpose of the except block is clear.

    Would rather improve the comment at the top of the function if you think it doesn’t provide enough context? https://github.com/bitcoin/bitcoin/blob/8fa04cb13a4bf62a42b18e7c0c79151d13cb494b/test/functional/test_framework/test_node.py#L264-L265

    (Edit: This comment was pending when maflcko left his below).


    maflcko commented at 9:07 am on February 13, 2025:
    • I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned

    I’d guess that it is coming from authproxy and is pretty general:

     0$ git grep '\-342\>' ./test/
     1test/functional/feature_taproot.py:# Test Taproot softfork (BIPs 340-342)
     2test/functional/rpc_bind.py:        assert_raises_rpc_error(-342, "non-JSON HTTP response with '403 Forbidden' from server", self.run_allowip_test, ['1.1.1.1'], self.non_loopback_ip, self.defaultport)
     3test/functional/test_framework/authproxy.py:                    'code': -342, 'message': 'non-200 HTTP status code but no JSON-RPC error'}, status)
     4test/functional/test_framework/authproxy.py:                    'code': -342, 'message': 'non-200 HTTP status code'}, status)
     5test/functional/test_framework/authproxy.py:                'code': -342, 'message': 'non-200 HTTP status code'}, status)
     6test/functional/test_framework/authproxy.py:                'code': -342, 'message': 'missing HTTP response from server'})
     7test/functional/test_framework/authproxy.py:                raise JSONRPCException({'code': -342, 'message': 'Content received with NO CONTENT status code'})
     8test/functional/test_framework/authproxy.py:                {'code': -342, 'message': 'non-JSON HTTP response with \'%i %s\' from server' % (http_response.status, http_response.reason)},
     9test/functional/test_framework/authproxy.py:                'code': -342, 'message': f'Cannot decode response in utf8 format, content: {data}, exception: {e}'})
    10test/functional/test_framework/test_node.py:                # -342 Service unavailable, RPC server started but is shutting down due to error
    11test/functional/test_framework/test_node.py:                if e.error['code'] != -28 and e.error['code'] != -342:
    

    ryanofsky commented at 7:30 pm on March 13, 2025:

    re: #30660 (review)

    I think it is good to document why we are catching and suppressing some exceptions (while re-raising others).

    It’s funny, because to me replacing “Initialization phase” with “Suppress these in favor of a later outcome” is literally removing the comment that explains why it is ok to suppress these errors. The previous comment made it clear these errors were being suppressed because they are expected during initialization. The new comment tells me they are being suppressed and that various other outcomes are possible later, which tells me nothing about why it is safe or good to suppress errors here.

    Maybe the problem is that I’ve wasted too much time debugging issues and being traumatized by code that suppresses errors and fails to report useful information, but I found the new comment here hard to parse, and could not figure out what it was trying to tell me, and it made me not trust this code for seeming to suppress errors without a clear reason.

    I think -342 comes from test/functional/test_framework/authproxy.py. This is what prompted my changed phrasing: #30660 (comment)

    Thanks that helps. But the phrase “Service unavailable” would seem to connote that server is running but it is telling us it can’t process requests, when actually this is a low-level error coming from the JSON-RPC client. Would suggest replacing “-342 Service unavailable” with something like “-342 HTTP error” to be clearer about the source of this error.


    hodlinator commented at 2:33 pm on March 14, 2025:

    Thanks for taking the time to elaborate.

    “Initialization phase” was too terse for me - of course we’re in the initialization phase… duh. :)

    Changed to “Suppress these as they are expected during initialization.”

  96. in test/functional/feature_framework_startup_failures.py:36 in 2226e14891 outdated
    31+            self.rpc_timeout = self.options.rpc_timeout
    32+        if self.options.extra_args:
    33+            self.extra_args = [[self.options.extra_args]]
    34+
    35+    def add_options(self, parser):
    36+        parser.add_argument("--rpc_timeout", dest="rpc_timeout", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
    


    ryanofsky commented at 4:57 pm on February 12, 2025:

    In commit “qa: Add feature_framework_startup_failures.py” (2226e1489101f363114ce474d7f5e56ec50c9c6a)

    Maybe give these arguments a prefix like --internal- so it obvious they are internal, and to avoid risk of silent conflicts if real test arguments with the same names are added later.

  97. in test/functional/test_framework/test_framework.py:374 in 9d0e6a1b2a outdated
    327-            self.log.info("Not stopping nodes as test failed. The dangling processes will be cleaned up later.")
    328-        else:
    329-            self.log.info("Stopping nodes")
    330-            if self.nodes:
    331-                self.stop_nodes()
    332+        if self.nodes:
    


    ryanofsky commented at 5:10 pm on February 12, 2025:

    In commit “qa: Only log node stopping info if we have any” (9d0e6a1b2aa5be186550658ff85e5ad643554338)

    Would be helpful if commit description stated motivation for this change. After looking at the stop_nodes() method implementation it seem like this is a logging-only change and point is just to remove stopping log statements when there are no nodes. But it would be good if actual intent of change were explicit.


    hodlinator commented at 8:33 am on February 13, 2025:
    • Saw the code still had an inner if self.nodes-check - removed in new push.
    • Slightly clarified commit message, describing the change (even though it adds some negation) rather than describing then end state of the change.
  98. in test/functional/test_framework/test_node.py:403 in 8102f3d8e0 outdated
    399@@ -400,14 +400,14 @@ def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True):
    400         if not self.running:
    401             return
    402         self.log.debug("Stopping node")
    403-        try:
    404+        if self.rpc_connected:
    


    ryanofsky commented at 5:12 pm on February 12, 2025:

    In commit “qa: Avoid calling stop-RPC if not connected” (8102f3d8e0b6715433eeda78b9b57a83109fc45e)

    This change seems good, but unclear what motivation is. Is it supposed to be a refactoring that makes code more robust, or were there actually any cases where CannotSendRequest exceptions were caught and suppressed? If so what were those cases? Commit description should say what goals of this change are and if CannotSendRequest is an actual or theoretical change in behavior.


    hodlinator commented at 8:40 am on February 13, 2025:
    Thanks! This made me actually realize I could add a test for it and improve the exception message (+ commit message).

    ryanofsky commented at 6:58 pm on March 13, 2025:

    re: #30660 (review)

    Thanks! This made me actually realize I could add a test for it and improve the exception message (+ commit message).

    I’m still not clear when this case would ever be expected to happen.

    IIUC, there is no runtime error that could trigger this, and nothing non-test code could could do to trigger it, and the only thing that could trigger it is incorrectly written test code?

    If so, it would seem clearer to replace the exception with an assertion like assert self.rpc_connected, self._node_msg(...) to be clear this is a bug in the test code, not some other failure.


    hodlinator commented at 2:28 pm on March 14, 2025:

    Correct, especially now after maflcko’s change in fae3bf6b870eb0f9cddd1adac82ba72890806ae3 to not call stop() on test failure, I agree it makes sense to assert instead, good suggestion! Fixed.

    were there actually any cases where CannotSendRequest exceptions were caught and suppressed? If so what were those cases? Commit description should say what goals of this change are and if CannotSendRequest is an actual or theoretical change in behavior.

    It was a common knock-on issue from attempting to call stop-RPC on a non-functional RPC interface, see for example “CannotSendRequest” in the log of #21367. Should no longer happen due to maflcko’s change and us guarding against that kind of thing with the assert.

    Clarified commit description.

  99. ryanofsky approved
  100. ryanofsky commented at 5:26 pm on February 12, 2025: contributor

    Code review ACK 8102f3d8e0b6715433eeda78b9b57a83109fc45e. Looks like #31620 made some overlapping changes, but this PR still seems very useful.

    From PR description:

    Before #31620, failure to establish RPC connections would result in an attempt to stop the node via RPC, triggering knock-on exceptions. This would muddy the waters for developers looking for root causes of test failures. This PR used to solve that issue in a more roundabout way than aforementioned PR, but now merely adds tests to ensure knock-on errors/exceptions are not reintroduced.

    Even after reviewing this PR twice and rereading this paragraph several times, I find this pretty difficult to understand. Would suggest giving a more straightforward description like:

    • This PR improves handling of startup errors in functional tests. Specifically it improves wait_for_rpc_connection so instead of returning a vague Unable to connect to bitcoind error, it returns details about specific failures leading up to it. This PR also adds comments, refactors code, improves logging, detects timeout errors on windows, adds a test to make sure error handling behavior is not broken in the future.

    I think it is ok to mention #31620 and mention the earlier version of this PR in a postscript to provide context for early review comments, but the main goal of the PR description should be to describe the PR as it currently is.

  101. DrahtBot requested review from m3dwards on Feb 12, 2025
  102. hodlinator force-pushed on Feb 13, 2025
  103. hodlinator commented at 10:23 am on February 13, 2025: contributor

    Thanks for bearing with me @ryanofsky. :)

    Updated PR description based on feedback and addressed inline comments.

  104. hodlinator force-pushed on Feb 13, 2025
  105. DrahtBot added the label CI failed on Feb 13, 2025
  106. DrahtBot commented at 10:27 am on February 13, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/37155473613

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  107. DrahtBot removed the label CI failed on Feb 13, 2025
  108. ryanofsky approved
  109. ryanofsky commented at 8:02 pm on March 13, 2025: contributor
    Code review ACK c107aa8c873cb3c031b341baef487638133823f1. Changes since last review were just updating some messages and expanding the test a little. Thanks for the updates. This still looks good to me and I left some more comments, but they are not important.
  110. hodlinator force-pushed on Mar 14, 2025
  111. hodlinator force-pushed on Mar 14, 2025
  112. hodlinator force-pushed on Mar 14, 2025
  113. hodlinator force-pushed on Mar 14, 2025
  114. hodlinator force-pushed on Mar 14, 2025
  115. in test/functional/test_framework/test_node.py:329 in fa2178e67f outdated
    321@@ -322,13 +322,18 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    322                 suppressed_errors[f"JSONRPCException {e.error['code']}"] += 1
    323                 latest_error = repr(e)
    324             except OSError as e:
    325+                error_num = e.errno
    326+                # Workaround issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.
    327+                if error_num is None:
    328+                    assert isinstance(e, TimeoutError)
    329+                    error_num = errno.ETIMEDOUT
    


    m3dwards commented at 5:35 pm on March 17, 2025:

    The inclusion of an assert suggests to me that we expect that all OSErrors without errno are TimeoutErrors? Is this actually what we expect?

    If not should the if statement be changed to if error_num is None and isinstance(e, TimeoutError):


    hodlinator commented at 9:20 pm on March 17, 2025:

    Had a feeling OSErrors must always have that field set (except for the socket TimeoutError bug), but the docs indicate it’s a bit more loose:

    The second form of the constructor sets the corresponding attributes, described below. The attributes default to None if not specified.

    Could change the condition as you suggestion, but it would be prudent to assert that error_num isn’t None for other cases:

    0                if error_num is None and isinstance(e, TimeoutError):
    1                    error_num = errno.ETIMEDOUT
    2                else:
    3                    assert error_num is not None
    

    So not sure about that change as it would repeat one of the conditions. Maybe just adding an error message to what we have now would be okay?

    0                if error_num is None:
    1                    assert isinstance(e, TimeoutError), "We only currently support " \
    2                        f"inferring errno from TimeoutErrors, encountered unhandled: {e!r}"
    3                    error_num = errno.ETIMEDOUT
    

    m3dwards commented at 12:29 pm on March 19, 2025:

    This is only a NIT so feel free to leave it as it is but in my mind an assert is used to signal an expectation of predicted state. In Python asserts are not even executed if run in optimised mode. It seems like we are using it more like an exception here?

    0                if error_num is None:
    1                    if isinstance(e, TimeoutError):
    2                        error_num = errno.ETIMEDOUT
    3                    else:
    4                        raise Exception("errno is required")
    

    hodlinator commented at 2:02 pm on March 19, 2025:
    Not so concerned about people running in optimised node since 5+ functional tests fail in that way, but changed in latest push to: https://github.com/bitcoin/bitcoin/blob/cac218dcbdd2b5682a7ac8659aac0306e68a65f5/test/functional/test_framework/test_node.py#L325-L331

    ryanofsky commented at 1:51 pm on March 26, 2025:

    re: #30660 (review)

    In commit “qa: Workaround Windows issue with socket timeouts” (c92a8fcd79d76b23c366a07be1c8778db7bea891)

    FWIW, I think the best way to write this would be simply:

    0# Work around issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.
    1if error_num is None and isinstance(e, TimeoutError):
    2    error_num = errno.ETIMEDOUT
    3
    4if error_num not in [ ... ]:
    5   raise
    

    Raising a new “Cannot infer errno when unset” exception just adds complexity and could make it harder to debug since the original exception contains more information than the new exception.


    hodlinator commented at 2:26 pm on March 26, 2025:

    Thanks for re-reviewing this again!

    Yes, in that way we would make error_num == None be one of the cases for which we raise. It’s obviously better to raise for something unexpected than to suppress, which should lead to it being tracked down and intentionally categorized for suppression or not.

    Updated in latest push.

  116. hodlinator force-pushed on Mar 19, 2025
  117. ryanofsky approved
  118. ryanofsky commented at 2:03 pm on March 26, 2025: contributor

    Code review ACK cac218dcbdd2b5682a7ac8659aac0306e68a65f5. Just changed exception to assert and assert to exception since last review. Left a minor suggestion but it’s not important so feel free to ignore. Thanks for being so responsive with previous feedback.

    I think this PR should be pretty helpful to developers writing python tests since it cleans up logging, returns more complete error information, provides better feedback if stop_node is used incorrectly, and adds test coverage for the test framework to make sure it works reliably.

  119. DrahtBot requested review from m3dwards on Mar 26, 2025
  120. hodlinator force-pushed on Mar 26, 2025
  121. ryanofsky approved
  122. ryanofsky commented at 3:13 pm on March 26, 2025: contributor
    Code review ACK 6539598b42005f453b358a0076287db99d1b02eb. Since last review just simplified errno code a little as suggested
  123. in test/functional/test_framework/test_node.py:326 in 52746d0c58 outdated
    321@@ -322,13 +322,17 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
    322                 suppressed_errors[f"JSONRPCException {e.error['code']}"] += 1
    323                 latest_error = repr(e)
    324             except OSError as e:
    325+                error_num = e.errno
    326+                # Workaround issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.
    


    l0rinc commented at 10:37 am on March 27, 2025:
    Is this the issue? https://github.com/python/cpython/issues/109601 If so, it doesn’t seem to be Windows related.

    maflcko commented at 11:34 am on March 27, 2025:
    Also, the commit message seems to imply this is a bugfix for an issue observed on Windows. However, I don’t think anyone has observed this on current master. The first commit of this pull claims to be a “refactor”, so maybe the bug this is fixing was introduced there instead?

    hodlinator commented at 1:07 pm on March 27, 2025:

    Thanks for finding that issue @l0rinc!

    I’ve only observed it on Windows, maybe the socket subsystem or timeouts are different enough there to make it more of an issue.

    However, I don’t think anyone has observed this on current master.

    Maybe I was running under some less common setup like non-WSL MSVC Debug, or maybe my Windows network stack was having a bad day. (There’s an odd chance that I might have been using Linux->Windows cross-built binaries for some reason).

    The first commit of this pull claims to be a “refactor”, so maybe the bug this is fixing was introduced there instead?

    8ff738a6e96c19aaf14ebea52c8ae13751cf443b, the refactor-commit does indeed restructure how OSErrors are caught, at first I thought maybe it was filtering TimeoutError by type, but no, the approach on master also relies on errno.ETIMEDOUT (ConnectionResetError is partially filtered by type there, they are not direct descendants of each other).


    hodlinator commented at 10:31 pm on March 27, 2025:
    Pushed new version of this commit - 51b2115a348f681e12aabf514cb156da4420da2e. Now links to the Python issue in the comment, and only mentions that it was encountered on Windows in the commit message.

    maflcko commented at 7:13 am on March 28, 2025:
    Do you still know how often it happened and what the traceback looked like?

    hodlinator commented at 1:27 pm on March 28, 2025:

    Was able to easily reproduce the issue.

    C:\Users\hodlinator\bitcoin>py build\test\functional\feature_framework_startup_failures.py 2025-03-28T13:09:19.459000Z TestFramework (INFO): PRNG seed is: 4595347679033165165 2025-03-28T13:09:19.519000Z TestFramework (INFO): Initializing test directory C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2 2025-03-28T13:09:19.519000Z TestFramework (INFO): Verifying timeout in connecting to bitcoind’s RPC interface results in only one exception. 2025-03-28T13:09:19.788000Z TestFramework (INFO): Verifying inability to connect to bitcoind’s RPC interface due to wrong port results in one exception containing at least one OSError. 2025-03-28T13:09:21.378000Z TestFramework (ERROR): Found 0/1 occurrences of the specific exception: AssertionError: [node 0] Unable to connect to bitcoind after \d+s (ignored errors: {[^}]‘OSError \w+’?: \d+[^}]}, latest error: \w+([^)]+)) 2025-03-28T13:09:21.378000Z TestFramework (ERROR): Assertion failed Traceback (most recent call last): File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 135, in main self.run_test() ~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 111, in run_test self.test_wrong_rpc_port() ~~~~~~~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 85, in test_wrong_rpc_port self._run_test_internal( ~~~~~~~~~~~~~~~~~~~~~~~^ # Lower the timeout so we don’t wait that long. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …<4 lines>… r"AssertionError: [node 0] Unable to connect to bitcoind after \d+s (ignored errors: {[^}]‘OSError \w+’?: \d+[^}]}, latest error: \w+([^)]+))" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 74, in _run_test_internal raise AssertionError(f"Child test didn’t contain (only) expected errors.\n:\n{result.stdout}\n\n") AssertionError: Child test didn’t contain (only) expected errors. : 2025-03-28T13:09:19.929000Z TestFramework (INFO): PRNG seed is: 1755806780689552109 2025-03-28T13:09:19.952000Z TestFramework (INFO): Initializing test directory C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_zahuvbto 2025-03-28T13:09:21.234000Z TestFramework (ERROR): Unexpected exception caught during testing Traceback (most recent call last): File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 131, in main self.setup() ~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 313, in setup self.setup_network() ~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 47, in setup_network BitcoinTestFramework.setup_network(self) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 404, in setup_network self.setup_nodes() ~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 426, in setup_nodes self.start_nodes() ~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 585, in start_nodes node.wait_for_rpc_connection() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_node.py”, line 286, in wait_for_rpc_connection rpc.getblockcount() ~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\coverage.py”, line 50, in call return_val = self.auth_service_proxy_instance.call(*args, **kwargs) File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\authproxy.py”, line 127, in call response, status = self._request(‘POST’, self.__url.path, postdata.encode(‘utf-8’)) ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\authproxy.py”, line 105, in _request self.__conn.request(method, path, postdata, headers) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1336, in request self._send_request(method, url, body, headers, encode_chunked) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1382, in _send_request self.endheaders(body, encode_chunked=encode_chunked) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1331, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1091, in _send_output self.send(msg) ~~~~~~~~~^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1035, in send self.connect() ~~~~~~~~~~~~^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1001, in connect self.sock = self._create_connection( ~~~~~~~~~~~~~~~~~~~~~~~^ (self.host,self.port), self.timeout, self.source_address) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\socket.py”, line 864, in create_connection raise exceptions[0] File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\socket.py”, line 849, in create_connection sock.connect(sa) ~~~~~~~~~~~~^^^^ TimeoutError: timed out 2025-03-28T13:09:21.344000Z TestFramework (INFO): Not stopping nodes as test failed. The dangling processes will be cleaned up later. 2025-03-28T13:09:21.345000Z TestFramework (WARNING): Not cleaning up dir C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_zahuvbto 2025-03-28T13:09:21.345000Z TestFramework (ERROR): Test failed. Test logging available at C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_zahuvbto/test_framework.log 2025-03-28T13:09:21.345000Z TestFramework (ERROR): 2025-03-28T13:09:21.345000Z TestFramework (ERROR): Hint: Call C:\Users\hodlinator\bitcoin\build\test\functional\combine_logs.py ‘C:\Users\HODLIN~1\AppData\Local\Temp\bitcoin_func_test_zahuvbto’ to consolidate all logs 2025-03-28T13:09:21.346000Z TestFramework (ERROR): 2025-03-28T13:09:21.346000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log. 2025-03-28T13:09:21.346000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues 2025-03-28T13:09:21.346000Z TestFramework (ERROR): [node 0] Cleaning up leftover process

    2025-03-28T13:09:21.431000Z TestFramework (INFO): Not stopping nodes as test failed. The dangling processes will be cleaned up later. 2025-03-28T13:09:21.431000Z TestFramework (WARNING): Not cleaning up dir C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2 2025-03-28T13:09:21.431000Z TestFramework (ERROR): Test failed. Test logging available at C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2/test_framework.log 2025-03-28T13:09:21.431000Z TestFramework (ERROR): 2025-03-28T13:09:21.432000Z TestFramework (ERROR): Hint: Call C:\Users\hodlinator\bitcoin\build\test\functional\combine_logs.py ‘C:\Users\HODLIN~1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2’ to consolidate all logs 2025-03-28T13:09:21.432000Z TestFramework (ERROR): 2025-03-28T13:09:21.432000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log. 2025-03-28T13:09:21.432000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues 2025-03-28T13:09:21.432000Z TestFramework (ERROR):

    My hypothesis is that we trigger the issue through the first test setting a 0 rpc_timeout:

    https://github.com/bitcoin/bitcoin/blob/6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5/test/functional/feature_framework_startup_failures.py#L76-L81

    Attempting to use the network stack with a 0 timeout seems to have different meanings on Linux and Windows. It’s not really a realistic test. I think the next test covers a superset of the one above, so could probably remove the first one.

    https://github.com/bitcoin/bitcoin/blob/6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5/test/functional/feature_framework_startup_failures.py#L83-L92


    hodlinator commented at 1:55 pm on March 28, 2025:
    Dropped the test and workaround commit in latest push.

    hodlinator commented at 10:10 pm on April 21, 2025:
    This was re-added in later pushes as I was able to reproduce it on Windows with 5- second timeouts, not just 0 timeouts as hypothesized above (https://github.com/bitcoin/bitcoin/pull/30660#discussion_r2018664612). See 1f639efca5e71a0ff208415d94e408a74778d4db.
  124. in test/functional/feature_framework_startup_failures.py:98 in 75134fef62 outdated
    93+
    94+    def test_init_error(self):
    95+        self.log.info("Verify startup failure due to invalid arg results in only one exception.")
    96+        self._run_test_internal(
    97+            ["--internal-extra_args=-nonexistentarg"],
    98+            "FailedToStartError: \\[node 0\\] bitcoind exited with status 1 during initialization. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"
    


    l0rinc commented at 11:12 am on March 27, 2025:
    nit: for consistency I’d make this either a r"...\[node 0\] as well or do a re.escape if we want to be explicit. Though I think we’re probably matching way too much here, it doesn’t hint at which part of the error is important and which part is irrelevant and subject to change

    hodlinator commented at 1:15 pm on March 27, 2025:
    r""-string seems more consistent. Thanks for pointing it out! Will fix if I re-touch.

    hodlinator commented at 10:31 pm on March 27, 2025:
    Fixed.
  125. l0rinc commented at 11:13 am on March 27, 2025: contributor
    lightweight, code review only ACK 6539598b42005f453b358a0076287db99d1b02eb
  126. hodlinator force-pushed on Mar 27, 2025
  127. hodlinator commented at 10:33 pm on March 27, 2025: contributor
  128. in test/functional/feature_framework_startup_failures.py:98 in 6bbfb3eeef outdated
    93+
    94+    def test_init_error(self):
    95+        self.log.info("Verify startup failure due to invalid arg results in only one exception.")
    96+        self._run_test_internal(
    97+            ["--internal-extra_args=-nonexistentarg"],
    98+            r"FailedToStartError: \[node 0\] bitcoind exited with status 1 during initialization. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"
    


    l0rinc commented at 9:24 am on March 28, 2025:

    super-nit:

    0            r"FailedToStartError: \[node 0\] bitcoind exited with status 1 during initialization\. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"
    

    hodlinator commented at 10:14 am on March 28, 2025:
    Will take if anything else forces me to re-touch.

    hodlinator commented at 1:53 pm on March 28, 2025:
  129. l0rinc approved
  130. l0rinc commented at 9:28 am on March 28, 2025: contributor
    lightweight re-review-ack 6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5
  131. hodlinator force-pushed on Mar 28, 2025
  132. hodlinator commented at 2:06 pm on March 28, 2025: contributor

    Latest push:

    • Removes test_instant_rpc_timeout which was artificially setting rpc_timeout = 0, which is not a supported case and which caused different behavior on Windows network stacks compared to Linux. That test was a rough subset of test_wrong_rpc_port. Removing test_instant_rpc_timeout allowed dropping the commit with the workaround for socket timeout with unset errno entirely. Special thanks to @maflcko for insisting that something was not right.
    • Fixed nit concerning escaping in regexp string.
    • Bumped subprocess.run-timeout since it was flaky under Windows.
  133. ryanofsky approved
  134. ryanofsky commented at 2:56 pm on March 28, 2025: contributor

    Code review ACK fc88c2ec297dc93ba06008bd5ae10798e9f6aeac. Just dropped 0 timeout workaround and instant timeout test since the last review.

    Ideally, I would want setting a 0 timeout to result in a comprehensible error message lke “Unable to connect to bitcoind after 0s” and not some harder to debug behavior, but it wouldn’t be worth restructuring the code or adding a special case for.

  135. DrahtBot requested review from l0rinc on Mar 28, 2025
  136. in test/functional/feature_framework_startup_failures.py:102 in 3ee7062ee3 outdated
     97+            ["--internal-start_stop=1"],
     98+            r"AssertionError: \[node 0\] Should only call stop_node\(\) on a running node after wait_for_rpc_connection\(\) succeeded\. Did you forget to call the latter after start\(\)\? Not connected to process: \d+"
     99+        )
    100+
    101+    def run_test(self):
    102+        if self.options.internal_rpc_timeout is None and self.options.internal_extra_args is None:
    


    l0rinc commented at 12:48 pm on April 18, 2025:
    this is a quite confusing way to check whether we’re in a child process or not (not obvious to me why internal_start_stop is not even here)

    ryanofsky commented at 1:26 pm on April 18, 2025:

    this is a quite confusing way to check whether we’re in a child process or not (not obvious to me why internal_start_stop is not even here)

    Do you have a suggestion to make it less confusing? I feel like checking for an internal argument is a mostly straightforward way of checking whether this was called internally.


    hodlinator commented at 1:30 pm on April 18, 2025:
    I’ve realized that all child test processes fail before reaching run_test(), so I’m leaning toward changing this into an assert.
  137. in test/functional/feature_framework_startup_failures.py:105 in 3ee7062ee3 outdated
    100+
    101+    def run_test(self):
    102+        if self.options.internal_rpc_timeout is None and self.options.internal_extra_args is None:
    103+            self.test_wrong_rpc_port()
    104+            self.test_init_error()
    105+            self.test_start_stop()
    


    l0rinc commented at 12:48 pm on April 18, 2025:
    do we need to batch all 3 tests into a single test?

    hodlinator commented at 8:36 pm on April 18, 2025:
  138. in test/functional/test_runner.py:413 in 3ee7062ee3 outdated
    407@@ -408,6 +408,7 @@
    408     'p2p_handshake.py --v2transport',
    409     'feature_dirsymlinks.py',
    410     'feature_help.py',
    411+    'feature_framework_startup_failures.py',
    


    l0rinc commented at 12:49 pm on April 18, 2025:
    I know this is meant to be mostly random, but the existing feature ones seem to be in alphabetic order

  139. in test/functional/feature_framework_startup_failures.py:95 in 3ee7062ee3 outdated
    90+            ["--internal-extra_args=-nonexistentarg"],
    91+            r"FailedToStartError: \[node 0\] bitcoind exited with status 1 during initialization\. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"
    92+        )
    93+
    94+    def test_start_stop(self):
    95+        self.log.info("Verify start() then stop_node() on a node without wait_for_rpc_connection() in between triggers an assert.")
    


    l0rinc commented at 12:50 pm on April 18, 2025:
    could these be separate sub-tests instead of just test logs?

    hodlinator commented at 9:03 pm on April 18, 2025:
    Let me know if you think the final commit in the latest push is an improvement.

    l0rinc commented at 7:24 am on April 19, 2025:
    It is, thanks
  140. in test/functional/feature_framework_startup_failures.py:58 in 3ee7062ee3 outdated
    53+            print(f"Unexpected timeout, subprocess output:\n{e.output}\nSubprocess output end", file=sys.stderr)
    54+            raise
    55+
    56+        success = True
    57+
    58+        traceback_count = len(re.findall("Traceback", result.stdout))
    


    l0rinc commented at 12:50 pm on April 18, 2025:
    could we add these into the try so that I don’t have to understand where result is coming from?

    hodlinator commented at 8:40 pm on April 18, 2025:
    Tried it out, but then one runs into the issue that it looks like subprocess.TimeoutExpired could be raised anywhere in that whole function, which I prefer avoiding.

    l0rinc commented at 7:24 am on April 19, 2025:

    looks like subprocess.TimeoutExpired could be raised anywhere in that whole function

    Isn’t that the whole point of a try/catch, that the happy path isn’t disrupted, that you don’t have to know where it’s coming from? Otherwise we’re treating exceptions as return codes.

    Quoting again from the Python docs https://docs.python.org/3/tutorial/errors.html

    0try:
    1    f = open('myfile.txt')
    2    s = f.readline()
    3    i = int(s.strip())
    4except [...]
    

    and not

    0try:
    1    f = open('myfile.txt')
    2except [...]
    3s = f.readline()
    4i = int(s.strip())
    

    hodlinator commented at 9:31 am on April 23, 2025:

    The full example from the python docs is:

     0try:
     1    f = open('myfile.txt')
     2    s = f.readline()
     3    i = int(s.strip())
     4except OSError as err:
     5    print("OS error:", err)
     6except ValueError:
     7    print("Could not convert data to an integer.")
     8except Exception as err:
     9    print(f"Unexpected {err=}, {type(err)=}")
    10    raise
    

    So the first and last lines inside the try are expected to sometimes raise exceptions (maybe the readline() too), which does not map well to the code in _verify_startup_failure().


    l0rinc commented at 9:38 am on April 23, 2025:
    The point is that the happy path is separated from the error handling, we’re not using exceptions as if they were returned error codes

    hodlinator commented at 9:51 am on April 23, 2025:
    My intent is to keep as much of the code separate from error handling as possible. By narrowing the scope of the try it makes clear to the reader that there aren’t exceptions hiding in every line of the happy path.

    l0rinc commented at 10:03 am on April 23, 2025:
    I think the opposite is true, this way we’re forcing the reader to consider the error case when trying to understand the happy path (i.e. separation of concerns). But I’ll leave it, it’s not that important.

    hodlinator commented at 11:54 am on April 23, 2025:

    Another objection to your approach as a general rule: If one of the lines in the happy path is modified to call a function that raises exceptions under some circumstances, we run the risk of accidentally catching it by related (parent)type through a pre-existing except/catch-block, and treating it differently than we had if we intentionally designed the code.

    Side note: I’m in favor of newer languages like Rust/Zig not supporting exceptions.

  141. in test/functional/feature_framework_startup_failures.py:46 in 3ee7062ee3 outdated
    54+            raise
    55+
    56+        success = True
    57+
    58+        traceback_count = len(re.findall("Traceback", result.stdout))
    59+        if traceback_count != 1:
    


    l0rinc commented at 12:51 pm on April 18, 2025:
    this line might need explanation for why no Traceback is a success=False

    hodlinator commented at 8:45 pm on April 18, 2025:
    Changed the function name from _run_test_internal to _verify_startup_failure and added comment, hopefully that clarifies this aspect. I also find the error message quite explanatory: https://github.com/bitcoin/bitcoin/blob/fc88c2ec297dc93ba06008bd5ae10798e9f6aeac/test/functional/feature_framework_startup_failures.py#L59-L60
  142. in test/functional/feature_framework_startup_failures.py:56 in 3ee7062ee3 outdated
    51+            result = subprocess.run([sys.executable, __file__] + args, encoding="utf-8", stdout=subprocess.PIPE, stderr=subprocess.STDOUT, timeout=60 * self.options.timeout_factor)
    52+        except subprocess.TimeoutExpired as e:
    53+            print(f"Unexpected timeout, subprocess output:\n{e.output}\nSubprocess output end", file=sys.stderr)
    54+            raise
    55+
    56+        success = True
    


    l0rinc commented at 12:51 pm on April 18, 2025:
    Could we collect the errors instead of printing and counting?

    hodlinator commented at 9:03 pm on April 18, 2025:
    Taken!
  143. in test/functional/feature_framework_startup_failures.py:37 in 3ee7062ee3 outdated
    32+            self.extra_args = [[self.options.internal_extra_args]]
    33+
    34+    def add_options(self, parser):
    35+        parser.add_argument("--internal-rpc_timeout", dest="internal_rpc_timeout", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
    36+        parser.add_argument("--internal-extra_args", dest="internal_extra_args", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
    37+        parser.add_argument("--internal-start_stop", dest="internal_start_stop", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
    


    l0rinc commented at 12:53 pm on April 18, 2025:
    Do we really need 3 separate internal dummy args? Seems like a framework with a custom arg and 3 separate tests with different behaviors checking only a single thing would untangle this slightly better

    hodlinator commented at 9:05 pm on April 18, 2025:
    I think it leaves flexibility for adding more tests which use a subset of them.
  144. in test/functional/feature_framework_startup_failures.py:6 in 3ee7062ee3 outdated
    0@@ -0,0 +1,109 @@
    1+#!/usr/bin/env python3
    2+# Copyright (c) 2025-present The Bitcoin Core developers
    3+# Distributed under the MIT software license, see the accompanying
    4+# file COPYING or http://www.opensource.org/licenses/mit-license.php.
    5+"""
    6+Verify various startup failures only raise one exception since multiple
    


    l0rinc commented at 12:55 pm on April 18, 2025:
    “various startup failures” doesn’t sound like a single test to me, could we maybe unwrap them somehow?

    hodlinator commented at 9:08 pm on April 18, 2025:
    It is rare for functional tests to only test one thing. But I agree “various” is a bit too loose, changed to “framework”.
  145. in test/functional/feature_framework_startup_failures.py:11 in 3ee7062ee3 outdated
     6+Verify various startup failures only raise one exception since multiple
     7+exceptions being raised muddies the waters of what actually went wrong.
     8+We should maintain this bar of only raising one exception as long as
     9+additional maintenance and complexity is low.
    10+
    11+Test relaunches itself into a child processes in order to trigger failure
    


    l0rinc commented at 12:55 pm on April 18, 2025:
    I find the relaunches itself part quite confusing here

    hodlinator commented at 8:47 pm on April 18, 2025:

    I’d rather keep the overall test in one file. We need child process as we want the parent to always succeed, and the child processes to always fail.

    Corrected grammar: “Test relaunches itself into child processes in order to trigger failures without the parent process’ BitcoinTestFramework also failing.”

  146. in test/functional/test_framework/test_framework.py:1 in fc88c2ec29


    l0rinc commented at 12:58 pm on April 18, 2025:
    nit: I’d move this commit before the previous one (it’s more like the others, the previous one is different)

    hodlinator commented at 9:14 pm on April 18, 2025:
    Done. Also makes it easier to squash/drop the new final commit of introducing multiple BitcoinTestFramework-subclasses.
  147. in test/functional/test_framework/test_node.py:338 in 8ff738a6e9 outdated
    333+                # Suppress similarly to the above JSONRPCException errors.
    334+                if e.errno not in [ errno.ECONNRESET,     # This might happen when the RPC server is in warmup,
    335+                                                          # but shut down before the call to getblockcount succeeds.
    336+                                    errno.ETIMEDOUT,      # Treat identical to ECONNRESET
    337+                                    errno.ECONNREFUSED ]: # Port not yet open?
    338                     raise  # unknown OS error
    


    l0rinc commented at 1:02 pm on April 18, 2025:

    Nit: this formatting pushes everything out of my screen:

    0                # Suppress similarly to the above JSONRPCException errors.
    1                if e.errno not in [
    2                    errno.ECONNRESET,  # This might happen when the RPC server is in warmup, but shut down before the call to getblockcount succeeds.
    3                    errno.ETIMEDOUT,   # Treat identical to ECONNRESET
    4                    errno.ECONNREFUSED # Port not yet open?
    5                ]: 
    6                    raise  # unknown OS error
    

    hodlinator commented at 9:21 pm on April 18, 2025:
    Fair. Taken.
  148. in test/functional/feature_framework_startup_failures.py:40 in 3ee7062ee3 outdated
    35+        parser.add_argument("--internal-rpc_timeout", dest="internal_rpc_timeout", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
    36+        parser.add_argument("--internal-extra_args", dest="internal_extra_args", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
    37+        parser.add_argument("--internal-start_stop", dest="internal_start_stop", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
    38+
    39+    def setup_network(self):
    40+        # Avoid doing anything if num_nodes == 0, otherwise we fail.
    


    l0rinc commented at 1:03 pm on April 18, 2025:
    why?

    hodlinator commented at 8:52 pm on April 18, 2025:

    Improved comment: “Only proceed if num_nodes > 0 (child test processes), as 0 (parent) is not supported by BitcoinTestFramework.setup_network().”

    Could explore making BitcoinTestFramework more permissive in this aspect but would rather defer that.

  149. in test/functional/feature_framework_startup_failures.py:51 in 3ee7062ee3 outdated
    46+            else:
    47+                BitcoinTestFramework.setup_network(self)
    48+
    49+    def _run_test_internal(self, args, expected_exception):
    50+        try:
    51+            result = subprocess.run([sys.executable, __file__] + args, encoding="utf-8", stdout=subprocess.PIPE, stderr=subprocess.STDOUT, timeout=60 * self.options.timeout_factor)
    


    l0rinc commented at 1:05 pm on April 18, 2025:
    can we just add .stdout at the end here since it’s all we’re using?

    l0rinc commented at 1:06 pm on April 18, 2025:
    do we care about stdout and encoding here? Or can we do stdout=subprocess.DEVNULL maybe?

    hodlinator commented at 8:55 pm on April 18, 2025:
    Good point, taken!

    hodlinator commented at 8:56 pm on April 18, 2025:

    Requiring the expected output to be in stderr and not stdout seems petty/brittle to me, even if it is more correct.

    encoding is required to avoid:

    0    traceback_count = len(re.findall("Traceback", output))
    1                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    2  File "/nix/store/26yi95240650jxp5dj78xzch70i1kzlz-python3-3.12.9/lib/python3.12/re/__init__.py", line 217, in findall
    3    return _compile(pattern, flags).findall(string)
    4           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    5TypeError: cannot use a string pattern on a bytes-like object
    

    l0rinc commented at 7:23 am on April 19, 2025:
    Thanks for checking
  150. in test/functional/feature_framework_startup_failures.py:66 in 3ee7062ee3 outdated
    61+            success = False
    62+
    63+        matching_exception_count = len(re.findall(expected_exception, result.stdout))
    64+        if matching_exception_count != 1:
    65+            self.log.error(f"Found {matching_exception_count}/1 occurrences of the specific exception: {expected_exception}")
    66+            success = False
    


    l0rinc commented at 1:10 pm on April 18, 2025:

    we could likely use the walrus operator here to narrow the scope and name:

    0		if (matches := len(re.findall(expected_exception, result.stdout))) != 1:
    1		    self.log.error(f"Found {matches}/1 occurrences of the specific exception: {expected_exception}")
    2		    success = False
    

    hodlinator commented at 9:01 pm on April 18, 2025:
    Prefer using the walrus for truthiness tests which don’t require the additional () != 1.

    l0rinc commented at 7:24 am on April 19, 2025:

    The point is to narrow the scope of variables so that we can give them more focused names. The example in the doc https://docs.python.org/3/whatsnew/3.8.html is basically this exact escenario

    0if (n := len(a)) > 10:
    1    print(f"List is too long ({n} elements, expected <= 10)")
    

    l0rinc commented at 7:40 am on April 19, 2025:

    Or we could avoid repetition as well by a more functional approach of separating data from algorithm:

    0patterns = {
    1    "Traceback": "tracebacks - expecting exactly one with no knock-on exceptions",
    2    expected_exception: f"occurrences of the specific exception: {expected_exception}",
    3    "Test failed. Test logging available at": "test failure output messages"
    4}
    5errors = [f"Found {count}/1 {message}."
    6          for pattern, message in patterns.items()
    7          if (count := len(re.findall(pattern, output))) != 1]
    
  151. l0rinc changes_requested
  152. l0rinc commented at 1:10 pm on April 18, 2025: contributor
    As discussed in private, I’ve summarized my findings here shortly.
  153. DrahtBot requested review from l0rinc on Apr 18, 2025
  154. hodlinator force-pushed on Apr 18, 2025
  155. hodlinator commented at 9:57 pm on April 18, 2025: contributor

    Hi folks!

    Asked @l0rinc what it would take for him to give this more than a lightweight A-C-K, and we went through the commits together, especially feature_framework_startup_failures.py. It made me agree some minor improvements are justified in order to make the code more readable & correct.

    Added final commit - drop or squash?

    We also agreed that it would be worth experimenting with breaking up the one test class into multiple to clarify things further. Let me know what you think about the result. @ryanofsky: Hope you tolerate the invalidation of your latest ACK. If you deem the next-to-last commit re-ACK:able, but would rather see the last commit dropped, I hereby grant you veto rights for time served within this PR.


    Edit: Subsequent pushes unbreak the test when running in the context of test_runner.py with --cachedir being sent in, solve minor linter failure, and rebased on master in hopes of avoiding Windows CI file contention issue #32291 (which I only now realize hasn’t been solved yet).

  156. hodlinator force-pushed on Apr 18, 2025
  157. DrahtBot added the label CI failed on Apr 18, 2025
  158. DrahtBot commented at 10:43 pm on April 18, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/40802607410

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  159. hodlinator force-pushed on Apr 18, 2025
  160. DrahtBot removed the label CI failed on Apr 19, 2025
  161. l0rinc commented at 7:26 am on April 19, 2025: contributor
    You could rebase your change before the fix reverted in #32302 to make it green
  162. in test/functional/feature_framework_startup_failures.py:122 in 9e0fede6a6 outdated
    117+if __name__ == '__main__':
    118+    if not any('--internal_test=' in arg for arg in sys.argv):
    119+        FeatureFrameworkStartupFailures(__file__).main()
    120+    else:
    121+        for arg in sys.argv:
    122+            if arg[:16] == '--internal_test=':
    


    l0rinc commented at 7:44 am on April 19, 2025:

    nit: can probably be if arg.startswith('--internal_test='): or walrus to the rescue again:

    0if arg.startswith(prefix := '--internal_test='):
    1    class_name = arg[len(prefix):]
    

    ryanofsky commented at 5:48 pm on April 21, 2025:

    In commit “Drop or Squash? - Break out startup failures into distinct classes” (9e0fede6a6d608ff307b48b9a326c6d467b410cf)

    I think could also simplify the sys.modules code and write globals()[class_name](__file__).main()

  163. l0rinc approved
  164. l0rinc commented at 7:46 am on April 19, 2025: contributor
    Concept ACK, I like the new FeatureFrameworkStartupFailures, it’s untangled a lot better now in my opinion. I assume from the last commit message that it’s not final, right?
  165. hodlinator marked this as a draft on Apr 19, 2025
  166. qa refactor: wait_for_rpc_connection - Treat OSErrors the same
    ConnectionResetError is an OSError as well (ECONNRESET), no reason to have a separate except-block for it.
    
    Also improves comments for other exceptions and make condition above more Pythonic.
    879243e81f
  167. qa: Include ignored errors in RPC connection timeout
    When an RPC connection attempt with bitcoind times out, include which ignored errors occurred in the exception message.
    
    May provide clues of what has gone wrong.
    6ad21b4c01
  168. qa: Only allow calling TestNode.stop() after connecting
    (Still tolerate calling it on a no longer (self.)running node, as in a node that has been queried for is_node_stopped() and modified state before returning True).
    
    Tests should not attempt to use the non-functioning RPC interface to call stop() unless wait_for_connections() has succeeded.
    
    No longer log and suppress http.client.CannotSendRequest as a consequence of stop()-RPC, as error conditions causing this knock-on issue are now guarded against before the call.
    9b24a403fa
  169. ryanofsky approved
  170. ryanofsky commented at 5:53 pm on April 21, 2025: contributor
    Code review ACK 9e0fede6a6d608ff307b48b9a326c6d467b410cf. Main change since last review were a lot of feature_framework_startup_failures.py cleanups. I do think the drop or squash commit 9e0fede6a6d608ff307b48b9a326c6d467b410cf is an improvement so would vote for squash, but no strong opinion
  171. DrahtBot requested review from l0rinc on Apr 21, 2025
  172. qa: Work around Python socket timeout issue
    Observed on local machine running Windows / Python v3.13.1 when overriding rpc_timeout to small values (5- seconds). Next commit performs such overrides.
    1f639efca5
  173. hodlinator force-pushed on Apr 21, 2025
  174. hodlinator force-pushed on Apr 21, 2025
  175. in test/functional/feature_framework_startup_failures.py:58 in 3ae99bf8d7 outdated
    53+                                    stdout=subprocess.PIPE,
    54+                                    stderr=subprocess.STDOUT,
    55+                                    timeout=60 * self.options.timeout_factor).stdout
    56+        except subprocess.TimeoutExpired as e:
    57+            # e.output is returned as bytes on Linux and str on Windows.
    58+            child_output = e.output if isinstance(e.output, str) else e.output.decode("utf-8")
    


    l0rinc commented at 7:15 pm on April 21, 2025:

    str on Windows

    https://docs.python.org/3/library/subprocess.html#subprocess.TimeoutExpired.output claims:

    This is always bytes when any output was captured

    Can we simplify to:

    0            child_output = e.output.decode("utf-8") if e.output else ""
    

    ?


    hodlinator commented at 9:43 pm on April 21, 2025:

    In this case real-world experience proved the documentation to be wrong. We do get a string on Windows, at least under some conditions.

    It does make sense to invert the condition similar to what you did though! To avoid trying None.decode("utf-8"). None serializes okay (to "None") so we can return e.output for both the str and None cases.

    0            # e.output is returned as bytes on Linux and str on Windows.
    1            child_output = e.output.decode("utf-8") if isinstance(e.output, bytes) else e.output
    

    ryanofsky commented at 5:57 pm on April 22, 2025:

    re: #30660 (review)

    I think you could drop all unnecessary encoding and decoding by just dropping encoding="utf-8" parameter to subprocess.run and printing the bytes directly with something like sys.stderr.buffer.write(b"<CHILD OUTPUT BEGIN>\n%s\n<CHILD OUTPUT END>" % e.output)


    hodlinator commented at 9:04 am on April 23, 2025:
    Pushed an alternative, still using print. When I tried sys.stderr.buffer.write that data ended up at the end of the failure output, instead of when the exception was caught.
  176. in test/functional/feature_framework_startup_failures.py:132 in 3ae99bf8d7 outdated
    115+        # nothing is on that port. Divide by timeout_factor to counter
    116+        # multiplication in base, 2 * node_start_duration should be enough.
    117+        self.rpc_timeout = max(3, 2 * self.options.node_start_duration) / self.options.timeout_factor
    118+
    119+    def run_test(self):
    120+        assert False, "Should have failed earlier during startup."
    


    l0rinc commented at 7:23 pm on April 21, 2025:

    nit: assert False is a bit hacky in my opinion - can we use the same style that we did on line 60?

    0        print("Should have failed earlier during startup.")
    1        raise
    

    hodlinator commented at 8:38 pm on April 21, 2025:

    raise re-raises the currently caught exception. Think it’s roughly equivalent to throw; in C++. One can do:

    0raise AssertionError("...")
    

    But I prefer using assert for new code. Considered raising RuntimeError, but this really is a code invariant, mostly for documenting, but also for catching changed behavior.


    l0rinc commented at 11:27 am on April 22, 2025:
    yes, I meant either print + raise or just raise Exception("Should have failed earlier during startup"), as you’ve mentioned. assert False feels like a hack…

    hodlinator commented at 9:07 am on April 23, 2025:
    If there was some kind of UnreachableError I would use it. Looked into using our utility assert_raises(...stop_node...), but that swallows the exception, so the test fails.
  177. in test/functional/feature_framework_startup_failures.py:146 in 3ae99bf8d7 outdated
    141+        assert False, "Should have failed earlier during startup."
    142+
    143+
    144+if __name__ == '__main__':
    145+    internal_test = None
    146+    for arg in sys.argv:
    


    l0rinc commented at 7:30 pm on April 21, 2025:

    I’m not a fan of a incomplete iteration, can we separate the generator expression to the first line, something like:

    0    class_name = next((arg[16:] for arg in sys.argv if arg.startswith('--internal_test=')), None)
    

    or use a regexp to deduplicate the match and substring (untested):

    0        class_name = next((m[1] for arg in sys.argv if (m := re.match(r'--internal_test=(.*)', arg))), None)
    

    hodlinator commented at 9:30 pm on April 21, 2025:
    Thanks! Took regexp form.
  178. in test/functional/test_framework/test_node.py:339 in 3ae99bf8d7 outdated
    348+                    errno.ETIMEDOUT,    # Treat identical to ECONNRESET
    349+                    errno.ECONNREFUSED  # Port not yet open?
    350+                ]:
    351                     raise  # unknown OS error
    352-            except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
    353+                suppressed_errors[f"OSError {errno.errorcode[error_num]}"] += 1
    


    l0rinc commented at 7:36 pm on April 21, 2025:
    Based on the checks we’re doing on e at the top, error_num can potentially be None here - is that expected?

    hodlinator commented at 8:42 pm on April 21, 2025:
    We will re-raise above for None, see: #30660 (review)

    l0rinc commented at 11:25 am on April 22, 2025:
    Indeed, thanks
  179. in test/functional/feature_framework_startup_failures.py:46 in 3ae99bf8d7 outdated
    41+        if self.options.nocleanup:
    42+            args.append("--nocleanup")
    43+        if self.options.tmpdir:
    44+            args.append(f"--tmpdir={self.options.tmpdir}/{hash(expected_exception)}"),
    45+        if self.options.randomseed is not None:
    46+            args.append(f"--randomseed={self.options.randomseed + hash(expected_exception)}")
    


    l0rinc commented at 7:43 pm on April 21, 2025:

    If I understand this correctly, we may want to preserve the seed properties - and xor preserves the distribution properties better than addition:

    0            args.append(f"--randomseed={self.options.randomseed ^ hash(expected_exception)}")
    

    Q: is hash expected to be stable between runs - or is that not important here?


    hodlinator commented at 9:29 pm on April 21, 2025:

    Q: is hash expected to be stable between runs - or is that not important here?

    Oof no! Switched to insecure & fast MD5.

  180. in test/functional/feature_framework_startup_failures.py:63 in 3ae99bf8d7 outdated
    58+            child_output = e.output if isinstance(e.output, str) else e.output.decode("utf-8")
    59+            print(f"Unexpected child timeout:\n<CHILD OUTPUT BEGIN>\n{child_output}\n<CHILD OUTPUT END>", file=sys.stderr)
    60+            raise
    61+
    62+        errors = []
    63+        if (n := len(re.findall("Traceback", output))) != 1:
    


    l0rinc commented at 7:46 pm on April 21, 2025:

    do we need regular expression matching here or can we use output.count("Traceback") as well?

    0        if (n := output.count("Traceback")) != 1:
    

    hodlinator commented at 9:45 pm on April 21, 2025:
    Taken.

    l0rinc commented at 11:33 am on April 22, 2025:
    I think it applies to the other findall instances after this
  181. l0rinc commented at 7:52 pm on April 21, 2025: contributor
    I also vote for squashing the last commit.
  182. hodlinator force-pushed on Apr 21, 2025
  183. hodlinator marked this as ready for review on Apr 21, 2025
  184. hodlinator commented at 10:21 pm on April 21, 2025: contributor

    Notable changes since previous update (git range-diff master 77776c5 ea3cc83).

    • Changed parent test process to run a node, and use it to measure startup-times. That measurement is then used to to cap how long we wait before timing out due to wrong port.
    • Forward test framework arguments like --tmpdir and --timeout-factor to child processes (now with stable hashing).
    • Dropped the commit to avoid logging node stop info as parent process now also runs a node.
    • Re-added the commit to handle socket timeout exception with missing errno (https://github.com/bitcoin/bitcoin/pull/30660#discussion_r2053051864), since I started experiencing the issue again for non-zero rpc_timeouts as well.
    • Added warning about dangling bitcoind processes upon subprocess timeout.
    • Edit: Also squashed commit breaking apart class in feature_framework_startup_failures.py based off #30660#pullrequestreview-2781869977 and #30660#pullrequestreview-2782051429.
  185. in test/functional/feature_framework_startup_failures.py:33 in ea3cc8388b outdated
    28+    def setup_network(self):
    29+        # Don't start the node yet, as we want to measure during run_test.
    30+        self.add_nodes(self.num_nodes, self.extra_args)
    31+
    32+    # Launches a child test process running this same file, but instantiating
    33+    # a child test, and verifies that we (only) raise the expected exception.
    


    l0rinc commented at 11:39 am on April 22, 2025:

    Maybe we could make the tenses more uniform here:

    0    # Launches a child test process that runs this same file, instantiates
    1    # a child test, and verifies that it raises only the expected exception.
    

    hodlinator commented at 12:50 pm on April 22, 2025:
    Adjusted in similar direction.
  186. in test/functional/feature_framework_startup_failures.py:48 in ea3cc8388b outdated
    43+            args.append("--nocleanup")
    44+        weak_hash = md5(expected_exception.encode("utf-8"))
    45+        if self.options.tmpdir:
    46+            args.append(f"--tmpdir={self.options.tmpdir}/{weak_hash.hexdigest()[:16]}")
    47+        if self.options.randomseed is not None:
    48+            args.append(f"--randomseed={self.options.randomseed ^ int.from_bytes(weak_hash.digest()[:8], 'little')}")
    


    l0rinc commented at 11:44 am on April 22, 2025:

    I’m fine with this as well, but to avoid complexity on declaration site and usage site (and to skip the “weak” part which seems out-of-place), we could try:

    0        hash_id = int(sha256(expected_exception.encode("utf-8")).hexdigest()[:8], 16)
    1        if self.options.tmpdir:
    2            args.append(f"--tmpdir={self.options.tmpdir}/{hash_id}")
    3        if self.options.randomseed is not None:
    4            args.append(f"--randomseed={self.options.randomseed ^ hash_id}")
    

    hodlinator commented at 12:51 pm on April 22, 2025:
    Took part of the simplification.
  187. in test/functional/feature_framework_startup_failures.py:74 in ea3cc8388b outdated
    70+        if (n := len(re.findall(expected_exception, output))) != 1:
    71+            errors.append(f"Found {n}/1 occurrences of the specific exception: {expected_exception}")
    72+        if (n := len(re.findall("Test failed. Test logging available at", output))) != 1:
    73+            errors.append(f"Found {n}/1 test failure output messages.")
    74+
    75+        assert not errors, f"Child test didn't contain (only) expected errors:\n{chr(10).join(errors)}\n<CHILD OUTPUT BEGIN>\n{output}\n<CHILD OUTPUT END>\n"
    


    l0rinc commented at 11:45 am on April 22, 2025:

    nit: ‘\n’ seems less cryptic:

    0        assert not errors, f"Child test didn't contain (only) expected errors:\n{'\n'.join(errors)}\n<CHILD OUTPUT BEGIN>\n{output}\n<CHILD OUTPUT END>\n"
    

    hodlinator commented at 12:52 pm on April 22, 2025:

    This was the result of an unfortunate battle with the linter. Seems to not even have worked on older versions:

    https://stackoverflow.com/questions/51775950/why-isnt-it-possible-to-use-backslashes-inside-the-braces-of-f-strings-how-can


    l0rinc commented at 4:14 pm on April 26, 2025:

    The above worked for me - seems it was fixed in Python 3.12, but we have to support earlier ones as well. But I haven’t fully given up on this :D. It’s not very important, just had to find a solution, feel free to ignore. Since empty strings are also falsy, the 🦭 operator can come to the rescue again.

    0        assert not (msg := '\n'.join(errors)), f"Child test didn't contain (only) expected errors:\n{msg}\n<CHILD OUTPUT BEGIN>\n{output}\n<CHILD OUTPUT END>\n"
    

    hodlinator commented at 9:16 pm on April 26, 2025:

    Switched to use os.linesep which I think is more idiomatic… But I’m not entirely happy with how I ended up having to escape: https://github.com/bitcoin/bitcoin/blob/b19439b1eee57041fc37dcf509d5ee9f248263bc/test/functional/feature_framework_startup_failures.py#L76-L79

    It’s a side-effect of using repr(e) in 28e282ef9ae94ede4aace6b97ff18c66cb72a001 which gives the escaped form.

  188. in test/functional/feature_framework_startup_failures.py:150 in ea3cc8388b outdated
    145+    def run_test(self):
    146+        assert False, "Should have failed earlier during startup."
    147+
    148+
    149+if __name__ == '__main__':
    150+    if class_name := next((m[1] for arg in sys.argv if (m := re.match(r'--internal_test=(.*)', arg))), None):
    


    l0rinc commented at 11:47 am on April 22, 2025:
    nit: I know I wrote .* but we might want .+ instead, since we likely don’t support empty ones
  189. l0rinc approved
  190. l0rinc commented at 11:56 am on April 22, 2025: contributor

    ACK ea3cc8388b4aefcaee61f75e8bf1bc03281a0c91

    LGTM, left a few nits, but none are blockers. Before merging someone with more experience should also review to be sure.

  191. DrahtBot requested review from ryanofsky on Apr 22, 2025
  192. hodlinator force-pushed on Apr 22, 2025
  193. ryanofsky approved
  194. ryanofsky commented at 6:02 pm on April 22, 2025: contributor
    Code review ACK ece168ecef6e57e3a994ec5f7bd341f9c7525721. Changes since last review were squashing and rebasing, doing a better job of passing seed, timeouts, and tmpdirs to child processes. Previously dropped workaround for None errno is also back, this time with link to upstream python bug.
  195. DrahtBot requested review from l0rinc on Apr 22, 2025
  196. hodlinator force-pushed on Apr 23, 2025
  197. hodlinator force-pushed on Apr 23, 2025
  198. DrahtBot added the label CI failed on Apr 23, 2025
  199. DrahtBot commented at 9:19 am on April 23, 2025: contributor

    🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/40998561125

    Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

    • Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.

    • A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.

    • An intermittent issue.

    Leave a comment here, if you need help tracking down a confusing failure.

  200. in test/functional/feature_framework_startup_failures.py:43 in ece5442f05 outdated
    38+            f"--cachedir={self.options.cachedir}",
    39+            f"--portseed={self.options.port_seed}",
    40+            f"--internal_test={test.__name__}",
    41+        ] + internal_args
    42+        if self.options.nocleanup:
    43+            args.append("--nocleanup")
    


    l0rinc commented at 10:28 am on April 23, 2025:

    I must be missing something fundamental here, I don’t understand why these lines are never triggered while running the test. Are these triggered in some special circumstance that has happened before but isn’t happening anymore - and since it’s already testing other tests, there’s no point in covering it with actual tests?

    I find it a bit confusing, wanted to understand why e.g. the nocleanup branch was added here - but disabling it still passes, so that doesn’t help…

    0if self.options.nocleanup:
    1    raise Exception("")
    2    args.append("--nocleanup")
    

    and

    0if self.options.randomseed is not None:
    1    raise Exception("")
    2    args.append(f"--randomseed={self.options.randomseed ^ weak_hash}")
    

    and

    0if (n := output.count("Traceback")) != 1:
    1    raise Exception("")
    2    errors.append(f"Found {n}/1 tracebacks - expecting exactly one with no knock-on exceptions.")
    

    hodlinator commented at 12:01 pm on April 23, 2025:
    The forwarding of self.options to the child process is purely to obey the expectations of someone running feature_framework_startup_failures.py or test_runner.py with those arguments on the CLI/CI.

    l0rinc commented at 12:06 pm on April 23, 2025:
    Not sure I understand, so is it dead code? Would the above exceptions be triggered on CI? In either case I find it confusing.

    hodlinator commented at 12:17 pm on April 23, 2025:
    It is sometimes useful to debug tests by looking at what is left behind on the filesystem. In those cases one could invoke the parent with “–nocleanup”. Not forwarding it to child processes means they would not leave files behind.

    l0rinc commented at 12:21 pm on April 23, 2025:
    Sure, but why isn’t it used currently? The dead code part is what bothers me

    hodlinator commented at 12:43 pm on April 23, 2025:
    It is functionality that can be activated by the user/CI. Not having the code would leave such cases in more of a broken state.

    l0rinc commented at 12:45 pm on April 23, 2025:
    Can we cover the paths in the tests as well?

    hodlinator commented at 12:51 pm on April 23, 2025:
    We could definitely test more parts of the framework functionality in another PR. This PR is focused on keeping the error reporting clear for startup failures, to aid troubleshooting.

    l0rinc commented at 12:55 pm on April 23, 2025:
    But this PR introduced the dead code

    hodlinator commented at 1:02 pm on April 23, 2025:

    hodlinator commented at 6:57 am on April 25, 2025:

    I felt we were going in circles here, that’s why I illustrated it by linking back up the thread. The code is not dead, it is there to avoid the behavior not being broken when those args are activated by a user or CI.

    This change is about testing the behavior of the test framework. We could also add tests to test the testing of the test framework, but at some point, this recursion has to stop.

    Maybe we’re talking past each other in some way I don’t perceive?


    hodlinator commented at 10:05 am on April 25, 2025:
    (We had a call and I will experiment with passing down the full sys.argv to the child processes (with some tweaking for tmpdir). Will also see if I can add a test-class which intentionally exercises the failure case of ...count("Traceback")) != 1 and other checks in _verify_startup_failure without making things too convoluted, as it will ensure the test tester is robust going forward).

    hodlinator commented at 9:27 pm on April 25, 2025:
    (Done in latest push from ~8 hours ago).

    l0rinc commented at 4:16 pm on April 26, 2025:
    Thanks for fixing this!
  201. DrahtBot removed the label CI failed on Apr 23, 2025
  202. hodlinator commented at 6:59 am on April 25, 2025: contributor
    Drahtbot’s LLM is complaining about an assert string: #30660 (comment) The Python interpreter is fine with mixing regular ""-strings and f""-strings, so it’s a false positive.
  203. maflcko commented at 7:13 am on April 25, 2025: member

    Drahtbot’s LLM is complaining … #30660 (comment) … it’s a false positive.

    Yes, there are some rare false-positives. However, there are many more true positives, including one logic bug that was already found in a test in another pull, so I think overall it is worth it. Happy to discuss further in https://github.com/maflcko/DrahtBot/issues, or anywhere else.

  204. qa: assert_raises_message() - Stop assuming certain structure for exceptions 28e282ef9a
  205. hodlinator force-pushed on Apr 25, 2025
  206. in test/functional/feature_framework_startup_failures.py:50 in 0302eefd9a outdated
    45+            if (m := re.match(r'--tm(p?(d?(i?(r?))))=(.+)', arg)):
    46+                parent_args[i] = f"--tmpdir={m[5]}/{subdir}"
    47+                tmpdir_arg = True
    48+                break
    49+        if not tmpdir_arg:
    50+            parent_args.append(f"--tmpdir={self.options.tmpdir}/{subdir}")
    


    l0rinc commented at 3:49 pm on April 26, 2025:

    What’s the exact purpose of this, why would someone set --tmr=x or --tmdi=y? And if not tmpdir_arg is basically the same as the previous loop which inserts to existing location, it just appends and uses a different root folder, right? We could use those as defaults I think, and slice it in, something like:

    0        i, path = next(((i, m[1]) for i, arg in enumerate(parent_args) if (m := re.match(r'--tmpdir=(.+)', arg))),
    1                       (len(parent_args), self.options.tmpdir))
    2        subdir = md5(expected_exception.encode('utf-8')).hexdigest()[:8]
    3        parent_args[i:i + 1] = [f"--tmpdir={path}/{subdir}"]
    

    ((-1, self.options.tmpdir)) would also work, but the slice would look weird in that case)

    Not a blocker from my part, will leave it up to you.


    hodlinator commented at 7:53 pm on April 26, 2025:
    Thanks! Took a variant of that, while still maintaining support for shortened forms of --tmpdir. The Python ArgumentParser supports doing that and I’ve personally used --time=2 in place of --timeout-factor=2 since it’s shorter and I don’t remember if it’s --timeout_factor.

    l0rinc commented at 9:04 pm on April 26, 2025:

    But this doesn’t just match prefixes, it makes every parameter optional, so the middle one can be missing - which would be weird.

    Maybe you meant --tm(?:p(?:d(?:ir?)?)?)?= which would only match the prefixes

    • --tmpdir=
    • --tmpdi=
    • --tmpd=
    • --tmp=
    • --tm=

    but the intermediary letters aren’t optional, so

    • --tmr=

    won’t be matched anymore (and the groups are non-capturing so you can still use m[1] for the value)


    hodlinator commented at 9:33 pm on April 26, 2025:
    Thanks for catching that. I would have leaned towards --tm(p(d(i(r?)?)?)?)?= but I like the non-capturing groups, don’t recall coming across them before. Taken.
  207. l0rinc commented at 4:20 pm on April 26, 2025: contributor

    ACK 0302eefd9a826e1d8e81b3684ff8ebc6de9d4217

    Latest push fixed dead code concerns, uses counts instead of find for non-regex cases consistently and uses less encoding rountrips.

  208. DrahtBot requested review from ryanofsky on Apr 26, 2025
  209. l0rinc approved
  210. hodlinator force-pushed on Apr 26, 2025
  211. hodlinator force-pushed on Apr 26, 2025
  212. l0rinc commented at 9:06 pm on April 26, 2025: contributor
    ACK b19439b1eee57041fc37dcf509d5ee9f248263bc
  213. qa: Add feature_framework_startup_failures.py 10845cd7cc
  214. hodlinator force-pushed on Apr 26, 2025
  215. l0rinc commented at 9:54 pm on April 26, 2025: contributor
    ACK 10845cd7cc89fc83b4ee844dc577b391720bba9c

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-05-06 09:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me