qa: Verify clean shutdown on startup failure

hodlinator commented at 11:53 am on August 15, 2024: contributor

Improves handling of startup errors in functional tests and puts tests in place to ensure knock-on errors don’t creep in.

wait_for_rpc_connection() now appends specific failures leading up to the Unable to connect to bitcoind error to that error message: [node 0] Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 1, 'OSError.ECONNREFUSED': 239}, latest error: ConnectionRefusedError(111, 'Connection refused'))
Fixes Windows Python issue where socket.timeout exceptions end up with unset errno-fields.
Also adds comments, refactors code, improves logging.

The underlying purpose is to ensure developer efficiency in finding root causes of test failures.

Prior iterations of the PR partially focused on fixing the same issue as #31620. Originally inspired by #30390.

Testing

Can be tested by reverting either faf2f2c654d9aa18b2f49a157956f9ab0fce302a or fae3bf6b870eb0f9cddd1adac82ba72890806ae3 from #31620, or the “qa: Avoid calling stop-RPC if not connected” from this PR, and running feature_framework_startup_failures.py.

DrahtBot commented at 11:53 am on August 15, 2024: contributor

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/30660.

Reviews

See the guideline for information on the review process.

Type	Reviewers
ACK	l0rinc, ryanofsky
Stale ACK	tdb3, m3dwards

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

No conflicts as of last run.

DrahtBot added the label Tests on Aug 15, 2024

hodlinator commented at 11:54 am on August 15, 2024: contributor

Encountered a case of #30390 here: https://github.com/bitcoin/bitcoin/actions/runs/10380115902/job/28739417666?pr=30377

Output with only the test (no fix commit):

 0$ test/functional/feature_framework_rpc_failure.py
 12024-08-15T07:30:20.427000Z TestFramework (INFO): PRNG seed is: 635212259632049987
 22024-08-15T07:30:20.427000Z TestFramework (INFO): Initializing test directory /run/user/1001/bitcoin_func_test_ujsvbdnm
 32024-08-15T07:30:20.428000Z TestFramework (INFO): Forcing us to timeout instantly to simulate an unresponsive bitcoind
 42024-08-15T07:30:20.429000Z TestFramework (ERROR): Assertion failed
 5Traceback (most recent call last):
 6  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
 7    node.wait_for_rpc_connection()
 8  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 329, in wait_for_rpc_connection
 9    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
10  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 192, in _raise_assertion_error
11    raise AssertionError(self._node_msg(msg))
12AssertionError: [node 0] Unable to connect to bitcoind after 0s
13
14During handling of the above exception, another exception occurred:
15
16Traceback (most recent call last):
17  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 21, in setup_network
18    BitcoinTestFramework.setup_network(self)
19  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
20    self.setup_nodes()
21  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
22    self.start_nodes()
23  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
24    self.stop_nodes()
25  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
26    node.stop_node(wait=wait, wait_until_stopped=False)
27  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 390, in stop_node
28    self.stop(wait=wait)
29    ^^^^^^^^^
30  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 209, in __getattr__
31    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
32           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
33AssertionError: [node 0] Error: no RPC connection
34
35During handling of the above exception, another exception occurred:
36
37Traceback (most recent call last):
38  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
39    self.setup()
40  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
41    self.setup_network()
42  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 23, in setup_network
43    assert_equal(str(e), "[node 0] Unable to connect to bitcoind after 0s")
44  File "/home/hodlinator/bitcoin/test/functional/test_framework/util.py", line 75, in assert_equal
45    raise AssertionError("not(%s)" % " == ".join(str(arg) for arg in (thing1, thing2) + args))
46AssertionError: not([node 0] Error: no RPC connection == [node 0] Unable to connect to bitcoind after 0s)
472024-08-15T07:30:20.480000Z TestFramework (INFO): Stopping nodes
48Traceback (most recent call last):
49  File "/home/hodlinator/bitcoin/test/functional/feature_framework_rpc_failure.py", line 32, in <module>
50    FeatureFrameworkRPCFailure(__file__).main()
51  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
52    exit_code = self.shutdown()
53                ^^^^^^^^^^^^^^^
54  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
55    self.stop_nodes()
56  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
57    node.stop_node(wait=wait, wait_until_stopped=False)
58  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 390, in stop_node
59    self.stop(wait=wait)
60    ^^^^^^^^^
61  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 209, in __getattr__
62    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
63           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64AssertionError: [node 0] Error: no RPC connection
65[node 0] Cleaning up leftover process

With fix:

0$ test/functional/feature_framework_rpc_failure.py
12024-08-15T07:30:05.092000Z TestFramework (INFO): PRNG seed is: 1672695534155872365
22024-08-15T07:30:05.092000Z TestFramework (INFO): Initializing test directory /run/user/1001/bitcoin_func_test_cow4vrf7
32024-08-15T07:30:05.093000Z TestFramework (INFO): Forcing us to timeout instantly to simulate an unresponsive bitcoind
42024-08-15T07:30:05.094000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 326500.
52024-08-15T07:30:05.144000Z TestFramework (INFO): Caught AssertionError with expected message
62024-08-15T07:30:05.194000Z TestFramework (INFO): Stopping nodes
72024-08-15T07:30:05.194000Z TestFramework (INFO): Cleaning up /run/user/1001/bitcoin_func_test_cow4vrf7 on exit
82024-08-15T07:30:05.194000Z TestFramework (INFO): Tests successful

maflcko commented at 12:32 pm on August 15, 2024: member

Can you explain this a bit more? I think optimizing the test framework to cleanly shut down on a failure is a bit premature, and may not be needed. However, if a failing test may leave behind a dangling bitcoind process, this seems like something to fix, because it will otherwise lead to test warnings and failures down the road.

hodlinator force-pushed on Aug 18, 2024

hodlinator commented at 8:23 pm on August 18, 2024: contributor

Can you explain this a bit more? I think optimizing the test framework to cleanly shut down on a failure is a bit premature, and may not be needed.

As long as the additional complexity and maintenance burden is low, I think it is worth having a cleaner shutdown to avoid muddying the waters for people trying to figure out root causes.

However, if a failing test may leave behind a dangling bitcoind process, this seems like something to fix, because it will otherwise lead to test warnings and failures down the road.

It does clean up dangling bitcoind processes without the fix, as can be seen in the log:

0[node 0] Cleaning up leftover process

The latest push (~~bcc87514e780cd35d1f71d5c4c4069a8fac95324~~ 0ea00cddc3d4a3f92c6bfae13a8b844e97ebc8c3) now explains in the log message that it is terminating the process to avoid wait_until_stopped timing out (which raises an additional exception).

+ Broke out a commit adding the self.expected_ret_code functionality, so that the fix commit in the middle is clearer. + Attempted to make the comment at the top of the test file more explanatory. + Switched from process.kill() to the somewhat less aggressive process.terminate() (although on second thoughts, if the RPC interface is not responding in a real-world situation outside of the added test, maybe .kill() is closer to what we want).

hodlinator force-pushed on Aug 18, 2024

hodlinator commented at 8:27 pm on August 26, 2024: contributor

Last week I was hit by this again. Probably line ending differences on Windows makes any additional Python exception reporting extra annoying to read due to every second line being blank (separate issue, but slightly improved by this PR producing less output).

Git-sleuthing to help motivate this PR

Was this ever working?

Unfortunately for my case, it seems the answer is no: 7897338918dac072e788b8ab2919d4559f311bef (Aug, 2017) broke out test_node.py from test_framework.py added the self.rpc_connected flag. Would have been nice if it had protected use of the stop() RPC using this new flag (like in this PR) but unfortunately it did not.

Fixes in the same area

fafb33cdefd2d8ce065263978075d26a1672b630 (May, 2016) MarcoFalke wrapped stop() attempts in try/except so at least other nodes would have a chance of shutting down properly. (6 days later in fab5233fe6c99d446a2cebc48cf479c3f026c2cc a timeout for waiting on the process to shutdown was added too, probably so the tests wouldn’t hang waiting for a non-crashed bitcoind that we failed to send stop() to).

#14670 (Dec, 2018) is tangential and adds test/functional/feature_shutdown.py in 28479f926f21f2a91bec5a06671c60e5b0c55532, to ensure that HTTP responses for stop() make it all the way back to the client if I’m understanding it correctly. It fixes #11777 (Nov, 2017) which is arguably about shutdowns not being clean enough (sometimes showing additional errors depending on timing). The current PR is certainly more nit-level, but still continuing in the same general direction.

hodlinator commented at 10:07 pm on August 26, 2024: contributor

Another argument for this PR is that even the current title of #30390 is “ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it”. This in itself when looking at logs like https://github.com/bitcoin/bitcoin/actions/runs/9677128783/job/26698120754#step:27:225 is actually just a knock-on exception happening after the RPC connection attempt times out and stop()-RPC is called in vain:

 0313/313 - wallet_listsinceblock.py --legacy-wallet failed, Duration: 2415 s
 1stdout:
 22024-06-26T09:53:17.196000Z TestFramework (INFO): PRNG seed is: 3062882430636553844
 32024-06-26T09:53:17.227000Z TestFramework (INFO): Initializing test directory D:\a\_temp\test_runner_₿_🏃_20240626_093314\wallet_listsinceblock_88
 42024-06-26T10:33:29.783000Z TestFramework (ERROR): Assertion failed
 5Traceback (most recent call last):
 6  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 569, in start_nodes
 7    node.wait_for_rpc_connection()
 8  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 314, in wait_for_rpc_connection
 9    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
10  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
11    raise AssertionError(self._node_msg(msg))
12AssertionError: [node 3] Unable to connect to bitcoind after 2400s
13
14During handling of the above exception, another exception occurred:
15
16Traceback (most recent call last):
17  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 131, in main
18    self.setup()
19  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 302, in setup
20    self.setup_network()
21  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 396, in setup_network
22    self.setup_nodes()
23  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 418, in setup_nodes
24    self.start_nodes()
25  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 572, in start_nodes
26    self.stop_nodes()
27  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 587, in stop_nodes
28    node.stop_node(wait=wait, wait_until_stopped=False)
29  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 375, in stop_node
30    self.stop(wait=wait)
31    ^^^^^^^^^
32  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 205, in __getattr__
33    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
34           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
35AssertionError: [node 3] Error: no RPC connection
362024-06-26T10:33:29.842000Z TestFramework (INFO): Stopping nodes
37[node 3] Cleaning up leftover process
38[node 2] Cleaning up leftover process
39[node 1] Cleaning up leftover process
40[node 0] Cleaning up leftover process
41
42stderr:
43Traceback (most recent call last):
44  File "D:\a\bitcoin\bitcoin/test/functional/wallet_listsinceblock.py", line 508, in <module>
45    ListSinceBlockTest().main()
46  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 155, in main
47    exit_code = self.shutdown()
48                ^^^^^^^^^^^^^^^
49  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 318, in shutdown
50    self.stop_nodes()
51  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 587, in stop_nodes
52    node.stop_node(wait=wait, wait_until_stopped=False)
53  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 375, in stop_node
54    self.stop(wait=wait)
55  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\coverage.py", line 50, in __call__
56    return_val = self.auth_service_proxy_instance.__call__(*args, **kwargs)
57                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
58  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\authproxy.py", line 127, in __call__
59    response, status = self._request('POST', self.__url.path, postdata.encode('utf-8'))
60                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
61  File "D:\a\bitcoin\bitcoin\test\functional\test_framework\authproxy.py", line 105, in _request
62    self.__conn.request(method, path, postdata, headers)
63  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1336, in request
64    self._send_request(method, url, body, headers, encode_chunked)
65  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1382, in _send_request
66    self.endheaders(body, encode_chunked=encode_chunked)
67  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1331, in endheaders
68    self._send_output(message_body, encode_chunked=encode_chunked)
69  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1091, in _send_output
70    self.send(msg)
71  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1035, in send
72    self.connect()
73  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\http\client.py", line 1001, in connect
74    self.sock = self._create_connection(
75                ^^^^^^^^^^^^^^^^^^^^^^^^
76  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\socket.py", line 853, in create_connection
77    raise exceptions[0]
78  File "C:\hostedtoolcache\windows\Python\3.12.4\x64\Lib\socket.py", line 838, in create_connection
79    sock.connect(sa)
80ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

A better title might have been “Unable to connect to bitcoind after 2400s errors on Windows”, which this PR would have made it more likely to be in the first place (had it predated the issue).

hodlinator force-pushed on Aug 27, 2024

hodlinator commented at 9:54 pm on August 27, 2024: contributor

After more investigation into #30390, my impression is that since TestNode.wait_for_rpc_connection() ignores a bunch of exceptions, the knock-on exceptions when calling stop() and failing prior to this PR actually may sometimes provide a clue for why the RPC connection wasn’t established in the first place.

That realization made me add 2 more commits here, which provide more information on the original exceptions being triggered and ignored inside wait_for_rpc_connection. Only fair since we are avoiding the triggering of knock-on exceptions later.

The timeout exception’s message thereby now includes counters of encountered (and ignored) exceptions + latest exception type & message: Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 5}, latest error: ValueError('No RPC credentials'))

DrahtBot added the label CI failed on Aug 27, 2024

DrahtBot removed the label CI failed on Sep 29, 2024

DrahtBot added the label CI failed on Oct 13, 2024

achow101 requested review from josibake on Oct 15, 2024

achow101 requested review from willcl-ark on Oct 15, 2024

achow101 commented at 3:30 pm on October 15, 2024: member

cc @tdb3

DrahtBot removed the label CI failed on Oct 18, 2024

hodlinator force-pushed on Oct 21, 2024

hodlinator commented at 10:14 pm on October 21, 2024: contributor

(Rebased onto latest master).

in test/functional/test_framework/test_node.py:324 in 5c95843281 outdated

320+                    ignored_errors["JSONRPCException -28"] += 1
321                 # -342 Service unavailable, RPC server started but is shutting down due to error
322-                if e.error['code'] != -28 and e.error['code'] != -342:
323+                # - Ignore the exception to properly raise the FailedToStartError.
324+                elif e.error['code'] == -342:
325+                    ignored_errors["JSONRPCException -342"] += 1

tdb3 commented at 6:19 pm on October 31, 2024:

non-blocking nit: might be a bit more Pythonic to use Counter.update() rather than += 1. For example:

0ignored_errors.update([f"JSONRPCException {e.error['code']}"])

hodlinator commented at 8:23 pm on November 2, 2024:

Tried using update(), but it resulted in things like:

0AssertionError: [node 0] Unable to connect to bitcoind after 10s (ignored errors: {'m': 40, 'i': 120, 's': 120, 'n': 80, 'g': 40, '_': 40, 'c': 40, 'r': 40, 'e': 80, 'd': 40, 't': 40, 'a': 40, 'l': 40}, latest error: ValueError('No RPC credentials'))

i.e. it was iterating over the string and incrementing an entry for each unique character, which seems to match the docs even though not super clear.~

Edit: @tdb3 pointed out out of band than I’d probably missed out the []-brackets around the strings to make update() iterate through whole strings instead of chars.

0ignored_errors.update([f"JSONRPCException {e.error['code']}"])

feels slightly more contrived and verbose to me than

0ignored_errors[f"JSONRPCException {e.error['code']}"] += 1

but neither is too elegant.

Might switch away from Counter if something more appropriate is found.

in test/functional/test_framework/test_node.py:336 in 5c95843281 outdated

335                 elif e.errno == errno.ETIMEDOUT:
336-                    pass # Treat identical to ECONNRESET
337+                    ignored_errors["OSError.ETIMEDOUT"] += 1 # Treat identical to ECONNRESET
338                 elif e.errno == errno.ECONNREFUSED:
339-                    pass  # Port not yet open?
340+                    ignored_errors["OSError.ECONNREFUSED"] += 1 # Port not yet open?

tdb3 commented at 6:27 pm on October 31, 2024:

non-blocking nit: same with these. e.g.

0ignored_errors.update([errno.errorcode[e.errno]])

hodlinator commented at 8:21 pm on November 2, 2024:

(Same issue with update(), see #30660 (review)).

in test/functional/feature_framework_rpc_failure.py:35 in 9173b05600 outdated

30+            BitcoinTestFramework.setup_network(self)
31+        except AssertionError as e:
32+            assert_equal(str(e)[:47], "[node 0] Unable to connect to bitcoind after 0s")
33+            self.log.info(f"{test_name}: Caught AssertionError with expected RPC connection failure message")
34+        else:
35+            raise AssertionError("Didn't raise expected error")

tdb3 commented at 6:47 pm on October 31, 2024:

Maybe we can use util.assert_raises() or util.assert_raises_message() here to keep things more concise?

hodlinator commented at 8:22 pm on November 2, 2024:

Had to add a commit (2a0c0e410e9f32dbf7af7229457889b09553c438) that refurbishes assert_raises_message(), but test is much cleaner now!

tdb3 approved

tdb3 commented at 6:49 pm on October 31, 2024: contributor

Code review ACK 9173b056006fd3940675108aa3fe0d0d855af768

This seems quite useful when troubleshooting errors. Left a few nits, but nothing I feel very strongly about.

hodlinator force-pushed on Nov 2, 2024

hodlinator commented at 8:26 pm on November 2, 2024: contributor

Thanks for the review @tdb3!

Implemented feedback from that review, made the code a bit more Pythonic and improved comments slightly.

in test/functional/test_framework/test_node.py:341 in b6b30c2d9c outdated

337                     raise
338+                ignored_errors["missing_credentials"] += 1
339+                latest_error = repr(e)
340             time.sleep(1.0 / poll_per_s)
341-        self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
342+        self._raise_assertion_error("Unable to connect to bitcoind after {}s (ignored errors: {}, latest error: {})".format(self.rpc_timeout, str(dict(ignored_errors)), latest_error))

tdb3 commented at 10:50 pm on November 2, 2024:

non-blocking pico-nit: If this file gets touched again, could use f string instead of format. https://github.com/bitcoin/bitcoin/blob/master/test/functional/README.md#style-guidelines

hodlinator commented at 11:11 pm on November 2, 2024:

Good point! Resolved in latest push.

tdb3 approved

tdb3 commented at 10:59 pm on November 2, 2024: contributor

Code Review re ACK 1148c4a53295e7635a813aadfde53ba5d973395c

Nice work cleaning up the existing code around the core changes. Increases readability.

hodlinator force-pushed on Nov 2, 2024

tdb3 approved

tdb3 commented at 11:15 pm on November 2, 2024: contributor

re ACK 042cee7073b338d79b5176d4157cf08bbd079b9f

in test/functional/feature_framework_rpc_failure.py:26 in 042cee7073 outdated

21+    def set_test_params(self):
22+        self.num_nodes = 1
23+
24+    def setup_network(self):
25+        test_name = Path(__file__).name
26+        self.log.info(f"{test_name}: Setting RPC timeout to 0 to simulate an unresponsive bitcoind")

maflcko commented at 10:15 am on November 4, 2024:

Any reason to log the file name here? The containing folder of the log should already have the name of the test/file. If not, the user should know which test they called to know it.

If not, and this was helpful, it would be better to log it for all tests, instead of just for this one test specifically.

hodlinator commented at 10:44 am on November 8, 2024:

Wanted to make clear it wasn’t something internal to BitcoinTestFramework/TestNode deciding to set the RPC timeout. Replaced that with instead appending “, expect one warning” to the log message. (Edit: now pushed).

in test/functional/test_framework/test_node.py:315 in 5ca92b17c5 outdated

310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
311                 self.rpc_connected = True
312                 self.url = self.rpc.rpc_url
313                 return
314-            except JSONRPCException as e:  # Initialization phase
315+            except JSONRPCException as e:
316+                # Avoid raising these, to possibly raise the FailedToStartError on next iteration.

maflcko commented at 11:24 am on November 4, 2024:

nit in 5ca92b17c5d2d5efa7b2246da1905bb8fd186230: It is obvious, but could mention the success case as well?

Retry after these, until a permanent success or failure state (such as FailedToStartError, or timeout) is reached.?

hodlinator commented at 10:47 am on November 8, 2024:

Good point, changed to Suppress these in favor of a later outcome (success, FailedToStartError, or timeout).

in test/functional/test_framework/test_node.py:317 in 5ca92b17c5 outdated

310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
311                 self.rpc_connected = True
312                 self.url = self.rpc.rpc_url
313                 return
314-            except JSONRPCException as e:  # Initialization phase
315+            except JSONRPCException as e:
316+                # Avoid raising these, to possibly raise the FailedToStartError on next iteration.
317                 # -28 RPC in warmup
318                 # -342 Service unavailable, RPC server started but is shutting down due to error

maflcko commented at 11:36 am on November 4, 2024:

I don’t think this is true, it is just one example. I think 342 can be raised by authproxy.py for various reasons.

hodlinator commented at 10:48 am on November 8, 2024:

Made comment less definitive.

in test/functional/test_framework/test_node.py:331 in 5ca92b17c5 outdated

330-                    pass  # Port not yet open?
331-                else:
332+                # ECONNRESET   - This might happen when the RPC server is in warmup, but shut down before the call to
333+                #                getblockcount succeeds. Try again to properly raise the FailedToStartError
334+                # ETIMEDOUT    - Treat identical to ECONNRESET
335+                # ECONNREFUSED - Port not yet open?

maflcko commented at 11:42 am on November 4, 2024:

nit: Could just mention that the three are treated equal to “-342 Service unavailable”, which is already explained above?

hodlinator commented at 10:48 am on November 8, 2024:

Added: Suppress similarly to the above JSONRPCException errors:

maflcko commented at 3:06 pm on November 8, 2024:

I still don’t know if the current docs are correct and useful.

Why would ETIMEDOUT only happen on warmup? Why is it even relevant to mention anything here?

I think it would be clearer to just mention that the three are treated equal to “-342 Service unavailable”, which is already explained above?

maflcko commented at 11:27 am on November 18, 2024:

re #30660 (review)

Are you still working on this?

hodlinator commented at 10:19 pm on November 18, 2024:

I am still working on this PR. Not thrilled about removing the 3 comment strings predating this PR.

hodlinator commented at 10:35 am on November 26, 2024:

I am satisfied with the PR in it’s latest revision, but if you don’t consider this a nit, but rather a blocker, let me know.

ryanofsky commented at 7:57 pm on January 6, 2025:

In commit “test refactor: wait_for_rpc_connection - Treat OSErrors the same” (5321eb089cd1b39160038050a6ef6f27fd6f4ef9)

I think these documentation changes look ok, but looking at old and new versions side by side I mostly found the previous comments easier to follow because they were more direct and interspersed with relevant code.

E.g. the previous “# Initialization phase” comment next to the exception being caught told me that the exception was expected during initialization. The current comment “Suppress these in favor of a later outcome” tells me the code does not care about this exception as much as other exceptions, but not why.

Am fine with both versions though.

hodlinator commented at 4:43 pm on January 9, 2025:

I mostly found the previous comments easier to follow because they were more direct and interspersed with relevant code.

Hopefully this is from the latest push can be considered an improvement:

0if e.errno not in [ errno.ECONNRESET,     # This might happen when the RPC server is in warmup,
1                                          # but shut down before the call to getblockcount succeeds.
2                    errno.ETIMEDOUT,      # Treat identical to ECONNRESET
3                    errno.ECONNREFUSED ]: # Port not yet open?

in test/functional/test_framework/test_node.py:328 in 5ca92b17c5 outdated

335+                # ECONNREFUSED - Port not yet open?
336+                if e.errno not in [errno.ECONNRESET, errno.ETIMEDOUT, errno.ECONNREFUSED]:
337                     raise  # unknown OS error
338-            except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
339+            except ValueError as e:
340+                # If cookie file is missing and no rpcuser or rpcpassword; bitcoind is still starting, otherwise raise

maflcko commented at 11:47 am on November 4, 2024:

nit: I don’t think “bitcoind is still starting” is true either. It is just one example. Another example would be that the cookie file is (accidentally) disabled in the config.

What about just `Retry if cookie file is not created and no rpcuser or rpcpassword."?

hodlinator commented at 10:49 am on November 8, 2024:

Agree, made comment less definitive, but still think it’s good to include the common cause: Suppress if cookie file is missing and no rpcuser or rpcpassword; bitcoind may be starting

maflcko commented at 6:18 pm on November 6, 2024: member

left some nits on the first commit. Will take a look at the others later.

hodlinator force-pushed on Nov 8, 2024

hodlinator commented at 10:56 am on November 8, 2024: contributor

Thanks for the reviews! Was trying to keep comments largely intact, but agree with @maflcko that they were somewhat incorrect, so altered them. Also renamed ignored_errors (introduced by this PR) to suppressed_errors since it felt more fitting and also matches the updated comments.

DrahtBot added the label CI failed on Nov 8, 2024

DrahtBot commented at 11:32 am on November 8, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/32707688603

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

hodlinator force-pushed on Nov 8, 2024

DrahtBot removed the label CI failed on Nov 8, 2024

tdb3 approved

tdb3 commented at 5:32 pm on November 9, 2024: contributor

Code Review re ACK 68b0530d5fe867dbee5e551ecf6d491fecbc77f1

m3dwards commented at 10:40 am on November 18, 2024: contributor

Concept ACK

I like the motivation to cleanup the logs and remove the cascade of exceptions.

To test, I modified the included test to see what errors a user would have seen without the assertion:

 0--- a/test/functional/feature_framework_rpc_failure.py
 1+++ b/test/functional/feature_framework_rpc_failure.py
 2@@ -23,7 +23,8 @@ class FeatureFrameworkRPCFailure(BitcoinTestFramework):
 3         self.log.info(f"Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.")
 4         self.rpc_timeout = 0
 5
 6-        assert_raises_message(AssertionError, "[node 0] Unable to connect to bitcoind after 0s", BitcoinTestFramework.setup_network, self)
 7+        BitcoinTestFramework.setup_network(self)
 8
 9     def run_test(self):
10         pass

 02024-11-18T09:59:33.936000Z TestFramework (INFO): PRNG seed is: 6704525597374825472
 12024-11-18T09:59:33.937000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_kccj2gaf
 22024-11-18T09:59:33.939000Z TestFramework (INFO): Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.
 32024-11-18T09:59:33.944000Z TestFramework (ERROR): Assertion failed
 4Traceback (most recent call last):
 5  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 569, in start_nodes
 6    node.wait_for_rpc_connection()
 7  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 335, in wait_for_rpc_connection
 8    self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
 9  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 198, in _raise_assertion_error
10    raise AssertionError(self._node_msg(msg))
11AssertionError: [node 0] Unable to connect to bitcoind after 0s
12
13During handling of the above exception, another exception occurred:
14
15Traceback (most recent call last):
16  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 131, in main
17    self.setup()
18  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 302, in setup
19    self.setup_network()
20  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 24, in setup_network
21    BitcoinTestFramework.setup_network(self)
22  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 396, in setup_network
23    self.setup_nodes()
24  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
25    self.start_nodes()
26  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 572, in start_nodes
27    self.stop_nodes()
28  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
29    node.stop_node(wait=wait, wait_until_stopped=False)
30  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 396, in stop_node
31    self.stop(wait=wait)
32    ^^^^^^^^^
33  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 215, in __getattr__
34    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
35           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
36AssertionError: [node 0] Error: no RPC connection
372024-11-18T09:59:33.997000Z TestFramework (INFO): Stopping nodes
38Traceback (most recent call last):
39  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 32, in <module>
40    FeatureFrameworkRPCFailure(__file__).main()
41  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 155, in main
42    exit_code = self.shutdown()
43                ^^^^^^^^^^^^^^^
44  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 318, in shutdown
45    self.stop_nodes()
46  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
47    node.stop_node(wait=wait, wait_until_stopped=False)
48  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 396, in stop_node
49    self.stop(wait=wait)
50    ^^^^^^^^^
51  File "/Users/max/source/bitcoin/build/test/functional/test_framework/test_node.py", line 215, in __getattr__
52    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
53           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
54AssertionError: [node 0] Error: no RPC connection
55[node 0] Cleaning up leftover process

 02024-11-18T10:09:52.135000Z TestFramework (INFO): PRNG seed is: 7321683449385937851
 12024-11-18T10:09:52.135000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a
 22024-11-18T10:09:52.137000Z TestFramework (INFO): Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.
 32024-11-18T10:09:52.141000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 47850 so that wait_until_stopped will not time out.
 42024-11-18T10:09:52.194000Z TestFramework (ERROR): Assertion failed
 5Traceback (most recent call last):
 6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
 7    self.setup()
 8  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
 9    self.setup_network()
10  File "/Users/max/source/bitcoin/./build/test/functional/feature_framework_rpc_failure.py", line 26, in setup_network
11    BitcoinTestFramework.setup_network(self)
12  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
13    self.setup_nodes()
14  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
15    self.start_nodes()
16  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
17    node.wait_for_rpc_connection()
18  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 344, in wait_for_rpc_connection
19    self._raise_assertion_error(f"Unable to connect to bitcoind after {self.rpc_timeout}s (ignored errors: {str(dict(suppressed_errors))}, latest error: {latest_error})")
20  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 199, in _raise_assertion_error
21    raise AssertionError(self._node_msg(msg))
22AssertionError: [node 0] Unable to connect to bitcoind after 0s (ignored errors: {}, latest error: )
232024-11-18T10:09:52.250000Z TestFramework (INFO): Stopping nodes
242024-11-18T10:09:52.250000Z TestFramework (WARNING): Not cleaning up dir /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a
252024-11-18T10:09:52.250000Z TestFramework (ERROR): Test failed. Test logging available at /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a/test_framework.log
262024-11-18T10:09:52.250000Z TestFramework (ERROR):
272024-11-18T10:09:52.251000Z TestFramework (ERROR): Hint: Call /Users/max/source/bitcoin/test/functional/combine_logs.py '/var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_izqaxw8a' to consolidate all logs
282024-11-18T10:09:52.251000Z TestFramework (ERROR):
292024-11-18T10:09:52.251000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
302024-11-18T10:09:52.251000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
312024-11-18T10:09:52.251000Z TestFramework (ERROR):

I prefer the logs with this change!

Next I modified bitcoind so that it would fail to start and ran a random functional test (p2p_blockfilters.py)

 02024-11-18T09:39:40.861000Z TestFramework (INFO): PRNG seed is: 8730495587434508112
 12024-11-18T09:39:40.862000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_0gw_zhhl
 22024-11-18T09:39:41.372000Z TestFramework (ERROR): Assertion failed
 3Traceback (most recent call last):
 4  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
 5    node.wait_for_rpc_connection()
 6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 276, in wait_for_rpc_connection
 7    raise FailedToStartError(self._node_msg(
 8test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization.
 9
10During handling of the above exception, another exception occurred:
11
12Traceback (most recent call last):
13  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
14    self.setup()
15  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
16    self.setup_network()
17  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
18    self.setup_nodes()
19  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
20    self.start_nodes()
21  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
22    self.stop_nodes()
23  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
24    node.stop_node(wait=wait, wait_until_stopped=False)
25  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 396, in stop_node
26    self.stop(wait=wait)
27    ^^^^^^^^^
28  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 215, in __getattr__
29    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
30           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
31AssertionError: [node 0] Error: no RPC connection
322024-11-18T09:39:41.431000Z TestFramework (INFO): Stopping nodes
33Traceback (most recent call last):
34  File "/Users/max/source/bitcoin/build/test/functional/p2p_blockfilters.py", line 285, in <module>
35    CompactFiltersTest(__file__).main()
36  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
37    exit_code = self.shutdown()
38                ^^^^^^^^^^^^^^^
39  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
40    self.stop_nodes()
41  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 587, in stop_nodes
42    node.stop_node(wait=wait, wait_until_stopped=False)
43  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 396, in stop_node
44    self.stop(wait=wait)
45    ^^^^^^^^^
46  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 215, in __getattr__
47    assert self.rpc_connected and self.rpc is not None, self._node_msg("Error: no RPC connection")
48           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
49AssertionError: [node 0] Error: no RPC connection
50[node 1] Cleaning up leftover process
51[node 0] Cleaning up leftover process

 02024-11-15T13:07:28.308000Z TestFramework (INFO): PRNG seed is: 2784953090753807617
 12024-11-15T13:07:28.308000Z TestFramework (INFO): Initializing test directory /var/folders/q9/65c9dvyj3lbdsv5zrhp_8qk80000gn/T/bitcoin_func_test_udgky3fr
 22024-11-15T13:07:28.821000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26867 so that wait_until_stopped will not time out.
 32024-11-15T13:07:28.821000Z TestFramework.node1 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26868 so that wait_until_stopped will not time out.
 42024-11-15T13:07:28.821000Z TestFramework (ERROR): Assertion failed
 5Traceback (most recent call last):
 6  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
 7    node.wait_for_rpc_connection()
 8  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 279, in wait_for_rpc_connection
 9    raise FailedToStartError(self._node_msg(
10test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization.
11
12During handling of the above exception, another exception occurred:
13
14Traceback (most recent call last):
15  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
16    self.setup()
17  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
18    self.setup_network()
19  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
20    self.setup_nodes()
21  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
22    self.start_nodes()
23  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 572, in start_nodes
24    self.stop_nodes()
25  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 591, in stop_nodes
26    node.wait_until_stopped()
27  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in wait_until_stopped
28    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
29  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 860, in wait_until
30    return wait_until_helper_internal(test_function, timeout=timeout, timeout_factor=self.timeout_factor)
31           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
32  File "/Users/max/source/bitcoin/test/functional/test_framework/util.py", line 291, in wait_until_helper_internal
33    if predicate():
34       ^^^^^^^^^^^
35  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in <lambda>
36    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
37                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
38  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 439, in is_node_stopped
39    assert return_code == expected_ret_code, self._node_msg(
40           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
41AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping
422024-11-15T13:07:28.880000Z TestFramework (INFO): Stopping nodes
432024-11-15T13:07:28.881000Z TestFramework.node0 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26867 so that wait_until_stopped will not time out.
442024-11-15T13:07:28.881000Z TestFramework.node1 (WARNING): Cannot call stop-RPC as we are not connected. Killing process 26868 so that wait_until_stopped will not time out.
45Traceback (most recent call last):
46  File "/Users/max/source/bitcoin/build/test/functional/p2p_blockfilters.py", line 285, in <module>
47    CompactFiltersTest(__file__).main()
48  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 155, in main
49    exit_code = self.shutdown()
50                ^^^^^^^^^^^^^^^
51  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 318, in shutdown
52    self.stop_nodes()
53  File "/Users/max/source/bitcoin/test/functional/test_framework/test_framework.py", line 591, in stop_nodes
54    node.wait_until_stopped()
55  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in wait_until_stopped
56    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
57  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 860, in wait_until
58    return wait_until_helper_internal(test_function, timeout=timeout, timeout_factor=self.timeout_factor)
59           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
60  File "/Users/max/source/bitcoin/test/functional/test_framework/util.py", line 291, in wait_until_helper_internal
61    if predicate():
62       ^^^^^^^^^^^
63  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 462, in <lambda>
64    self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
65                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
66  File "/Users/max/source/bitcoin/test/functional/test_framework/test_node.py", line 439, in is_node_stopped
67    assert return_code == expected_ret_code, self._node_msg(
68           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
69AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping
70[node 1] Cleaning up leftover process 26868
71[node 0] Cleaning up leftover process 26867

I’m not sure if it’s the intention of this PR to improve things in that scenario but the log doesn’t seem much better and if anything it was slightly easier to understand before.

Before this change the logs showed AssertionError: [node 0] Error: no RPC connection and after this change that is now AssertionError: [node 0] Node returned unexpected exit code (1) vs (-9) when stopping. I feel like no RPC connection is the better error here?

hodlinator force-pushed on Nov 18, 2024

hodlinator commented at 11:13 pm on November 18, 2024: contributor

Thanks @m3dwards for having a look!

I had not tested the case of bitcoind exiting early together with my suggested change here (focused more on stalling as in one of the issues inspiring it).

Made some changes to this part of TestNode.stop_node() to handle an already dead process:

https://github.com/bitcoin/bitcoin/blob/ba28593147ccc880c3a4d40db0d4ef57f4766254/test/functional/test_framework/test_node.py#L412-L420

With those changes, the output for such a case is now:

 0₿ build/test/functional/p2p_blockfilters.py 
 12024-11-18T23:01:17.204000Z TestFramework (INFO): PRNG seed is: 5667811122664278008
 22024-11-18T23:01:17.205000Z TestFramework (INFO): Initializing test directory /tmp/bitcoin_func_test_mzr3jg96
 32024-11-18T23:01:17.458000Z TestFramework.node0 (WARNING): Process 175180 already died with return code 1.
 42024-11-18T23:01:17.458000Z TestFramework.node1 (WARNING): Process 175181 already died with return code 1.
 52024-11-18T23:01:17.459000Z TestFramework (ERROR): Unexpected exception caught during testing
 6Traceback (most recent call last):
 7  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 131, in main
 8    self.setup()
 9  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 302, in setup
10    self.setup_network()
11  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 396, in setup_network
12    self.setup_nodes()
13  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 418, in setup_nodes
14    self.start_nodes()
15  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_framework.py", line 569, in start_nodes
16    node.wait_for_rpc_connection()
17  File "/home/hodlinator/bitcoin/test/functional/test_framework/test_node.py", line 279, in wait_for_rpc_connection
18    raise FailedToStartError(self._node_msg(
19test_framework.test_node.FailedToStartError: [node 0] bitcoind exited with status 1 during initialization. 
202024-11-18T23:01:17.513000Z TestFramework (INFO): Stopping nodes
212024-11-18T23:01:17.513000Z TestFramework (WARNING): Not cleaning up dir /tmp/bitcoin_func_test_mzr3jg96
222024-11-18T23:01:17.513000Z TestFramework (ERROR): Test failed. Test logging available at /tmp/bitcoin_func_test_mzr3jg96/test_framework.log
232024-11-18T23:01:17.513000Z TestFramework (ERROR): 
242024-11-18T23:01:17.513000Z TestFramework (ERROR): Hint: Call /home/hodlinator/bitcoin/test/functional/combine_logs.py '/tmp/bitcoin_func_test_mzr3jg96' to consolidate all logs
252024-11-18T23:01:17.513000Z TestFramework (ERROR): 
262024-11-18T23:01:17.514000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log.
272024-11-18T23:01:17.514000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues
282024-11-18T23:01:17.514000Z TestFramework (ERROR):

m3dwards commented at 1:13 pm on November 21, 2024: contributor

Made some changes to this part of TestNode.stop_node() to handle an already dead process

Looks cleaner.

ACK ba28593147ccc880c3a4d40db0d4ef57f4766254

DrahtBot requested review from tdb3 on Nov 21, 2024

in test/functional/test_framework/test_node.py:8 in ba28593147 outdated

4@@ -5,6 +5,7 @@
5 """Class for bitcoind node under test"""
6 
7 import contextlib
8+from collections import Counter

hodlinator commented at 8:31 pm on December 3, 2024:

(Could possibly use collections.defaultdict(int) instead of Counter iff I really need to retouch).

DrahtBot added the label CI failed on Dec 11, 2024

DrahtBot commented at 1:30 am on December 11, 2024: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/33168290506

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

hodlinator force-pushed on Dec 11, 2024

hodlinator commented at 8:29 pm on December 11, 2024: contributor

Fixed ruff linter error reported by CI due to #31417 being merged.
Added feature_framework_stop_node.py-test to verify clean shutdown of TestNode when the bitcoind-process already exited/crashed, issue found earlier by m3dwards. (Also added comment justifying setting of self.expected_ret_code in test_node.py).
Switched from Counter to simpler collections.defaultdict(int)-type.
Updated copyright headers to include “-present”.

hodlinator force-pushed on Dec 11, 2024

DrahtBot removed the label CI failed on Dec 11, 2024

in test/functional/test_framework/test_node.py:329 in 5321eb089c outdated

337+                # ECONNREFUSED - Port not yet open?
338+                if e.errno not in [errno.ECONNRESET, errno.ETIMEDOUT, errno.ECONNREFUSED]:
339                     raise  # unknown OS error
340-            except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
341+            except ValueError as e:
342+                # Suppress if cookie file is missing and no rpcuser or rpcpassword; bitcoind may be starting

ryanofsky commented at 7:59 pm on January 6, 2025:

In commit “test refactor: wait_for_rpc_connection - Treat OSErrors the same” (5321eb089cd1b39160038050a6ef6f27fd6f4ef9)

Found both this comment “cookie file is missing” and previous comment “cookie file not found” confusing because it is not clear why the cookie file would not be present. IMO saying “node has not generated the cookie file yet” would be clearer.

hodlinator commented at 4:41 pm on January 9, 2025:

New version: # Suppress if cookie file isn't generated yet and no rpcuser or rpcpassword; bitcoind may be starting.

in test/functional/test_framework/test_node.py:453 in db0e1580a5 outdated

447@@ -447,7 +448,9 @@ def is_node_stopped(self, *, expected_stderr="", expected_ret_code=0):
448 
449     def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, expect_error=False, **kwargs):
450         if "expected_ret_code" not in kwargs:
451-            kwargs["expected_ret_code"] = 1 if expect_error else 0  # Whether node shutdown return EXIT_FAILURE or EXIT_SUCCESS
452+            # Whether node shutdown is expected to return EXIT_FAILURE (1) or
453+            # EXIT_SUCCESS (self.expected_ret_code).
454+            kwargs["expected_ret_code"] = 1 if expect_error else self.expected_ret_code

ryanofsky commented at 8:30 pm on January 6, 2025:

In commit “test: Allow changing expected internal return code” (db0e1580a5212a162908ac4f23e935e1f76e06dc)

Now that this logic is getting more complicated with self.expected_ret_code I think the expect_error argument should be dropped so the code can be more understandable. It looks like there is only a single expect_error usage in feature_abortnode.py and it can be easily replaced by using expected_ret_code. Would suggest:

 0--- a/test/functional/feature_abortnode.py
 1+++ b/test/functional/feature_abortnode.py
 2@@ -36,7 +36,7 @@ class AbortNodeTest(BitcoinTestFramework):
 3 
 4             # Check that node0 aborted
 5             self.log.info("Waiting for crash")
 6-            self.nodes[0].wait_until_stopped(timeout=5, expect_error=True, expected_stderr="Error: A fatal internal error occurred, see debug.log for details: Failed to disconnect block.")
 7+            self.nodes[0].wait_until_stopped(timeout=5, expected_ret_code=1, expected_stderr="Error: A fatal internal error occurred, see debug.log for details: Failed to disconnect block.")
 8         self.log.info("Node crashed - now verifying restart fails")
 9         self.nodes[0].assert_start_raises_init_error()
10 
11--- a/test/functional/test_framework/test_node.py
12+++ b/test/functional/test_framework/test_node.py
13@@ -460,11 +460,11 @@ class TestNode():
14         self.log.debug("Node stopped")
15         return True
16 
17-    def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, expect_error=False, **kwargs):
18+    def wait_until_stopped(self, *, timeout=BITCOIND_PROC_WAIT_TIMEOUT, **kwargs):
19         if "expected_ret_code" not in kwargs:
20             # Whether node shutdown is expected to return EXIT_FAILURE (1) or
21             # EXIT_SUCCESS (self.expected_ret_code).
22-            kwargs["expected_ret_code"] = 1 if expect_error else self.expected_ret_code
23+            kwargs["expected_ret_code"] = self.expected_ret_code
24         self.wait_until(lambda: self.is_node_stopped(**kwargs), timeout=timeout)
25 
26     def replace_in_config(self, replacements):

hodlinator commented at 4:45 pm on January 9, 2025:

Taken, although I moved some of the handling of expected_ret_code directly into is_node_stopped().

in test/functional/test_framework/test_node.py:431 in 4ab7879ec6 outdated

421             else:
422-                self.stop()
423-        except http.client.CannotSendRequest:
424-            self.log.exception("Unable to stop node.")
425+                self.log.warning(f"Process {self.process.pid} already died with return code {return_code}.")
426+                # If we no longer have a running process once stop_node() is called, assume we've already triggered an

ryanofsky commented at 8:47 pm on January 6, 2025:

In commit “test: Kill process if stop-RPC is not available” (4ab7879ec6b71804cc3a51bcf58fa6318fa103b0)

This change doesn’t seem safe. The stop_node method can be called by arbitrary test code at arbitrary times, including when self.rpc_connected is False, and I don’t think it should be suppressing errors unless the caller has explicitly requested that errors should not be passed on.

Would suggest narrowing the error suppression behavior, so errors and bugs will be less likely to go unnoticed because of this change:

 0--- a/test/functional/feature_framework_stop_node.py
 1+++ b/test/functional/feature_framework_stop_node.py
 2@@ -42,7 +42,7 @@ class FeatureFrameworkStopNode(BitcoinTestFramework):
 3         assert self.nodes[0].running, \
 4             "The process should still be flagged as running before calling stop_node()"
 5         self.log.debug("Explicitly stopping the node to verify it completes cleanly during the test")
 6-        self.nodes[0].stop_node(expected_stderr=invalid_arg_error)
 7+        self.nodes[0].stop_node(expected_stderr=invalid_arg_error, force=True)
 8 
 9 if __name__ == '__main__':
10     FeatureFrameworkStopNode(__file__).main()
11--- a/test/functional/test_framework/test_framework.py
12+++ b/test/functional/test_framework/test_framework.py
13@@ -315,7 +315,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
14         if not self.options.noshutdown:
15             self.log.info("Stopping nodes")
16             if self.nodes:
17-                self.stop_nodes()
18+                self.stop_nodes(force=self.success != TestStatus.PASSED)
19         else:
20             for node in self.nodes:
21                 node.cleanup_on_exit = False
22@@ -569,7 +569,7 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
23                 node.wait_for_rpc_connection()
24         except Exception:
25             # If one node failed to start, stop the others
26-            self.stop_nodes()
27+            self.stop_nodes(force=True)
28             raise
29 
30         if self.options.coveragedir is not None:
31@@ -580,11 +580,11 @@ class BitcoinTestFramework(metaclass=BitcoinTestMetaClass):
32         """Stop a bitcoind test node"""
33         self.nodes[i].stop_node(expected_stderr, wait=wait)
34 
35-    def stop_nodes(self, wait=0):
36+    def stop_nodes(self, wait=0, force=False):
37         """Stop multiple bitcoind test nodes"""
38         for node in self.nodes:
39             # Issue RPC to stop nodes
40-            node.stop_node(wait=wait, wait_until_stopped=False)
41+            node.stop_node(wait=wait, wait_until_stopped=False, force=force)
42 
43         for node in self.nodes:
44             # Wait for nodes to stop
45--- a/test/functional/test_framework/test_node.py
46+++ b/test/functional/test_framework/test_node.py
47@@ -393,12 +393,30 @@ class TestNode():
48     def version_is_at_least(self, ver):
49         return self.version is None or self.version >= ver
50 
51-    def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True):
52-        """Stop the node."""
53+    def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True, force=False):
54+        """Stop the node.
55+
56+        Args:
57+            force (bool): Stop the node forcefully if necessary, and change
58+                errors that would otherwise be exceptions into logged warnings.
59+
60+        Notes:
61+            - The force argument should only be set to true when a test has
62+            already failed, and the caller is already throwing an exception of
63+            its own, and wants its own exception to take precedence over errors
64+            from this function. Since the error that happened was presumably
65+            unknown and unexpected, the node may have already crashed or may
66+            need to be shut down forcefully.
67+        """
68         if not self.running:
69             return
70         self.log.debug("Stopping node")
71-        if self.rpc_connected:
72+
73+        # If self.running is true but self.rpc_connected is not true, it means
74+        # that the process has been started, but wait_for_rpc_connection() was
75+        # not able to make a successful connection. In this case, if force is
76+        # true, avoid trying to make another RPC call and just stop the process.
77+        if not force or self.rpc_connected:
78             try:
79                 # Do not use wait argument when testing older nodes, e.g. in wallet_backwards_compatibility.py
80                 if self.version_is_at_least(180000):

Note: The diff above is still calling the stop RPC instead of killing the node when self.rpc_connected is true. But it might actually be more desirable to change if not force or self.rpc_connected: there to if not force: and prefer to kill node processes instead of stopping them after test failures. Would not suggest doing this here, since it would change other behavior beyond the scope of the PR, but I could imagine cases where changing this might reduce noise and help debugging errors in CI.

hodlinator commented at 10:31 am on January 9, 2025:

It seems wrong to me to attempt using the RPC connection in stop_node() when we known that self.rpc_connected == False. New version has:

0if self.rpc_connected:
1    # Call stop-RPC...
2elif force:
3    # Kill etc...
4else:
5    raise RuntimeError(f"Cannot call stop-RPC as we don't have an RPC connection to process {self.process.pid}.")

Edit: (Ended up calling it avoid_exceptions which I think is slightly better than force).

in test/functional/test_framework/test_node.py:428 in 4ab7879ec6 outdated

415+            return_code = self.process.poll()
416+            if return_code is None:
417+                self.log.warning("Cannot call stop-RPC as we are not connected. "
418+                    f"Killing process {self.process.pid} so that wait_until_stopped will not time out.")
419+                self.expected_ret_code = 1 if platform.system() == "Windows" else -9
420+                self.process.kill()

ryanofsky commented at 4:05 pm on January 7, 2025:

In commit “test: Kill process if stop-RPC is not available” (4ab7879ec6b71804cc3a51bcf58fa6318fa103b0)

I think this is probably the simplest way to write this code and we should stick with it for now, but just want to point out there is a minor race condition here because the process could end between the time that poll and kill are called and then the kill call could throw an exception.

Another separate idea is that it might be safer to write self.expected_ret_code = self.process.wait() instead of hardcoding 1/-9 return codes. Again though would be inclined to stick with the working code unless there is a more of a reason to change.

hodlinator commented at 9:52 am on January 9, 2025:

Thanks for pointing out the race condition! It is not so likely if we get here because of RPC connection timeout, but still possible.

I tried adding a second .kill() after the first one and it does not raise an exception, and no exception is documented.

in test/functional/feature_framework_stop_node.py:45 in 608b9ff097 outdated

40+            self=self.nodes[0]
41+        )
42+        assert self.nodes[0].running, \
43+            "The process should still be flagged as running before calling stop_node()"
44+        self.log.debug("Explicitly stopping the node to verify it completes cleanly during the test")
45+        self.nodes[0].stop_node(expected_stderr=invalid_arg_error)

ryanofsky commented at 5:02 pm on January 7, 2025:

In commit “test: Add feature_framework_stop_node.py” (608b9ff097214bff3527cc240b7be313c62131d5)

One problem with this test as currently written is that it prints a real warning to the output (like TestFramework.node0 (WARNING): Process 579831 already died with return code 1). This is potentially confusing because this warning is expected and should not be a cause for concern, but you can’t know that from looking at test output.

I can think of a few possible solutions:

Redirecting log output during the stop_node call and checking for the expected warning with an assertion here.
Letting the stop_node function throw exceptions instead of converting errors into warnings, then catching and handling the exceptions in the places (just 2 places currently) where they are expected.
Not changing anything but just add a warning before the warning instructing reader to ignore the next warning.

hodlinator commented at 4:36 pm on January 9, 2025:

Went for option 3, making clear in the log message just above that a WARNING is expected.

in test/functional/feature_framework_rpc_failure.py:26 in f67d896357 outdated

21+
22+    def setup_network(self):
23+        self.log.info("Setting RPC timeout to 0 to simulate an unresponsive bitcoind, expect one warning.")
24+        self.rpc_timeout = 0
25+
26+        assert_raises_message(AssertionError, "[node 0] Unable to connect to bitcoind after 0s", BitcoinTestFramework.setup_network, self)

ryanofsky commented at 5:20 pm on January 7, 2025:

In commit “test: Add feature_framework_rpc_failure.py” (f67d8963575feecdd4914c1a56bab4e0bebc000d)

Note: there is no test coverage for the “ignored errors” and “latest error” parts of the message. I think it would be good to add some coverage for this, even if it just checking for 0 and None, so there is a at least a hint that more coverage could be added here, especially since it’s possible to imagine this functionality becoming broken in the future and more coverage helping prevent this.

hodlinator commented at 4:37 pm on January 9, 2025:

Added test/functional/feature_framework_rpc_failure_details.py.

in test/functional/test_framework/test_node.py:160 in db0e1580a5 outdated

156@@ -157,6 +157,7 @@ def __init__(self, i, datadir_path, *, chain, rpchost, timewait, timeout_factor,
157         self.rpc = None
158         self.url = None
159         self.log = logging.getLogger('TestFramework.node%d' % i)
160+        self.expected_ret_code = 0 # EXIT_SUCCESS

ryanofsky commented at 6:13 pm on January 7, 2025:

In commit “test: Allow changing expected internal return code” (db0e1580a5212a162908ac4f23e935e1f76e06dc)

Note: I like the changes in this commit, and think the new TestNode.expected_ret_code member could be a useful way to signal expected statuses when shutdown starts. But just want to point out that I don’t think adding a new expected_ret_code member is strictly necessary for this PR to work. The only place it is actually used later in the PR is in the stop_node function when it is killing a failed process, and it seems to me that stop_node could just as easily set self.running = False instead and call process.wait() itself to reap the pid without blocking.

hodlinator commented at 4:39 pm on January 9, 2025:

Interesting, I do agree there is a spaghetti/complexity factor to expected_ret_code, so might revisit this.

ryanofsky commented at 6:54 pm on January 7, 2025: contributor

Code review 608b9ff097214bff3527cc240b7be313c62131d5

Nice changes in this PR. I’ve been confused by the error output from wait_for_rpc_connection() before, so think this will be helpful.

One thing that wasn’t obvious to me from the PR description is that this improving error output in two separate ways:

One change here is adding error output to wait_for_rpc_connection. Previously errors that happened during polling before the final “Unable to connect to bitcoind” error were silently ignored, so less information would be visible about what was causing the problem. Now counts of the different errors and the most recent message are shown.
Another change here is suppressing error output after the wait_for_rpc_connection error, specifically suppressing exceptions from in stop_node when trying to call stop method fails.

I didn’t ACK the PR because I think the changes to the stop_node method are overbroad, and I suggested a way to limit them. Other than that I think this all looks very good.

hodlinator renamed this:
~~test: Shut down framework cleanly on RPC connection failure~~
qa: Shut down framework cleanly on RPC connection failure
on Jan 9, 2025

hodlinator force-pushed on Jan 9, 2025

hodlinator commented at 4:55 pm on January 9, 2025: contributor

Updated based off ryanofsky’s review.

Hopefully latest PR summary is slightly clearer.

Replaced “test” -> “qa” since PR is about functional tests and not unit tests, should conform better with CONTRIBUTING.md.

hodlinator force-pushed on Jan 9, 2025

DrahtBot added the label CI failed on Jan 9, 2025

DrahtBot commented at 6:21 pm on January 9, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/35384673501

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

hodlinator force-pushed on Jan 9, 2025

DrahtBot removed the label CI failed on Jan 9, 2025

in test/functional/test_framework/test_framework.py:572 in 8148e4da2c outdated

568@@ -569,7 +569,7 @@ def start_nodes(self, extra_args=None, *args, **kwargs):
569                 node.wait_for_rpc_connection()
570         except Exception:
571             # If one node failed to start, stop the others
572-            self.stop_nodes()
573+            self.stop_nodes(avoid_exceptions=True)

maflcko commented at 9:10 am on January 10, 2025:

coming from the conflicting pull, I don’t think this call should exist in the first place for two reasons:

It doesn’t work in most cases, because a failure to start any node likely means the RPC isn’t up for the other nodes either, so the call will just fail in most cases. Effectively just using the kill fallback.
It doesn’t make sense to special case startup. Nodes can at any time fail to start (not only in start_nodes), or crash, and adding this exception handling to all places isn’t useful nor trivially possible.

So I removed it in #31620.

(Just leaving a comment here to document the merge conflict between the two changes. Should be trivial to solve either way, so nothing to be done here, I think.)

hodlinator commented at 12:53 pm on January 17, 2025: contributor

Excuse me from being brief earlier.

Taking the time do document changes in latest push - git range-diff master 608b9ff 8148e4d

Inlined comments about different errno-codes instead of duplicating the identifiers in the comment.
Made missing cookie file comment slightly more informative.
Removed expect_error in favor of expected_ret_code after ryanofsky’s suggestion (grepping for uses of the former). This helps decrease the amount of added complexity to test_node.py.
Add avoid_exceptions parameter to stop_node() to make the behavior change more surgical, thanks to ryanofsky’s suggestion. We now only kill the node under more narrow conditions.
Clarified in log output of tests when a WARNING is expected (ryanofsky suggestion).
ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 - Added feature_framework_rpc_failure_details.py to verify that suppressed exception information is included in RPC connection timeout message (ryanofsky suggestion).
Switched feature_framework_stop_node.py from using TestNode.start() + TestNode.wait_for_rpc_connection() to BitcoinTestFramework.start_node() which combines the two.

DrahtBot added the label Needs rebase on Jan 28, 2025

hodlinator renamed this:
~~qa: Shut down framework cleanly on RPC connection failure~~
qa: Verify clean shutdown on startup failure
on Jan 28, 2025

hodlinator force-pushed on Jan 28, 2025

hodlinator commented at 10:45 pm on January 28, 2025: contributor

Resolved conflict with @maflcko’s #31620 which solved the core issue more elegantly.

I think there is still some use in the tests from this PR, as we want to avoid knock-on exceptions being reintroduced. I’ve reworked the tests from 3 to 1 Python file. They now launch a child process of which keeps the success-state of the parent BitcoinTestFramework separate from the expected failures of the child processes.

The final commit (f2e3721e018c3038d5d648ec7ae08e96418f3c29) is what remains of the core of what this PR used to be, but I haven’t been able to construe a test that exercises it. So if reviewers prefer, I can drop it.

Test instructions in updated PR summary.

hodlinator force-pushed on Jan 28, 2025

DrahtBot added the label CI failed on Jan 28, 2025

DrahtBot commented at 11:12 pm on January 28, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/36321182366

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot removed the label CI failed on Jan 29, 2025

DrahtBot removed the label Needs rebase on Jan 29, 2025

hodlinator force-pushed on Jan 29, 2025

hodlinator commented at 4:37 pm on January 29, 2025: contributor

Latest push adds 2347ce8f018b1b9b8d6ef1933a778023fd228b7f which fixes a Windows Python issue found by CI. It’s still a mystery to me why the old test from a previous version of this PR in ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 didn’t provoke the same issue on Windows CI (successful run including that commit). CI output states they’re running the same Python version.

Test was also updated to re-launch child processes using the same Python interpreter as the parent, as I was ending up with mixed 64/32-bit interpeters on Windows. Also switches from using subprocesses.getoutput() which is marked as legacy, to subprocess.run() with a timeout.

hodlinator force-pushed on Jan 29, 2025

hodlinator commented at 8:50 pm on January 29, 2025: contributor

Sorry about the churn here. Went back to ee4c9d7eba6ba6fb6dcb7abd539c23f5b62d5991 and was able to reproduce the TimeoutError.errno==None-issue there too, which is a relief. It seemed to be a bit timing-sensitive, so in the latest push I increased the still low rpc_timeout-override and applied self.options.timeout_factor to it in hopes of making it more robust.

Edit: The above was for a18f7c8a252bdffd181fe696ede8a91e285e1cce. Fixed trivial type-error in next push, 8102f3d8e0b6715433eeda78b9b57a83109fc45e.

DrahtBot added the label CI failed on Jan 29, 2025

hodlinator force-pushed on Jan 29, 2025

DrahtBot removed the label CI failed on Jan 30, 2025

in test/functional/test_framework/test_node.py:315 in 8fa04cb13a outdated

310@@ -311,23 +311,21 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
311                 self.rpc_connected = True
312                 self.url = self.rpc.rpc_url
313                 return
314-            except JSONRPCException as e:  # Initialization phase
315+            except JSONRPCException as e:
316+                # Suppress these in favor of a later outcome (success, FailedToStartError, or timeout).

ryanofsky commented at 4:38 pm on February 12, 2025:

In commit “qa refactor: wait_for_rpc_connection - Treat OSErrors the same” (8fa04cb13a4bf62a42b18e7c0c79151d13cb494b)

Various thoughts:

I think “initialization phase” comment that was removed provided more context and was more understandable than the new “Suppress these in favor of” comment.
I don’t see why it is useful to list later outcomes that can happen after these errors are suppressed, especially when the list seems to contains every imaginable outcome.
I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned and https://github.com/bitcoin/bitcoin/blob/master/src/rpc/protocol.h doesn’t seem to contain that error code.

Given all this, I’d suggest just keeping the comments in this except block unchanged.

But if not doing that, I think “Suppress these” should be changed to something like “Suppress early init errors” so overall purpose of the except block is clear.

hodlinator commented at 10:33 pm on February 12, 2025:

I think “initialization phase” comment that was removed provided more context and was more understandable than the new “Suppress these in favor of” comment.

As the commit message specifies, we are changing wait_for_rpc_connection(), so the only value the “Initialization phase” comment had came from adding some context to a ~80 line long function which is part of initialization IMO. (The mention of FailedToStartError also gives a hint at what phase we are in).

I don’t see why it is useful to list later outcomes that can happen after these errors are suppressed, especially when the list seems to contains every imaginable outcome.

I think it is good to document why we are catching and suppressing some exceptions (while re-raising others).

I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned and https://github.com/bitcoin/bitcoin/blob/master/src/rpc/protocol.h doesn’t seem to contain that error code.

I think -342 comes from test/functional/test_framework/authproxy.py. This is what prompted my changed phrasing: #30660 (review)

But if not doing that, I think “Suppress these” should be changed to something like “Suppress early init errors” so overall purpose of the except block is clear.

Would rather improve the comment at the top of the function if you think it doesn’t provide enough context? https://github.com/bitcoin/bitcoin/blob/8fa04cb13a4bf62a42b18e7c0c79151d13cb494b/test/functional/test_framework/test_node.py#L264-L265

(Edit: This comment was pending when maflcko left his below).

maflcko commented at 9:07 am on February 13, 2025:

I don’t understand why “-342 Service unavailable, RPC server started but is shutting down due to error” comment is being changed to “-342 Service unavailable, could be starting up or shutting down”. These descriptions seem inconsistent and I don’t know which one is correct since I don’t see where -342 is being returned

I’d guess that it is coming from authproxy and is pretty general:

 0$ git grep '\-342\>' ./test/
 1test/functional/feature_taproot.py:# Test Taproot softfork (BIPs 340-342)
 2test/functional/rpc_bind.py:        assert_raises_rpc_error(-342, "non-JSON HTTP response with '403 Forbidden' from server", self.run_allowip_test, ['1.1.1.1'], self.non_loopback_ip, self.defaultport)
 3test/functional/test_framework/authproxy.py:                    'code': -342, 'message': 'non-200 HTTP status code but no JSON-RPC error'}, status)
 4test/functional/test_framework/authproxy.py:                    'code': -342, 'message': 'non-200 HTTP status code'}, status)
 5test/functional/test_framework/authproxy.py:                'code': -342, 'message': 'non-200 HTTP status code'}, status)
 6test/functional/test_framework/authproxy.py:                'code': -342, 'message': 'missing HTTP response from server'})
 7test/functional/test_framework/authproxy.py:                raise JSONRPCException({'code': -342, 'message': 'Content received with NO CONTENT status code'})
 8test/functional/test_framework/authproxy.py:                {'code': -342, 'message': 'non-JSON HTTP response with \'%i %s\' from server' % (http_response.status, http_response.reason)},
 9test/functional/test_framework/authproxy.py:                'code': -342, 'message': f'Cannot decode response in utf8 format, content: {data}, exception: {e}'})
10test/functional/test_framework/test_node.py:                # -342 Service unavailable, RPC server started but is shutting down due to error
11test/functional/test_framework/test_node.py:                if e.error['code'] != -28 and e.error['code'] != -342:

ryanofsky commented at 7:30 pm on March 13, 2025:

re: #30660 (review)

I think it is good to document why we are catching and suppressing some exceptions (while re-raising others).

It’s funny, because to me replacing “Initialization phase” with “Suppress these in favor of a later outcome” is literally removing the comment that explains why it is ok to suppress these errors. The previous comment made it clear these errors were being suppressed because they are expected during initialization. The new comment tells me they are being suppressed and that various other outcomes are possible later, which tells me nothing about why it is safe or good to suppress errors here.

Maybe the problem is that I’ve wasted too much time debugging issues and being traumatized by code that suppresses errors and fails to report useful information, but I found the new comment here hard to parse, and could not figure out what it was trying to tell me, and it made me not trust this code for seeming to suppress errors without a clear reason.

I think -342 comes from test/functional/test_framework/authproxy.py. This is what prompted my changed phrasing: #30660 (comment)

Thanks that helps. But the phrase “Service unavailable” would seem to connote that server is running but it is telling us it can’t process requests, when actually this is a low-level error coming from the JSON-RPC client. Would suggest replacing “-342 Service unavailable” with something like “-342 HTTP error” to be clearer about the source of this error.

hodlinator commented at 2:33 pm on March 14, 2025:

Thanks for taking the time to elaborate.

“Initialization phase” was too terse for me - of course we’re in the initialization phase… duh. :)

Changed to “Suppress these as they are expected during initialization.”

in test/functional/feature_framework_startup_failures.py:36 in 2226e14891 outdated

31+            self.rpc_timeout = self.options.rpc_timeout
32+        if self.options.extra_args:
33+            self.extra_args = [[self.options.extra_args]]
34+
35+    def add_options(self, parser):
36+        parser.add_argument("--rpc_timeout", dest="rpc_timeout", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")

ryanofsky commented at 4:57 pm on February 12, 2025:

In commit “qa: Add feature_framework_startup_failures.py” (2226e1489101f363114ce474d7f5e56ec50c9c6a)

Maybe give these arguments a prefix like --internal- so it obvious they are internal, and to avoid risk of silent conflicts if real test arguments with the same names are added later.

in test/functional/test_framework/test_framework.py:374 in 9d0e6a1b2a outdated

327-            self.log.info("Not stopping nodes as test failed. The dangling processes will be cleaned up later.")
328-        else:
329-            self.log.info("Stopping nodes")
330-            if self.nodes:
331-                self.stop_nodes()
332+        if self.nodes:

ryanofsky commented at 5:10 pm on February 12, 2025:

In commit “qa: Only log node stopping info if we have any” (9d0e6a1b2aa5be186550658ff85e5ad643554338)

Would be helpful if commit description stated motivation for this change. After looking at the stop_nodes() method implementation it seem like this is a logging-only change and point is just to remove stopping log statements when there are no nodes. But it would be good if actual intent of change were explicit.

hodlinator commented at 8:33 am on February 13, 2025:

Saw the code still had an inner if self.nodes-check - removed in new push.
Slightly clarified commit message, describing the change (even though it adds some negation) rather than describing then end state of the change.

in test/functional/test_framework/test_node.py:403 in 8102f3d8e0 outdated

399@@ -400,14 +400,14 @@ def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True):
400         if not self.running:
401             return
402         self.log.debug("Stopping node")
403-        try:
404+        if self.rpc_connected:

ryanofsky commented at 5:12 pm on February 12, 2025:

In commit “qa: Avoid calling stop-RPC if not connected” (8102f3d8e0b6715433eeda78b9b57a83109fc45e)

This change seems good, but unclear what motivation is. Is it supposed to be a refactoring that makes code more robust, or were there actually any cases where CannotSendRequest exceptions were caught and suppressed? If so what were those cases? Commit description should say what goals of this change are and if CannotSendRequest is an actual or theoretical change in behavior.

hodlinator commented at 8:40 am on February 13, 2025:

Thanks! This made me actually realize I could add a test for it and improve the exception message (+ commit message).

ryanofsky commented at 6:58 pm on March 13, 2025:

re: #30660 (review)

Thanks! This made me actually realize I could add a test for it and improve the exception message (+ commit message).

I’m still not clear when this case would ever be expected to happen.

IIUC, there is no runtime error that could trigger this, and nothing non-test code could could do to trigger it, and the only thing that could trigger it is incorrectly written test code?

If so, it would seem clearer to replace the exception with an assertion like assert self.rpc_connected, self._node_msg(...) to be clear this is a bug in the test code, not some other failure.

hodlinator commented at 2:28 pm on March 14, 2025:

Correct, especially now after maflcko’s change in fae3bf6b870eb0f9cddd1adac82ba72890806ae3 to not call stop() on test failure, I agree it makes sense to assert instead, good suggestion! Fixed.

were there actually any cases where CannotSendRequest exceptions were caught and suppressed? If so what were those cases? Commit description should say what goals of this change are and if CannotSendRequest is an actual or theoretical change in behavior.

It was a common knock-on issue from attempting to call stop-RPC on a non-functional RPC interface, see for example “CannotSendRequest” in the log of #21367. Should no longer happen due to maflcko’s change and us guarding against that kind of thing with the assert.

Clarified commit description.

ryanofsky approved

ryanofsky commented at 5:26 pm on February 12, 2025: contributor

Code review ACK 8102f3d8e0b6715433eeda78b9b57a83109fc45e. Looks like #31620 made some overlapping changes, but this PR still seems very useful.

From PR description:

Before #31620, failure to establish RPC connections would result in an attempt to stop the node via RPC, triggering knock-on exceptions. This would muddy the waters for developers looking for root causes of test failures. This PR used to solve that issue in a more roundabout way than aforementioned PR, but now merely adds tests to ensure knock-on errors/exceptions are not reintroduced.

Even after reviewing this PR twice and rereading this paragraph several times, I find this pretty difficult to understand. Would suggest giving a more straightforward description like:

This PR improves handling of startup errors in functional tests. Specifically it improves wait_for_rpc_connection so instead of returning a vague Unable to connect to bitcoind error, it returns details about specific failures leading up to it. This PR also adds comments, refactors code, improves logging, detects timeout errors on windows, adds a test to make sure error handling behavior is not broken in the future.

I think it is ok to mention #31620 and mention the earlier version of this PR in a postscript to provide context for early review comments, but the main goal of the PR description should be to describe the PR as it currently is.

DrahtBot requested review from m3dwards on Feb 12, 2025

hodlinator force-pushed on Feb 13, 2025

hodlinator commented at 10:23 am on February 13, 2025: contributor

Thanks for bearing with me @ryanofsky. :)

Updated PR description based on feedback and addressed inline comments.

hodlinator force-pushed on Feb 13, 2025

DrahtBot added the label CI failed on Feb 13, 2025

DrahtBot commented at 10:27 am on February 13, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/37155473613

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

DrahtBot removed the label CI failed on Feb 13, 2025

ryanofsky approved

ryanofsky commented at 8:02 pm on March 13, 2025: contributor

Code review ACK c107aa8c873cb3c031b341baef487638133823f1. Changes since last review were just updating some messages and expanding the test a little. Thanks for the updates. This still looks good to me and I left some more comments, but they are not important.

hodlinator force-pushed on Mar 14, 2025

in test/functional/test_framework/test_node.py:329 in fa2178e67f outdated

321@@ -322,13 +322,18 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
322                 suppressed_errors[f"JSONRPCException {e.error['code']}"] += 1
323                 latest_error = repr(e)
324             except OSError as e:
325+                error_num = e.errno
326+                # Workaround issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.
327+                if error_num is None:
328+                    assert isinstance(e, TimeoutError)
329+                    error_num = errno.ETIMEDOUT

m3dwards commented at 5:35 pm on March 17, 2025:

The inclusion of an assert suggests to me that we expect that all OSErrors without errno are TimeoutErrors? Is this actually what we expect?

If not should the if statement be changed to if error_num is None and isinstance(e, TimeoutError):

hodlinator commented at 9:20 pm on March 17, 2025:

Had a feeling OSErrors must always have that field set (except for the socket TimeoutError bug), but the docs indicate it’s a bit more loose:

The second form of the constructor sets the corresponding attributes, described below. The attributes default to None if not specified.

Could change the condition as you suggestion, but it would be prudent to assert that error_num isn’t None for other cases:

0                if error_num is None and isinstance(e, TimeoutError):
1                    error_num = errno.ETIMEDOUT
2                else:
3                    assert error_num is not None

So not sure about that change as it would repeat one of the conditions. Maybe just adding an error message to what we have now would be okay?

0                if error_num is None:
1                    assert isinstance(e, TimeoutError), "We only currently support " \
2                        f"inferring errno from TimeoutErrors, encountered unhandled: {e!r}"
3                    error_num = errno.ETIMEDOUT

m3dwards commented at 12:29 pm on March 19, 2025:

This is only a NIT so feel free to leave it as it is but in my mind an assert is used to signal an expectation of predicted state. In Python asserts are not even executed if run in optimised mode. It seems like we are using it more like an exception here?

0                if error_num is None:
1                    if isinstance(e, TimeoutError):
2                        error_num = errno.ETIMEDOUT
3                    else:
4                        raise Exception("errno is required")

hodlinator commented at 2:02 pm on March 19, 2025:

Not so concerned about people running in optimised node since 5+ functional tests fail in that way, but changed in latest push to: https://github.com/bitcoin/bitcoin/blob/cac218dcbdd2b5682a7ac8659aac0306e68a65f5/test/functional/test_framework/test_node.py#L325-L331

ryanofsky commented at 1:51 pm on March 26, 2025:

re: #30660 (review)

In commit “qa: Workaround Windows issue with socket timeouts” (c92a8fcd79d76b23c366a07be1c8778db7bea891)

FWIW, I think the best way to write this would be simply:

0# Work around issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.
1if error_num is None and isinstance(e, TimeoutError):
2    error_num = errno.ETIMEDOUT
3
4if error_num not in [ ... ]:
5   raise

Raising a new “Cannot infer errno when unset” exception just adds complexity and could make it harder to debug since the original exception contains more information than the new exception.

hodlinator commented at 2:26 pm on March 26, 2025:

Thanks for re-reviewing this again!

Yes, in that way we would make error_num == None be one of the cases for which we raise. It’s obviously better to raise for something unexpected than to suppress, which should lead to it being tracked down and intentionally categorized for suppression or not.

Updated in latest push.

hodlinator force-pushed on Mar 19, 2025

ryanofsky approved

ryanofsky commented at 2:03 pm on March 26, 2025: contributor

Code review ACK cac218dcbdd2b5682a7ac8659aac0306e68a65f5. Just changed exception to assert and assert to exception since last review. Left a minor suggestion but it’s not important so feel free to ignore. Thanks for being so responsive with previous feedback.

I think this PR should be pretty helpful to developers writing python tests since it cleans up logging, returns more complete error information, provides better feedback if stop_node is used incorrectly, and adds test coverage for the test framework to make sure it works reliably.

DrahtBot requested review from m3dwards on Mar 26, 2025

hodlinator force-pushed on Mar 26, 2025

ryanofsky approved

ryanofsky commented at 3:13 pm on March 26, 2025: contributor

Code review ACK 6539598b42005f453b358a0076287db99d1b02eb. Since last review just simplified errno code a little as suggested

in test/functional/test_framework/test_node.py:326 in 52746d0c58 outdated

321@@ -322,13 +322,17 @@ def wait_for_rpc_connection(self, *, wait_for_import=True):
322                 suppressed_errors[f"JSONRPCException {e.error['code']}"] += 1
323                 latest_error = repr(e)
324             except OSError as e:
325+                error_num = e.errno
326+                # Workaround issue observed on Windows, Python v3.13.1 where socket timeouts don't have errno set.

l0rinc commented at 10:37 am on March 27, 2025:

Is this the issue? https://github.com/python/cpython/issues/109601 If so, it doesn’t seem to be Windows related.

maflcko commented at 11:34 am on March 27, 2025:

Also, the commit message seems to imply this is a bugfix for an issue observed on Windows. However, I don’t think anyone has observed this on current master. The first commit of this pull claims to be a “refactor”, so maybe the bug this is fixing was introduced there instead?

hodlinator commented at 1:07 pm on March 27, 2025:

Thanks for finding that issue @l0rinc!

I’ve only observed it on Windows, maybe the socket subsystem or timeouts are different enough there to make it more of an issue.

However, I don’t think anyone has observed this on current master.

Maybe I was running under some less common setup like non-WSL MSVC Debug, or maybe my Windows network stack was having a bad day. (There’s an odd chance that I might have been using Linux->Windows cross-built binaries for some reason).

The first commit of this pull claims to be a “refactor”, so maybe the bug this is fixing was introduced there instead?

8ff738a6e96c19aaf14ebea52c8ae13751cf443b, the refactor-commit does indeed restructure how OSErrors are caught, at first I thought maybe it was filtering TimeoutError by type, but no, the approach on master also relies on errno.ETIMEDOUT (ConnectionResetError is partially filtered by type there, they are not direct descendants of each other).

hodlinator commented at 10:31 pm on March 27, 2025:

Pushed new version of this commit - 51b2115a348f681e12aabf514cb156da4420da2e. Now links to the Python issue in the comment, and only mentions that it was encountered on Windows in the commit message.

maflcko commented at 7:13 am on March 28, 2025:

Do you still know how often it happened and what the traceback looked like?

hodlinator commented at 1:27 pm on March 28, 2025:

Was able to easily reproduce the issue.

C:\Users\hodlinator\bitcoin>py build\test\functional\feature_framework_startup_failures.py 2025-03-28T13:09:19.459000Z TestFramework (INFO): PRNG seed is: 4595347679033165165 2025-03-28T13:09:19.519000Z TestFramework (INFO): Initializing test directory C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2 2025-03-28T13:09:19.519000Z TestFramework (INFO): Verifying timeout in connecting to bitcoind’s RPC interface results in only one exception. 2025-03-28T13:09:19.788000Z TestFramework (INFO): Verifying inability to connect to bitcoind’s RPC interface due to wrong port results in one exception containing at least one OSError. 2025-03-28T13:09:21.378000Z TestFramework (ERROR): Found 0/1 occurrences of the specific exception: AssertionError: [node 0] Unable to connect to bitcoind after \d+s (ignored errors: {[^}]‘OSError \w+’?: \d+[^}]}, latest error: \w+([^)]+)) 2025-03-28T13:09:21.378000Z TestFramework (ERROR): Assertion failed Traceback (most recent call last): File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 135, in main self.run_test() ~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 111, in run_test self.test_wrong_rpc_port() ~~~~~~~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 85, in test_wrong_rpc_port self._run_test_internal( ~~~~~~~~~~~~~~~~~~~~~~~^ # Lower the timeout so we don’t wait that long. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …<4 lines>… r"AssertionError: [node 0] Unable to connect to bitcoind after \d+s (ignored errors: {[^}]‘OSError \w+’?: \d+[^}]}, latest error: \w+([^)]+))" ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 74, in _run_test_internal raise AssertionError(f"Child test didn’t contain (only) expected errors.\n:\n{result.stdout}\n\n") AssertionError: Child test didn’t contain (only) expected errors. : 2025-03-28T13:09:19.929000Z TestFramework (INFO): PRNG seed is: 1755806780689552109 2025-03-28T13:09:19.952000Z TestFramework (INFO): Initializing test directory C:\Users\HODLIN1\AppData\Local\Temp\bitcoin_func_test_zahuvbto 2025-03-28T13:09:21.234000Z TestFramework (ERROR): Unexpected exception caught during testing Traceback (most recent call last): File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 131, in main self.setup() ~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 313, in setup self.setup_network() ~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\feature_framework_startup_failures.py”, line 47, in setup_network BitcoinTestFramework.setup_network(self) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 404, in setup_network self.setup_nodes() ~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 426, in setup_nodes self.start_nodes() ~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_framework.py”, line 585, in start_nodes node.wait_for_rpc_connection() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\test_node.py”, line 286, in wait_for_rpc_connection rpc.getblockcount() ~~~~~~~~~~~~~~~~~^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\coverage.py”, line 50, in call return_val = self.auth_service_proxy_instance.call(*args, **kwargs) File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\authproxy.py”, line 127, in call response, status = self._request(‘POST’, self.__url.path, postdata.encode(‘utf-8’)) ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\bitcoin\build\test\functional\test_framework\authproxy.py”, line 105, in _request self.__conn.request(method, path, postdata, headers) ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1336, in request self._send_request(method, url, body, headers, encode_chunked) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1382, in _send_request self.endheaders(body, encode_chunked=encode_chunked) ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1331, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1091, in _send_output self.send(msg) ~~~~~~~~~^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1035, in send self.connect() ~~~~~~~~~~~~^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\http\client.py”, line 1001, in connect self.sock = self._create_connection( ~~~~~~~~~~~~~~~~~~~~~~~^ (self.host,self.port), self.timeout, self.source_address) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\socket.py”, line 864, in create_connection raise exceptions[0] File “C:\Users\hodlinator\AppData\Local\Programs\Python\Python313\Lib\socket.py”, line 849, in create_connection sock.connect(sa) ~~~~~~~~~~~~^^^^ TimeoutError: timed out 2025-03-28T13:09:21.344000Z TestFramework (INFO): Not stopping nodes as test failed. The dangling processes will be cleaned up later. 2025-03-28T13:09:21.345000Z TestFramework (WARNING): Not cleaning up dir C:\Users\HODLIN~~1\AppData\Local\Temp\bitcoin_func_test_zahuvbto 2025-03-28T13:09:21.345000Z TestFramework (ERROR): Test failed. Test logging available at C:\Users\HODLIN~~1\AppData\Local\Temp\bitcoin_func_test_zahuvbto/test_framework.log 2025-03-28T13:09:21.345000Z TestFramework (ERROR): 2025-03-28T13:09:21.345000Z TestFramework (ERROR): Hint: Call C:\Users\hodlinator\bitcoin\build\test\functional\combine_logs.py ‘C:\Users\HODLIN~1\AppData\Local\Temp\bitcoin_func_test_zahuvbto’ to consolidate all logs 2025-03-28T13:09:21.346000Z TestFramework (ERROR): 2025-03-28T13:09:21.346000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log. 2025-03-28T13:09:21.346000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues 2025-03-28T13:09:21.346000Z TestFramework (ERROR): [node 0] Cleaning up leftover process

2025-03-28T13:09:21.431000Z TestFramework (INFO): Not stopping nodes as test failed. The dangling processes will be cleaned up later. 2025-03-28T13:09:21.431000Z TestFramework (WARNING): Not cleaning up dir C:\Users\HODLIN~~1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2 2025-03-28T13:09:21.431000Z TestFramework (ERROR): Test failed. Test logging available at C:\Users\HODLIN~~1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2/test_framework.log 2025-03-28T13:09:21.431000Z TestFramework (ERROR): 2025-03-28T13:09:21.432000Z TestFramework (ERROR): Hint: Call C:\Users\hodlinator\bitcoin\build\test\functional\combine_logs.py ‘C:\Users\HODLIN~1\AppData\Local\Temp\bitcoin_func_test_4hxul9x2’ to consolidate all logs 2025-03-28T13:09:21.432000Z TestFramework (ERROR): 2025-03-28T13:09:21.432000Z TestFramework (ERROR): If this failure happened unexpectedly or intermittently, please file a bug and provide a link or upload of the combined log. 2025-03-28T13:09:21.432000Z TestFramework (ERROR): https://github.com/bitcoin/bitcoin/issues 2025-03-28T13:09:21.432000Z TestFramework (ERROR):

My hypothesis is that we trigger the issue through the first test setting a 0 rpc_timeout:

https://github.com/bitcoin/bitcoin/blob/6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5/test/functional/feature_framework_startup_failures.py#L76-L81

Attempting to use the network stack with a 0 timeout seems to have different meanings on Linux and Windows. It’s not really a realistic test. I think the next test covers a superset of the one above, so could probably remove the first one.

https://github.com/bitcoin/bitcoin/blob/6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5/test/functional/feature_framework_startup_failures.py#L83-L92

hodlinator commented at 1:55 pm on March 28, 2025:

Dropped the test and workaround commit in latest push.

hodlinator commented at 10:10 pm on April 21, 2025:

This was re-added in later pushes as I was able to reproduce it on Windows with 5- second timeouts, not just 0 timeouts as hypothesized above (https://github.com/bitcoin/bitcoin/pull/30660#discussion_r2018664612). See 1f639efca5e71a0ff208415d94e408a74778d4db.

in test/functional/feature_framework_startup_failures.py:98 in 75134fef62 outdated

93+
94+    def test_init_error(self):
95+        self.log.info("Verify startup failure due to invalid arg results in only one exception.")
96+        self._run_test_internal(
97+            ["--internal-extra_args=-nonexistentarg"],
98+            "FailedToStartError: \\[node 0\\] bitcoind exited with status 1 during initialization. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"

l0rinc commented at 11:12 am on March 27, 2025:

nit: for consistency I’d make this either a r"...\[node 0\] as well or do a re.escape if we want to be explicit. Though I think we’re probably matching way too much here, it doesn’t hint at which part of the error is important and which part is irrelevant and subject to change

hodlinator commented at 1:15 pm on March 27, 2025:

r""-string seems more consistent. Thanks for pointing it out! Will fix if I re-touch.

hodlinator commented at 10:31 pm on March 27, 2025:

Fixed.

l0rinc commented at 11:13 am on March 27, 2025: contributor

lightweight, code review only ACK 6539598b42005f453b358a0076287db99d1b02eb

hodlinator force-pushed on Mar 27, 2025

hodlinator commented at 10:33 pm on March 27, 2025: contributor

Pushed new version due to recent discussion around socket timeout issue.

in test/functional/feature_framework_startup_failures.py:98 in 6bbfb3eeef outdated

93+
94+    def test_init_error(self):
95+        self.log.info("Verify startup failure due to invalid arg results in only one exception.")
96+        self._run_test_internal(
97+            ["--internal-extra_args=-nonexistentarg"],
98+            r"FailedToStartError: \[node 0\] bitcoind exited with status 1 during initialization. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"

l0rinc commented at 9:24 am on March 28, 2025:

super-nit:

0            r"FailedToStartError: \[node 0\] bitcoind exited with status 1 during initialization\. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"

hodlinator commented at 10:14 am on March 28, 2025:

Will take if anything else forces me to re-touch.

hodlinator commented at 1:53 pm on March 28, 2025:

Thread https://github.com/bitcoin/bitcoin/pull/30660/files#r2018221082: Taken.

l0rinc approved

l0rinc commented at 9:28 am on March 28, 2025: contributor

lightweight re-review-ack 6bbfb3eeef2e35560dd357bb2a8a7378e2b49eb5

hodlinator force-pushed on Mar 28, 2025

hodlinator commented at 2:06 pm on March 28, 2025: contributor

Latest push:

Removes test_instant_rpc_timeout which was artificially setting rpc_timeout = 0, which is not a supported case and which caused different behavior on Windows network stacks compared to Linux. That test was a rough subset of test_wrong_rpc_port. Removing test_instant_rpc_timeout allowed dropping the commit with the workaround for socket timeout with unset errno entirely. Special thanks to @maflcko for insisting that something was not right.
Fixed nit concerning escaping in regexp string.
Bumped subprocess.run-timeout since it was flaky under Windows.

ryanofsky approved

ryanofsky commented at 2:56 pm on March 28, 2025: contributor

Code review ACK fc88c2ec297dc93ba06008bd5ae10798e9f6aeac. Just dropped 0 timeout workaround and instant timeout test since the last review.

Ideally, I would want setting a 0 timeout to result in a comprehensible error message lke “Unable to connect to bitcoind after 0s” and not some harder to debug behavior, but it wouldn’t be worth restructuring the code or adding a special case for.

DrahtBot requested review from l0rinc on Mar 28, 2025

in test/functional/feature_framework_startup_failures.py:102 in 3ee7062ee3 outdated

 97+            ["--internal-start_stop=1"],
 98+            r"AssertionError: \[node 0\] Should only call stop_node\(\) on a running node after wait_for_rpc_connection\(\) succeeded\. Did you forget to call the latter after start\(\)\? Not connected to process: \d+"
 99+        )
100+
101+    def run_test(self):
102+        if self.options.internal_rpc_timeout is None and self.options.internal_extra_args is None:

l0rinc commented at 12:48 pm on April 18, 2025:

this is a quite confusing way to check whether we’re in a child process or not (not obvious to me why internal_start_stop is not even here)

ryanofsky commented at 1:26 pm on April 18, 2025:

this is a quite confusing way to check whether we’re in a child process or not (not obvious to me why internal_start_stop is not even here)

Do you have a suggestion to make it less confusing? I feel like checking for an internal argument is a mostly straightforward way of checking whether this was called internally.

hodlinator commented at 1:30 pm on April 18, 2025:

I’ve realized that all child test processes fail before reaching run_test(), so I’m leaning toward changing this into an assert.

in test/functional/feature_framework_startup_failures.py:105 in 3ee7062ee3 outdated

100+
101+    def run_test(self):
102+        if self.options.internal_rpc_timeout is None and self.options.internal_extra_args is None:
103+            self.test_wrong_rpc_port()
104+            self.test_init_error()
105+            self.test_start_stop()

l0rinc commented at 12:48 pm on April 18, 2025:

do we need to batch all 3 tests into a single test?

hodlinator commented at 8:36 pm on April 18, 2025:

That is the pattern for functional tests, see for example:

https://github.com/bitcoin/bitcoin/blob/3ee7062ee33c94c717131b8e7a8c01b37a1e9728/test/functional/feature_config_args.py#L488-L503

in test/functional/test_runner.py:413 in 3ee7062ee3 outdated

407@@ -408,6 +408,7 @@
408     'p2p_handshake.py --v2transport',
409     'feature_dirsymlinks.py',
410     'feature_help.py',
411+    'feature_framework_startup_failures.py',

l0rinc commented at 12:49 pm on April 18, 2025:

I know this is meant to be mostly random, but the existing feature ones seem to be in alphabetic order

hodlinator commented at 8:38 pm on April 18, 2025:

https://github.com/bitcoin/bitcoin/blob/3ee7062ee33c94c717131b8e7a8c01b37a1e9728/test/functional/test_runner.py#L371-L375

https://github.com/bitcoin/bitcoin/blob/3ee7062ee33c94c717131b8e7a8c01b37a1e9728/test/functional/test_runner.py#L389-L391

in test/functional/feature_framework_startup_failures.py:95 in 3ee7062ee3 outdated

90+            ["--internal-extra_args=-nonexistentarg"],
91+            r"FailedToStartError: \[node 0\] bitcoind exited with status 1 during initialization\. Error: Error parsing command line arguments: Invalid parameter -nonexistentarg"
92+        )
93+
94+    def test_start_stop(self):
95+        self.log.info("Verify start() then stop_node() on a node without wait_for_rpc_connection() in between triggers an assert.")

l0rinc commented at 12:50 pm on April 18, 2025:

could these be separate sub-tests instead of just test logs?

hodlinator commented at 9:03 pm on April 18, 2025:

Let me know if you think the final commit in the latest push is an improvement.

l0rinc commented at 7:24 am on April 19, 2025:

It is, thanks

in test/functional/feature_framework_startup_failures.py:58 in 3ee7062ee3 outdated

53+            print(f"Unexpected timeout, subprocess output:\n{e.output}\nSubprocess output end", file=sys.stderr)
54+            raise
55+
56+        success = True
57+
58+        traceback_count = len(re.findall("Traceback", result.stdout))

l0rinc commented at 12:50 pm on April 18, 2025:

could we add these into the try so that I don’t have to understand where result is coming from?

hodlinator commented at 8:40 pm on April 18, 2025:

Tried it out, but then one runs into the issue that it looks like subprocess.TimeoutExpired could be raised anywhere in that whole function, which I prefer avoiding.

l0rinc commented at 7:24 am on April 19, 2025:

looks like subprocess.TimeoutExpired could be raised anywhere in that whole function

Isn’t that the whole point of a try/catch, that the happy path isn’t disrupted, that you don’t have to know where it’s coming from? Otherwise we’re treating exceptions as return codes.

Quoting again from the Python docs https://docs.python.org/3/tutorial/errors.html

0try:
1    f = open('myfile.txt')
2    s = f.readline()
3    i = int(s.strip())
4except [...]

and not

0try:
1    f = open('myfile.txt')
2except [...]
3s = f.readline()
4i = int(s.strip())

hodlinator commented at 9:31 am on April 23, 2025:

The full example from the python docs is:

 0try:
 1    f = open('myfile.txt')
 2    s = f.readline()
 3    i = int(s.strip())
 4except OSError as err:
 5    print("OS error:", err)
 6except ValueError:
 7    print("Could not convert data to an integer.")
 8except Exception as err:
 9    print(f"Unexpected {err=}, {type(err)=}")
10    raise

So the first and last lines inside the try are expected to sometimes raise exceptions (maybe the readline() too), which does not map well to the code in _verify_startup_failure().

l0rinc commented at 9:38 am on April 23, 2025:

The point is that the happy path is separated from the error handling, we’re not using exceptions as if they were returned error codes

hodlinator commented at 9:51 am on April 23, 2025:

My intent is to keep as much of the code separate from error handling as possible. By narrowing the scope of the try it makes clear to the reader that there aren’t exceptions hiding in every line of the happy path.

l0rinc commented at 10:03 am on April 23, 2025:

I think the opposite is true, this way we’re forcing the reader to consider the error case when trying to understand the happy path (i.e. separation of concerns). But I’ll leave it, it’s not that important.

hodlinator commented at 11:54 am on April 23, 2025:

Another objection to your approach as a general rule: If one of the lines in the happy path is modified to call a function that raises exceptions under some circumstances, we run the risk of accidentally catching it by related (parent)type through a pre-existing except/catch-block, and treating it differently than we had if we intentionally designed the code.

Side note: I’m in favor of newer languages like Rust/Zig not supporting exceptions.

in test/functional/feature_framework_startup_failures.py:46 in 3ee7062ee3 outdated

54+            raise
55+
56+        success = True
57+
58+        traceback_count = len(re.findall("Traceback", result.stdout))
59+        if traceback_count != 1:

l0rinc commented at 12:51 pm on April 18, 2025:

this line might need explanation for why no Traceback is a success=False

hodlinator commented at 8:45 pm on April 18, 2025:

Changed the function name from _run_test_internal to _verify_startup_failure and added comment, hopefully that clarifies this aspect. I also find the error message quite explanatory: https://github.com/bitcoin/bitcoin/blob/fc88c2ec297dc93ba06008bd5ae10798e9f6aeac/test/functional/feature_framework_startup_failures.py#L59-L60

in test/functional/feature_framework_startup_failures.py:56 in 3ee7062ee3 outdated

51+            result = subprocess.run([sys.executable, __file__] + args, encoding="utf-8", stdout=subprocess.PIPE, stderr=subprocess.STDOUT, timeout=60 * self.options.timeout_factor)
52+        except subprocess.TimeoutExpired as e:
53+            print(f"Unexpected timeout, subprocess output:\n{e.output}\nSubprocess output end", file=sys.stderr)
54+            raise
55+
56+        success = True

l0rinc commented at 12:51 pm on April 18, 2025:

Could we collect the errors instead of printing and counting?

hodlinator commented at 9:03 pm on April 18, 2025:

Taken!

in test/functional/feature_framework_startup_failures.py:37 in 3ee7062ee3 outdated

32+            self.extra_args = [[self.options.internal_extra_args]]
33+
34+    def add_options(self, parser):
35+        parser.add_argument("--internal-rpc_timeout", dest="internal_rpc_timeout", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
36+        parser.add_argument("--internal-extra_args", dest="internal_extra_args", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
37+        parser.add_argument("--internal-start_stop", dest="internal_start_stop", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")

l0rinc commented at 12:53 pm on April 18, 2025:

Do we really need 3 separate internal dummy args? Seems like a framework with a custom arg and 3 separate tests with different behaviors checking only a single thing would untangle this slightly better

hodlinator commented at 9:05 pm on April 18, 2025:

I think it leaves flexibility for adding more tests which use a subset of them.

in test/functional/feature_framework_startup_failures.py:6 in 3ee7062ee3 outdated

0@@ -0,0 +1,109 @@
1+#!/usr/bin/env python3
2+# Copyright (c) 2025-present The Bitcoin Core developers
3+# Distributed under the MIT software license, see the accompanying
4+# file COPYING or http://www.opensource.org/licenses/mit-license.php.
5+"""
6+Verify various startup failures only raise one exception since multiple

l0rinc commented at 12:55 pm on April 18, 2025:

“various startup failures” doesn’t sound like a single test to me, could we maybe unwrap them somehow?

hodlinator commented at 9:08 pm on April 18, 2025:

It is rare for functional tests to only test one thing. But I agree “various” is a bit too loose, changed to “framework”.

in test/functional/feature_framework_startup_failures.py:11 in 3ee7062ee3 outdated

 6+Verify various startup failures only raise one exception since multiple
 7+exceptions being raised muddies the waters of what actually went wrong.
 8+We should maintain this bar of only raising one exception as long as
 9+additional maintenance and complexity is low.
10+
11+Test relaunches itself into a child processes in order to trigger failure

l0rinc commented at 12:55 pm on April 18, 2025:

I find the relaunches itself part quite confusing here

hodlinator commented at 8:47 pm on April 18, 2025:

I’d rather keep the overall test in one file. We need child process as we want the parent to always succeed, and the child processes to always fail.

Corrected grammar: “Test relaunches itself into child processes in order to trigger failures without the parent process’ BitcoinTestFramework also failing.”

in test/functional/test_framework/test_framework.py:1 in fc88c2ec29

l0rinc commented at 12:58 pm on April 18, 2025:

nit: I’d move this commit before the previous one (it’s more like the others, the previous one is different)

hodlinator commented at 9:14 pm on April 18, 2025:

Done. Also makes it easier to squash/drop the new final commit of introducing multiple BitcoinTestFramework-subclasses.

in test/functional/test_framework/test_node.py:338 in 8ff738a6e9 outdated

333+                # Suppress similarly to the above JSONRPCException errors.
334+                if e.errno not in [ errno.ECONNRESET,     # This might happen when the RPC server is in warmup,
335+                                                          # but shut down before the call to getblockcount succeeds.
336+                                    errno.ETIMEDOUT,      # Treat identical to ECONNRESET
337+                                    errno.ECONNREFUSED ]: # Port not yet open?
338                     raise  # unknown OS error

l0rinc commented at 1:02 pm on April 18, 2025:

Nit: this formatting pushes everything out of my screen:

0                # Suppress similarly to the above JSONRPCException errors.
1                if e.errno not in [
2                    errno.ECONNRESET,  # This might happen when the RPC server is in warmup, but shut down before the call to getblockcount succeeds.
3                    errno.ETIMEDOUT,   # Treat identical to ECONNRESET
4                    errno.ECONNREFUSED # Port not yet open?
5                ]: 
6                    raise  # unknown OS error

hodlinator commented at 9:21 pm on April 18, 2025:

Fair. Taken.

in test/functional/feature_framework_startup_failures.py:40 in 3ee7062ee3 outdated

35+        parser.add_argument("--internal-rpc_timeout", dest="internal_rpc_timeout", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
36+        parser.add_argument("--internal-extra_args", dest="internal_extra_args", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
37+        parser.add_argument("--internal-start_stop", dest="internal_start_stop", help="ONLY TO BE USED WHEN TEST RELAUNCHES ITSELF")
38+
39+    def setup_network(self):
40+        # Avoid doing anything if num_nodes == 0, otherwise we fail.

l0rinc commented at 1:03 pm on April 18, 2025:

why?

hodlinator commented at 8:52 pm on April 18, 2025:

Improved comment: “Only proceed if num_nodes > 0 (child test processes), as 0 (parent) is not supported by BitcoinTestFramework.setup_network().”

Could explore making BitcoinTestFramework more permissive in this aspect but would rather defer that.

in test/functional/feature_framework_startup_failures.py:51 in 3ee7062ee3 outdated

46+            else:
47+                BitcoinTestFramework.setup_network(self)
48+
49+    def _run_test_internal(self, args, expected_exception):
50+        try:
51+            result = subprocess.run([sys.executable, __file__] + args, encoding="utf-8", stdout=subprocess.PIPE, stderr=subprocess.STDOUT, timeout=60 * self.options.timeout_factor)

l0rinc commented at 1:05 pm on April 18, 2025:

can we just add .stdout at the end here since it’s all we’re using?

l0rinc commented at 1:06 pm on April 18, 2025:

do we care about stdout and encoding here? Or can we do stdout=subprocess.DEVNULL maybe?

hodlinator commented at 8:55 pm on April 18, 2025:

Good point, taken!

hodlinator commented at 8:56 pm on April 18, 2025:

Requiring the expected output to be in stderr and not stdout seems petty/brittle to me, even if it is more correct.

encoding is required to avoid:

0    traceback_count = len(re.findall("Traceback", output))
1                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2  File "/nix/store/26yi95240650jxp5dj78xzch70i1kzlz-python3-3.12.9/lib/python3.12/re/__init__.py", line 217, in findall
3    return _compile(pattern, flags).findall(string)
4           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
5TypeError: cannot use a string pattern on a bytes-like object

l0rinc commented at 7:23 am on April 19, 2025:

Thanks for checking

in test/functional/feature_framework_startup_failures.py:66 in 3ee7062ee3 outdated

61+            success = False
62+
63+        matching_exception_count = len(re.findall(expected_exception, result.stdout))
64+        if matching_exception_count != 1:
65+            self.log.error(f"Found {matching_exception_count}/1 occurrences of the specific exception: {expected_exception}")
66+            success = False

l0rinc commented at 1:10 pm on April 18, 2025:

we could likely use the walrus operator here to narrow the scope and name:

0		if (matches := len(re.findall(expected_exception, result.stdout))) != 1:
1		    self.log.error(f"Found {matches}/1 occurrences of the specific exception: {expected_exception}")
2		    success = False

hodlinator commented at 9:01 pm on April 18, 2025:

Prefer using the walrus for truthiness tests which don’t require the additional () != 1.

l0rinc commented at 7:24 am on April 19, 2025:

The point is to narrow the scope of variables so that we can give them more focused names. The example in the doc https://docs.python.org/3/whatsnew/3.8.html is basically this exact escenario

0if (n := len(a)) > 10:
1    print(f"List is too long ({n} elements, expected <= 10)")

l0rinc commented at 7:40 am on April 19, 2025:

Or we could avoid repetition as well by a more functional approach of separating data from algorithm:

0patterns = {
1    "Traceback": "tracebacks - expecting exactly one with no knock-on exceptions",
2    expected_exception: f"occurrences of the specific exception: {expected_exception}",
3    "Test failed. Test logging available at": "test failure output messages"
4}
5errors = [f"Found {count}/1 {message}."
6          for pattern, message in patterns.items()
7          if (count := len(re.findall(pattern, output))) != 1]

l0rinc changes_requested

l0rinc commented at 1:10 pm on April 18, 2025: contributor

As discussed in private, I’ve summarized my findings here shortly.

DrahtBot requested review from l0rinc on Apr 18, 2025

hodlinator force-pushed on Apr 18, 2025

hodlinator commented at 9:57 pm on April 18, 2025: contributor

Hi folks!

Asked @l0rinc what it would take for him to give this more than a lightweight A-C-K, and we went through the commits together, especially feature_framework_startup_failures.py. It made me agree some minor improvements are justified in order to make the code more readable & correct.

Added final commit - drop or squash?

We also agreed that it would be worth experimenting with breaking up the one test class into multiple to clarify things further. Let me know what you think about the result. @ryanofsky: Hope you tolerate the invalidation of your latest ACK. If you deem the next-to-last commit re-ACK:able, but would rather see the last commit dropped, I hereby grant you veto rights for time served within this PR.

Edit: Subsequent pushes unbreak the test when running in the context of test_runner.py with --cachedir being sent in, solve minor linter failure, and rebased on master in hopes of avoiding Windows CI file contention issue #32291 (which I only now realize hasn’t been solved yet).

hodlinator force-pushed on Apr 18, 2025

DrahtBot added the label CI failed on Apr 18, 2025

DrahtBot commented at 10:43 pm on April 18, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/40802607410

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

hodlinator force-pushed on Apr 18, 2025

DrahtBot removed the label CI failed on Apr 19, 2025

l0rinc commented at 7:26 am on April 19, 2025: contributor

You could rebase your change before the fix reverted in #32302 to make it green

in test/functional/feature_framework_startup_failures.py:122 in 9e0fede6a6 outdated

117+if __name__ == '__main__':
118+    if not any('--internal_test=' in arg for arg in sys.argv):
119+        FeatureFrameworkStartupFailures(__file__).main()
120+    else:
121+        for arg in sys.argv:
122+            if arg[:16] == '--internal_test=':

l0rinc commented at 7:44 am on April 19, 2025:

nit: can probably be if arg.startswith('--internal_test='): or walrus to the rescue again:

0if arg.startswith(prefix := '--internal_test='):
1    class_name = arg[len(prefix):]

ryanofsky commented at 5:48 pm on April 21, 2025:

In commit “Drop or Squash? - Break out startup failures into distinct classes” (9e0fede6a6d608ff307b48b9a326c6d467b410cf)

I think could also simplify the sys.modules code and write globals()[class_name](__file__).main()

l0rinc approved

l0rinc commented at 7:46 am on April 19, 2025: contributor

Concept ACK, I like the new FeatureFrameworkStartupFailures, it’s untangled a lot better now in my opinion. I assume from the last commit message that it’s not final, right?

hodlinator marked this as a draft on Apr 19, 2025

qa refactor: wait_for_rpc_connection - Treat OSErrors the same

ConnectionResetError is an OSError as well (ECONNRESET), no reason to have a separate except-block for it.

Also improves comments for other exceptions and make condition above more Pythonic.

879243e81f

qa: Include ignored errors in RPC connection timeout

When an RPC connection attempt with bitcoind times out, include which ignored errors occurred in the exception message.

May provide clues of what has gone wrong.

6ad21b4c01

qa: Only allow calling TestNode.stop() after connecting

(Still tolerate calling it on a no longer (self.)running node, as in a node that has been queried for is_node_stopped() and modified state before returning True).

Tests should not attempt to use the non-functioning RPC interface to call stop() unless wait_for_connections() has succeeded.

No longer log and suppress http.client.CannotSendRequest as a consequence of stop()-RPC, as error conditions causing this knock-on issue are now guarded against before the call.

9b24a403fa

ryanofsky approved

ryanofsky commented at 5:53 pm on April 21, 2025: contributor

Code review ACK 9e0fede6a6d608ff307b48b9a326c6d467b410cf. Main change since last review were a lot of feature_framework_startup_failures.py cleanups. I do think the drop or squash commit 9e0fede6a6d608ff307b48b9a326c6d467b410cf is an improvement so would vote for squash, but no strong opinion

DrahtBot requested review from l0rinc on Apr 21, 2025

qa: Work around Python socket timeout issue

Observed on local machine running Windows / Python v3.13.1 when overriding rpc_timeout to small values (5- seconds). Next commit performs such overrides.

1f639efca5

hodlinator force-pushed on Apr 21, 2025

in test/functional/feature_framework_startup_failures.py:58 in 3ae99bf8d7 outdated

53+                                    stdout=subprocess.PIPE,
54+                                    stderr=subprocess.STDOUT,
55+                                    timeout=60 * self.options.timeout_factor).stdout
56+        except subprocess.TimeoutExpired as e:
57+            # e.output is returned as bytes on Linux and str on Windows.
58+            child_output = e.output if isinstance(e.output, str) else e.output.decode("utf-8")

l0rinc commented at 7:15 pm on April 21, 2025:

str on Windows

https://docs.python.org/3/library/subprocess.html#subprocess.TimeoutExpired.output claims:

This is always bytes when any output was captured

Can we simplify to:

0            child_output = e.output.decode("utf-8") if e.output else ""

?

hodlinator commented at 9:43 pm on April 21, 2025:

In this case real-world experience proved the documentation to be wrong. We do get a string on Windows, at least under some conditions.

It does make sense to invert the condition similar to what you did though! To avoid trying None.decode("utf-8"). None serializes okay (to "None") so we can return e.output for both the str and None cases.

0            # e.output is returned as bytes on Linux and str on Windows.
1            child_output = e.output.decode("utf-8") if isinstance(e.output, bytes) else e.output

ryanofsky commented at 5:57 pm on April 22, 2025:

re: #30660 (review)

I think you could drop all unnecessary encoding and decoding by just dropping encoding="utf-8" parameter to subprocess.run and printing the bytes directly with something like sys.stderr.buffer.write(b"<CHILD OUTPUT BEGIN>\n%s\n<CHILD OUTPUT END>" % e.output)

hodlinator commented at 9:04 am on April 23, 2025:

Pushed an alternative, still using print. When I tried sys.stderr.buffer.write that data ended up at the end of the failure output, instead of when the exception was caught.

in test/functional/feature_framework_startup_failures.py:132 in 3ae99bf8d7 outdated

115+        # nothing is on that port. Divide by timeout_factor to counter
116+        # multiplication in base, 2 * node_start_duration should be enough.
117+        self.rpc_timeout = max(3, 2 * self.options.node_start_duration) / self.options.timeout_factor
118+
119+    def run_test(self):
120+        assert False, "Should have failed earlier during startup."

l0rinc commented at 7:23 pm on April 21, 2025:

nit: assert False is a bit hacky in my opinion - can we use the same style that we did on line 60?

0        print("Should have failed earlier during startup.")
1        raise

hodlinator commented at 8:38 pm on April 21, 2025:

raise re-raises the currently caught exception. Think it’s roughly equivalent to throw; in C++. One can do:

0raise AssertionError("...")

But I prefer using assert for new code. Considered raising RuntimeError, but this really is a code invariant, mostly for documenting, but also for catching changed behavior.

l0rinc commented at 11:27 am on April 22, 2025:

yes, I meant either print + raise or just raise Exception("Should have failed earlier during startup"), as you’ve mentioned. assert False feels like a hack…

hodlinator commented at 9:07 am on April 23, 2025:

If there was some kind of UnreachableError I would use it. Looked into using our utility assert_raises(...stop_node...), but that swallows the exception, so the test fails.

in test/functional/feature_framework_startup_failures.py:146 in 3ae99bf8d7 outdated

141+        assert False, "Should have failed earlier during startup."
142+
143+
144+if __name__ == '__main__':
145+    internal_test = None
146+    for arg in sys.argv:

l0rinc commented at 7:30 pm on April 21, 2025:

I’m not a fan of a incomplete iteration, can we separate the generator expression to the first line, something like:

0    class_name = next((arg[16:] for arg in sys.argv if arg.startswith('--internal_test=')), None)

or use a regexp to deduplicate the match and substring (untested):

0        class_name = next((m[1] for arg in sys.argv if (m := re.match(r'--internal_test=(.*)', arg))), None)

hodlinator commented at 9:30 pm on April 21, 2025:

Thanks! Took regexp form.

in test/functional/test_framework/test_node.py:339 in 3ae99bf8d7 outdated

348+                    errno.ETIMEDOUT,    # Treat identical to ECONNRESET
349+                    errno.ECONNREFUSED  # Port not yet open?
350+                ]:
351                     raise  # unknown OS error
352-            except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
353+                suppressed_errors[f"OSError {errno.errorcode[error_num]}"] += 1

l0rinc commented at 7:36 pm on April 21, 2025:

Based on the checks we’re doing on e at the top, error_num can potentially be None here - is that expected?

hodlinator commented at 8:42 pm on April 21, 2025:

We will re-raise above for None, see: #30660 (review)

l0rinc commented at 11:25 am on April 22, 2025:

Indeed, thanks

in test/functional/feature_framework_startup_failures.py:46 in 3ae99bf8d7 outdated

41+        if self.options.nocleanup:
42+            args.append("--nocleanup")
43+        if self.options.tmpdir:
44+            args.append(f"--tmpdir={self.options.tmpdir}/{hash(expected_exception)}"),
45+        if self.options.randomseed is not None:
46+            args.append(f"--randomseed={self.options.randomseed + hash(expected_exception)}")

l0rinc commented at 7:43 pm on April 21, 2025:

If I understand this correctly, we may want to preserve the seed properties - and xor preserves the distribution properties better than addition:

0            args.append(f"--randomseed={self.options.randomseed ^ hash(expected_exception)}")

Q: is hash expected to be stable between runs - or is that not important here?

hodlinator commented at 9:29 pm on April 21, 2025:

Q: is hash expected to be stable between runs - or is that not important here?

Oof no! Switched to insecure & fast MD5.

in test/functional/feature_framework_startup_failures.py:63 in 3ae99bf8d7 outdated

58+            child_output = e.output if isinstance(e.output, str) else e.output.decode("utf-8")
59+            print(f"Unexpected child timeout:\n<CHILD OUTPUT BEGIN>\n{child_output}\n<CHILD OUTPUT END>", file=sys.stderr)
60+            raise
61+
62+        errors = []
63+        if (n := len(re.findall("Traceback", output))) != 1:

l0rinc commented at 7:46 pm on April 21, 2025:

do we need regular expression matching here or can we use output.count("Traceback") as well?

0        if (n := output.count("Traceback")) != 1:

hodlinator commented at 9:45 pm on April 21, 2025:

Taken.

l0rinc commented at 11:33 am on April 22, 2025:

I think it applies to the other findall instances after this

l0rinc commented at 7:52 pm on April 21, 2025: contributor

I also vote for squashing the last commit.

hodlinator force-pushed on Apr 21, 2025

hodlinator marked this as ready for review on Apr 21, 2025

hodlinator commented at 10:21 pm on April 21, 2025: contributor

Notable changes since previous update (git range-diff master 77776c5 ea3cc83).

Changed parent test process to run a node, and use it to measure startup-times. That measurement is then used to to cap how long we wait before timing out due to wrong port.
Forward test framework arguments like --tmpdir and --timeout-factor to child processes (now with stable hashing).
Dropped the commit to avoid logging node stop info as parent process now also runs a node.
Re-added the commit to handle socket timeout exception with missing errno (https://github.com/bitcoin/bitcoin/pull/30660#discussion_r2053051864), since I started experiencing the issue again for non-zero rpc_timeouts as well.
Added warning about dangling bitcoind processes upon subprocess timeout.
Edit: Also squashed commit breaking apart class in feature_framework_startup_failures.py based off #30660#pullrequestreview-2781869977 and #30660#pullrequestreview-2782051429.

in test/functional/feature_framework_startup_failures.py:33 in ea3cc8388b outdated

28+    def setup_network(self):
29+        # Don't start the node yet, as we want to measure during run_test.
30+        self.add_nodes(self.num_nodes, self.extra_args)
31+
32+    # Launches a child test process running this same file, but instantiating
33+    # a child test, and verifies that we (only) raise the expected exception.

l0rinc commented at 11:39 am on April 22, 2025:

Maybe we could make the tenses more uniform here:

0    # Launches a child test process that runs this same file, instantiates
1    # a child test, and verifies that it raises only the expected exception.

hodlinator commented at 12:50 pm on April 22, 2025:

Adjusted in similar direction.

in test/functional/feature_framework_startup_failures.py:48 in ea3cc8388b outdated

43+            args.append("--nocleanup")
44+        weak_hash = md5(expected_exception.encode("utf-8"))
45+        if self.options.tmpdir:
46+            args.append(f"--tmpdir={self.options.tmpdir}/{weak_hash.hexdigest()[:16]}")
47+        if self.options.randomseed is not None:
48+            args.append(f"--randomseed={self.options.randomseed ^ int.from_bytes(weak_hash.digest()[:8], 'little')}")

l0rinc commented at 11:44 am on April 22, 2025:

I’m fine with this as well, but to avoid complexity on declaration site and usage site (and to skip the “weak” part which seems out-of-place), we could try:

0        hash_id = int(sha256(expected_exception.encode("utf-8")).hexdigest()[:8], 16)
1        if self.options.tmpdir:
2            args.append(f"--tmpdir={self.options.tmpdir}/{hash_id}")
3        if self.options.randomseed is not None:
4            args.append(f"--randomseed={self.options.randomseed ^ hash_id}")

hodlinator commented at 12:51 pm on April 22, 2025:

Took part of the simplification.

in test/functional/feature_framework_startup_failures.py:74 in ea3cc8388b outdated

70+        if (n := len(re.findall(expected_exception, output))) != 1:
71+            errors.append(f"Found {n}/1 occurrences of the specific exception: {expected_exception}")
72+        if (n := len(re.findall("Test failed. Test logging available at", output))) != 1:
73+            errors.append(f"Found {n}/1 test failure output messages.")
74+
75+        assert not errors, f"Child test didn't contain (only) expected errors:\n{chr(10).join(errors)}\n<CHILD OUTPUT BEGIN>\n{output}\n<CHILD OUTPUT END>\n"

l0rinc commented at 11:45 am on April 22, 2025:

nit: ‘\n’ seems less cryptic:

0        assert not errors, f"Child test didn't contain (only) expected errors:\n{'\n'.join(errors)}\n<CHILD OUTPUT BEGIN>\n{output}\n<CHILD OUTPUT END>\n"

hodlinator commented at 12:52 pm on April 22, 2025:

This was the result of an unfortunate battle with the linter. Seems to not even have worked on older versions:

https://stackoverflow.com/questions/51775950/why-isnt-it-possible-to-use-backslashes-inside-the-braces-of-f-strings-how-can

l0rinc commented at 4:14 pm on April 26, 2025:

The above worked for me - seems it was fixed in Python 3.12, but we have to support earlier ones as well. But I haven’t fully given up on this :D. It’s not very important, just had to find a solution, feel free to ignore. Since empty strings are also falsy, the 🦭 operator can come to the rescue again.

0        assert not (msg := '\n'.join(errors)), f"Child test didn't contain (only) expected errors:\n{msg}\n<CHILD OUTPUT BEGIN>\n{output}\n<CHILD OUTPUT END>\n"

hodlinator commented at 9:16 pm on April 26, 2025:

Switched to use os.linesep which I think is more idiomatic… But I’m not entirely happy with how I ended up having to escape: https://github.com/bitcoin/bitcoin/blob/b19439b1eee57041fc37dcf509d5ee9f248263bc/test/functional/feature_framework_startup_failures.py#L76-L79

It’s a side-effect of using repr(e) in 28e282ef9ae94ede4aace6b97ff18c66cb72a001 which gives the escaped form.

in test/functional/feature_framework_startup_failures.py:150 in ea3cc8388b outdated

145+    def run_test(self):
146+        assert False, "Should have failed earlier during startup."
147+
148+
149+if __name__ == '__main__':
150+    if class_name := next((m[1] for arg in sys.argv if (m := re.match(r'--internal_test=(.*)', arg))), None):

l0rinc commented at 11:47 am on April 22, 2025:

nit: I know I wrote .* but we might want .+ instead, since we likely don’t support empty ones

l0rinc approved

l0rinc commented at 11:56 am on April 22, 2025: contributor

ACK ea3cc8388b4aefcaee61f75e8bf1bc03281a0c91

LGTM, left a few nits, but none are blockers. Before merging someone with more experience should also review to be sure.

DrahtBot requested review from ryanofsky on Apr 22, 2025

hodlinator force-pushed on Apr 22, 2025

ryanofsky approved

ryanofsky commented at 6:02 pm on April 22, 2025: contributor

Code review ACK ece168ecef6e57e3a994ec5f7bd341f9c7525721. Changes since last review were squashing and rebasing, doing a better job of passing seed, timeouts, and tmpdirs to child processes. Previously dropped workaround for None errno is also back, this time with link to upstream python bug.

DrahtBot requested review from l0rinc on Apr 22, 2025

hodlinator force-pushed on Apr 23, 2025

DrahtBot added the label CI failed on Apr 23, 2025

DrahtBot commented at 9:19 am on April 23, 2025: contributor

🚧 At least one of the CI tasks failed. Debug: https://github.com/bitcoin/bitcoin/runs/40998561125

Try to run the tests locally, according to the documentation. However, a CI failure may still happen due to a number of reasons, for example:

Possibly due to a silent merge conflict (the changes in this pull request being incompatible with the current code in the target branch). If so, make sure to rebase on the latest commit of the target branch.
A sanitizer issue, which can only be found by compiling with the sanitizer and running the affected test.
An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

in test/functional/feature_framework_startup_failures.py:43 in ece5442f05 outdated

38+            f"--cachedir={self.options.cachedir}",
39+            f"--portseed={self.options.port_seed}",
40+            f"--internal_test={test.__name__}",
41+        ] + internal_args
42+        if self.options.nocleanup:
43+            args.append("--nocleanup")

l0rinc commented at 10:28 am on April 23, 2025:

I must be missing something fundamental here, I don’t understand why these lines are never triggered while running the test. Are these triggered in some special circumstance that has happened before but isn’t happening anymore - and since it’s already testing other tests, there’s no point in covering it with actual tests?

I find it a bit confusing, wanted to understand why e.g. the nocleanup branch was added here - but disabling it still passes, so that doesn’t help…

0if self.options.nocleanup:
1    raise Exception("")
2    args.append("--nocleanup")

and

0if self.options.randomseed is not None:
1    raise Exception("")
2    args.append(f"--randomseed={self.options.randomseed ^ weak_hash}")

and

0if (n := output.count("Traceback")) != 1:
1    raise Exception("")
2    errors.append(f"Found {n}/1 tracebacks - expecting exactly one with no knock-on exceptions.")

hodlinator commented at 12:01 pm on April 23, 2025:

The forwarding of self.options to the child process is purely to obey the expectations of someone running feature_framework_startup_failures.py or test_runner.py with those arguments on the CLI/CI.

l0rinc commented at 12:06 pm on April 23, 2025:

Not sure I understand, so is it dead code? Would the above exceptions be triggered on CI? In either case I find it confusing.

hodlinator commented at 12:17 pm on April 23, 2025:

It is sometimes useful to debug tests by looking at what is left behind on the filesystem. In those cases one could invoke the parent with “–nocleanup”. Not forwarding it to child processes means they would not leave files behind.

l0rinc commented at 12:21 pm on April 23, 2025:

Sure, but why isn’t it used currently? The dead code part is what bothers me

hodlinator commented at 12:43 pm on April 23, 2025:

It is functionality that can be activated by the user/CI. Not having the code would leave such cases in more of a broken state.

l0rinc commented at 12:45 pm on April 23, 2025:

Can we cover the paths in the tests as well?

hodlinator commented at 12:51 pm on April 23, 2025:

We could definitely test more parts of the framework functionality in another PR. This PR is focused on keeping the error reporting clear for startup failures, to aid troubleshooting.

l0rinc commented at 12:55 pm on April 23, 2025:

But this PR introduced the dead code

hodlinator commented at 1:02 pm on April 23, 2025:

#30660 (review)

hodlinator commented at 6:57 am on April 25, 2025:

I felt we were going in circles here, that’s why I illustrated it by linking back up the thread. The code is not dead, it is there to avoid the behavior not being broken when those args are activated by a user or CI.

This change is about testing the behavior of the test framework. We could also add tests to test the testing of the test framework, but at some point, this recursion has to stop.

Maybe we’re talking past each other in some way I don’t perceive?

hodlinator commented at 10:05 am on April 25, 2025:

(We had a call and I will experiment with passing down the full sys.argv to the child processes (with some tweaking for tmpdir). Will also see if I can add a test-class which intentionally exercises the failure case of ...count("Traceback")) != 1 and other checks in _verify_startup_failure without making things too convoluted, as it will ensure the test tester is robust going forward).

hodlinator commented at 9:27 pm on April 25, 2025:

(Done in latest push from ~8 hours ago).

l0rinc commented at 4:16 pm on April 26, 2025:

Thanks for fixing this!

DrahtBot removed the label CI failed on Apr 23, 2025

hodlinator commented at 6:59 am on April 25, 2025: contributor

Drahtbot’s LLM is complaining about an assert string: #30660 (comment) The Python interpreter is fine with mixing regular ""-strings and f""-strings, so it’s a false positive.

maflcko commented at 7:13 am on April 25, 2025: member

Drahtbot’s LLM is complaining … #30660 (comment) … it’s a false positive.

Yes, there are some rare false-positives. However, there are many more true positives, including one logic bug that was already found in a test in another pull, so I think overall it is worth it. Happy to discuss further in https://github.com/maflcko/DrahtBot/issues, or anywhere else.

qa: assert_raises_message() - Stop assuming certain structure for exceptions 28e282ef9a

hodlinator force-pushed on Apr 25, 2025

in test/functional/feature_framework_startup_failures.py:50 in 0302eefd9a outdated

45+            if (m := re.match(r'--tm(p?(d?(i?(r?))))=(.+)', arg)):
46+                parent_args[i] = f"--tmpdir={m[5]}/{subdir}"
47+                tmpdir_arg = True
48+                break
49+        if not tmpdir_arg:
50+            parent_args.append(f"--tmpdir={self.options.tmpdir}/{subdir}")

l0rinc commented at 3:49 pm on April 26, 2025:

What’s the exact purpose of this, why would someone set --tmr=x or --tmdi=y? And if not tmpdir_arg is basically the same as the previous loop which inserts to existing location, it just appends and uses a different root folder, right? We could use those as defaults I think, and slice it in, something like:

0        i, path = next(((i, m[1]) for i, arg in enumerate(parent_args) if (m := re.match(r'--tmpdir=(.+)', arg))),
1                       (len(parent_args), self.options.tmpdir))
2        subdir = md5(expected_exception.encode('utf-8')).hexdigest()[:8]
3        parent_args[i:i + 1] = [f"--tmpdir={path}/{subdir}"]

((-1, self.options.tmpdir)) would also work, but the slice would look weird in that case)

Not a blocker from my part, will leave it up to you.

hodlinator commented at 7:53 pm on April 26, 2025:

Thanks! Took a variant of that, while still maintaining support for shortened forms of --tmpdir. The Python ArgumentParser supports doing that and I’ve personally used --time=2 in place of --timeout-factor=2 since it’s shorter and I don’t remember if it’s --timeout_factor.

l0rinc commented at 9:04 pm on April 26, 2025:

But this doesn’t just match prefixes, it makes every parameter optional, so the middle one can be missing - which would be weird.

Maybe you meant --tm(?:p(?:d(?:ir?)?)?)?= which would only match the prefixes

--tmpdir=
--tmpdi=
--tmpd=
--tmp=
--tm=

but the intermediary letters aren’t optional, so

--tmr=

won’t be matched anymore (and the groups are non-capturing so you can still use m[1] for the value)

hodlinator commented at 9:33 pm on April 26, 2025:

Thanks for catching that. I would have leaned towards --tm(p(d(i(r?)?)?)?)?= but I like the non-capturing groups, don’t recall coming across them before. Taken.

l0rinc commented at 4:20 pm on April 26, 2025: contributor

ACK 0302eefd9a826e1d8e81b3684ff8ebc6de9d4217

Latest push fixed dead code concerns, uses counts instead of find for non-regex cases consistently and uses less encoding rountrips.

DrahtBot requested review from ryanofsky on Apr 26, 2025

l0rinc approved

hodlinator force-pushed on Apr 26, 2025

l0rinc commented at 9:06 pm on April 26, 2025: contributor

ACK b19439b1eee57041fc37dcf509d5ee9f248263bc

qa: Add feature_framework_startup_failures.py 10845cd7cc

hodlinator force-pushed on Apr 26, 2025

l0rinc commented at 9:54 pm on April 26, 2025: contributor

ACK 10845cd7cc89fc83b4ee844dc577b391720bba9c

in test/functional/test_framework/test_node.py:342 in 6ad21b4c01 outdated

338                     raise
339+                suppressed_errors["missing_credentials"] += 1
340+                latest_error = repr(e)
341             time.sleep(1.0 / poll_per_s)
342-        self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
343+        self._raise_assertion_error(f"Unable to connect to bitcoind after {self.rpc_timeout}s (ignored errors: {str(dict(suppressed_errors))}, latest error: {latest_error})")

ismaelsadeeq commented at 5:14 pm on May 8, 2025:

In “qa: Include ignored errors in RPC connection timeout” 6ad21b4c0114029d16d334bf8d437834708a295e

This is an example error after this change

0AssertionError: [node 0] Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 240}, latest error: ValueError('No RPC credentials'))

I think it will better if it will be consistent

e.g

0AssertionError: [node 0] Unable to connect to bitcoind after 60s (ignored errors: {'missing_credentials': 240}, latest error: missing_credentials))

ryanofsky commented at 7:53 pm on May 8, 2025:

re: #30660 (review)

Suggestion here seems to be to replace latest error repr with ignored errors str. Could also do the opposite and use repr for both. Either way seems find and would be nice to be consistent

hodlinator commented at 9:07 pm on May 9, 2025:

The ignored errors map are more of a categorization, while I intentionally wanted to include the full representation for the one latest error. I agree there is a drawback in them looking like different errors. Maybe something like this would be better?

0AssertionError: [node 0] Unable to connect to bitcoind after 60s. Ignored errors: {'missing_credentials': 240}, latest error: 'missing_credentials'/ValueError('No RPC credentials').

in test/functional/test_framework/test_node.py:399 in 9b24a403fa outdated

395@@ -396,15 +396,15 @@ def stop_node(self, expected_stderr='', *, wait=0, wait_until_stopped=True):
396         """Stop the node."""
397         if not self.running:
398             return
399+        assert self.rpc_connected, self._node_msg(

ismaelsadeeq commented at 5:39 pm on May 8, 2025:

In “qa: Only allow calling TestNode.stop() after connecting” 9b24a403fae4b896ff7705519bd48c877b4e621b

I am not sure whether this new error message is better, I think I will prefer the previous one as there is a case where the test node has start without a cookie file, hence wait_for_rpc_connection should not even be called.

And when you call stop_node it will tell you that there is no RPC connection.

ryanofsky commented at 7:58 pm on May 8, 2025:

re: #30660 (review)

I didn’t look into it closely but I think I don’t understand why wait_for_rpc_connection should not be called if stop if called. If there’s no cookie file because password authentication is used instead for example, i’d think you should still need for the RPC interface to wait before sending the stop RPC

hodlinator commented at 9:36 pm on May 9, 2025:

I think I will prefer the previous one as there is a case where the test node has start without a cookie file

If you can pinpoint a specific test that calls stop_node() without setting up some kind of RPC connection, let me know. Without being able to send the stop-RPC, we have no way of signalling the node to shut down AFAIK. We should be calling something like TestNode.kill_process() instead for such cases. So, seems like a logic error worthy of an assert.

in test/functional/feature_framework_startup_failures.py:45 in 10845cd7cc

40+        # as a cause of colliding with the parent dir.
41+        parent_args = sys.argv.copy()
42+        assert self.options.tmpdir, "Framework should always set tmpdir."
43+        i, path = next(((i, m[1]) for i, arg in enumerate(parent_args) if (m := re.match(r'--tm(?:p(?:d(?:ir?)?)?)?=(.+)', arg))),
44+                       (len(parent_args), self.options.tmpdir))
45+        subdir = md5(expected_exception.encode('utf-8')).hexdigest()[:8]

ismaelsadeeq commented at 5:57 pm on May 8, 2025:

🥲

0        subdir = sha1(expected_exception.encode('utf-8')).hexdigest()[:8]

hodlinator commented at 11:55 am on May 9, 2025:

I know right? :) Was aiming for speed, not pseudo-collision-resistance.

ismaelsadeeq commented at 6:05 pm on May 8, 2025: member

Nice cleanup, although I have one comment on the new log of stop_node.

in test/functional/test_framework/util.py:106 in 28e282ef9a outdated

105-            raise AssertionError(
106-                "Expected substring not found in error message:\nsubstring: '{}'\nerror message: '{}'.".format(
107-                    message, e.error['message']))
108+        if message is not None and message not in repr(e):
109+            raise AssertionError("Expected substring not found in exception:\n"
110+                                 f"substring: '{message}'\nexception: {repr(e)}.")

ryanofsky commented at 6:28 pm on May 8, 2025:

In commit “qa: assert_raises_message() - Stop assuming certain structure for exceptions” (28e282ef9ae94ede4aace6b97ff18c66cb72a001)

Would be nice if commit message mentioned why this change is being made. Currently there’s no explanation and the change seems pretty random.

Also could replace {repr(e)} with just {e!r}

hodlinator commented at 9:25 pm on May 9, 2025:

Since it’s been merged already, I’ll add some more context here:

AFAIK all existing calls to assert_raises_message had message==None. In this PR I wanted to use it in a later commit. The old way of checking e.error['message'] instead of repr(e) for occurrences of message was assuming a certain structure/layout of the exception type. Not sure if there’s more to say.

I’m not married to using repr(e) (as indicated #30660 (review)). I realize now that it could make other users of this feature no longer work (had they existed on some parallel branch), as they would need to add escaping of special characters like I had to do for (\r)\n.

in test/functional/feature_framework_startup_failures.py:79 in 10845cd7cc

74+        assert_raises_message(
75+            AssertionError,
76+            ("Child test didn't contain (only) expected errors:\n" +
77+             linesep.join(["Found 0/1 tracebacks - expecting exactly one with no knock-on exceptions.",
78+                           "Found 0/1 occurrences of the specific exception: NonExistentError",
79+                           "Found 0/1 test failure output messages."])).encode("unicode_escape").decode("utf-8"),

ryanofsky commented at 6:49 pm on May 8, 2025:

In commit “qa: Add feature_framework_startup_failures.py” (10845cd7cc89fc83b4ee844dc577b391720bba9c)

I think all these nested string operations make this code hard to decipher, and flattening could make it more readable. Would suggest:

 0--- a/test/functional/feature_framework_startup_failures.py
 1+++ b/test/functional/feature_framework_startup_failures.py
 2@@ -71,12 +71,13 @@ class FeatureFrameworkStartupFailures(BitcoinTestFramework):
 3 
 4     def run_test(self):
 5         self.log.info("Verifying _verify_startup_failure() functionality (self-check).")
 6+        eol = linesep.encode("unicode_escape").decode("utf8")
 7         assert_raises_message(
 8             AssertionError,
 9-            ("Child test didn't contain (only) expected errors:\n" +
10-             linesep.join(["Found 0/1 tracebacks - expecting exactly one with no knock-on exceptions.",
11-                           "Found 0/1 occurrences of the specific exception: NonExistentError",
12-                           "Found 0/1 test failure output messages."])).encode("unicode_escape").decode("utf-8"),
13+            (rf"Child test didn't contain (only) expected errors:\n"
14+             rf"Found 0/1 tracebacks - expecting exactly one with no knock-on exceptions.{eol}"
15+             rf"Found 0/1 occurrences of the specific exception: NonExistentError{eol}"
16+             rf"Found 0/1 test failure output messages.{eol}"),
17             self._verify_startup_failure,
18             TestSuccess, [],
19             "NonExistentError",

hodlinator commented at 9:44 pm on May 9, 2025:

~~One remaining \n, which would probably fail on Windows.~~ Agree it is marginally more readable.

in test/functional/feature_framework_startup_failures.py:44 in 10845cd7cc

39+        # Inherit args from parent, only modifying tmpdir so children don't fail
40+        # as a cause of colliding with the parent dir.
41+        parent_args = sys.argv.copy()
42+        assert self.options.tmpdir, "Framework should always set tmpdir."
43+        i, path = next(((i, m[1]) for i, arg in enumerate(parent_args) if (m := re.match(r'--tm(?:p(?:d(?:ir?)?)?)?=(.+)', arg))),
44+                       (len(parent_args), self.options.tmpdir))

ryanofsky commented at 7:06 pm on May 8, 2025:

In commit “qa: Add feature_framework_startup_failures.py” (10845cd7cc89fc83b4ee844dc577b391720bba9c)

This line is pretty dense. Might suggest adding a comment like “Get position and value of the first –tmpdir option, or the end position and self.options.tmpdir if none is specified.” so it’s not necessary to spend time deciphering this to understand the rest of the code.

Also I think it would probably be more correct to s/enumerate(parent_args)/reversed(enumerate(parent_args))/ to use the last tmpdir option instead of the first, because argparse will return the last option value if multiple are present.

Or actually, maybe this is not necessary at all given that argparse will use the last value? Following seems to work without a problem

0assert self.options.tmpdir, "Framework should always set tmpdir."
1subdir = md5(expected_exception.encode('utf-8')).hexdigest()[:8]
2args = [sys.executable] + sys.argv + [f"--tmpdir={self.options.tmpdir}/{subdir}", f"--internal_test={test.__name__}"] + internal_args

hodlinator commented at 11:57 am on May 9, 2025:

I like your final version for its simplicity and also being more correct as currently we only mutate the first found instance.

in test/functional/feature_framework_startup_failures.py:39 in 10845cd7cc

34+        self.add_nodes(self.num_nodes, self.extra_args)
35+
36+    # Launches a child test process that runs this same file, but instantiates
37+    # a child test. Verifies that it raises only the expected exception, once.
38+    def _verify_startup_failure(self, test, internal_args, expected_exception):
39+        # Inherit args from parent, only modifying tmpdir so children don't fail

ryanofsky commented at 7:13 pm on May 8, 2025:

In commit “qa: Add feature_framework_startup_failures.py” (10845cd7cc89fc83b4ee844dc577b391720bba9c)

Would suggest s/only modifying tmpdir/just appending a unique suffix to tmpdir/ to give a little more context and help the next part of the comment make sense.

ryanofsky approved

ryanofsky commented at 7:40 pm on May 8, 2025: contributor

Code review ACK 10845cd7cc89fc83b4ee844dc577b391720bba9c. Only changes since last review were adding a new commit tweaking assert_raises_message(), extending the new test to have a self-check, and to pass through all options to child tests instead of a hardcoded list of options. I left some cleanup suggestions below but they are not important.

I think I will go ahead and merge this with 2 current acks and many stale ones, since this had has a lot of heavy review already and it only affects test code. It would be nice to address the remaining comments here or in a followup, but not necessary.

ryanofsky assigned ryanofsky on May 8, 2025

ryanofsky merged this on May 8, 2025

ryanofsky closed this on May 8, 2025

hodlinator deleted the branch on May 9, 2025

hodlinator commented at 11:30 am on May 9, 2025: contributor

There was a seemingly intermittent CI failure in wallet_basic.py on the commit merging this into master. Tracked as #32456. The following push to master succeeded.

hodlinator commented at 9:49 pm on May 9, 2025: contributor

Thank you @ismaelsadeeq for your feedback & @ryanofsky for having mercy on this PR. :)

Aiming to address some of the remaining feedback in a follow-up PR next week.

hodlinator referenced this in commit 30b1486491 on May 15, 2025

hodlinator referenced this in commit c47f634718 on May 15, 2025

hodlinator commented at 7:05 pm on May 15, 2025: contributor

@ryanofsky & @ismaelsadeeq some of your remaining comments are addressed and ready for review in follow-up PR #32509. (Happy to get review from others as well of course).

qa: Verify clean shutdown on startup failure #30660

Testing

Code Coverage & Benchmarks

Reviews

Conflicts

Git-sleuthing to help motivate this PR

Was this ever working?

Fixes in the same area

Added final commit - drop or squash?