Usually they come with a follow-up error of Unable to connect to bitcoind after 2400s
(or similar).
This is a tracking issue, because all tests are affected.
Usually they come with a follow-up error of Unable to connect to bitcoind after 2400s
(or similar).
This is a tracking issue, because all tests are affected.
Seen recently here (wallet_bumpfee.py --descriptors
) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:
0 File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
1 raise AssertionError(self._node_msg(msg))
2 AssertionError: [node 1] Unable to connect to bitcoind after 2400s
3 test 2024-07-09T21:07:49.462000Z TestFramework (DEBUG): Closing down network thread
Would something like this be useful in helping approach the root cause of this?
(Edit: updated diff based off 1 2)
0--- a/test/functional/test_framework/test_node.py
1+++ b/test/functional/test_framework/test_node.py
2@@ -5,6 +5,7 @@
3 """Class for bitcoind node under test"""
4
5 import contextlib
6+from collections import Counter
7 import decimal
8 import errno
9 from enum import Enum
10@@ -260,6 +261,7 @@ class TestNode():
11 """Sets up an RPC connection to the bitcoind process. Returns False if unable to connect."""
12 # Poll at a rate of four times per second
13 poll_per_s = 4
14+ ignored_errors = Counter()
15 for _ in range(poll_per_s * self.rpc_timeout):
16 if self.process.poll() is not None:
17 # Attach abrupt shutdown error/s to the exception message
18@@ -308,25 +310,32 @@ class TestNode():
19 return
20 except JSONRPCException as e: # Initialization phase
21 # -28 RPC in warmup
22+ if e.error['code'] == -28:
23+ ignored_errors["JSONRPCException -28"] += 1
24 # -342 Service unavailable, RPC server started but is shutting down due to error
25- if e.error['code'] != -28 and e.error['code'] != -342:
26+ # - Ignore the exception to properly raise the FailedToStartError.
27+ elif e.error['code'] == -342:
28+ ignored_errors["JSONRPCException -342"] += 1
29+ else:
30 raise # unknown JSON RPC exception
31- except ConnectionResetError:
32- # This might happen when the RPC server is in warmup, but shut down before the call to getblockcount
33- # succeeds. Try again to properly raise the FailedToStartError
34- pass
35 except OSError as e:
36- if e.errno == errno.ETIMEDOUT:
37- pass # Treat identical to ConnectionResetError
38+ if e.errno == errno.ECONNRESET:
39+ # This might happen when the RPC server is in warmup, but shut down before the call to getblockcount
40+ # succeeds. Try again to properly raise the FailedToStartError
41+ ignored_errors["OSError.ECONNRESET"] += 1
42+ elif e.errno == errno.ETIMEDOUT:
43+ ignored_errors["OSError.ETIMEDOUT"] += 1 # Treat identical to ECONNRESET
44 elif e.errno == errno.ECONNREFUSED:
45- pass # Port not yet open?
46+ ignored_errors["OSError.ECONNREFUSED"] += 1 # Port not yet open?
47 else:
48 raise # unknown OS error
49 except ValueError as e: # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
50- if "No RPC credentials" not in str(e):
51+ if "No RPC credentials" in str(e):
52+ ignored_errors["missing_credentials"] += 1
53+ else:
54 raise
55 time.sleep(1.0 / poll_per_s)
56- self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
57+ self._raise_assertion_error("Unable to connect to bitcoind after {}s (ignored_errors: {})".format(self.rpc_timeout, str(dict(ignored_errors))))
58
59 def wait_for_cookie_credentials(self):
60 """Ensures auth cookie credentials can be read, e.g. for testing CLI with -rpcwait before RPC connection is up."""
Not sure if patches like this could be applied to some subset of CI to help debug CI-only issues, slightly unclean, but maybe you’ve done similar things before?
Notable: I removed the case for “-342 Service unavailable, RPC server started but is shutting down due to error”. Seems like we should be failing if the server is shutting down instead of silently ignoring it.
Notable: I removed the case for “-342 Service unavailable, RPC server started but is shutting down due to error”. Seems like we should be failing if the server is shutting down instead of silently ignoring it.
That’d revert 62c304ea481d474bc87d950e21907b8b05134fe7 and thus break existing tests, no?
For reference: You can use git log -S "unique_code_blob"
to see when (and hopefully why) a piece of code was added. In this case it would be something like:
0$ git log -S ' != -342'
1commit 62c304ea481d474bc87d950e21907b8b05134fe7
2Date: Sat Oct 6 13:42:11 2018 +0800
3
4 tests: Allow closed http server in assert_start_raises_init_error
Other than that the patch
Seen recently here (
wallet_bumpfee.py --descriptors
) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:
This one looks different, because node1
never actually starts to start up. It doesn’t even get to emit an 10061
error.
errno.ECONNREFUSED
for WinError 10061?
Thanks, I should have searched for why the -342 check was introduced before I suggested removing it. :facepalm: And my git-fu could do with some leveling up, that’s impressive how it can single out the right commit.
Updated my suggested diff, bringing back the -342 check.
As for the ignored_errors counter, I presume it will just count errno.ECONNREFUSED for WinError 10061?
Yes, that’s what ConnectionRefusedError seems to contain. Also corrected & updated that part of the diff.
Other than that the patch
Seen recently here (
wallet_bumpfee.py --descriptors
) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:This one looks different, because
node1
never actually starts to start up. It doesn’t even get to emit an10061
error.
The linked log has:
0 node1 2024-07-09T20:27:21.198248Z [shutoff] [D:\a\bitcoin\bitcoin\src\init.cpp:386] [Shutdown] Shutdown: done
1 test 2024-07-09T20:27:21.233000Z TestFramework.node1 (DEBUG): Node stopped
2 test 2024-07-09T20:27:21.233000Z TestFramework.node1 (DEBUG): bitcoind started, waiting for RPC to come up
3 node0 2024-07-09T20:27:52.439635Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2365] [StartExtraBlockRelayPeers] [net] enabling extra block-relay-only peers
4 node0 2024-07-09T20:42:07.454184Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2328] [DumpAddresses] [net] Flushed 0 addresses to peers.dat 1ms
5 node0 2024-07-09T20:57:07.475871Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2328] [DumpAddresses] [net] Flushed 0 addresses to peers.dat 1ms
6 test 2024-07-09T21:07:49.462000Z TestFramework (ERROR): Assertion failed
7 Traceback (most recent call last):
8 File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 132, in main
9 self.run_test()
10 File "D:\a\bitcoin\bitcoin/test/functional/wallet_bumpfee.py", line 99, in run_test
11 test_bumpfee_with_abandoned_descendant_succeeds(self, rbf_node, rbf_node_address, dest_address)
12 File "D:\a\bitcoin\bitcoin/test/functional/wallet_bumpfee.py", line 451, in test_bumpfee_with_abandoned_descendant_succeeds
13 self.restart_node(1, ['-minrelaytxfee=0.00005'] + self.extra_args[1])
14 File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 602, in restart_node
15 self.start_node(i, extra_args)
16 File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 554, in start_node
17 node.wait_for_rpc_connection()
18 File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 314, in wait_for_rpc_connection
19 self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
20 File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
21 raise AssertionError(self._node_msg(msg))
22 AssertionError: [node 1] Unable to connect to bitcoind after 2400s
The first 2 lines in the excerpt show how node 1 is shutting down right before starting. This made me think of issues reusing TCP ports when restarting a process (stackoverflow 1 2).
Maybe it would be better for the test nodes to use new port numbers upon restart, especially for tests like wallet_bumpfee.py
that use restart_node
?
This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP TIME_WAIT
period.
Edit: I guess TIME_WAIT
does not directly explain why connection still fails after 2400s. But it could be a contributing factor.
JSONRPCException
-342
in the beginning of June. Don’t see how that change would cause issues, but if it correlates with when you started seeing this, maybe it could be?)
This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP
TIME_WAIT
period.
I don’t think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, …).
Edit: I guess TIME_WAIT does not directly explain why connection still fails after 2400s. But it could be a contributing factor.
I don’t think Bitcoin Core will re-try to start the HTTP RPC server, so if it fails once, it should fail always?
However, it looks like the node is never started at all. Same in https://github.com/bitcoin/bitcoin/actions/runs/9789643429/job/27029722411?pr=30356
This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP
TIME_WAIT
period.I don’t think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, …).
Yes, different tests would get different PIDs, which feed into p2p_port
etc, so port collisions between tests should not happen unless PIDs are reused within a too short timeperiod.
Edit: I guess TIME_WAIT does not directly explain why connection still fails after 2400s. But it could be a contributing factor.
I don’t think Bitcoin Core will re-try to start the HTTP RPC server, so if it fails once, it should fail always?
That is my assumption too.
However, it looks like the node is never started at all. Same in https://github.com/bitcoin/bitcoin/actions/runs/9789643429/job/27029722411?pr=30356
Yup, it sure is mysterious. At least having counters for the silent failures could provide some clues. Do you think I should create a PR for #30390 (comment)?
Something definitely seems to be causing bitcoind to keep running but stall somewhere for 2400s without returning an exit code from .poll()
. Or as you say, Python Popen may be having issues in even starting the process for some reason.
Tangent: In another project, I once saw issues with subprocesses stalling on writing to stdout
due to the parent process not reading from it quickly enough, so the OS buffer in between the processes would fill up. It was solved by spinning up a dedicated thread in the parent process which continuously read stdout
from the child. But in our case we seem to mostly be writing subprocess stdout
& stderr
to a file, so only running out of disk would potentially cause it to block in a similar fashion to the other project. But running out of disk would hopefully spawn other errors too, making the issue clear.
Read through the docs on Python Popen and didn’t find anything suspicious. Would assume that Popen
would raise an exception immediately if it fails to start bitcoind
. Seems more likely that bitcoind
has gone into an infinite loop/blocked for some reason early during startup.
Might be helpful if we could log the callstack(s) from the bitcoind
process when hitting the 2400s timeout.
Also #28509.
Ah nice! I forgot about it. I guess it also underlines that the issue lies in starting the process.
Maybe GHA has some kind of “Anti-Virus” software that may block some processes?
This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP
TIME_WAIT
period.I don’t think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, …).
Yes, different tests would get different PIDs, which feed into
p2p_port
etc, so port collisions between tests should not happen unless PIDs are reused within a too short timeperiod.
Probably unrelated to this issue at this point, but the test_runner provides the --portseed
, so no PID is used.
Also #28509.
Thanks for the added context @hebasto! There seems to be some agreement that upgrading to Python 3.12 would fix the issue. However, I was unable to find a documented fix, do you have a link? Your comment links to an issue with subprocess.run
but it’s a higher-level function implemented on top of subprocess.Popen
which we use.
Maybe GHA has some kind of “Anti-Virus” software that may block some processes?
Little experience with GHA, but yeah, those kind of things slowing down CI is not unheard of.
Probably unrelated to this issue at this point, but the test_runner provides the
--portseed
, so no PID is used.
Thanks for pointing that out!
0test = self.test_list.pop(0)
1portseed = len(self.test_list)
2portseed_arg = ["--portseed={}".format(portseed)]
..that should indeed circumvent possible port-reuse issues stemming from OS PID reuse choices.
Given..
0def p2p_port(n):
1 return PORT_MIN + n + (MAX_NODES * PortSeed.n) % (PORT_RANGE - 1 - MAX_NODES)
2 # Constants: 11000 + n + (12 * PortSeed.n) % (5000 - 1 - 12)
3 # = 11000 + n + (12 * PortSeed.n) % (4987)
..we should be able to have 415 tests (4987 / 12) before tests at the beginning of the test_list
risk colliding ports with tests at the end. From what I can see in test/functional/test_runner.py we have ~300 tests. :+1:
There seems to some agreement that upgrading to Python 3.12 would fix the issue. However, I was unable to find a documented fix, do you have a link?
This was just a wrong comment by myself. See #28509 (comment)
procdump -mm <PID>
on the bitcoind.exe
process and hopefully figure out where the process is stalling?
Alright, I can live with having it a pending PR until it has proven itself useful on CI. :)
Will see if I can get a Windows/Wine environment going locally and study how to integrate it into GHA.
Seen in https://github.com/bitcoin/bitcoin/actions/runs/10851531017/job/30115514734?pr=24123#step:12:242:
0 File "C:\hostedtoolcache\windows\Python\3.12.5\x64\Lib\socket.py", line 850, in create_connection
1
2 sock.connect(sa)
3
4ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it