Usually they come with a follow-up error of Unable to connect to bitcoind after 2400s (or similar).
This is a tracking issue, because all tests are affected.
Usually they come with a follow-up error of Unable to connect to bitcoind after 2400s (or similar).
This is a tracking issue, because all tests are affected.
Given that no one has reproduced this locally yet, this may be an issue limited to GHA.
Seen recently here (wallet_bumpfee.py --descriptors) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:
File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
raise AssertionError(self._node_msg(msg))
AssertionError: [node 1] Unable to connect to bitcoind after 2400s
test 2024-07-09T21:07:49.462000Z TestFramework (DEBUG): Closing down network thread
Would something like this be useful in helping approach the root cause of this?
(Edit: updated diff based off 1 2)
--- a/test/functional/test_framework/test_node.py
+++ b/test/functional/test_framework/test_node.py
@@ -5,6 +5,7 @@
"""Class for bitcoind node under test"""
import contextlib
+from collections import Counter
import decimal
import errno
from enum import Enum
@@ -260,6 +261,7 @@ class TestNode():
"""Sets up an RPC connection to the bitcoind process. Returns False if unable to connect."""
# Poll at a rate of four times per second
poll_per_s = 4
+ ignored_errors = Counter()
for _ in range(poll_per_s * self.rpc_timeout):
if self.process.poll() is not None:
# Attach abrupt shutdown error/s to the exception message
@@ -308,25 +310,32 @@ class TestNode():
return
except JSONRPCException as e: # Initialization phase
# -28 RPC in warmup
+ if e.error['code'] == -28:
+ ignored_errors["JSONRPCException -28"] += 1
# -342 Service unavailable, RPC server started but is shutting down due to error
- if e.error['code'] != -28 and e.error['code'] != -342:
+ # - Ignore the exception to properly raise the FailedToStartError.
+ elif e.error['code'] == -342:
+ ignored_errors["JSONRPCException -342"] += 1
+ else:
raise # unknown JSON RPC exception
- except ConnectionResetError:
- # This might happen when the RPC server is in warmup, but shut down before the call to getblockcount
- # succeeds. Try again to properly raise the FailedToStartError
- pass
except OSError as e:
- if e.errno == errno.ETIMEDOUT:
- pass # Treat identical to ConnectionResetError
+ if e.errno == errno.ECONNRESET:
+ # This might happen when the RPC server is in warmup, but shut down before the call to getblockcount
+ # succeeds. Try again to properly raise the FailedToStartError
+ ignored_errors["OSError.ECONNRESET"] += 1
+ elif e.errno == errno.ETIMEDOUT:
+ ignored_errors["OSError.ETIMEDOUT"] += 1 # Treat identical to ECONNRESET
elif e.errno == errno.ECONNREFUSED:
- pass # Port not yet open?
+ ignored_errors["OSError.ECONNREFUSED"] += 1 # Port not yet open?
else:
raise # unknown OS error
except ValueError as e: # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
- if "No RPC credentials" not in str(e):
+ if "No RPC credentials" in str(e):
+ ignored_errors["missing_credentials"] += 1
+ else:
raise
time.sleep(1.0 / poll_per_s)
- self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
+ self._raise_assertion_error("Unable to connect to bitcoind after {}s (ignored_errors: {})".format(self.rpc_timeout, str(dict(ignored_errors))))
def wait_for_cookie_credentials(self):
"""Ensures auth cookie credentials can be read, e.g. for testing CLI with -rpcwait before RPC connection is up."""
Not sure if patches like this could be applied to some subset of CI to help debug CI-only issues, slightly unclean, but maybe you've done similar things before?
~Notable: I removed the case for "-342 Service unavailable, RPC server started but is shutting down due to error". Seems like we should be failing if the server is shutting down instead of silently ignoring it.~
Notable: I removed the case for "-342 Service unavailable, RPC server started but is shutting down due to error". Seems like we should be failing if the server is shutting down instead of silently ignoring it.
That'd revert 62c304ea481d474bc87d950e21907b8b05134fe7 and thus break existing tests, no?
For reference: You can use git log -S "unique_code_blob" to see when (and hopefully why) a piece of code was added. In this case it would be something like:
$ git log -S ' != -342'
commit 62c304ea481d474bc87d950e21907b8b05134fe7
Date: Sat Oct 6 13:42:11 2018 +0800
tests: Allow closed http server in assert_start_raises_init_error
Other than that the patch
Seen recently here (
wallet_bumpfee.py --descriptors) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:
This one looks different, because node1 never actually starts to start up. It doesn't even get to emit an 10061 error.
As for the ignored_errors counter, I presume it will just count errno.ECONNREFUSED for WinError 10061?
Thanks, I should have searched for why the -342 check was introduced before I suggested removing it. :facepalm: And my git-fu could do with some leveling up, that's impressive how it can single out the right commit.
Updated my suggested diff, bringing back the -342 check.
As for the ignored_errors counter, I presume it will just count errno.ECONNREFUSED for WinError 10061?
Yes, that's what ConnectionRefusedError seems to contain. Also corrected & updated that part of the diff.
Other than that the patch
Seen recently here (
wallet_bumpfee.py --descriptors) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:This one looks different, because
node1never actually starts to start up. It doesn't even get to emit an10061error.
The linked log has:
node1 2024-07-09T20:27:21.198248Z [shutoff] [D:\a\bitcoin\bitcoin\src\init.cpp:386] [Shutdown] Shutdown: done
test 2024-07-09T20:27:21.233000Z TestFramework.node1 (DEBUG): Node stopped
test 2024-07-09T20:27:21.233000Z TestFramework.node1 (DEBUG): bitcoind started, waiting for RPC to come up
node0 2024-07-09T20:27:52.439635Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2365] [StartExtraBlockRelayPeers] [net] enabling extra block-relay-only peers
node0 2024-07-09T20:42:07.454184Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2328] [DumpAddresses] [net] Flushed 0 addresses to peers.dat 1ms
node0 2024-07-09T20:57:07.475871Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2328] [DumpAddresses] [net] Flushed 0 addresses to peers.dat 1ms
test 2024-07-09T21:07:49.462000Z TestFramework (ERROR): Assertion failed
Traceback (most recent call last):
File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 132, in main
self.run_test()
File "D:\a\bitcoin\bitcoin/test/functional/wallet_bumpfee.py", line 99, in run_test
test_bumpfee_with_abandoned_descendant_succeeds(self, rbf_node, rbf_node_address, dest_address)
File "D:\a\bitcoin\bitcoin/test/functional/wallet_bumpfee.py", line 451, in test_bumpfee_with_abandoned_descendant_succeeds
self.restart_node(1, ['-minrelaytxfee=0.00005'] + self.extra_args[1])
File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 602, in restart_node
self.start_node(i, extra_args)
File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 554, in start_node
node.wait_for_rpc_connection()
File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 314, in wait_for_rpc_connection
self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
raise AssertionError(self._node_msg(msg))
AssertionError: [node 1] Unable to connect to bitcoind after 2400s
The first 2 lines in the excerpt show how node 1 is shutting down right before starting. This made me think of issues reusing TCP ports when restarting a process (stackoverflow 1 2).
Maybe it would be better for the test nodes to use new port numbers upon restart, especially for tests like wallet_bumpfee.py that use restart_node?
This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP TIME_WAIT period.
Edit: I guess TIME_WAIT does not directly explain why connection still fails after 2400s. But it could be a contributing factor.
(Btw, b22529529823c0cb5916ac318c8536e9107b7e78 did touch code raising JSONRPCException -342 in the beginning of June. Don't see how that change would cause issues, but if it correlates with when you started seeing this, maybe it could be?)
This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP
TIME_WAITperiod.
I don't think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, ...).
Edit: I guess TIME_WAIT does not directly explain why connection still fails after 2400s. But it could be a contributing factor.
I don't think Bitcoin Core will re-try to start the HTTP RPC server, so if it fails once, it should fail always?
However, it looks like the node is never started at all. Same in https://github.com/bitcoin/bitcoin/actions/runs/9789643429/job/27029722411?pr=30356
This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP
TIME_WAITperiod.I don't think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, ...).
Yes, different tests would get different PIDs, which feed into p2p_port etc, so port collisions between tests should not happen unless PIDs are reused within a too short timeperiod.
Edit: I guess TIME_WAIT does not directly explain why connection still fails after 2400s. But it could be a contributing factor.
I don't think Bitcoin Core will re-try to start the HTTP RPC server, so if it fails once, it should fail always?
That is my assumption too.
However, it looks like the node is never started at all. Same in https://github.com/bitcoin/bitcoin/actions/runs/9789643429/job/27029722411?pr=30356
Yup, it sure is mysterious. At least having counters for the silent failures could provide some clues. Do you think I should create a PR for #30390 (comment)?
Something definitely seems to be causing bitcoind to keep running but stall somewhere for 2400s without returning an exit code from .poll(). Or as you say, Python Popen may be having issues in even starting the process for some reason.
Tangent: In another project, I once saw issues with subprocesses stalling on writing to stdout due to the parent process not reading from it quickly enough, so the OS buffer in between the processes would fill up. It was solved by spinning up a dedicated thread in the parent process which continuously read stdout from the child. But in our case we seem to mostly be writing subprocess stdout & stderr to a file, so only running out of disk would potentially cause it to block in a similar fashion to the other project. But running out of disk would hopefully spawn other errors too, making the issue clear.
Read through the docs on Python Popen and didn't find anything suspicious. Would assume that Popen would raise an exception immediately if it fails to start bitcoind. Seems more likely that bitcoind has gone into an infinite loop/blocked for some reason early during startup.
Might be helpful if we could log the callstack(s) from the bitcoind process when hitting the 2400s timeout.
Also #28509.
Ah nice! I forgot about it. I guess it also underlines that the issue lies in starting the process.
Maybe GHA has some kind of "Anti-Virus" software that may block some processes?
This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP
TIME_WAITperiod.I don't think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, ...).
Yes, different tests would get different PIDs, which feed into
p2p_portetc, so port collisions between tests should not happen unless PIDs are reused within a too short timeperiod.
Probably unrelated to this issue at this point, but the test_runner provides the --portseed, so no PID is used.
Also #28509.
Thanks for the added context @hebasto! ~There seems to be some agreement that upgrading to Python 3.12 would fix the issue. However, I was unable to find a documented fix, do you have a link? Your comment links to an issue with subprocess.run but it's a higher-level function implemented on top of subprocess.Popen which we use.~
Maybe GHA has some kind of "Anti-Virus" software that may block some processes?
Little experience with GHA, but yeah, those kind of things slowing down CI is not unheard of.
Probably unrelated to this issue at this point, but the test_runner provides the
--portseed, so no PID is used.
Thanks for pointing that out!
test = self.test_list.pop(0)
portseed = len(self.test_list)
portseed_arg = ["--portseed={}".format(portseed)]
..that should indeed circumvent possible port-reuse issues stemming from OS PID reuse choices.
Given..
def p2p_port(n):
return PORT_MIN + n + (MAX_NODES * PortSeed.n) % (PORT_RANGE - 1 - MAX_NODES)
# Constants: 11000 + n + (12 * PortSeed.n) % (5000 - 1 - 12)
# = 11000 + n + (12 * PortSeed.n) % (4987)
..we should be able to have 415 tests (4987 / 12) before tests at the beginning of the test_list risk colliding ports with tests at the end. From what I can see in test/functional/test_runner.py we have ~300 tests. :+1:
There seems to some agreement that upgrading to Python 3.12 would fix the issue. However, I was unable to find a documented fix, do you have a link?
This was just a wrong comment by myself. See #28509 (comment)
Regarding process start issues, would there be any support for me attempting to integrate Microsoft SysInternals ProcDump into CI & test framework to be able to call procdump -mm <PID> on the bitcoind.exe process and hopefully figure out where the process is stalling?
Seems fine to integrate in a pull request and then keep the pull request unmerged, and only merge the pull request in the GHA CI, as that is the only place where it reproduces. Adding temporary test test-only code for everyone else may be a bit too much.
Alright, I can live with having it a pending PR until it has proven itself useful on CI. :)
Will see if I can get a Windows/Wine environment going locally and study how to integrate it into GHA.
Seen in https://github.com/bitcoin/bitcoin/actions/runs/10851531017/job/30115514734?pr=24123#step:12:242:
File "C:\hostedtoolcache\windows\Python\3.12.5\x64\Lib\socket.py", line 850, in create_connection
sock.connect(sa)
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it