ci: ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

maflcko commented at 8:40 am on July 4, 2024: member

Usually they come with a follow-up error of Unable to connect to bitcoind after 2400s (or similar).

This is a tracking issue, because all tests are affected.

maflcko added the label Windows on Jul 4, 2024

maflcko added the label Tests on Jul 4, 2024

maflcko added the label CI failed on Jul 4, 2024

maflcko commented at 9:59 am on July 4, 2024: member

Given that no one has reproduced this locally yet, this may be an issue limited to GHA.

fanquake commented at 8:45 am on July 10, 2024: member

Seen recently here (wallet_bumpfee.py --descriptors) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:

0                                     File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
1                                       raise AssertionError(self._node_msg(msg))
2                                   AssertionError: [node 1] Unable to connect to bitcoind after 2400s
3 test  2024-07-09T21:07:49.462000Z TestFramework (DEBUG): Closing down network thread

hodlinator commented at 9:56 pm on August 26, 2024: contributor

Would something like this be useful in helping approach the root cause of this?

(Edit: updated diff based off 1 2)

 0--- a/test/functional/test_framework/test_node.py
 1+++ b/test/functional/test_framework/test_node.py
 2@@ -5,6 +5,7 @@
 3 """Class for bitcoind node under test"""
 4 
 5 import contextlib
 6+from collections import Counter
 7 import decimal
 8 import errno
 9 from enum import Enum
10@@ -260,6 +261,7 @@ class TestNode():
11         """Sets up an RPC connection to the bitcoind process. Returns False if unable to connect."""
12         # Poll at a rate of four times per second
13         poll_per_s = 4
14+        ignored_errors = Counter()
15         for _ in range(poll_per_s * self.rpc_timeout):
16             if self.process.poll() is not None:
17                 # Attach abrupt shutdown error/s to the exception message
18@@ -308,25 +310,32 @@ class TestNode():
19                 return
20             except JSONRPCException as e:  # Initialization phase
21                 # -28 RPC in warmup
22+                if e.error['code'] == -28:
23+                    ignored_errors["JSONRPCException -28"] += 1
24                 # -342 Service unavailable, RPC server started but is shutting down due to error
25-                if e.error['code'] != -28 and e.error['code'] != -342:
26+                # - Ignore the exception to properly raise the FailedToStartError.
27+                elif e.error['code'] == -342:
28+                    ignored_errors["JSONRPCException -342"] += 1
29+                else:
30                     raise  # unknown JSON RPC exception
31-            except ConnectionResetError:
32-                # This might happen when the RPC server is in warmup, but shut down before the call to getblockcount
33-                # succeeds. Try again to properly raise the FailedToStartError
34-                pass
35             except OSError as e:
36-                if e.errno == errno.ETIMEDOUT:
37-                    pass  # Treat identical to ConnectionResetError
38+                if e.errno == errno.ECONNRESET:
39+                    # This might happen when the RPC server is in warmup, but shut down before the call to getblockcount
40+                    # succeeds. Try again to properly raise the FailedToStartError
41+                    ignored_errors["OSError.ECONNRESET"] += 1
42+                elif e.errno == errno.ETIMEDOUT:
43+                    ignored_errors["OSError.ETIMEDOUT"] += 1 # Treat identical to ECONNRESET
44                 elif e.errno == errno.ECONNREFUSED:
45-                    pass  # Port not yet open?
46+                    ignored_errors["OSError.ECONNREFUSED"] += 1 # Port not yet open?
47                 else:
48                     raise  # unknown OS error
49             except ValueError as e:  # cookie file not found and no rpcuser or rpcpassword; bitcoind is still starting
50-                if "No RPC credentials" not in str(e):
51+                if "No RPC credentials" in str(e):
52+                    ignored_errors["missing_credentials"] += 1
53+                else:
54                     raise
55             time.sleep(1.0 / poll_per_s)
56-        self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
57+        self._raise_assertion_error("Unable to connect to bitcoind after {}s (ignored_errors: {})".format(self.rpc_timeout, str(dict(ignored_errors))))
58 
59     def wait_for_cookie_credentials(self):
60         """Ensures auth cookie credentials can be read, e.g. for testing CLI with -rpcwait before RPC connection is up."""

Not sure if patches like this could be applied to some subset of CI to help debug CI-only issues, slightly unclean, but maybe you’ve done similar things before?

Notable: I removed the case for “-342 Service unavailable, RPC server started but is shutting down due to error”. Seems like we should be failing if the server is shutting down instead of silently ignoring it.

maflcko commented at 5:17 am on August 27, 2024: member

Notable: I removed the case for “-342 Service unavailable, RPC server started but is shutting down due to error”. Seems like we should be failing if the server is shutting down instead of silently ignoring it.

That’d revert 62c304ea481d474bc87d950e21907b8b05134fe7 and thus break existing tests, no?

maflcko commented at 5:37 am on August 27, 2024: member

For reference: You can use git log -S "unique_code_blob" to see when (and hopefully why) a piece of code was added. In this case it would be something like:

0$ git log -S ' != -342' 
1commit 62c304ea481d474bc87d950e21907b8b05134fe7
2Date:   Sat Oct 6 13:42:11 2018 +0800
3
4    tests: Allow closed http server in assert_start_raises_init_error

maflcko commented at 6:39 am on August 27, 2024: member

Other than that the patch

Seen recently here (wallet_bumpfee.py --descriptors) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:

This one looks different, because node1 never actually starts to start up. It doesn’t even get to emit an 10061 error.

maflcko commented at 6:43 am on August 27, 2024: member

As for the ignored_errors counter, I presume it will just count errno.ECONNREFUSED for WinError 10061?

hodlinator commented at 7:31 am on August 27, 2024: contributor

Thanks, I should have searched for why the -342 check was introduced before I suggested removing it. :facepalm: And my git-fu could do with some leveling up, that’s impressive how it can single out the right commit.

Updated my suggested diff, bringing back the -342 check.

As for the ignored_errors counter, I presume it will just count errno.ECONNREFUSED for WinError 10061?

Yes, that’s what ConnectionRefusedError seems to contain. Also corrected & updated that part of the diff.

hodlinator commented at 8:01 am on August 27, 2024: contributor

Other than that the patch

Seen recently here (wallet_bumpfee.py --descriptors) https://github.com/bitcoin/bitcoin/actions/runs/9863555844/job/27236587622#step:27:12496:

This one looks different, because node1 never actually starts to start up. It doesn’t even get to emit an 10061 error.

The linked log has:

 0 node1 2024-07-09T20:27:21.198248Z [shutoff] [D:\a\bitcoin\bitcoin\src\init.cpp:386] [Shutdown] Shutdown: done 
 1 test  2024-07-09T20:27:21.233000Z TestFramework.node1 (DEBUG): Node stopped 
 2 test  2024-07-09T20:27:21.233000Z TestFramework.node1 (DEBUG): bitcoind started, waiting for RPC to come up 
 3 node0 2024-07-09T20:27:52.439635Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2365] [StartExtraBlockRelayPeers] [net] enabling extra block-relay-only peers 
 4 node0 2024-07-09T20:42:07.454184Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2328] [DumpAddresses] [net] Flushed 0 addresses to peers.dat  1ms 
 5 node0 2024-07-09T20:57:07.475871Z [scheduler] [D:\a\bitcoin\bitcoin\src\net.cpp:2328] [DumpAddresses] [net] Flushed 0 addresses to peers.dat  1ms 
 6 test  2024-07-09T21:07:49.462000Z TestFramework (ERROR): Assertion failed 
 7                                   Traceback (most recent call last):
 8                                     File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 132, in main
 9                                       self.run_test()
10                                     File "D:\a\bitcoin\bitcoin/test/functional/wallet_bumpfee.py", line 99, in run_test
11                                       test_bumpfee_with_abandoned_descendant_succeeds(self, rbf_node, rbf_node_address, dest_address)
12                                     File "D:\a\bitcoin\bitcoin/test/functional/wallet_bumpfee.py", line 451, in test_bumpfee_with_abandoned_descendant_succeeds
13                                       self.restart_node(1, ['-minrelaytxfee=0.00005'] + self.extra_args[1])
14                                     File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 602, in restart_node
15                                       self.start_node(i, extra_args)
16                                     File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_framework.py", line 554, in start_node
17                                       node.wait_for_rpc_connection()
18                                     File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 314, in wait_for_rpc_connection
19                                       self._raise_assertion_error("Unable to connect to bitcoind after {}s".format(self.rpc_timeout))
20                                     File "D:\a\bitcoin\bitcoin\test\functional\test_framework\test_node.py", line 188, in _raise_assertion_error
21                                       raise AssertionError(self._node_msg(msg))
22                                   AssertionError: [node 1] Unable to connect to bitcoind after 2400s

The first 2 lines in the excerpt show how node 1 is shutting down right before starting. This made me think of issues reusing TCP ports when restarting a process (stackoverflow 1 2).

Maybe it would be better for the test nodes to use new port numbers upon restart, especially for tests like wallet_bumpfee.py that use restart_node?

This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP TIME_WAIT period.

Edit: I guess TIME_WAIT does not directly explain why connection still fails after 2400s. But it could be a contributing factor.

hodlinator commented at 8:56 am on August 27, 2024: contributor

(Btw, b22529529823c0cb5916ac318c8536e9107b7e78 did touch code raising JSONRPCException -342 in the beginning of June. Don’t see how that change would cause issues, but if it correlates with when you started seeing this, maybe it could be?)

maflcko commented at 9:21 am on August 27, 2024: member

This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP TIME_WAIT period.

I don’t think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, …).

Edit: I guess TIME_WAIT does not directly explain why connection still fails after 2400s. But it could be a contributing factor.

I don’t think Bitcoin Core will re-try to start the HTTP RPC server, so if it fails once, it should fail always?

However, it looks like the node is never started at all. Same in https://github.com/bitcoin/bitcoin/actions/runs/9789643429/job/27029722411?pr=30356

hodlinator commented at 9:51 am on August 27, 2024: contributor

This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP TIME_WAIT period.

I don’t think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, …).

Yes, different tests would get different PIDs, which feed into p2p_port etc, so port collisions between tests should not happen unless PIDs are reused within a too short timeperiod.

Edit: I guess TIME_WAIT does not directly explain why connection still fails after 2400s. But it could be a contributing factor.

I don’t think Bitcoin Core will re-try to start the HTTP RPC server, so if it fails once, it should fail always?

That is my assumption too.

However, it looks like the node is never started at all. Same in https://github.com/bitcoin/bitcoin/actions/runs/9789643429/job/27029722411?pr=30356

Yup, it sure is mysterious. At least having counters for the silent failures could provide some clues. Do you think I should create a PR for #30390 (comment)?

Something definitely seems to be causing bitcoind to keep running but stall somewhere for 2400s without returning an exit code from .poll(). Or as you say, Python Popen may be having issues in even starting the process for some reason.

Tangent: In another project, I once saw issues with subprocesses stalling on writing to stdout due to the parent process not reading from it quickly enough, so the OS buffer in between the processes would fill up. It was solved by spinning up a dedicated thread in the parent process which continuously read stdout from the child. But in our case we seem to mostly be writing subprocess stdout & stderr to a file, so only running out of disk would potentially cause it to block in a similar fashion to the other project. But running out of disk would hopefully spawn other errors too, making the issue clear.

hodlinator commented at 10:10 am on August 27, 2024: contributor

Read through the docs on Python Popen and didn’t find anything suspicious. Would assume that Popen would raise an exception immediately if it fails to start bitcoind. Seems more likely that bitcoind has gone into an infinite loop/blocked for some reason early during startup.

Might be helpful if we could log the callstack(s) from the bitcoind process when hitting the 2400s timeout.

hebasto commented at 10:15 am on August 27, 2024: member

Also #28509.

maflcko commented at 11:35 am on August 27, 2024: member

Also #28509.

Ah nice! I forgot about it. I guess it also underlines that the issue lies in starting the process.

Maybe GHA has some kind of “Anti-Virus” software that may block some processes?

This could potentially be an issue between two different tests as well, if they shut down/start up quickly enough to still be within the TCP TIME_WAIT period.

I don’t think different tests affect each other, because the test_runner assigns non-overlapping spans of ports for each test case (rpc, p2p, …).

Yes, different tests would get different PIDs, which feed into p2p_port etc, so port collisions between tests should not happen unless PIDs are reused within a too short timeperiod.

Probably unrelated to this issue at this point, but the test_runner provides the --portseed, so no PID is used.

hodlinator commented at 1:23 pm on August 27, 2024: contributor

Also #28509.

Thanks for the added context @hebasto! There seems to be some agreement that upgrading to Python 3.12 would fix the issue. However, I was unable to find a documented fix, do you have a link? Your comment links to an issue with subprocess.run but it’s a higher-level function implemented on top of subprocess.Popen which we use.

Maybe GHA has some kind of “Anti-Virus” software that may block some processes?

Little experience with GHA, but yeah, those kind of things slowing down CI is not unheard of.

Probably unrelated to this issue at this point, but the test_runner provides the --portseed, so no PID is used.

Thanks for pointing that out!

0test = self.test_list.pop(0)
1portseed = len(self.test_list)
2portseed_arg = ["--portseed={}".format(portseed)]

..that should indeed circumvent possible port-reuse issues stemming from OS PID reuse choices.

Given..

0def p2p_port(n):
1    return PORT_MIN + n + (MAX_NODES * PortSeed.n) % (PORT_RANGE - 1 - MAX_NODES)
2    # Constants: 11000 + n + (12 * PortSeed.n) % (5000 - 1 - 12)
3    #          = 11000 + n + (12 * PortSeed.n) % (4987)

..we should be able to have 415 tests (4987 / 12) before tests at the beginning of the test_list risk colliding ports with tests at the end. From what I can see in test/functional/test_runner.py we have ~300 tests. :+1:

maflcko commented at 1:29 pm on August 27, 2024: member

There seems to some agreement that upgrading to Python 3.12 would fix the issue. However, I was unable to find a documented fix, do you have a link?

This was just a wrong comment by myself. See #28509 (comment)

hodlinator commented at 2:10 pm on September 3, 2024: contributor

Regarding process start issues, would there be any support for me attempting to integrate Microsoft SysInternals ProcDump into CI & test framework to be able to call procdump -mm <PID> on the bitcoind.exe process and hopefully figure out where the process is stalling?

maflcko commented at 2:38 pm on September 3, 2024: member

Seems fine to integrate in a pull request and then keep the pull request unmerged, and only merge the pull request in the GHA CI, as that is the only place where it reproduces. Adding temporary test test-only code for everyone else may be a bit too much.

hodlinator commented at 2:46 pm on September 3, 2024: contributor

Alright, I can live with having it a pending PR until it has proven itself useful on CI. :)

Will see if I can get a Windows/Wine environment going locally and study how to integrate it into GHA.

fanquake commented at 9:25 am on September 16, 2024: member

Seen in https://github.com/bitcoin/bitcoin/actions/runs/10851531017/job/30115514734?pr=24123#step:12:242:

0   File "C:\hostedtoolcache\windows\Python\3.12.5\x64\Lib\socket.py", line 850, in create_connection
1
2    sock.connect(sa)
3
4ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

fanquake commented at 9:04 am on September 18, 2024: member

https://github.com/bitcoin/bitcoin/actions/runs/10907936867/job/30272762595?pr=30918#step:12:226

maflcko commented at 12:25 pm on October 7, 2024: member

https://github.com/bitcoin/bitcoin/actions/runs/11164991068/job/31035622913?pr=30982#step:12:242

hodlinator referenced this in commit 47d2781014 on Oct 7, 2024

fanquake commented at 8:19 am on October 10, 2024: member

https://github.com/bitcoin/bitcoin/actions/runs/11265108219/job/31326367933#step:12:228

achow101 closed this on Oct 24, 2024

achow101 referenced this in commit 947f2925d5 on Oct 24, 2024

ci: ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it #30390