ci: Intermittent failure "Could not resolve host: github.com"

maflcko commented at 1:59 pm on February 17, 2025: member

Looks like this happens for some workers/users on the same machine. For example, it happens for ci_worker_1738693519_026122670, but none of the others. Also, it seems to almost persist for several hours. That is, it happens on almost all runs, once it happened at least once.

Example failures:

However, the same worker sometimes still passes:

https://cirrus-ci.com/task/4507687313997824

This can also be seen by looking at the worker summary: https://0xb10c.github.io/bitcoin-core-ci-stats/tags/workerci_worker_1738693519_026122670/

On the passing run, the command podman container rm --force --all prints:

0[05:49:48.929] time="2025-02-17T05:49:48-05:00" level=warning msg="StopSignal SIGTERM failed to stop container ci_native_nowallet_libbitcoinkernel in 10 seconds, resorting to SIGKILL"
1[05:49:50.607] time="2025-02-17T05:49:50-05:00" level=error msg="Unable to clean up network for container be7a1370684d21a332fca91b3323679159787b30b9aa5469b8960e4ec632404c: \"tearing down network namespace configuration for container be7a1370684d21a332fca91b3323679159787b30b9aa5469b8960e4ec632404c: netavark: IO error: aardvark pid not found\""
2[05:49:52.396] be7a1370684d21a332fca91b3323679159787b30b9aa5469b8960e4ec632404c

Also, the last aborted run seems to have happened during podman run: https://cirrus-ci.com/task/6702139617050624?logs=ci#L268

Thus, I believe this may be an upstream race bug in podman.

It would be good to minimize, report and fix the bug upstream.

maflcko added the label CI failed on Feb 17, 2025

maflcko added the label Upstream on Feb 17, 2025

maflcko commented at 1:59 pm on February 17, 2025: member

Closing for now, because it seems out-of-scope for this repo. (Not sure where to file otherwise, so feel free to continue discussion here for now)

maflcko closed this on Feb 17, 2025

0xB10C commented at 2:29 pm on February 17, 2025: contributor

Likely unrelated, but I’ve seen Could not resolve host: github.com as well a couples of times on my runners when attempting to upgrade to newer Nix packages. I’ve tried to bisect which changed package causes it, but no luck so far. I might give it another try tough now that you mention it here too. However, I’m using docker and not podman..

maflcko commented at 3:14 pm on February 17, 2025: member

As a temporary workaround, I started a terminal in the CI task https://cirrus-ci.com/task/5557162807656448 and typed podman system reset:

 0ci_worker_1738693519_026122670@core-ci-2:/tmp/cirrus-build-2385794995$ podman system reset 
 1WARNING! This will remove:
 2        - all containers
 3        - all pods
 4        - all images
 5        - all networks
 6        - all build cache
 7        - all machines
 8        - all volumes
 9        - the graphRoot directory: "/home/ci_worker_1738693519_026122670/.local/share/containers/storage"
10        - the runRoot directory: "/run/user/1009/containers"
11WARNING! The following external containers will be purged:
12        - abcabffd0496 (ubuntu-working-container)
13        - 15da99126e8c (03ea1830d9de-working-container)
14        - 0f000f09c8fb (b3df132b660b-working-container)
15        - 7c5d536c6163 (0865f201036c-working-container)
16Are you sure you want to continue? [y/N] y
17WARN[0029] StopSignal SIGTERM failed to stop container ci_native_nowallet_libbitcoinkernel in 10 seconds, resorting to SIGKILL

Let’s see what happens.

maflcko commented at 12:38 pm on February 20, 2025: member

As a workaround, reverting 4b527fa93b9763a33842069bc07446313cbf5e0f could be investigated. See also #30193 (review)

0xB10C commented at 3:28 pm on February 20, 2025: contributor

Can confirm that reverting 4b527fa93b9763a33842069bc07446313cbf5e0f (with 7c54d9fb75bb7a630f56990beff48b302a8d2b41) resolves Could not resolve host: github.com on my runners - but causes e.g. rpc_bind.py --ipv6 to fail.

maflcko commented at 1:13 pm on February 26, 2025: member

Likely unrelated, but on one of my machines, another issue reproduces reliably. Though, only on that machine, and only under aarch64 qemu, and only for ubuntu:24.04:

 0$ podman run  --rm --platform=linux/arm64   'mirror.gcr.io/ubuntu:24.04' bash -c 'apt update && apt install curl -y && curl -v --location --fail --remote-name https://github.com/bitcoin-core/qa-assets/raw/main/unit_test_data/script_assets_test.json'  # (fails as of Feb 26 2025)
 1
 2...
 3
 4* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 / X25519 / id-ecPublicKey
 5* ALPN: server accepted h2
 6* Server certificate:
 7*  subject: CN=github.com
 8*  start date: Feb  5 00:00:00 2025 GMT
 9*  expire date: Feb  5 23:59:59 2026 GMT
10*  subjectAltName: host "github.com" matched cert's "github.com"
11*  issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo ECC Domain Validation Secure Server CA
12*  SSL certificate verify ok.
13*   Certificate level 0: Public key type EC/prime256v1 (256/128 Bits/secBits), signed using ecdsa-with-SHA256
14*   Certificate level 1: Public key type EC/prime256v1 (256/128 Bits/secBits), signed using ecdsa-with-SHA384
15*   Certificate level 2: Public key type EC/secp384r1 (384/192 Bits/secBits), signed using ecdsa-with-SHA384
16} [5 bytes data]
17* Send failure: Broken pipe
18* OpenSSL SSL_write: Broken pipe, errno 32
19  0     0    0     0    0     0      0      0 --:--:--  0:00:41 --:--:--     0
20* Closing connection
21curl: (55) Send failure: Broken pipe

ci: Intermittent failure “Could not resolve host: github.com” #31889