This should fix #33640.
It also contains a few refactor cleanups, which are explained in the corresponding commits.
The other code in this file is using this pattern to throw when a key is
unset, instead of silently returning a None when using os.getenv(key)
with no default value specified.
So use the pattern here as well. As the env vars are always set, this
should be a refactor that does not change the behavior.
The bool is forced to check=True, but some commands may want to
explicitly use kwargs to disable it.
This refactor is needed for the next commit.
The build scripts inside the image retry after a failure. However, there
may be some rare network failures inside the container engine. For
example, when pulling the underlying base image, or when pulling the
docker cache.
Thus, retry after a failure once, which should hopefully fix
https://github.com/bitcoin/bitcoin/issues/33640.
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.
For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33677.
See the guideline for information on the review process.
| Type | Reviewers |
|---|---|
| ACK | l0rinc, kevkevinpal |
| Concept ACK | davidgumberg |
If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.
Reviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.
127+ f"--mount=type=bind,src={os.environ['BASE_READ_ONLY_DIR']},dst={os.environ['BASE_READ_ONLY_DIR']},readonly",
128+ f"--mount={CI_CCACHE_MOUNT}",
129+ f"--mount={CI_DEPENDS_MOUNT}",
130+ f"--mount={CI_DEPENDS_SOURCES_MOUNT}",
131+ f"--mount={CI_PREVIOUS_RELEASES_MOUNT}",
132+ *CI_BUILD_MOUNT,
Similar to this comment, #33639 (review)
maybe avoid using concats (e.g. via *list)
yea maybe something like this? Let me know if this looks good or if there are any issues you see with it
https://github.com/kevkevinpal/bitcoin/pull/184/commits/558c116d3fa49e3a103d70e0c18c873c1d534ae8
Hmm, it is interesting to see CI seemingly pass on your commit. However, I think you are just exploiting a bug in Docker. Passing an empty string as an argument is normally not equal to passing no argument. Maybe Docker is doing that internally, but Podman is not, and it will fail there.
You can also check this locally:
0$ podman run --rm "--cap-add=LINUX_IMMUTABLE" "" 'alpine:latest' echo hi
1Error: parsing reference "": repository name must have at least one component
or via the CI:
0# MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_native_alpine_musl.sh" ./ci/test_run_all.sh
1...
2Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
3+ docker run --rm --interactive --detach --tty --cap-add=LINUX_IMMUTABLE --mount=type=bind,src=/root/b-c-ci,dst=/root/b-c-ci,readonly --mount=type=volume,src=ci_native_alpine_musl_ccache,dst=/ci_container_base/ci/scratch/ccache --mount=type=volume,src=ci_native_alpine_musl_depends,dst=/ci_container_base/depends/built --mount=type=volume,src=ci_native_alpine_musl_depends_sources,dst=/ci_container_base/depends/sources --mount=type=volume,src=ci_native_alpine_musl_previous_releases,dst=/ci_container_base/prev_releases '' --env-file=/tmp/env-root-ci_native_alpine_musl --name=ci_native_alpine_musl --network=ci-ip6net --platform=linux ci_native_alpine_musl
4Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
5Error: parsing reference "": repository name must have at least one component
6Command '['docker', 'run', '--rm', '--interactive', '--detach', '--tty', '--cap-add=LINUX_IMMUTABLE', '--mount=type=bind,src=/root/b-c-ci,dst=/root/b-c-ci,readonly', '--mount=type=volume,src=ci_native_alpine_musl_ccache,dst=/ci_container_base/ci/scratch/ccache', '--mount=type=volume,src=ci_native_alpine_musl_depends,dst=/ci_container_base/depends/built', '--mount=type=volume,src=ci_native_alpine_musl_depends_sources,dst=/ci_container_base/depends/sources', '--mount=type=volume,src=ci_native_alpine_musl_previous_releases,dst=/ci_container_base/prev_releases', '', '--env-file=/tmp/env-root-ci_native_alpine_musl', '--name=ci_native_alpine_musl', '--network=ci-ip6net', '--platform=linux', 'ci_native_alpine_musl']' returned non-zero exit status 125.
7
8
9# git log -1 --oneline
10558c116 (HEAD) ci: CI_BUILD_MOUNT to f string instead of array
The only way to make that consistently pass is to create a single string of the command and pass it to a shell with the shellcheck SC2086 warning disabled. However, the whole goal of this commit is to do proper word splitting, as explained in the commit message.
ahh I didnt realize that Podman would fail with empty string
ok np what you have currently looks best then
92+
93+ CI_CCACHE_MOUNT = f"type=bind,src={os.environ['CCACHE_DIR']},dst={os.environ['CCACHE_DIR']}"
94+ CI_DEPENDS_MOUNT = f"type=bind,src={os.environ['DEPENDS_DIR']}/built,dst={os.environ['DEPENDS_DIR']}/built"
95+ CI_DEPENDS_SOURCES_MOUNT = f"type=bind,src={os.environ['DEPENDS_DIR']}/sources,dst={os.environ['DEPENDS_DIR']}/sources"
96+ CI_PREVIOUS_RELEASES_MOUNT = f"type=bind,src={os.environ['PREVIOUS_RELEASES_DIR']},dst={os.environ['PREVIOUS_RELEASES_DIR']}"
97+ CI_BUILD_MOUNT = [f"--mount=type=bind,src={os.environ['BASE_BUILD_DIR']},dst={os.environ['BASE_BUILD_DIR']}"]
Doesn’t this make more sense? And then you can change *CI_BUILD_MOUNT to CI_BUILD_MOUNT
0CI_BUILD_MOUNT = []
1...
2CI_BUILD_MOUNT = f"--mount=type=bind,src={os.environ['BASE_BUILD_DIR']},dst={os.environ['BASE_BUILD_DIR']}"
yea maybe something like this? Let me know if this looks good or if there are any issues you see with it
https://github.com/kevkevinpal/bitcoin/pull/184/commits/558c116d3fa49e3a103d70e0c18c873c1d534ae8
111+ # Still prune everything in case the filtered pruning doesn't work, or if labels were not set
112+ # on a previous run. Belt and suspenders approach, should be fine to remove in the future.
113+ # Prune images used by --external containers (e.g. build containers) when
114+ # using podman.
115+ print("Prune all dangling images")
116+ run(["podman", "image", "prune", "--force", "--external"], check=False)
check=False for these commands when previously there was no || true appended to the original statements
check=False.
118+ print(f"Prune all dangling {CI_IMAGE_LABEL} images")
119+ # When detecting podman-docker, `--external` should be added.
120+ run(["docker", "image", "prune", "--force", "--filter", f"label={CI_IMAGE_LABEL}"])
121+
122+ cmd_run = ["docker", "run", "--rm", "--interactive", "--detach", "--tty"]
123+ cmd_run += [
nit to avoid modification after creation:
0 cmd_run = [
1 "docker", "run", "--rm", "--interactive", "--detach", "--tty",
73- --name "$CONTAINER_NAME" \
74- --network ci-ip6net \
75- --platform="${CI_IMAGE_PLATFORM}" \
76- "$CONTAINER_NAME")
77- export CI_CONTAINER_ID
78 export CI_EXEC_CMD_PREFIX="docker exec ${CI_CONTAINER_ID}"
CI_CONTAINER_ID is not exported before (only in os.environ["CI_CONTAINER_ID"] = container_id)?
Yes, it should be correct. My understanding is that export CI_CONTAINER_ID is roughly equal to os.environ["CI_CONTAINER_ID"] = container_id in Python.
You can also see in the CI logs that it works. https://github.com/bitcoin/bitcoin/actions/runs/18776395325/job/53571955105?pr=33677#step:9:247
65@@ -65,7 +66,10 @@ def main():
66 cmd_build += [os.environ["BASE_READ_ONLY_DIR"]]
67
68 print(f"Building {os.environ['CONTAINER_NAME']} image tag to run in")
69- run(cmd_build)
70+ if run(cmd_build, check=False).returncode != 0:
71+ print(f"Retry building {os.environ['CONTAINER_NAME']} image tag after failure")
72+ time.sleep(3)
if I remember correctly, whenever this error occurred, all concurrent CI builds failed so 3 seconds and a single retry may not suffice in this case - but I don’t mind trying as such and add some exponential backoff in a future PR if this isn’t enough (e.g. 5 runs with 2 ** retry_count sleep).
Q: do we need to clean up anything after the first try to make sure the second call has more chance of success?
if I remember correctly, whenever this error occurred, all concurrent CI builds failed
In the linked issue in OP only one task failed
Q: do we need to clean up anything after the first try to make sure the second call has more chance of success?
If we had to clean up anything, it would be an upstream bug in Docker or Podman, which makes them less reproducible.
65@@ -65,7 +66,10 @@ def main():
66 cmd_build += [os.environ["BASE_READ_ONLY_DIR"]]
67
68 print(f"Building {os.environ['CONTAINER_NAME']} image tag to run in")
69- run(cmd_build)
70+ if run(cmd_build, check=False).returncode != 0:
is my understanding correct that previously check was implicit, raising an exception on failure and now the first call is allowed to fail silently in which case we retry once with checks enabled? So it’s basically:
0try:
1 run(cmd_build)
2except subprocess.CalledProcessError:
3 time.sleep(3)
4 run(cmd_build) # Retry once
run does not throw. You can find the function above and it behaves roughly equal to set -o errexit in Bash (exit immediately on failure).
139+
140+ container_id = run(
141+ cmd_run,
142+ stdout=subprocess.PIPE,
143+ text=True,
144+ encoding="ascii",
It seems to me we don’t usually use this when text is set:
0% git grep -r 'text=True' . | wc -l
1 37
2% git grep -r 'encoding="ascii"' . | wc -l
3 3
test/lint/lint-python-utf8-encoding.py give different vibes. Not sure what to do, but I can remove it in a follow-up or the next push.
LC_ALL here.
73+ run(cmd_build)
74+
75+ run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_ccache"], check=False)
76+ run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_depends"], check=False)
77+ run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_depends_sources"], check=False)
78+ run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_previous_releases"], check=False)
I don’t see docker supporting multiple volume creations but maybe we can loop to avoid the repetition:
0for suffix in ["ccache", "depends", "depends_sources", "previous_releases"]:
1 run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_{suffix}"], check=False)
83+ CI_PREVIOUS_RELEASES_MOUNT = f"type=volume,src={os.environ['CONTAINER_NAME']}_previous_releases,dst={os.environ['PREVIOUS_RELEASES_DIR']}"
84+ CI_BUILD_MOUNT = []
85+
86+ if os.getenv("DANGER_CI_ON_HOST_FOLDERS"):
87+ # ensure the directories exist
88+ run(["mkdir", "-p", os.environ["CCACHE_DIR"]])
https://man7.org/linux/man-pages/man1/mkdir.1.html seems to support batch creation so maybe we can reduce the noise a bit:
0run([
1 "mkdir", "-p",
2 os.environ["CCACHE_DIR"],
3 f"{os.environ['DEPENDS_DIR']}/built",
4 f"{os.environ['DEPENDS_DIR']}/sources",
5 os.environ["PREVIOUS_RELEASES_DIR"],
6 os.environ["BASE_BUILD_DIR"], # Unset by default, must be defined externally
7])
But Python does support native folder creation, not sure we should delegate to bash here:
0for p in [
1 os.environ["CCACHE_DIR"],
2 f"{os.environ['DEPENDS_DIR']}/built",
3 f"{os.environ['DEPENDS_DIR']}/sources",
4 os.environ["PREVIOUS_RELEASES_DIR"],
5 os.environ["BASE_BUILD_DIR"], # Unset by default, must be defined externally
6]:
7 Path(p).mkdir(parents=True, exist_ok=True)
105+
106+ run(["docker", "network", "create", "--ipv6", "--subnet", "1111:1111::/112", "ci-ip6net"], check=False)
107+
108+ if os.getenv("RESTART_CI_DOCKER_BEFORE_RUN"):
109+ print("Restart docker before run to stop and clear all containers started with --rm")
110+ run(["podman", "container", "rm", "--force", "--all"]) # Similar to "systemctl restart docker"
timeout=12345 for a few of these or will the default suffice?
143@@ -144,6 +144,16 @@ def main():
144 ).stdout.strip()
145 os.environ["CI_CONTAINER_ID"] = container_id
146
147+ # GNU getopt is required for the CI_RETRY_EXE script
148+ if os.getenv("CI_OS_NAME") == "macos":
if not os.getenv("DANGER_RUN_CI_ON_HOST"): branch, so checking the CI host seems correct
Code review ACK fa586e86c4aa20c07f37488c13b2b797b35fdbd9
The small focused commits with detailed messages made the review very easy. I left a few suggestions for cases that could be made more pythonic (i.e. partial bash migration), but I don’t mind keeping it as is, seems correct to me.
This refactor does not change behavior, but it has a few benefits:
* The shellcheck SC2086 warning is disabled for the whole command, but
is only needed for the CI_CONTAINER_CAP env var. So in Python, only
pass this one env var to shlex.split() for proper word splitting.
* Future logic improvements can be implemented in Python.
The comments are moved, which can be checked via the git options:
--color-moved=dimmed-zebra --color-moved-ws=ignore-all-space
This was added in commit b705bade44973e61655d5f847f49d97fb5bb8393, but I
keep forgetting the background that this is needed for the retry Bash
script. So document it.
ACK 5555bce
Retrying building image on failure makes sense to me, and the refactor looks good
crACK https://github.com/bitcoin/bitcoin/pull/33677/commits/5555bce994b648f836c35a02570f22ae9ad36da3
Concept ACK on moving the CI script from bash to python and crACK https://github.com/bitcoin/bitcoin/pull/33677/commits/fabe516440c96bb7a6a6902195684d3802d64139 and https://github.com/bitcoin/bitcoin/pull/33677/commits/5555bce994b648f836c35a02570f22ae9ad36da3 as refactors with no behavior change, but those changes seem orthogonal to the PR title and linked issue and I think they would ideally be in a separate PR.
refactors with no behavior change, but those changes seem orthogonal to the PR title and linked issue and I think they would ideally be in a separate PR.
thx, i’ll open a dedicated follow-up for the remaining refactors. Though, I’ll leave this one as-is for now, as it already has 3 acks and seems rfm.