ci: Retry image building once on failure #33677

pull maflcko wants to merge 5 commits into bitcoin:master from maflcko:2510-ci-retry changing 2 files +92 −75
  1. maflcko commented at 12:21 pm on October 22, 2025: member

    This should fix #33640.

    It also contains a few refactor cleanups, which are explained in the corresponding commits.

  2. ci: Use os.environ[key] access when value must be set
    The other code in this file is using this pattern to throw when a key is
    unset, instead of silently returning a None when using os.getenv(key)
    with no default value specified.
    
    So use the pattern here as well. As the env vars are always set, this
    should be a refactor that does not change the behavior.
    fa8e4de5c3
  3. ci: Allow overwriting check option in run() helper
    The bool is forced to check=True, but some commands may want to
    explicitly use kwargs to disable it.
    
    This refactor is needed for the next commit.
    fa4dbe04d7
  4. ci: Retry image building once on failure
    The build scripts inside the image retry after a failure. However, there
    may be some rare network failures inside the container engine. For
    example, when pulling the underlying base image, or when pulling the
    docker cache.
    
    Thus, retry after a failure once, which should hopefully fix
    https://github.com/bitcoin/bitcoin/issues/33640.
    fa6aa9f42f
  5. DrahtBot renamed this:
    ci: Retry image building once on failure
    ci: Retry image building once on failure
    on Oct 22, 2025
  6. DrahtBot added the label Tests on Oct 22, 2025
  7. DrahtBot commented at 12:21 pm on October 22, 2025: contributor

    The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

    Code Coverage & Benchmarks

    For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/33677.

    Reviews

    See the guideline for information on the review process.

    Type Reviewers
    ACK l0rinc, kevkevinpal
    Concept ACK davidgumberg

    If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

    Conflicts

    Reviewers, this pull request conflicts with the following ones:

    • #33362 (Run feature_bind_port_(discover|externalip).py in CI by vasild)
    • #31349 (ci: detect outbound internet traffic generated while running tests by vasild)

    If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

  8. maflcko force-pushed on Oct 22, 2025
  9. maflcko force-pushed on Oct 22, 2025
  10. in ci/test/02_run_container.py:133 in fa6aaf16a5 outdated
    127+            f"--mount=type=bind,src={os.environ['BASE_READ_ONLY_DIR']},dst={os.environ['BASE_READ_ONLY_DIR']},readonly",
    128+            f"--mount={CI_CCACHE_MOUNT}",
    129+            f"--mount={CI_DEPENDS_MOUNT}",
    130+            f"--mount={CI_DEPENDS_SOURCES_MOUNT}",
    131+            f"--mount={CI_PREVIOUS_RELEASES_MOUNT}",
    132+            *CI_BUILD_MOUNT,
    


    kevkevinpal commented at 6:09 pm on October 22, 2025:

    Similar to this comment, #33639 (review)

    maybe avoid using concats (e.g. via *list)


    maflcko commented at 6:57 am on October 23, 2025:
    Yes, that is what the code here is already doing. I don’t understand the suggestion, do you have a diff I can push?

    kevkevinpal commented at 9:10 pm on October 23, 2025:

    yea maybe something like this? Let me know if this looks good or if there are any issues you see with it

    https://github.com/kevkevinpal/bitcoin/pull/184/commits/558c116d3fa49e3a103d70e0c18c873c1d534ae8


    maflcko commented at 8:09 am on October 24, 2025:

    Hmm, it is interesting to see CI seemingly pass on your commit. However, I think you are just exploiting a bug in Docker. Passing an empty string as an argument is normally not equal to passing no argument. Maybe Docker is doing that internally, but Podman is not, and it will fail there.

    You can also check this locally:

    0$ podman run --rm "--cap-add=LINUX_IMMUTABLE" "" 'alpine:latest' echo hi
    1Error: parsing reference "": repository name must have at least one component
    

    or via the CI:

     0# MAKEJOBS="-j$(nproc)" FILE_ENV="./ci/test/00_setup_env_native_alpine_musl.sh" ./ci/test_run_all.sh
     1...
     2Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
     3+ docker run --rm --interactive --detach --tty --cap-add=LINUX_IMMUTABLE --mount=type=bind,src=/root/b-c-ci,dst=/root/b-c-ci,readonly --mount=type=volume,src=ci_native_alpine_musl_ccache,dst=/ci_container_base/ci/scratch/ccache --mount=type=volume,src=ci_native_alpine_musl_depends,dst=/ci_container_base/depends/built --mount=type=volume,src=ci_native_alpine_musl_depends_sources,dst=/ci_container_base/depends/sources --mount=type=volume,src=ci_native_alpine_musl_previous_releases,dst=/ci_container_base/prev_releases '' --env-file=/tmp/env-root-ci_native_alpine_musl --name=ci_native_alpine_musl --network=ci-ip6net --platform=linux ci_native_alpine_musl
     4Emulate Docker CLI using podman. Create /etc/containers/nodocker to quiet msg.
     5Error: parsing reference "": repository name must have at least one component
     6Command '['docker', 'run', '--rm', '--interactive', '--detach', '--tty', '--cap-add=LINUX_IMMUTABLE', '--mount=type=bind,src=/root/b-c-ci,dst=/root/b-c-ci,readonly', '--mount=type=volume,src=ci_native_alpine_musl_ccache,dst=/ci_container_base/ci/scratch/ccache', '--mount=type=volume,src=ci_native_alpine_musl_depends,dst=/ci_container_base/depends/built', '--mount=type=volume,src=ci_native_alpine_musl_depends_sources,dst=/ci_container_base/depends/sources', '--mount=type=volume,src=ci_native_alpine_musl_previous_releases,dst=/ci_container_base/prev_releases', '', '--env-file=/tmp/env-root-ci_native_alpine_musl', '--name=ci_native_alpine_musl', '--network=ci-ip6net', '--platform=linux', 'ci_native_alpine_musl']' returned non-zero exit status 125.
     7
     8
     9# git log -1 --oneline
    10558c116 (HEAD) ci: CI_BUILD_MOUNT to f string instead of array
    

    The only way to make that consistently pass is to create a single string of the command and pass it to a shell with the shellcheck SC2086 warning disabled. However, the whole goal of this commit is to do proper word splitting, as explained in the commit message.


    kevkevinpal commented at 1:22 pm on October 24, 2025:

    ahh I didnt realize that Podman would fail with empty string

    ok np what you have currently looks best then

  11. in ci/test/02_run_container.py:99 in fa6aaf16a5 outdated
    92+
    93+            CI_CCACHE_MOUNT = f"type=bind,src={os.environ['CCACHE_DIR']},dst={os.environ['CCACHE_DIR']}"
    94+            CI_DEPENDS_MOUNT = f"type=bind,src={os.environ['DEPENDS_DIR']}/built,dst={os.environ['DEPENDS_DIR']}/built"
    95+            CI_DEPENDS_SOURCES_MOUNT = f"type=bind,src={os.environ['DEPENDS_DIR']}/sources,dst={os.environ['DEPENDS_DIR']}/sources"
    96+            CI_PREVIOUS_RELEASES_MOUNT = f"type=bind,src={os.environ['PREVIOUS_RELEASES_DIR']},dst={os.environ['PREVIOUS_RELEASES_DIR']}"
    97+            CI_BUILD_MOUNT = [f"--mount=type=bind,src={os.environ['BASE_BUILD_DIR']},dst={os.environ['BASE_BUILD_DIR']}"]
    


    kevkevinpal commented at 6:29 pm on October 22, 2025:

    Doesn’t this make more sense? And then you can change *CI_BUILD_MOUNT to CI_BUILD_MOUNT

    0CI_BUILD_MOUNT = []
    1...
    2CI_BUILD_MOUNT = f"--mount=type=bind,src={os.environ['BASE_BUILD_DIR']},dst={os.environ['BASE_BUILD_DIR']}"
    

    maflcko commented at 6:56 am on October 23, 2025:
    I don’t think I can. do you have a working diff I can push?

    kevkevinpal commented at 9:10 pm on October 23, 2025:

    yea maybe something like this? Let me know if this looks good or if there are any issues you see with it

    https://github.com/kevkevinpal/bitcoin/pull/184/commits/558c116d3fa49e3a103d70e0c18c873c1d534ae8

  12. in ci/test/02_run_container.py:116 in fa6aaf16a5
    111+            # Still prune everything in case the filtered pruning doesn't work, or if labels were not set
    112+            # on a previous run. Belt and suspenders approach, should be fine to remove in the future.
    113+            # Prune images used by --external containers (e.g. build containers) when
    114+            # using podman.
    115+            print("Prune all dangling images")
    116+            run(["podman", "image", "prune", "--force", "--external"], check=False)
    


    kevkevinpal commented at 6:48 pm on October 22, 2025:
    Why add check=False for these commands when previously there was no || true appended to the original statements

    maflcko commented at 6:56 am on October 23, 2025:
    Ah, nice catch. Removed the check=False.
  13. maflcko force-pushed on Oct 23, 2025
  14. in ci/test/02_run_container.py:125 in fa1b8611a0 outdated
    118+        print(f"Prune all dangling {CI_IMAGE_LABEL} images")
    119+        # When detecting podman-docker, `--external` should be added.
    120+        run(["docker", "image", "prune", "--force", "--filter", f"label={CI_IMAGE_LABEL}"])
    121+
    122+        cmd_run = ["docker", "run", "--rm", "--interactive", "--detach", "--tty"]
    123+        cmd_run += [
    


    l0rinc commented at 9:36 am on October 24, 2025:

    nit to avoid modification after creation:

    0        cmd_run = [
    1              "docker", "run", "--rm", "--interactive", "--detach", "--tty",
    

    maflcko commented at 11:13 am on October 24, 2025:
    I don’t think this works with auto-formatters
  15. in ci/test/02_run_container.sh:12 in fa1b8611a0 outdated
    73-                  --name "$CONTAINER_NAME" \
    74-                  --network ci-ip6net \
    75-                  --platform="${CI_IMAGE_PLATFORM}" \
    76-                  "$CONTAINER_NAME")
    77-  export CI_CONTAINER_ID
    78   export CI_EXEC_CMD_PREFIX="docker exec ${CI_CONTAINER_ID}"
    


    l0rinc commented at 9:37 am on October 24, 2025:
    Seems correct to me, but to be sure: does this still work correctly now that CI_CONTAINER_ID is not exported before (only in os.environ["CI_CONTAINER_ID"] = container_id)?

    maflcko commented at 11:13 am on October 24, 2025:

    Yes, it should be correct. My understanding is that export CI_CONTAINER_ID is roughly equal to os.environ["CI_CONTAINER_ID"] = container_id in Python.

    You can also see in the CI logs that it works. https://github.com/bitcoin/bitcoin/actions/runs/18776395325/job/53571955105?pr=33677#step:9:247

  16. in ci/test/02_run_container.py:71 in fa6aa9f42f outdated
    65@@ -65,7 +66,10 @@ def main():
    66         cmd_build += [os.environ["BASE_READ_ONLY_DIR"]]
    67 
    68         print(f"Building {os.environ['CONTAINER_NAME']} image tag to run in")
    69-        run(cmd_build)
    70+        if run(cmd_build, check=False).returncode != 0:
    71+            print(f"Retry building {os.environ['CONTAINER_NAME']} image tag after failure")
    72+            time.sleep(3)
    


    l0rinc commented at 9:39 am on October 24, 2025:

    if I remember correctly, whenever this error occurred, all concurrent CI builds failed so 3 seconds and a single retry may not suffice in this case - but I don’t mind trying as such and add some exponential backoff in a future PR if this isn’t enough (e.g. 5 runs with 2 ** retry_count sleep).

    Q: do we need to clean up anything after the first try to make sure the second call has more chance of success?


    maflcko commented at 11:13 am on October 24, 2025:

    if I remember correctly, whenever this error occurred, all concurrent CI builds failed

    In the linked issue in OP only one task failed

    Q: do we need to clean up anything after the first try to make sure the second call has more chance of success?

    If we had to clean up anything, it would be an upstream bug in Docker or Podman, which makes them less reproducible.

  17. in ci/test/02_run_container.py:69 in fa6aa9f42f outdated
    65@@ -65,7 +66,10 @@ def main():
    66         cmd_build += [os.environ["BASE_READ_ONLY_DIR"]]
    67 
    68         print(f"Building {os.environ['CONTAINER_NAME']} image tag to run in")
    69-        run(cmd_build)
    70+        if run(cmd_build, check=False).returncode != 0:
    


    l0rinc commented at 9:41 am on October 24, 2025:

    is my understanding correct that previously check was implicit, raising an exception on failure and now the first call is allowed to fail silently in which case we retry once with checks enabled? So it’s basically:

    0try:
    1    run(cmd_build)
    2except subprocess.CalledProcessError:
    3    time.sleep(3)
    4    run(cmd_build)  # Retry once
    

    maflcko commented at 11:13 am on October 24, 2025:
    Yes, but run does not throw. You can find the function above and it behaves roughly equal to set -o errexit in Bash (exit immediately on failure).
  18. in ci/test/02_run_container.py:144 in fa1b8611a0 outdated
    139+
    140+        container_id = run(
    141+            cmd_run,
    142+            stdout=subprocess.PIPE,
    143+            text=True,
    144+            encoding="ascii",
    


    l0rinc commented at 9:53 am on October 24, 2025:

    It seems to me we don’t usually use this when text is set:

    0% git grep -r 'text=True' . | wc -l
    1      37
    2% git grep -r 'encoding="ascii"' . | wc -l
    3       3
    

    maflcko commented at 11:04 am on October 24, 2025:
    I actually don’t know if it is needed. The existing code and test/lint/lint-python-utf8-encoding.py give different vibes. Not sure what to do, but I can remove it in a follow-up or the next push.

    maflcko commented at 11:09 am on October 24, 2025:
    I guess it can be dropped after https://peps.python.org/pep-0686/, or even earlier, since we force utf8 via LC_ALL here.
  19. in ci/test/02_run_container.py:77 in fa1b8611a0 outdated
    73+            run(cmd_build)
    74+
    75+        run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_ccache"], check=False)
    76+        run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_depends"], check=False)
    77+        run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_depends_sources"], check=False)
    78+        run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_previous_releases"], check=False)
    


    l0rinc commented at 9:57 am on October 24, 2025:

    I don’t see docker supporting multiple volume creations but maybe we can loop to avoid the repetition:

    0for suffix in ["ccache", "depends", "depends_sources", "previous_releases"]:
    1    run(["docker", "volume", "create", f"{os.environ['CONTAINER_NAME']}_{suffix}"], check=False)
    

    maflcko commented at 11:04 am on October 24, 2025:
    thx, done
  20. in ci/test/02_run_container.py:87 in fa1b8611a0 outdated
    83+        CI_PREVIOUS_RELEASES_MOUNT = f"type=volume,src={os.environ['CONTAINER_NAME']}_previous_releases,dst={os.environ['PREVIOUS_RELEASES_DIR']}"
    84+        CI_BUILD_MOUNT = []
    85+
    86+        if os.getenv("DANGER_CI_ON_HOST_FOLDERS"):
    87+            # ensure the directories exist
    88+            run(["mkdir", "-p", os.environ["CCACHE_DIR"]])
    


    l0rinc commented at 9:58 am on October 24, 2025:

    https://man7.org/linux/man-pages/man1/mkdir.1.html seems to support batch creation so maybe we can reduce the noise a bit:

    0run([
    1    "mkdir", "-p",
    2    os.environ["CCACHE_DIR"],
    3    f"{os.environ['DEPENDS_DIR']}/built",
    4    f"{os.environ['DEPENDS_DIR']}/sources",
    5    os.environ["PREVIOUS_RELEASES_DIR"],
    6    os.environ["BASE_BUILD_DIR"],  # Unset by default, must be defined externally
    7])
    

    But Python does support native folder creation, not sure we should delegate to bash here:

    0for p in [
    1    os.environ["CCACHE_DIR"],
    2    f"{os.environ['DEPENDS_DIR']}/built",
    3    f"{os.environ['DEPENDS_DIR']}/sources",
    4    os.environ["PREVIOUS_RELEASES_DIR"],
    5    os.environ["BASE_BUILD_DIR"],   # Unset by default, must be defined externally
    6]:
    7    Path(p).mkdir(parents=True, exist_ok=True)
    

    maflcko commented at 11:04 am on October 24, 2025:
    thx, done
  21. in ci/test/02_run_container.py:111 in fa586e86c4 outdated
    105+
    106+        run(["docker", "network", "create", "--ipv6", "--subnet", "1111:1111::/112", "ci-ip6net"], check=False)
    107+
    108+        if os.getenv("RESTART_CI_DOCKER_BEFORE_RUN"):
    109+            print("Restart docker before run to stop and clear all containers started with --rm")
    110+            run(["podman", "container", "rm", "--force", "--all"])  # Similar to "systemctl restart docker"
    


    l0rinc commented at 10:12 am on October 24, 2025:
    do we need to add timeout=12345 for a few of these or will the default suffice?

    maflcko commented at 11:04 am on October 24, 2025:
    I want to keep if move-only for now.
  22. in ci/test/02_run_container.py:149 in fa586e86c4 outdated
    143@@ -144,6 +144,16 @@ def main():
    144         ).stdout.strip()
    145         os.environ["CI_CONTAINER_ID"] = container_id
    146 
    147+    # GNU getopt is required for the CI_RETRY_EXE script
    148+    if os.getenv("CI_OS_NAME") == "macos":
    


    l0rinc commented at 10:27 am on October 24, 2025:
    Ah, this is outside of the if not os.getenv("DANGER_RUN_CI_ON_HOST"): branch, so checking the CI host seems correct
  23. l0rinc commented at 10:32 am on October 24, 2025: contributor

    Code review ACK fa586e86c4aa20c07f37488c13b2b797b35fdbd9

    The small focused commits with detailed messages made the review very easy. I left a few suggestions for cases that could be made more pythonic (i.e. partial bash migration), but I don’t mind keeping it as is, seems correct to me.

  24. l0rinc approved
  25. maflcko force-pushed on Oct 24, 2025
  26. ci: Export the container id in python script
    This refactor does not change behavior, but it has a few benefits:
    
    * The shellcheck SC2086 warning is disabled for the whole command, but
      is only needed for the CI_CONTAINER_CAP env var. So in Python, only
      pass this one env var to shlex.split() for proper word splitting.
    * Future logic improvements can be implemented in Python.
    
    The comments are moved, which can be checked via the git options:
    --color-moved=dimmed-zebra --color-moved-ws=ignore-all-space
    fabe516440
  27. ci: Document why IN_GETOPT_BIN env var is needed on macOS
    This was added in commit b705bade44973e61655d5f847f49d97fb5bb8393, but I
    keep forgetting the background that this is needed for the retry Bash
    script. So document it.
    5555bce994
  28. maflcko force-pushed on Oct 24, 2025
  29. l0rinc commented at 11:18 am on October 24, 2025: contributor
    Code review reACK 5555bce994b648f836c35a02570f22ae9ad36da3 (only suggested nits applied since last review)
  30. kevkevinpal commented at 1:23 pm on October 24, 2025: contributor

    ACK 5555bce

    Retrying building image on failure makes sense to me, and the refactor looks good

  31. fanquake commented at 1:39 pm on October 24, 2025: member
  32. davidgumberg commented at 8:25 pm on October 28, 2025: contributor

    crACK https://github.com/bitcoin/bitcoin/pull/33677/commits/5555bce994b648f836c35a02570f22ae9ad36da3

    Concept ACK on moving the CI script from bash to python and crACK https://github.com/bitcoin/bitcoin/pull/33677/commits/fabe516440c96bb7a6a6902195684d3802d64139 and https://github.com/bitcoin/bitcoin/pull/33677/commits/5555bce994b648f836c35a02570f22ae9ad36da3 as refactors with no behavior change, but those changes seem orthogonal to the PR title and linked issue and I think they would ideally be in a separate PR.

  33. davidgumberg approved
  34. maflcko commented at 9:07 pm on October 28, 2025: member

    refactors with no behavior change, but those changes seem orthogonal to the PR title and linked issue and I think they would ideally be in a separate PR.

    thx, i’ll open a dedicated follow-up for the remaining refactors. Though, I’ll leave this one as-is for now, as it already has 3 acks and seems rfm.

  35. fanquake merged this on Oct 29, 2025
  36. fanquake closed this on Oct 29, 2025

  37. maflcko deleted the branch on Oct 29, 2025
  38. maflcko commented at 12:25 pm on October 29, 2025: member

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-11-02 18:12 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me