Revisiting us self-hosting parts of our CI #31965

issue 0xB10C openend this issue on February 28, 2025
  1. 0xB10C commented at 4:42 pm on February 28, 2025: contributor

    The project currently self-hosts the runners for ~half of our CI jobs. The other half of the CI jobs run on GitHub Actions hosted by GitHub. In an offline discussion, the question came up if we should move the self-hosted runners to some other non-self-hosted infrastructure or if there are other ways to improve on the following areas:

    • reduce infrastructure maintenance burden: someone has to take care of the servers, but that’s (normally) not too much work. Shouldn’t be a single person maintaining it to reduce the bus-factor and single point of failures.
    • increase performance: generally faster CI is welcome, maybe solvable with more CPU, but also related to #30852 - not clear yet what exactly is targeted - and performance could be improved with self-hosted CI too - no need to go non-self-hosted for this alone.
    • increase security/robustness: generally, running public jobs CI on self-hosted hardware is a security risk and not recommended - making it (more) secure is possible, but probably a lot of work that we might want to outsource to e.g. a company that specializes on it

    Currently, the self-hosted CI has implicit caches for docker images, ccache, depends (sources and built). Caching would need to be implemented on non-selfhosted runners too. Some more discussion about the implicit caches is here.

    Some mentioned ideas, but there are probably more approaches worth exploring:

    • Github Actions: move all jobs to GitHub actions. GitHub cache is limited to 10GB, which is too small. PR authors can’t rerun PRs. Even more dependency on GitHub/Microsoft, but we already have half of the jobs there.
      See also #30304 ci: Move more tasks to GHA?.
    • runson.com: would need to move the cirrus jobs to GitHub Actions format. Allows to use a way bigger cache, potentially cheaper than GitHub actions?
    • actions-runner-controller: bring our own machines for a kubernetes cluster, or rent pods on demand and let the controller spin up things for new jobs. In theory, helps with performance and probably with security. Would need to move from Cirrus to GitHub Actions.
    • Cirrus-CI on Cloud machines: use cloud machines and keep Cirrus. Needs us to figure out caching.
    • bitcoin-core-cirrus-runner: is a experimental NixOS module that runs jobs in a isolated, ephemeral, and minimal QEMU VM. Would still be self-hosted CI, does address some of the security concerns about the current CI, not production ready yet, and probably a more complex solutions that wound’t reduce maintenance burden in the long run.
    • Keep status status quo: use self-hosted, persistent, but potentially insecure Cirrus runners - “if it ain’t broken, don’t fix it”
  2. rishavtarway commented at 4:16 am on March 2, 2025: none

    @0xB10C Will there be possibility we can think of more on this idea :

    1. Hybrid Approach with Optimized Caching:

    => Key Factors to this approach

    i) Instead of a complete switch, maintain a smaller, more secure self-hosted infrastructure specifically for caching.

    ii ) Use GitHub Actions or another cloud-based CI for the majority of the job execution.

    iii) Develop a robust, distributed caching system (e.g., using a dedicated cloud storage service or a distributed cache like Redis) that integrates with both self-hosted and cloud CI.

    in this way this allows for the speed of cloud CI, while utilizing the control of self hosted for critical caching.

  3. m3dwards commented at 2:29 pm on March 3, 2025: contributor

    CC @willcl-ark

    There are also some companies that offer drop in replacements to Github runners such as BuildJet. They will also often provide their own caching options.

    I quite like the idea of all of CI being defined as Github Actions which means all jobs should be runnable on forks (if a little slower as they wouldn’t be on bigger runners).

    Assuming we were happy to migrate the jobs to Github Actions, the next consideration is which runners to use:

    To put this on a spectrum of easy -> hard to administer and maintain it would be:

    Github’s runners -> BuildJet -> runs-on.com -> actions-runner-controller

    Then to consider the cost per minute from expensive -> inexpensive on an 8 core runner:

    Github’s runners $0.032 -> actions-runner-controller $?? -> BuildJet $0.016 -> runs-on.com $0.0034

    Actions runner controller is hard to estimate cost as it would depend on the autoscaling of the k8s cluster.

    It’s my opinion that the sweet spot between cost and administration burden is between BuildJet and Runs-On. BuildJet requires almost no administration and Runs-On advertises as only needing a few hours a year but I think we should imagine that there would need to be a few people comfortable enough with AWS to fix things once a year or so if something goes wrong. Runs-On is provisioned using a Cloud Formation template.

    Assuming we have enough people happy to administer a runs-on.com AWS stack (I would be) then I think this is perhaps worth exploring moving jobs there. If people want as little administration as possible, perhaps we should look at BuildJet (or equivalents)

  4. maflcko added the label Tests on Mar 9, 2025
  5. maflcko commented at 1:26 pm on March 9, 2025: member

    Thanks for the survey and summary, as well as the additional input on this topic!

    I don’t have a strong opinion on this, and anything is probably fine, as long as it “works”.

    Just some random thoughts:

    • The GHA runners that are included for public repositories are likely too weak to run all of the CI tasks in the maximum timeout (apart from the cache being too small as well, as mentioned above). So the goal that more of the CI tasks can run on forks trivially may not be achievable.
    • The 32-bit arm CI task (https://github.com/bitcoin/bitcoin/blob/4637cb1eec48d1af8d23eeae1bb4c6f8de55eed9/ci/test/00_setup_env_arm.sh#L9-L10) likely can’t run on any of the mentioned alternatives? It should be possible to drop the 32-bit mode from the task, emulate it (slower), or move it to a “nightly” task. However, another CI task may be added in the future that requires special hardware that isn’t available. So the goal that all tasks are moved to a single provider/hoster and will stay there may not be achievable without further changes to the CI tasks.
    • The “server-part” of CI maintenance with persistent runners is a smaller chunk of work. The majority of CI issues are unrelated to the infra and will have to fixed regardless of where the CI runs. Roughly, they can be found via the “CI failed” label: https://github.com/bitcoin/bitcoin/issues?q=label%3A%22CI%20failed%22%20is%3Aissue
    • It should be easy to increase the CI capacity (https://github.com/bitcoin/bitcoin/issues/30852) with the current setup, but probably not worth to look into, given that the direction isn’t clear yet.

    However, I haven’t confirmed my thoughts by doing a full test of all available options. (I don’t really have the time to create accounts for each cloud provider (GCE/AWS/AZ), then an account for each CI provider (ARC/runs-on/build-jet/…), and come up with the CI config specific to each provider …). If someone wants to take a look here, I am happy to answer questions and review pull requests. As mentioned above, anything that “works” should be fine and the requirements are:

    • All current CI tasks should be covered fully, only dropping coverage where it is reasonable and required.
    • Caching, as mentioned by @0xB10C above. Specifically, The msan task needs some kind of caching (likely docker build layer caching?), as it is not possible to do without: See #31850 (comment). All other tasks can probably do without install caching, but they need depends caching and ccache caching. (depends and ccache caching is required, because most CI tasks have a 100% hit-rate, according to https://0xb10c.github.io/bitcoin-core-ci-stats/graph/ccache/ and reducing it will likely result in a longer run-time)

    If someone has a proof-of-concept (or two), I’d say it should be fine to apply it to one (or two) CI tasks for a month to see how it goes on master, and then revert it within one month, or apply it to other tasks as well.

  6. maflcko commented at 6:21 pm on March 12, 2025: member
    Forgot to mention it, but if Cirrus is replaced, it may be good to also update DrahtBot to properly assign the “CI failed” label, so that it is easy to monitor for (intermittent) CI failures via https://github.com/bitcoin/bitcoin/pulls?q=is%3Aopen+label%3A%22CI+failed%22+sort%3Aupdated-desc++is%3Apr+-label%3A%22Needs+rebase%22++
  7. fkorotkov commented at 10:19 pm on May 15, 2025: none
    Cirrus CI founder here. Just wanted to add 2 cents that we now also have a service of self-hosted runners called Cirrus Runners with fast machines and bigger caches. Main difference is also that it’s fixed price for unlimited minutes per month hence we buy servers off traditional cloud like AWS/GCP/Azure.
  8. maflcko commented at 2:51 pm on August 4, 2025: member

    Hmm, so I looked into re-running GHA CI tasks to catch silent merge conflicts. However, GitHub does not allow this and blocks any re-run if the push is older than 30 days. Even if the task has been recently re-run. Example: https://github.com/bitcoin/bitcoin/actions/runs/16022448902/job/47185933876?pr=32856 (can not be re-run anymore)

    If the ability to re-run CI tasks is removed without replacement, this will likely lead to more pull requests being merged into master, where the CI fails.

    The solution by GitHub is to use “merge trains”, which is incompatible with our merge process.

    Any ideas on how to handle this?

  9. achow101 commented at 7:26 pm on August 4, 2025: member

    Any ideas on how to handle this?

    From https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#pull_request, it looks like there are other activity types that we can enable to trigger a workflow to run. Maybe if we have a workflow that is also triggered by the labeled activity when some “periodic CI” label is added. Then DrahtBot can add that label to PRs to run a particular task again?

    Maybe we could use https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#workflow_dispatch to make a workflow that can only be triggered manually, and this would be what DrahtBot runs periodically? Assuming that a workflow that is triggered manually is different from a re-run so it isn’t limited by the 30 days.

  10. maflcko commented at 11:13 am on September 3, 2025: member
    Another follow-up idea would be to port https://0xb10c.github.io/bitcoin-core-ci-stats/graph/ to GHA, to see if this improved/worsened scheduling/runtime.
  11. 0xB10C commented at 11:23 am on September 3, 2025: contributor

    Another follow-up idea would be to port https://0xb10c.github.io/bitcoin-core-ci-stats/graph/ to GHA, to see if this improved/worsened scheduling/runtime.

    Tracking this here: https://github.com/0xB10C/bitcoin-core-ci-stats/issues/8

  12. purpleKarrot commented at 2:50 pm on September 3, 2025: contributor

    I quite like the idea of all of CI being defined as Github Actions

    I prefer the idea that all CI is decoupled from the project, and that both the project and the CI are implemented against a defined interface: the cmake project definition. I presented this approach at C++Now 2025 as Effective CTest.

    Defining CI in a YAML file that is part of the project essentially states “it works on my machine(s)”. The project will start to support variables that are defined in the CI configuration only and it will derive from the default cmake workflow (example: running ctest directly will fail and a special script or a custom target is required to run all tests).

    A better approach is to define several CI build scripts that are project independent and then run the same CI scripts on a wide range of projects. Also: Ensure that all projects are CI script agnostic. Ideally, aggregate build results from external parties on a common dashboard.

  13. maflcko commented at 3:24 pm on September 3, 2025: member

    Defining CI in a YAML file that is part of the project essentially states “it works on my machine(s)”. The project will start to support variables that are defined in the CI configuration only and it will derive from the default cmake workflow (example: running ctest directly will fail and a special script or a custom target is required to run all tests).

    I think this is a pre-existing problem. It was never possible to run all tests (fuzz, functional, lint) from the build system (autotools, cmake, ctest, …). So I think this may be best tracked in a separate issue.

    As for the CI scripts: They are written in a way to ideally be reproducible: https://github.com/bitcoin/bitcoin/tree/master/ci#ci-scripts. If a task fails/passes on one machine, it should also fail/pass on any another. There is an effort to confirm this, and it should be easy to setup, but of course more people running nightly backup CI can be useful (e.g. running on different filesystems, hardware, architecture, …). But this also seems unrelated from moving the CI from one machine to another and is best discussed in a separate issue?

  14. m3dwards commented at 3:28 pm on September 3, 2025: contributor
    Further, a lot of the YAML is actually setting things up like caching which is not only essential for reasonable performance but also is vendor specific and so couldn’t really exist in reproducible CI scripts either written as they are or in cmake.
  15. purpleKarrot commented at 3:53 pm on September 3, 2025: contributor

    Further, a lot of the YAML is actually setting things up like caching which is not only essential for reasonable performance but also is vendor specific and so couldn’t really exist in reproducible CI scripts either written as they are or in cmake.

    I don’t see how setting up a cache could require YAML in a way that is not possible to reproduce in the cmake language. I explicitly covered caching in my presentation.

  16. maflcko commented at 8:15 am on September 4, 2025: member

    When testing the GHA caching, it seems down (at least yesterday and today):

    https://github.com/maflcko/bitcoin-core-qa-assets/actions/runs/17436331249/job/49574416227#step:7:173

     0+ ./ci/test/02_run_container.sh
     1+ '[' -z '' ']'
     2+ MAYBE_CPUSET=
     3+ '[' '' ']'
     4+ echo 'Creating mirror.gcr.io/ubuntu:24.04 container to run in'
     5+ docker buildx build --file /home/runner/work/bitcoin-core-qa-assets/bitcoin-core-qa-assets/ci/test_imagefile --build-arg CI_IMAGE_NAME_TAG=mirror.gcr.io/ubuntu:24.04 --build-arg FILE_ENV=./ci/test/00_setup_env_native_fuzz.sh --build-arg BASE_ROOT_DIR=/home/runner/work/_temp --platform=linux --label=bitcoin-ci-test --tag=ci_native_fuzz --cache-from type=gha,scope=ci_native_fuzz --cache-to type=gha,mode=max,ignore-error=true,scope=ci_native_fuzz --load /home/runner/work/bitcoin-core-qa-assets/bitcoin-core-qa-assets
     6Creating mirror.gcr.io/ubuntu:24.04 container to run in
     7[#0](/bitcoin-bitcoin/0/) building with "builder-5b3273a2-41c7-4250-8250-02abbf1924d6" instance using docker-container driver
     8
     9[#1](/bitcoin-bitcoin/1/) [internal] load build definition from test_imagefile
    10[#1](/bitcoin-bitcoin/1/) transferring dockerfile: 722B done
    11[#1](/bitcoin-bitcoin/1/) DONE 0.0s
    12
    13[#2](/bitcoin-bitcoin/2/) [internal] load metadata for mirror.gcr.io/ubuntu:24.04
    14[#2](/bitcoin-bitcoin/2/) DONE 3.0s
    15
    16[#3](/bitcoin-bitcoin/3/) [internal] load .dockerignore
    17[#3](/bitcoin-bitcoin/3/) transferring context: 2B done
    18[#3](/bitcoin-bitcoin/3/) DONE 0.0s
    19
    20[#4](/bitcoin-bitcoin/4/) importing cache manifest from gha:5102315559629195803
    21[#4](/bitcoin-bitcoin/4/) ERROR: failed to parse error response 400: <h2>Our services aren't available right now</h2><p>We're working to restore all services as soon as possible. Please check back soon.</p>0pkm5aAAAAACfachw/pXoToewnJYxmbvoUEhYMzFFREdFMDUwNgBFZGdl: invalid character '<' looking for beginning of value
    

    Does anyone know if this ever worked and has a log how it would look like?

  17. maflcko commented at 8:45 am on September 4, 2025: member

    Interestingly, the armhf task is using GHA runners even in this repo, and it fails with a different error message:

    https://github.com/bitcoin/bitcoin/actions/runs/17447989750/job/49546894060?pr=33300#step:8:154:

     0+ echo 'Creating mirror.gcr.io/ubuntu:24.04 container to run in'
     1+ docker buildx build --file /home/runner/work/bitcoin/bitcoin/ci/test_imagefile --build-arg CI_IMAGE_NAME_TAG=mirror.gcr.io/ubuntu:24.04 --build-arg FILE_ENV=./ci/test/00_setup_env_arm.sh --build-arg BASE_ROOT_DIR=/home/runner/work/_temp --platform=linux/arm64 --label=bitcoin-ci-test --tag=ci_arm_linux --cache-from type=gha,url=http://127.0.0.1:12321/,url_v2=http://127.0.0.1:12321/,scope=ci_arm_linux --load /home/runner/work/bitcoin/bitcoin
     2BASE_READ_ONLY_DIR=/home/runner/work/bitcoin/bitcoin
     3LC_ALL=C.UTF-8
     4CI_LIMIT_STACK_SIZE=1
     5PREVIOUS_RELEASES_DIR=/home/runner/work/_temp/previous_releases
     6DEBIAN_FRONTEND=noninteractive
     7+ ./ci/test/02_run_container.sh
     8Creating mirror.gcr.io/ubuntu:24.04 container to run in
     9[#0](/bitcoin-bitcoin/0/) building with "builder-1bdaf7c5-2392-4278-86a6-8b98d6baa155" instance using docker-container driver
    10
    11[#1](/bitcoin-bitcoin/1/) [internal] load build definition from test_imagefile
    12[#1](/bitcoin-bitcoin/1/) transferring dockerfile: 722B done
    13[#1](/bitcoin-bitcoin/1/) DONE 0.1s
    14
    15[#2](/bitcoin-bitcoin/2/) [internal] load metadata for mirror.gcr.io/ubuntu:24.04
    16[#2](/bitcoin-bitcoin/2/) DONE 0.9s
    17
    18[#3](/bitcoin-bitcoin/3/) [internal] load .dockerignore
    19[#3](/bitcoin-bitcoin/3/) transferring context: 2B done
    20[#3](/bitcoin-bitcoin/3/) DONE 0.0s
    21
    22[#4](/bitcoin-bitcoin/4/) importing cache manifest from gha:3148970939802758695
    23[#4](/bitcoin-bitcoin/4/) ERROR: Post "http://127.0.0.1:12321/twirp/github.actions.results.api.v1.CacheService/GetCacheEntryDownloadURL": dial tcp 127.0.0.1:12321: connect: connection refused
    
  18. willcl-ark commented at 9:09 am on September 4, 2025: member

    Yes, I’ve seen GHA caching work, will try and find logs.

    to your second comment, nice catch! When we switched back to using a combination of GH runners (for arm 32bit) and cirrus for the rest, we don’t handle configuring docker differently. I will make a PR for this.

  19. willcl-ark commented at 9:18 am on September 4, 2025: member
    Opened #33302 which should sort this out.
  20. maflcko commented at 12:41 pm on September 4, 2025: member
    Created https://github.com/maflcko/DrahtBot/issues/59, because I don’t know how to fetch the pull request number from a check suite or check run.

github-metadata-mirror

This is a metadata mirror of the GitHub repository bitcoin/bitcoin. This site is not affiliated with GitHub. Content is generated from a GitHub metadata backup.
generated: 2025-09-15 06:13 UTC

This site is hosted by @0xB10C
More mirrored repositories can be found on mirror.b10c.me