Revisiting us self-hosting parts of our CI

0xB10C commented at 4:42 pm on February 28, 2025: contributor

The project currently self-hosts the runners for ~half of our CI jobs. The other half of the CI jobs run on GitHub Actions hosted by GitHub. In an offline discussion, the question came up if we should move the self-hosted runners to some other non-self-hosted infrastructure or if there are other ways to improve on the following areas:

reduce infrastructure maintenance burden: someone has to take care of the servers, but that’s (normally) not too much work. Shouldn’t be a single person maintaining it to reduce the bus-factor and single point of failures.
increase performance: generally faster CI is welcome, maybe solvable with more CPU, but also related to #30852 - not clear yet what exactly is targeted - and performance could be improved with self-hosted CI too - no need to go non-self-hosted for this alone.
increase security/robustness: generally, running public jobs CI on self-hosted hardware is a security risk and not recommended - making it (more) secure is possible, but probably a lot of work that we might want to outsource to e.g. a company that specializes on it

Currently, the self-hosted CI has implicit caches for docker images, ccache, depends (sources and built). Caching would need to be implemented on non-selfhosted runners too. Some more discussion about the implicit caches is here.

Some mentioned ideas, but there are probably more approaches worth exploring:

Github Actions: move all jobs to GitHub actions. GitHub cache is limited to 10GB, which is too small. PR authors can’t rerun PRs. Even more dependency on GitHub/Microsoft, but we already have half of the jobs there.
See also #30304 ci: Move more tasks to GHA?.
runson.com: would need to move the cirrus jobs to GitHub Actions format. Allows to use a way bigger cache, potentially cheaper than GitHub actions?
actions-runner-controller: bring our own machines for a kubernetes cluster, or rent pods on demand and let the controller spin up things for new jobs. In theory, helps with performance and probably with security. Would need to move from Cirrus to GitHub Actions.
Cirrus-CI on Cloud machines: use cloud machines and keep Cirrus. Needs us to figure out caching.
bitcoin-core-cirrus-runner: is a experimental NixOS module that runs jobs in a isolated, ephemeral, and minimal QEMU VM. Would still be self-hosted CI, does address some of the security concerns about the current CI, not production ready yet, and probably a more complex solutions that wound’t reduce maintenance burden in the long run.
Keep status status quo: use self-hosted, persistent, but potentially insecure Cirrus runners - “if it ain’t broken, don’t fix it”

rishavtarway commented at 4:16 am on March 2, 2025: none

@0xB10C Will there be possibility we can think of more on this idea :

Hybrid Approach with Optimized Caching:

=> Key Factors to this approach

i) Instead of a complete switch, maintain a smaller, more secure self-hosted infrastructure specifically for caching.

ii ) Use GitHub Actions or another cloud-based CI for the majority of the job execution.

iii) Develop a robust, distributed caching system (e.g., using a dedicated cloud storage service or a distributed cache like Redis) that integrates with both self-hosted and cloud CI.

in this way this allows for the speed of cloud CI, while utilizing the control of self hosted for critical caching.

m3dwards commented at 2:29 pm on March 3, 2025: contributor

CC @willcl-ark

There are also some companies that offer drop in replacements to Github runners such as BuildJet. They will also often provide their own caching options.

I quite like the idea of all of CI being defined as Github Actions which means all jobs should be runnable on forks (if a little slower as they wouldn’t be on bigger runners).

Assuming we were happy to migrate the jobs to Github Actions, the next consideration is which runners to use:

To put this on a spectrum of easy -> hard to administer and maintain it would be:

Github’s runners -> BuildJet -> runs-on.com -> actions-runner-controller

Then to consider the cost per minute from expensive -> inexpensive on an 8 core runner:

Github’s runners $0.032 -> actions-runner-controller $?? -> BuildJet $0.016 -> runs-on.com $0.0034

Actions runner controller is hard to estimate cost as it would depend on the autoscaling of the k8s cluster.

It’s my opinion that the sweet spot between cost and administration burden is between BuildJet and Runs-On. BuildJet requires almost no administration and Runs-On advertises as only needing a few hours a year but I think we should imagine that there would need to be a few people comfortable enough with AWS to fix things once a year or so if something goes wrong. Runs-On is provisioned using a Cloud Formation template.

Assuming we have enough people happy to administer a runs-on.com AWS stack (I would be) then I think this is perhaps worth exploring moving jobs there. If people want as little administration as possible, perhaps we should look at BuildJet (or equivalents)

maflcko added the label Tests on Mar 9, 2025

maflcko commented at 1:26 pm on March 9, 2025: member

Thanks for the survey and summary, as well as the additional input on this topic!

I don’t have a strong opinion on this, and anything is probably fine, as long as it “works”.

Just some random thoughts:

The GHA runners that are included for public repositories are likely too weak to run all of the CI tasks in the maximum timeout (apart from the cache being too small as well, as mentioned above). So the goal that more of the CI tasks can run on forks trivially may not be achievable.
The 32-bit arm CI task (https://github.com/bitcoin/bitcoin/blob/4637cb1eec48d1af8d23eeae1bb4c6f8de55eed9/ci/test/00_setup_env_arm.sh#L9-L10) likely can’t run on any of the mentioned alternatives? It should be possible to drop the 32-bit mode from the task, emulate it (slower), or move it to a “nightly” task. However, another CI task may be added in the future that requires special hardware that isn’t available. So the goal that all tasks are moved to a single provider/hoster and will stay there may not be achievable without further changes to the CI tasks.
The “server-part” of CI maintenance with persistent runners is a smaller chunk of work. The majority of CI issues are unrelated to the infra and will have to fixed regardless of where the CI runs. Roughly, they can be found via the “CI failed” label: https://github.com/bitcoin/bitcoin/issues?q=label%3A%22CI%20failed%22%20is%3Aissue
It should be easy to increase the CI capacity (https://github.com/bitcoin/bitcoin/issues/30852) with the current setup, but probably not worth to look into, given that the direction isn’t clear yet.

However, I haven’t confirmed my thoughts by doing a full test of all available options. (I don’t really have the time to create accounts for each cloud provider (GCE/AWS/AZ), then an account for each CI provider (ARC/runs-on/build-jet/…), and come up with the CI config specific to each provider …). If someone wants to take a look here, I am happy to answer questions and review pull requests. As mentioned above, anything that “works” should be fine and the requirements are:

All current CI tasks should be covered fully, only dropping coverage where it is reasonable and required.
Caching, as mentioned by @0xB10C above. Specifically, The msan task needs some kind of caching (likely docker build layer caching?), as it is not possible to do without: See #31850 (comment). All other tasks can probably do without install caching, but they need depends caching and ccache caching. (depends and ccache caching is required, because most CI tasks have a 100% hit-rate, according to https://0xb10c.github.io/bitcoin-core-ci-stats/graph/ccache/ and reducing it will likely result in a longer run-time)

If someone has a proof-of-concept (or two), I’d say it should be fine to apply it to one (or two) CI tasks for a month to see how it goes on master, and then revert it within one month, or apply it to other tasks as well.

maflcko commented at 6:21 pm on March 12, 2025: member

Forgot to mention it, but if Cirrus is replaced, it may be good to also update DrahtBot to properly assign the “CI failed” label, so that it is easy to monitor for (intermittent) CI failures via https://github.com/bitcoin/bitcoin/pulls?q=is%3Aopen+label%3A%22CI+failed%22+sort%3Aupdated-desc++is%3Apr+-label%3A%22Needs+rebase%22++

fkorotkov commented at 10:19 pm on May 15, 2025: none

Cirrus CI founder here. Just wanted to add 2 cents that we now also have a service of self-hosted runners called Cirrus Runners with fast machines and bigger caches. Main difference is also that it’s fixed price for unlimited minutes per month hence we buy servers off traditional cloud like AWS/GCP/Azure.

maflcko commented at 2:51 pm on August 4, 2025: member

Hmm, so I looked into re-running GHA CI tasks to catch silent merge conflicts. However, GitHub does not allow this and blocks any re-run if the push is older than 30 days. Even if the task has been recently re-run. Example: https://github.com/bitcoin/bitcoin/actions/runs/16022448902/job/47185933876?pr=32856 (can not be re-run anymore)

If the ability to re-run CI tasks is removed without replacement, this will likely lead to more pull requests being merged into master, where the CI fails.

The solution by GitHub is to use “merge trains”, which is incompatible with our merge process.

Any ideas on how to handle this?

achow101 commented at 7:26 pm on August 4, 2025: member

Any ideas on how to handle this?

From https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#pull_request, it looks like there are other activity types that we can enable to trigger a workflow to run. Maybe if we have a workflow that is also triggered by the labeled activity when some “periodic CI” label is added. Then DrahtBot can add that label to PRs to run a particular task again?

Maybe we could use https://docs.github.com/en/actions/reference/workflows-and-actions/events-that-trigger-workflows#workflow_dispatch to make a workflow that can only be triggered manually, and this would be what DrahtBot runs periodically? Assuming that a workflow that is triggered manually is different from a re-run so it isn’t limited by the 30 days.

Revisiting us self-hosting parts of our CI #31965