This changeset migrates all current self-hosted CI jobs over to hosted Cirrus Runners.
These runners cost a flat rate of $150/month, and we qualify for an open source discount of 50%. Therefore they are $75/month/runner.
One “runner” should more accurately be thought of in terms of the number of vCPU you are purchasing: https://cirrus-runners.app/pricing/ or in terms of “concurrency”, where 1 runners gets you 1.0 concurrency. e.g. a Linux x86 Runner gets you 16 vCPU (1.0 concurrency) and 64GB RAM to be provisioned as you choose, amongst one or more jobs.
Cirrus Runners currently only support Linux (x86 and Arm64) and MacOS (Arm64). This changeset does not move the existing Github Actions native MacOS runners away from being run on Github’s infrastructure. This could be a follow up optimisation.
Runs from this changeset using Cirrus Runners can be found at: https://github.com/testing-cirrus-runners/bitcoin2/actions which shows an uncached run on master (CI#1), an outside pull request (CI#3) and an updated push to master (CI#4).
These workflows were run on 10 runners, and we would recommend purchasing a similar number for our CI in this repo to achieve the speed and concurrency we expect.
We include some optional performance commits, but these could be split out and made into followups or dropped entirely.
Benefits
Maintenance
As we are not self-hosting, nobody needs to maintain servers, disks etc.
Bus factor
Currently we have a very small number of people with the know-how working on server setup and maintenance. This setup fixes that so that “anyone” familiar with GitHub-style CI systems can work on it.
Scaling
These do not “auto-scale”/have “unlimited concurrency” like some solutions, but if we want more workers/cpu to increase parallism or increase the runner size of certain jobs for a speed-up we can simply buy more concurrency using the web interface.
Speed
Runtimes aproximate current runtimes pretty well, with some jobs being faster. Caching improvements on pull request (re-runs) are left as future optimisations from the current changeset (see below).
GitHub workflow syntax
With a migration to the more-commonly-used GitHub workflow syntax, migration to other providers in the future is often as simple as a one-line change (and installing a new GitHub app to the repo).
If we decide to self-host again, then we can also self-host GitHub runners (using https://github.com/actions/runner) and maintain new GH-style CI syntax.
Reporting
GitHub workflows provide nicer built-in reporting directly on the “Checks” page of a pr. This includes more-detailed action reporting, and a host of pretty nice integrated features, such as Workflow Commands for creating annotations that can print messages during runs. See for example at the bottom of this window where we report ccache
hitrate, if it was below 90%: https://github.com/testing-cirrus-runners/bitcoin/actions/runs/16163449125?pr=1
These could be added conditionally into our CI scripts to report interesting or other information.
Costs
Financial
Relative to competitors Cirrus runners are cheap for the hosted CI-world. However these are likely more expensive than our current setup, or a well-configured (new) self-hosted setup.
If we started with 10 runners to be shared amongst all migrated jobs, this would total $750/mo = $9000/yr.
Note that we are not trying to comptete here on cost directly.
Dependencies
We would be dependent on Cirrus infra, and any registry cache that we sign up to (e.g. Quay.io)
Forks
- Forks should be able to run CI without paid Cirrus runners. This behaviour is achieved through a rather verbose
runs-on:
directive.- This directive hardcodes the main repo (unfortunately you cannot use the
env
github context in this field in particular, for some reason). - This directive also allows for a fork which has it’s own Cirrus runners to set the
USE_CIRRUS_RUNNERS
repository env var to use their own cirrus runners onpush
to their fork. - If this is variable is not set or are we are not in the context of the main repo, the workflow will fallback to the GitHub free runners.
- This directive hardcodes the main repo (unfortunately you cannot use the
- This cirrus cache action transparently falls back to github actions cache when not running on cirrus, so forks will get some free github caching (10GB per repo).
All jobs work on forks, but will run (slowly) on GitHub native free hosted runners, instead of Cirrus runners. They will also suffer from poor cache hit-rates, but there’s nothing that can be done about that, and the situtation is an improvement on today.
Migration process
The main org should also, in addition to pulling code changes:
- Set up a Quay.io account (free is fine)
- Configure a quay.io “robot account”
- Set the robot account name as repo environment variable
QUAY_USERNAME
- Set the robot account access token as repo secret
QUAY_ROBOT_TOKEN
- Create a new docker repo on quay.io for the build cache to use
- Give the robot account read/write access to this repo
- Set up a https://cirrus-runners.app/ account, install the GitHub app to the repository, and buy/provision the number of runners.
- Update this PR with the repo
ORG
andREPO
. - Permit the actions
docker/setup-buildx-action@v3
anddocker/login-action@v3
to be run in this repo.
Caching
For the number of CI jobs we have, cache usage on GitHub would be an issue as GH only provides 10GB of cache space, per repo. However cirrus provides 10 GB per runner, which scales better with the number of runners.
The cirruslabs/action/[restore|save]
action we use here redirects this to Cirrus’ own cache and is both faster and larger.
In the case that user is running CI on a fork, the cirrus cache falls back transparently to GitHub default cache without error.
ccache, depends-sources, built-depends
- Cached as blobs via
cirruslabs/actions/cache
action. - Current implementation:
- On
push
: restores and saves caches. - On
pull_request
: restores but does not save caches.
- On
This means a new pull request should hit a pretty relevant cache. Old pull requests which are not being rebased on master may suffer from lower cache hit-rate.
If we save caches on all pull request runs we run the risk of evicting recent (and more relevant) cache blobs. It may be possible in a future optimisation to widen this to save on pull request runs too, but it will also depend on how many runners we provision and what cache churn rates are like in the main repo.
Docker build layer caching
- Cached using a registry backend
- Currently configured to use Quay.io
- For pushing this requires i) an account with them, and ii) login using a “robot account” (with push access).
- Login is only performed
on: push
, therefore caches are again (like the other caches) only essentially saved on pushes to master.- It is insecure to login to the registry
on: pull_request
, as these originate from external sources. - It is possible to login, but the workflow must be changed to use
on: pull_request_target
, which permits a PR to leak all your repo variables and secrets, amongst other pitfalls such as not useing a merge commit with the base branch as the default checkout target. We recommend this be avoided.
- It is insecure to login to the registry
- PRs (and forks) can pull layers from the cache without login (allegedly without rate limits on Quay.io, in contrast to Dockerhub).
We have experimented with the gha
cache backend type, which stores the build cache as a series of blobs (using the cirruslabs/actions/cache
action), and therefore shares the same 10GB*num_runners as all other caches, whilst negating the requirement on any registry.
This seems to simply not work in many cases, and without good visibility into the cache itself (currently there is no ui/cli access), it is not easy to see why build cache layers are seemingly missed in runs in many cases.
Using a regsitry cache has advantages and disadvantages:
- Requires an account+login for pushing.
- Registry may in theory rate limit us (despite having “unlimited” usage for public repos).
- We do not know precisely their garbage collection policy on the registry backend (although this likely doesn’t matter if they do something like Least Recently Used).
- Well supported by both docker and podman.
- Appears to operate much better in general, at least we do not see the cache misses we regularly saw using the
gha
backend. - In theory CI being run locally can also use the
--cache-from
build args to share the CI build cache.
Both backends require the same network i/o in a CI setting (and so are marginally slower than our current disk i/o cache).
Cirrus are apparently in the process of creating their own container registry for their Runner product. If this materialises then we should be able to trivially switch from Quay.io to the cirrus registry for some speed improvements (and presumably to reduce the risk of being rate-limited).
But what about… x
?
We have tested many other providers, including Runs-on, Buildjet, WarpBuild, and GitHub hosted runners (and investigated even more). But they all fall short in one-way or another.
- Runs-On and Buildjet (and others) require installing GH apps with much too-liberal permissions (e.g.
Administration: Read|Write
) for our use-case. - GitHub hosted runners suffer from all of high costs, lower speed, small cache, and the requirement for a GitHub Teams subscription.
- WarpBuild seems to be simply too expensive.
TODO:
To complete migration from self-hosted to hosted for this repo, the backport branches 27.x
, 28.x
and 29.x
would also need their CI ported, but these are left for followups to this change (and pending review/changes here first).
Work and experimentation undertaken with @m3dwards