The project currently self-hosts the runners for ~half of our CI jobs. The other half of the CI jobs run on GitHub Actions hosted by GitHub. In an offline discussion, the question came up if we should move the self-hosted runners to some other non-self-hosted infrastructure or if there are other ways to improve on the following areas:
- reduce infrastructure maintenance burden: someone has to take care of the servers, but that’s (normally) not too much work. Shouldn’t be a single person maintaining it to reduce the bus-factor and single point of failures.
- increase performance: generally faster CI is welcome, maybe solvable with more CPU, but also related to #30852 - not clear yet what exactly is targeted - and performance could be improved with self-hosted CI too - no need to go non-self-hosted for this alone.
- increase security/robustness: generally, running public jobs CI on self-hosted hardware is a security risk and not recommended - making it (more) secure is possible, but probably a lot of work that we might want to outsource to e.g. a company that specializes on it
Currently, the self-hosted CI has implicit caches for docker images, ccache, depends (sources and built). Caching would need to be implemented on non-selfhosted runners too. Some more discussion about the implicit caches is here.
Some mentioned ideas, but there are probably more approaches worth exploring:
- Github Actions: move all jobs to GitHub actions. GitHub cache is limited to 10GB, which is too small. PR authors can’t rerun PRs. Even more dependency on GitHub/Microsoft, but we already have half of the jobs there.
See also #30304 ci: Move more tasks to GHA?. - runson.com: would need to move the cirrus jobs to GitHub Actions format. Allows to use a way bigger cache, potentially cheaper than GitHub actions?
- actions-runner-controller: bring our own machines for a kubernetes cluster, or rent pods on demand and let the controller spin up things for new jobs. In theory, helps with performance and probably with security. Would need to move from Cirrus to GitHub Actions.
- Cirrus-CI on Cloud machines: use cloud machines and keep Cirrus. Needs us to figure out caching.
- bitcoin-core-cirrus-runner: is a experimental NixOS module that runs jobs in a isolated, ephemeral, and minimal QEMU VM. Would still be self-hosted CI, does address some of the security concerns about the current CI, not production ready yet, and probably a more complex solutions that wound’t reduce maintenance burden in the long run.
- Keep status status quo: use self-hosted, persistent, but potentially insecure Cirrus runners - “if it ain’t broken, don’t fix it”