disclaimer: The following should not replace us investigating and fixing the root causes of timeouts and intermittent test runtime performance.
Now seems an opportune time to open a discussion on some investigation I have been doing into our self-hosted runners, as our CI has been struggling again recently.
I wanted to see what the cost/benefit implications would be on upgrading our self-hosted runners would look like. Hooking up a single Hetzner AX52 (70 €/month) as a self-hosted runner saw each job run on average 3-5x faster (result image shown at end), which is not surprising in itself.
IIUC correctly we currently have about 25 low powered, shared vCPU x86_64 runners (plus a sprinkling of ARM ones). If we had the appetite, and could find the funding, we might consider:
- upgrade the x86_64 runners to 12 dedicated CPU servers. At 70€ this would total 840€ per month, or 10080€ per year, vs current spend of ~3840€, so 2.5x more cost for 3-5x more speed. This feels like a decent return.
alternatively
- Bump our current (shared vCPU) runners to the next “level” up. If e.g. these are the runners in use for us today, we could increment the CPX21s to CPX31s, and the CPX31s to CPX41s for a monthly cost of 505.6€ vs a current spend of 320€. I did not test performance gains of this path.
We could also" just spend more" buying larger numbers of the same (low-powered) runners, but IMO this would not be as effective as reducing the runtime of CI jobs, and eliminating CI run timeouts. Moving away from vCPUs feels like the correct choice, if we can, as it’s possible that random contention on these could contribute to “random” timeouts and failures.
Additional thoughts in no particular order:
- I have likely not considered all the benefits of having larger numbers of lower powered runners. Comments welcomed on this.
- These more powerful runners also (all) come with (some) more disk space, so we could potentially do things like configure total ccache size (across all jobs) to be something like 100’s of GB, and try and maximize those cache hits!
- I am not sure what the developer tradeoff is for “CI startup time” (i.e. when a job is picked up by a runner) vs “CI runtime”.
- Shold we employ a more scientific approach, which could be to caclulate total compute/€ and just get whatever wins out by that metric.
I’d be curious to hear thoughts on whether this is something worth us looking at any further, or if folks have investigated this before and what they found.
AX52 test
- A typical CI run using current runners (on a good day):
- Jobs run using a single AX52 runner: