Increasing self-hosted runner raw performance

willcl-ark commented at 11:49 am on September 9, 2024: member

disclaimer: The following should not replace us investigating and fixing the root causes of timeouts and intermittent test runtime performance.

Now seems an opportune time to open a discussion on some investigation I have been doing into our self-hosted runners, as our CI has been struggling again recently.

I wanted to see what the cost/benefit implications would be on upgrading our self-hosted runners would look like. Hooking up a single Hetzner AX52 (70 €/month) as a self-hosted runner saw each job run on average 3-5x faster (result image shown at end), which is not surprising in itself.

IIUC correctly we currently have about 25 low powered, shared vCPU x86_64 runners (plus a sprinkling of ARM ones). If we had the appetite, and could find the funding, we might consider:

upgrade the x86_64 runners to 12 dedicated CPU servers. At 70€ this would total 840€ per month, or 10080€ per year, vs current spend of ~3840€, so 2.5x more cost for 3-5x more speed. This feels like a decent return.

alternatively

Bump our current (shared vCPU) runners to the next “level” up. If e.g. these are the runners in use for us today, we could increment the CPX21s to CPX31s, and the CPX31s to CPX41s for a monthly cost of 505.6€ vs a current spend of 320€. I did not test performance gains of this path.

We could also" just spend more" buying larger numbers of the same (low-powered) runners, but IMO this would not be as effective as reducing the runtime of CI jobs, and eliminating CI run timeouts. Moving away from vCPUs feels like the correct choice, if we can, as it’s possible that random contention on these could contribute to “random” timeouts and failures.

Additional thoughts in no particular order:

I have likely not considered all the benefits of having larger numbers of lower powered runners. Comments welcomed on this.
- These more powerful runners also (all) come with (some) more disk space, so we could potentially do things like configure total ccache size (across all jobs) to be something like 100’s of GB, and try and maximize those cache hits!
I am not sure what the developer tradeoff is for “CI startup time” (i.e. when a job is picked up by a runner) vs “CI runtime”.
Shold we employ a more scientific approach, which could be to caclulate total compute/€ and just get whatever wins out by that metric.

I’d be curious to hear thoughts on whether this is something worth us looking at any further, or if folks have investigated this before and what they found.

AX52 test

A typical CI run using current runners (on a good day):

Jobs run using a single AX52 runner:

willcl-ark added the label Brainstorming on Sep 9, 2024

maflcko commented at 12:26 pm on September 9, 2024: member

each job run[s] on average 3-5x faster

Nice find. I didn’t expect such a difference, but your results look promising to give this a try.

achow101 commented at 3:46 pm on September 9, 2024: member

Is the cause for the speedup a difference in the actual hardware are simply due to dedicated resources?

I currently have some older server hardware that I am planning to dedicate to fuzzing. I could instead also set them up for our CI. These servers will live in a data center and should have 100% uptime.

maflcko commented at 4:04 pm on September 9, 2024: member

I haven’t done the benchmarks here, so I can’t say for sure, but the AMD EPYC™ 9454P all-core boost clock is lower than the base clock of the AMD Ryzen™ 7 7700 that @willcl-ark did the benchmark on. So I guess it is a combination of faster clock + dedicated resources + more idle CPU waiting for work on the server?

If your older hardware has about 90 CPU threads, about 152 GB of RAM, and a 3 TB of SSD, and has comparable performance, it could also be used.

I think you can compare the performance by just running a given CI task twice, measuring the time only on the second run.

achow101 commented at 5:02 am on September 10, 2024: member

Not really that much faster: https://cirrus-ci.com/build/5557255590903808, but possibly also misconfigured as everything was setup to run on that one machine simultaneously, although it should enough cores and memory.

maflcko commented at 5:27 am on September 10, 2024: member

Hmm, you could also try to re-run the tasks individually to see the best possible performance (which is what was tested above as well).

Otherwise, my suggestion would be to go with one of the AMD CPUs mentioned above. (I can give this a try in the coming days)

willcl-ark commented at 8:20 am on September 10, 2024: member

Hmm, you could also try to re-run the tasks individually to see the best possible performance (which is what was tested above as well).

Correct, I used default cirrus-cli concurrency (of 1) on the server during testing. Some of these were second runs but ccache was not hit in all of them. E.g. it hit 0% in depends, debug, but 100% in nowallet, libbitcoinkernel. The run is here: https://cirrus-ci.com/build/6496417768800256

IMO that @achow101 didn’t see a great speedup further hints to me that dedicated resources are having quite a large impact here.

willcl-ark commented at 8:28 am on September 10, 2024: member

Another potential (and cheap) avenue we could explore is having a single massive shared ccache dir. This of course wouldn’t speed up running tests themselves, but we could try for even more cache hits during compilation.

Hetzner has a “storage box” https://www.hetzner.com/storage/storage-box/ which, for 4€ per month gets unlimited internal traffic and 1TB of disk space. I don’t much like the look of “10 concurrent connections” max, and unsure if it’s SSD or spinning disk, but supposing the best we could in theory set up redis on one, and use ccache’s remote backend option to try and guarantee higher numbers of cache hits.

maflcko commented at 9:07 am on September 10, 2024: member

I am thinking it would be less hassle to set up local storage. Ensuring that a given task type only runs on a given machine should be enough to ensure a “single ccache dir”. (ccache results can’t be shared between tasks anyway)

This would also avoid additional work or downtime when something goes wrong in the remote “storage box” (connection management, maintenance, configuration, …)

maflcko commented at 1:01 pm on September 11, 2024: member

There are a bunch of places where CCACHE_MAXSIZE is hardcoded, making it impossible to set a larger size. This needs to be fixed first, so I did that in #30869 (among other stuff).

Assuming .5GB of caches per task, 10 tasks per push, 100 pushes per day, and a cache duration of 20 days, means that 10TB of storage should be sufficient.

willcl-ark commented at 11:24 am on September 12, 2024: member

There are a bunch of places where CCACHE_MAXSIZE is hardcoded, making it impossible to set a larger size. This needs to be fixed first, so I did that in #30869 (among other stuff).

Awesome!

Assuming .5GB of caches per task, 10 tasks per push, 100 pushes per day, and a cache duration of 20 days, means that 10TB of storage should be sufficient.

Not sure I understand this; the idea is to have “one ccache per task” surely, not one per push?

I was thinking we should set up with:

10x ccache dirs, one per task, on the (self) host(ed runner)
Use a large size for each (as space allows)
Mount the appropriate cache dir into the container according to task type
set CCACHE_COMPILERCHECK to content mode, which passes the compiler check when the hash of the compiler is the same (the default is mdate which will change with each container we create)

This means we will share hits between pushes, and all PRs, just based on task type.

Is there something I’m missing for you to make your suggestion? I get a high cache hit-rate locally on new-PRs and pushes to master based on a single shared cache:

 0will@ubuntu in ~/src/bitcoin on  master [$?⇕] via △ v3.30.3 : 🐍 (bitcoin)
 1₿ ccache -s
 2Cacheable calls:   348057 / 547383 (63.59%)
 3  Hits:            263097 / 348057 (75.59%)
 4    Direct:        239565 / 263097 (91.06%)
 5    Preprocessed:   23532 / 263097 ( 8.94%)
 6  Misses:           84960 / 348057 (24.41%)
 7Uncacheable calls: 199325 / 547383 (36.41%)
 8Errors:                 1 / 547383 ( 0.00%)
 9Local storage:
10  Cache size (GB):   22.3 /   25.0 (89.37%)
11  Hits:            263098 / 348058 (75.59%)
12  Misses:           84960 / 348058 (24.41%)
13
14will@ubuntu in ~/src/bitcoin on  master [$?⇕] via △ v3.30.3 : 🐍 (bitcoin)
15₿ ccache -z
16Statistics zeroed
17
18# ----------
19# Check out a random new branch
20# ----------
21
22will@ubuntu in ~/src/bitcoin on  master [$?⇕] via △ v3.30.3 : 🐍 (bitcoin)
23₿ gh pr checkout 30866
24From https://github.com/bitcoin/bitcoin
25 * [new ref]                 refs/pull/30866/head -> multipath-spkm-fuzz-crash
26Switched to branch 'multipath-spkm-fuzz-crash'
27
28will@ubuntu in ~/src/bitcoin on  multipath-spkm-fuzz-crash [$?] via △ v3.30.3 : 🐍 (bitcoin)
29₿ rm -i -Rf build
30
31will@ubuntu in ~/src/bitcoin on  multipath-spkm-fuzz-crash [$?] via △ v3.30.3 : 🐍 (bitcoin)
32₿ cmake -B build &> /dev/null; and cmake --build build -j16 &> /dev/null
33
34will@ubuntu in ~/src/bitcoin on  multipath-spkm-fuzz-crash [$?] via △ v3.30.3 : 🐍 (bitcoin) took 2m9s
35₿ ccache -s
36Cacheable calls:    547 /  627 (87.24%)
37  Hits:             293 /  547 (53.56%)
38    Direct:         291 /  293 (99.32%)
39    Preprocessed:     2 /  293 ( 0.68%)
40  Misses:           254 /  547 (46.44%)
41Uncacheable calls:   80 /  627 (12.76%)
42Local storage:
43  Cache size (GB): 22.4 / 25.0 (89.63%)
44  Hits:             293 /  547 (53.56%)
45  Misses:           254 /  547 (46.44%)
46
47# ----------
48# rebuild
49# ----------
50
51will@ubuntu in ~/src/bitcoin on  multipath-spkm-fuzz-crash [$?] via △ v3.30.3 : 🐍 (bitcoin)
52₿ ccache -z
53Statistics zeroed
54
55will@ubuntu in ~/src/bitcoin on  multipath-spkm-fuzz-crash [$?] via △ v3.30.3 : 🐍 (bitcoin)
56₿ rm -i -Rf build
57
58will@ubuntu in ~/src/bitcoin on  multipath-spkm-fuzz-crash [$?] via △ v3.30.3 : 🐍 (bitcoin)
59₿ cmake -B build &> /dev/null; and cmake --build build -j16 &> /dev/null
60
61will@ubuntu in ~/src/bitcoin on  multipath-spkm-fuzz-crash [$?] via △ v3.30.3 : 🐍 (bitcoin) took 16s
62₿ ccache -s
63Cacheable calls:    547 /  627 (87.24%)
64  Hits:             457 /  547 (83.55%)
65    Direct:         457 /  457 (100.0%)
66    Preprocessed:     0 /  457 ( 0.00%)
67  Misses:            90 /  547 (16.45%)
68Uncacheable calls:   80 /  627 (12.76%)
69Local storage:
70  Cache size (GB): 22.4 / 25.0 (89.63%)
71  Hits:             457 /  547 (83.55%)
72  Misses:            90 /  547 (16.45%)

maflcko commented at 7:42 am on September 13, 2024: member

The thing is that workers are “shuffled” by type, so one run may happen one one machine and another run of the same task may happen on another machine. It is of course possible to make each task its own type, but that seems a bit too much maintenance overhead when tasks are added or removed in one branch, but not all. As you mentioned earlier, it is of course also possible to set up a remote ccache. However, I think this is similarly going to be more maintenance overhead and extra moving parts, with the additional downside of added latency due to the distance. So I think the best way forward would be to have all workers of the same type on the same machine. (I am still running some performance tests to get an estimate if this make sense at all)

I was thinking we should set up with:

That sounds like what is being done today already: Each cache dir is a volume (one volume per task) on each runner, except for using the default cache size and default CCACHE_COMPILERCHECK.

maflcko commented at 9:34 am on September 17, 2024: member

So I think the best way forward would be to have all workers of the same type on the same machine. (I am still running some performance tests to get an estimate if this make sense at all)

Did a quick sanity check with a 100G shared ccache on a single machine cpx51 and all tasks finished in under a hour. Some tasks were faster than the AX52 run above, some were slower:

Screenshot 2024-09-17 at 11-11-05 Merge commit ‘fa99e4521b6fc0e7f6636d40bc0d6a7325227374’ into main-with-ci - Cirrus CI

So my recommendation would be to try that with a dedicated AX162, because:

It is dedicated, without kvm, so possibly minimally faster, but certainly should not be slower than my cpx51 trial run, or the current CI setup
It has 96 CPU threads, which is the most that Hetzner currently offers (There is also a Dell Intel setup with 128 threads, but it doesn’t have an option for a single 10 TB ccache shared drive, only 7 TB)
It has an option for a 22 TB HDD, which can be used for the shared ccache folder

willcl-ark commented at 1:37 pm on September 17, 2024: member

So I think the best way forward would be to have all workers of the same type on the same machine. (I am still running some performance tests to get an estimate if this make sense at all)

Is this so that re-runs (of the same PR) always hit the cache? If that’s it I’m not totally sure it’s worth it. My intuition says that if we had multiple servers, even if a re-run took place on a different machine, we’d likely get a pretty decent cache hit from another run or a previous master build (or we simply get the benefit of a new, super-fast non-shared CPU vs the current ones today).

The tradeoff of using a single beast machine is CPU/io/disk contention for better ccache hits. I suspect the former may be more important, when comparing to say a handful of moderately powerful servers?

It looks like you got decent results in your testing though, to be honest, and, I’d be happy to try anything faster 😄

Did a quick sanity check with a 100G shared ccache on a single machine cpx51 and all tasks finished in under a hour. Some tasks were faster than the AX52 run above, some were slower:

That seems like a decent result. Were all jobs running simultaneously? Did you change the --jobs number or anything else for the run?

maflcko commented at 2:50 pm on September 17, 2024: member

Is this so that re-runs (of the same PR) always hit the cache? If that’s it I’m not totally sure it’s worth it. My intuition says that if we had multiple servers, even if a re-run took place on a different machine, we’d likely get a pretty decent cache hit from another run or a previous master build

Right. It is unclear whether this is worth it. I think when it comes to builds on master, the speed doesn’t matter at all, because no one is waiting for the result to appear immediately. So the thing where the cache matters a bit could be a pull request. However, for a first push, the cache hit rate will be low either way. The only thing that could be optimized would be a (little) re-push to the same pull request (assuming little to no changes to master in the meantime), or any push on a “little” pull request (which can then use the cache from master). But I don’t have a number how often that happens and whether it is worth it to optimize for.

(or we simply get the benefit of a new, super-fast non-shared CPU vs the current ones today).

Hetzner offers so many different CPUs, and I didn’t want to rent all of them to find the “fastest” one. But reading the docs (https://www.amd.com/en/products/processors/desktops/ryzen/7000-series/amd-ryzen-9-7950x3d.html and https://www.amd.com/en/products/processors/desktops/ryzen/7000-series/amd-ryzen-7-7700.html), it seems like AX102 may be faster than AX52. So that could be another idea to look into?

I think when it comes to “sharing” CPUs, the CI performs better by “sharing” (putting several workers on the same machine), because most tests are bottle-necked by a single long-running test. IIRC this is true for the unit tests (where the tail is the secp tests), the functional tests (where the tail is an “extended” or otherwise long running test), as well as the fuzz tests (which have a few long-running tests). Imagining a case where a pull request was merged, and a few pulls need rebase as a result, there may be several pushes at the same time (as well as the master build). Having servers with a single worker on them will make the servers go over the tasks one-by-one, whereas having the same servers with more workers on them allows them to possibly service all tasks at once. Given that the workload in this case (Bitcoin Core CI) is CPU-bound (and not IO-bound), the “sharing” approach with the same number of servers should never be slower. However, I haven’t benchmarked this.

Did a quick sanity check with a 100G shared ccache on a single machine cpx51 and all tasks finished in under a hour. Some tasks were faster than the AX52 run above, some were slower:

That seems like a decent result. Were all jobs running simultaneously? Did you change the --jobs number or anything else for the run?

Nah, it was just a sanity check where the medium and small worker were running at the same time on a single machine. I just wanted to check a single shared ccache on the machine. I left MAKEJOBS as-is.

maflcko commented at 3:10 pm on September 17, 2024: member

CPU-bound (and not IO-bound)

Actually, the long-running fuzz test may be IO-bound so using a ramdisk could speed them up. So this would be another idea to look at.

maflcko commented at 11:44 am on September 23, 2024: member

set CCACHE_COMPILERCHECK to content mode, which passes the compiler check when the hash of the compiler is the same (the default is mdate which will change with each container we create)

Are you sure? At least on Ubuntu Noble mdate is set correctly for clang (and does not change with each container). I presume the same is true for Debian and GCC. So, if the modify is different, the content would be different as well (and vice-versa). See:

0# stat /usr/bin/clang++
1  File: /usr/bin/clang++ -> ../lib/llvm-18/bin/clang++
2  Size: 26        	Blocks: 0          IO Block: 4096   symbolic link
3Device: 0,69	Inode: 7972159     Links: 1
4Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
5Access: 2024-09-23 11:18:23.212490379 +0000
6Modify: 2024-03-30 11:19:55.000000000 +0000
7Change: 2024-09-23 11:18:15.408283849 +0000
8 Birth: 2024-09-23 11:18:15.408283849 +0000

The only exception may be the msan build, where the compiler is compiled.

(Just leaving a reply for context, in case someone reads the thread and stumbles upon this)

willcl-ark commented at 1:03 pm on September 23, 2024: member

set CCACHE_COMPILERCHECK to content mode, which passes the compiler check when the hash of the compiler is the same (the default is mdate which will change with each container we create)

Are you sure? At least on Ubuntu Noble mdate is set correctly for clang (and does not change with each container). I presume the same is true for Debian and GCC. So, if the modify is different, the content would be different as well (and vice-versa). See:
0# stat /usr/bin/clang++
1  File: /usr/bin/clang++ -> ../lib/llvm-18/bin/clang++
2  Size: 26        	Blocks: 0          IO Block: 4096   symbolic link
3Device: 0,69	Inode: 7972159     Links: 1
4Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
5Access: 2024-09-23 11:18:23.212490379 +0000
6Modify: 2024-03-30 11:19:55.000000000 +0000
7Change: 2024-09-23 11:18:15.408283849 +0000
8 Birth: 2024-09-23 11:18:15.408283849 +0000
The only exception may be the msan build, where the compiler is compiled.

(Just leaving a reply for context, in case someone reads the thread and stumbles upon this)

Oh great, that will help simplify things a little then, as we can use more defaults. Not sure what I must’ve been looking at to come to the wrong conclusion here, but thanks for double-checking.

maflcko commented at 6:53 pm on October 1, 2024: member

The only exception may be the msan build, where the compiler is compiled. (Just leaving a reply for context, in case someone reads the thread and stumbles upon this)

Oh great, that will help simplify things a little then, as we can use more defaults. Not sure what I must’ve been looking at to come to the wrong conclusion here, but thanks for double-checking.

It isn’t wrong. I presume it may still have to be set for the msan task, but this can probably be done in a follow-up and shouldn’t be an initial blocker here.

Given that the workload in this case (Bitcoin Core CI) is CPU-bound (and not IO-bound), the “sharing” approach with the same number of servers should never be slower. However, I haven’t benchmarked this.

Sorry for the delay, I finally got around to checking this. Apart from the inital clone+apt, the tasks are not IO-bound (at least not on Hetzner Cloud). Also checking a full build on two cpx51 for each type (small and medium) shows that given a fixed number of machines, it is faster to put more workers on them, than not to. (Having many workers is needed when many commits are built at the same time. For example, a merge to master and 7 pull request pushes in half an hour are not rare. That is 80 tasks that need to be built, which is faster with more workers than with less)

I compared two runs: A fresh install of the machines with just one worker each and another fresh install with the number of workers equal to the number of tasks.

The first one took 2h:40 end-to-end overall, the second one just 1h:40 and all tasks started at the same time. Obviously this doesn’t represent reality, because ccache is usually filled and llvm doesn’t need to be built from scratch each time, but it should give the idea. (I tested with #30935 for the llvm build)

Screen Shot 2024-10-01 at 18 08 49

Screen Shot 2024-10-01 at 20 14 40

Going forward, I’d say to try AX162 and then watch it for a week or two in the real setting. If improvements are still possible afterward, it should be easy to adjust.

However, I am not yet sure about a remote shared ccache dir. Given that CI sometimes seems to fail for inexplicable reasons (https://cirrus-ci.com/task/5392365521731584?logs=ci#L190, from #30998, possibly a buggy systemd?), and given that the remote ccache failure would be silent, my preference would be to keep the dir local for now.

willcl-ark commented at 7:38 pm on October 1, 2024: member

Apart from the inital clone+apt, the tasks are not IO-bound (at least not on Hetzner Cloud).

I know in docker you can cache apt, I wonder if we can do the same on podman (having difficulty finding that they support the exact same flags)? e.g. in docker you can:

0RUN --mount=target=/var/lib/apt/lists,type=cache,sharing=locked \
1    --mount=target=/var/cache/apt,type=cache,sharing=locked \

We’d need likely to delineate between Ubuntu versions etc. and this may only help with package download time, not a limiting factor for us. I don’t recall exactly how much it helps with the install/upgrade part of the job. I can test this myself in the meantime.

Docker docs ref: https://docs.docker.com/build/cache/optimize/#use-cache-mounts

I compared two runs: A fresh install of the machines with just one worker each and another fresh install with the number of workers equal to the number of tasks.

The first one took 2h:40 end-to-end overall, the second one just 1h:40 and all tasks started at the same time. Obviously this doesn’t represent reality, because ccache is usually filled and llvm doesn’t need to be built from scratch each time, but it should give the idea. (I tested with #30935 for the llvm build)

This sounds promising to me.

Going forward, I’d say to try AX162 and then watch it for a week or two in the real setting. If improvements are still possible afterward, it should be easy to adjust.

I agree with this.

However, I am not yet sure about a remote shared ccache dir. Given that CI sometimes seems to fail for inexplicable reasons (cirrus-ci.com/task/5392365521731584?logs=ci#L190, from #30998, possibly a buggy systemd?), and given that the remote ccache failure would be silent, my preference would be to keep the dir local for now.

In my experience ccaache was very resilient in terms of accessing a remote cache dir (it always maintains a local cache too to fallback to), but I agree that if opting for a single larger runner with 2TB of disk space itself then a remote ccache doesn’t make as much sense, and can be left as an optimisation for later if needed.

With the RAM on the AX162 we can probably get some more free speed by doing things like using a ramdisk as --cachedir on the functional tests for example.

I do have one reservation though: IIUC currently we have more runners (and therefore workers) than jobs; I think the proposal of a single AX162 with number of workers == jobs would mean we are reducing net number of active workers? Just worried this may decrease our total throughput if these previous assumptions are correct (and not mitigated by better caching and faster CPU etc.)?

maflcko commented at 7:55 pm on October 1, 2024: member

I know in docker you can cache apt

Interesting. Though, I think the apt is rarely called, and would be just 30 seconds anyway (https://cirrus-ci.com/task/5200852661567488?logs=ci#L199). The reason the image cache exists is that network error are avoided and the expensive llvm build is cached.

resilient in terms of accessing a remote cache dir (it always maintains a local cache

It would be nice if the local cache was also shared, but I guess this can be done in the future, if needed.

AX162 with number of workers == jobs would mean we are reducing net number of active workers?

I think all current workers exactly fit on it, so it wouldn’t reduce the number of them. But it should be easy to get a second AX162, because there are two worker types.

maflcko commented at 9:38 am on October 2, 2024: member

ramdisk

Benchmarking this is a bit more involved and I think requires changes to the CI scripts, so that the ramdisk is (1) mounted into the CI container and (2) picked up as the temp dir. I can take a look in the future, unless someone beats me to it.

theuni commented at 3:29 pm on October 7, 2024: member

I could also possibly supply a beefy dedicated threadripper machine in an MIT datacenter (which should have local-ish mirrors for apt and etc). When comparing costs, I think it makes sense to think of buying hardware vs renting as well.

maflcko commented at 3:42 pm on October 7, 2024: member

threadripper

Nice. Any machine with at least 96 CPU threads, or two machines with at least 48 CPU threads should be a good start. (Less cores would be fine as well, but require the remote ccache, which could break. Generally, if there is something that can break, it will break, given enough time)

(which should have local-ish mirrors for apt and etc)

apt should be cached in the docker image and doesn’t take more than 30 seconds on a cache miss, so should be fine. Also, inside docker, any apt mirror is ignored by default, unless it is configured.

maflcko commented at 3:43 pm on October 7, 2024: member

(I still want to benchmark the ramdisk difference, to see how much can be squeezed out before switching, but if people think a switch should be done sooner, that is fine, too)

theuni commented at 3:47 pm on October 7, 2024: member

I’m pretty sure the potential hardware is 24core/48threads, so it doesn’t quite meet those requirements. There’s another available, though I’m currently using it, but I guess sacrificing that one is an option as well :)

maflcko commented at 7:58 am on October 31, 2024: member

ramdisk

Benchmarking this is a bit more involved and I think requires changes to the CI scripts, so that the ramdisk is (1) mounted into the CI container and (2) picked up as the temp dir. I can take a look in the future, unless someone beats me to it.

Done in #31182, but I couldn’t find any difference at all so far, which is a bit surprising.

0xB10C commented at 6:34 pm on December 21, 2024: contributor

Inspired by discussions with @willcl-ark at and after the last CoreDev, I started working on a NixOS based configuration for a large host with multiple Cirrus CI runners with a focus on isolation and performance.

To isolate the runners, each runner lives in an ephemeral QEMU VM¹ that’s shut down and recreated after each run. While the runners are ephemeral and the QEMU VMs have a tmpfs as rootfs, the host maintains a shared cache. Since the cache is somewhat implicit on the current Cirrus CI runners (which aren’t ephemeral), I figured it makes sense to document what needs caching:

docker images
- base images: [ EDIT: likely not needed anymore since #31678 ] an image cache for e.g. ubuntu:24.04, debian:bookworm, quay.io/centos/amd64:stream9, …
  - we need these to read “System-dependent env vars” https://github.com/bitcoin/bitcoin/blob/fc7b21484703da606c5c69b23daee8c39506d90c/ci/test/02_run_container.sh#L16-L17
  - caching this is needed to avoid connection problems downloading the docker base images and to avoid docker-hub rate-limiting
  - the official docker registry supports a pull-through-cache mode, but only for hub.docker.com and not for e.g. quay.io: “It’s currently not possible to mirror another private registry. Only the central Hub can be mirrored.”. Other solutions exist, but would require to change the CI_IMAGE_NAME_TAGs to point another registry
- docker build cache: to avoid rebuilding the images for e.g. the ci_i686_centos, ci_native_tidy, ci_native_msan, … containers
  - this works with e.g. a local docker build cache
  - implementing support for this in #31545
ccache
- using DANGER_CI_ON_HOST_CACHE_FOLDERS with a custom CCACHE_DIR and CCACHE_MAXSIZE=0 works well (code)
- it might be nice to have a ccache per task (as I don’t think e.g. the ci_i686_centos and the ci_native_msan task will ever share something). But setting an env var like CCACHE_DIR="/ci_container_base/ccache/${CONTAINER_NAME}" won’t work as bash doesn’t evaluate env vars in env vars.
- ccache is trimmed on the host (I haven’t implemented it yet as I wanted to see how fast it grows)
- I’ve also tried CCACHE_REMOTE_STORAGE
  - in http-mode to a local nginx: works well, but I’ve choose to limit the ability of the cirrus-worker service and docker container to interact with localhost in the VM. Could resort to this again though.
  - in file-mode: doesn’t work without modification of DANGER_CI_ON_HOST_CACHE_FOLDERS folders as we can’t mount another directory to the docker container (a hack could be to use a sub-directory like /ci_container_base/ccache/remote-ccache-storage and hope it doesn’t conflict with normal ccache - but I haven’t tried this)
depends
- sources:
  - we don’t want to download the sources multiple times, mainly to avoid network issues
  - using DANGER_CI_ON_HOST_CACHE_FOLDERS and caching them works well
- built:
  - using DANGER_CI_ON_HOST_CACHE_FOLDERS and caching them works well
  - avoids rebuilding them, which speeds up builds by a lot
  - they are already placed nicely in sub directories by host-triplet (e.g. i686-pc-linux-gnu), depend, and a build recipe id: e.g. depends/built/i686-pc-linux-gnu/bdb/bdb-4.8.30-3a1cdeca58f.tar.gz
previous_releases
- again, DANGER_CI_ON_HOST_CACHE_FOLDERS works well

Each VM has an overlayfs view of the shared cache on the host. While a malicious job can try to wipe the cache, it will only affect it’s own cache for the duration of the task. After a task ends, only new cache contents will be copied from the upper-dir to the shared cache. I hope this reduces the potential CI cache vandalism to “disk fill” or “inode exhaustion”. Additionally, the cirrus worker service runs in it’s own hardened systemd service (inside the VM) under it’s own user (and not root), docker runs in rootless mode inside the QEMU VM. I’ve tried to make it harder to extract the cirrus-token from the runners - which would allow a third party that is not in the org to add their own cirrus runners (these could then, for example, always succeed: volkswagen). The QEMU VMs also run as their own user in a hardened systemd service on the host. However, I have no doubt that it’s possible to escalate privileges inside the VM or escape from the VM to the host, and escape privileges there. After all, a self-hosted CI runner is literally Remote-Code-Execution on our server.

Having a NixOS configuration allows people interested in it to quickly deploy their own instances of this for personal use or even contribute to the bitcoin/bitcoin pool. I’ve been testing on a 100€/month x86_64² server with 256 GB of RAM and 2x AMD EPYC 7282 CPUs (16c/32t each, so 64t total, 2nd Gen EPYC released in 2019) with the following VM configuration:

6x small³ VMs a 4 treads and 8 GB each
4x medium³ VMs a 8 threads and 16 GB each total: 56 threads & 112 GB RAM

https://cirrus-ci.com/build/6017527349772288 is a fully cached run, all tasks started at the same time⁴. I’d expect these to be faster on a Hetzner AX162-R - which would also allow running more VMs. This was mentioned before, and I tested that its possible to over-provision on CPU (as cloud hosters do when the speak about shared-CPU) to that multiple VMs share CPUs.

I running https://github.com/cirruslabs/cirrus-cli/pull/813 currently, which I hope to get merged sometime early in the new year. It’s still missing some input on weird scheduling on Cirrus https://github.com/cirruslabs/cirrus-cli/pull/813#issuecomment-2500512964.

Once it’s stable enough, I hope and plan on running and maintaining at least one of these instances for the medium-term (ideally with a sponsor for the server) adding the workers to the bitcoin/bitcoin worker pool. I think @willcl-ark might join or help out too. I don’t think this is complete replacement for the existing workers, but might lift some work off of Marcos shoulders. For now, I’ll work on documentation, automated testing of the NixOS configuration, …

If you’re into Nix, here’s the module/flake https://github.com/0xB10C/bitcoin-core-cirrus-runner that you could import into you’re own infrastructure and here is how I’m using it https://github.com/0xB10C/bitcoin-core-cirrus-runner-infra currently. Again, both are still work in progress and still lack documentation.

It only makes sense to run this on dedicated machine with KVM - which “cloud” machines which are already virtualized probably don’t have. ↩︎
I haven’t tested it on ARM, but don’t see a reason why it wouldn’t work there too. ↩︎
small and medium are types/sizes defined in: https://github.com/bitcoin/bitcoin/blob/fc7b21484703da606c5c69b23daee8c39506d90c/.cirrus.yml#L15-L18 ↩︎ ↩︎
To avoid hitting the rather slow disk, the shared cache is on a 25 GB tmpfs in RAM. This does limit the size of e.g. the ccache, but I don’t think a 1 TB ccache with mostly stale entries helps (AFAIK the runners currently have 500MB of ccache). If it does, happy to allow mounting the ccache to it’s dedicated, large HDD. ↩︎

maflcko commented at 9:32 am on February 11, 2025: member

As for the raw performance, the runners were upgraded, and the massive scheduling delays of last year (up to ~5hours (?) in peak times, when 10 pull requests or commits were pushed at the same time, resulting in about 100 CI tasks at the same time) should be a bit smoothed out this year. I don’t have full metrics, but https://cirrus-ci.com/github/bitcoin/bitcoin/master shows “Duration Chart” as a proxy, with the end-to-end maximum duration. Currently, it shows a range of 44 minutes up to 2h:44minutes.

(Possibly @0xB10C may have a more complete picture of the stats)

I guess the remaining question would be whether the raw performance should be improved more, or if the CI performance is fine?

0xB10C commented at 6:58 pm on February 11, 2025: contributor

here a some per task stats on this:

e.g. for the fuzzer,address,undefined,integer, no depends task

scheduling + execution: https://0xb10c.github.io/bitcoin-core-ci-stats/graph/schedule-and-execution/

scheduling only: https://0xb10c.github.io/bitcoin-core-ci-stats/graph/schedule/

Increasing self-hosted runner raw performance #30852

AX52 test