top | item 39471802

(no title)

At first glance, this looks very cool!

On my queue at work coming up is to try to speed up our github action CI runs and I'll definitely take a look at this. Our runs aren't particularly slow by most standards (about 4 minutes), but I would really love to make them much faster. I'm not sure if 1 minute would be possible, but one can dream ;-). But I figure if I can run our test suite on my macbook air m2 in about a minute, I don't see why it's not possible to get my CI near that without spending a fortune. I feel like so much time is wasted in our GHA workflows by downloading the same container images and dependencies over and over. Anecdotally, I also find the GHA hosted runners to sometimes have huge performance swings, where some runs are 25-50% slower for no apparent reason (although time of day seems to affect it). I'm thinking running on EC2 might help with that too.

I've considered some of the third party hosted runners (e.g. buildjet), but didn't love the idea of trusting them with our code base. On the other hand, I looked at some of the projects for running self-hosted gha runners and they seemed like they could require a decent amount of "babysitting", and I didn't see any that supported persistent disks.

Just out of curiosity, can you explain how the persistent disks work in a little more detail? Does it work something like the following:

1. Create EC2 Instance for Runner #1

2. Create new EBS volume and attach it to Runner #1

3. Runner #1 shuts down due to inactivity and EBS volume is detached.

4. Create EC2 Instance For Runner #1 (or does it just stop/start an existing instance?)

5. Attach existing EBS volume created in step #2

Assuming you had multiple runners, would it check for an unattached EBS volume first before trying to create a new one?

Another question I had, do you manage the AMI that the runner uses? Is it the latest ubuntu like GHA uses?

discuss

xjia|2 years ago

Persistent disks are implemented as EBS snapshots, so the process is something like:

1. Create EC2 instance for runner #1. Find out there is no existing snapshot, so an empty volume is created and attached.

2. Runner #1 runs exactly 1 job and shuts down. A snapshot is taken for the persistent volume. That's going to be used by later runners.

3. Create EC2 instance for runner #2. Create a new volume based on the last snapshot.

4. Assuming #2 is still running while a new job comes in. Create EC2 instance for runner #3. Create volume based on the same last snapshot.

5. Whenever a runner finishes, its persistent volume gets a snapshot taken. Outdated snapshots are automatically removed.

And yes we manage the AMI that the runner uses. We try out best to follow https://github.com/actions/runner-images and will automate this process very soon so it's always up-to-date.

Edit: formatting

cswilliams|2 years ago

Thanks for answering! Unless I'm misunderstanding, one issue with this method is since you're creating a new EBS volume from a snapshot every time the runner starts, the volume will be cold and there will be additional latency on the first reads from the volume. Seems like you could run into this penalty fairly often if you were constantly spinning up and down runners due to inactivity. Maybe something worth considering for v3 (spot instances would be nice to have too).