Fast CI with MicroVMs

[+] throwawaaarrgh|3 years ago|reply

> I spoke to the GitHub Actions engineering team, who told me that using an ephemeral VM and an immutable OS image would solve the concerns.

that doesn't solve them all. the main problem is secrets. if a job has access to an api token that can be used to modify your code or access a cloud service, a PR can abuse that to modify things it shouldn't. a second problem is even if you don't have secrets exposed, a PR can run a crypto miner, wasting your money. finally, a self-hosted runner is a step into your private network and can be used for attacks, which firecracker can help mitigate but never eliminate.

the best solution to these problems is 1) don't allow repos to trigger your CI unless the user is trusted or the change has been reviewed, 2) always use least privilege and zero-trust for all access (yes even for dev services), 3) add basic constraints by default on all jobs running to prevent misuse, and then finally 4) provide strong isolation in addition to ephemeral environments.

[+] unknown|3 years ago|reply

[deleted]

[+] alexellisuk|3 years ago|reply

You still have those same problems with hosted runners, don't you?

We're trying to re-create the hosted experience, but with self-hosted, faster, infrastructure, without needing to account for metered billing.

[+] no_wizard|3 years ago|reply

Firecracker is pretty great, good to see it can be used in a CI environment like this, definitely peaking my interest.

I know its the backbone of what runs fly.io[0] as well

[0]: https://fly.io/docs/reference/architecture/#microvms

[+] waz0wski|3 years ago|reply

Firecracker has been running AWS Lambda & Fargate for a few years now

https://aws.amazon.com/blogs/aws/firecracker-lightweight-vir...

There's also similar microVM project with a bit more container-focused support called Kata

https://katacontainers.io/

[+] emmelaich|3 years ago|reply

*piquing

[+] intelVISA|3 years ago|reply

Does Fly.io have an in-house engineering team? Seems like it's mostly AWS lite.

[+] ignoramous|3 years ago|reply

Sounds similar to webapp.io (layerci) that has been discussed quite a few times here: https://news.ycombinator.com/item?id=31062301

> Friction starts when the 7GB of RAM and 2 cores allocated causes issues for us

Well, I just create a 20GB swap. There's ample disk space but swap is slow for sure.

> MicroVM

Coincidentally, QEMU now sports a firecracker-inspired microvm: https://github.com/qemu/qemu/blob/a082fab9d25/docs/system/i3... / https://mergeboard.com/blog/2-qemu-microvm-docker/

[+] alexellisuk|3 years ago|reply

Hi, I'd not heard of webapp.io before so thanks for mentioning it. Actuated is not a preview branch product, that's an interesting area but not the problem we're trying to solve.

actuated is not trying to be a CI system or a replacement for one like webapp.

It's a direct integration with GitHub Actions, and as we get interest from pilot customer for GitLab etc, we'll consider adding support for those platforms too.

Unopinionated, without lock-in. We want to create the hosted experience, with safety and speed built in.

[+] colinchartier|3 years ago|reply

Hey, yeah this looks somewhat similar to what we're building at https://webapp.io (nee LayerCI, YC S20)

We migrated to a fork of firecracker, but we're a fully hosted product that doesn't directly interact with GHA at all (similar to how CircleCI works), so there's some positioning difference between us and OP at the very least.

Always happy to see innovation in the space :)

[+] synergy20|3 years ago|reply

did not know qemu has its own firecraker machine now, thanks for the info! going to test how fast it boots.

[+] avita1|3 years ago|reply

Something I've increasingly wondered is if the model of CI where a totally pristine container (or VM) gets spun on each change for each test set imposes an floor on how fast CI can run.

Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.

If I had infinity time, I'd build a CI system that found a runner that maintained some state (gasp!) about the build and went to a test runner that had most of its local build cache downloaded, source code cloned, and toolchain bootstrapped.

[+] capableweb|3 years ago|reply

You'd love a service like that, until you have some weird stuff working in CI but not in local (or vice-versa), that's why things are built from scratch all the time, to prevent any such issues from happening.

Npm was (still is?) famously bad at installing dependencies, where sometimes the fix is to remove node_modules and simply reinstalling. Back when npm was more brittle (yes, possible) it was nearly impossible to maintain caches of node_modules directories, as they ended up being different than if you reinstalled with no existing node_modules directory.

[+] maccard|3 years ago|reply

I work in games, our repository is ~100GB (20m download) and a clean compile is 2 hours on a 16 core machine with 32GB ram (c6i.4xlarge for any Aws friends). Actually building a runnable version of the game takes two clean compiles (one editor and one client) plus an asset processing task that takes about another 2 hours clean.

Our toolchain install takes about 30 minutes (although that includes making a snapshot of the EBS volume to make an AMI out of).

That's ~7 hours for a clean build.

We have a somewhat better system than this - our base ami contains the entire toolchain, and we do an initial clone on the ami to get the bulk of the download done too. We store all the intermediates on a separate drive and we just mount it, build incrementally and unmount again. Sometimes we end up with duplicated work but overall it works pretty well. Our full builds are down from 7 hours (in theory) to about 30 minutes, including artifact deployments.

[+] Too|3 years ago|reply

This is how CI systems have always behaved traditionally. Just install a Jenkins agent on any computer/VM and it will maintain persistent workspace on disk for each job to reuse in incremental builds. There are countless other tools that work in the same way. This also solves the problem of isolating builds if your ci only checks out the code and then launches a constrained docker container executing the build. This can easily be extended to use persistent network disks and scaled up workers, but is usually not worth the cost.

It's baffling to see this new trend of yaml actions running in pristine workers, redownloading the whole npm-universe from scratch on every change, birthing hundreds of startups trying to "solve" CI by presenting solutions to non-problems and then wrapping things in even more layers of lock-in and micro-VMs and detaching yourself from the integration.

While Jenkins might not be the best tool in the world, the industry needs a wake-up shower on how to simplify and keep in touch with reality, not hidden behind layers of SaaS-abstractions.

[+] jacobwg|3 years ago|reply

Agreed, this is more or less the inspiration behind Depot (https://depot.dev). Today it builds Docker images with this philosophy, but we'll be expanding to other more general inputs as well. Builds get routed to runner instances pre-configured to build as fast as possible, with local SSD cache and pre-installed toolchains, but without needing to set up any of that orchestration yourself.

[+] colinchartier|3 years ago|reply

This was the idea behind https://webapp.io (YC S20):

- Run a linear series of steps

- Watch which files are read (at the OS level) during each step, and snapshot the entire RAM/disk state of the MicroVM

- When you next push, just skip ahead to the latest snapshot

In practice this makes a generalized version of "cache keys" where you can snapshot the VM as it builds, and then restore the most appropriate snapshot for any given change.

[+] lytedev|3 years ago|reply

I have zero experience with bazel, but I believe it offers the possibility of mechanisms similar to this? Or a mechanism that makes this "somewhat safe"?

[+] skissane|3 years ago|reply

> Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.

Couldn’t this be addressed if every node had a local caching proxy server container/VM, and all the other containers/VMs on the node used it for Git checkouts, image/package downloads, etc?

[+] quesera|3 years ago|reply

> the model of CI where a totally pristine container (or VM) gets spun on each change for each test set imposes an floor on how fast CI can run

I believe this is the motivation behind https://brisktest.com/

[+] mattbillenstein|3 years ago|reply

I'm using buildkite - which lets me run the workers myself. These are long-lived Ubuntu systems setup with the same code we use on dev and production running all the same software dependencies. Tests are fast and it works pretty nice.

[+] Shish2k|3 years ago|reply

> Each job will always have to run a clone

You can create a base filesystem image with the code and tools checked out, then create a VM which uses that in a copy-on-write way

[+] mig_|3 years ago|reply

[deleted]

[+] bkq|3 years ago|reply

Good article. Firecracker is something that has definitely piqued my interest when it comes to quickly spinning up a throwaway environment to use for either development or CI. I run a CI platform [1], which currently uses QEMU for the build environments (Docker is also supported but currently disabled on the hosted offering), startup times are ok, but having a boot time of 1-2s is definitely highly appealing. I will have to investigate Firecracker further to see if I could incorporate this into what I'm doing.

Julia Evans has also written about Firecracker in the past too [2][3].

[1] - https://about.djinn-ci.com

[2] - https://jvns.ca/blog/2021/01/23/firecracker--start-a-vm-in-l...

[3] - https://news.ycombinator.com/item?id=25883253

[+] alexellisuk|3 years ago|reply

Thanks for commenting, and your product looks cool btw.

Yeah a lot of people have talked about Firecracker in the past, that's why I focus on the pain and the problem being solved. The tech is cool, but it's not the only thing that matters.

People need to know that there are better alternatives to sharing a docker socket or using DIND with K8s runners.

[+] lxe|3 years ago|reply

Firecracker is nice but still very limited to what it can do.

My gripe with all CI systems is that an an industry standard we've universally sacrificed performance for hermeticity and re-entrancy, even when it doesn't really gives us a practical advantage. Downloading and re-running containers and vms, endlessly checking out code, installing deps over and over is just a waste of time, even with caching, COW, and other optimizations.

[+] jxf|3 years ago|reply

> My gripe with all CI systems is that an an industry standard we've universally sacrificed performance for hermeticity and re-entrancy, even when it doesn't really gives us a practical advantage.

The perceived practical advantage is the incremental confidence that the thing you built won't blow up in production.

> even with caching, COW, and other optimizations

Many CI systems do employ caching. For example, Circle.

[+] IshKebab|3 years ago|reply

Hermeticity is precisely what allows you to avoid endlessly downloading and building the same dependencies. Without hermeticity you can't rely on caching.

I feel like 90% of the computer industry is ignoring the lessons of Bazel and is probably going to wake up in 10 years and go "ooooooh, that's how we should have been doing it".

[+] throwaway894345|3 years ago|reply

Honestly, I've never missed the shared mutable environment approach one bit. It might have been marginally faster, but I'd trade a whole bunch of performance for consistency (and the optimizations mean there's not much of a performance difference). Moreover, most of the time spent in CI is not container/VM overhead, but rather crappy Docker images, slow toolchains, slow tests, etc.

[+] alexellisuk|3 years ago|reply

When you say it's limited in what it can do, what are you comparing it to? And what do you wish it could do?

Fly has a lot of ideas here, and we've also been able to optimize how things work in terms of downloads and as for boot-up speed, it's less than 1-2s before a runner is connected.

[+] lijogdfljk|3 years ago|reply

I'm a bit surprised i don't see NixOS-like tooling in container orchestration for this reason.

[+] unknown|3 years ago|reply

[deleted]

[+] fideloper|3 years ago|reply

This project looks really neat!

Firecracker is very cool, I wish/hope tooling around it matures enough to be super easy. I'd love to see the technical details on how this is run. It looks like it's closed source?

The need for baremetal for Firecracker is a bit of a shame, but it's still wicked cool. (You can run it on a DO droplet but nested virtualization feels a bit icky?)

I run a CI app myself, and have looked at firecracker. Right now I'm working on moving some compute to Fly.io and it's Machines API, which is well suited for on-demand compute.

[+] alexellisuk|3 years ago|reply

Hey thanks for the interest, this is probably the best resource I have on Firecracker, hope you enjoy it:

https://www.youtube.com/watch?v=CYCsa5e2vqg

For info on actuated, check out the FAQ or the docs: https://docs.actuated.dev

We're running a pilot and looking for customers who want to make CI faster for public or self-hosted runners, want to avoid side-effects and security compromise of DIND / sharing a Docker socket or need to build on ARM64 for speed.

Feel free to reach out

[+] ridiculous_fish|3 years ago|reply

The article does not say what a MicroVM is. From what I can gather, it's using KVM to virtualize specifically a Linux kernel. In this way, Firecracker is somewhat intermediate between Docker (which shares the host kernel) and Vagrant (which is not limited to running Linux). Is that accurate?

Is it possible to use a MicroVM to virtualize a non-Linux OS?

[+] alexellisuk|3 years ago|reply

Thanks for the feedback.

That video covers this is great detail. Click on the the video under 1) and have a watch, it should answer all your questions.

I didn't want to repeat the content there

[+] f0e4c2f7|3 years ago|reply

This seems pretty interesting to me. I haven't messed with firecracker yet but it seems like a possible alternative to docker in the future.

[+] alexellisuk|3 years ago|reply

It is, but is also a very low-level tool, and there is very little support around it. We've been building this platform since the summer and there are many nuances and edge cases to cater for.

But if you just want to try out Firecracker, I've got a free lab listed in the blog post.

I hear Podman desktop is also getting some traction, if you have particular issues with Docker Desktop.

[+] kernelbugs|3 years ago|reply

Would have loved to see more of the technical details involved in spinning up Firecracker VMs on demand for Github Actions.

[+] alexellisuk|3 years ago|reply

Hey thanks for the feedback. We may do some more around this. What kinds of things do you want to know?

To get hands on, you can run my Firecracker lab that I shared in the blog post, then add a runner can be done with "arkade system install actions-runner"

We also explain how it works here: https://docs.actuated.dev/faq/

[+] Sytten|3 years ago|reply

Wondering if it would be possible to run macos. The hosted runner of Github Actions for macos are really really horrible, our builds take easily 2x to 3x more time than hosted Windows and Linux machines.

[+] rad_gruchalski|3 years ago|reply

Congratulations on the launch.

The interesting part of this is that the client supplies the most difficult resource to get for this setup. As in, a machine on which Firecracker can run.

[+] alexellisuk|3 years ago|reply

Users provide a number of hosts and run a simple agent. We maintain the OS image, Kernel configuration and control plane service, with support for ARM64 too.

[+] brightball|3 years ago|reply

I’m curious to see how k8s isn’t a good fit for this? I’m not a k8s advocate for production code but at the CI level it seems ideal.

[+] alexellisuk|3 years ago|reply

Great questions, we answer those here in the FAQ: https://docs.actuated.dev/faq/

[+] imachine1980_|3 years ago|reply

Realy cool what is the license? , there any way I can contribute code/test/documentation to this project ?

[+] a-dub|3 years ago|reply

this is cool. throwing firecracker at CI is something i've been thinking about since i first read about firecracker.

i was thinking more along the lines of, can you checkpoint a bunch of common initialization and startup and then massively parallelize?

[+] alexellisuk|3 years ago|reply

You can checkpoint and restore, but only once for security reasons, so it doesn't help much.

https://github.com/firecracker-microvm/firecracker/blob/main...

The VMs launch super quick in < 1s they are actually running a job.

[+] deltaci|3 years ago|reply

congratulations on the launch. it looks pretty much like a self-hosted version of https://buildjet.com/for-github-actions

[+] alexellisuk|3 years ago|reply

Thanks for commenting.

It seems like buildjet is competing directly with GitHub on price (GitHub has bigger runners available now, pay per minute), and GitHub will always win because they own Azure, so I'm not sure what their USP is and worry they will get commoditised and then lose their market share.

Actuated is hybrid, not self-hosted. We run actuated as a managed service and scheduler, you provide your own compute and run our agent, then it's a very hands-off experience. This comes with support from our team, and extensive documentation.

Agents can even be cheap-ish VMs using nested virtualisation, you can learn a bit more here: https://docs.actuated.dev/add-agent/

91 comments