top | item 47412569

Show HN: Sub-millisecond VM sandboxes using CoW memory forking

310 points| adammiribyan | 12 days ago |github.com | reply

I wanted to see how fast an isolated code sandbox could start if I never had to boot a fresh VM.

So instead of launching a new microVM per execution, I boot Firecracker once with Python and numpy already loaded, then snapshot the full VM state. Every execution after that creates a new KVM VM backed by a `MAP_PRIVATE` mapping of the snapshot memory, so Linux gives me copy-on-write pages automatically.

That means each sandbox starts from an already-running Python process inside a real VM, runs the code, and exits.

These are real KVM VMs, not containers: separate guest kernel, separate guest memory, separate page tables. When a VM writes to memory, it gets a private copy of that page.

The hard part was not CoW itself. The hard part was resuming the snapshotted VM correctly.

Rust, Apache 2.0.

72 comments

order
[+] cperciva|12 days ago|reply
Don't forget about entropy! You've just created two identical copies of all of your random number generators, which could be very very bad for security.

The firecracker team wrote a very good paper about addressing this when they added snapshot support.

[+] Retr0id|12 days ago|reply
I suppose it'd be easy enough to re-seed RNGs, but re-relocating ASLR sounds like a pain. (Although I suppose for Python that doesn't matter)
[+] injidup|12 days ago|reply
It's so frustrating seeing all this sandbox tooling pop up for linux but windows is soooooo far behind. I mean Windows Sandbox ( https://learn.microsoft.com/en-us/windows/security/applicati... ) doesn't even have customizable networking white lists. You can turn networking on or off but that's about as fine grained as it gets. So all of us still having to write desktop windows stuff are left without a good method of easily putting our agents in a blast proof box.
[+] benterix|12 days ago|reply
I don't mean to turn this into a religious war, but honestly, I sometimes wonder what would be the net benefit for humanity if Windows slowly disappeared. And I'm saying this as someone who appreciates the good stuff done by Microsoft in the past (windows 9* UI, decades-long support for Win32 APIs etc.).
[+] CTDOCodebases|12 days ago|reply
Web browsers don't even work properly in Windows Sandbox. There is a bug that hasn't been patched in over a year whereby web browsers can't use the GPU to render a page so all it displays is a white page. Users have to create a configuration file that turns off vGPU and launch Windows Sandbox from that.
[+] BornaP|11 days ago|reply
Feel you. That's why we're actively working on Windows and macOS sandbox support at Daytona - with proper isolation, agents tools, dynamic resizing etc; not just "networking on/off" level controls.

If you're building agents on Windows and want to give it a spin, reach out for early access.

[+] ddtaylor|11 days ago|reply
I guess get busy contacting Microsoft or get busy using Open Source software instead.
[+] RyleHisk|10 days ago|reply
You can run WSL on Windows — then you've got access to all the Linux sandbox tools.
[+] BornaP|11 days ago|reply
Really impressive work. Sub-millisecond cold starts via CoW forking is a pretty clever approach.

The tricky part we keep running into with agent sandboxes is that code execution is just one piece, bcs agents also need file system access, memory, git, a pty, and a bunch of other tools all wired up and isolated together. That's where things get hairy fast.

[+] crawshaw|12 days ago|reply
Nice to see this work! I experimented with this for exe.dev before we launched. The VM itself worked really well, but there was a lot of setup to get the networking functioning. And in the end, our target are use cases that don't mind a ~1-second startup time, which meant doing a clean systemd start each time was easier.

That said, I have seen several use cases where people want a VM for something minimal, like a python interpreter, and this is absolutely the sort of approach they should be using. Lot of promise here, excited to see how far you can push it!

[+] hrmtst93837|12 days ago|reply
The thing people tend to gloss over is how CoW shines until you need to update the base image, then you start playing whack-a-mole with stale memory and hotpatching. Snapshots give you a magic boot, but god help you when you need to roll out a security fix to hundreds of forks with divergent state.

Fast startup is nice. If the workload is "run plain Python on a trusted codebase" you win, but once it gets hairier the maintenance overhead sends you straight back to yak shaving.

[+] indigodaddy|12 days ago|reply
simonw seems like he's always wanting what you describe, maybe more for wasm though
[+] skwuwu|12 days ago|reply
I noticed that you implemented a high-performance VM fork. However, to me, it seems like a general-purpose KVM project. Is there a reason why you say it is specialized for running AI agents?
[+] adammiribyan|12 days ago|reply
Fair question. The fork engine itself is general purpose -- you could use it for anything that needs fast isolated execution. We say 'AI agents' because that's where the demand is right now. Every agent framework (LangChain, CrewAI, OpenAI Assistants) needs sandboxed code execution as a tool call, and the existing options (E2B, Daytona, Modal) all boot or restore a VM/container per execution. At sub-millisecond fork times, you can do things that aren't practical with 100-200ms startup: speculative parallel execution (fork 10 VMs, try 10 approaches, keep the best), treating code execution like a function call instead of an infrastructure decision, etc.
[+] vmg12|12 days ago|reply
Does it only work with that specific version of firecracker and only with vms with 1 vcpu?

More than the sub ms startup time the 258kb of ram per VM is huge.

[+] adammiribyan|12 days ago|reply
1 vCPU per fork currently. Multi-vCPU is doable (per-vCPU state restore in a loop) but would multiply fork time.

On Firecracker version: tested with v1.12, but the vmstate parser auto-detects offsets rather than hardcoding them, so it should work across versions.

[+] buckle8017|12 days ago|reply
This is how android processes work, but it's a security problem breaking some ASLR type things.
[+] indigodaddy|12 days ago|reply
Your write-up made me think of:

https://codesandbox.io/blog/how-we-clone-a-running-vm-in-2-s...

Are there parallels?

[+] deivid|12 days ago|reply
Niiiiiice, I've been working on something like this, but reducing linux boot time instead of snapshot restore time; obviously my solution doesn't work for heavy runtimes
[+] jauntywundrkind|12 days ago|reply
I keep so so so many opencode windows going. I wish I had bought a better SSD, because I have so much swap space to support it all.

I keep thinking I need to see if CRIU (checkpoint restore in userspace) is going to work here. So I can put work down for longer time, be able to close & restore instances sort of on-demand.

I don't really love the idea of using VMs more, but I super love this project. Heck yes forking our processes/VMs.

[+] adammiribyan|12 days ago|reply
CRIU is great for save/restore. The nice thing about CoW forking is it's cheap branching, not just checkpointing. You can clone a running state thousands of times at a few hundred KB each.
[+] aftbit|11 days ago|reply
Is this a service or a library? The README has curl and an API key... Can I run this myself on my own hardware?
[+] adammiribyan|11 days ago|reply
Both. The engine is open source. You can self-host it on any Linux box with KVM. There's also a live API you can hit right now (curl example in the README). Building the managed service for teams that don't want to run their own infra.
[+] indigodaddy|12 days ago|reply
Does this need passthrough or might we be able to leverage PVM with it on a passthrough-less cloud VM/VPS?
[+] dizhn|12 days ago|reply
I am not sure exactly what you are asking but firecracker does need access to /dev/kvm so nesting needs to be enabled on the VM.
[+] aa-jv|11 days ago|reply
Nice.

Now I want the ability to freeze the VM cryogenically and move it to another machine automagically, defrosting and running as seamlessly as possible.

I know this is gonna happen soon enough, I've been waiting since the death of TandemOS for just this feature ..

[+] diptanu|12 days ago|reply
The tricky part of doing this in production is cloning sandboxes across nodes. You would have to snapshot the resident memory, file system (or a CoW layer on top of the rootfs), move the data across nodes, etc.
[+] Rygian|12 days ago|reply
If each node has its own warmed-up VM awaiting from startup, there's no need to clone across nodes.
[+] adammiribyan|12 days ago|reply
Agreed, cross-node is the hard next step. For now single-node density gets you surprisingly far. 1000 concurrent sandboxes on one $50 box. When we need multi-node, userfaultfd with remote page fetch is the likely path.
[+] polskibus|11 days ago|reply
Is it possible to run minikube inside ? I’d love to use it for ephemeral clusters for testing .
[+] yagizdagabak|12 days ago|reply
Cool approach. Are you guys planning on creating a managed version?
[+] adammiribyan|12 days ago|reply
The API in the readme is live right now -- you can curl it. Plan is multi-region, custom templates with your own dependencies, and usage-based pricing. Email in my profile if you want early access.
[+] adammiribyan|12 days ago|reply
Thanks! Yes, there's going to be a managed version.
[+] huksley|11 days ago|reply
Any plans to offer self-hosted / open-source version?
[+] wang_cong|10 days ago|reply
No networking inside forks? This is not usable.
[+] adammiribyan|9 days ago|reply
Not yet. Current design is run code, return result. Adding virtio-net to forks is on the roadmap. What's your use case that needs it?