top | item 46572682

(no title)

amitprasad | 1 month ago

If you're not looking for GPU snapshotting the ecosystem is relatively mature. Specifically, CPU-only VM-based snapshotting techniques are pretty well understood. However, if you need GPUs, this is a notoriously hard problem. IIRC Fly also was planning on using gVisor (EDIT: cloud-hypervisor) for their GPU cloud, but abandoned the effort [1].

Kata runs atop many things, but is a little awkward because it creates a "pod" (VM) inside which it creates 1+ containers (runc/gVisor). Firecracker is also awkward because GPU support is pretty hard / impossible.

[1] https://fly.io/blog/wrong-about-gpu/

discuss

order

Imustaskforhelp|1 month ago

Ohh this makes sense now. Firecracker is good for compute related workflows but gvisor is more good for GPU related workflows, gotcha.

For my use cases usually, its Firecracker but I can now see why company like Modal would use gvisor because they focus a lot (and I mean a lot) on providing gpu access. I think that its one of their largest selling points or one of them, for them compute is secondary customer and gvisor's compute performance hit is a well worth trade off for them

Thanks for trying to explain the situation!