(no title)
zeotroph | 1 year ago
Since everyone is treating containers as cattle CRIU doesn't seem to get much attention, and might be why a video and not a blog post was my first introduction.
zeotroph | 1 year ago
Since everyone is treating containers as cattle CRIU doesn't seem to get much attention, and might be why a video and not a blog post was my first introduction.
yencabulator|1 year ago
Nah, it's more like "I don't trust that thing to not cause weird behavior in production".
VM-level snapshots are standard practice[1] because the abstraction there is right-sized for being able to do that reliably. CRIU isn't, because it's trying to solve a much harder problem.
[1]: And even there, beware cloning running memory state, you can get weird interactions from two identical parties trying to talk to the same 3rd service, separated by time. Cloning disk snapshots is much safer, and even there you can screw up because of duplicate machine IDs, crypto keys, nonces, etc.
__turbobrew__|1 year ago
Im sure there are some niche applications for container checkpointing, but I don’t really see the complexity being worth it. Maybe checkpointing some long running batch jobs could save you some money, but you should just make your jobs checkpoint their state to an external store such a ceph or s3 and make the jobs smart enough to load any state from those stores if they are preempted.
yourapostasy|1 year ago
Hopefully though, my trepidation is wrong. What is the most complex piece of software others have run under CRIU in production, and for how long?
znpy|1 year ago
small nit: podman is not a docker fork, it's a completely different codebase written from scratch
cpuguy83|1 year ago
JeremyNT|1 year ago
Yeah, I guess that's probably the reason. If you're engineering your workloads with the idea that the world might "poof" out from under you at any moment you'd never wonder about / reach for something like CRIU.
It's a trick that I'd never much thought about, but now that I've learned it exists (so many years late) I find myself wondering about the path not taken here. It feels like it should be incredibly useful... but I can't figure out exactly what I'd want to do with it myself.
yourapostasy|1 year ago
Check out mainframes and Tandem systems for a peek at that path. Lots of support in those systems for the notion your application’s substrate might suddenly go poof, and you need it to recover from where it left off as instantaneously as possible.
It’s expensive.