The What, Why and How of Containers

cwillu|1 year ago

Computing is an endless cycle of inventing ways to isolate code in a private machine, followed by inventing ways to make it easier for those machines to interoperate.

npteljes|1 year ago

Absolute. I feel like society goes through changes in a similar cyclic way. We, as humans, basically have a finite span of understanding and attention, and so, basically create cycles that are longer than that.

cjk2|1 year ago

Don't forget an endless cycle of inventing ways to make debugging and problem solving harder by adding isolation boundaries and complexity :)

lioeters|1 year ago

> inventing ways to isolate code in a private machine

Reminds me of how Alan Kay described OOP as communicating objects, where each object is a kind of computer.

"I thought of objects being like biological cells and/or individual computers on a network, only able to communicate with messages."

pjmlp|1 year ago

It is like the endless cycle of "microservices" since distributed computing was invented, after computer networks came to be.

nurple|1 year ago

This is a really interesting way to think about the progression.

As a timeline I like to plot the ratio of users to isolated compute. We've moved along points like users per building, users per room, user per computer, computers per user, kernels per user, processes per user.

Containers enabled the latest shift.

0xdeadbeefbabe|1 year ago

At least it doesn't halt

begueradj|1 year ago

That's a wise statement.

adamgordonbell|1 year ago

If you use chroot to run something, it's interesting how the dynamic libs you need to get in place grows until you are mirroring a whole linux in a subtree. It gives you a sense for how you end up with containers.

One thing that is wild to me is how nix solves this problem, of things needing to be linked together. It doesn't solve it with containers, but by rewriting the location of the links in the executable to be in the nix store. You can run LDD and see it in action.

To me, all that points at containers being in some way a solution to Dynamic linking. And maybe an over the top solution.

Should we be doing more static linking? Not even depending on libc? What are the challenges with that?

nurple|1 year ago

Containers are a solution to dependency management for sure. In the world of the FHS, the dynamic linker is meant to solve a number of problems, like space saving and security updates, at the OS level by discovering dynamic deps installed in special library paths. One thing I've never liked about FHS is how everything is organized by kind.

The interesting thing about how Nix approaches the problem is to replace the concept of FHS almost entirely (only a couple binaries are linked to /) by hijacking PATH, and the linker configs like you mentioned. The biggest difference being that the whole version-pinned dep tree is encoded in a nix package(and in the linker config of the binaries it produces) rather than just the package itself.

At some level you could say there is no "dynamic" runtime linking in nix, i.e the linker uses partially specified deps in a discovery phase, all of the link bindings happen at build time.

The FHS did attempt to solve the issue of multi-version dependencies with an interesting name and symlink setup, but they are usually still bound by fairly loose version constraints (like major version). Containers are a lot more like nix in this way, where deps are "resolved" at build time by the distro's package manager by virtue of controlling the process' filesystem.

This is one major issue with the reproducibility of container builds, the distro package managers are not deterministic, you could run a build back-to-back and get different deps depending on your timing(yes, even between test and build CI steps).

bayindirh|1 year ago

Disclaimer: I'm not a strong containerization proponent.

The good part of containers is you isolating the thing you're running. I'm very against resource waste, but if I can spend 90MB on a container image instead of installing a complete software stack to run a task which is executed weekly and runs for 10 minutes, I'd prefer that. Plus, I can create a virtual network and storage stack around the container(s) if I need to.

Case in point: I use imap-backup to backup my e-mail accounts, but it's a ruby application and I need to install the whole stack of things, plus the gems. Instead I containerize it, and keep my system clean.

Nix is something different and doesn't solve "many foreign processes not seeing each-other on the same OS" problem. Heck, even Docker doesn't solve all problems, so we have "user level containers" which do not require root access and designed to be run in multitenant systems.

tutfbhuf|1 year ago

One issue with static linking is that your dependencies will likely have critical CVEs over time. If you keep all your libraries separate on the filesystem, you can just do a "apt update; apt upgrade", and you will have all the latest patches. This will patch security issues in e.g. libssl or libc for all your applications that are dynamically linked against this shared libraries, which can be quite a few. In static binaries, the version of the libraries is not obvious from the outside. If you have, for example, 100 fully static binaries, these can come in 100 different major/minor/patch level versions of their dependencies. You now have to patch each binary separately by upgrading and recompilation 100 times to patch all your static binaries, that requires much more time and energy.

jayd16|1 year ago

Containers are powerful because they solve many computing issues, one of which being able to act as a (lower case c) static container for dynamically linked apps as well as cross-language, multi-executable meta-apps.

Containers also provide many forms of isolation (network, file system etc.), they provide a modern versioning and distribution scheme, composibility (use another container as a base image).

All of these things can, and perhaps should be, done at a the language level as well but containers also work across languages, across linking paradigms, and with existing binaries.

nonameiguess|1 year ago

I haven't seen any other response mention it yet, but containers are also heavily used for web-exposed services in part because of address space and port contention. Network namespaces allow you to graft an overlay network onto your physical network in a relatively simple and easy way (not that it's actually easy, but networking never is).

Otherwise, sure, nix can rewrite the RPATH in your ELF file to make it pull dynamic libs from the nix store, but what does it do when two processes both want to listen on ports 80 and 443?

Possibly, if the Internet ever actually goes pure IPv6, one LAN will have enough addresses to assign one to each process instead of each host.

There are, of course, other ways to handle it. People used vhosts predominantly defined in a dedicated web server that was really only a reverse proxy, but now you need nix and nginx. Then you discover you also want resource isolation. Is there a userspace alternative to cGroups? I don't see how there could even in principle be an alternative to PID/UID namespaces and UID/GID submapping. Some things have to happen in the kernel and that means containers of some sort. It doesn't have to be the exact OCI standard that eventually grew out of Docker and eventually Kubernetes, but some kind of container.

Already__Taken|1 year ago

dynamic linking to me seems like solving a problem that filesystems should. deduplicating data.

silverquiet|1 year ago

Funny - it looks like you’re being downvoted for asking what I think is a very natural question. It’s one I’ve asked before; have we just created a more elaborate statically-linked executable via containerization? In the end, Docker/OCI seems like the universal Linux package manager.

I’m sure I don’t have the full picture since I’m far more ops than dev though.

pjmlp|1 year ago

That wasn't the deal in HP-UX vaults, Solaris Zones, System 360/MVS virtualization,...

The way they are used on GNU/Linux is indeed a solution to GNU/Linux's software distribution issues on a highly fragmented landscape.

kqr|1 year ago

I wish I had read this article a decade ago. For many years I have been wondering "why the heck would I use containers when I have chroot, cgroups and namespaces?"

Turns out that's exactly what containers are a packaging of! And I only found out about two years ago.

Although this article doesn't go into it, the benefits I've found of using containers rather than rolling isolation by hand is that a lot of semi-standardised monitoring, deployment, and workload management tooling expects things to come packaged as containers.

otabdeveloper4|1 year ago

> Turns out that's exactly what containers are a packaging of!

Well, no. When people say "containers", they always mean "Docker".

And Docker also comes with a daemon with full root permissions and ridiculous security policies. (Like, for example, forcefully turning off your machine's firewall, #yolo. WTF!)

P.S. I actually run systemd-nspawn in production, but I am probably the only person on earth to do so.

sevagh|1 year ago

[deleted]

disconnect3d|1 year ago

It's a nice blog post but it still misses a few important building blocks without which it would be trivial to escape a container running as root.

Apart from chroot, cgroups and namespaces, the containers are also build upon:

1) linux capabilities - that split the privileges of a root user into "capabilities" which allows limiting the actions a root user can do (see `man 7 capabilities`, `cat /proc/self/status | grep Cap` or `capsh --decode=a80425fb`)

2) seccomp - which is used to filter syscalls and their arguments that a process can execute. (fwiw Docker renders its seccomp policy based on the capabilities requested by the container)

3) AppArmor (or SELinux, though AppArmor is the default) - a LSM (Linux Security Module) used to limit access to certain paths on the system and syscalls

4) masked paths - container engines bind mounts certain sensitive paths so they can't be read or written to (like /proc/sysrq-trigger, /proc/irq, /proc/kcore etc.)

5) NoNewPrivs flag - while not enabled by default (e.g., in Docker) this prevents the user from gaining more privileges (e.g., suid binaries won't change the uid)

If anyone is interested in reading more about those topics and security of containers, you may want to read a blog post [0] where I dissected a privileged docker escape technique (note: with --privileged, you could just mount the disk device and read/write to it) and slides from a talk [1] I have given which details the Docker container building blocks and shows how we can investigate them etc.

[0] https://blog.trailofbits.com/2019/07/19/understanding-docker...

[1] https://docs.google.com/presentation/d/1tCqmGSOJJzi6ZK7TNhbz...

nurple|1 year ago

Excellent info! I started head-deving a project similar to nix-snapshotter[0] and I was thinking "ok, I can probably just build CRI impl that builds a rootfs dir with nix and just shell out to bubblewrap to make a "container".

But once I went through that mental exercise I started reading code in containerd and cri-o. Wow, these are _not_ simple projects; containerd itself having a full GRPC-based service registry for driving dynamic logic via config.

One thing I was pretty disappointed about is how deeply ingrained OSI images are in the whole ecosystem. While you can replace almost all functional parts of runtime, but not really the concept of images. I think images are a poor solution to the problem they solve, and a big downside of this is a bunch of complexity in the runtimes trying to work around how images work (like remote snapshotters).

[0] https://github.com/pdtpartners/nix-snapshotter

mikewarot|1 year ago

Containers are a bad take on a solved problem. The problem was encountered, studied[0] and solved, decades ago.

During the Viet Nam conflict, the Air Force needed to plan missions with multiple levels of classified data. This couldn't be done with the systems of that era. This resulted in research and development of multi-level security, the Bell-LaPadula model[2], and capability based security[1].

Conceptually, it's elegant, and requires almost no changes in user behavior while solving entire classes of problems with minimal code changes. It's a matter of changing the default from all access to no access, all the way down to the kernel.

[0] https://csrc.nist.rip/publications/history/ande72.pdf

[1] https://en.wikipedia.org/wiki/Capability-based_security

[2] https://en.wikipedia.org/wiki/Bell%E2%80%93LaPadula_model

remram|1 year ago

Containers are not a security mechanism, they are a deployment mechanism.

simpaticoder|1 year ago

Conceptually, I've come to think of containers as a kind of "known-good starting point", the origin of a coordinate system where "movement" is adding things. A set of Dockerfiles form a trie where each line of the Dockerfile is a node in that trie's branch. The great benefit of containers is that they allow you to reach any possible point in the space for a single process, without affecting any other. The other features of containers are, to me, secondary, things like container images, or even access or resource control. The main draw of the tool is giving the user a declarative way to move reliably and repeatedly through system-space, and to do so for any number of processes. (The main cost is the ~20% overhead such a system incurs).

Zambyte|1 year ago

Unfortunately this skips over the history of microkernels, which solve the same problems in a much more elegant way than containers.

mati365|1 year ago

Can you elaborate?

ahepp|1 year ago

If this struck your interest, but you want more nitty gritty examples and details, you may find the following article interesting: https://ericchiang.github.io/post/containers-from-scratch/

If I'm remembering correctly from when I ran through the instructions at home, it was written for the original cgroup sysfs interface rather than the more modern cgroup2 [0]. You can figure out which you're running with

> mount | grep cgroup > cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

which turns the examples is a nice "check your understanding"

[0]: https://docs.kernel.org/admin-guide/cgroup-v2.html#basic-ope...

sam2426679|1 year ago

When I was first learning about containers, I found the below course to be very helpful, which has a similar didactic trajectory to this article.

https://frontendmasters.com/courses/complete-intro-container...

the_duke|1 year ago

Great high level explanation.

Note: if you look into the details of how setting up namespaces and cgroups works you'll run away in horror. The APIs are very iteratively evolved piecework, not really a coherent(ly designed) abstraction.

zinodaur|1 year ago

I'm still not sold on the "why" wrt kubernetes. I hate that my resource hog map reduce jobs run on the same kernel and contend for the same resources as my user facing live site service.

kube-system|1 year ago

That is one of the reasons why. Containers share the kernel and system resources. When you want to start running a bunch of containers in a particular configuration, that's when you'd use a container orchestration tool like kubernetes to define how and where you want those containers to run across multiple systems.

While you could schedule containers manually, or just run your application on VMs or hardware manually, something like kubernetes will let you define rules which it will dynamically evaluate against your infrastructure. You can instruct kubernetes to run your map reduce jobs on different nodes than your user-facing site... and you can give kubernetes an arbitrary number of nodes to work with, and it can scale your workloads for you automatically while also following your rules.

EraYaN|1 year ago

But kubernetes has very good support for segmenting applications and long running processes, you don't even have to segment the nodes, you can just "let it happen" (although you should probably segment the nodes somewhat). You can set (anti) affinity for example to make applications not tolerate each other when scheduled etc. And there are quite a few more knobs the scheduler has that you can tune.

jason2323|1 year ago

Is there a guide around that teaches you to build a container from scratch with chroot, namespaces and cgroups?

Izkata|1 year ago

It's not a tutorial, but "Docker implemented in around 100 lines of Bash" might help: https://news.ycombinator.com/item?id=33218094

ahepp|1 year ago

This might be what you're looking for? IIRC it was written for the older cgroup (v1) sysfs interface, so you may need to cross reference it with the cgroup2 documentation

https://ericchiang.github.io/post/containers-from-scratch/

adamgordonbell|1 year ago

I did one with just the chroot part:

https://earthly.dev/blog/chroot/

Liz Rice has a good talk about the cgroups and namespaces.

https://www.youtube.com/watch?v=_TsSmSu57Zo

unknown|1 year ago

[deleted]

HeyLaughingBoy|1 year ago

As someone whose primary area of development is embedded systems and has never used a Container, I really like this ELI5 explanation.

timetraveller26|1 year ago

systemd-nspawn is really useful for the day to day to be able to have isolated systems in a more lightweight fashion that lxc and docker.

It comes already with systemd (I know, cool, right?)

More info: https://wiki.archlinux.org/title/systemd-nspawn

dzonga|1 year ago

something that is nice in the container world -- better than docker are lxc containers - but the steward of the project Canonical seem to have done a bad job with it. last time I played with lxc the ux was clunky.

if you could have the automation / configuration of docker / podman for lxc that would have been nice.

hunter2_|1 year ago

I've used Proxmox to manage my LXC workloads for years and it's been great, although I'm unaware to what extent it meets your criteria of offering automation. I find its interface to do roughly what a VM host (VirtualBox, VMWare, etc.) can do, but for LXC containers (and QEMU VMs) instead of VMs.

pjmlp|1 year ago

Misses HP-UX Vault, Solaris Zones, Aix LPAR, and whatever IBM was doing with System 360 and MVS.

lysecret|1 year ago

Oh nice I really like this style of explanation.

unknown|1 year ago

[deleted]

128 comments