How Containers Work: Overlayfs

[+] Cedricgc|6 years ago|reply

I enjoyed this blog post. Julia does a great job of distilling an idea down with examples.

I am fairly comfortable with Linux as a user for things like understanding processes, ports, key files and utilities, etc. The way I understand how to model abstractions like containers is to know the various OS primitives like cgroups, changing root, network isolation. Once one sees how those pieces come together to create the container abstraction, they can be mapped to the system calls provided by the OS. Usually they also have utilities bundled (like `chroot`) to interface with those primitives as an operator.

[+] whytaka|6 years ago|reply

I have been confused about containers for so long but having read your comment and looking up the terms you mentioned allowed me to finally find the right articles that explained containers to me. Thanks!

[+] pcr910303|6 years ago|reply

Hmm... how is Overlayfs and Unionfs different? From the explanation I can't find any differences...

Unionfs: A Stackable Unification File System[0]:

> This project builds a stackable unification file system, which can appear to merge the contents of several directories (branches), while keeping their physical content separate.

> Unionfs allows any mix of read-only and read-write branches, as well as insertion and deletion of branches anywhere in the fan-out.

> To maintain unix semantics, Unionfs handles elimination of duplicates, partial-error conditions, and more.

If it is the same thing (but maybe more maintained or has more features...) , can we implement something like trip[1][2] package manager on top of Overlayfs?

(Porg is a package manager where all files that are installed by make install is tracked and mounted on a Unionfs layer.)

[0] http://unionfs.filesystems.org

[1] https://github.com/grencez/trip

[2] http://www.linuxfromscratch.org/hints/downloads/files/packag...

[+] Jasper_|6 years ago|reply

They're mostly the same, with different answers to tricky questions. e.g. if I stack filesystems A, B, C, and have the same file /foo/bar in all of the layers, and then do rm /foo/bar, what happens:

1. Does /foo/bar get removed from the topmost layer, exposing the one below?

2. Does /foo/bar get removed from all three layers?

3. Does /foo/bar get replaced with a "tombstone" record to pretend that it was deleted, while still appearing in some of A, B, or C on its own?

These semantics are tricky to get right, and during the process of upstreaming unionfs to the kernel, they made some incompatible changes to the model and chose different answers for these questions, and as a result, renamed it overlayfs.

[+] loeg|6 years ago|reply

Note that there is also a BSD UnionFS filesystem doing much the same thing. I don't know what relation it has to the Linux OverlayFS, only that they (superficially) do very similar things.

[+] tyingq|6 years ago|reply

They are very similar. Overlayfs is in the default kernel tree, which would be the biggest difference.

[+] nwellinghoff|6 years ago|reply

I recently made a system that uses overlays to provide work spaces for a complex build process. However, there is significant overhead paid for unmounting an deleting the files in the overlay after all the work is done. I was thinking about changing the system such that I allocate a partition ahead of time, write all the overlays there, and on success just blow away the partition and with it the overlays. This is kind of a pain in the ass. Can anyone suggest a method for rapidly deleting all the data generated by using 100's of overlays? Maybe BtrFS snapshots would be better? What are the pros and cons? Thank you so much and I apologize for "anything" up front :)

[+] the8472|6 years ago|reply

With recent kernels you can combine overlayfs and btrfs.

btrfs subvols/snapshosts have their own costs, they can get slow once you accumulate thousands of them (it's fine if you just use a few at a time). But you can create a single btrfs subvolume, store your overlays in there and then delete the subvolume when you're done.

[+] viraptor|6 years ago|reply

I'm not sure I understand your use case well, but have you tried lvm's snapshots? That could be the simplest solution.

If you're going to try btrfs, check if your system/tools handle the space overcommit correctly. Some ways to check the available space don't really play well with snapshots. (As in, they report less space available)

[+] tlb|6 years ago|reply

Can you use a Linux tmpfs as the overlay? It's much faster to create/delete files, and you can simply unmount it at the end and its memory is immediately reclaimed.

[+] rzzzt|6 years ago|reply

Tangentially related: are there any good solutions for shipping data with containers (for which the CoW mechanism is not suited particularly well)? Is there a "hub" for volumes?

[+] dfox|6 years ago|reply

I somehow feel compelled to point out that this idea of union/overlay FS layers has nothing to do with containers per se. But on the other hand is somehow critical for why containers got popular as that is the way to make the whole thing somehow efficient both in terms of hardware resources and developer time.

[+] seminatl|6 years ago|reply

Yeah the title should be "unionfs: a kernel feature having nothing whatever to do with containers, and some ways to use it" but I guess that's too long :-) . Problem is there is not some central marketing department for Linux that can even tell us what "containers" means. There are lots of people who think they are "using containers" who do not use this style of mount, and there are lots of people using this style of mount who do not consider themselves container users.

[+] mikepurvis|6 years ago|reply

They really don't, and it was funny that period where you'd see Dockerfiles with all the commands in a single invocation to avoid "bloating" the resulting image with unnecessary intermediate products that ended up deleted.

Maybe it's out there and I've just missed it, but I really wish there were richer ways to build up a container FS than just the usual layers approach with sets of invocations to get from one layer to the next, especially when it's common to see invocations that mean totally different things depending on when they're run (eg "apt update") and then special cache-busting steps like printing the current time.

I know part of this is just using a better package manager like nix, but I feel like on the docker side you could do interesting stuff like independently run steps A, B, and C against some base image X, and then create a container that's the FS of X+A+B+C, even though the original source containers were X+A, X+B, and X+C.

[+] krab|6 years ago|reply

As I understand it, containers are just a set of concepts and kernel features put together to provide an abstraction that's not that different from virtual machines for common use cases.

[+] fulafel|6 years ago|reply

In common usage it seems containers is synonymous with "what Docker does". Because meanwhile Docker has blurred in meaning since various other things were called Docker. Such as calling the non-native product "Native Docker" or whatever "Docker Enterprise" is (impossible to tell by the landing page description).

[+] djsumdog|6 years ago|reply

True, you can use other stacking filesystems with Docker (I believe it had/has a ZFS driver at one time?) The example she shows in the comic are just about the filesystem and leave out the Docker pieces, so I'm wondering if this is just one part of a series.

[+] 2019119|6 years ago|reply

I wrote this script[1] a while ago which creates an overlay and chroots into the overlays workdir. It's pretty useful, with it, you can do something like

> overlay-here

> make install (or ./wierd-application or 'npm install' or whatever)

> exit

and all changes made to the filesystem in that shell is written to the upperdir instead. So for example in the above example, the upperdir would contain files such as upperdir/usr/bin/app and upperdir/usr/include/header.h.

It's useful when

* You want to create a deb/rpm/tgz package, but a makefile does not have support for DESTDIR

* An application writes configuration files somewhere, but you don't know where

* An application would pollute your filesystem by writing files in random places, and you want to keep this application local to "this directory"

* In general when you want to know "which files does this application write to" without resorting to strace

* or when you want to isolate writes, but not reads, that an application does

[1]: https://gist.github.com/dbeecham/183c122059f7ba288397e8c3320...

[+] ChrisSD|6 years ago|reply

I'd be wary of that last point depending on what you mean by "isolate". Chroot is not a security feature so the isolation is not perfect. This shouldn't matter if you trust the application but if it could be malicious (or manipulated by something malicious) then you'd want a harder boundary. `pivot_root` perhaps?

[+] xyzzy_plugh|6 years ago|reply

Debian's schroot was made to do pretty much this, though largely obviated by modern container runtimes.

[+] ZoomZoomZoom|6 years ago|reply

There was a practice of using MergerFS/OverlayFS for pooling multiple drives (often by SnapRAID users), but what's still missing (to my knowledge) is some sort of a balancing layer, that could distribute writes.

I got this idea many years ago, when first personal cloud storages appeared and offered some limited space for free. I thought it would be nice if I could pool them and fill them taking their different capacities into account. And if I could also stripe and EC them for higher availability...

I still wonder if there's something that can do this and if there isn't I would like to know why, since it looks like a useful and obvious idea.

[+] _trapexit|6 years ago|reply

What do you mean by "distribute writes"? One of mergerfs' primary features has always been it's policies which provide different algorithms for choosing a branch to apply a particular function to.

https://github.com/trapexit/mergerfs#policy-descriptions

[+] asdfaoeu|6 years ago|reply

aufs sort of could do that at the file layer. The issue is you run into a bunch of incompatibilities with how applications can expect it to work. As soon as you want to start looking at striping and EC then you really need to just go with something like ZFS or btrfs.

[+] koffiezet|6 years ago|reply

This is crucial tech to understand docker-style containers...

I used this to build custom iso images based on an existing iso file. The old method mounted the ISO image, rsync'd the entire contents, copied and updated some files, and then created a new image. This took quite a while and (temporarily) wasted a lot of disk space, and was initially sped up by copying the temp ISO to RAM disk, which also presented some challenges, and wasn't as fast as the eventual solution, using aufs on top of the ISO mount to patch the image. Worked like a charm and sped up the ISO building considerably :)

[+] brokenmachine|6 years ago|reply

What happens if you want to delete a file from the ISO?

edit: oops, I've read the article now, I guess aufs acts the same and creates a tombstone file.

[+] hackerm0nkey|6 years ago|reply

Great article and nice style distilling all this into a bite size chunks.

Is it me or just the title is a little bit inaccurate in the sense that there's more to "How containers work?" than overlays, e.g. it made me think that it covers more than it actually does, e.g. cgroups, namespaces, etc...

Anyone knows of a more in depth coverage of containers building block type of article that allows one to build a rudimentary container from scratch to appreciate what goes into building one ?

[+] skywhopper|6 years ago|reply

The title just means it's one piece of the puzzle. She's thinking about making a comic about how containers work, and one important piece of that is overlays. So this is that piece.

[+] djsumdog|6 years ago|reply

Yea, it did a great job of covering overlays, but didn't get into how Docker uses a hash value for each overlay piece. Maybe this will be part of a series where she does more of that?

This was posted a few months back on here and it a cool little tools for seeing how Docker fits layers together:

https://github.com/wagoodman/dive

[+] wooptoo|6 years ago|reply

OverlayFS is pretty useful for day to day things like union folders for your media collection spanning across different hard drives. It does have a few quirks like inotify not working properly, so changes need to be watched for on the underlying fs.

[+] marmaduke|6 years ago|reply

A lot of utility in Docker comes from incremental (cached) builds based on overlay but in fact you can get it from any CoW system such as LVM/ZFS/BtrFS snapshots.

[+] seabrookmx|6 years ago|reply

Docker defaults to overlayfs but you don't have to use it. If you use ZFS or another storage driver it will leverage their capabilities to provide the same functionality: https://docs.docker.com/storage/storagedriver/select-storage...

[+] curt15|6 years ago|reply

Facebook, for one, uses Btrfs for its containers: https://facebookmicrosites.github.io/btrfs/docs/btrfs-facebo...

[+] flas9sd|6 years ago|reply

fyi, btrfs as /var/lib/docker/btrfs has a disadvantage under certain circumstances: running commands (du,rsync,..) on the subvolume folders altering the access time will incur a storage cost if the filesystem is mounted without noatime option: https://github.com/moby/moby/issues/39815 - not sure if it applies to ZFS as well (see https://lwn.net/Articles/499648/)

[+] jzl|6 years ago|reply

In fact, this is explained in the article which also has an interesting anecdote about btrfs.

[+] z3t4|6 years ago|reply

When working with containers also consider hardware abstractions like virtual machines. Startup time can be optimez from minutes down to milliseconds. And also consider statically linked binaries if all you want is to solve the dll hell.

[+] henesy|6 years ago|reply

s/o to unionfs for plan9 by kvik: http://code.a-b.xyz/unionfs

Userspace implementation of union directories with control of which half of the union made gets precedence for certain operations such as file creation, etc.

[+] crtlaltdel|6 years ago|reply

used ovelayfs for an embedded linux project! great stuff!

[+] adam3563|6 years ago|reply

[deleted]

72 comments