Your Docker image might be broken without you knowing it

[+] xal|12 years ago|reply

The comments about the init process are true. It makes sense to run a proper PID1 system here such as runit.

I'd argue with the rest of the post. The problem is that phusion makes the common mistake of thinking of containers as faster VMs. That's fine, this is where almost everyone starts when first looking at Docker paradigm.

A good rule of thumb is: If you feel like your container should have Cron[1] or SSH[2], you are trying to build a VM not a container.

VMs are something that you run a few of on a particular computer. Containers are something that you will run thousands or tens of thousands on a single server. They are a lot more lightweight and loading them up with VM cruft doesn't help there.

[1] Cron: use the cron of the outer machine with docker run [2] SSH: use lxc attach

[+] FooBarWidget|12 years ago|reply

I disagree with running cron outside the Docker container. One of the reasons for using Docker is to lower deployment pain. The moment you use cron on the host machine you've introduced yet another moving part, and yet another dependency that must be installed on the host.

I also disagree that there is a mistake here involving thinking of containers as faster VMs. Yes, you can think them of applications, but the fact is still that there is a whole operating system running inside the container, and that many apps rely on cron and other stuff. Given that crond is so small and lightweight, and that a lot of people don't know what depends on cron and what not, I think it's better to turn it on by default. If you know for sure that you don't use cron, you can still turn it off.

Remember, the goal of baseimage-docker is to provide a base image that is correct for most people, especially people who are not intimately familiar with the Unix system model.

Lxc-attach, although it works, has several problems:

* You are doing things outside Docker so you won't be able to track it (logs, attach, etc). Also, docker might use LXC right now, but there is no warranty it will do so forever. For example, what if you're using Docker on OS X? No lxc-attach there.

* It does not allow you to limit access. What if you want to give a person only access to a specific container? You can do that with SSH through the use of keys.

* lxc-attach has caveats with --elevated-privileges, documented in the man page.

[+] tinco|12 years ago|reply

I am not sure whether it is a common mistake to see containers as faster VMs. The way we use Docker at Phusion is on one side as sandboxed environments to run our integration tests in, and on the other side as isolated containers to run long running services in (like apps, databases) to lessen administrative overhead. Both use cases work best in an environment that's as much as a fully functional OS as possible.

Anyway, what's wrong with using containers as ultra light weight VMs, isn't the whole idea of the recent VM upmarch to use VMs as application containers?

As long as you keep in mind that the security of LXC hasn't been thoroughly battle test yet, I think it's a fine idea to use docker to build light weight VMs.

[+] rdtsc|12 years ago|reply

Well they are both, it is a weaker VM as far as capabilities, but also have more ability than a single isolated OS process.

There is no need to blackbox the concept un-necessarily. It is an LXC container, read about LXC containers, its capabilities and restrictions. If you are building at core of your infrastructure don't trust simplistic mental analogies from HN.

Then Docker works on top of that. Read about what Docker adds or restricts on top of the base LXC containers.

I also think Docker is a little too wishy washy in their marketing and there are enough people who get confused about what exactly is going on underneath.

From the slides "More technical explanations: Run everywhere, regardless of kernel version [+], regardless of host distro, physical or virtual, cloud or not [o]

[+] kernel version must be 2.6.32+ [o] container and host architecture must match. "

No mention of LXC containers, and that is a "technical" description. Why not just put a mention of the underlying core technology? It might change tomorrow sure and just update.

[+] coldtea|12 years ago|reply

>Containers are something that you will run thousands or tens of thousands on a single server.

Citation needed. Who'd run "thousands of containers" on a single server? Some VPS service? I don't think Docker is meant to be used like those use Xen and the like.

[+] nailer|12 years ago|reply

Fascinating. My first inclination, when I started running Docker, was to run /sbin/init and launch a full systemd and all services.

I even asked on ServerFault (ie, StackOverflow for servers) about it and was told, quite aggressively, that running a full OS is wrong:

http://serverfault.com/questions/573378/how-can-i-persistent...

Addressed individually:

1. Reaping orphans inside the container.

Yup. If your app's parent process crashes, its child processes may now be orphans. However in this case your monitoring should also restart the entire container.

2. Logging.

Assuming you run your docker image in a .service file (which is what CoreOS uses as standard), systemd-journald on the host will log everything as coming from whatever your unit (.service) name is. So if you `systemctl myapp start` output and errors will show up in `journalctl -u myapp` in the parent OS.

3. Scheduled tasks.

For things like logrotate, it really depends whether you're handing logs inside or outside the container. Again, I'd use systemd-journald in CoreOS, rather than individual containers, for logs, so they'd be rotated in CoreOS. For other scheduled tasks it depends.

4. SSHd

It depends. SSH isn't the only way to access a container, you can run `lxc-attach` or similar from the host to go directly to a container.

I do mention CoreOS here because that's what I use, but RHEL 7 beta, recent Fedoras, and upcoming Debian/Ubuntus would all operate similarly.

[+] FooBarWidget|12 years ago|reply

Regarding reaping orphans: orphans do not necessarily imply that something crashed or that something went wrong. Orphans is a very normal part of system operation. Consider an app that daemonizes by double forking. Double forking a common technique with the intention of making the process become adopted by init. It expects PID 1 to be able to handle that correctly.

Regarding logging: that only holds for output that is written to the terminal. There are lots of software out there that never log to the terminal and log only to syslog.

As for all the other stuff: it is up to debate whether they should be handled inside or outside the container. The right solution probably varies in a case-by-case basis. Baseimage-docker's point is not to tell that everyone must do things the way it does. It's to provide a good and useful default, for the people who have no idea that these issues exist. If you are aware of these issues and you think that it makes sense to do things in another way, go ahead.

[+] markbnj|12 years ago|reply

I've only been working with Docker for a couple of months, and I find this discussion really interesting. The goal of trying to get containers to behave more like a full system across various lifecycle events is somewhat orthogonal to my own aims, which have been to get my containers as close to stateless as I can.

Like some other posters here I view containers less as a lightweight VM, and more as a process sandbox. In the context of a scalable architecture I would like a container to represent a single abstract component, which can be spun up (perhaps in response to autoscaling events), grabs its config, connects to the appropriate resources, streams its logs/events out to sinks, reads and writes files from external volumes, and runs until it faults or you shut it down.

Ideally there would be nothing inside the container at shutdown that you care about. After shutdown the container, and potentially the instance it was running on, disappear. Spinning up another one is a matter of launching a new container from a reference image.

So far, in cases where I have needed daemons running in the container, I have pointed my CMD at a launch script that starts the appropriate services, and then launches the application components, typically using supervisord. That has worked fine, but I admit to not understanding the PID1 issue well-enough up to this point.

[+] FooBarWidget|12 years ago|reply

Baseimage-docker does not imply that your container becomes stateful. Using services like cron and SSH do not imply statefulness.

I also think that the container should be as stateless as possible. When state is necessary, it can be saved at a bind mounted directory.

The point of baseimage-docker is to ensure that the system works correctly. See its description about the role of PID 1. It has got nothing to do with the statefulness discussion.

[+] josh-wrale|12 years ago|reply

Cross-distro support notwithstanding, why not just skip Docker, LXC and VMs. Instead, use cgroups on bare-metal to make processes behave. On that note, forget bridging, use SR-IOV virtual functions with VLANs for QoS and _Profit_.

Edit: It seems this comment has been voted down. I think perhaps this is seen as irrelevant, but I would disagree, because Docker uses LXC and masks its function in much the same way as LXC uses cgroups and masks their function. cgroups can be used to achieve similar goals without these many layers of abstraction. In this way, I believe this comment to be relevant to the discussion of full vs. application containers on Linux. There are certainly many reasons for using containers, but one of the leading reasons is process limits (e.g. RAM, network namespace). Limiting process usage of those resources, using only cgroups, is quite easy in comparison to all Phusion has gone through here to something with similar (though admittedly different) aims. Example: http://www.andrewklau.com//controlling-glusterfsd-cpu-outbre...

Edit 2: I would also appreciate constructive criticism. That is, I've been downvoted without useful feedback. Specific feedback as to what is wrong with my comment would enable me to contribute more constructively to this discussion. Without such feedback, I believe the downvote can be seen as a simple and tribal "go away".

[+] pekk|12 years ago|reply

Constructive criticism, about how this sounds: it isn't clear that what you propose is actually more valuable than using Docker. It sounds like it's complex and requires a lot of manual intervention. It doesn't sound like your alternative covers Docker's use cases.

Your idea may need to be more fleshed out, but at a minimum it needs to be explained in a way that makes it clear why most users of Docker would see a significant benefit to use your approach instead.

[+] derefr|12 years ago|reply

> There are certainly many reasons for using containers, but one of the leading reasons is process limits

On the other hand, the main reason people tend to use Docker, as far as I know, is not anything to do with quotas or limits; it's guaranteed reproducibility of deploys. (The same thing you get on Heroku with "slugs", etc.)

[+] wmf|12 years ago|reply

The layers of abstraction in Docker result in something that's a lot simpler to use than managing cgroups manually. Also, I suspect people care more about the cross-distro support and configuration isolation that Docker provides than resource management.

[+] Eiwatah4|12 years ago|reply

My experience with cgroups is that it's incredibly difficult to get them to do what you want them to do. But systemd seems to be changing that, so maybe their use will get more mainstream soon.

[+] nl|12 years ago|reply

I think those many layers of abstraction are something many see as a feature, not a problem. I certainly appreciate the clean abstraction Docker provides to LXC.

[+] philips|12 years ago|reply

You really should not run ssh in your containers. If you have a ton of containers then key management and security updates of SSH will be a pain. There are two tools that can easily help out:

- nsenter lets you pick and chose what namespaces you enter. Say the host OS has tcpdump but your container doesn't. Then you can use nsenter to enter the network namespace but not the mount namespace: sudo nsenter -t 772 -n tcpdump -i lo

- lxc-attach will let you run a command inside of an existing container. This is lxc specific I believe and probably not a great long term solution. But, most people have it installed.

[+] ewindisch|12 years ago|reply

I disagree with the premise that using Docker to run individual processes is "wrong". Phusion is doing a disservice by suggesting as such. There ARE use-cases where such a base-image is useful, but I believe these should be the uncommon case, not the common one. Even yet, if running multiple processes in a container is needed, it's preferable to use Docker-in-Docker.

I suppose part of the problem is the two benefits of Docker and containerization are frequently confused. Docker provides portability and build bundling, but ALSO provide loose process isolation. You should want to take advantage of that process isolation and by doing so, should want to run SSH or cron in their own containers, not in a single container with your application process. If your application has multiple processes, each should have their own containers. These containers can be linked and share volumes, devices, namespaces, etc. Granted, some of the functionality one might desire for this model is still missing or in development, but much of it is there already and that's the model I aspire Docker to follow.

It might also be to some degree a matter of legacy versus green-field applications. For instance, I've been deploying OpenStack's 'devstack' developer environment (which forks dozens of binaries) inside of a single Docker container. In this case, the Phusion base-image might make sense. However, the proper way of using Docker would be to run dozens of containers, each running a single service.

The reason I don't do this is because the OpenStack development/testing tools provide this forking and enforce this model, using 'screen' as a pseudo-init process. From the Docker perspective, this is a legacy application. I could and probably will change those development tools to create multiple containers, but until then, it's easiest to stick to a single container.

[+] tinco|12 years ago|reply

    I disagree with the premise that using Docker to run individual processes is "wrong". Phusion is doing a disservice by suggesting as such.

This is not the premise of the article. The premise is that someone goes 'from ubuntu; apt-get install memcache; cmd ["memcached"]' and thinks everything is going to be alright, when in reality they've just set up a rather buggy system.

If you're absolutely certain your app is going to be fine running as the sole (PID1) process in the container, then this article has no problem with that. It just says that if you're going to run something you've got from apt-get, then chances are, your system is going to have to be a little more like a Debian system.

[+] bashcoder|12 years ago|reply

Yeah, Phusion oversells their case when they say this is the "right way" to do it. It's one way to do it, and this methodology probably addresses customer support issues they are having. Many of their customers likely misunderstand what's actually going in a container by default.

I'd rather see a more balanced approach that shows a range of options, without opining so much about how Docker containers should or should not be used. Better to fit the solution to a particular use case.

[+] zimbatm|12 years ago|reply

It will work but things are addressed on the wrong level in my opinion.

syslog: each container now has it's own logs to handle. If you want them to be persistent/forwarded it might be better if all containers could share the /dev/log device of the host (not sure of the implications though).

ssh: lxc-attach. Docker should expose that.

zombies: it's a bug in the program to not wait(1) on child processes.

cron: make a separate container that runs cron.

init crashes: bug in the program again. it's possible to use the hosts's init system to restart a container if necessary.

[+] FooBarWidget|12 years ago|reply

lxc-attach: see https://news.ycombinator.com/item?id=7258242 about why I think SSH is more appropriate.

Zombies: this is not about child processes created by the program. It's about child processes created by child processes! For example what if your app spawns another app that daemonizes by double forking? Your PID 1 has to reap all adopted child processes, not just the ones it spawned.

[+] thu|12 years ago|reply

It may be a matter of opinion but advocating to run cron, sshd, and so on in your containers, let alone in every single one by providing a base image to do that seems plain wrong.

Let's take an example. You have Nginx, a web app, and a database. You can put everything in the same container or not. If you choose to put everything in different containers, you will be able to use tools at the Docker level to manage them (e.g. replace one of those processes).

And the fundamental idea is that we expect to have plenty of Docker images around that you can pick and play with, and those Docker-level tools will be able to manage all those things.

Now if you put everything in the same container, you're back to square one, reinventing the tools to manage those individual process. You can say that you don't need to re-invent anything, because you're used to full-fledged operating systems. Still, if you have a nice story to deploy containers on multiple hosts, to send logs across those hosts, and so on, the road will be more straightfoward when you decide to use multiple hosts.

This is about uniformity. I want processes (and containers around them), and hosts, that's it. I don't want additional levels. I don't want processes, arbitrarily grouped inside some VMs (or containers), and hosts. Two levels instead of three.

[+] FooBarWidget|12 years ago|reply

Right, cron and sshd are open for debate, but at the very least you have to make your PID 1 behave correctly by reaping adopted child processes. That is a major part of baseimage-docker.

Baseimage-docker is not advocating putting everything in the same container. It's advocating putting all the necessary, low-level services in the same container. What if your app happens to use a library that needs to schedule things to run periodically using cron? To me it doesn't make sense to split that cron job to another container. The app might physically consist of multiple processes and components, but I think it should logically behave as a single unit.

For stuff like Nginx and the database, it's not so clear what is the right thing to do. It depends your use case. I don't think that putting those major services in the same container is always correct (though it might be), but I also don't think that splitting them out to Docker containers is always the right thing to do.

You say that that putting stuff in the same container puts us back to square one. I think splitting them puts us back to square one. Your base OS already runs all your processes as single units. You have to worry about each one of them separately, resulting in lots of moving parts that all increase deployment complexity. The beauty of Docker should be that you can group things. If you don't group things then why would you be using Docker? You might as well apt-get install your app and have it run as a normal daemon.

One use case where it really really makes sense to put everything in the same container: when distributing an app to end users who have little to no system administration knowledge. For example, what if you want to distribute the Discourse forum software? It depends on Rails, Nginx and PostgreSQL. Users are already having a lot of trouble installing Ruby, running 'bundle install', setting up Nginx and setting up PostgreSQL. Imagine if they can just 'docker run discourse' and it immediately listens on port 80, or whatever port they prefer, with the database and everything already taken care of for them.

[+] unknown|12 years ago|reply

[deleted]

[+] hrjet|12 years ago|reply

Why not just use

CMD ["/sbin/init"]

And start your app through an init.rd script?

The article says "upstart" is designed to be run on real hardware and not a virtualised system. If that is true, then perhaps there is value in baseimage-docker, but details are lacking.

[+] FooBarWidget|12 years ago|reply

So why don't you try it and see whether it works?

One of the things /sbin/init does is checking and mounting your filesystems. But you can't do that in an unprivileged Docker container because you don't have direct hardware access. This is only one example of where things go wrong. The entire init process is full of these kinds of code where it is assumed that there is direct hardware access.

Even when your container is started with -privileged, you still can't do that. The host OS is already controlling the hardware.

Also, /sbin/init usually does not like having SIGTERM sent to it, which is what 'docker stop' does. Depending on the implementation, /sbin/init either terminates uncleanly (causing the entire container to be killed uncleanly) or ignores the signal outright (causing the 'docker stop' timeout to kick in, also causing the container to be killed uncleanly).

[+] pjc50|12 years ago|reply

Old systemv init would work for this purpose, but not upstart.

[+] tomgruner|12 years ago|reply

Docker is a container for running processes, or a process. Containers should be disposable and transient. I have begun to think of it in terms similar to OOP. Images are your Classes. Containers are your class Instances. When you are done with an instance, you discard it and make a new instance. So don't go shoving all kinds of crap into the instance like crons and sshd that don't belong there. Most devs don't expect to have their code be free of memory leaks when it comes to interpreted languages. And docker containers don't need to worry about child processes being stopped - they should just be disposed of and you make a new container from your image. Keeping containers around would be like trying to pickle a python class instance perpetually that has references to who knows what... Just make a new instance when you need it. And just make a new container when you need one. I use named containers and a Makefile that stops and deletes existing containers with the same name before starting a new one.

[+] FooBarWidget|12 years ago|reply

To me, that does not make any sense. Your program executable is already like a Class. A normal process is already like instances of your classes. If all you want is OOP, then why are you using Docker? Your Unix system has been doing that for 30+ years!

[+] damm|12 years ago|reply

I think there's a lot of assumptions made, and there are a lot of assumptions made about your base image.

The Ubuntu base image (or how it's built) can be found https://github.com/tianon/docker-brew-ubuntu

Some excellent examples of how to use them with /sbin/init can be found https://github.com/tianon/dockerfiles/tree/master/sbin-init/...

Not everyone who uses Docker uses CRON, nor considers them long-term containers; rather short term process containers.

Docker is growing and how we use docker will change so be flexible and realize what you considered useful yesterday may not be required tomorrow. We will have to re-learn best practices and keep learning after that.

Note, the Ubuntu image isn't made by Ubuntu. Maybe Phusion should host their own Ubuntu image just for sakes of sakes.

[+] rschmitty|12 years ago|reply

Is there a "explain Docker to me like I'm 5" post?

This seems like the old "I have problems with managing everything I need for my app so I'll just run docker containers. Now I have 2 problems"

[+] tinco|12 years ago|reply

The Linux kernel has some features that make it possible to isolate a process from all (most) system resources, without actually running it inside a VM.

Docker is a tool that makes it easy to launch such isolated processes. You just specify what the filesystem environment should be for the new process, and what process to run in a small file and off it goes.

In theory this could make provisioning easier, having each application come in a Docker container that satisfies its own system level dependencies, and does its service level dependencies over connections to other containers/external hosts.

[+] Ruska|12 years ago|reply

The slideshow on this page gives a decent overview:

https://www.docker.io/learn_more/

[+] peterwwillis|12 years ago|reply

To understand Docker [0] in a complete way you first have to understand a lot of other concepts.

You need to understand what "linux containers" [1] are. To understand that you need to understand cgroups [2]. To understand cgroups you have to understand process groups/sessions [3] and namespace isolation [4], as well as how the kernel implements cgroups. [5] [6]

You need to understand what chroot [7] is. (it just prepends a path to all pathname lookups for a process and its children)

You need to understand what aufs [8] [9] is. To help understand aufs, you need to understand how union filesystems [10] work, how copy-on-write [11] works, and how virtual filesystems work [12] [13] [14].

You need to understand what rsync is [15]. To understand that you need to understand delta compression [16], and how rolling checksums [17] are used.

You should also understand things like how bash works, how processes signal each other and return status on exit, how file descriptors (like stdin/stdout/stderr) work, and other basic UNIXy concepts, to understand how other parts of docker works.

Docker is a frankenstein amalgamation of all these things, working together to allow you to basically run an arbitrary command in a way that is as isolated from your operating system as is practical while still remaining "light weight". Other solutions have other benefits or tradeoffs as evidenced here [18].

[+] Shish2k|12 years ago|reply

If you happen to be a python dev:

Docker is basically like virtualenv, but for everything instead of just for one aspect of one language

[+] brokenparser|12 years ago|reply

So if you run anything other than Ubuntu inside Docker, this is useless because the steps to build your own aren't outlined.

I find Docker to be horribly counter-intuitive and ass-backwards anyway, so not much harm done there as people are in general better off with something else entirely (plain lxc, libvirt, virtualbox, xen, openvz...). I recommend to steer away from it at least until 1.0 is out.

EDIT: I put it in my .plan to build a better BusyBox image aimed at running statically compiled programs with minimal baggage, but I'm not sure when I'll get a round tuit*

*: http://i.ebayimg.com/00/s/NDgwWDY0MA==/z/z-4AAOxyUrZSr82N/$_...

[+] FooBarWidget|12 years ago|reply

Why do you think it isn't outlined? The website explains exactly what the modifications are, what they do, and what they are good for. The Dockerfile is on Github for everyone to see. The website makes explicit mention that the init system is /sbin/my_init, for which the full source is available on Github. It's trivial to take the my_init script and integrate it into your non-Ubuntu container. You can even write your own init system based on the website description if you so choose.

[+] pini42|12 years ago|reply

I think it is not related to Docker itself, but to the fact the it is using all purpose Linux distributions. I'm pretty sure that very soon we will see explosion on new distros addressing exactly these problem and built explicitly for running inside containers.

[+] krakensden|12 years ago|reply

I used to think that that was what CoreOS was, but I have since become confused.

[+] tel|12 years ago|reply

How does this play with the CoreOS premise where each docker should be hosting a single process managed intelligently through something like systemd?

Under this model I'd expect that systemd's pgroup support should help with zombie processes and generally take over many of the services that baseimage-docker is suggesting here. As other have mentioned in this thread, there's a fairly large difference of opinion between running containers like fast VMs or like thin layers around single processes—does baseimage-docker make sense only in the latter?

[+] tinco|12 years ago|reply

baseimage-docker is meant to make it easier to make a correct environment for the processes you run in it, so perhaps make it be more like a fast VM.

From what we've seen the CoreOS people and perhaps the Docker people as well like to see Docker more as a thin layer around processes, being managed by external services.

[+] DanHulton|12 years ago|reply

Off-topic, but I'd thought I'd screwed up my DNS for a moment and this article redirected to the silly side-project I've been working on: ipaidthemost.com.

I guess we borrowed the same template?

[+] krakensden|12 years ago|reply

I'm pretty suspicious of using runit instead of Upstart- nobody tests Ubuntu with runit, and you're liable to get in trouble if you depend on some other service running on the machine. Although clearly it works well enough for them.

I also sort of suspect that the closer you are to running a full distribution in your containers, the less benefit you're getting from the containers.

[+] FooBarWidget|12 years ago|reply

Baseimage-docker uses runit exactly to not run a full distribution in the container. Upstart tries to boot a full Ubuntu. A full Ubuntu is not necessary inside the container. Therefore, baseimage-docker provides a custom init system that boots only the minimal subset of Ubuntu that is necessary for it to run correctly in Docker.

[+] akerl_|12 years ago|reply

I was super stoked to read this, and went diving to borrow some of their work for my own Docker usage. However, I'm confused by their choice of Python for the my_init script. The site claims they chose runit because it is more lightweight than supervisord, a Python tool of similar merit. Making the init process depend on Python seems to negate that advantage.

[+] FooBarWidget|12 years ago|reply

It's not only Python that makes supervisord relatively heavy compared to runit. It's also the amount of code in supervisord (and its dependencies). my_init is only a single file, less than 300 lines, with minimal dependencies.

Baseimage-docker is also in a "minimal viable product" phase. We're still trying to tweak things until they're right. For example my_init recently received some features which are important in certain use cases; features which would have been much lower to implement in C.

In the future we may optimize things by rewriting my_init in C. Right now it's laziness on our part.

[+] peterwwillis|12 years ago|reply

They just described implementing an OpenVZ VM.

[+] jaybuff|12 years ago|reply

"Note that the shell script must run the daemon without letting it daemonize/fork it. Usually, daemons provide a command line flag or a config file option for that."

fghack is an anti-backgrounding tool. http://cr.yp.to/daemontools/fghack.html

90 comments