Jails are actually very similar to Linux namespaces / unshare. Much more similar than most people in this thread think.
There's one difference though:
In namespaces, you start with no isolation, from zero, and you add whatever you want — mount, PID, network, hostname, user, IPC namespaces.
In jails, you start with a reasonable secure baseline — processes, users, POSIX IPC and mounts are always isolated. But! You can isolate the filesystem root — or not (by specifying /). You can keep the host networking or restrict IP addresses or create a virtual interface. You can isolate SysV IPC (yay postgres!) — or keep the host IPC namespace, or ban IPC outright. See? The interesting parts are still flexible! Okay, not as flexible as "sharing PIDs with one jail and IPC with another", but still.
So unlike namespaces, where user isolation is done with weird UID mapping ("uid 1 in the container is uid 1000001 outside") and PID isolation I don't even know how, jails are at their core just one more column in the process table. PID, UID, and now JID (Jail ID). (The host is JID 0.) No need for weird mappings, the system just takes JID into account when answering system calls.
By the way, you definitely can run X11 apps in a jail :) Even with hardware accelerated graphics (just allow /dev/dri in your devfs ruleset).
P.S. one area where Linux did something years before FreeBSD is resource accounting and limits (cgroups). FreeBSD's answer is simple and pleasant to use though: https://www.freebsd.org/cgi/man.cgi?rctl
While I'm not sure I agree entirely with the "Complexity == Bugs" section, the main point, that containers are first-class citizens but a (useful) combination of independent mechanisms is spot-on. This has real repercussions: most people I've spoken do don't know these things exist. They know containers do, they have a very vague idea what containers are, but they have no fundamental understanding of the underlying concepts. (And who can blame them? Really, it was marketed that way.)
For example, pid_namespaces, and subreapers are an awesome feature¹, and are extremely handy if you have a daemon that needs to keep track of a set of child jobs that may or may not be well behaved. pid_namespaces ensure that if something bad happens to the parent, the children are terminated; they don't ignorantly continue executing after being reparented to init. Subreapers (if a parent dies, reparent the children to this process, not init) solve the problem of grandchildren getting orphaned to init if the parent dies. Both excellent features for managing subtrees of processes, which is why they're useful for containers. Just not only containers.
But developers aren't going to take advantage of syscalls they have no idea that they exist, of course.
¹although I wish someone could tell me why pid_namespaces are root-only: what's the security risk of allowing unprivileged users to create pid_namespaces?
This is definitely true, but only as long as docker (or $container_runtime) remains lightweight enough that you can still use those independent parts on their own, compatibly with docker. The risk is that docker grows in complexity such that it creates new dependencies between these independent parts and therefore handicaps their power when used individually.
As an example, it's easy to create network namespaces and add routing rules, interfaces, packet forwarding logic, etc all by using `ip netns exec`. But there is no easy way to launch a docker container into an existing netns. You need to use docker's own network tooling or build your own network driver, which may be more complex than what you need. This strikes me as a code smell in docker.
Docker is to containers what OAUth2.0 is to cryptography: a roll your own solution with a wide complexity.
Whereas jails/zones/VM have a complexity that is mutualized, docker have a feature of being more flexible which comes at the price that you may introduce more escape scenari.
As a result like in cryptography, Docker is kind of a roll your own crypto solution, secured by obfuscation that may if you don't have a lot of knowledge on the topic your own poison.
From this article you can derive 2 conclusions:
- docker is good for a big business having enough knowledge to devote a specialized team for handling the topic, because FEATURES
- jails/zones are more adapted for securing small business
Is it possible to create pid_namespaces for unprivileged users by wrapping pid_namespaces creation in a suid shell script that will take care of loading everithing using the current unprivileged user ?
Ignorance admission time: I still have no idea what problem containers are supposed to solve. I understand VMs. I understand chroot. I understand SELinux. Hell, I even understand monads a little bit. But I have no idea what containers do or why I should care. And I've tried.
Containers are just advanced chroots. They do the same with the network interface, process list and your local user list as chroot is doing with your filesystem. In addition, containers often throttle resource consumption of CPU, memory, block I/O and network I/O of the running application to have some QoS for other colocated applications in the same machine.
It is the spot between chroot and VM. Looks like a VM from the inside, provides some degree of resource usage QoS and does not require you to run a full operating system like a VM.
Another concept that is now also often automatically connected to containers is the distribution mechanism that Docker brought. While provisioning is an orthogonal topic to runtime, it is nice that these two operational topics are solved at the same time in a convinient way.
rkt did some nice work to allow you to choose the runtime isolation level while sticking to the same provisioning mechanism:
I'm with you, but I've found a single use case that I'm running with, and potentially a second that I'm becoming sold on. So far, the most useful thing for me is being able to take a small application I've written, package it as a container, and package it in a manner when I know it will run identically on multiple remote machines that I will not have proximity to manage should something go wrong. I can also make a Big Red Button to blow the whole thing away and redownload the container if need be, since I was (correctly) forced to externalize storage and database. I can also push application updates just by having a second Big Red Button marked "update" which performs a docker pull and redeploy. So now, what was a small, single-purpose Rails app can be pushed to a dozen or so remote Mac minis with a very simple GUI to orchestrate docker commands, and less-than-tech-savvy field workers can manage this app pretty simply.
I'm also becoming more sold on the Kubernetes model, which relies on containers. Build your small service, let the system scale it for you. I don't have as much hands-on here yet, but so far it seems pretty great.
Neither of those are the same problems that VMs or chroot are solving, as I see it, but a completely different problem that gets much less press.
Everyone says containers help resource utilization but I think there killer raison d'etre is that they are a common static binary packaging mechanism. I can ship Java, Go, Python, or whatever and the download and run mechanism is all abstracted away.
I'm very new to containers, but I think I'm starting to get the hype a bit. Recently I was working on a couple of personal projects, and for one I wanted a Postgres server, and for the other PhantomJS so that I could do some webscraping. Since I try to keep my projects self-contained I try to avoid installing software onto my Mac. So my usual workflow would be to use Vagrant (sometimes with Ansible) to configure a VM. I do this infrequently enough that I can never remember the syntax, and there's a relatively long feedback loop when trying to debug install commands, permissions etc. I gave Docker a try out of frustration, but was simply delighted when I discovered that I could just download and start Postgres in a self-contained way. And reset it or remove it trivially. I know there's a lot more to containers than this, but it was an eye-opener for me.
Yeah, I think a lot of it is better resource utilization compared to VMs. At the same time, though, I don't think containers are the thing, but just a thing that paves the way for something very powerful: datacenter-level operating systems.
In 2010, Zaharia et al. presented [1], which basically made the argument that increasing scale of deployments and variety of distributed applications means that we need better deployment primitives than just at the machine level. On the topic of virtualization, it observed:
> The largest datacenter operators, including Google, Microsoft, and Yahoo!, do not appear to use virtualization due to concerns about overhead. However, as virtualization overhead goes down, it is natural to ask whether virtualization could simplify scheduling.
But what they didn't know was that Google has been using containers for a long time. [2] They're deployed with Borg, an internal cluster scheduler (probably better known as the predecessor to the open-source Kubernetes), which essentially serves exactly as an operating system for datacenters that Zaharia et al. described. When you think about it that way, a container is better thought of not as a thinner VM, but as a thicker process.
> Because well-designed containers and container images are scoped to a single application, managing containers means managing applications rather than machines.
In the open-source world, we now have projects like Kubernetes and Mesos. They're not mature enough yet, but they're on the way.
- have different versions of libs/apps on the same OS (or run different OS's)
- tinker with linux kernel, etc without breaking your box (remember the 90's?)
- building immutable images packed with dependencies, ready for deploy
- testing distributed software without VMs (because containers are faster to run)
- if you have a big box (say 64gb, eight core or whateva) or multiple big boxes, you can manage the box resources through containerization, which can be useful if you need to run different software. Say every team builds a container image, then you can deploy any image, do HA, Load balancing, etc. Ofc this use case is highly debatable
These comments are helpful. Thanks. Sounds like for a given piece of hardware you might be able to fit 2 or 3 VMs on it, or a lot more containers. But without the security barriers of VMs.
That being the case, why not just use the OS? And processes and shared libraries?
Increase server utilization by packing multiple non-hostile tenants on it, quickly create test environments, have a volatile env. You can have all of those with VMs although at much higher CPU, RAM usage cost.
Same with me. This plays right into the complexity issue.
Even if you understand them, you have to understand the specific configuration (unlike VMs, where you have a very limited set of configurable options, and the isolation guarantees are pretty much clear).
They're VM's but much more efficient and start faster. There's a clever but shockingly naive build system involved. That's pretty much it.
Going beyond this you get orchestration - which you can certainly do with VM's but it's slow; and various hangovers from SOA rebadged and called microservices.
But they're really, really efficient compared to VM's.
Maybe it's Deathstar vs. Lego. But I assume you can survive a lot longer in a Deathstar in vacuum than in your Lego spaceship hardened by gaffa tape.
1: I have uttermost respect for anyone working on this stuff. No offense, but as a user sometimes a lack of design and implementation of bigger concepts (not as in more code, but better design, more secure) in the Linux world is sad. It's probably the only way to move forward but you could read on @grsecurity Twitter years ago that this idea is going to be a fun ride full of security bugs. There might be a better way?
I really wish this post went into more detail. It feels too high level to be useful.
I ran into the memory issue recently. In DC/OS when you use the marathon scheduler, if you go above the allocated memory limit, the scheduler kills your task and restarts it.
The trouble is, if you ran top inside your container, and you're running on a DC/OS node with 32GB of memory, top reports all 32GB of memory. So interpreters that do garbage collection (like Java) will just continue to eat memory if you don't specify the right limits/parameters. The OS will eve let it allocate past the container limit, but just kill the container afterwards.
Now the container limit is available under the /proc/cgroups somewhere, but now interpreters need to check to see if they're running in a container and adjust everything accordingly.
Of course you could always tell your scheduler not to hard kill something when it goes over a memory limit, which is why we never ran into that when we were running things on CoreOS since we didn't configure hard limits per container.
Yeah, Linux security features work like: throw ... against a wall and see what sticks. I find it amusing when people say: "We should write a new kernel" and their only proposed security feature is using that memory safe language (TM)... they'd have my attention if they said "We should write a new kernel and design all the permissions/isolations/resource limits from the ground up".
Yes, and as long as your life isn't threatened and you live in a world full of other people, problems, opportunities, the lego ship with gaffa tape is way more useful.
It feels like Ms Frazelle's essay ends abruptly. I was looking forward to the other use cases of non-Linux containers.
I think most people are considering these OS-level virtualization systems for the same or or very similar use cases: familiar, scalable, performant and maintainable general purpose computing. Linux containers win because Linux won. Linux didn't have to be designed for OS virt. People have been patient as long as they've continued to see progress -- and be able to rely on hardware virt. Containers are a great example of where even with all of the diverse stakeholders of Linux, the community continues to be adaptive and create a better and better system at a consistent pace in and around the kernel.
That my $job - 2, Joyent, re-booted Lx-branded zones to make Linux applications run on illumos (descendent of OpenSolaris) is more than a "can't beat them join them strategy" as it allows their Triton (OSS) users full access, not only to Linux API and toolchains, but to the Docker APIs and image ecosystem and has been an environment for their own continued participation in micro services evolution.
Although Joyent adds an additional flavor, it targets the same scalable, performant and maintainable cloud/IaaS/PaaS-ish use case. In hindsight, it's crazy that I worked at three companies in a row in this space, Piston Cloud, Joyent, Apcera, and each time I didn't think I'd be competing against my former company, but each time the business models as a result of the ecosystems shifted. Thankfully with $job I'm now a consumer of all of the awesome innovations in this space.
I think an interesting bit here is that e.g. Solaris first had Zones (i.e. "Containers"), while virtualisation was added later (sun4v), while the story is exactly the other way around for Linux.
Its probably a good time to stop using containers to mean LXC considering the new OCI runc specs containers on Solaris using Zones and Windows using Hyper-V:
I don't think anyone in the container dev community thinks that containers means LXC only. Even back in 2013, docker's front end api was designed to support other runtimes such as VMs and chroot. Perhaps this is a marketing story gone awry?
I think it's important to realize that the reduced isolation of containers can also have pretty significant upsides.
For example monitoring the host and all running containers and all future containers only means running one extra (privileged) container on each host. I don't need to modify the host itself, or any of the other containers, and no matter who builds the containers my monitoring will always work the same.
The same goes for logging. Mainly there is an agreed-upon standard that containers should just log to stdout/stderr, which makes it very flexible to process the logs however you want on the host. But also if your application uses a log file somewhere inside the container, I can start another container (often called "sidecar") with my tools that can have access to that file and pipe it into my logging infrastructure.
If I want multiple containers can share the same network namespace. So I listen on "localhost:8080" in one container, and connect to "localhost:8080" in another, and that just works without any overhead. I can share socket files just the same.
I can run one (privileged) container on each host that starts more containers and bootstraps f.e. a whole kubernetes cluster with many more components.
You can save yourself much "infrastructure" stuff with containers, because the host provides them or they are done conceptually different. For example ntp, ssh, cron, syslog, monitoring, configuration management, security updates, dhcp/dns, network access to internal or external services like package repositories.
My main point is that by embracing what containers are and using that to your advantage, you gain much more than by just viewing them as lightweight virtualisation with lower overhead and a nicer image distribution.
Edit: I want to add that not all of that is necessarily exclusive to containers or mandatory. For example throwing away the whole VM and booting a new one for rolling updates is done a lot, but with containers it became a very integral and universally accepted standard workflow and way of thinking, and you will get looked at funny if you DON'T do it that way.
The meme image ("Can't have 0days or bugs... if I don't write any code") is incorrect.
You can't have bugs if you don't have any code, but not writing code just means that your bugs are guaranteed to be someone else's bugs. Now, this may be a good thing -- other people's code has probably been reviewed more closely than yours, for one thing -- but using other people's code doesn't make you invulnerable, and other people's code often doesn't necessarily match your precise requirements.
If you have a choice between writing 10 lines of code or reusing 100,000 lines of someone else's code, unless you're a truly awful coder you'll end up with fewer bugs if you take the "10 lines of code" option.
There's probably no good way to pick up this context from the article, but the meaning of that particular meme is that the caption is supposed to be a shortsighted analysis. See http://knowyourmeme.com/memes/roll-safe , which lists examples like "You can't be broke if you don't check your bank account" or "If you're already late.. Take your time.. You can't be late twice."
> If you have a choice between writing 10 lines of code or reusing 100,000 lines of someone else's code, unless you're a truly awful coder you'll end up with fewer bugs if you take the "10 lines of code" option.
I disagree, this is only true if you understand why the other code has 100k lines [Although this example is a bit extreme].
A good example that could send a junior developer astray is date handling. Or most likely date mishandling if they are coding it themselves.
These container and container like solutions are not 10 lines of code, no implementation will be 10 lines. Therefore solutions which have had time to stabilize will be better since 10 lines of code isn't even a valid solution. New code causes new issues and increased complexity, thats the only point to be made by the meme.
Nobody mentioned unikernels yet? It's a bit unrelated to the containers discussion in this thread, but I thought I'd mention it anyway. They let you create an operating system image, which only includes the code you need. Nothing more, nothing less. This improves security, because the attack surface is reduced.
It makes a lot of sense too me when I think about how cloud computing works. Most of the time an operating system container, zone, jail, VM... is booted just to run a select number of processes. There is absolutely no need for a general purpose system. I think unikernels could really shine in this area.
MirageOS is a project that lets you create unikernels. It's written in OCaml, so it's interesting in more than one way. MirageOS images mostly run on Xen, by the way.
It's a sad reflection of a technical community when 3 years later many do not seem to still clearly understand the bare basics of how containers work. HN has been complicit in massively hyping containers without a corresponding understanding of how containers work outside the context of docker.
How many container users understand namespaces and how easy it is to launch a process in its own namespace, both as root and non root users? Or know overlay file systems and how they work. Or linux basics like bind mounts, and networking.
The docker team leveraged LXC to grow from its tooling to container images but didn't shy from rubbishing it and misleading users on what it is. LXC was presented as 'some low level kernel layer' when it has always been a front end manager for containers like Docker, the only difference is LXC launches a process manager in the container and Docker doesn't. Just clearly articulating this in the beginning would have led to a much better understanding of containers and Docker itself among users and the wider community.
How many docker users know the authors of aufs and overlayfs? The hype is so intense around the front end tools that few know or care to know the underlying tools. This has led to a complete lack of understanding of how things work and an unhealthy ecosystem as critical back end tools do not get funding and recognition, with the focus solely on front ends as they 'wrap' projects, make things more complex and build walls to justify their value. Launch 5000 nodes and 500000 containers. How many users need this?
And this complexity has a huge cost and technical debt, when you are scaling as many stories here itself report and when you are trying to figure out the ecosystem so much so that its now at risk of putting people off containers.
A stateless PAAS has never been the general use case, its a single use case pushed as a generic solution because that's Docker's origin as a PAAS provider. The whole problem with scaling for the vast majority is managing state. Running stateless containers or instances does not even begin to solve that in any remote way. Yes, it sounds good to launch 5000 stateless instances but how is it useful? Without state scaling has never been a problem. A few bash scripts which is what Dockerfiles are will do it. But now because of hype around Docker and Kubernetes users must deal with needless complexity around basic process management, networking, storage and re-architect their stack to make it stateless, without any tools to manage state. Congratulations on becoming a PAAS provider.
A couple of observations from someone not-so-familiar with containers:
If the consensus is that containers for the most part are just a way to ship and manage packages along with their dependencies to ease library and host OS dependencies, I'm missing a discussion about container runtimes themselves being a dependency. For example, Docker has a quarterly release cadence I believe. So when your goal was to become independent of OS and library versions, you're now dependent on Docker versions, aren't you? If your goal as IT manager is to reduce long-term maintainance cost and have the result of an internally developed project run on Docker without having to do a deep dive into the project long after the project has been completed, then you may find yourself still not being able to run older Docker images because the host OS/kernel and Docker has evolved since the project was completed. If that's the case, the dependency isolation that Docker provides might prove insufficient for this use case.
Another point: if your goal is to leverage the Docker ecosystem to ultimately save ops costs, managing Docker image landscapes with eg. kubernetes (or to a lesser degree Mesos) might prove extremely costly after all since these setups can turn out to be extremely complex, and absolutely require expert knowledge in container tech across your ops staff, and are also evolving quickly at the same time.
Another problem and weak point of Docker might be identity management for internally used apps; eg. containers don't isolate Unix/Linux user/group IDs and permissions, but take away resolution mechanisms like (in the simplest case) /etc/password and /etc/group or PAM/LDAP. Hence you routinely need complex replacements for it, adding to the previous point.
As a sysadmin I just want to point out to this mostly dev crowd, that my current favorite method of operations is to have multiple compartmentalized VM's which then may or may not hold containers or jails.
Why do I do it this way? Because having a full stack VM for each use-case on a good server is realistically not that much more resource hungry than a container, but the benefits are noticeable.
Lots of the core reason stems from security concerns. For example, there are quite a few Microsoft Small Business Server styled linux attempts at hitting the business space, but instead of playing to the strengths of modern hardware, they all mostly throw every service on the same OS just like SBS does... which is a major weakness. So instead of an AD server that also does dns and dhcp and the list goes on, each thing in my environments get it's own seperate VM (eg, SAMBA4 by itself, bind by itself, isc-kea by itself, and so on)
Another reason for this is log parsing related. It's much easier to know that when the bind VM OSSEC logs go full alert, I know exactly what to fix. On multi container systems, a single failure or comprimise can end up affecting many containerizations and convoluting the problem/solution process.
Of course, the main weakness to such a system is any attempt to break out of the VM space illicitly could comprimise many systems, but that's why you harden the VM's and have good logging in the first place, but also do it to the host system, along with using distributed seperation of hosts and good backups.
Just some real world usage from a sysadmin I wanted to convey. I still will do a container or a VM with many containers for the devs if needed, but when it comes time to deploy to prod, I tend to use a full stack VM. I'm also open to talk about weaknesses in this system, as I'd be curious to hear what devs think.
To be fair, I still haven't fully caught up with the whole devops movement either, so perhaps I'm behind.
Also, a big shoutout to proxmox for a virtual environment system, FOSS and production quality since 4.0. I have also run BSD systems with jails in a similar way. The key pont of the article is that zones/jails/vms are top level isolations and containers are not (but that doesn't make containers bad!)
A little bit off topic but I've been following Jess for a while and I think that developers like her are great. In my country is hard to see a happy developer and she seems to enjoy everything she does. That's why I follow her, because of her great work and great personality. I'm happy to see one of her blog posts here in HN
In this post the author links to one of her previous posts[0], where she wrote:
> As a proof of concept of unprivileged containers without cgroups I made binctr. Which spawned a mailing list thread for implementing this in runc/libcontainer. Aleksa Sarai has started on a few patches and this might actually be a reality pretty soon!
Does anybody know if this made it into runc/libcontainer? I'm not an expert on these technologies but would love to read through docs if it has been implemented.
I had to read the post twice before I really got what she was saying. I think the distinction I would make is that while there are many more use cases that you can apply to Containers that may not apply to Jails, Zones, or VMs the most common use case of "run an app inside a pre-built environment" applies to all of them. Since I believe most users (or potential users) of Containers are only looking at that use case, it's harder to see the differences between the different technologies.
My only hope is that anyone in a position of making a decision on which technology to use can at least explain at a high level the difference between a Container and a VM.
I'm not an OS person, so forgive me if this is a stupid question: Lots of people are excited about Intel SGX and similar things. Are there any interesting ways people are thinking about combining, like, Docker containers with SGX enclaves and such? One could imagine (e.g.) using remote attestation to verify an entire container image.
It doesn't matter how many distinction you make on these things (first-class, last-class, second-class, poor-class, etc...). These kind of discussions are always relative.
All is good as long as your decision is conscious of the compromises taken by each approach and what they entail (what other security mechanisms do you have at your disposal ? how could they enhance your app ? will your solution depend on external tools like ansible/puppet/etc ? do you actually need "containers" or jails or [insert your favorite trendy tech here] ?).
Running a *BSD or a Linux is a way bigger design decision than what kind of isolation mechanisms you have as many of the underlying parts are becoming different.
I'm trying to understand something. At my last work we had a big problem with "works for me". We started using Vagrant and all those problems disappeared. Then Docker became popular and all of a sudden people wanted to use that instead.
But is Docker really suitable for this? While each Vagrant instance is exactly the same Docker runs on the host system. It feels like it will be prone to all sorts of dissimilarities.
It depends on what you want to keep constant between hosts; for many (dare I say most?) projects, docker will provide a sufficiently identical environment that it "just works", because the filesystem as seen by the same image on two different computers will be exactly the same. This is sufficient for many, many projects, and is the primary source of "works for me" problems, in my experience.
However, if your application requires things like the CPU architecture set is the same, or the amount of memory available to the process is the same, etc... then no, docker will not give you this level of "sameness". I have had bugs that manifested on one docker host and not another because one host had certain x86 extensions available (AVX, or something similar) and the other didn't, and this caused a certain codepath to be followed. However this is probably extremely rare for non-performance-targeted code, as most projects will compile for a conservative subset of x86_64 instructions so that they can run everywhere.
Anecdotally, I find learning the "Docker way" has transformed my development paradigm in a much more fundamental way than Vagrant ever did, mostly because docker containers are almost instantaneous to start, which enables their use for a whole range of applications that I would never have even considered for something like Vagrant. Because I can launch a docker container in ~300ms, I can use containers like native applications. I do everything from building software to compiling latex documents using docker just so that I don't have to worry about installing toolchains or configuring things just right; I get an image that does what I want, and then I invoke the docker containers as if I were invoking the actual tools themselves.
Docker is just a tool, but it's a darned powerful one, and it's pretty fun to see how it continues to evolve right now. I highly suggest you check it out.
In my experience Vagrant only solves 50% of the problem. In complexity it's not really reproducable enough. Docker until now always did what I asked from it. So I'm confident that it at least solves 75% of the problem. I can suggest to fully move to docker.
[+] [-] floatboth|9 years ago|reply
There's one difference though:
In namespaces, you start with no isolation, from zero, and you add whatever you want — mount, PID, network, hostname, user, IPC namespaces.
In jails, you start with a reasonable secure baseline — processes, users, POSIX IPC and mounts are always isolated. But! You can isolate the filesystem root — or not (by specifying /). You can keep the host networking or restrict IP addresses or create a virtual interface. You can isolate SysV IPC (yay postgres!) — or keep the host IPC namespace, or ban IPC outright. See? The interesting parts are still flexible! Okay, not as flexible as "sharing PIDs with one jail and IPC with another", but still.
So unlike namespaces, where user isolation is done with weird UID mapping ("uid 1 in the container is uid 1000001 outside") and PID isolation I don't even know how, jails are at their core just one more column in the process table. PID, UID, and now JID (Jail ID). (The host is JID 0.) No need for weird mappings, the system just takes JID into account when answering system calls.
By the way, you definitely can run X11 apps in a jail :) Even with hardware accelerated graphics (just allow /dev/dri in your devfs ruleset).
P.S. one area where Linux did something years before FreeBSD is resource accounting and limits (cgroups). FreeBSD's answer is simple and pleasant to use though: https://www.freebsd.org/cgi/man.cgi?rctl
[+] [-] deathanatos|9 years ago|reply
For example, pid_namespaces, and subreapers are an awesome feature¹, and are extremely handy if you have a daemon that needs to keep track of a set of child jobs that may or may not be well behaved. pid_namespaces ensure that if something bad happens to the parent, the children are terminated; they don't ignorantly continue executing after being reparented to init. Subreapers (if a parent dies, reparent the children to this process, not init) solve the problem of grandchildren getting orphaned to init if the parent dies. Both excellent features for managing subtrees of processes, which is why they're useful for containers. Just not only containers.
But developers aren't going to take advantage of syscalls they have no idea that they exist, of course.
¹although I wish someone could tell me why pid_namespaces are root-only: what's the security risk of allowing unprivileged users to create pid_namespaces?
[+] [-] chatmasta|9 years ago|reply
As an example, it's easy to create network namespaces and add routing rules, interfaces, packet forwarding logic, etc all by using `ip netns exec`. But there is no easy way to launch a docker container into an existing netns. You need to use docker's own network tooling or build your own network driver, which may be more complex than what you need. This strikes me as a code smell in docker.
[+] [-] julie1|9 years ago|reply
Whereas jails/zones/VM have a complexity that is mutualized, docker have a feature of being more flexible which comes at the price that you may introduce more escape scenari.
As a result like in cryptography, Docker is kind of a roll your own crypto solution, secured by obfuscation that may if you don't have a lot of knowledge on the topic your own poison.
From this article you can derive 2 conclusions:
- docker is good for a big business having enough knowledge to devote a specialized team for handling the topic, because FEATURES
- jails/zones are more adapted for securing small business
[+] [-] XorNot|9 years ago|reply
If I could create pid namespaces for my user-space apps, then every program I write forever would, as it's first step, launch into a pid namespace.
[+] [-] mingodad|9 years ago|reply
[+] [-] wlamartin|9 years ago|reply
[+] [-] dreamcompiler|9 years ago|reply
[+] [-] sarnowski|9 years ago|reply
It is the spot between chroot and VM. Looks like a VM from the inside, provides some degree of resource usage QoS and does not require you to run a full operating system like a VM.
Another concept that is now also often automatically connected to containers is the distribution mechanism that Docker brought. While provisioning is an orthogonal topic to runtime, it is nice that these two operational topics are solved at the same time in a convinient way.
rkt did some nice work to allow you to choose the runtime isolation level while sticking to the same provisioning mechanism:
https://coreos.com/rkt/docs/latest/devel/architecture.html#s...
[+] [-] eddieroger|9 years ago|reply
I'm also becoming more sold on the Kubernetes model, which relies on containers. Build your small service, let the system scale it for you. I don't have as much hands-on here yet, but so far it seems pretty great.
Neither of those are the same problems that VMs or chroot are solving, as I see it, but a completely different problem that gets much less press.
[+] [-] voidfunc|9 years ago|reply
[+] [-] vosper|9 years ago|reply
[+] [-] elvinyung|9 years ago|reply
Yeah, I think a lot of it is better resource utilization compared to VMs. At the same time, though, I don't think containers are the thing, but just a thing that paves the way for something very powerful: datacenter-level operating systems.
In 2010, Zaharia et al. presented [1], which basically made the argument that increasing scale of deployments and variety of distributed applications means that we need better deployment primitives than just at the machine level. On the topic of virtualization, it observed:
> The largest datacenter operators, including Google, Microsoft, and Yahoo!, do not appear to use virtualization due to concerns about overhead. However, as virtualization overhead goes down, it is natural to ask whether virtualization could simplify scheduling.
But what they didn't know was that Google has been using containers for a long time. [2] They're deployed with Borg, an internal cluster scheduler (probably better known as the predecessor to the open-source Kubernetes), which essentially serves exactly as an operating system for datacenters that Zaharia et al. described. When you think about it that way, a container is better thought of not as a thinner VM, but as a thicker process.
> Because well-designed containers and container images are scoped to a single application, managing containers means managing applications rather than machines.
In the open-source world, we now have projects like Kubernetes and Mesos. They're not mature enough yet, but they're on the way.
[1] https://cs.stanford.edu/~matei/papers/2011/hotcloud_datacent...
[2] http://queue.acm.org/detail.cfm?id=2898444
[+] [-] mping|9 years ago|reply
- have different versions of libs/apps on the same OS (or run different OS's) - tinker with linux kernel, etc without breaking your box (remember the 90's?) - building immutable images packed with dependencies, ready for deploy - testing distributed software without VMs (because containers are faster to run) - if you have a big box (say 64gb, eight core or whateva) or multiple big boxes, you can manage the box resources through containerization, which can be useful if you need to run different software. Say every team builds a container image, then you can deploy any image, do HA, Load balancing, etc. Ofc this use case is highly debatable
[+] [-] dreamcompiler|9 years ago|reply
That being the case, why not just use the OS? And processes and shared libraries?
[+] [-] unixhero|9 years ago|reply
Ended with a spectacular data loss, of my own company's financial data. Luckily I had 7-day old SQL exports.
[+] [-] colordrops|9 years ago|reply
1. More efficient use of hardware (including spin up time) 2. Better mechanisms for tying together and sharing resources across boundaries.
But in the end they don't really do anything you couldn't do with a VM. It's just that people realized that VMs are overkill for many use cases.
[+] [-] betaby|9 years ago|reply
[+] [-] _pmf_|9 years ago|reply
Even if you understand them, you have to understand the specific configuration (unlike VMs, where you have a very limited set of configurable options, and the isolation guarantees are pretty much clear).
[+] [-] mbesto|9 years ago|reply
[+] [-] RantyDave|9 years ago|reply
Going beyond this you get orchestration - which you can certainly do with VM's but it's slow; and various hangovers from SOA rebadged and called microservices.
But they're really, really efficient compared to VM's.
[+] [-] nisa|9 years ago|reply
We have to gaffer tape with AppArmor and SELinux to fix all the holes the kernel doesn't care about: https://github.com/lxc/lxc/blob/master/config/apparmor/conta...
Solaris Zones are more designed and an evolution from FreeBSD Jails. Okay, the military likely paid for that: https://blogs.oracle.com/darren/entry/overview_of_solaris_ke...
Maybe it's Deathstar vs. Lego. But I assume you can survive a lot longer in a Deathstar in vacuum than in your Lego spaceship hardened by gaffa tape.
1: I have uttermost respect for anyone working on this stuff. No offense, but as a user sometimes a lack of design and implementation of bigger concepts (not as in more code, but better design, more secure) in the Linux world is sad. It's probably the only way to move forward but you could read on @grsecurity Twitter years ago that this idea is going to be a fun ride full of security bugs. There might be a better way?
[+] [-] djsumdog|9 years ago|reply
I ran into the memory issue recently. In DC/OS when you use the marathon scheduler, if you go above the allocated memory limit, the scheduler kills your task and restarts it.
The trouble is, if you ran top inside your container, and you're running on a DC/OS node with 32GB of memory, top reports all 32GB of memory. So interpreters that do garbage collection (like Java) will just continue to eat memory if you don't specify the right limits/parameters. The OS will eve let it allocate past the container limit, but just kill the container afterwards.
Now the container limit is available under the /proc/cgroups somewhere, but now interpreters need to check to see if they're running in a container and adjust everything accordingly.
Of course you could always tell your scheduler not to hard kill something when it goes over a memory limit, which is why we never ran into that when we were running things on CoreOS since we didn't configure hard limits per container.
[+] [-] zuzun|9 years ago|reply
[+] [-] erikb|9 years ago|reply
[+] [-] lloydde|9 years ago|reply
I think most people are considering these OS-level virtualization systems for the same or or very similar use cases: familiar, scalable, performant and maintainable general purpose computing. Linux containers win because Linux won. Linux didn't have to be designed for OS virt. People have been patient as long as they've continued to see progress -- and be able to rely on hardware virt. Containers are a great example of where even with all of the diverse stakeholders of Linux, the community continues to be adaptive and create a better and better system at a consistent pace in and around the kernel.
That my $job - 2, Joyent, re-booted Lx-branded zones to make Linux applications run on illumos (descendent of OpenSolaris) is more than a "can't beat them join them strategy" as it allows their Triton (OSS) users full access, not only to Linux API and toolchains, but to the Docker APIs and image ecosystem and has been an environment for their own continued participation in micro services evolution.
Although Joyent adds an additional flavor, it targets the same scalable, performant and maintainable cloud/IaaS/PaaS-ish use case. In hindsight, it's crazy that I worked at three companies in a row in this space, Piston Cloud, Joyent, Apcera, and each time I didn't think I'd be competing against my former company, but each time the business models as a result of the ecosystems shifted. Thankfully with $job I'm now a consumer of all of the awesome innovations in this space.
[+] [-] dom0|9 years ago|reply
[+] [-] erikb|9 years ago|reply
[+] [-] nikcub|9 years ago|reply
https://github.com/opencontainers/runtime-spec/blob/master/s...
[+] [-] ffk|9 years ago|reply
[+] [-] lgas|9 years ago|reply
[+] [-] jo909|9 years ago|reply
For example monitoring the host and all running containers and all future containers only means running one extra (privileged) container on each host. I don't need to modify the host itself, or any of the other containers, and no matter who builds the containers my monitoring will always work the same.
The same goes for logging. Mainly there is an agreed-upon standard that containers should just log to stdout/stderr, which makes it very flexible to process the logs however you want on the host. But also if your application uses a log file somewhere inside the container, I can start another container (often called "sidecar") with my tools that can have access to that file and pipe it into my logging infrastructure.
If I want multiple containers can share the same network namespace. So I listen on "localhost:8080" in one container, and connect to "localhost:8080" in another, and that just works without any overhead. I can share socket files just the same.
I can run one (privileged) container on each host that starts more containers and bootstraps f.e. a whole kubernetes cluster with many more components.
You can save yourself much "infrastructure" stuff with containers, because the host provides them or they are done conceptually different. For example ntp, ssh, cron, syslog, monitoring, configuration management, security updates, dhcp/dns, network access to internal or external services like package repositories.
My main point is that by embracing what containers are and using that to your advantage, you gain much more than by just viewing them as lightweight virtualisation with lower overhead and a nicer image distribution.
Edit: I want to add that not all of that is necessarily exclusive to containers or mandatory. For example throwing away the whole VM and booting a new one for rolling updates is done a lot, but with containers it became a very integral and universally accepted standard workflow and way of thinking, and you will get looked at funny if you DON'T do it that way.
[+] [-] cperciva|9 years ago|reply
You can't have bugs if you don't have any code, but not writing code just means that your bugs are guaranteed to be someone else's bugs. Now, this may be a good thing -- other people's code has probably been reviewed more closely than yours, for one thing -- but using other people's code doesn't make you invulnerable, and other people's code often doesn't necessarily match your precise requirements.
If you have a choice between writing 10 lines of code or reusing 100,000 lines of someone else's code, unless you're a truly awful coder you'll end up with fewer bugs if you take the "10 lines of code" option.
[+] [-] geofft|9 years ago|reply
[+] [-] rileymat2|9 years ago|reply
I disagree, this is only true if you understand why the other code has 100k lines [Although this example is a bit extreme].
A good example that could send a junior developer astray is date handling. Or most likely date mishandling if they are coding it themselves.
[+] [-] jjn2009|9 years ago|reply
[+] [-] AlexanderDhoore|9 years ago|reply
It makes a lot of sense too me when I think about how cloud computing works. Most of the time an operating system container, zone, jail, VM... is booted just to run a select number of processes. There is absolutely no need for a general purpose system. I think unikernels could really shine in this area.
MirageOS is a project that lets you create unikernels. It's written in OCaml, so it's interesting in more than one way. MirageOS images mostly run on Xen, by the way.
[1] https://en.wikipedia.org/wiki/Unikernel
[2] https://mirage.io/
[+] [-] throw2016|9 years ago|reply
How many container users understand namespaces and how easy it is to launch a process in its own namespace, both as root and non root users? Or know overlay file systems and how they work. Or linux basics like bind mounts, and networking.
The docker team leveraged LXC to grow from its tooling to container images but didn't shy from rubbishing it and misleading users on what it is. LXC was presented as 'some low level kernel layer' when it has always been a front end manager for containers like Docker, the only difference is LXC launches a process manager in the container and Docker doesn't. Just clearly articulating this in the beginning would have led to a much better understanding of containers and Docker itself among users and the wider community.
How many docker users know the authors of aufs and overlayfs? The hype is so intense around the front end tools that few know or care to know the underlying tools. This has led to a complete lack of understanding of how things work and an unhealthy ecosystem as critical back end tools do not get funding and recognition, with the focus solely on front ends as they 'wrap' projects, make things more complex and build walls to justify their value. Launch 5000 nodes and 500000 containers. How many users need this?
And this complexity has a huge cost and technical debt, when you are scaling as many stories here itself report and when you are trying to figure out the ecosystem so much so that its now at risk of putting people off containers.
A stateless PAAS has never been the general use case, its a single use case pushed as a generic solution because that's Docker's origin as a PAAS provider. The whole problem with scaling for the vast majority is managing state. Running stateless containers or instances does not even begin to solve that in any remote way. Yes, it sounds good to launch 5000 stateless instances but how is it useful? Without state scaling has never been a problem. A few bash scripts which is what Dockerfiles are will do it. But now because of hype around Docker and Kubernetes users must deal with needless complexity around basic process management, networking, storage and re-architect their stack to make it stateless, without any tools to manage state. Congratulations on becoming a PAAS provider.
[+] [-] tannhaeuser|9 years ago|reply
If the consensus is that containers for the most part are just a way to ship and manage packages along with their dependencies to ease library and host OS dependencies, I'm missing a discussion about container runtimes themselves being a dependency. For example, Docker has a quarterly release cadence I believe. So when your goal was to become independent of OS and library versions, you're now dependent on Docker versions, aren't you? If your goal as IT manager is to reduce long-term maintainance cost and have the result of an internally developed project run on Docker without having to do a deep dive into the project long after the project has been completed, then you may find yourself still not being able to run older Docker images because the host OS/kernel and Docker has evolved since the project was completed. If that's the case, the dependency isolation that Docker provides might prove insufficient for this use case.
Another point: if your goal is to leverage the Docker ecosystem to ultimately save ops costs, managing Docker image landscapes with eg. kubernetes (or to a lesser degree Mesos) might prove extremely costly after all since these setups can turn out to be extremely complex, and absolutely require expert knowledge in container tech across your ops staff, and are also evolving quickly at the same time.
Another problem and weak point of Docker might be identity management for internally used apps; eg. containers don't isolate Unix/Linux user/group IDs and permissions, but take away resolution mechanisms like (in the simplest case) /etc/password and /etc/group or PAM/LDAP. Hence you routinely need complex replacements for it, adding to the previous point.
[+] [-] arca_vorago|9 years ago|reply
Why do I do it this way? Because having a full stack VM for each use-case on a good server is realistically not that much more resource hungry than a container, but the benefits are noticeable.
Lots of the core reason stems from security concerns. For example, there are quite a few Microsoft Small Business Server styled linux attempts at hitting the business space, but instead of playing to the strengths of modern hardware, they all mostly throw every service on the same OS just like SBS does... which is a major weakness. So instead of an AD server that also does dns and dhcp and the list goes on, each thing in my environments get it's own seperate VM (eg, SAMBA4 by itself, bind by itself, isc-kea by itself, and so on)
Another reason for this is log parsing related. It's much easier to know that when the bind VM OSSEC logs go full alert, I know exactly what to fix. On multi container systems, a single failure or comprimise can end up affecting many containerizations and convoluting the problem/solution process.
Of course, the main weakness to such a system is any attempt to break out of the VM space illicitly could comprimise many systems, but that's why you harden the VM's and have good logging in the first place, but also do it to the host system, along with using distributed seperation of hosts and good backups.
Just some real world usage from a sysadmin I wanted to convey. I still will do a container or a VM with many containers for the devs if needed, but when it comes time to deploy to prod, I tend to use a full stack VM. I'm also open to talk about weaknesses in this system, as I'd be curious to hear what devs think.
To be fair, I still haven't fully caught up with the whole devops movement either, so perhaps I'm behind.
Also, a big shoutout to proxmox for a virtual environment system, FOSS and production quality since 4.0. I have also run BSD systems with jails in a similar way. The key pont of the article is that zones/jails/vms are top level isolations and containers are not (but that doesn't make containers bad!)
[+] [-] aaossa|9 years ago|reply
[+] [-] apeace|9 years ago|reply
> As a proof of concept of unprivileged containers without cgroups I made binctr. Which spawned a mailing list thread for implementing this in runc/libcontainer. Aleksa Sarai has started on a few patches and this might actually be a reality pretty soon!
Does anybody know if this made it into runc/libcontainer? I'm not an expert on these technologies but would love to read through docs if it has been implemented.
[0] https://blog.jessfraz.com/post/getting-towards-real-sandbox-...
[+] [-] opcenter|9 years ago|reply
My only hope is that anyone in a position of making a decision on which technology to use can at least explain at a high level the difference between a Container and a VM.
[+] [-] swordswinger12|9 years ago|reply
[+] [-] HugoDaniel|9 years ago|reply
All is good as long as your decision is conscious of the compromises taken by each approach and what they entail (what other security mechanisms do you have at your disposal ? how could they enhance your app ? will your solution depend on external tools like ansible/puppet/etc ? do you actually need "containers" or jails or [insert your favorite trendy tech here] ?).
Running a *BSD or a Linux is a way bigger design decision than what kind of isolation mechanisms you have as many of the underlying parts are becoming different.
[+] [-] brotherjerky|9 years ago|reply
[+] [-] Kiro|9 years ago|reply
But is Docker really suitable for this? While each Vagrant instance is exactly the same Docker runs on the host system. It feels like it will be prone to all sorts of dissimilarities.
[+] [-] staticfloat|9 years ago|reply
However, if your application requires things like the CPU architecture set is the same, or the amount of memory available to the process is the same, etc... then no, docker will not give you this level of "sameness". I have had bugs that manifested on one docker host and not another because one host had certain x86 extensions available (AVX, or something similar) and the other didn't, and this caused a certain codepath to be followed. However this is probably extremely rare for non-performance-targeted code, as most projects will compile for a conservative subset of x86_64 instructions so that they can run everywhere.
Anecdotally, I find learning the "Docker way" has transformed my development paradigm in a much more fundamental way than Vagrant ever did, mostly because docker containers are almost instantaneous to start, which enables their use for a whole range of applications that I would never have even considered for something like Vagrant. Because I can launch a docker container in ~300ms, I can use containers like native applications. I do everything from building software to compiling latex documents using docker just so that I don't have to worry about installing toolchains or configuring things just right; I get an image that does what I want, and then I invoke the docker containers as if I were invoking the actual tools themselves.
Docker is just a tool, but it's a darned powerful one, and it's pretty fun to see how it continues to evolve right now. I highly suggest you check it out.
[+] [-] erikb|9 years ago|reply
[+] [-] qaq|9 years ago|reply