top | item 25243159

A better Kubernetes from the ground up

261 points| mr-karan | 5 years ago |blog.dave.tf | reply

152 comments

order
[+] dijit|5 years ago|reply
The open is weak (mutable pods is considered an inherent anti-pattern in k8s), but I think he’s got a lot of good points about networking.

Every time I look at k8s networking seriously it gives me great pause on whether I should continue to run such a complex system. IPv6+EtcD would solve this matter really well.

[+] nvarsj|5 years ago|reply
I came here to post pretty much the same thing.

Networking in k8s is the poorest designed aspect of Kubernetes. So much so, I'm honestly surprised it didn't kill off K8s early on. You still can't get proper service layer load balancing out of the box w/ iptables.

I'm not sure the history - it's the simplest thing possible, I guess, but doesn't perform very well at the most basic of tasks. To try and solve this, we have the current mess of service meshes, CNI, etc. It's going back to the middleware days of yore - which I think anyone with operational experience should know we really need to avoid. Having multiple layers of network proxies and masquerading between service calls, on an internal network, is just ridiculous and difficult to debug or operate at scale.

[+] jmillikin|5 years ago|reply
You don't need to use the complex stateful overlay networks. At Stripe we have a network overlay using IPv6 and the Linux kernel's built-in stateless tunnel device, so there's effectively unlimited addresses with no coordination between worker machines and no iptables port remapping.
[+] joshuak|5 years ago|reply
Yeah the value of being able to eliminate pod mutation as a source of trouble is pretty hard to overcome by any feature I can imagine mutability offering. Certainly not the example given. Actually the example must be over my head, because I do the equivalent of SIGTERM restarts with new configs on pods with a single command all day long when building services.
[+] hodgesrm|5 years ago|reply
> IPv6+EtcD would solve this matter really well.

Like others on this thread I completely agree that k8s networking is over-the-top complex. Still, it is hard to understand how this addresses relatively common use cases like providing intelligent load balancing for clients outside the cluster or secure multitenancy within it.

Solving this requires a step back to think through simplifying networking itself. I thought amazon did the world a favor by removing L2 networking from their model. We need a better set of primitives that abstracts out as much of the low level implementation as possible. The current situation feels like data management before the advent of the relational model.

[+] stingraycharles|5 years ago|reply
Yeah I’ve been dying to have a “simple” k8s solution that either uses just a VLAN or ipv6. It’s been one of the few components of k8s that I simply do not understand the rationale for, it seems so much engineering and complexity for what, exactly?
[+] bboreham|5 years ago|reply
He’s overestimating how much money is made by selling Kubernetes network addons.
[+] fulafel|5 years ago|reply
Hear, hear. The unquestioning attitude towards the complex system of application level proxies is disquieting.
[+] andrewrothman|5 years ago|reply
Interesting article. Thanks for sharing!

My dream platform:

1. Single binary install on nodes, and easy to join them into a cluster.

2. Resources defined as JSON with comments in a simple format with JSON Schema URLs denoting resource types - I should be able to run 1 container with 3 lines of resource definition.

3. Everything as a CRD... No resources or funcionality pre-installed and instead available via publicly hosted HTTPS schema URLs.

4. Pluggable / auto-installed runtimes based on the schema URL or a "runtime" field: containers, vms, firecracker, wasm, maybe even bare processes, etc.

5. A standard web dashboard with a marketplace which can install multiple containers, vms, wasm or via a copy-pasted HTTPS URL, or a standard electron app which lets me connect to clusters or manage local deployments on my devbox.

6. Apps can provide a JSON schema with config options, which map to env vars and volume mounts and can be displayed in a user friendly way in the web dashboard.

I feel something like this could standardize even more than kube. I'd love if one system allowed me to manage AWS EC2 / DigitalOcean / Proxmox instances, as well as manage services on my devbox for daily use (like daemons, etc.)

With that said, I do like a standard format across service providers. While I do find kube complex at times, I like that something "won" this battle, and I also like the push towards containers vs vms.

I'd love to see kube start pairing down and innovating toward making things easier for smaller clusters. Anyone know if there are discussions about that going on, or if there are resources for managing kube in smaller teams? Anyone interested in this?

[+] asim|5 years ago|reply
Working on something that will probably end up covering a lot of those requirements in the long run. https://micro.mu
[+] crudbug|5 years ago|reply
K8s forced industry to have a consistent open cluster API that will drive innovation and competition.

We can have multiple implementations of the same API, but what I am seeing currently from the "commercial" vendors is the base K8s with UI changes. I hope we will have multiple implementations of the spec. The specs also have to evolve with time.

Last decade, systemd had its fair share of criticism. But it provided a consistent API to run "Compute Units" locally. K8s can benefit from the same principles to manage "Compute Pods" across the cluster. The concepts of promise theory [1] provide interesting control loop co-ordination.

[1] https://en.wikipedia.org/wiki/Promise_theory

[+] joana035|5 years ago|reply
I'm quite confident that one day kubernetes will make use of systemd-nspawn. Let's see... :)
[+] stevefan1999|5 years ago|reply
> That means the computers in my home server rack, the DigitalOcean VM I have, and the couple other servers dotted around the internet. These should all be part of one cluster, and behave as such.

YES! This is exactly what we should have had been able to do in K8S swiftly, but I would like to suggest one additional goal: to get a highly-distributed, fault tolerant computer cluster over different network without a hassle (as a mean to spread risk), while also having a as-low-as-best-as-possible TCO through the use of heterogeneous architecture.

So this means I can have used AWS for their Graviton instances for high performance, near-bare-metal microservices, and can also run typical x86 workload on various cheap VPS providers such as Vultr, DO and any other cheap hosts out in the wild to handle normal stuff that typically wouldn't run well on ARM, such as GitLab, Prometheus, Keycloak, in a sense that we associate x86 to have an affinity with "heavy stuff" runner and this implies ARM to be the Mr. "lightweight stuff" guy. This is not really possible by today's shape of the k8s ecosystem, given that the majority of the images in Docker Hub is predominately x86/amd64, and my wild guess is around 90%. By the way I use ServerHunter[1] to scavenge such cheap x86 servers.

Also I'm a k3s[2] user and I have attempted to do this pretty well. Given your kernel support, you can even strap WireGuard on natively (albeit my VPS host internet bill tells me this can hit quite heavy)

While I can definitely control k8s by using kubeadm but I just liked k3s's philosophy of being battery-packed: It will install Traefik (although my recent needs to run nginx-ingress ultimately let me turn it off), CoreDNS, Flannel, a simple local storage and extra goodies all out of the box in your cluster, and you can opt them out whenever you want if you think they are insufficient for your kind of workload (whilst they are adequate for most of 90% of the thing actually). To be honest, this is how k8s should have had been this simple in the very beginning to say the least.

[1]: https://www.serverhunter.com/ [2]: https://k3s.io

[+] q3k|5 years ago|reply
Counter-argument: having your entire world-wide deployment operate under a single control-plane is a recipe for global outages. There should no single command that one can fat-finger that will bring down your system globally.

One-cluster-per-region (with some tie-in into one region being its own failure domain, both at the underlying infrastructure and application level) is the way to go for reliability.

[+] dnautics|5 years ago|reply
> YES! This is exactly what we should have had been able to do in K8S swiftly, but I would like to suggest one additional goal: to get a highly-distributed, fault tolerant computer cluster over different network without a hassle (as a mean to spread risk)

In practice this is going to be tricky unless your services are completely stateless. For 80% of people that's going to be true, but if you have customers with large datasets, (I'm thinking mostly media) you do not want to be schlepping those things between clouds or between cloud and on prem or even between on prems.

[+] yowlingcat|5 years ago|reply
One potential gotcha in your model is egress costs. For the big three, it's anywhere from $0.03/GB to $0.10/GB. The million dollar question suggested by your model is how do you you store and backup your persistent state in a manner which is not just conceptually robust but cost optimal?
[+] chucky_z|5 years ago|reply
I do this today with Nomad/Consul. Hashicorp's Raft+Serf implementation allows for shocking amounts of latency between servers/clients. I have several centralized server clusters (the "servers" _should_ be geographically close, the "agents" can be far and wide), and agents more than 200ms+ away, and across multiple clouds/on-prem. Everything works just fine.

I've legitimately considered running some kind of simple SaaS and behind the scenes running a Nomad cluster and having the remote SaaS agent just be a Nomad client.

[+] brown9-2|5 years ago|reply
Both of these taught me that Kubernetes is extremely complex, and that most people who are trying to use it are not prepared for the sheer amount of work that lies between the marketing brochure and the system those brochures promise.

...

GKE SRE taught me that even the foremost Kubernetes experts cannot safely operate Kubernetes at scale.

This is rough to hear coming from a former engineer at a top cloud company which has often lead the way on “marketing brochures”.

[+] markbnj|5 years ago|reply
It caught my eye too, but if this perception is true they do a pretty good job of faking it. I don't know if our clusters are "at scale" - the largest of them has about 120 nodes - but we have had very few issues in over three years of running production workloads on GKE.
[+] corytheboyd|5 years ago|reply
Out of curiosity, is it rough hearing that the author felt lied to by the marketing, or that you worked for one of the companies doing it, or something else?
[+] ForHackernews|5 years ago|reply
> GKE SRE taught me that even the foremost Kubernetes experts cannot safely operate Kubernetes at scale.

This is a pretty damning indictment, honestly.

But has k8s gotten so influential that it'll be impossible to dislodge? I guess k8s has no problem breaking backwards-compatibility with new versions, so maybe somebody can propose something better even if it breaks compatibility.

[+] pjmlp|5 years ago|reply
On Java and .NET world, with projects like Tye and Quarkus, you get to automate the interactions with Kubernets from the language tooling side, so it becomes less painful to deal with all kubernetes idiocracies.
[+] FroshKiller|5 years ago|reply
I think you meant “indictment” rather than “indigent.”
[+] bogomipz|5 years ago|reply
>"So, for starters, let’s rip out all k8s networking. Overlay networks, gone. Services, gone. CNI, gone. kube-proxy, gone. Network addons, gone."

Then they go on to to suggest:

>"If you have more elaborate connectivity needs, you bolt those on as additional network interfaces and boring, predictable IPv6 routes. Need to secure node-to-node comms? Bring up wireguard tunnels, add routes to push node IPs through the wireguard tunnel, and you’re done.

and

>'We could also have some fun with NAT64 and CLAT: make the entire network IPv6-only, but use CLAT to trick pods into thinking they have v4 connectivity. Within the pod, do 4-to-6 translation and send the traffic onwards to a NAT64 gateway."

So changing one type of NAT translation for another type of NAT translation and throw in some additional tunneling in there? How is that any simpler, more elegant or even more manageable than the current state of K8S networking? There is no requirement that you have to have an overlay network at all. If you bring up EKS cluster today in AWS the default is the aws-vpc CNI which is a flat address space the same as your VPC there is no overlay.

Then further:

>"Sticking at the pod layer for a bit longer: now that they’re mutable, the next obvious thing I want is rollbacks. For that, let’s keep old versions of pod definitions around, and make it trivial to “go back to version N”.

>"Now, a pod update looks like: write an updated definition of the pod, and it updates to match. Update broken? Write back version N-1, and you’re done."

This is exactly what using a GitOps operator does. Then they go on in the next sentence to call Gitops "nonsense"?

Not much of this is convincing or even well-throught out. This is definitely not "from the ground up." It's more like "throwing some shit against the wall and seeing if something sticks."

[+] astral303|5 years ago|reply
I must be boring because I manage plain old (virtual) machines with a load balancer at the edge.

It is easy as pie to manage and is easy to understand and secure.

I build the clustering into relevant services as needed, because what it means to be in a cluster together is highly service specific, so you are deluding yourself by thinking that a generic external clustering framework is anywhere near the answer.

If all you ever seen is Google prescribed order of the world (Google SRE), then of course you would contemplate rewriting K8s instead of throwing it out.

[+] reacharavindh|5 years ago|reply
May be there can be another form of Kubernetes, that is even simpler, and opinionated.

A project that picks one of [firecracker, gvisor, runc] and aggressively supports it, and combines it with a barebones network overlay that assumes nodes are running on a single subnet. Perhaps, also assumes that the control plane runs on dependable hardware and not distribute everything. May be it exists already, and I just don’t know it...

Fewer moving parts the better for people like me who are not looking at huge scale but still scale enough to span several servers in a few racks.

[+] StreamBright|5 years ago|reply
Doesn’t Nomad already support FC?
[+] gabereiser|5 years ago|reply
This was a good read. My question is though, what’s to stop this Kubenext of becoming the next kubernetes? To use the analogy, what’s to keep this sleek go orchestration from becoming the c++ everyone wants to migrate away from.

The networking bit reminds me a lot of Mesos, which was utterly hammered out of existence with corporations blind bandwagon riding of kubernetes. Mesos networking ran along with docker’s and required you to do borg-style service port mapping (albeit in an atomic number way). What I don’t like about 90% of this (and why ECS is my jam) is I want an orchestration of containers, I don’t want zookeeper, etcd, dns-proxy, the plethora of other services to make my orchestration - orchestrate.

I spent a good few years running cloud architecture and infrastructure and managed a few SRE’s. The thing that makes our lives easier was guaranteed red/blue deployments (mentioned a bit in the article with PinnedDeployments), auto-sizing clusters to resource totals (and out again for deployments). Terraform/CloudFormation or really anything IaC. Slackbot for deployments with rapid feedback of status through the CI chain. A kin to “bot deploy <repo> to <env>“ and let the bot figure out the config from yaml in the repo’s.

I had the most joy using DC/OS. I had the most commercial success with ECS. I’ve had the most requests for kubernetes. I’ve had the worst headaches with abstractions of kubernetes (packaged installers, canonical...)

[+] corytheboyd|5 years ago|reply
A bit OT. I’ve been putting off learning how to setup and run k8s, and am unfortunately in a situation where I don’t have anyone at work to learn from.

For context I’m no stranger to what the shape of production ready systems should be and can fill in the gaps given enough time to research and educate myself, but I don’t do operational work day-to-day.

I’m bringing a project to life right now and can’t but feel like while tough to learn and manage, k8s would be a good investment to make. I don’t have anyone using it yet, so I just did the bare minimum to get my docker-compose stack up and running on a Linode box. It works great for making sure what I have now works in a remote environment too, and I had to do a decent amount of configuration rework to get it ready, which should be transferable.

Now I’m wondering, how will things like rolling deployments work? I want to decouple the monitoring stack from the application stack, how will I handle adding another physical machine to my setup? I’m sure more questions will come up like this as I run into them, but would be curious to hear initial thoughts from anyone here to help me make a decision :)

[+] ahnick|5 years ago|reply
I think the easiest way to get started is to use DigitalOcean's managed Kubernetes offering. If you already know Docker then basically you'll just be learning how to setup a cluster on Kubernetes, installing your app to the cluster and other apps using kubectl and helm, and then setting up an ingress into the cluster (likely nginx-ingress although there are other options).

In regards to your specific questions... adding another physical machine to your cluster would be really simple if you are using Terraform, you would just increment your node count number and then re-apply the Terraform config. Here is an example Terraform config(main.tf) from a simple project I have:

  variable "do_token" {}

  provider "digitalocean" {
    token = var.do_token
  }

  resource "digitalocean_kubernetes_cluster" "your-cluster" {
    name    = "your-cluster"
    region  = "sfo2"
    # Grab the latest version slug from `doctl kubernetes options versions`
    version = "1.16.6-do.2"

    node_pool {
      name       = "your-pool-1"
      size       = "s-1vcpu-2gb"
      node_count = 3
    }
  }
For deployments I would honestly just keep it simple and increment your docker tag version number on your image each time you are doing a deploy. Then when you deploy your new image (e.g. kubectl apply -f deployment.yaml), the new image will be pulled down to each cluster pod and the application containers will then be restarted one by one.

If you are running this on another test/development cluster (e.g. minikube) prior to deployment, then you should have great confidence that this will succeed. In the event that you did run into an issue, just roll back the docker image version number and reapply the yaml using kubectl again.

I've been following this method for a while and have never had any downtime with deployments. Eventually if you get sophisticated you'll want to add these steps into an automated CI/CD pipeline, but kubectl apply can carry you pretty far in solo operations.

[+] scott_s|5 years ago|reply
Me and my colleagues came up with an abstraction called conductors for the orchestration problem. Short version is that when you have some dependence on multiple resources being in a particular state before creating or updating yet another resource, use a conductor. They observe multiple resource kinds, and operate a FSM internally. State transitions happen upon receiving an event from an observed resource. The logs of the conductor make it easy to debug dependency problems, because you can easily see that your conductor’s FSM is in a particular state, waiting to see some number of a particular resource.

The other principles we developed are “controllers” only control one resource and all resource updates for a resource kind must be serialized through a coordinator. Our paper has much more detail: A Cloud Native Platform for Stateful Streaming, https://arxiv.org/abs/2006.00064

[+] hbogert|5 years ago|reply
Whenever people use Go in an analogy and depict it as simple, I'm out. Easy, yes; simple, no.
[+] znpy|5 years ago|reply
This is beautiful, I hope this person goes on and actually implements what's they're talking about.

I really liked the part about simplifying networking, I really feel there should be a more general push towards both ipv6 and srv records.

[+] jeffbee|5 years ago|reply
Networking is the most befuddling thing in k8s. I really don’t see how they got from borg to that. I think they thought that people just wouldn’t accept the free-for-all borg model. But I’d rather have some slight complexity in my name service (DNS sucks anyway) than configure networking in k8s.
[+] Rillen|5 years ago|reply
Making pods mutable would break the core benefit of what kubernetes imperial system does for you.

This GitOps 'nonsense' gives me a well defined and automatically backuped infrastructure setup with audit build in. It doesn't allow someone to snowflake around which is brilliant and forces you and your colleges to manifest stuff and not forgetting it and degrading your system over time (nix and nixos are also great examples of such systems)

This reminds me of the time i learned html and wanted to set new lines all over a text instead of using proper paragraphs and letting html take care of the proper formatting.

I would like to see a better/stronger statefulset though. As long as this pod is alive, make sure its state is not interrupted. Like allow a pod to be migrated to another node.

Nonetheless, i'm in the middle of setting up kubernetes with kubeadm and cilium network. Its already really easy to do so. It will just get more easy and more stable over time and its already great.

When you look at the storage example: yes its more difficult then just using a hard drive. But you ignore the issue with one hard drive: Backup, checksum / bit rot and recovery. With a storage layer, you can actually increase replica count, you can backup ALL storage volumnes automatically.

the same with networking: With cilium you can now have a lightweight firewall with dns support.

It is much more critical for the whole industry to start rebuilding software to be more cloud/container native. This will reduce the pain points we have right now and will make it more resilient to operate. For example Jenkins: Instead of one big master, have a ha setup for your working queue, a pod for a dashboard and schedule workers on demand.

My personal conclusion: Don't use it, if you don't need it. If you need it, embrace the advantages.

[+] throwaway894345|5 years ago|reply
> Making pods mutable would break the core benefit of what kubernetes imperial system does for you.

Which core benefit is that? I’m not following.

> This GitOps 'nonsense' gives me a well defined and automatically backuped infrastructure setup with audit build in. It doesn't allow someone to snowflake around which is brilliant and forces you and your colleges to manifest stuff and not forgetting it and degrading your system over time (nix and nixos are also great examples of such systems)

TFA says you can still use the GitOps “nonsense” if you want under his proposal.

[+] lazyresearcher|5 years ago|reply
Can't you use node affinity to stop k8s from moving a pod to a different node whilst alive?
[+] dnautics|5 years ago|reply
Having worked in the Erlang VM for a few years now this is something that i have wished for so hard (erlang + systemd gets you very far, but not quite). But, like the author, I have been happy doing orchestrations on metal, and I don't have the heart to try to make this (and try to get mindshare on it) myself.
[+] pst|5 years ago|reply
The article lost me at making pods mutable.
[+] dilyevsky|5 years ago|reply
Author complains system is too complicated then laments it needs to add bunch more features that would make it even more complex (particularly mutable pods).

> A modest expansion of the previous section: make each field of an object owned explicitly by a particular control loop. That loop is the only one allowed to write to that field. If no owner is defined, the field is writable by the cluster operator, and nothing else

This is already a thing starting 1.17 i think with server side apply https://kubernetes.io/docs/reference/using-api/server-side-a... (except it’s opt-in)

[+] blinkingled|5 years ago|reply
> Versioning - Update broken? Write back version N-1, and you’re done.

Doesn't kubectl rollout rollback already do this for deployments?

> Pinned deployments

Service meshes like Istio let you run mutiple versions of services that you can selectively route traffic to. You can sign up for it if you need it. What value do pinned deployments add over that?

I kind of get the problem parts of k8s networking. But other than that this seems like it makes already complicated kubernetes some more complicated for not so convincing reasons.