Kubernetes is hard

[+] 0xbadcafebee|2 years ago|reply

> Kubernetes is complex and I think they are partially right

Kubernetes is a distributed centralized operating system which itself depends on a distributed decentralized database, and has a varying network topology, permissions system, plugins, scheduler, storage, and much more, depending on how & where it was built, and runs applications as independent containerized environments (often deeply dependent on Linux kernel features) which can all have their own base operating systems. All of which must be maintained, upgraded, patched, and secured, separately, and frequently.

Kubernetes is literally the most complex single system that almost anyone in the world will ever use. It is the Katamari Damacy of the cloud.

> It allows dev teams to not worry about all these things; all they must do is to write a simple YAML file.

cackles, then sobs

> More importantly, teams no longer need to ask DevOps/infra folks to add DNS entry and create a Load Balancer just to expose a service.

more sobbing

> Should you use Kubernetes?

Should you change a tire with a crowbar, a can of WD40, and a lighter? Given an alternative, the alternative is usually better, but sometimes you don't have an alternative.

[+] hartator|2 years ago|reply

> t allows dev teams to not worry about all these things; all they must do is to write a simple YAML file.

A "simple" YAML file.

[+] Rantenki|2 years ago|reply

OMFG so much this.

I worked at a large company that deployed it's own Kubernetes stack, on a VERY large number of physical hosts. The theory was that the K8S would simplify our devops story enough that we could iterate quickly and scale linearly.

In reality, the K8S team ended up being literally 10x larger than the team building the application we were deploying on it. In addition, K8S introduced entirely new categories of failure mode (ahem; CNI updates/restarts/bedsh*tting, operator/custom resource failures, and tons of other ego driven footguns).

The worst part? The application itself ran fine on a single dev workstation, but also on any random assortment of VMs. Just pass the consul details as environment variables. I am not saying everybody on K8S is in the same boat, but I think that far more people are planning on becoming a unicorn cloud service than have any hope of becoming a unicorn cloud service.

TL;DR: If your hosting solution requires more maintenance than the application itself, you made a boo-boo.

[+] dekhn|2 years ago|reply

if you think k8s is the most complex system anyone in the world will ever use: 50K googler using borg beg to disagree.

[+] paulddraper|2 years ago|reply

> a simple YAML file

Oooooof

[+] overgard|2 years ago|reply

I agree with the point that production is hard. There's so many things you just don't think about as a developer that end up being important. Log storage, certificate renewal, etc.

I think how "hard" kubernetes is depends on how deep you go. If you're building a cluster from scratch, on your own hardware, setting up the control plane yourself etc. it's very very hard. On the other hand, if you're using a hosted service like EKS and you can hand off the hardware and control plane management to someone else, IMO it's actually very easy to use; I actually find it a lot easier than working with the constellation of services amazon has to offer for instance.

I do think there are parts of it where "best practices" are still being worked out though, like managing YAML files. There's also definitely some rough edges. Like, Helm charts are great... to use. They're an absolute nightmare to write, and there's all sorts of delightful corner cases like not being able to reliably upgrade things that use StatefulSet (last I used anyway). It's not perfect, but honestly if you learn the core concepts and use a hosted service you can get a lot out of it.

[+] endisneigh|2 years ago|reply

Cost aside, I wonder how far you can get with something like a managed newsql database (Spanner, CockroachDB, Vitess, etc.) and serverless.

Most providers at this point offer ephemeral containers or serverless functions.

Does a product focused, non infra startup even need k8s? In my honest opinion people should be using Cloud Run. It’s by far Google’s best cloud product.

Anyway, going back to the article - k8s is hard if you’re doing hard things. It’s pretty trivial to do easy things using k8s, which only leads to the question - why not use the cloud equivalents of all the “easy”things? Monitoring, logging, pub/sub, etc. basically all of these things have cloud equivalents as services.

The question is, cost aside, why use k8s? Of course, if you are cost constrained you might do bare metal, or a cheaper collocation, or maybe even a cheap cloud like DigitalOcean. Regardless, you will bear the cost one way or another.

If it were really so easy to use k8s to productionize services to then offer as a SaaS, everyone would do it. Therefore I assert, unless those things are your service, you should use the cloud services. Don’t use cloud vms, use cloud services, and preserve your sanity. After all, if you’re not willing to pay someone else to be oncall, that implies the arbitrage isn’t really there enough to drive the cost down enough for you to pay, which might imply it isn’t worth your time either (infra companies aside).

[+] mirekrusin|2 years ago|reply

Cloud services are shit unless I can run them locally when developing and testing.

[+] davnicwil|2 years ago|reply

Or just app engine honestly.

Works with docker containers so you can run the same simple stack locally as in prod. No need for more exotic serverless architectures.

Generous free tier, too!

Have only good things to say about it for quickly firing up a product.

[+] suralind|2 years ago|reply

I think you could push that setup far. I'm not familiar with GCP or Cloud Run, but it probably integrates nicely with other services GCP offers (for debugging, etc.).

I'd be curious to read if anybody has that setup and what scale they have.

[+] suralind|2 years ago|reply

Regarding the second part, I totally agree, either use cloud or don't. For some reason, most companies want to be cloud-agnostic and so they stay away from things that are too difficult to migrate between cloud providers.

[+] rs999gti|2 years ago|reply

> In my honest opinion people should be using Cloud Run. It’s by far Google’s best cloud product.

Is this the same thing as running containers in Azure App Services?

[+] solatic|2 years ago|reply

This. Greenfield products should be serverless by default. By the time you have sustained traffic to the point where you can run the numbers and think that you could save money by switching off serverless, that's a Good Problem To Have, one for which you'll have investors giving you money to hire DevOps to take care of the servers.

[+] tptacek|2 years ago|reply

37signals is not like the typical large-scale startup. They have an extremely small team (around 30 people?), and just a couple of products.

Large-scale startups use dynamic-scheduled cloud services in part to reduce coupling between teams. Every service --- and there are dozens --- is scheduled independently, and new teams can get spun up to roll out new services without too much intervention from other teams.

When you've got a couple products that have been in maintenance mode for 10+ years and then 2 just two primary products, both of which are on the same stack, and you can predict your workloads way out into the future (because you charge money for your services, don't do viral go-to-markets, and don't have public services), there simply isn't much of a win to dynamic scheduling. You can, in fact, just have a yaml file somewhere with all your hosts in it, and write some shell-grade tooling to roll new versions of your apps out.

A lot of the reflexive pushback to not using k8s seemed like it came from people that either didn't understand that 37signals was doing something closer to static scheduling, or that don't understand that most of what makes k8s complicated is dynamic scheduling.

[+] solatic|2 years ago|reply

> Every service --- and there are dozens...

Most startups that I see trying to go with microservices too early do so while keeping a shared database between the services, so they're not really microservices, but a distributed monolith. This turns into a massive, massive pain.

Doing microservices well means building out templates for CI/CD pipelines, templates for observability, figuring out how best to share (or not share) common credentials like third-party API keys, setting up a service mesh, setting up services like Backstage and Buf... which inevitably requires hiring dedicated engineers at enormous cost.

If you can set up a monolith, why would you switch to microservices? So that you can have a smaller container image and faster autoscaling? So the developers on the second and third teams you hire can wait a little less time for CI tests to run?

It's a pretty big mistake to adopt microservices too early.

[+] majewsky|2 years ago|reply

> Large-scale startups use dynamic-scheduled cloud services in part to reduce coupling between teams.

This is the crux. It's Conway's Law in action. The promise of Kubernetes to a large org is that you can split the baremetal and OS layer into one team that manages everything up to Kubernetes, and then that's the common interface where all other teams deploy their applications into. And besides the team separation, the value is that you have a central layer to put automatic policy enforcement, engineering doctrines, and so forth.

[+] anonzzzies|2 years ago|reply

> what makes k8s complicated is dynamic scheduling.

… Which almost no startup or otherwise ever will need. Creating complex stuff for workloads you will never ever have. You hope to have them, but that’s called premature optimisation. And then you still mostly likely fall in the bracket of a company that will never need it.

[+] deterministic|2 years ago|reply

I work for a company that routinely deploy very large scale software to airlines/airports/rail companies around the world. Millions of lines of mission critical server and mobile/browser/desktop client code.

We do it without the cloud, without micro-services, without Kubernetes etc. Just straight forward good old fashioned client/server monoliths. It’s simple. It works.

The reality is that 99% of people who think they need Kubernetes don’t actually need it. Almost all problems in software development are caused by the developers themselves. Not by the actual business problems.

[+] xyzzy_plugh|2 years ago|reply

Who said anything about a large-scale startup? Kubernetes is approachable all the way down to N=1 employees.

I strongly disagree with your take on static vs dynamic scheduling. Static scheduling ties your hands early. In a mature organization, it is very much an optimization.

Dynamic scheduling forces a cattle-not-pets mentality out of the gate, which is great. It also gives you all the knobs to figure out HA, right-sizing and performance that you'd ever want, for whenever you're ready for them. It's considerably more laborious to rearrange or tune things with static scheduling. I've run the gamut here and Kubernetes is by far the easiest and most approachable way to manage a fleet from 1 to $BIG that I've ever encountered. If you want to treat it like a static scheduler, that is also trivial. It's not like there's some huge cost to doing so. It's basically a NOP.

37signals blew off their foot by doubling down on their sunk costs (read: capex) of metal. They clearly don't want to not think about KVM and F5s and SSH keys and all the other odds and ends that entirely solved away by managed services for reasonable prices.

Which is it? Are they too big for the cloud or too small? You can't have it both ways.

[+] tkiolp4|2 years ago|reply

I think the post title should be called “Production is hard” (as the author talks about later on). Pick up any technology out there: from Python, to C++, to K8s, to Linux… Do the analogous “Hello world” using such technologies and run the program on your laptop. Easy. You congratulate yourself and move on.

Production is another story. Suddenly your program that wasn’t checking for errors, breaks. The memory that you didn’t manage properly becomes now a problem. Your algorithm doesn’t cut it anymore. Etc.

[+] morelisp|2 years ago|reply

> Suddenly your program that wasn’t checking for errors, breaks. The memory that you didn’t manage properly becomes now a problem.

Yeah, nobody deployed anything and ran it for months, even years, before Kubernetes.

[+] papruapap|2 years ago|reply

Well then you could the resume the whole article in the title because there isnt anything else in it.

[+] znpy|2 years ago|reply

Quoting Bryan Cantrill: production is war.

[+] bdougherty|2 years ago|reply

Production is only as hard as you make it.

[+] rglover|2 years ago|reply

Kubernetes is hard because it's over-complicated and poorly designed. A lot of people don't want to hear that because it was created by The Almighty Google and people have made oodles of money being k8s gurus.

After wasting two years chasing config files, constant deprecations, and a swamp of third-party dependencies that were supposedly "blessed" (all of which led to unnecessary downtime and stress), I swapped it all out with a HAProxy load balancer server in front of some vanilla instances and a few scripts to handle auto-scaling. Since then: I've had zero downtime and scaling is region-specific and chill (and could work up to an infinite number of instances). It just works.

The punchline: just because it's popular, doesn't mean it's the best way to do it.

[+] tflinton|2 years ago|reply

It’s not overly complicated, it’s just trying to serve everyone’s use cases. I’ve tried deploying to 10k servers with custom scripts in Jenkins, bamboo and AWS auto scaling groups but I’ve found kubernetes is the only tool that will elegantly handle a problem. You can probably write a script for the happy path but for a production service I’d bet my money on something that can handle all of the problems that come along with the statistics blow ups at scale. That said, I can be complete overkill for most systems.

[+] orthecreedence|2 years ago|reply

For a happy medium, check out Nomad. I've been managing our infrastructure on Nomad for years by myself, with upwards of 40 nodes (auto-scaled) and the number of problems we've had can be counted on one hand (and was almost always a simple user error or fixed by upgrading). I spend most of the time I would otherwise spend doing tedious ops shit actually building things.

That said, Nomad and stateful services don't mix. Don't try. I think the same goes for k8s though.

[+] rektide|2 years ago|reply

You can setup a solid k3s cluster in 30 minutes. I'm sorry you had a hard time but just because you didn't succeed at your attempt doesn't mean it actually is super hard.

[+] honkycat|2 years ago|reply

I am consistently confused by all of the talk about how "hard" Kubernetes is.

We spin up EKS. We install the newrelic and datadog log ingestion pods onto it, provided in a nice "helm" format.

We install a few other resources via helm, like external secrets, and external dns, and a few others.

Kubernetes EKS runs like a champ. My company saves 100k/mo by dynamically scaling our cloud services, all of which are running on Kubernetes, to more efficiently use compute classes.

My company has over 50 million unique users monthly. We have massive scale. Kubernetes just works for us and we only have 2 people maintaining it.

What we gain is a unified platform with a consistent API for developing our services. And if we wanted to migrate elsewhere, it is one less thing to worry about.

¯\_(ツ)_/¯

Feels like some kind of hipster instinct to dislike the "cool new thing"... even though k8 has been around for years now and has been battle tested to the bone.

[+] drdaeman|2 years ago|reply

So, what do you do when one of your pods suddenly cannot connect to another, even though both nodes seem to be passing healthchecks and stuff?

Spinning up a K8s cluster "in the cloud" is easy and everyone jumps on that happy "look how simple it all is" bandwagon, forgetting than it's just the beginning of a very long journey. There are millions of blog articles of varying quality that explain how easy it is, because it's very simple to spam search engines retelling a story how to click a couple buttons or do some basic Terraform/CloudFormation/whatever.

And here's what they don't tell you - maintaining this machinery is still your business, because all you get is a bunch of provisioned node machines and a cookiecutter to spin it up. Plus a bunch of scripts to handle most common scenarios (scaling, upgrading, some basic troubleshooting, etc). The rest is either on your or tech support (if you pay extra for it). And if you have a sysadmin-for-hire contract anyway, then its them who should have an opinion what's easy and what's hard. Contracting other people is always relatively easy - compared to what they do.

[+] rjh29|2 years ago|reply

Yeah it's easy for you because it's two people's full time job to maintain it? Many of us are having to learn it and use it in our spare time, or on top of our other work. We wouldn't necessarily know the best practices, or to use newrelic and datadog, or what to use for external secrets, external dns, how to diagnose and debug the issues which inevitably will occur when setting it up.

Now this is true for doing it without k8s too, but somehow there was never a huge set of blog posts about "it's really hard to set up a load balancer and a secrets service and networking" but there is for k8s, so there must something intrinsic in either its design or its documentation that is causing that. I think it's probably that k8s is designed for google-scale deployments, so for most people the initial burst of complexity is a bit overwhelming.

[+] sass_muffin|2 years ago|reply

Same, we use EKS and a very similar setup, our workload has some pretty high throughput and scaling requirements. Works amazing for our team, wouldn't change it for anything else at this point. Very low maintenance effort since AWS manages the K8s infra.

[+] Turbots|2 years ago|reply

Your company "saves" over 100k/month paying WAY too much for EKS, which is extremely expensive.

If you're at any decent scale (looks like you are), then switch to GKE, or switch to on-prem and buy some hardware + a Kubernetes distro like Mirantis/Openshift/Tanzu.

Heck, go run k3s on Hetzner and you won't have that much more work, but save literally millions at the scale you're talking about.

[+] jasoneckert|2 years ago|reply

While I understand where the author is coming from, my opinion of Kubernetes (and production deployment in general) isn't that it is hard per se, but that it involves many components.

I liken it to Lego. Each component separately isn't hard to work with, and once you figure out how to connect it to other components, you can do it 100 times easily. And like Lego, a typical Kubernetes environment may consist of several dozen or hundred pieces.

So, I wouldn't describe Kubernetes as hard - I would describe it as large (i.e., comprised of multiple interconnected components). And by being large, there is a fair amount of time and effort necessary to learn it and maintain it, which may make it seem hard. But in the end, it's just Lego.

[+] swozey|2 years ago|reply

As an Infra person reading k8s posts on hackernews has got to be one of the most frustrating and pointless things to read on here. You all just regurgitate the same thing every post. It's even the same people, over and over again.

30% of you are developers who think K8s is the devil and too complex and difficult, 30% of you like it and enjoy using it, and another 20% of you have never touched it but have strong opinions on it.

[+] amir734jj|2 years ago|reply

I would not use k8s unless we are convinced it will benefit us in the long run (think about constant effort that needs to put in to get things running). k8s is not magic. I would just stick with docker-compose or digital ocean for small startup. OR rent a VM on Azure OR if you really really need k8s use a managed k8s.

[+] brody_hamer|2 years ago|reply

Docker swarm is a great option too, for production environments. It’s like the production-ready big brother to docker-compose (with better health checks and deployment rollout options). And it has much less of a learning curve than k8s.

[+] unknown|2 years ago|reply

[deleted]

[+] scubbo|2 years ago|reply

> [K8s] allows dev teams to not worry about all these things; all they must do is to write a simple YAML file. More importantly, teams no longer need to ask DevOps/infra folks to add DNS entry and create a Load Balancer just to expose a service. They can do it on their own, in a declarative manner, if you have an operator do to it.

Yeah, as opposed to Cloudformation or Terraform, where you...uhhh...

Don't get me wrong, it requires work to set up your corporate infrastructure in your Favourite Cloud Provider(tm) to make those things available for developers to manage. But it takes work in k8s too - even the author says "if you have an operator to do it". Kubernetes is great for what it's great for, but these are terrible arguments in favour of it.

[+] suralind|2 years ago|reply

That's true, but I'd argue that TF is not as powerful as k8s. You could combine it with auto scaling services offered by cloud providers, but then you need to understand plugins and with k8s it often is a single value. For example, you can "add" Prometheus scraper just by adding labels. You won't have that with TF.

[+] paxys|2 years ago|reply

Kuberenetes has been a total failure at defining a simple "just works" devops workflow, but I don't think that is due to any deficiencies in the product itself. The basic premise behind its common use case – automating away the SRE/ops role at a company – is what is flawed. Companies that blindly make the switch are painfully finding out that the job of their system operator wasn't just to follow instruction checklists but apply reasoning and logic to solve problems, similar to that of any software engineer. And that's not something you can replace with Kubernetes or any other such tool.

On the other hand, there's still a lot of value in having a standard configuration and operating language for a large distributed system. It doesn't have to be easy to understand or use. Even if you still have to hire the same number of SREs, you can at least filter on Kubernetes experience rather than having them onboard to your custom stack. And on the other side, your ops skills and years of experience are now going to be a lot more transferrable if you want to move on from the company.

[+] rektide|2 years ago|reply

Kubernetes has been a masterwork ultra flexible but consistent underlay for building simple "just works" platforms.

It would be trash if it tried to be the answer; it'd be a good fit for no one. That's an unbelievable anti-goal.

But there are dozens of really good ci/CD & gitops systems that work very well with it that make all kinds of sense. Install k3s cluster in hour 1, setup gitops in hour #2.

Tentative agreement in where we both land. You need to make intelligent choices. You need technical understanding & problem solving. Kube isn't really much better or worse, but it at least sets a shape & form where there are common patterns no matter what systems & platforms a particular company happens to be running on kube, no matter which concern you're dealing with.

[+] wodenokoto|2 years ago|reply

Luckily its been a few years since I had to work directly with Kubernetes. But ...

> Forget the hype, there’s a reason why Kubernetes is being adopted by so many companies.

I've never worked with it because it was the right solution, but only because some senior engineer or management bought into the hype.

> It allows dev teams to not worry about all these things; all they must do is to write a simple YAML file.

I've never found these yaml files simple.

[+] gloosx|2 years ago|reply

I think setting up something similar without k8s is like, 100 times harder? I was never deeply into DevOps but a single short video and few doc pages told me how to bring up highly available, load-balanced 2-node cluster, and how to rollout new services and versions in minutes with zero downtime. I also can precisely control it, monitor all the logs and resources without leaving my working terminal for a minute. I would never be able to set-up an infra like this without kube in a timespan of one day with little prior DevOps knowledge. The complexity beast it tames into structure is just mind-blowing, and it's a virtue that it came out just being a bit "hard".

[+] ianbutler|2 years ago|reply

I'm working on making it easier, or at least providing the tools to make working with it easier!

https://github.com/TorbFoundry/torb

"Torb is a tool for quickly setting up best practice development infrastructure on Kubernetes along with development stacks that have reasonably sane defaults. Instead of taking a couple hours to get a project started and then a week to get your infrastructure correct, do all of that in a couple minutes."

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] menacingly|2 years ago|reply

Perhaps put more simply, operating in production has a lot of intrinsic complexity that will probably surface in the tooling, and if you constantly reinvent to "fix" the complexity you'll eventually end up putting up it back.

That's how you end up with the modern javascript tooling hellscape where it looks like no one was around to tell a bright young developer "no"

[+] nathants|2 years ago|reply

the good new is, for the 95% of projects that can tolerate it, aws the good parts are actually both simple and easy[1].

it’s hard to find things you can’t build on s3, dynamo, lambda, and ec2.

if either compliance or a 5% project demand it, complicated solutions should be explored.

1. https://github.com/nathants/libaws

158 comments