top | item 32304834

Ask HN: Have You Left Kubernetes?

317 points| strzibny | 3 years ago

If so, what did you replace it with?

313 comments

order
[+] derefr|3 years ago|reply
We started with a full-stack k8s approach (on GKE); left (switching to plain GCE VMs); then came back much more conservatively, just using GKE for the stateless business-layer while keeping stateful components on dedicated VMs. Much lower total maintenance burden.

(Hard-won bit of experience: k8s + Redis really don't like each-other if Redis 1. is configured to load from disk, and 2. your memory limit for the Redis container is somewhat-tightly bounded. At least from the k8s controller's perspective, Redis apparently uses ~400% of its steady-state memory while reading the AOF tail of an RDB file — getting the container stuck in an OOM-kill loop until you come along and temporarily de-bound its memory.)

However, we're considering switching back to k8s for stateful components, with a different approach: allocating single-node node-pools with taints that map 1:1 to each stateful component, effectively making these more like "k8s-managed VMs" than "k8s-managed containers." The point would be to get away from the need to manage the VMs ourselves, giving them over to GKE, while still retaining the assumptions of VM isolation (e.g. not having/needing memory limits, because the single pod is the only tenant of the VM anyway.)

[+] rekrsiv|3 years ago|reply
I've yet to encounter a non-smelly k8s deployment that was started before everyone knew how it works or why it works.

On the other hand, once everyone on the team has experience building such a system from scratch, then deploying k8s and using it somehow becomes straightforward.

It's almost as if we need to learn how a tool works before being able to use it effectively.

Anyways, what we (actually didn't) replace it with:

  - Don't let your devs learn about k8s on the job.
  - Let them run side-projects on your internal cluster.
  - Give them a small allowance to run their stuff on your network and learn how to do that safely.
  - Give your devs time to code review each other's internally-hosted side-projects-that-use-k8s.
  - Reap the benefits of a team that has learnt the ins and out of k8s without messing up your products.
[+] dosethree|3 years ago|reply
Maybe its just my team, but dev's dont need to know k8s. It certainly doesn't hurt, but they should be able to write code and get their jobs done without knowing much about k8s at all. Basic shit like how to get logs, but thats a given for all platforms
[+] meldyr|3 years ago|reply
What do you mean with side projects? Are they paid?

If you want your Devs to learn kubernetes you should pay them for doing it.

If you can't, hire a contractor with the expertise you need.

[+] manquer|3 years ago|reply
You can provision vclusters to give each dev (each or even) that gives the space to play with the env without it being a problem.

Cattle not pets after all.

[+] iasay|3 years ago|reply
Not yet. We are still deluding ourselves that the 3x cost increment and insane complexity increase we can barely manage to keep spinning is actually a business benefit.

Note: this isn't everyone's end game but I suspect it's realistic for a lot of people.

I would like to go back to cleanly divided, architected IaaS and ansible. It was fast, extremely reliable, cheaper to run, had a much lower cognitive load and a million less footguns. What's more important possibly is not everything can be wedged into containers cleanly despite the promises.

[+] krmboya|3 years ago|reply
Also a big fan of sticking to ansible and plain VMs, at least for most cases I've encountered. To me, a VM in the cloud already feels like a container and you can use the cloud provider's APIs to scale up and down virtual instances as needed
[+] gautamdivgi|3 years ago|reply
Unless you have massive scale VMs are your best option. If you need VM configuration on startup (elastic scaling), you may need to maintain your own image. Salt Stack and/or Fabric are good alternatives to Ansible.

You could look at containerization without K8S (podman or docker) especially if you use python and don’t want to mess with the Linux native python installation.

[+] nimbius|3 years ago|reply
you might consider migrating to systemd controlled rootless dockerless podman. helm even has a plugin for podman.
[+] paulgb|3 years ago|reply
We did. Our use case is spinning up containers on demand to user actions, giving them ephemeral, internet-routable hostnames, and shutting them down when all inbound connections have dropped. Because users are waiting to interact with these containers, we found the start times with Kubernetes too slow and its architecture to be a bad fit.

We ended up writing our own control plane that uses NATS as a message bus. We are in the process of open sourcing it here: https://github.com/drifting-in-space/spawner

[+] cbanek|3 years ago|reply
> we found the start times with Kubernetes too slow

Just curious if you could elaborate here? I work with k8s on docker, and we're also going to be spinning up ephemeral containers (and most of the other things you say) with jupyter notebooks. We're all in on k8s, but since you might be ahead of me, just wondering what hurdles you have faced?

Our big problem was fetching containers took too long since we have kitchen sink containers that are like 10 GB (!) each. They seem to spin up pretty fast though if the image is already pulled. I've worked on a service that lives in the k8s cluster to pull images to make sure they are fresh (https://github.com/lsst-sqre/cachemachine) but curious if you are talking about that or the networking?

From what it looks like in your repo it might be that you need to do session timing (like ms) response time from a browser?

[+] baryphonic|3 years ago|reply
Wow, this is excellent! At a previous job, we had been using k8s + knative to spin up containers on demand, and likewise were unhappy with the delays. Spawner seems excellent.

One question: have you had to do any custom container builds on demand, and if so, have you had to deal with large "kitchen sink" containers (e.g. a Python base image with a few larger packages installed from PyPI, plus some system packages like Postgres client)? We would run up against extremely long build image times using tools like kaniko, and caching would typically have only a limited benefit.

I was experimenting using Nix to maybe solve some of these problems, but never got far enough to run a speed test, and then left the job before finishing. But it seems to me some sort of algorithm like Nixery uses (https://nixery.dev) to generate cacheable layers with completely repeatable builds and nothing extraneous would help.

Maybe that's not a problem you had to solve, but if it is, I'd love your thoughts.

[+] ehutch79|3 years ago|reply
It's always been my understanding that with things like k8s and other orchestration stuff, you're supposed to spin up before you need the capacity? You set a threshold, like 75% capacity, and if you're over that for a bit, you spin up a new container(s) to get you back to under effectively 75% capacity.

Is that not how this works?

[+] samsquire|3 years ago|reply
This is really awesome. Thank you for sharing this.

One of my ideas lately has been to upgrade FaaS to a full on server after a set amount of traffic. Or said differently, a dedicated server spin up that serves the same app as callable functions ala scalable RPC and upgrade to a dedicated instance composed of said functions. The best of both worlds.

Combine the scale to zero of Serverless combined with the scalability and capacity of a dedicated server.

[+] no_circuit|3 years ago|reply
Kind of curious what made it too slow for your use case? I'm guessing you did not users to wait for something like kube-dns to update or the workload scheduler? Of course things like spinning up a Pod can be slow. Or non-Kubernetes things like doing DNS ACME challenges could affect things.

But on other hand, I can't quite figure out why something would prevent, you, yourself, from running the service that hosts the VMs that hosts the containers on demand on Kubernetes.

[+] wvh|3 years ago|reply
I just wrote a controller that does pretty much that – spawn containers on demand and report back status changes. While this solution does require some knowledge, it so far has been perfectly reliable and reasonably fast. I can fathom the need for processes to spawn and tear down faster in specific use cases than the Kubernetes scheduler would allow for, but for us a few seconds of wait time has been perfectly reasonable.
[+] osigurdson|3 years ago|reply
Is there a fundamental reason why Kubernetes cannot start pods and services fast (outside of pulling images of course!)?
[+] davewritescode|3 years ago|reply
No, in fact we've gone running towards it after some initial success, especially when combined with ArgoCD for CD and Istio as a service mesh. My company has a lot of experience with running applications on VMs and Amazon's ECS. Our VM automation ultimately became expensive to maintain and ECS had its own set of issues I could probably fill up a blog post with.

From the Operations side, Kubernetes is scary. It's easy to screw things up and you can definitely run into problems. I understand why folks who work mostly on that side of the house are put off by the complexity of Kubernetes.

However, from the application side of things, our developers have been THRILLED with Kubernetes. For most developers my company provides a nice paved road experience with minimal customization required. For advanced use cases, we allow developers to use the Kubernetes API (along ArgoCD + GateKeeper policies) as a break glass type of approach. Istio gives the infra team the ability to easily move services between clusters and make policy changes easily. It also allows us to make use of Knative, although I think the Istio requirement is no longer there.

That said, you should be using managed Kubernetes wherever possible and not running your own clusters. That's where trouble lurks.

[+] clutchdude|3 years ago|reply
ArgoCD was our missing lynch pin for getting workloads migrated over and supported.

It makes it that much easier to actually use the cluster rather than mess with endless configuration tooling. Is it the best engineered tool? Probably not. But it's the one that works best for us.

[+] therealdrag0|3 years ago|reply
Same story for us. We’ve been moving towards k8s and it’s been great for app devs. We ran in plain VMs for a decade and it was a good time to switch at 2k employees, maybe 500 devs?
[+] benfrancom|3 years ago|reply
I migrated a company from k8s to ECS/Fargate in 2019. Kubernetes is very flexible, but I opted for simplicity.

The result of the migration was that there is little underlying infrastructure to maintain, and ongoing operational costs were lowered by 50% year over year. The CTO and I liked the setup so much, we started converting another large client of theirs. I followed up with them at the beginning of 2022 to see how things were going, and they still love it. There is so little maintenance, and now they have more time to focus on what they do best–Software!

Other options on the horizon that I'm testing include utilizing AWS Copilot with ECS/Fargate, and/or Copilot with Amazon App Runner.

[+] mr337|3 years ago|reply
I have settled on the ECS camp as well. Took a run at Kubernetes and was blown away by the complexity. With ECS/Fargate I don't spend any time on it. It just works for our setup.

I still wonder from time to time if I am missing something not going Kubernetes.

[+] rootforce|3 years ago|reply
I use AWS Copilot and find it to be really easy to use and helpful. It is still a pretty young project and as such doesn't really handle all the edge cases, but for the things it supports, it makes using ECS even easier than it already is.
[+] cies|3 years ago|reply
Chose Fargate over K8 too. I made the call, so no need for migrations :)
[+] oneplane|3 years ago|reply
We have had a few teams try, but as soon as you go beyond "I want to run some code for a bit", nobody really has anything for you. Instead of trying to re-invent the wheel (service discovery, mutual TLS, cross-provider capabilities) successfully, it went downhill quite fast and they moved back. (this was mostly due to cost as other services can get expensive really quickly, and because of the lack of broadly available knowledge for the custom stuff they had to build)

If a team were to start with no legacy and no complexity and there isn't going to be multi-team/multi-owner/shared-services I could see them using something else. But that applies to anything.

[+] julienchastang|3 years ago|reply
I've been a K8s user for some time, but it does drive me bat shit crazy. My main beef with it is I often cannot discern the logic of how things work. For the developer platforms and systems I enjoy working with, you are presented with primitive axioms that you can then bootstrap your knowledge upon to derive more complex ideas (e.g., any decent programming language, or OS). K8s does not work that way -- at least as far as I can tell. A priori knowledge gains you nothing. When I run into a problem on K8s, I copy/paste the error into a search engine and I am presented with a 200 message long GitHub issue with users presenting their various solutions (how does this command relate to my original problem, who knows?), some work, but most of the time, they don't and you are left in a bigger hole than when you started. I end up tearing the whole things down and starting over, most of the time. That last comment is the biggest "code smell" for me with K8s. When it is easier just to nuke the thing and begin again, there is a problem.
[+] lumost|3 years ago|reply
I've never gotten too deep with K8s. It always came across as incredibly complex to maintain with limited managed service support. Whenever I spoke to engineers pushing it, the problems it solved didn't resonate with me as someone whose spent the last 10 years running hundreds of services across thousands of servers.

These days I'm a huge fan of CDK and Pipelines style deployments. I prefer to treat my compute layer as a swappable component which I'll change as and when I need to. I tend to lean towards serverless offerings which take care of the internal scaling details if I can while still giving me a traditional "instance", and if I can't then I'll go for the next best managed offering.

I've yet to see an example where internal tooling doesn't become a mess over time, and K8S requires a ton of work to keep things sensible.

[+] nailer|3 years ago|reply
Yep CDK and/or Pulumi. It’s very easy to map your own custom concepts and logic to your cloud provider, rather than making a cloud provider on top of the cloud provider you already pay for.
[+] kmac_|3 years ago|reply
I've moved to a company that doesn't use Kubernetes at the moment (and that's a 100% calculated and rational decision). What I see, is that a lot of effort is put to provide functionalities that Kubernetes brings. In case of running a bunch of services, when you wish to do that in a stable and secure way, Kubernetes cuts down running costs. It covers so much cross cutting concerns that reimplementation of those capabilities is not possible unless you have heavy $$$ to spend.
[+] Bayart|3 years ago|reply
Nope, I like k8s. What I don't like is people trying to be overly smart with it and leaving a configuration hell of templates, weird network configurations and broken certs behind them. For my personal workloads it's all basic containers with a reverse proxy, though.
[+] llama052|3 years ago|reply
Hell no,

I remember managing hundreds of virtual machines in datacenters & cloud, using Ansible and a myriad of other tooling.

It's nice when you're at a small scale and you don't have a lot of people making changes, but over time as it grows the pain grows with it unless you've enforced a consistent cattle model.

The longer VMs live with custom changes/code and updates over time the more brittle they can become. Part of the cattle model is so that you can recreate/rebuild when changing code so things stay consistent. The drift from infrastructure as code can be scary otherwise.

With the cattle model you need to have pipelines in place to build new VM images for infrastructure updates (packer etc), have multiple APIs to hit (easier in cloud) to upload images and serve them in a non damaging way. (HA deployments/rollouts/dealing with load balancers) It's certainly a non-trivial amount of work.

With Kubernetes, a lot of this tooling comes out of the box. You've got autoscaling, load balancing, health-checks, limits/requests, failure mitigation, service mesh options. On top of that it's served in a strict semi-consistent way. Good luck replicating that with virtual machines without a lot of tooling and effort.

If you can learn the Kubernetes tooling it can do a lot for you. However I agree that not all setups need it, a lot of times small setups never grow and that's ok a few virtual machines aren't that big of a deal.

We still use virtual machines for workloads that aren't container friendly, and to be honest these days I abhor it, even with pipelines in place.

[+] zomglings|3 years ago|reply
Not only have we left Kubernetes, we left Docker.

Replaced with Linux servers and SSH.

Have done a lot of work with k8s in the past. Not the right tool for my startup.

[+] dijit|3 years ago|reply
Went to nomad, which is working better for my workloads.

There's still use-cases where k8s wins; but nomad handles state a bit better and is easier to reason about from scratch.

[+] bluehatbrit|3 years ago|reply
I really like the look of nomad and want to give it a go. The two things holding me back are:

1) I don't really want to manage the installation but there aren't any(?) cloud hosts for nomad that I can see. 2) It doesn't seem as widely used so community support seems thin. There aren't many blog posts about good patterns with it etc, and I'd worry that we'd get stuck and end up reverting back to k8s.

[+] riadsila|3 years ago|reply
Koyeb also moved off Kubernetes and went with Nomad. We started with Kubernetes, thinking it was the right abstraction layer for us to build our platform, but then quickly ran into major limitations. The big ones: as others have mentioned in this thread, its complexity; security (we wanted to explore using Firecracker on Kubernetes, but it was very experimental at that time); we were not interested in keeping up with its release cycles; global and multi-zone deployments was not as straightforward as we needed; and the overhead (10-25% of RAM) was a cost we were not willing to take (we are around 100MB with our new architecture).

We wrote about our decision to switch here: https://www.koyeb.com/blog/the-koyeb-serverless-engine-from-...

[+] AtNightWeCode|3 years ago|reply
Nomad replaces parts of K8. It is not a drop-in replacement. If one only want the container orchestration that is fine but then you need Consul for service discovery and so on.
[+] bogomipz|3 years ago|reply
>"There's still use-cases where k8s wins; but nomad handles state a bit better and is easier to reason about from scratch."

Can you elaborate on how Nomad handles state differently than K8S and what makes it better?

[+] superice|3 years ago|reply
Yes! My startup of 5 people did. We started out with a managed Kubernetes cluster on DigitalOcean, but there were a number of reasons that caused us to not be very comfortable with that setup.

   - Taking random .yml configs from The InternetTM to install an Nginx Ingress with automatic LetsEncrypt certs felt not-exactly-great. It's no better than piping curl to bash, except the potential impact is not that your computer is dead, but the entirety of prod goes down.
   - Because of this, upgrades of Kubernetes are a pain. The DigitalOcean admin panel will complain about problems in 'our' configs, that aren't actually OUR configs. We don't know how to fix that, or if ignoring the warnings and upgrading will break our production apps.
   - Upgrades of Kubernetes itself aren't actually zero downtime, and we couldn't figure out how to do that (even after investing a significant amount of research time).  
   - We were using only a tiny subset of the functionality in Kubernetes. Specifically we wanted high-availability application servers (2+ pods in parallel) with zero-downtime deployments, connecting to a DO managed PostgreSQL instance, with a webserver that does SSL-termination in front of it.  
   - Setting up deployments from a GitLab CI/CD pipeline was pretty hard, and it turned out the functionality for managing a Kubernetes cluster from GitLab was not really done with our use case in mind (I think?).  
   - It would be bad enough if DigitalOcean shit the bed, but the biggest problem was that we couldn't reliably recognize if something was a problem caused by us, or by DO. Try explaining that one to your customers.
Summarizing: it was just too complex and fragile, even once you wrap your head around what the hell a Pod, a Deployment, an Ingress and Ingress Controller, and all of the other Kubernetes lingo actually means. I suspect you need a dedicated infra person who knows their stuff to make this work, so it could very well make sense for larger companies, but for our situation it was overkill.

We were not intellectually in control of this setup, and I do not feel comfortable running production workloads (systems used by 20k high-school students, mission-critical applications used by logistical companies) on something we couldn't quite grasp.

We went to a much simpler setup on Fly.io, and have been happy since. It's a shame they seem to be too young of a company to really be super reliable, but I suspect this is only a matter of time. In terms of feature set, it's all we need.

[+] dbingham|3 years ago|reply
For context, I ran a DevOps team for the last 4 years that managed two products on AWS - one on EKS and one on ECS. I also just finished building out more or less that exact infrastructure on DO.

I can pretty confidently say, that's not K8s, that's Digital Ocean. On AWS, we ran the EKS infrastructure (which was not simple) with basically half a dev's time for years. It was only when it started to scale to millions of users that we needed to build a team to support it. It was still a much smaller team than the one that supported the ECS product (two devops).

I was mostly managing and not coding by the time Kubernetes was in our stack, so while I'm very familiar with infrastructure in general (and I know ECS inside and out unfortunately), I hadn't used Kubernetes directly much before I build this DO infrastructure. But I got it up in a week and though DO is a nightmare, k8s is an absolute joy as a DevOps. Holy shit it's perfect. It does exactly what it needs to, with exactly the right abstractions, with perfectly reasonable defaults.

The reality is that infrastructure work is just that complicated.

You wouldn't try to have a team of front end engineers build your rest backend. It's not reasonable to expect javascript engineers to know how to build and operate an infrastructure - at least not with out dedicating themselves to learning the tooling and space full time for a while. Think of it from the perspective of a frontend engineer learning Python and Django to build out a rest backend, and then multiply the complexity by 4. That's just infrastructure regardless of what you're using.

That said, if something like Fly.io can fit your needs, that's great! I haven't used them so I can't speak to them directly, but I know that with Heroku, the trade off was cost and, eventually, being limited in what you could build. Eventually you would need to build something that just couldn't be built with Heroku. A quick glance at Fly, the pricing looks reasonable, but I'm guessing the build limits will still apply.

[+] j16sdiz|3 years ago|reply
Sounds like you have never understand what you have deployed. Kubernete is complex, you need somebody know how in your team.

Meanwhile, going fly.io sounds sensible to me.

[+] dimitar|3 years ago|reply
Well all of those issues are fixable, but I think it is a totally valid reason not to use k8s if you don't have a dedicated infra person/team.
[+] tomphoolery|3 years ago|reply
Kinda? We use Cloud Run because for our workloads GKE was a lot more expensive. So far, it's been great. I wouldn't say I've "left" Kubernetes, since from what I understand Cloud Run implements the Knative standard, which is itself built on Kubernetes. But much like it was predicted early on, I think Kubernetes is best used as a means of building an infrastructure platform, not an infrastructure platform in and of itself. You certainly can cobble all this stuff together and build a nice system, but it takes a lot of work, and there's probably a hosting company out there which already does something similar enough that you can adopt.

With this approach to hosting and deployment, I think Kubernetes' main advantage is that it opens the door to new kinds of infrastructure businesses, not that it makes hosting a website any easier.

[+] seabrookmx|3 years ago|reply
+1 for Cloud Run.

I've tried many of the serverless platforms and maybe it's the types of applications I work on, but I've found most of their limitations (short runtime, limited access to resources on your private network) basically make them useless. The more self-hosted types that don't have these limitations lose out on many of the benefits or are leaky abstractions on k8s.

Cloud Run has all the benefits I want: extremely easy deployment and scaling, as well as the ability to scale to zero if you need it (though generally you don't), while still being able to run basically whatever workload I want. My current employer is mostly a Python shop but we recently deployed a little .NET core service on Cloud Run and it's been awesome.

[+] steren|3 years ago|reply
Note that Cloud Run is not built on Kubernetes, but on Borg. It implements the Knative Serving API spec, mainly for portability reason with Knative and Kubernertes.

Source: I'm the Cloud Run PM and we have commmunicated about that publicly in the past.

[+] jdoss|3 years ago|reply
Yep! Well, kinda... I still use it at work but for any of my personal stuff at home or for my side projects I use Fedora CoreOS [1] with Butane YAML [2] which I template with Jinja2. Being able to define a VM with Butane and launch it quickly is pretty great. Nothing I am running requires the benefits that Kubernetes can bring to my workloads and the reduced complexity is a breath of fresh air.

I am slowly moving towards using Hashicorp's Nomad running on Fedora CoreOS using the Podman and QEMU drivers. I rolled out a Nomad at work for internal projects and it let's me get things done quickly without living in a total YAML hellscape.

1: https://docs.fedoraproject.org/en-US/fedora-coreos/getting-s...

2: https://coreos.github.io/butane/examples/

[+] lkurusa|3 years ago|reply
We use Nomad from Hashicorp, it's super simple. Never liked the complexity K8s brings along.
[+] 0xbadcafebee|3 years ago|reply
I would love to. But what I hate about K8s is how you can't not use it. It's like Jenkins. A total piece of shit, slow, buggy, insecure, maintenance headache, expensive to maintain, never works the way you want without a ton of work, lots of footguns, bad practice is the default. But try explaining to management how you don't want to use Jenkins and they'll just come back with "but it's free" and "everyone uses it" and "no vendor lock-in". They don't understand that they're asking you to become a Ferrari mechanic when you really need a Ford F-350 pick-up.
[+] majodev|3 years ago|reply
No and we are happily using it within our overcommitted cluster (combination of shared and dedicated nodepools).

We are a small team of 5 infrastructure engineers and previously managed 200+ libvirt VMs running on bare-metal HA hypervisors in a GlusterFS storage pool (software agency, different customer application services). We started to migrate to GKE in 2017 and finished within a year or so.

I know many associate k8s with a yaml mess, but this is actually our most favourite part of it. We are able to describe a whole customer project in this format and it's not something we have to maintain in-house (Ansible). As long as you don't try to be smart (templating/helm, operator dependance), it works out pretty well, prefer plain manifests and extend that with you own validation scripts.

Nevertheless, if you have no 24/7 operations, stay the hell away from bare-metal - go managed.