top | item 21023917

(no title)

After having just spend most of the day yesterday trying to nurse a failing Kubernetes cluster back to health (taking down all of our production websites in the process), I’ve come to completely loathe it. It felt more like practicing medicine than engineering: try different actions to suppress symptoms, instead of figuring out a root cause to actually prevent this from happening again. While it is a pretty sweet system if it does run, I would strongly advice against anyone trying to manage their own cluster, as it is simply to complex to debug on your own, and there is preciously little information out there to help you

discuss

Apocalypse_666|6 years ago

I wrote this whole thing below as a reply to someone stating that I should just stop complaining and figure it out, but that comment has since been deleted. Figured sharing my frustration might be cathartic anyways :)

If only I could!! That’s exactly the frustrating part: there seems to be no way of grokking what goes on under the hood, and there are so many different ways of setting up a cluster and very few have any information about them online whatsoever.

As a practical example, what happened yesterday was that all of a sudden my pods could no longer resolve DNS lookups (took a while to figure out that that was what was going on, no fun when all your sites are down and customers are on the phone). Logging into the nodes, we found out about half of them had iptables disabled (but still worked somehow?). You try to figure out what’s going on, but there’s about 12 containers running in tandem to enable networking in the first place (what’s Calico again? KubeDNS? CoreDNS? I set it up a year ago, can’t remember now...) and no avail in Googling, because your setup is unique and nobody else was harebrained enough to set up their own cluster and blog about it. Commence the random sequence of commands I’ll never remember until by some miracle things seems to fix themselves. Now it’s just waiting for this to happen again, and being not one step closer to fixing it

antocv|6 years ago

Well that sounds like a problem of "I run my own cloud". And not a problem of Kubernetes. You dont remember how you set it up, oh well.

If you use a managed kubernetes (not in aws since they suck, eks is not really managed). Like gke or aks, then you skip the whole "there is a problem in my own cloud of my own making".

btw, I also encountered DNS problems in kubernetes, on ACS, it took 5-10 minutes to resolve, and was caused by ACS not having services enabled to restart dns upon reboot, lol.

rixed|6 years ago

Would you be willing to tell what kind of business you are in, where you have few enough customers that they can reach you by phone, but still need such a large number of machines that you need kubernetes ?

cbanek|6 years ago

I love Kubernetes and think it solves my problems very well. Although I think I have problems that generally require it due to scale and failover.

But I have also had a number of DNS problems that we still haven't resolved, and they sometimes go away on their own. Same for IP tables rules issues. This is of course on a hosted kubernetes cluster at a large supercomputing center. (I didn't set it up, I just have to fix it. Ugh.) At Google, it's been great and we've had no networking problems, but they almost certainly run their own overlay network driver.

The various networking solutions you can plug into kubernetes seem pretty spotty, and they are very hard to debug. I still haven't figured it out myself. But I am trying to not throw the baby away with the bathwater. I think the networking (and storage) parts will get better.

unknown|6 years ago

[deleted]

nickthemagicman|6 years ago

I feel for your troubles. Letting you know you can move it into EKS or Google cloud could probably save you a lot of headaches in the long run.

vitalus|6 years ago

Definitely can sympathize with you on this, having spent plenty of time myself fighting some clusters that ended up in a broken state, and trying to get them going again.

I think that this pain is sometimes more severe in the context of automated provisioning tools out there and the trend towards immutable infrastructure - folks tend to not have the know-how to dig in and mutate that state if need be.

It's really important to have a story within teams, though, about either investing in the knowledge needed to make these fixes, or to have the tooling in place to quickly rebuild everything from scratch and cutover to a new, working production cluster in a minimal amount of time.

larntz|6 years ago

I'm just beginning my journey into the vanilla Kubernetes world.

As I build my knowledge I am also building Ansible playbooks and task files. After each iteration I shutdown my cluster. Do an automated rebuild and test. Delete the original cluster and start my next iteration.

I have an admin box with everything I need to persist between builds (Ansible, keys, configuration files, etc) and can deploy whatever size and quantity of workers (VM) needed.

It has been a good process so far. I haven't yet put things in an unrecoverable state, but if that happens I can rebuild the cluster to my most recent save and try again.

I don't see it taking a lot of resources to have a proving ground. I would definitely not feel comfortable going to production without the ability to reproduce the production clusters' exact state.

I anticipate exactly what you describe as a roll back mechanism. At all times I want to be able to automate the deployment of clusters to an exact known state.

I think building a cluster, walking away from it for a year, and then coming back to it for a break fix/update/new deployment is a huge gamble.

cwingrav|6 years ago

This is my thinking too. Build a new cluster and push it all over to a new cluster. If you feel like understanding the old (and can afford it), keep it around and try to figure it out.

Clusters are cattle, not pets.

Apocalypse_666|6 years ago

Very much agree, but never managed to reach this point. One reason is that the amount of hardware needed for this is pretty prohibitive. Second is that configuring a new cluster (last time I did it) was so much work, and I never managed to automate the process, that there was just simply no way I could have created a new cluster in time to get our websites back up

sgt|6 years ago

We need to do self hosted Kubernetes and after evaluating it (not deployed to production yet), we considered the training and costs of running, and came to the conclusion that our needs are met by Nomad [1].

It is also a cluster management tool, but much simpler and can be combined with other tools to make it just as powerful as Kubernetes.

[1] https://www.nomadproject.io

imtringued|6 years ago

That looks very interesting. I like their solution to running stateful applications. They only schedule containers on nodes that have the requested volume. Of course the downside is that you need to manage the volumes manually which is perfectly reasonable because I only need 3 nodes for high availability.

ownagefool|6 years ago

Out of interest, what was wrong with it and how did you fix it?

In 4 years I've never came across a cluster I was unable to fix, nor has it really broken without someone taking an unadvisable action on it. This may simply be because I started early enough that I was forced to manually configure the components and thus understand the underlying system well enough.

Over time I have seen some interesting things though:

- Changing the overlay network on running servers probably the silliest thing I've done. This wasn't on production, but figuring out where all the files are and deleting them was something pretty undocumented.

- A few years back somebody ran a HA cluster without setting it as HA which resulted in occasional races where services keep changing IP addresses. I believe the ability to do this was patched out.

- An upgrade caused a doubling of all pods once. This was back when deployments were alpha/beta and they changed how they were references in the underlying system, causing deployments to forget their replicasets, etc.

Overall though, in 4 years I've spent very little time debugging clusters and more time debugging apps, which is what we want.

keymone|6 years ago

> nor has it really broken without someone taking an unadvisable action on it

You’re basically saying “the tool X is fine, you’re just inexperienced/undisciplined and using it wrong”. Which is fair critique if I was an intern, but I have a decade+ experience in development and operations and I look at kubernetes in disbelief - why should things be that complicated? I get it, everything is pluggable and configurable, but surely this must be balanced out by making it more approachable and convenient?

You can’t sneeze in kubernetes without it requiring you to generate some ssl certs to the point where it’s just cargo-culture without any consideration of purpose and security.

And what’s up with dozens and dozens of bloated yamls and golang files? The fresh 30-odd commits ”official” flink operator is 3 THOUSAND lines of Go and 5 THOUSAND lines of yamls. How is that reasonable? In which universe is that reasonable? all it does is a for-loop that overwrites a bunch of pods to keep their spec in sync with desired config. There’s like 1000:1 boilerplate ratio in kubernetes and it’s considered good somehow?

Sorry for the rant, I’m just angry that we’re six decades into software engineering and the newest hottest project I the newest hottest line of work behaves like everybody should be paid per line of code they produce.

Apocalypse_666|6 years ago

I wrote a long reply to someone else’s question below that should answer your question :)

reacharavindh|6 years ago

Would you say that you would be happier if you put a bunch of websites/web applications behind a load balancer like haproxy - all in a few VMs or even bare-metal servers instead of taking on the complexity of Kubernetes?

No snark or pushing opinion; I’m genuinely wondering how it is from someone who went through this path.

As a sysadmin who cares more about the reliability of services, still managing critical services outside of Kubernetes, I’m wondering what I’m missing out with Kubernetes.

Apocalypse_666|6 years ago

Infinitely happier, because if something goes wrong, you can usually figure out what it was, and fix it or even prevent it from happening again.

Sure, the blue-green automatic deployment in k8s is cool, but a bit of clever Ansible scripting should get you there as well. It might be more busywork, but the amount of time spent nursing my k8s cluster in no way amount to time saving

unknown|6 years ago

[deleted]