After having just spend most of the day yesterday trying to nurse a failing Kubernetes cluster back to health (taking down all of our production websites in the process), I’ve come to completely loathe it. It felt more like practicing medicine than engineering: try different actions to suppress symptoms, instead of figuring out a root cause to actually prevent this from happening again. While it is a pretty sweet system if it does run, I would strongly advice against anyone trying to manage their own cluster, as it is simply to complex to debug on your own, and there is preciously little information out there to help you
Apocalypse_666|6 years ago
If only I could!! That’s exactly the frustrating part: there seems to be no way of grokking what goes on under the hood, and there are so many different ways of setting up a cluster and very few have any information about them online whatsoever.
As a practical example, what happened yesterday was that all of a sudden my pods could no longer resolve DNS lookups (took a while to figure out that that was what was going on, no fun when all your sites are down and customers are on the phone). Logging into the nodes, we found out about half of them had iptables disabled (but still worked somehow?). You try to figure out what’s going on, but there’s about 12 containers running in tandem to enable networking in the first place (what’s Calico again? KubeDNS? CoreDNS? I set it up a year ago, can’t remember now...) and no avail in Googling, because your setup is unique and nobody else was harebrained enough to set up their own cluster and blog about it. Commence the random sequence of commands I’ll never remember until by some miracle things seems to fix themselves. Now it’s just waiting for this to happen again, and being not one step closer to fixing it
antocv|6 years ago
If you use a managed kubernetes (not in aws since they suck, eks is not really managed). Like gke or aks, then you skip the whole "there is a problem in my own cloud of my own making".
btw, I also encountered DNS problems in kubernetes, on ACS, it took 5-10 minutes to resolve, and was caused by ACS not having services enabled to restart dns upon reboot, lol.
rixed|6 years ago
cbanek|6 years ago
But I have also had a number of DNS problems that we still haven't resolved, and they sometimes go away on their own. Same for IP tables rules issues. This is of course on a hosted kubernetes cluster at a large supercomputing center. (I didn't set it up, I just have to fix it. Ugh.) At Google, it's been great and we've had no networking problems, but they almost certainly run their own overlay network driver.
The various networking solutions you can plug into kubernetes seem pretty spotty, and they are very hard to debug. I still haven't figured it out myself. But I am trying to not throw the baby away with the bathwater. I think the networking (and storage) parts will get better.
unknown|6 years ago
[deleted]
nickthemagicman|6 years ago
vitalus|6 years ago
I think that this pain is sometimes more severe in the context of automated provisioning tools out there and the trend towards immutable infrastructure - folks tend to not have the know-how to dig in and mutate that state if need be.
It's really important to have a story within teams, though, about either investing in the knowledge needed to make these fixes, or to have the tooling in place to quickly rebuild everything from scratch and cutover to a new, working production cluster in a minimal amount of time.
larntz|6 years ago
As I build my knowledge I am also building Ansible playbooks and task files. After each iteration I shutdown my cluster. Do an automated rebuild and test. Delete the original cluster and start my next iteration.
I have an admin box with everything I need to persist between builds (Ansible, keys, configuration files, etc) and can deploy whatever size and quantity of workers (VM) needed.
It has been a good process so far. I haven't yet put things in an unrecoverable state, but if that happens I can rebuild the cluster to my most recent save and try again.
I don't see it taking a lot of resources to have a proving ground. I would definitely not feel comfortable going to production without the ability to reproduce the production clusters' exact state.
I anticipate exactly what you describe as a roll back mechanism. At all times I want to be able to automate the deployment of clusters to an exact known state.
I think building a cluster, walking away from it for a year, and then coming back to it for a break fix/update/new deployment is a huge gamble.
cwingrav|6 years ago
Clusters are cattle, not pets.
Apocalypse_666|6 years ago
sgt|6 years ago
It is also a cluster management tool, but much simpler and can be combined with other tools to make it just as powerful as Kubernetes.
[1] https://www.nomadproject.io
imtringued|6 years ago
ownagefool|6 years ago
In 4 years I've never came across a cluster I was unable to fix, nor has it really broken without someone taking an unadvisable action on it. This may simply be because I started early enough that I was forced to manually configure the components and thus understand the underlying system well enough.
Over time I have seen some interesting things though:
- Changing the overlay network on running servers probably the silliest thing I've done. This wasn't on production, but figuring out where all the files are and deleting them was something pretty undocumented.
- A few years back somebody ran a HA cluster without setting it as HA which resulted in occasional races where services keep changing IP addresses. I believe the ability to do this was patched out.
- An upgrade caused a doubling of all pods once. This was back when deployments were alpha/beta and they changed how they were references in the underlying system, causing deployments to forget their replicasets, etc.
Overall though, in 4 years I've spent very little time debugging clusters and more time debugging apps, which is what we want.
keymone|6 years ago
You’re basically saying “the tool X is fine, you’re just inexperienced/undisciplined and using it wrong”. Which is fair critique if I was an intern, but I have a decade+ experience in development and operations and I look at kubernetes in disbelief - why should things be that complicated? I get it, everything is pluggable and configurable, but surely this must be balanced out by making it more approachable and convenient?
You can’t sneeze in kubernetes without it requiring you to generate some ssl certs to the point where it’s just cargo-culture without any consideration of purpose and security.
And what’s up with dozens and dozens of bloated yamls and golang files? The fresh 30-odd commits ”official” flink operator is 3 THOUSAND lines of Go and 5 THOUSAND lines of yamls. How is that reasonable? In which universe is that reasonable? all it does is a for-loop that overwrites a bunch of pods to keep their spec in sync with desired config. There’s like 1000:1 boilerplate ratio in kubernetes and it’s considered good somehow?
Sorry for the rant, I’m just angry that we’re six decades into software engineering and the newest hottest project I the newest hottest line of work behaves like everybody should be paid per line of code they produce.
Apocalypse_666|6 years ago
reacharavindh|6 years ago
No snark or pushing opinion; I’m genuinely wondering how it is from someone who went through this path.
As a sysadmin who cares more about the reliability of services, still managing critical services outside of Kubernetes, I’m wondering what I’m missing out with Kubernetes.
Apocalypse_666|6 years ago
Sure, the blue-green automatic deployment in k8s is cool, but a bit of clever Ansible scripting should get you there as well. It might be more busywork, but the amount of time spent nursing my k8s cluster in no way amount to time saving
unknown|6 years ago
[deleted]