top | item 30772273

(no title)

This is a design flaw in Kubernetes. The article doesn't really explain what's happening though. The real problem is that there is no synchronization between the ingress controller (which manages the ingress software configuration, e.g. nginx from the Endpoints resources), kube-proxy (which manages iptables rules from the Endpoints resource), and kubelet (which sends the signals to the container). A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases (and the cases it doesn't will have exceeded your timeout anyhow). Things become more complicated when there are sidecar containers (say an envoy or nginx routing to another container in the same pod) and that often requires shenanigans such as shared emptyDir{} volumes that waits (with fsnotify or similar) for socket files to be closed to ensure requests are fully completed.

discuss

WJW|4 years ago

It's more of a design compromise than an outright flaw though. Since you can't know if your order to shut down a pod has arrived or not in a distributed system (per the CAP theorem), you either have to do it the way k8s has already implemented it or you have to accept potentially unbounded wait pod shutdown (and by extension new release rollout) durations in times of network partitions. K8s just chose Availability over Consistency in this case.

You can argue whether it would not have been preferable to choose C over A instead (or even better, to make this configurable), but in a distributed system you will always have to trade one of these two off. The hacks with shared emptyDir volumes just moves the system back to "Consistency" mode but in a hacky way.

fn1|4 years ago

The most obvious design flaw of kubernetes is that the ingress-controller is pluggable and therefore not thoroughly defined.

spoiler|4 years ago

I would say that's true for networking.k8s.io/v1beta1 Ingress, but not for networking.k8s.io/v1 which is much better.

There's still some issues around "concerns" maybe eg:

Should the Ingress also handle redirecting? ALB Ingress has its own annotations DSL to support this, and the nginx has a completely different annotations DSL to support this. I don't think Envoy does, though.

But then there's the question of supporting CDNs; some controllers support it with annotations and some through `pathType: ImplementationSpecific` and a `backend.resource` CRD (which doesn't have to be a CRD; they could become native networking.k8s.io/v1 extensions in the future that the controllers can opt in to support). This becomes great when combined with the operator framework (+ embedded kubebilder).

So, I think there's a lot of potential for things to get better.

A great success example in the ecosystem is cert-manager, that a lot of controllers rely on as a peer dependency in the cluster.

cassianoleal|4 years ago

> A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases

That's precisely what we did in one of my previous client. To increase portability, we wrote the smallest possible sleep equivalent in C, statically linked it, stuck it into a ConfigMap and mounted it to the pods so every workload would have the same pre-stop hook.

It was funny to watch when a new starter in the team would find out about that very elegant, stable and useful hack and go "wtf is going on here?" :D

This dealt with pretty much all our 5XXs due to unclean shutdowns.

kodah|4 years ago

I mean, technically, you can recreate this scenario on a single host as well. Send a sigterm to an application and try to swap in another instance of it.

System fundamentals are at the heart of that problem: SIGTERM is just what it is, it's a signal and an application can choose to acknowledge it and do something or catch it and ignore it. The system also has no way of knowing what the application chose to do.

All that to say, I'm not sure it's as much of a flaw in Kubernetes as much as it's the way systems work and Kubernetes is reflecting that.

lolc|4 years ago

In my view it is a clear flaw that the signal to terminate can arrive while the server is still getting new requests. Being able to steer traffic based on your knowledge of the state of the system is one of the reasons why you'd want to set up an integrated environment where the load-balancer and servers are controlled from the same process.

The time to send the signal is entirely under control of the managing process. It could synchronize with the load-balancer before sending pods the term signal, and I'm unclear why this isn't done.

unknown|4 years ago

[deleted]

tyingq|4 years ago

And then many throw a service mesh on top of that foundation.

simplicialset|4 years ago

Why do people continue using k8s if it's so badly designed?

tinco|4 years ago

Its design is good enough. There's just enough protocol to make it portable, and it's almost completely extensible so you can make it do basically anything.

unknown|4 years ago

[deleted]

jo22oij3|4 years ago

Because it's a good cash cow for expensive consultants