This is a design flaw in Kubernetes. The article doesn't really explain what's happening though. The real problem is that there is no synchronization between the ingress controller (which manages the ingress software configuration, e.g. nginx from the Endpoints resources), kube-proxy (which manages iptables rules from the Endpoints resource), and kubelet (which sends the signals to the container). A presStop hook w/ a sleep equivalent to an acceptable timeout will handle the 99%+ cases (and the cases it doesn't will have exceeded your timeout anyhow). Things become more complicated when there are sidecar containers (say an envoy or nginx routing to another container in the same pod) and that often requires shenanigans such as shared emptyDir{} volumes that waits (with fsnotify or similar) for socket files to be closed to ensure requests are fully completed.
WJW|4 years ago
You can argue whether it would not have been preferable to choose C over A instead (or even better, to make this configurable), but in a distributed system you will always have to trade one of these two off. The hacks with shared emptyDir volumes just moves the system back to "Consistency" mode but in a hacky way.
fn1|4 years ago
spoiler|4 years ago
There's still some issues around "concerns" maybe eg:
Should the Ingress also handle redirecting? ALB Ingress has its own annotations DSL to support this, and the nginx has a completely different annotations DSL to support this. I don't think Envoy does, though.
But then there's the question of supporting CDNs; some controllers support it with annotations and some through `pathType: ImplementationSpecific` and a `backend.resource` CRD (which doesn't have to be a CRD; they could become native networking.k8s.io/v1 extensions in the future that the controllers can opt in to support). This becomes great when combined with the operator framework (+ embedded kubebilder).
So, I think there's a lot of potential for things to get better.
A great success example in the ecosystem is cert-manager, that a lot of controllers rely on as a peer dependency in the cluster.
cassianoleal|4 years ago
That's precisely what we did in one of my previous client. To increase portability, we wrote the smallest possible sleep equivalent in C, statically linked it, stuck it into a ConfigMap and mounted it to the pods so every workload would have the same pre-stop hook.
It was funny to watch when a new starter in the team would find out about that very elegant, stable and useful hack and go "wtf is going on here?" :D
This dealt with pretty much all our 5XXs due to unclean shutdowns.
kodah|4 years ago
System fundamentals are at the heart of that problem: SIGTERM is just what it is, it's a signal and an application can choose to acknowledge it and do something or catch it and ignore it. The system also has no way of knowing what the application chose to do.
All that to say, I'm not sure it's as much of a flaw in Kubernetes as much as it's the way systems work and Kubernetes is reflecting that.
lolc|4 years ago
The time to send the signal is entirely under control of the managing process. It could synchronize with the load-balancer before sending pods the term signal, and I'm unclear why this isn't done.
unknown|4 years ago
[deleted]
tyingq|4 years ago
simplicialset|4 years ago
tinco|4 years ago
unknown|4 years ago
[deleted]
jo22oij3|4 years ago