I have read this near four times now. I can't really find any sustenance -- it appears to be a advertisement for a product wrapped in a unsolved issue.
No mention of lsof, netstat, or tcpdump, the normal tools used for troubleshooting these sort of problems. Without trying to sound to snarky I find it highly concerning that the industry is now working with tools like docker and Kubernties and we some how just throw out the fact that these sit on top of Linux.
Not to mention kubelet's ability to spot one of many turntables reaching a max still would have not solved this problem -- "Fundamentally, the node was unhealthy" -- is not a proper answer to the problem -- what was done to resolve the memory issue is. That could be increasing the tcp_mem to to support the workload, or finding a faulty user space program who is acting faulty -- all of which we have no clue because no real tools for troubleshooting this were used.
I mainly write this gripe because this appears to be a problemtisement, or a blogtisement. A "helpful" but not informative blog simply to provide a way to advertise your companies service at as the final blurb, leaving us with no real solution, resolution or a closing to the mystery of why tcp_mem was higher than expected.
> kubelets ability to spot one of many turntables reaching a max...
Hm...how do you propose kubernetes / kubernetes users solve these kinds of problems? It could be a fairly common error that’s hard to catch on a system of large number of nodes where you’re not supposed to actively think about the fact that you have nodes. What’s the right tooling / monitoring to have on a system of 20nodes where one node is basically screwed?
These kinds of things make me think the entire K8s/container abstraction is just broken.
I am not sure how you have read this 4 times, and missed these parts.
> leaving us with no real solution, resolution or a closing to the mystery of why tcp_mem was higher than expected
One user-space program was faulty and was not closing TCP sockets.
> what was done to resolve the memory issue is
The faulty program was fixed.
> Without trying to sound to snarky I find it highly concerning that the industry is now working with tools like docker and Kubernties and we some how just throw out the fact that these sit on top of Linux.
This I agree with, and this was the learning of the author, which they mention in the article.
Using netstat/lsof/tcp_dump from inside the containers did not help unfortunately. Eventual next step was to check nodes and kernel logs revealed the issue rightaway.
Seems that metrics providing visibility into the "network connectivity was flaky", like looking at response times (particularly 95/99 percentile) and digging into the pod, which gives you the node, would have isolated the problem pretty quickly to a single node. If a problem is isolated to a node, first thing to look at would be node logs. Would that pattern not have worked in this case?
[+] [-] mbrumlow|8 years ago|reply
No mention of lsof, netstat, or tcpdump, the normal tools used for troubleshooting these sort of problems. Without trying to sound to snarky I find it highly concerning that the industry is now working with tools like docker and Kubernties and we some how just throw out the fact that these sit on top of Linux.
Not to mention kubelet's ability to spot one of many turntables reaching a max still would have not solved this problem -- "Fundamentally, the node was unhealthy" -- is not a proper answer to the problem -- what was done to resolve the memory issue is. That could be increasing the tcp_mem to to support the workload, or finding a faulty user space program who is acting faulty -- all of which we have no clue because no real tools for troubleshooting this were used.
I mainly write this gripe because this appears to be a problemtisement, or a blogtisement. A "helpful" but not informative blog simply to provide a way to advertise your companies service at as the final blurb, leaving us with no real solution, resolution or a closing to the mystery of why tcp_mem was higher than expected.
[+] [-] kemcho|8 years ago|reply
Hm...how do you propose kubernetes / kubernetes users solve these kinds of problems? It could be a fairly common error that’s hard to catch on a system of large number of nodes where you’re not supposed to actively think about the fact that you have nodes. What’s the right tooling / monitoring to have on a system of 20nodes where one node is basically screwed?
These kinds of things make me think the entire K8s/container abstraction is just broken.
[+] [-] ecthiender|8 years ago|reply
> leaving us with no real solution, resolution or a closing to the mystery of why tcp_mem was higher than expected
One user-space program was faulty and was not closing TCP sockets.
> what was done to resolve the memory issue is
The faulty program was fixed.
> Without trying to sound to snarky I find it highly concerning that the industry is now working with tools like docker and Kubernties and we some how just throw out the fact that these sit on top of Linux.
This I agree with, and this was the learning of the author, which they mention in the article.
Disclaimer: I work at Hasura
[+] [-] alberteinstein|8 years ago|reply
[+] [-] BenjiWiebe|8 years ago|reply
[+] [-] kronin|8 years ago|reply
[+] [-] Thaxll|8 years ago|reply
[+] [-] alberteinstein|8 years ago|reply