How We Found 7 TiB of Memory Just Sitting Around

Aeolun|4 months ago

I read this and I have to wonder, did anyone ever think it was reasonable that a cluster that apparently needed only 120gb of memory was consuming 1.2TB just for logging (or whatever vector does)

devjab|4 months ago

We're a much smaller scale company and the cost we lose on these things is insignificant compared to what's in this story. Yesterday I was improving the process for creating databases in our azure and I stumbled upon a subscription which was running 7 mssql servers for 12 databases. These weren't elastic and they were each paying a license that we don't have to pay because we qualify for the base cost through our contract with our microsoft partner. This company has some of the thightest control over their cloud infrastructure out of any organisation I've worked with.

This is anecdotal, but if my experiences aren't unique then there is a lot of lack of reasonable in DevOps.

bstack|4 months ago

Author here: You’d be surprised what you don’t notice given enough nodes and slow enough resource growth over time! Out of the total resource usage in these clusters even at the high water mark for this daemonset it was still a small overall portion of the total.

formerly_proven|4 months ago

It probably doesn't help that the first line of treatment for any error is to blindly increase memory request/limit and claim it's fixed (preferably without looking at the logs once).

fock|4 months ago

we have on-prem with heavy spikes (our batch workload can utilize the 20TB of memory in the cluster easily) and we just don't care much and add 10% every year to the hardware requested. Compared to employing people or paying other vendors (relational databases with many TB-sized tables...) this is just irrelevant.

Sadly devs are incentivized by that and going towards the cloud might be a fun story. Given the environment I hope they scrap the effort sooner rather than later, buy some Oxide systems for the people who need to iterate faster than the usual process of getting a VM and replace/reuse the 10% of the company occupied with the cloud (mind you: no real workload runs there yet...) to actually improve local processes...

nitinreddy88|4 months ago

The other way to look is why adding NS label is causing so much memory footprint in Kubernetes. Shouldn't be fixing that (could be much bigger design change), will benefit whole Kube community?

bstack|4 months ago

Author here: yeah that's a good point. tbh I was mostly unfamiliar with Vector so I took the shortest path to the goal but that could be interesting followup. It does seem like there's a lot of bytes per namespace!

shanemhansen|4 months ago

The unreasonable effectiveness of profiling and digging deep strikes again.

hinkley|4 months ago

The biggest tool in the performance toolbox is stubbornness. Without it all the mechanical sympathy in the world will go unexploited.

There’s about a factor of 3 improvement that can be made to most code after the profiler has given up. That probably means there are better profilers than could be written, but in 20 years of having them I’ve only seen 2 that tried. Sadly I think flame graphs made profiling more accessible to the unmotivated but didn’t actually improve overall results.

seg_lol|3 months ago

Unreasonable effectiveness of looking.

liampulles|4 months ago

I'm a little surprised that it got to the point where pods which should consume a couple MB of RAM were consuming 4GB before action was taken. But I can also kind of understand it, because the way k8s operators (apps running in k8s that manipulate k8s resource) are meant to run is essentially a loop of listing resources, comparing to spec, and making moves to try and bring the state of the cluster closer to spec. This reconciliation loop is simple to understand (and I think this benefit has led to the creation of a wide array of excellent open source and proprietary operators that can be added to clusters). But its also a recipe for cascading explosions in resource usage.

These kind of resource explosions are something I see all the time in k8s clusters. The general advice is to always try and keep pressure off the k8s API, and the consequence is that one must be very minimal and tactical with the operators one installs, and then engage in many hours of work trying to fine tune each operator to run efficiently (e.g. Grafana, whose default helm settings do not use the recommended log indexing algorithm, and which needs to be tweaked to get an appropriate set of read vs. write pods for your situation).

Again, I recognize there is a tradeoff here - the simplicity and openness of the k8s API is what has led to a flourish of new operators, which really has allowed one to run "their own cloud". But there is definitely a cost. I don't know what the solution is, and I'm curious to hear from people who have other views of it, or use other solutions to k8s which offer a different set of tradeoffs.

never_inline|4 months ago

> are meant to run is essentially a loop of listing resources, comparing to spec, and making moves to try and bring the state of the cluster closer to spec.

Aren't they supposed to use watch/long polling?

hinkley|4 months ago

Keys require O(logn) space per key or nlogn for the entire data set, simply to avoid key collisions. But human friendly key spaces grow much, much faster and I don’t think many people have looked too hard at that.

There were recent changes to the NodeJS Prometheus client that eliminates tag names from the keys used for storing the tag cardinality for metrics. The memory savings wasn’t reported but the cpu savings for recording data points was over 1/3. And about twice that when applied to the aggregation logic.

Lookups are rarely O(1), even in hash tables.

I wonder if there’s a general solution for keeping names concise without triggering transposition or reading comprehension errors. And what the space complexity is of such an algorithm.

vlovich123|4 months ago

Why aren’t let’s just 128bit UUIDs? Those are guaranteed to be globally unique and don’t require so much spacex

timzaman|4 months ago

7tib.. that's like 3 servers..

74 comments