top | item 41469955

(no title)

NightMKoder | 1 year ago

If your clojure pods are getting OOMKilled, you have a misconfigured JVM. The code (e.g. eval or not) mostly doesn't matter.

If you have an actual memory leak in a JVM app what you want is an exception called java.lang.OutOfMemoryError . This means the heap is full and has no space for new objects even after a GC run.

An OOMKilled means the JVM attempted to allocate memory from the OS but the OS doesn't have any memory available. The kernel then immediately kills the process. The problem is that the JVM at the time thinks that _it should be able to allocate memory_ - i.e. it's not trying to garbage collect old objects - it's just calling malloc for some unrelated reason. It never gets a chance to say "man I should clear up some space cause I'm running out". The JVM doesn't know the cgroup memory limit.

So how do you convince the JVM that it really shouldn't be using that much memory? It's...complicated. The big answer is -Xmx but there's a ton more flags that matter (-Xss, -XX:MaxMetaspaceSize, etc). Folks think that -XX:+UseContainerSupport fixes this whole thing, but it doesn't; there's no magic bullet. See https://ihor-mutel.medium.com/tracking-jvm-memory-issues-on-... for a good discussion.

discuss

order

positr0n|1 year ago

> It never gets a chance to say "man I should clear up some space cause I'm running out".

To add to everything you said, depending on the type of framework you are using sometimes you don't even want it to do that. The JVM will try increasingly desperate measures, looped GC scans, ref processing, and sleeps with backoffs. With a huge heap, that can easily take hundreds to thousands of ms.

At scale, it's often better to just kill the JVM right away if the heap fills up. That way your open connections don't have all that extra latency added before the clients figure out something went wrong. Even if the JVM could recover this time, usually it will keep limping along and repeating this cycle. Obviously monitor, collect data, and determine the root cause immediately when that happens.

NightMKoder|1 year ago

Of course you’re right and you really want to avoid getting to GC thrashing. IMO people still miss the old +UseGCOverheadLimit on the new GCs.

That said trying to enforce overhead limits with RSS limits also won’t end well. Java doesn’t make guarantees around max allocated but unused heap space. You need something like this: https://github.com/bazelbuild/bazel/blob/10060cd638027975480... - but I have rarely seen something like that in production.

pwagland|1 year ago

This is one of the areas where OpenJ9 does things a lot better than HotSpot. OpenJ9 uses one memory pool for _everything_, HotSpot has a dozen different memory pools for different purposes. This makes it much harder to tune HotSpot in containers.

pjmlp|1 year ago

Depends on what JVM version is being used as well, as key guideline use the latest version, or at least the latest LTS.

Folks insisting in using Java 11 or worse, Java 8, for containers are in for a surprise.

This on OpenJDK, as sibling comment points out, there are other JVMs as well.