top | item 47137140

Dissecting the CPU-memory relationship in garbage collection (OpenJDK 26)

119 points| jonasn | 6 days ago |norlinder.nu

37 comments

order

jonasn|6 days ago

Hi HN, I'm the author of this post and a JVM engineer working on OpenJDK.

I've spent the last few years researching GC for my PhD and realized that the ecosystem lacked standard tools to quantify GC CPU overhead—especially with modern concurrent collectors where pause times don't tell the whole story.

To fix this blind spot, I built a new telemetry framework into OpenJDK 26. This post walks through the CPU-memory trade-off and shows how to use the new API to measure exactly what your GC is costing you.

I'll be around and am happy to answer any questions about the post or the implementation!

spockz|5 days ago

Thank you for this interface! It will definitely help in tracking down GC related performance issues or in selecting optimal settings.

One thing that I still struggle with, is to see how much penalty our application threads suffer from other work, say GC. In the blog you mention that GC is not only impacting by cpu doing work like traversing and moving (old/live) objects but also the cost of thread pauses and other barriers.

How can we detect these? Is there a way we can share the data in some way like with OpenTelemetry?

Currently I do it by running a load on an application and retaining its memory resources until the point where it CPU skyrockets because of the strongly increasing GC cycles and then comparing the cpu utilisation and ratio between cpu used/work.

Edit: it would be interesting to have the GC time spent added to a span. Even though that time is shared across multiple units of work, at least you can use it as a datapoint that the work was (significantly?) delayed by the GC occurring, or waiting for the required memory to be freed.

abbeyj|3 days ago

I'm a bit confused about the colors used in the CPU graphs. In the first graphs it looks like green means that the application is running and red means that the GC is running. But once we get to Figure 4 then red means the GC is running (on the GC threads) or nothing is running (on the Main thread)? If red always means that GC work is being done on that thread then this is inconsistent with the text that says "By distributing reclamation work across both cores..." since we would have three threads running at once. Once you move to the concurrent GC figures you definitely have three things running at once. Unless you're assuming SMT with each core running two threads?

In Figure 3 you somehow have 101% wall time. :)

yunnpp|4 days ago

Hey, noob question, but does OpenJDK look at variable scope and avoid allocating on the heap to begin with if a variable is known to not escape the function's stack frame?

Not strictly related to this post, but I figured it'd be helpful to get an authoritative answer from you on this.

latchkey|4 days ago

I built this 15 years ago and it got fairly popular, but is long dead now...

https://github.com/jmxtrans/jmxtrans

Kind of amazing how people are still building telemetry into Java. Great post and great work. Keep it up.

sitta|4 days ago

Great article!

Will the new metric be exposed in JFR recordings as well?

exabrial|4 days ago

I just want to say this is an incredibly detailed, well written, and beautifully illustrated article. Solid work.

cogman10|5 days ago

At my work, one thing that I've often had to explain to devs is that the Parallel collector (and even the serial collector) are not bad just because they are old or simple. They aren't always the right tool, but for us who do a lot of batch data processing, it's the best collector around for that data pipeline.

Devs keep on trying to sneak in G1GC or ZGC because they hyper focus on pause time as being the only metric of value. Hopefully this new log:cpu will give us a better tool for doing GC time and computational costs. And for me, will make for a better way to argue that "it's ok that the parallel collector had a 10s pause in a 2 hour run".

jonasn|5 days ago

Every GC algorithm in HotSpot is designed with a specific set of trade-offs in mind.

ZGC and G1 are fantastic engineering achievements for applications that require low latency and high responsiveness. However, if you are running a pure batch data pipeline where pause times simply don't matter, Parallel GC remains an incredibly powerful tool and probably the one I would pick for that scenario. By accepting the pauses, you get the benefit of zero concurrent overhead, dedicating 100% of the CPU to your application threads while they are running.

torginus|4 days ago

I think a very serious issue with GC is that:

- The number of edges in a graph tend to scale superlinearly with heap size, as the number of edges possible in a graph are quadratic wrt no of objects.

- Memory bandwidth hasn't been scaling very much during the past decade and a half, even compared to memory size. It's also not a thing people think about or even easy to display in any performance monitoring tool.

But considering if you had a machine 15 years ago with 4GB or ram that could be read at 15GB/s, and now you have one with 32GB that can be read at 60GB/s, it means that your bandwidth compared to heap size has halved. Considering the quadratic nature of references, the 'amplification factor', the number of times you have to revisit an already visited block of memory is higher as well.

This is in addition to the cache trashing issues mentioned in the post.

If you need to read the whole heap, this sets a lower bound on how much time the GC will take ~0.25s on the old machine, ~0.5s on the new one.

Suppose your GC triggers a memory bandwidth issue - how do you even profile for that? This is kind of an invisible resource that just gets used up.

cvoss|5 days ago

> This freed programmers from managing complex lifecycle management.

It also deceived programmers into failing to manage complex lifecycles. Debugging wasted memory consumption is a huge pain.

mattclarkdotnet|4 days ago

Sorry if this is obvious to Java experts, but much as parallel GC is fine for batch workloads, is there a case for explicit GC control for web workloads? For example a single request to a web server will create a bunch of objects, but then when it completes 200ms later they can all be destroyed, so why even run GC during the request thread execution?

noelwelsh|4 days ago

There are a few ways of looking at this:

- Purely on the JVM, you probably want ZGC (or Shenandoah) because latency is more important than throughput.

- On Erlang / the BEAM VM, each thread gets its own private heap, so GC is a per thread operation. If the request doesn't spill over the heap then GC would never need to run during a request handler and all memory could be reclaimed when the handler finishes.

- There can still be cases where a request handler allocates memory that is no solely owned by it. E.g. if it causes a new database connection to be allocated in a connection pool, that connection is not owned by the request handler and should not be deallocated when the handler finishes.

- The general idea you're getting at is often called "memory regions": you can point to a scope in the code and say "all the memory can be freed when this scope exits". In this case the scope is the request handler. It's the same idea behind arena or slab memory allocation. There are languages that can encode this, and do safe automatic memory management without GC. Rust is an obvious example, but I don't find it very ergonomic. I think the OxCaml [1] and Scala 3 [2] approaches are better.

[1]: https://oxcaml.org/documentation/stack-allocation/reference/

[2]: https://docs.scala-lang.org/scala3/reference/experimental/cc...

jacobn|4 days ago

Most web request cases where you care about performance probably have multiple parallel web requests, so there’s no clean separation possible?

firefly2000|5 days ago

Are there plans to elucidate implicit GC costs as well?

jonasn|5 days ago

Great question! I actually just touched on this in another thread that went up right around the same time you asked this. It is clearly the next big frontier!

The short answer is: It's something I'm actively thinking about, but instrumenting micro-level events (like ZGC's load barriers or G1's write barriers) directly inside application threads without destroying throughput (or creating observer effects invalidating the measurements) is incredibly difficult.