top | item 45747173

(no title)

gnurizen | 4 months ago

This profiler is focused on kernel execution but we do scrape high level metrics (https://www.polarsignals.com/blog/posts/2025/06/04/latest-in... which is based on https://github.com/polarsignals/gpu-metrics-agent). What performance counters in particular were you interested in?

discuss

sirhcm|4 months ago

Cache hit rate is probably the most immediately useful. Although given that this is for always-on profiling maybe this project isn't as geared towards optimizing kernels as I originally thought? In theory reading the counters should be low overhead though.

porridgeraisin|4 months ago

It depends on what counter.

[ All from my experience on home GPUs, and in lah with 2 nodes with 2 80GB H100 each. Not extensively benchmarked ]

Events like kernel launch, which this profiler reads right now, is a very small overhead (1-2%). Kernel level metrics like DRAM utilisation, cache hit rate, SM occupancy, etc usually give you a 5-10% overhead. If you want to plot a flame graph at a instruction level (mostly useful for learning purposes) then you go off the rails - even 25% overhead I have seen. And finally full traces add tons of overhead but that's pretty much expected - they anyways produce GBs of profiling data.