I feel like I've seen Cupti have fairly high overhead depending on the cuda version, but I'm not very confident -- did you happen to benchmark different workloads with cupti on/off?
---
If you're taking feature requests: a way to subscribe to -- and get tracebacks for -- cuda context creation would be very useful; I've definitely been surprised by finding processes on the wrong gpu and being easily able to figure out where they came from would be great.
I did a hack by using LD_PRELOAD to subscribe/publish the event, but never really followed through on getting the python stack trace.
This "low-overhead always on GPU profiler" seems really cool and useful, but we're not using Kubernetes for anything, and the instructions for how to use it seems to only include Kubernetes. Is there a way of running this without Kubernetes?
Does the profiler read any of the GPU's performance counters? Would be super cool to have an open source tool that can capture the same data nsight compute does.
gnurizen|4 months ago
knlb|4 months ago
I feel like I've seen Cupti have fairly high overhead depending on the cuda version, but I'm not very confident -- did you happen to benchmark different workloads with cupti on/off?
---
If you're taking feature requests: a way to subscribe to -- and get tracebacks for -- cuda context creation would be very useful; I've definitely been surprised by finding processes on the wrong gpu and being easily able to figure out where they came from would be great.
I did a hack by using LD_PRELOAD to subscribe/publish the event, but never really followed through on getting the python stack trace.
embedding-shape|4 months ago
sirhcm|4 months ago