> you can get 100% GPU utilization by just reading/writing to memory while doing 0 computations
Indeed! Utilization is a proxy for what you actually want (which is good use of available hardware). 100% GPU utilization doesn't actually indicate this.
On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.
This reminds me of the Linux/Unix disk busy "%util" metric in tools like sar and iostat. People sometimes interpret the 100%util as a physical ceiling for the disk IO capacity, just like with CPUs ("we need more disks to get disk I/O utilization down!").
It is a correct metric when your block device has a single physical spinning disk that can only accept one request at a time (dispatch queue depth=1). But the moment you deal with SSDs (capable of highly concurrent NAND IO), SAN storage block devices striped over many physical disks or even a single spinning disk that can internally queue and reorder IOs for more efficient seeking, just hitting 100%util at the host block device level doesn't mean that you've hit some IOPS ceiling.
So, looks like the GPU "SM efficiency" analysis is somewhat like logging in to the storage array itself and checking how busy each physical disk (or at least each disk controller) inside that storage array is.
This sounds like the good old "having high test coverage is bad because I can get to 100% just by calling functions and doing nothing, asserting nothing with them".
100% test coverage doesn't mean your tests are good, but having 50% (or pick your number) means they are bad / not sufficient.
Yup, similar to SM efficiency in that sense too. If you aren't seeing >80%, there is certainly time left on the table. But getting a high SM efficiency value doesn't guarantee you're making good use of the hardware as well. (still a better proxy than GPU util though)
When understanding the performance of your model it's very helpful to look at a roofline plot [1]. The roofline plot will show you the floating-point performance as a function of arithmetic intensity for the various ops in your model. The plot has two regimes: a memory-bound regime on the left and a compute-bound regime on the right. This can help to identify memory-bound ops that are taking a significant fraction of compute time.
Agreed, roofline plots would be quite powerful in this context. From a quick search, seems like the only way to create a roofline plot for your model would be to use Nsight [1]? Would be interested to know if there are any simpler tools, since one of the big benefits of SM efficiency is how easily the metric is accessed.
Nice, seems like ML Productivity Goodput is a pretty well thought-out metric to understand the overall efficiency of your cluster. I'll consider adding this into our cluster management platform. Only potential drawbacks I'd guess are it being somewhat difficult to compute since it relies on metrics like MFUs, and not something we can observe layer-by-layer to understand inefficient kernels, but I'll take a deeper look. Thanks!
by using ThreadPoolExecutor, I cut that down to about 16 seconds. i wonder if there is a fairly obvious way to truly utlize my beefy L40S GPU! is it MPS? I haven't been successful at even running the MPS daemon on my linux server yet. very opaque for sure!
Start with Nsight Systems and turn on GPU metrics. It’s super easy and the plots will give you an immediate sense of your utilization, and low-hanging optimization opportunities.
So using 10-wide parallel processing took your batch from 21 seconds down to 16 seconds, did I do the arithmetic correctly? That suggests the single-threaded version isn’t too bad. I mean a 25% improvement is great and nothing to sneeze at, but batching might only be trimming the gaps in between image pairs, or queueing up your memory copies while the previous inference is running. You can verify this with nsys profiles.
> i wonder if there is a fairly obvious way to truly utilize my beefy L40S GPU! is it MPS?
No idea, it’s not always easy (and generally speaking gets harder and harder as you approach 100%), but first profile to see what your utilization is before going down any big technical route. Maybe with your ThreadPoolExecutor, you’re already getting max utilization and using MPS can’t possibly help.
totally agreed. A lot of our findings during this process is that there's still a lot of alpha in finding the right kernels for the job/model. We're hoping that in the future `torch.compile` will become more mature because current docs on performance at least on pytorch side definitely leave us wanting more
"If we have a CUDA kernel that continuously runs for 10 seconds but only uses 1 SM, on an H100, this would register 100% utilization, but the SM efficiency would be 1 / 132 = 0.7%."
does this situation register 100% utilization?
BTW, the SM OCCUPANCY is also a metric you need to care about if you concern on kernel efficiency
Yup, you'll see 100% utilization on a kernel over a time period if it's considered active, which includes just having a single thread executing [1]. SM occupancy is great but can be a little difficult to interpret since you're not simply trying to maximize it, unlike SM efficiency.
If you have a basic understanding of what your kernels are supposed to do, looking at pipe usage and roofline analysis in Nsight Compute is often helpful, since it will show you how hard you’re saturating those.
Power usage is indeed a better representation of GPU utilization during ML training. It has the advantage of combining many important indirect signals that aren’t visible, and avoids many downfalls of compute usage, which can go to 100% even in all-reduce deadlocks, among other scenarios.
power is also a good proxy. For example, we've had distributed runs that we monitored on WandB where one of our workers died in the middle and the rest were basically stalling on the dead worker. On WandB, we were only logging GPU stats on one worker and that one had 100% util but basically no excess power draw compared to having nothing running, which is how I found out something was stalling. Restarting fixed it and got the power draw up to normal, but even with high power draw, we were still having some sections of code with low SM efficiency (~20%) for that training.
We ran into a similar problem with CPU utilization at my job. Created an alert for when our systems hit 90% CPU util, and ended up with a ton of noise. We realized that for some of our workloads, this was normal and expected.
As someone that is familiar with using nvidia-smi to track util, what are some commands people use to track the SM efficiency? The end of the article had some references, but no examples of what to use explicitly.
gpu utilization is not everything, people! mfus are where it's at. time to recalibrate those expectations and tap into the true potential of your gpus. brace yourselves, the real efficiency is yet to come!
SnowflakeOnIce|1 year ago
Indeed! Utilization is a proxy for what you actually want (which is good use of available hardware). 100% GPU utilization doesn't actually indicate this.
On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.
tanelpoder|1 year ago
It is a correct metric when your block device has a single physical spinning disk that can only accept one request at a time (dispatch queue depth=1). But the moment you deal with SSDs (capable of highly concurrent NAND IO), SAN storage block devices striped over many physical disks or even a single spinning disk that can internally queue and reorder IOs for more efficient seeking, just hitting 100%util at the host block device level doesn't mean that you've hit some IOPS ceiling.
So, looks like the GPU "SM efficiency" analysis is somewhat like logging in to the storage array itself and checking how busy each physical disk (or at least each disk controller) inside that storage array is.
serial_dev|1 year ago
100% test coverage doesn't mean your tests are good, but having 50% (or pick your number) means they are bad / not sufficient.
roanakb|1 year ago
shaklee3|1 year ago
jorvi|1 year ago
Some of us like having more than 2 hours of battery life, and not scalding our skin in the process of using our devices.
antognini|1 year ago
[1]: https://en.wikipedia.org/wiki/Roofline_model
roanakb|1 year ago
[1]: https://www.nvidia.com/en-us/on-demand/session/gtcspring21-s...
sundalia|1 year ago
roanakb|1 year ago
sergiotapia|1 year ago
take this example: https://gist.github.com/sergiotapia/efc9b3f7163ba803a260b481... - running a fairly simple model that takes only 70ms per image pair, but because I have 300 images it becomes a big time sink.
by using ThreadPoolExecutor, I cut that down to about 16 seconds. i wonder if there is a fairly obvious way to truly utlize my beefy L40S GPU! is it MPS? I haven't been successful at even running the MPS daemon on my linux server yet. very opaque for sure!
dahart|1 year ago
So using 10-wide parallel processing took your batch from 21 seconds down to 16 seconds, did I do the arithmetic correctly? That suggests the single-threaded version isn’t too bad. I mean a 25% improvement is great and nothing to sneeze at, but batching might only be trimming the gaps in between image pairs, or queueing up your memory copies while the previous inference is running. You can verify this with nsys profiles.
> i wonder if there is a fairly obvious way to truly utilize my beefy L40S GPU! is it MPS?
No idea, it’s not always easy (and generally speaking gets harder and harder as you approach 100%), but first profile to see what your utilization is before going down any big technical route. Maybe with your ThreadPoolExecutor, you’re already getting max utilization and using MPS can’t possibly help.
zaptrem|1 year ago
asaiacai|1 year ago
DamonsJ|1 year ago
does this situation register 100% utilization? BTW, the SM OCCUPANCY is also a metric you need to care about if you concern on kernel efficiency
roanakb|1 year ago
[1]: https://pytorch.org/blog/pytorch-profiler-1.9-released/#gpu-...
rurban|1 year ago
saagarjha|1 year ago
pavelstoev|1 year ago
roanakb|1 year ago
areichenbach|1 year ago
aabhay|1 year ago
asaiacai|1 year ago
danielvaughn|1 year ago
ScoutOrgo|1 year ago
roanakb|1 year ago
1. Profile your model with Pytorch Profiler 2. Export metrics with Nvidia DCGM
AeZ1E|1 year ago