top | item 41322999

(no title)

sundalia | 1 year ago

Application-specific metrics are the way to go. For ML training this is one example: https://cloud.google.com/blog/products/ai-machine-learning/g...

discuss

order

roanakb|1 year ago

Nice, seems like ML Productivity Goodput is a pretty well thought-out metric to understand the overall efficiency of your cluster. I'll consider adding this into our cluster management platform. Only potential drawbacks I'd guess are it being somewhat difficult to compute since it relies on metrics like MFUs, and not something we can observe layer-by-layer to understand inefficient kernels, but I'll take a deeper look. Thanks!