top | item 46729384

(no title)

touisteur | 1 month ago

I don't have first-hand knowledge on HBM GPUs but on the RTX Blackwell 6000 Pro Server, the perf difference between the free up-to-600W and the same GPU capped at 300W is less than 10% on any workload I could (including Tensor Core-heavy ones) throw at it.

That's a very expensive 300W and I wonder what tradeoff made them go for this, and whether capping is here a way to increase reliability. ...

Wonder whether there's any writeup on those additional 300 Watts...

discuss

zozbot234|1 month ago

> whether capping is here a way to increase reliability

Almost certainly so, and you wouldn't even need to halve the wattage; even a smaller drop ought to bring a very clear improvement. The performance profile you mention is something you see all the time on CPUs when pushed to their extremes; it's crazy to see that pro-level GPUs are seemingly being tuned the same way out of the box.

storystarling|1 month ago

It sounds like those workloads are memory bandwidth bound. In my experience with generative models, the compute units end up waiting on VRAM throughput, so throwing more wattage at the cores hits diminishing returns very quickly.

zozbot234|1 month ago

If they were memory bandwidth bound wouldn't that in itself push the wattage and thermals down comparatively, even on a "pegged to 100%" workload? That's the very clear pattern on CPU at least.

touisteur|1 month ago

I thought so but no, iterative small matrix multiplication kernel in tensor cores, or pure (generative) compute with ultra-late reduction and ultra-small working memory. nsight-compute says everything is in L1 or small register file, no spilling, and that I am compute bound, good ILP. Can't find a way to get more than 10% for the 300W difference. Thus asking if anyone did better and how and how reliable the HW stays.