top | item 46725306

(no title)

touisteur | 1 month ago

And NVIDIA supposedly has the exact knowhow for reliablity, as their Jetson 'industrial' parts are qualified for 10-15 years at maximal temp. Of course Jetson is on another point of the flops and watts curve.

Just wondering, if reliability increases if you slow down your use of GPUs a bit. Like pausing more often and stopping chasing every bubble and nvlink-all-reduce optimization.

discuss

order

dsrtslnd23|1 month ago

Jetson uses LPDDR though. H100 failures seem driven by HBM heat sensitivity and the 700W+ envelope. That is a completely different thermal density I guess.

zozbot234|1 month ago

Reliability also depends strongly on current density and applied voltage, even more perhaps than on thermal density itself. So "slowing down" your average GPU use in a long-term sustainable way ought to improve those reliability figures via multiple mechanisms. Jetsons are great for very small-scale self-contained tasks (including on a performance-per-watt basis) but their limits are just as obvious, especially with the recently announced advances wrt. clustering the big server GPUs on a rack- and perhaps multi-rack level.