(no title)
thundergolfer | 1 month ago
Component Type MTBF (yrs) AFR
─────────────────────────────────────────────────────────
SSD Hardware ~100 ~1%
RAM uncorrectable error Hardware ~75 ~1-4%
NVIDIA A100 critical error† Hardware 0.18 (65d) -
NVIDIA H100 critical error† Hardware 0.15 (50d) -
† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.
YetAnotherNick|1 month ago
However I think a lot of it is driver or some software issue. I remember switching from pytorch docker image to Nvidia's NGC images and the reliability increased very noticeably. Do you have the data for popular docker images?
salynchnew|1 month ago
GPUs--they're just like us!
layoric|1 month ago
Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.
formerly_proven|1 month ago
shrubble|1 month ago
direwolf20|1 month ago
jvalencia|1 month ago