top | item 46724942

(no title)

Author here. That 1:50-100 ratio looks roughly right based on my research, but my numbers have GPUs faring even worse.

  Component                      Type       MTBF (yrs)  AFR
  ─────────────────────────────────────────────────────────

  SSD                            Hardware   ~100        ~1%
  RAM uncorrectable error        Hardware   ~75         ~1-4%
  NVIDIA A100 critical error†    Hardware   0.18 (65d)  -
  NVIDIA H100 critical error†    Hardware   0.15 (50d)  -

† “Critical error” refers to a NVIDIA Xid or sXid error which is not recoverable, requiring application and GPU reset.

Only a minority of GPU 'failures' appear to be permanent hardware problems, such as row remapping errors. A lot seem to be, like another comment says, a consequence of operating too close to the operational limit, tipping over it, and then requiring a power cycle.

discuss

YetAnotherNick|1 month ago

So I ran 16x A100 in GCP for training workloads. And it was hard to keep it running for more than a few days so that matches my number.

However I think a lot of it is driver or some software issue. I remember switching from pytorch docker image to Nvidia's NGC images and the reliability increased very noticeably. Do you have the data for popular docker images?

salynchnew|1 month ago

> operating too close to the operational limit, tipping over it, and then requiring a power cycle.

GPUs--they're just like us!

layoric|1 month ago

I'm quite surprised the A100 is not much better since the power levels for the Ampere cards I believe is a lot lower.

Does this mean even for a model that fits on a single server that trains for a few weeks will absolutely need a recovery process? Interested in peoples experiences around this.

formerly_proven|1 month ago

GPU servers always have had crap reliability compared to a normal server (but sticking eight GPUs on a baseboard complicates things). As I understand it (not my domain), this (being a lack of widespread checkpointing and mpift support) is one of the motivating factors for why ML toolkits eschew MPI (besides accelerator-accelerator being an afterthought).

shrubble|1 month ago

If you rebooted every server after 35 days, would that get rid of many of the problems?

direwolf20|1 month ago

It's an average time to failure, not a guarantee. Failures occur randomly.

jvalencia|1 month ago

I'm curious if running them at slightly lower voltage would fix it or if it's a software thing.