top | item 36136141

(no title)

ioedward | 2 years ago

Nvidia's enterprise GPUs are surprisingly unreliable. Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days. I didn't have any insight on whether it was a hardware or software failure.

discuss

csdvrx|2 years ago

> Working on a 128 GPU A100 cluster on AWS, 1 would fail every few days

Define "fail".

> I didn't have any insight on whether it was a hardware or software failure.

Have scripts check nvidia-smi for ECC errors and dmesg for devices dropping of the PCI bus.

For the former, replace the card. For the later, just perform a device reset (a power toggle of the device and a rescan of the bus is often enough to be back online within 5 seconds)

robotresearcher|2 years ago

What does an AWS user do with this advice?