top | item 44110659

(no title)

tsurba | 9 months ago

1.5-2 years ago I did some training for a ML paper on 4 AMD MI250x (each is essentially 2 gpus so 8 in total really, each with 64GB VRAM) on LUMI.

My Jax models and the baseline PyTorch models were quite easy to set up there, and there was not a noticeable perf difference to 8x A100s (which I used for prototyping on our university cluster) in practice.

Of course it’s just a random anecdote, but I don’t think nvidia is actually that much ahead.

discuss

pama|9 months ago

I have nothing against random anecdotes per se, but a lot of academic code does not correctly optimize computation on GPU hardware. If you can estimate by pen and paper how many FLOPs/s your code was using because of the main operations it had to do and how that number compared to the theoretical limit of bfloat16 performance on the NVIDIA GPU (about 2.6 * 10^15 for the 8 A100 IIRC) then you can see a bit better how close your code was to optimality. I have seen low effort performance scaling reach less than 1% of these theoretical numbers and people were super happy because it was sufficiently fast anyways (which is fine) and it showed the GPU utilized all the time (but with only 1% of these possible ALU doing anything useful at all times).

nickpsecurity|9 months ago

That is totally true. I'll add one thing to remember for those discussions is to compare the cost to what they'd use without a GPU.

They'd probably have to spend $5k-20k on a multicore or NUMA-style box to get huge gains on multithreaded code. They also loose the cool factor of saying they're using a RTX. Maybe grant money if it's tied to GPU use. Between the three, it might make sense, even financial sense, to get a sub-$2000 GPU to accelerate academic code that barely uses the GPU.

I'm just brainstorming here, though.