top | item 41806180

(no title)

The real money is in renting infiniband clusters, not individual gpus/machines

If you look at lambda one click clusters they state $4.49/H100/hr

discuss

I'm in the business of mi300x. This comment nails it.

In general, the $2 GPUs are either PE venture losing money, long contracts, huge quantities, pcie, slow (<400G) networking, or some other limitation, like unreliable uptime on some bitcoin miner that decided to pivot into the GPU space and has zero experience on how to run these more complicated systems.

Basically, all the things that if you decide to build and risk your business on these sorts of providers, you "get what you pay for".

jsheard|1 year ago

> slow (<400G) networking

We're not getting Folding@Home style distributed training any time soon, are we.

marcyb5st|1 year ago

I agree with you, but as the article mentioned, if you need to finetune a small/medium model you really don't need clusters. Getting a whole server with 8/16x H100s is more than enough. And I also believe with the article when it states that most companies are finetuning some version of llama/open-weights models today.

pico_creator|1 year ago

Exactly, it covered in the article that there is a segmentation happening via GPU cluster size.

Is it big enough for foundation model training from scratch = ~$3+ Otherwise it drops hard

Problem is "big enough" is a moving goal post now, what was big, becomes small

swyx|1 year ago

so why not buy up all the little h100s and enough together for a cluster? seems like a decent rollup strategy?

ofcourse it woudl still cost a lot to do... but if the difference is $2/hr vs $4.49/hr then there's some size where it makes sense