We're Cutting L40S Prices in Half

zackangelo|1 year ago

L40S has 48GB of RAM, curious how they're able to run Llama 3.1 70B on it. The weights alone would exceed this. Maybe they mean quantized/fp8?

I just had to implement GPU clustering in my inference stack to support Llama 3.1 70b, and even then I needed 2xA100 80GB SXMs.

I was initially running my inference servers on fly.io because they were so easy to get started with. But I eventually moved elsewhere because the prices were so high. I pointed out to someone there that e-mailed me that it was really expensive vs. others and they basically just waved me away.

For reference, you can get an A100 SXM 80GB spot instance on google cloud right now for $2.04/hr ($5.07 regular).

tptacek|1 year ago

Our standard A100 SXM 80GB price is $3.50/hr, for what it's worth.

nknealk|1 year ago

> You can run DOOM Eternal, building the Stadia that Google couldn’t pull off, because the L40S hasn’t forgotten that it’s a graphics GPU.

Savage.

I wonder if we’ll see a resurgence of cloud game streaming

synicalx|1 year ago

I feel like Geforce Now has been growing quite steadily, they onboard new titles every week including some big ones like World of Warcraft recently and they've stood up streaming DC's in quite a lot of places now.

duxup|1 year ago

GeForce Now is pretty great.

0max|1 year ago

Is the services that PlayStation Now uses publicly known? That's the only streaming service I've used so far.

doublepg23|1 year ago

MS seems to be continuing to push xCloud with Games Pass. It has been useful when playing games with my friends a few times.

deepsquirrelnet|1 year ago

I hadn’t even heard of L40S until I started renting to get more memory for small training jobs. I didn’t benchmark it, but it seemed to be pretty fast for a pcie card.

Amazon’s g6 instances are L4-based with 24gb vram, half the capacity of the L40S, with sagemaker in demand prices at this rate. Vast ai is cheaper, though a little more like bidding and varying in availability.

CGamesPlay|1 year ago

> You can run Llama 3.1 70B — the big Llama — for LLM jobs.

That's the medium Llama. Does anyone know if an L40S would run the 405B version?

xena|1 year ago

Hi, I'm the person that wrote that sizing comment in the draft for this article. I have been trying for a while and have been unsuccessful at getting 405B running on any of the GPU machines. I suspect I'd need a raw 8xA100 node to do it at Q4. I doubt there is any reasonable combination of L40s cards that can do it on fly.io. It's just too big. I suspect that in time the 70b model will be brought up to be roughly equivalent, but realistically it's already on the GPT-4 threshold as is. I've found that 70b is more than sufficient in practice.

tazu|1 year ago

Prices lowered to $1.25/hr... still 2X vast.ai prices.

tptacek|1 year ago

There are definitely GPU providers where you can buy cheaper L40S hours than us. I'm not entirely sure what their system architectures are, or whether they're just buying in absolutely spectacular volume, because we are cutting pretty close to the bone with our pricing.

One cost factor we have that other providers might not have (I'd love to know): we have to dedicate individual racked physical hosts to each group of GPUs we deploy, because we don't (/can't, depending on how you think about systems security) allow GPU-enabled workloads to share hardware with non-GPU-enabled workloads, and we don't allow anyone to share kernels.

But like we said in the post: we're still figuring this stuff out. What we know is: at the same price level, we're consistently sold out of A10 inventory.

rdedev|1 year ago

I don't know what platform vast.ai uses but what I have noticed is cpu compute is pretty slow in those. Specifically the tokenization stage was unusually slow for no apparent reason. Had to give that up and use Google cloud for my research project

layoric|1 year ago

Not as fast as the L40S, but Runpod.io has the A40 48gb for $0.28/hr spot price, so if its mainly VRAM you need, this is a lot cheaper option. Vast.ai has it for the same price as well.

tptacek|1 year ago

Runpod is definitely cheaper than we are! We are not the cheapest GPU/hour you can get on any hardware iteration. That's not what we're about, and it is 100% legit to point out that there are workloads that make more sense on other platforms. It would be very weird if that wasn't the case.

blindriver|1 year ago

Suddenly cutting prices in half shows that the business model is in dire straits.

tptacek|1 year ago

What it shows is that we're sold out of one part but not the next part up. We're not cutting all our prices in half. We'd just rather source more L40S's than A10's, for what I think are pretty obvious reasons.

This all happened because we were having internal meetings about trying to find A10s to rack, and Kurt stopped and said "wtf are we doing".

If it'll make you feel better, we'll continue to charge you the previous list price for L40S GPU hours.

gedw99|1 year ago

they buy them at 12 K, so they pay them off in 1 year approx

nice business to be in I guess.

tptacek|1 year ago

LOL.

30 comments