top | item 45644776

(no title)

kilotaras | 4 months ago

Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)

> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found

Instead of 1192 GPUs they now use 213 for serving those requests.

discuss

bee_rider|4 months ago

I’m slightly confuse as to how all this works. Do the GPUs just sit there with the models on them when the models are not in use?

I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?

(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)

miki123211|4 months ago

Loading a model takes at least a few seconds, usually more, depending on model size, disk / network speed and a bunch of other factors.

If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.

If that kind of latency isn't acceptable to you, you have to keep the models loaded.

This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.

svachalek|4 months ago

Models take a lot of VRAM which is tightly coupled to the GPU so yeah, it's basically sitting there with the model waiting for use. I'm sure they probably do idle out but a few minutes of idle time is a lot of waste--possibly the full 82% mentioned. In this case they optimized by letting the GPUs load multiple models and sharing the load out by token.

smallnix|4 months ago

> I guess I’d assumed this sort of thing would be allocated dynamically

At the scale of a hyperscaler I think Alibaba is the one that would be doing that. AWS, Azure and I assume Alibaba do lease/rent data centers, but someone has to own the servers / GPU racks. I know there are specialized companies like nscale (and more further down the chain) in the mix, but I always assumed they only lease out fixed capacity.

yorwba|4 months ago

The paper is about techniques to do that dynamic allocation to maximize utilization without incurring unacceptable latencies. If you let a GPU sit idle for several minutes after serving a single request, you're setting money on fire. So they reuse it for a different model as soon as possible, starting even before the first request is finished, because: If you don't have a dedicated GPU for a model, are you going to wait for a multi-gigabyte transfer before each request? So they have a dedicated GPU (or two, one for prefill, one for decode) for a group of models that are processed in an interleaved fashion, scheduled such that they stay within the latency budget.

citizenpaul|4 months ago

>Do the GPUs just sit there with the models on them when the models are not in use

I've assumed that as well. It makes sense to me since loading up a model locally takes a while. I wonder if there is some sort of better way I'm not in the know about. That or too GPU poor to know about.

make3|4 months ago

the models are huge, so not a single (latest gen) one can fit on a single GPU.

It's likely that these are small unpopular (non flagship) models, or that they only pack eg one layer of each model.

hinkley|4 months ago

So 82% of 17.7%?

14.5% is worth a raise at least. But it’s still misleading.

abejfehr|4 months ago

I don't think that's what this is saying, isn't it that 100 - ~82 = 17.7% ?

yorwba|4 months ago

Not really, Figure 1(a) of the paper says that the 17.7% are relative to a total of 30k GPUs (i.e. 5310 GPUs for handling those 1.35% of requests) and the reduction is measured in a smaller beta deployment with only 47 different models (vs. the 733 "cold" models overall.) Naïve extrapolation by model count suggests they would need 3321 GPUs to serve all cold models, a 37.5% reduction to before. (Or 6.6% reduction of the full 30k-GPU cluster.)

somerandomdude2|4 months ago

Really:

"A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s."

Which, if you scale it, matches the GPs statement.

xor1101|4 months ago

Doesnt sound right

MangoCoffee|4 months ago

In the past, software and computer engineers would tackle problems head-on, designing algorithms and finding creative solutions.

thanks to the US restrictions on semiconductor industry (Chinese), Chinese engineers are being forced to innovate and find their own ways to overcome challenges like the old school engineers (What Silicon Valley used to be)

_heimdall|4 months ago

If you're one who sees progress as an end goal unto itself, what you describe is a good thing. When one party is attempting novel solutions to outcompete the competition we will be faster to whatever the next change is.

That said, I'm not sure what the US policies specifically have to do with this. Countries are always in competition with one another, and if one industry or technology is considered a national security threat they will guard it.