(no title)
kilotaras | 4 months ago
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
kilotaras | 4 months ago
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
bee_rider|4 months ago
I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?
(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)
miki123211|4 months ago
If you're using an efficient inference engine like VLLM, you're adding compilation into the mix, and not all of that is fully cached yet.
If that kind of latency isn't acceptable to you, you have to keep the models loaded.
This (along with batching) is why large local models are a dumb and wasteful idea if you're not serving them at enterprise scale.
svachalek|4 months ago
smallnix|4 months ago
At the scale of a hyperscaler I think Alibaba is the one that would be doing that. AWS, Azure and I assume Alibaba do lease/rent data centers, but someone has to own the servers / GPU racks. I know there are specialized companies like nscale (and more further down the chain) in the mix, but I always assumed they only lease out fixed capacity.
yorwba|4 months ago
citizenpaul|4 months ago
I've assumed that as well. It makes sense to me since loading up a model locally takes a while. I wonder if there is some sort of better way I'm not in the know about. That or too GPU poor to know about.
make3|4 months ago
It's likely that these are small unpopular (non flagship) models, or that they only pack eg one layer of each model.
hinkley|4 months ago
14.5% is worth a raise at least. But it’s still misleading.
abejfehr|4 months ago
yorwba|4 months ago
somerandomdude2|4 months ago
"A paper presented at SOSP 2025 details how token-level scheduling helped one GPU serve multiple LLMs, reducing demand from 1,192 to 213 H20s."
Which, if you scale it, matches the GPs statement.
xor1101|4 months ago
MangoCoffee|4 months ago
thanks to the US restrictions on semiconductor industry (Chinese), Chinese engineers are being forced to innovate and find their own ways to overcome challenges like the old school engineers (What Silicon Valley used to be)
_heimdall|4 months ago
That said, I'm not sure what the US policies specifically have to do with this. Countries are always in competition with one another, and if one industry or technology is considered a national security threat they will guard it.