top | item 45643163

Alibaba Cloud says it cut Nvidia AI GPU use by 82% with new pooling system

523 points| hd4 | 4 months ago |tomshardware.com

Paper: https://dl.acm.org/doi/10.1145/3731569.3764815

315 comments

order

kilotaras|4 months ago

Alibaba Cloud claims to reduce Nvidia GPU used for serving unpopular models by 82% (emphasis mine)

> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found

Instead of 1192 GPUs they now use 213 for serving those requests.

bee_rider|4 months ago

I’m slightly confuse as to how all this works. Do the GPUs just sit there with the models on them when the models are not in use?

I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?

(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)

hinkley|4 months ago

So 82% of 17.7%?

14.5% is worth a raise at least. But it’s still misleading.

yorwba|4 months ago

Not really, Figure 1(a) of the paper says that the 17.7% are relative to a total of 30k GPUs (i.e. 5310 GPUs for handling those 1.35% of requests) and the reduction is measured in a smaller beta deployment with only 47 different models (vs. the 733 "cold" models overall.) Naïve extrapolation by model count suggests they would need 3321 GPUs to serve all cold models, a 37.5% reduction to before. (Or 6.6% reduction of the full 30k-GPU cluster.)

xor1101|4 months ago

Doesnt sound right

MangoCoffee|4 months ago

In the past, software and computer engineers would tackle problems head-on, designing algorithms and finding creative solutions.

thanks to the US restrictions on semiconductor industry (Chinese), Chinese engineers are being forced to innovate and find their own ways to overcome challenges like the old school engineers (What Silicon Valley used to be)

djoldman|4 months ago

Key paragraph:

> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.

make3|4 months ago

these other models are likely much smaller

hunglee2|4 months ago

The US attempt to slow down China's technological development succeeds on the basis of preventing China from directly following the same path, but may backfire in the sense it forces innovation by China in a different direction. The overall outcome for us all may be increase efficiency as a result of this forced innovation, especially if Chinese companies continue to open source their advances, so we may in the end have reason to thank the US for their civilisational gate keeping

dlisboa|4 months ago

History has shown that withholding technology from China does not significantly stop them and they'll achieve it (or better) in a small number of years.

In many senses there's hubris in the western* view of China accomplishments: most of what western companies have created has had significant contribution by Chinese scientists or manufacturing, without which those companies would have nothing. If you look at the names of AI researchers there's a strong pattern even if some are currently plying their trade in the west.

---

* I hate the term "western" because some "westeners" use it to separated what they think are "civilized" from "uncivilized", hence for them LATAM is not "western" even though everything about LATAM countries is western.

notepad0x90|4 months ago

I think anti-immigrant rhetoric will have the most impact against the US. A lot of the people innovating on this stuff are being maligned and leaving in droves.

Aside from geography, attracting talent from all over the world is the one edge the US has a nation over countries like China. But now the US is trying to be xenophobic like China, restrict tech import/export like China but compete against 10x population and lack of similar levels of internal strife and fissures.

The world, even Europe is looking for a new country to take on a leader/superpower role. China isn't there yet, but it might get there in a few years after their next-gen fighter jets and catching up to ASML.

But, China's greatest weakness is their lack of ambition and focus on regional matters like Taiwan and south china sea, instead of winning over western europe and india.

reliabilityguy|4 months ago

Tbh this whole situation reminds of how Japan excelled in making a lot more with a lot less after WW2, e.g., fuel-efficient engines, light cars, etc. these constraints were not present in the US (and to some extent in Europe), and resulted in US cars being completely not competitive in non-US markets.

segmondy|4 months ago

may backfire? it's a bit too late for that.

go to 2024, western labs were crushing it.

it's now 2025, and from china, we have deepseek, qwen, kimi, glm, ernie and many more capable models keeping up with western labs. there are actually now more chinese labs releasing sota models than western labs.

rzerowan|4 months ago

Fingers crossed for convergence rather than divergence in the technical standards.Although the way hings are going it looks like the 2 stacks will diverge sooner rather than later , with the US+ banning the use of CHN models while simultaneosly banning the export of it quasi-open models. We may very well end up in a situation like the old PAL vs NTSC video standard where the PAL(EU/Asia/AFrica) and NTSC(America's/Japan) gradually converged with the adoption of digital formats. Instead here would be a divergence based on geopolitical considerations.

myth_drannon|4 months ago

China's innovation relies on the stolen western IP, without it, China is nothing. Also USSR/Russia is no longer a scientific powerhouse that can supply China with some military innovation. A dictatorship combined with cheap labour it 100% guarantees that the country's innovation is stunted, no matter what the Chinese propaganda claims.

archerx|4 months ago

I want China to release GPUs with a ton of VRAM, 128gb - 256gb. It doesn’t matter if they are half as fast as Nvidia because having a big model at a reasonable speed is better than not being to run them at all. AMD could have done this and have had a massive impact on nvidia’s market share but they choose not to because reasons.

sspiff|4 months ago

Their are signs that China is not open sourcing their SOTA models anymore. Both Huawei and Qwen (Qwen-Max, WAN 2.5) and have launched flagship models which are yet to be opensourced.

narrator|4 months ago

Peaceful competition is a good thing. It's better than a unified one world government throttling everybody.

belter|4 months ago

China is a nation of engineers...The US has been relying in on H-1B immigrants. Science is under attack. The truth is the US already lost: https://youtu.be/whVlI6H4d-4

knowitnone3|4 months ago

It's much easier to copy what others are doing instead of spending the time and money for research and engineering. It's also much easier if you steal the tech. I could never have invented a bicycle but I can sure make a copy of one.

downrightmike|4 months ago

That's how it usually goes, fully expected

coliveira|4 months ago

You mean, thank the US for their FAILED "civilizational" gate keeping.

amelius|4 months ago

Another outcome may be that we now have to learn Chinese to understand their datasheets ...

IT4MD|4 months ago

I believe this is an Pollyanna take on AI. There is nothing about humans that tells us humans will bring AI to fruition for the other humans and a mountain of evidence showing how it will be used to abuse humans instead....for profits/power/whatever horse shit the masters of the universe have decided upon.

braza|4 months ago

Does someone know if there's some equivalent of those engineering/research blogs for Chinese companies?

I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.

supriyo-biswas|4 months ago

The company blogs of Chinese companies will often do articles like this[1] talking about a new innovation or optimization that they did, but this will be often just mixed in with marketing articles too.

I would also assume there's a lot of content in the native Chinese forums, which unfortunately, as an English-speaking person, I wouldn't be able to easily refer to :(

[1] https://www.alibabacloud.com/blog/how-does-alibaba-ensure-th...

ddelnano|4 months ago

Does anyone know how their KV cache sync mechanism compares to newer P2P communication layers like nixl, uccl p2p, etc.?

The authors mention that NCCL and Ray initialization were too slow (see quote below), but from the description it sounds like they’ve reimplemented a layer that’s increasingly being standardized by frameworks like nixl and uccl.

> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.

checker659|4 months ago

They are working with tiny models. Not sure how well it'd scale to bigger models (if at all).

CaptainOfCoit|4 months ago

They're all LLMs, so no, not tiny, but not exactly huge either:

> Our current deployment runs in a cross-region cluster comprising 213 H20 GPUs, serving twenty-eight 1.8–7B models (TP=1) and nineteen 32–72B models (TP=4).

jeffybefffy519|4 months ago

I still think nVidia has the most to loose in the AI race, optimisations like this will continue coupled with better ASIC's.

ibejoeb|4 months ago

Sounds like this virtual GPU is a separate scheduler. I wonder what kind of latency is introduced by marshaling all that data around.

catigula|4 months ago

Sounds like they stopped doing something stupid.

shoeb00m|4 months ago

Would this make cloud providers running low volume fine-tuned models more economically viable?

lnxg33k1|4 months ago

Lots of shareholders here, move along, there is nothing to read

throwaway48476|4 months ago

Its easy enough for a a well resourced entity to take a pre trained model and deploy it on new hardware to save on the NVDA tax. It's far less likely for research and model training to happen outside the mature NVDA ecosystem.

mighmi|4 months ago

To what extent is this practice applicable to other loads?

wsfung2008|4 months ago

This is for platforms that serve many different models, most of which have very low usage. e.g. huggingface, civitai

wslh|4 months ago

How feasible is that in an horizon of 5 years new optimized "equations" will cut the need for more GPUs?

nickysielicki|4 months ago

> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.

I mean, it really shouldn't take tens of seconds for those initialization(s) to occur. There's no good fundamental reason that it should take that long. It's just bloat.

t0lo|4 months ago

Is this another nail in the gpu/ai stock market bubble coffin?

muddi900|4 months ago

[deleted]

dotnet00|4 months ago

This is such a popular coping tactic from Americans when it comes to facing actual competition, especially from China. Everything they do must either be a lie or just stolen American technology, as if there's something inherently special about Americans that no one else has.

throawayonthe|4 months ago

scmp is kinda the opposite of state media lol

larus_canus|4 months ago

Ah, yes, the American media environment, which is internationally famous for not lying.