I’m slightly confuse as to how all this works. Do the GPUs just sit there with the models on them when the models are not in use?
I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?
(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)
Not really, Figure 1(a) of the paper says that the 17.7% are relative to a total of 30k GPUs (i.e. 5310 GPUs for handling those 1.35% of requests) and the reduction is measured in a smaller beta deployment with only 47 different models (vs. the 733 "cold" models overall.) Naïve extrapolation by model count suggests they would need 3321 GPUs to serve all cold models, a 37.5% reduction to before. (Or 6.6% reduction of the full 30k-GPU cluster.)
In the past, software and computer engineers would tackle problems head-on, designing algorithms and finding creative solutions.
thanks to the US restrictions on semiconductor industry (Chinese), Chinese engineers are being forced to innovate and find their own ways to overcome challenges like the old school engineers (What Silicon Valley used to be)
> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.
The US attempt to slow down China's technological development succeeds on the basis of preventing China from directly following the same path, but may backfire in the sense it forces innovation by China in a different direction. The overall outcome for us all may be increase efficiency as a result of this forced innovation, especially if Chinese companies continue to open source their advances, so we may in the end have reason to thank the US for their civilisational gate keeping
History has shown that withholding technology from China does not significantly stop them and they'll achieve it (or better) in a small number of years.
In many senses there's hubris in the western* view of China accomplishments: most of what western companies have created has had significant contribution by Chinese scientists or manufacturing, without which those companies would have nothing. If you look at the names of AI researchers there's a strong pattern even if some are currently plying their trade in the west.
---
* I hate the term "western" because some "westeners" use it to separated what they think are "civilized" from "uncivilized", hence for them LATAM is not "western" even though everything about LATAM countries is western.
I think anti-immigrant rhetoric will have the most impact against the US. A lot of the people innovating on this stuff are being maligned and leaving in droves.
Aside from geography, attracting talent from all over the world is the one edge the US has a nation over countries like China. But now the US is trying to be xenophobic like China, restrict tech import/export like China but compete against 10x population and lack of similar levels of internal strife and fissures.
The world, even Europe is looking for a new country to take on a leader/superpower role. China isn't there yet, but it might get there in a few years after their next-gen fighter jets and catching up to ASML.
But, China's greatest weakness is their lack of ambition and focus on regional matters like Taiwan and south china sea, instead of winning over western europe and india.
Tbh this whole situation reminds of how Japan excelled in making a lot more with a lot less after WW2, e.g., fuel-efficient engines, light cars, etc. these constraints were not present in the US (and to some extent in Europe), and resulted in US cars being completely not competitive in non-US markets.
it's now 2025, and from china, we have deepseek, qwen, kimi, glm, ernie and many more capable models keeping up with western labs. there are actually now more chinese labs releasing sota models than western labs.
Fingers crossed for convergence rather than divergence in the technical standards.Although the way hings are going it looks like the 2 stacks will diverge sooner rather than later , with the US+ banning the use of CHN models while simultaneosly banning the export of it quasi-open models.
We may very well end up in a situation like the old PAL vs NTSC video standard where the PAL(EU/Asia/AFrica) and NTSC(America's/Japan) gradually converged with the adoption of digital formats. Instead here would be a divergence based on geopolitical considerations.
China's innovation relies on the stolen western IP, without it, China is nothing. Also USSR/Russia is no longer a scientific powerhouse that can supply China with some military innovation. A dictatorship combined with cheap labour it 100% guarantees that the country's innovation is stunted, no matter what the Chinese propaganda claims.
I want China to release GPUs with a ton of VRAM, 128gb - 256gb. It doesn’t matter if they are half as fast as Nvidia because having a big model at a reasonable speed is better than not being to run them at all. AMD could have done this and have had a massive impact on nvidia’s market share but they choose not to because reasons.
Their are signs that China is not open sourcing their SOTA models anymore. Both Huawei and Qwen (Qwen-Max, WAN 2.5) and have launched flagship models which are yet to be opensourced.
China is a nation of engineers...The US has been relying in on H-1B immigrants. Science is under attack. The truth is the US already lost: https://youtu.be/whVlI6H4d-4
It's much easier to copy what others are doing instead of spending the time and money for research and engineering. It's also much easier if you steal the tech. I could never have invented a bicycle but I can sure make a copy of one.
I believe this is an Pollyanna take on AI. There is nothing about humans that tells us humans will bring AI to fruition for the other humans and a mountain of evidence showing how it will be used to abuse humans instead....for profits/power/whatever horse shit the masters of the universe have decided upon.
Does someone know if there's some equivalent of those engineering/research blogs for Chinese companies?
I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.
The company blogs of Chinese companies will often do articles like this[1] talking about a new innovation or optimization that they did, but this will be often just mixed in with marketing articles too.
I would also assume there's a lot of content in the native Chinese forums, which unfortunately, as an English-speaking person, I wouldn't be able to easily refer to :(
Does anyone know how their KV cache sync mechanism compares to newer P2P communication layers like nixl, uccl p2p, etc.?
The authors mention that NCCL and Ray initialization were too slow (see quote below), but from the description it sounds like they’ve reimplemented a layer that’s increasingly being standardized by frameworks like nixl and uccl.
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
Its easy enough for a a well resourced entity to take a pre trained model and deploy it on new hardware to save on the NVDA tax. It's far less likely for research and model training to happen outside the mature NVDA ecosystem.
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
I mean, it really shouldn't take tens of seconds for those initialization(s) to occur. There's no good fundamental reason that it should take that long. It's just bloat.
This is such a popular coping tactic from Americans when it comes to facing actual competition, especially from China. Everything they do must either be a lie or just stolen American technology, as if there's something inherently special about Americans that no one else has.
kilotaras|4 months ago
> 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found
Instead of 1192 GPUs they now use 213 for serving those requests.
bee_rider|4 months ago
I guess I’d assumed this sort of thing would be allocated dynamically. Of course, there’s a benefit to minimizing the number of times you load a model. But surely if a GPU+model is idle for more than a couple minutes it could be freed?
(I’m not an AI guy, though—actually I’m used to asking SLURM for new nodes with every run I do!)
hinkley|4 months ago
14.5% is worth a raise at least. But it’s still misleading.
yorwba|4 months ago
xor1101|4 months ago
MangoCoffee|4 months ago
thanks to the US restrictions on semiconductor industry (Chinese), Chinese engineers are being forced to innovate and find their own ways to overcome challenges like the old school engineers (What Silicon Valley used to be)
djoldman|4 months ago
> However, a small handful of models such as Alibaba’s Qwen and DeepSeek are most popular for inference, with most other models only sporadically called upon. This leads to resource inefficiency, with 17.7 per cent of GPUs allocated to serve only 1.35 per cent of requests in Alibaba Cloud’s marketplace, the researchers found.
make3|4 months ago
majke|4 months ago
paper https://dl.acm.org/doi/10.1145/3731569.3764815
dang|4 months ago
hunglee2|4 months ago
dlisboa|4 months ago
In many senses there's hubris in the western* view of China accomplishments: most of what western companies have created has had significant contribution by Chinese scientists or manufacturing, without which those companies would have nothing. If you look at the names of AI researchers there's a strong pattern even if some are currently plying their trade in the west.
---
* I hate the term "western" because some "westeners" use it to separated what they think are "civilized" from "uncivilized", hence for them LATAM is not "western" even though everything about LATAM countries is western.
notepad0x90|4 months ago
Aside from geography, attracting talent from all over the world is the one edge the US has a nation over countries like China. But now the US is trying to be xenophobic like China, restrict tech import/export like China but compete against 10x population and lack of similar levels of internal strife and fissures.
The world, even Europe is looking for a new country to take on a leader/superpower role. China isn't there yet, but it might get there in a few years after their next-gen fighter jets and catching up to ASML.
But, China's greatest weakness is their lack of ambition and focus on regional matters like Taiwan and south china sea, instead of winning over western europe and india.
lesuorac|4 months ago
China has an import ban on chips [1] so its irrelevant what the US does.
[1]: https://www.cnbc.com/2025/09/17/nvidia-ceo-disappointed-afte...
reliabilityguy|4 months ago
segmondy|4 months ago
go to 2024, western labs were crushing it.
it's now 2025, and from china, we have deepseek, qwen, kimi, glm, ernie and many more capable models keeping up with western labs. there are actually now more chinese labs releasing sota models than western labs.
rzerowan|4 months ago
myth_drannon|4 months ago
archerx|4 months ago
unknown|4 months ago
[deleted]
sspiff|4 months ago
narrator|4 months ago
belter|4 months ago
knowitnone3|4 months ago
downrightmike|4 months ago
coliveira|4 months ago
amelius|4 months ago
IT4MD|4 months ago
braza|4 months ago
I used to follow the ones from Western companies, but honestly, after some point in time, I would like to see some cases from what I consider is a good benchmark for everyone that does not work in FAANG in terms of engineering.
supriyo-biswas|4 months ago
I would also assume there's a lot of content in the native Chinese forums, which unfortunately, as an English-speaking person, I wouldn't be able to easily refer to :(
[1] https://www.alibabacloud.com/blog/how-does-alibaba-ensure-th...
wsfung2008|4 months ago
[deleted]
ddelnano|4 months ago
The authors mention that NCCL and Ray initialization were too slow (see quote below), but from the description it sounds like they’ve reimplemented a layer that’s increasingly being standardized by frameworks like nixl and uccl.
> Distributed executor: Inference engines support model parallelism via distributed executors (e.g., Ray [32] and NCCL [9]), whose initialization takes tens of seconds.
checker659|4 months ago
CaptainOfCoit|4 months ago
> Our current deployment runs in a cross-region cluster comprising 213 H20 GPUs, serving twenty-eight 1.8–7B models (TP=1) and nineteen 32–72B models (TP=4).
jeffybefffy519|4 months ago
ibejoeb|4 months ago
catigula|4 months ago
shoeb00m|4 months ago
lnxg33k1|4 months ago
throwaway48476|4 months ago
mighmi|4 months ago
wsfung2008|4 months ago
wslh|4 months ago
aoeusnth1|4 months ago
nickysielicki|4 months ago
I mean, it really shouldn't take tens of seconds for those initialization(s) to occur. There's no good fundamental reason that it should take that long. It's just bloat.
t0lo|4 months ago
muddi900|4 months ago
[deleted]
Yoric|4 months ago
So, definitely not state media, probably not lying on the fundamentals. Of course, still presumably viewed favorably by the CCP, I'd imagine.
dotnet00|4 months ago
throawayonthe|4 months ago
larus_canus|4 months ago