(no title)
bick_nyers | 9 months ago
A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.
If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.
There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.
MoonGhost|9 months ago
That's about 5KW of power
> that gets 7 token/s in llama.cpp
Just looking at electricity bill it's cheaper to use API of any major providers.
> If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.
That's interesting, it means the model can be cut and those token routed to another closest expert, just in case they happened.
bick_nyers|9 months ago
latchkey|9 months ago
ElectricalUnion|9 months ago
In AMD own parlance, the "Modular Chiplet Platform" presents itself as either single-I-don't-care-about-speed-or-latency "Single Partition X-celerator" mode or in multiple-I-actually-totally-do-care-about-speed-and-latency-NUMA-like "Core Partitioned X-celerator" mode.
So you kinda still need to care what-loads-where.