top | item 46802094

(no title)

Given that it's a 400B-parameter model, but it's a sparse MoE model with 13B active parameters per token, would it run well on an NVIDIA DGX Spark with 128 GB of unified RAM, or do you practically need to hold the full model in RAM even with sparse MoE?

discuss

timschmidt|1 month ago

Even with MoE, holding the model in RAM while individual experts are evaluated in VRAM is a bit of a compromise. Experts can be swapped in and out of VRAM for each token. So RAM <-> VRAM bandwidth becomes important. With a model larger than RAM, that bandwidth bottleneck gets pushed to the SSD interface. At least it's read-only, and not read-write, but even the fastest of SSDs will be significantly slower than RAM.

That said, there are folks out there doing it. https://github.com/lyogavin/airllm is one example.

radarsat1|1 month ago

> Experts can be swapped in and out of VRAM for each token.

I've often wondered how much it happens in practice. What does the per-token distribution of expert selection actually look like during inference? For example does it act like uniform random variable, or does it stick with the same 2 or 3 experts for 10 tokens in a row? I haven't been able to find much info on this.

Obviously it depends on what model you are talking about, so some kind of survey would be interesting. I'm sure this must but something that the big inference labs are knowledgeable about.

Although, I guess if you are batching things, then even if a subset of experts is selected for a single query, maybe over the batch it appears completely random, that would destroy any efficiency gains. Perhaps it's possible to intelligently batch queries that are "similar" somehow? It's quite an interesting research problem when you think about it.

Come to think of it, how does it work then for the "prompt ingestion" stage, where it likely runs all experts in parallel to generate the KV cache? I guess that would destroy any efficiency gains due to MoE too, so the prompt ingestion and AR generation stages will have quite different execution profiles.

Gracana|1 month ago

I thought paging was so inefficient that it wasn't worth doing vs using CPU inference for the parts of the model that are in system memory. Maybe if you have a good GPU and a turtle of a CPU, but still somehow have the memory bandwidth to make shuffling data in and out of the GPU worthwhile? I'm curious to know who is doing this and why.

nick49488171|1 month ago

With a non-sequential generative approach perhaps the RAM cache misses could be grouped together and swapped on a when available/when needed prioritized bases.

antirez|1 month ago

Can run with mmap() but it is slower. 4-bit quantized there is a decent ratio between the model size and the RAM, with a fast SSD one could try to see how it works. However when a model is 4-bit quantized there is often the doubt that it is not better than an 8-bit quantized model of 200B parameters, it depends on the model, on the use case, ... Unfortunately the street for local inference of SOTA model is being stopped by the RAM prices and the GPU request of the companies, leaving us with little. Probably today the best bet is to buy Mac Studio systems and then run distributed inference (MLX supports this for instance), or a 512 GB Mac Studio M4 that costs, like 13k$.

vardump|1 month ago

I think 512 GB Mac Studio was M3 Ultra.

Anyways, isn't a new Mac Studio due in a few months? It should be significantly faster as well.

I just hope RAM prices don't ruin this...

notpublic|1 month ago

Talking about RAM prices, you can still get a framework Max+ 395 with 128GB RAM for ~$2,459 USD. They have not increased the price for it yet.

https://frame.work/products/desktop-diy-amd-aimax300/configu...

jychang|1 month ago

No.

128GB vram gets you enough space for 256B sized models. But 400B is too big for the DGX Spark, unless you connect 2 of them together and use tensor parallel.