(no title)
oxcidized | 3 months ago
Honestly curious where you got this number. Unless you're talking about extremely small quants. Even just a Q4 quant gguf is ~130GB. Am I missing out on a relatively cheap way to run models well that are this large?
I suppose you might be referring to a Mac Studio, but (while I don't have one to be a primary source of information) it seems like there is some argument to be made on whether they run models "well"?
simonw|3 months ago
An M3 Ultra with 256GB of RAM is $5599. That should just about be enough to fit MiniMax M2 at 8bit for MLX: https://huggingface.co/mlx-community/MiniMax-M2-8bit
Or maybe run a smaller quantized one to leave more memory for other apps!
Here are performance numbers for the 4bit MLX one: https://x.com/ivanfioravanti/status/1983590151910781298 - 30+ tokens per second.
zht|3 months ago
30 tokens per second looks good until you have to wait minutes for the first token
oxcidized|3 months ago
fzzzy|3 months ago
oxcidized|3 months ago