The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.
That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.
People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.
If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.
Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.
Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.
I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input.
• 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s
• 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s
These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.
Barathkanna|1 month ago
bertili|1 month ago
zozbot234|1 month ago
Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.
YetAnotherNick|1 month ago
omneity|1 month ago
embedding-shape|1 month ago
zozbot234|1 month ago
Barathkanna|1 month ago
These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
zozbot234|1 month ago