That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.
bertili|1 month ago
NitpickLawyer|1 month ago
Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.
zozbot234|1 month ago
Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.
YetAnotherNick|1 month ago
omneity|1 month ago