top | item 38706701

(no title)

Q4 comes out to be ~26GB but Apple doesn't let you load it on a 32GB Mac machine because they put a limit on the max usable unified memory at ~21GB (`device. recommendedMaxWorkingSetSize`) [1]. So for Q4 Mixtral MoE you'd need a 64GB Mac machine unfortunately.

Unless you use this hack [2].

[1] https://developer.apple.com/forums/thread/732035

[2] https://github.com/ggerganov/llama.cpp/discussions/2182#disc...

discuss

endymi0n|2 years ago

There’s a brand new hybrid quantization for Mixtral out that uses 4b for shared neurons and 2b for experts, which does not bleed much perplexity, but fits it into a 32G machine. Haven’t had it in hand yet and no link here on mobile, but can’t wait to try.

astrange|2 years ago

sysctls aren't a hack exactly, it's there so you can change it.

As for why it's not the default, it's mostly because wiring all your memory will crash the computer pretty fast.