This particular one may not work on M chips, but the model itself does. I just tested a different sized version of the same model in LM Studio on a Macbook Pro, 64GB M2 Max with 12 cores, just to see.
Prompt: Create a solar system simulation in a single self-contained HTML file.
qwen3-next-80b (MLX format, 44.86 GB), 4bit 42.56 tok/sec , 2523 tokens, 12.79s to first token
- note: looked like ass, simulation broken, didn't work at all.
Then as a comparison for a model with a similar size, I tried GLM.
GLM-4-32B-0414-8bit (MLX format, 36.66 GB), 9.31 tok/sec, 2936 tokens, 4.77s to first token
- note: looked fantastic for a first try, everything worked as expected.
Not a fair comparison 4bit vs 8bit but some data. The tok/sec on Mac is pretty good depending on the models you use.
I haven't tested on Apple machines yet, but gpt-oss and qwen3-next should work I assume. Llama3 versions use cuda specific loading logic for speed boost, so it won't work for sure
Not an expert in machine learning, but AFAIK diffusion models use a completely different architecture, therefore you can't use the same code to run optimized versions of both. But maybe the core ideas can be adapted to diffusion somehow.
1tok/2s is the best I got on my PC, thanks to MoE architecture of qwen3-next-80B. gpt-oss-20B is slower because I load all single layer experts to GPU and unpack weights (4bit -> bf16) each time. While with qwen3-next I load only active experts (normally 150 out of 512). Probably I could do the same with gpt-oss.
There's one more exciting thing about Qwen3-next (except, efficient MoE architecture and fast linear attention) - MTP (Multi token prediction). It is the additional layer that allows generating more tokens without the need to go through the model again. I'm trying to make it work, but unsuccesfully yet. Maybe someone could help me with it - https://github.com/Mega4alik/ollm/blob/dev/src/ollm/qwen3_ne... (dev brunch). Take a look
CPU is much slower than GPU. You can actually use both by offloading some layers to CPU as o.offload_layers_to_cpu(layers_num=12). It is faster to load from RAM than from SSD.
cahaya|5 months ago
poorman|5 months ago
tripplyons|5 months ago
mlx_lm.server --model mlx-community/Qwen3-Next-80B-A3B-Instruct-4bit --trust-remote-code --port 4444
I'm not sure if there is support for Qwen3-Next in any releases yet, but when I set up the python environment I had to install mlx_lm from source.
mhuffman|5 months ago
Prompt: Create a solar system simulation in a single self-contained HTML file.
qwen3-next-80b (MLX format, 44.86 GB), 4bit 42.56 tok/sec , 2523 tokens, 12.79s to first token
- note: looked like ass, simulation broken, didn't work at all.
Then as a comparison for a model with a similar size, I tried GLM.
GLM-4-32B-0414-8bit (MLX format, 36.66 GB), 9.31 tok/sec, 2936 tokens, 4.77s to first token
- note: looked fantastic for a first try, everything worked as expected.
Not a fair comparison 4bit vs 8bit but some data. The tok/sec on Mac is pretty good depending on the models you use.
jasonjmcghee|5 months ago
And it'll run at like 40t/s depending on which one you have
anuarsh|5 months ago
ydlr|5 months ago
anuarsh|5 months ago
addandsubtract|5 months ago
GTP|5 months ago
anuarsh|5 months ago
mendeza|5 months ago
anuarsh|5 months ago
anuarsh|5 months ago
aappleby|5 months ago
anuarsh|5 months ago