top | item 47162364

vLLM-mlx – 65 tok/s LLM inference on Mac with tool calling and prompt caching

3 points| raullen | 4 days ago |github.com

1 comment

order

raullen|4 days ago

I've been working on a fork of vllm-mlx (OpenAI-compatible LLM server for Apple Silicon) to make it actually usable for coding agents. The upstream project is great but was missing production-grade tool calling, reasoning separation, and multi-turn performance.

  What I added (37 commits):

  - Tool calling that works — streaming + non-streaming, supports MiniMax and Hermes/Qwen3 formats. 4/4 accuracy on structured function calling benchmarks.
  - Reasoning separation — MiniMax-M2.5 mixes reasoning into its output with no tags. Built a heuristic parser that cleanly separates reasoning from content (0% leak rate, was 60%
   with the generic parser).
  - Prompt cache for SimpleEngine — persistent KV cache across requests. On 33K-token coding agent contexts: TTFT goes from 28s to 0.3s on cache hit. This is the single biggest
  improvement for multi-turn use.
  - 1500+ tests — parsers, engine, server, tool calling. The upstream had minimal test coverage.

  Benchmarks (Mac Studio M3 Ultra, 256GB):

  Qwen3-Coder-Next-6bit (80B MoE, 3B active):
  - Decode: 65 tok/s
  - Prefill: 1090-1440 tok/s
  - TTFT (cache hit, 33K context): 0.3s

  MiniMax-M2.5-4bit (229B MoE):
  - Decode: 33-38 tok/s
  - Deep reasoning with tool calling

  I built this to run OpenClaw locally on my Mac instead of paying for cloud APIs. Qwen3-Coder-Next at 65 tok/s with tool calling is genuinely usable — not a toy demo.

  Quick start:

  pip install git+https://github.com/raullenchai/vllm-mlx.git
  python -m vllm_mlx.server \
    --model lmstudio-community/Qwen3-Coder-Next-MLX-6bit \
    --tool-call-parser hermes --port 8000

  GitHub: https://github.com/raullenchai/vllm-mlx