(no title)
rahimnathwani | 17 hours ago
First, make sure enough memory is allocated to the gpu:
sudo sysctl -w iogpu.wired_limit_mb=24000
Then run llama.cpp but reduce RAM needs by limiting the context window and turning off vision support. (And turn off reasoning for now as it's not needed for simple queries.) llama-server \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
--jinja \
--no-mmproj \
--no-warmup \
-np 1 \
-c 8192 \
-b 512 \
--chat-template-kwargs '{"enable_thinking": false}'
You can also enable/disable thinking on a per-request basis: curl 'http://localhost:8080/v1/chat/completions' \
--data-raw '{"messages":[{"role":"user","content":"hello"}],"stream":false,"return_progress":false,"reasoning_format":"auto","temperature":0.8,"max_tokens":-1,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"xtc_probability":0,"xtc_threshold":0.1,"typ_p":1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"chat_template_kwargs": { "enable_thinking": true }}'|jq .
If anyone has any better suggestions, please comment :)
suprjami|4 hours ago
Many user benchmarks report up to 30% better memory usage and up to 50% higher token generation speed:
https://reddit.com/r/LocalLLaMA/comments/1fz6z79/lm_studio_s...
As the post says, LM Studio has an MLX backend which makes it easy to use.
If you still want to stick with llama-server and GGUF, look at llama-swap which allows you to run one frontend which provides a list of models and dynamically starts a llama-server process with the right model:
https://github.com/mostlygeek/llama-swap
(actually you could run any OpenAI-compatible server process with llama-swap)
rahimnathwani|4 hours ago
Regarding mlx, I haven't tried it with this model. Does it work with unsloth dynamic quantization? I looked at mlx-community and found this one, but I'm not sure how it was quantized. The weights are about the same size as unsloth's 4-bit XL model: https://huggingface.co/mlx-community/Qwen3.5-35B-A3B-4bit/tr...