For anyone else trying to run this on a Mac with 32GB unified RAM, this is what worked for me:
First, make sure enough memory is allocated to the gpu:
sudo sysctl -w iogpu.wired_limit_mb=24000
Then run llama.cpp but reduce RAM needs by limiting the context window and turning off vision support. (And turn off reasoning for now as it's not needed for simple queries.)
suprjami|1 day ago
rahimnathwani|14 hours ago
First, make sure enough memory is allocated to the gpu:
Then run llama.cpp but reduce RAM needs by limiting the context window and turning off vision support. (And turn off reasoning for now as it's not needed for simple queries.) You can also enable/disable thinking on a per-request basis: If anyone has any better suggestions, please comment :)BoredomIsFun|19 hours ago
rahimnathwani|1 day ago