Hi. TheBloke has quantized the model: https://huggingface.co/TheBloke/MonadGPT-GGUF You may be able to run the Q3 or Q4 variant. Although in my experience, the quality of quantization takes a hit on "weirder" data (which is the case here)
As the model is very small you should be able to run any quantization level on a M-Series macbook with at least 16GB of ram.
The best one speed/quality wise will probably be Q6_K. As it has not much difference in quality with Q8, but will be definitely faster than Q8.
Haven't tried this one specifically but I always run the 7B parameter models on a M2 Pro with Q6_K or Q4_K_M (depending on how fast I want it).
schmeichel|2 years ago
https://github.com/ggerganov/llama.cpp/ https://huggingface.co/TheBloke/MonadGPT-GGUF
Dorialexander|2 years ago
SushiHippie|2 years ago
Haven't tried this one specifically but I always run the 7B parameter models on a M2 Pro with Q6_K or Q4_K_M (depending on how fast I want it).
See also this table in the readme, which states that Q8 only needs ~10GB of RAM: https://huggingface.co/TheBloke/MonadGPT-GGUF?text=Hey+my+na...