top | item 46818395

(no title)

chid | 1 month ago

Given the high bar of entry 160VRAM GPU - is there anything practical one can use this for?

discuss

omneity|1 month ago

The model being 32B could run in <20GB VRAM with Q4 quantization (minimal loss of quality), or 80GB unquantized at full fidelity. The quoted 160GB is for their recommended evaluation settings.

There's a few pre-quantized options[0] or you can quantize it yourself with llama.cpp[1]. You can run the resulting gguf with llama.cpp `llama-cli` or `llama-server`, with LM Studio or with Ollama.

0: https://huggingface.co/models?search=cwm%20q4%20gguf

1: https://huggingface.co/spaces/ggml-org/gguf-my-repo

chid|1 month ago

I see, still a fair more VRAM than I have access to. Thanks for sharing that information.