top | item 39749174

(no title)

fbhabbed | 1 year ago

https://www.reddit.com/r/localllama is your go-to place if you want a community of like minded people interested exactly in this.

TL:DR There's many ways to go about it.

Quick start?

Clone llama.cpp repo or download the .exe or main linux binary from the "Releases" on Github, on the right. If you care about security, do this in a virtual machine (unless you plan to only use unquantised safetensors).

Example syntax: ./llama.cpp/main -i -ins --color -c 0 --split-mode layer --keep -1 --top-p 40 --top-k 0.9 --min-p 0.02 --temp 2.0 --repeat_penalty 1.1 -n -1 --multiline-input -ngl 3 -m mixtral-8x7b-instruct-v0.1.Q8_0.gguf

In this example, I'm running Mixtral at quantisation Q8, with 3 layers offloaded to the GPU, for about 45GB RAM usage and 7GB VRAM (GPU) usage. To make sense of quants, this is the general rule: you pick the largest quant you can run with your RAM.

If you go look for TheBloke models, they all have a handy model card stating how much RAM each quantisation uses.

I tend to use GGUF versions, which run on CPU but can have some layers offloaded on GPU.

I definitely recommend reading the https://github.com/ggerganov/llama.cpp documentation.

discuss

No comments yet.