top | item 43743825

(no title)

hskalin | 10 months ago

With ollama you could offload a few layers to cpu if they don't fit in the VRAM. This will cost some performance ofcourse but it's much better than the alternative (everything on cpu)

discuss

senko|10 months ago

I'm doing that with a 12GB card, ollama supports it out of the box.

For some reason, it only uses around 7GB of VRAM, probably due to how the layers are scheduled, maybe I could tweak something there, but didn't bother just for testing.

Obviously, perf depends on CPU, GPU and RAM, but on my machine (3060 + i5-13500) it's around 2 t/s.

dockerd|10 months ago

Does it work on LM Studio? Loading 27b-it-qat taking up more than 22GB on 24GB mac.