top | item 38699596

(no title)

ghoomketu | 2 years ago

I recently downloaded ollama on my Linux machine and even with 3060 12gb gpu and 24 GB Ram I'm unable to run mistral or dolphin and always get an out of memory error. So it's amazing that these companies are able to scale these so well handling thousands of requests per minute.

I wish they would do a behind the scenes on how much money, time, optimisation is done to make this all work.

Also big fan of anyscale. Their pricing is just phenomenal for running models like mixtral. Not sure how they are so affordable.

discuss

M4v3R|2 years ago

You need to pick the correct model size and quantization for the amount of GPU RAM you have. For any given model don’t download the default file, instead go to Tags section on Ollama’s page and pick a quantization whose size in GB is at most 2/3rd of your available RAM, and it should work. For example in your case Mistral-7B q4_0 and even q8_0 should work perfectly.

swyx|2 years ago

whats the intuition for 2/3 of RAM?

ilaksh|2 years ago

Try https://github.com/ggerganov/llama.cpp

Builds very quickly with make. But if it's slow when you try it then make sure to enable any flags related to CUDA and then try the build again.

A key parameter is the one that tells it how many layers to offload to the GPU. ngl I think.

Also, download the 4 bit GGUF from HuggingFace and try that. Uses much less memory.

avereveard|2 years ago

with llama.cpp and a 12gb 3060 they can get the an entire mistral model at Q5_K_M n ram with the full 32k context. I recommend openhermes-2.5-mistral-7b-16k with USER: ASSISTANT: instructions, it's working surprisingly well for content production (let's say everything except logic and math, but that's not the strong suite of 7b models in general)

mgreg|2 years ago

Some details that might interest you from SemiAnalysis [1] just published yesterday. There's quite a bit that goes into optimizing inference with lots of dials to turn. One thing that does seem to have a large impact is batch size which is a benefit of scale.

1. https://www.semianalysis.com/p/inference-race-to-the-bottom-...

TheMatten|2 years ago

I can reasonably run (quantized) Mistral-7B on a 16GB machine without GPU, using ollama. Are you sure it isn't a configuration error or bug?

ilaksh|2 years ago

How many tokens per second and what are the specs of the machine? My attempts at CPU only have been really slow.

ignoramous|2 years ago

> optimisation is done to make this all work

Obviously still a nascent area but https://lmsys.org/blog do a good job of diving into engineering challenges behind running these LLMs.

(I'm sure there are others)

idonotknowwhy|2 years ago

You can run a 7b Q4 model in your 12gb vram no problem.

unknown|2 years ago

[deleted]