top | item 45602040

(no title)

mchiang | 4 months ago

sorry, I don't use 4chan, so I don't know what's said there.

May I ask what system you are using where you are getting memory estimations wrong? This is an area Ollama has been working on and improved quite a bit on.

Latest version of Ollama is 0.12.5 and with a pre-release of 0.12.6

0.7.1 is 28 versions behind.

discuss

thot_experiment|4 months ago

I recently tested every version from 0.7 to 0.11.1 trying to run q5 mistral-3.1 on a system with 48GB of available vram across 2 GPUs. Everything past 0.7.0 gave me OOM or other errors. Now that I've migrated back to llama.cpp I'm not particularly interested in fucking around with ollama again.

as for 4chan, they've hated ollama for a long time because they built on top of llama.cpp and then didn't contribute upstream or give credit to the original project

mchiang|4 months ago

ah! This must be downloaded from elsewhere and not from Ollama? So sorry about this.

To help future optimizations for given quantizations, we have been trying to limit the quantizations to ones that fit for majority of users.

In the case of mistral-small3.1, Ollama supports ~4bit (q4_k_m), ~8bit (q8_0) and fp16.

https://ollama.com/library/mistral-small3.1/tags

I'm hopeful that in the future, more and more model providers will help optimize for given model quantizations - 4 bit (i.e. NVFP4, MXFP4), 8 bit, and a 'full' model.