top | item 39960332

(no title)

Not sure if this helps but this is from tinkering with Mistral 7B on both my M1 Pro (10 Core, 16 GB RAM) and WSL 2 w/ CUDA (Acer Predator 17, i7-7700HK, GTX 1070 Mobile, 16GB DRAM, 8GB VRAM). - Got 15 - 18 Tokens / sec on WSL 2 with slightly higher on M1. Can think of that to about 10 - 15 words per second. Both were using GPU. Haven’t tried CPU on M1 but on WSL 2 it was low single digits - super slow for anything productive. - Used Mistral 7B via llamafile cross-platform APE executable. - For local-uses I found increasing the context size increased the RAM a lot - but it’s fast enough. I am considering adding another 16x1 or 8x2.

Tinkering with building a RAG with some of my documents using the vector stores and chaining multiple calls now.

discuss

spxneo|1 year ago

how does 7b match up to Mistral 8x7B?

coming from chatgpt4 it was a huge breath of fresh air to not deal with the judeo-christian biased censorship.

i think this is the ideal localllama setup--uncensored, unbiased, unlimited (only by hardware) LLM+RAG

prosunpraiser|1 year ago

I haven’t seen on how it fares on uncensored use-cases, but from what I see Q5_K variants of Mistral 7B are not very far from Mixtral 8x7B (the latter requires 64GB of RAM which I don’t have).

Tried open-webui yesterday with Ollama for spinning up some of these. It’s pretty good.