top | item 39960209

(no title)

eole666 | 1 year ago

Looks nice! But some informations about the hardware requirement are often missing in this kind of project :

- how much ram is needed

- what CPU do you need for decent performances

- can it run on a GPU? And if it does how much vram do you need / does it work only on Nvidia?

discuss

Not sure if this helps but this is from tinkering with Mistral 7B on both my M1 Pro (10 Core, 16 GB RAM) and WSL 2 w/ CUDA (Acer Predator 17, i7-7700HK, GTX 1070 Mobile, 16GB DRAM, 8GB VRAM). - Got 15 - 18 Tokens / sec on WSL 2 with slightly higher on M1. Can think of that to about 10 - 15 words per second. Both were using GPU. Haven’t tried CPU on M1 but on WSL 2 it was low single digits - super slow for anything productive. - Used Mistral 7B via llamafile cross-platform APE executable. - For local-uses I found increasing the context size increased the RAM a lot - but it’s fast enough. I am considering adding another 16x1 or 8x2.

Tinkering with building a RAG with some of my documents using the vector stores and chaining multiple calls now.

spxneo|1 year ago

how does 7b match up to Mistral 8x7B?

coming from chatgpt4 it was a huge breath of fresh air to not deal with the judeo-christian biased censorship.

i think this is the ideal localllama setup--uncensored, unbiased, unlimited (only by hardware) LLM+RAG

alexpinel|1 year ago

Right now the minimum amount of RAM I would recommend is 16gb, I think it can run with less memory but that will require a few changes here and there (although they might reduce performance). I would also strongly recommend using a GPU over CPU, in my experience it can make the LLM run twice as fast if not more. Only Nvidia GPUs are supported for now and the CUDA toolkit 12.2 is required to run Dot.