Not sure if this helps but this is from tinkering with Mistral 7B on both my M1 Pro (10 Core, 16 GB RAM) and WSL 2 w/ CUDA (Acer Predator 17, i7-7700HK, GTX 1070 Mobile, 16GB DRAM, 8GB VRAM).
- Got 15 - 18 Tokens / sec on WSL 2 with slightly higher on M1. Can think of that to about 10 - 15 words per second. Both were using GPU. Haven’t tried CPU on M1 but on WSL 2 it was low single digits - super slow for anything productive.
- Used Mistral 7B via llamafile cross-platform APE executable.
- For local-uses I found increasing the context size increased the RAM a lot - but it’s fast enough. I am considering adding another 16x1 or 8x2.Tinkering with building a RAG with some of my documents using the vector stores and chaining multiple calls now.
spxneo|1 year ago
coming from chatgpt4 it was a huge breath of fresh air to not deal with the judeo-christian biased censorship.
i think this is the ideal localllama setup--uncensored, unbiased, unlimited (only by hardware) LLM+RAG
prosunpraiser|1 year ago
Tried open-webui yesterday with Ollama for spinning up some of these. It’s pretty good.