> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.
Then this is great.
If your goal is
> Run and explore Llama models locally with minimal dependencies on CPU
then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.
Ollama (also wrapping llama.cpp) has GPU support, unless you're really in love with the idea of bundling weights into the inference executable probably a better choice for most people.
A great place to start is with the LLaMA 3.2 q6 llamafile I posted a few days ago. https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafi... We have a new CLI chatbot interface that's really fun to use. Syntax highlighting and all. You can also use GPU by passing the -ngl 999 flag.
> then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.
First time that I have a "it just works" experience with LLMs on my computer. Amazing. Thanks for the recommendation!
With the same mindset, but without even PyTorch as dependency there's a straightforward CPU implementation of llama/gemma in Rust: https://github.com/samuel-vitorino/lm.rs/
It's impressive to realize how little code is needed to run these models at all.
PyTorch has a native llm solution
It supports all the LLama models. It supports CPU, MPS and CUDA
https://github.com/pytorch/torchchat
Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1
yjftsjthsd-h|1 year ago
> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.
Then this is great.
If your goal is
> Run and explore Llama models locally with minimal dependencies on CPU
then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.
hedgehog|1 year ago
jart|1 year ago
seu|1 year ago
First time that I have a "it just works" experience with LLMs on my computer. Amazing. Thanks for the recommendation!
rmbyrro|1 year ago
anordin95|1 year ago
yumraj|1 year ago
bagels|1 year ago
AlfredBarnes|1 year ago
littlestymaar|1 year ago
It's impressive to realize how little code is needed to run these models at all.
Ship_Star_1010|1 year ago
ajaksalad|1 year ago
Seems like torchchat is exactly what the author was looking for.
> And the 8B model typically gets killed by the OS for using too much memory.
Torchchat also provides some quantization options so you can reduce the model size to fit into memory.
I_am_tiberius|1 year ago
dartos|1 year ago
tcdent|1 year ago
This just imports the Llama reference implementation and patches the device FYI.
There are more robust implementations out there.
anordin95|1 year ago
klaussilveira|1 year ago