Run Llama locally with only PyTorch on CPU

If your goal is

> I want to peel back the layers of the onion and other gluey-mess to gain insight into these models.

Then this is great.

If your goal is

> Run and explore Llama models locally with minimal dependencies on CPU

then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.

hedgehog|1 year ago

Ollama (also wrapping llama.cpp) has GPU support, unless you're really in love with the idea of bundling weights into the inference executable probably a better choice for most people.

jart|1 year ago

A great place to start is with the LLaMA 3.2 q6 llamafile I posted a few days ago. https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct-llamafi... We have a new CLI chatbot interface that's really fun to use. Syntax highlighting and all. You can also use GPU by passing the -ngl 999 flag.

seu|1 year ago

> then I recommend https://github.com/Mozilla-Ocho/llamafile which ships as a single file with no dependencies and runs on CPU with great performance. Like, such great performance that I've mostly given up on GPU for LLMs. It was a game changer.

First time that I have a "it just works" experience with LLMs on my computer. Amazing. Thanks for the recommendation!

rmbyrro|1 year ago

Do you have a ballpark idea of how much RAM would be necessary to run llama 3.1 8b and 70b on 8-quant?

anordin95|1 year ago

Thanks for the suggestion. I've added a link to llamafile in the repo's README. Though, my focus was on exploring the model itself.

yumraj|1 year ago

Can it use GPU if available, say on Apple silicon Macs

bagels|1 year ago

How great is the performance? Tokens/s?

AlfredBarnes|1 year ago

Thanks for posting this!

littlestymaar|1 year ago

With the same mindset, but without even PyTorch as dependency there's a straightforward CPU implementation of llama/gemma in Rust: https://github.com/samuel-vitorino/lm.rs/

It's impressive to realize how little code is needed to run these models at all.

Ship_Star_1010|1 year ago

PyTorch has a native llm solution It supports all the LLama models. It supports CPU, MPS and CUDA https://github.com/pytorch/torchchat Getting 4.5 tokens a second using 3.1 8B full precision using CPU only on my M1

ajaksalad|1 year ago

> I was a bit surprised Meta didn't publish an example way to simply invoke one of these LLM's with only torch (or some minimal set of dependencies)

Seems like torchchat is exactly what the author was looking for.

> And the 8B model typically gets killed by the OS for using too much memory.

Torchchat also provides some quantization options so you can reduce the model size to fit into memory.

I_am_tiberius|1 year ago

Does anyone know what's the easiest way to finetune a model locally is today?

dartos|1 year ago

https://github.com/axolotl-ai-cloud/axolotl

tcdent|1 year ago

> from llama_models.llama3.reference_impl.model import Transformer

This just imports the Llama reference implementation and patches the device FYI.

There are more robust implementations out there.

anordin95|1 year ago

Peel back the layers of the onion and other gluey-mess to gain insight into these models.

klaussilveira|1 year ago

Fast enough for RPI5 ARM?

34 comments