top | item 39481554

Gemma.cpp: lightweight, standalone C++ inference engine for Gemma models

422 points| mfiguiere | 2 years ago |github.com | reply

130 comments

order
[+] austinvhuang|2 years ago|reply
Hi, one of the authors austin here. Happy to answer any questions the best I can.

To get a few common questions out of the way:

- This is separate / independent of llama.cpp / ggml. I'm a big fan of that project and it was an inspiration (we say as much in the README). I've been a big advocate of gguf + llama.cpp support for gemma and am happy for people to use that.

- how is it different than inference runtime X? gemma.cpp is a direct implementation of gemma, in its current form it's aimed at experimentation + research and portability + easy modifiable rather than a general purpose deployment framework.

- this initial implementation is cpu simd centric. we're exploring options for portable gpu support but the cool thing is it will build and run on a lot of environments you might not expect an llm to run, so long as you have the memory to load the model.

- I'll let other colleagues answer questions about the Gemma model itself, this is a C++ implementation of the model, but relatively independent of the model training process.

- Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.

- I wrote a twitter thread on this project here: https://twitter.com/austinvhuang/status/1760375890448429459

[+] leminimal|2 years ago|reply
Kudos on your release! I know this was just made available but

- Somewhere the README, consider adding the need for a `-DWEIGHT_TYPE=hwy::bfloat16_t` flag for non-sfp. Maybe around step 3.

- The README should explicitly say somehere that there's no GPU support (at the moment)

- "Failed to read cache gating_ein_0 (error 294)" is pretty obscure. I think even "(error at line number 294)" would be a big improvement when it fails to FindKey.

- There's something odd about the 2b vs 7b model. The 2b will claim its trained by Google but the 7b won't. Were these trained on the same data?

- Are the .sbs weights the same weights as the GGUF? I'm getting different answers compared to llama.cpp. Do you know of a good way to compare the two? Any way to make both deterministic? Or even dump probability distributions on the first (or any) token to compare?

[+] beoberha|2 years ago|reply
> Although this is from Google, we're a very small team that wanted such a codebase to exist. We have lots of plans to use it ourselves and we hope other people like it and find it useful.

This is really cool, Austin. Kudos to your team!

[+] rgbrgb|2 years ago|reply
Thanks for releasing this! What is your use case for this rather than llama.cpp? For the on-device AI stuff I mostly do, llama.cpp is better because of GPU/metal offloading.
[+] dankle|2 years ago|reply
What's the reason to not integrate with llama.cpp instead of a separate app? In what ways this better than llama.cpp?
[+] moffkalast|2 years ago|reply
Cool, any plans on adding K quants, an API server and/or a python wrapper? I really doubt most people want to use it as a cpp dependency and run models at FP16.
[+] verticalscaler|2 years ago|reply
Hi Austin, what say you about how the Gemma rollout was handled, issues raised, and atmosphere around the office? :)
[+] zoogeny|2 years ago|reply
I know a lot of people chide Google for being behind OpenAI in their commercial offerings. We also dunk on them for the over-protective nature of their fine-tuning.

But Google is scarily capable on the LLM front and we shouldn't count them out. OpenAI might have the advantage of being quick to move, but when the juggernaut gets passed its resting inertia and starts to gain momentum it is going to leave an impression.

That became clear to me after watching the recent Jeff Dean video [1] which was posted a few days ago. The depth of institutional knowledge that is going to be unlocked inside Google is actually frightening for me to consider.

I hope the continued competition on the open source front, which we can really thank Facebook and Llama for, keeps these behemoths sharing. As OpenAI moves further from its original mission into capitalizing on its technological lead, we have to remember why the original vision they had is important.

So thank you, Google, for this.

1. https://www.youtube.com/watch?v=oSCRZkSQ1CE&ab_channel=RiceK...

[+] swozey|2 years ago|reply
The velocity of the LLM open source ecosystem is absolutely insane.

I just got into hobby projects with diffusion a week ago and I'm seeing non-stop releases. It's hard to keep up. It's a firehose of information, acronyms, code etc.

It's been a great python refresher.

[+] austinvhuang|2 years ago|reply
Don't be discouraged, you don't have to follow everything.

In fact it's probably better to dive deep into one hobby project like you're doing than constantly context switch with every little news item that comes up.

While working on gemma.cpp there were definitely a lot of "gee i wish i could clone myself and work on that other thing too".

[+] throwaway19423|2 years ago|reply
Can any kind soul explain the difference between GGUF, GGML and all the other model packaging I am seeing these days? Was used to pth and the thing tf uses. Is this all to support inference or quantization? Who manages these formats or are they brewing organically?
[+] austinvhuang|2 years ago|reply
I think it's mostly an organic process arising from the ecosystem.

My personal way of understanding it is this - the original sin of model weight format complexity is that NNs are both data and computation.

Representing the computation as data is the hard part and that's where the simplicity falls apart. Do you embed the compute graph? If so, what do you do about different frameworks supporting overlapping but distinct operations. Do you need the artifact to make training reproducible? Well that's an even more complex computation that you have to serialize as data. And so on..

[+] moffkalast|2 years ago|reply
It's all mostly just inference, though some train LoRAs directly on quantized models too.

GGML and GGUF are the same thing, GGUF is the new version that adds more data about the model so it's easy to support multiple architectures, and also includes prompt templates. These can run CPU only, be partially or fully offloaded to a GPU. With K quants, you can get anywhere from a 2 bit to an 8 bit GGUF.

GPTQ was the GPU-only optimized quantization method that was superseded by AWQ, which is roughly 2x faster and now by EXL2 which is even better. These are usually only 4 bit.

Safetensors and pytorch bin files are raw float16 model files, these are only really used for continued fine tuning.

[+] liuliu|2 years ago|reply
pth can include Python code (PyTorch code) for inference. TF includes the complete static graph.

GGUF is just weights, safetensors the same thing. GGUF doesn't need a JSON decoder for the format while safetensors needs that.

I personally think having a JSON decoder is not a big deal and make the format more amendable, given GGUF evolves too.

[+] xrd|2 years ago|reply
I was discussing LLMs with a non technical person on the plane yesterday. I was explaining why LLMs aren't good at math. And, he responded, no, chatgpt is great a multivariate regression, etc.

I'm using LLMs locally almost always and eschewing API backed LLMs like chatgpt. So I'm not very familiar with plugins, and I'm assuming chatgpt plugs into a backend when it detects a math problem. So it isn't the LLM doing the math but to the user it appears to be.

Does anyone here know what LLM projects like llama.cpp or gemma.cpp support a plugin model?

I'm interested in adding to the dungeons and dragons system I built using llama.cpp. Because it doesn't do math well, the combat mode is terrible. But I was writing my own layer to break out when combat mode occurs, and I'm wondering if there is a better way with some kind of plugin approach.

[+] staticman2|2 years ago|reply
Sillytavern is a front end for local and cloud models. They have a simple scripting language and there's been some experiments with adding game functionality with it:

https://chub.ai/characters/creamsan/team-neko-e4f1b2f8

This one says it uses javascript as well:

https://chub.ai/characters/creamsan/tessa-c4b917f9

Thise are the only two listed as SFW. There's some others if you hit the nsfw toggle and search for the scripted tag.I don't know if this is the right approach but you could also write a module for Sillytavern Extras.

[+] next_xibalba|2 years ago|reply
Is this neutered in the way Gemini is (i.e. is the "censorship" built in) or is that a "feature" of the Gemini application?
[+] ComputerGuru|2 years ago|reply
It depends on the model you load/use, the team released both censored and "PT" versions.
[+] jonpo|2 years ago|reply
These models (Gemma) are very difficult to jailbreak.
[+] ManasRao|2 years ago|reply
Hello Austin,

I would like to inquire about the accessibility of the code and weights for the Gemma Model. Is this information publicly available?

[+] austinvhuang|2 years ago|reply
Hi, haven't followed this thread for a while so just happened to see this now.

I'm assuming you mean in other languages/implementations? (since the gemma.cpp repo linked above has code + links for gemma.cpp specific weights)

If so, you can find the weights here https://www.kaggle.com/models/google/gemma - each of the "model variations" (flax, jax, pytorch, keras, etc.) has a download for the weights and links to its code.

If you're comfortable with flax, that's DM's own reference implementation: https://github.com/google-deepmind/gemma

[+] olegbask|2 years ago|reply
It would be amazing to add support for M1 aka Metal: I was able to run Q8 version with llama.cpp and it's blazingly fast. The problem: I don't know how much accuracy it loses and https://huggingface.co/google/gemma-2b-it/tree/main takes too much memory which results in OOMs.

Do you have any estimates on getting Metal support similar to how llama.cpp works?

Why `.gguf` files are so giant compared to `.sbs`? Is it just because they use fp32?

[+] namtranase|2 years ago|reply
Thank the team for the awesome repo. I have navigated gemma.cpp and run it from the first day, it is smooth in my view. So I hope gemma.cpp will continue to add cool features (something like k-quants, server,...) so it can serve more widely. Actually, I have developed a Python wrapper for it: https://github.com/namtranase/gemma-cpp-python The purpose is to use easily and update every new technique from gemma.cpp team.
[+] austinvhuang|2 years ago|reply
Nice, this is really cool to see! There were other threads that expressed interest in something like this.
[+] a1o|2 years ago|reply
If I want to put a Gemma model in a minimalist command line interface, build it to a standalone exe file that runs offline, what is the size of my final executable? I am interested in how small can the size of something like this be and it still be functional.
[+] brucethemoose2|2 years ago|reply
...Also, we have eval'd Gemma 7B internally in a deterministic, zero temperature test, and its error rate is like double Mistral Instruct 0.2. Well below most other 7Bs.

Was not very impressed with the chat either.

So maybe this is neat for embedded projects, but if it's Gemma only, that would be quite a sticking point for me.

[+] Vetch|2 years ago|reply
Was it via gemma.cpp or some other library? I've seen a few people note that gemma performance via gemma.cpp is much better than llama.cpp, possible that the non-google implementations are still not quite right?
[+] Havoc|2 years ago|reply
That does seem to be the consensus unfortunately. Would have been better for everyone if google’s foray into open model a la FB made a splash
[+] trisfromgoogle|2 years ago|reply
Any chance you can share more details on your measurement setup and eval protocols? You're likely seeing some config snafus, which we're trying to track down.
[+] brokensegue|2 years ago|reply
does anyone have stats on cpu only inference speed with this?
[+] austinvhuang|2 years ago|reply
any particular hardware folks are most interested in?
[+] dontupvoteme|2 years ago|reply
At the risk of being snarky, it's interesting that Llama.cpp was a 'grassroots' effort originating from a Bulgarian hacker google now launches a corporatized effort inspired by it.

I wonder if there's some analogies to the 80s or 90s in here.

[+] trisfromgoogle|2 years ago|reply
To be clear, this is not comparable directly to llama.cpp -- Gemma models work on llama.cpp and we encourage people who love llama.cpp to use them there. We're also launched with Ollama.

Gemma.cpp is a highly optimized and lightweight system. The performance is pretty incredible on CPU, give it a try =)

[+] alekandreev|2 years ago|reply
As a fellow Bulgarian from the 80s and 90s myself, and now a part of the Gemma team, I’d say Austin, Jan, and team very much live up to the ethos of hackers I'd meet on BBSes back then. :)

They are driven entirely by their own curiosity and a desire to push computers to the limit. Combined with their admirable low-level programming skills, you get a very solid, fun codebase, that they are sharing with the world.

[+] natch|2 years ago|reply
Apart from the fact that they are different things, since they came out of the same organization I think it’s fair to ask:

Do these models have the same kind of odd behavior as Gemini?

[+] brucethemoose2|2 years ago|reply
Not to be confused with llama.cpp and the GGML library, which is a seperate project (and almost immediately worked with Gemma).
[+] throwaway19423|2 years ago|reply
I am confused how all these things are able to interoperate. Are the creators of these models following the same IO for their models? Won't the tokenizer or token embedder be different? I am genuinely confused by how the same code works for so many different models.
[+] jebarker|2 years ago|reply
I doubt there'd be confusion as the names are totally different