WingNews

heinrichf|11 months ago

I'm comparing Gemma3 12 B (https://ollama.com/library/gemma3; running fully on my 3060 12GB) and Mistral Small 3 24B (https://ollama.com/library/mistral-small; 10% offloaded to the CPU).

- Gemma3 12B: ~100 t/s on prompt eval; 15 t/s on eval

- MistralSmall3 24B: ~500 t/s on prompt eval; 10 t/s on eval

Do you know what different in architecture could make the prompt eval (prefill) so much slower on the 2x smaller Gemma3 model?

alekandreev|11 months ago

Thank you for the report! We are working with the Ollama team directly and will look into it.

remuskaos|11 months ago

At what context sizes? I've just run the same prompt and query on my RTX3080 with openwebui as frontend.

When I set the context size to 2048 (openwebui's default), the inference is almost twice as fast as when I set it to 4096. I can't set the conext size any higher because my GPU only has 12GB of RAM and ollama crashes for larger context sizes.

Still, I find that thoroughly odd. Using the larger conetext size (4096), the GPU usage is only 50% as seen in nvtop. I have no idea why.

magicalhippo|11 months ago

Thanks, been using Gemma 2 a lot at home as it still holds up very well and the 9B version runs great on my 2080Ti. Strong prompt adherence coupled with overall capability makes it very useful. Looking forward to trying Gemma 3.

I have some dumb questions though, might as well ask. How do you decide on the model sizes? And how do you train them? Independently or are they related somehow?

alekandreev|11 months ago

Picking model sizes is not an exact science. We look for sizes that will fit quantized on different categories on devices (e.g., low-end and high-end smartphone, laptops and 16GB GPUs, and bigger GPUs/TPUs). We also want the ratio of model width to depth (number of layers) to be consistently around 90, which we found works best.

The models are trained with distillation from a bigger teacher. We train them independently, but for v3 we have unified the recipes for 4B-27B, to give you more predictably when scaling up and down to different model sizes.

miki123211|11 months ago

How good is Gemma at structured output generation, JSON schema compliance and tool use? Particularly the smaller versions, particularly in foreign languages?

We will run our internal evals on it for sure, but just wanted to ask whether that's even a use case that the team considered and trained for.

canyon289|11 months ago

Hey, I'm from the Gemma team. There's a couple of angles to your question

We do care about prompted instructions, like json schema, and it is something we eval for and encourage you to try. Here's an example from Gemma2 to guide folks looking to do what it sounds like you're interested in.

https://www.youtube.com/watch?v=YxhzozLH1Dk

Multilinguality was a big focus in Gemma3. Give it a try

And for structured output Gemma works well with many structured output libraries, for example the one built into Ollama

https://github.com/ollama/ollama/blob/main/docs/api.md#struc...

In short you should have all the functionality you need!

seektable|11 months ago

Just tried gemma3:4b for structured output and it fails with a strange error ( ollama is the latest):

Ollama error: POST predict: Post "http://127.0.0.1:49675/completion": read tcp 127.0.0.1:49677->127.0.0.1:49675: wsarecv: An existing connection was forcibly closed by the remote host.

Not sure this is Ollama or gemma3:4b problem. At the same time, gemma3:12b works fine for the same API request (100% identical, only difference is model id).

unknown|11 months ago

[deleted]

swyx|11 months ago

will there ever be a Gemma 3 Thinking? how copyable is the Flash Thinking approach to the Gemma series?

alekandreev|11 months ago

That's a very interesting area, but nothing we can announce today.

mdp2021|11 months ago

Thank you!

Question: your model supports 140 languages. Given that you are focusing on compactness and efficiency, would you not have gains in also developing models on a selected limited number of languages (e.g. the topmost (in cultural production) four "western" ones with shared alphabet - or similar set)?

Edit: of course the multilingual capability can be can be welcome. On the other hand, there are evident cases in which efficiency can be paramount. We can wonder about the tradeoff: how much in efficiency is sacrificed by features.

alekandreev|11 months ago

That's an idea we've thought about. However, we think the open source community has already created a very impressive set of language or region-specific finetunes [1] [2]. Also there is a lot of cultural and nuance context in every language that we don't have the capacity to cover sufficiently. So for v3 we focused on creating the best foundational multilingual model.

[1] https://huggingface.co/aiplanet/buddhi-indic

[2] https://ai.google.dev/gemma/gemmaverse/sealion

sidkshatriya|11 months ago

As per the technical report, every 5 layers you have a global attention layer. The global attention layer during training can have as many as a 128k context length during training (though I understand it is usually 32k).

Q. When you are training with a context length of 128k, is the attention in the global layers dense or sparse ?

If dense, would the attention memory requirement here would be O(n^2) where n is 128k for each global layer ?

alekandreev|11 months ago

We never train at 128k, only 32k, changing the scaling factor at the end.

We wanted the long context recipe to be friendly for finetuning, and training at 128k is a bit of a pain we don't do it. For inference, we see inference at 128k with the 5/1 is close to RAM usage for a fully-global-layer model at 32k.

Individual attention layers are always dense.

moffkalast|11 months ago

What's the official take on the system prompt? The technical report doesn't mention it, but the official QAT GGUFs include some form of prepending it to the first user message. Has it been trained with any <start_of_turn>system turns with tool calls and such?

alekandreev|11 months ago

We recommend using <start_of_turn>user for the system prompt as well.

werediver|11 months ago

Is speculative decoding possible across 1/4/12/27 B Gemma 3 variants?

LM Studio doesn't allow that (yet), but maybe the s/w requires some adjustments to support speculative decoding with Gemma 3.

pinglin|11 months ago

It's reported working but not with LM Studio: https://www.reddit.com/r/LocalLLaMA/comments/1j9reim/comment...

Herring|11 months ago

Excellent work. What optimizer did you use? I assume AdamW? I didn't see it listed.

saagarjha|11 months ago

Google is using Greenhouse for ATS now?

nothrowaways|11 months ago

Is this what powers Gemini?

(no title)

discuss