I have tried a lot of local models. I have 656GB of them on my computer so I have experience with a diverse array of LLMs. Gemma has been nothing to write home about and has been disappointing every single time I have used it.
Models that are worth writing home about are;
EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.
Rocinante-12B-v2i - Fun for stories and D&D
Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks
OpenThinker-7B - Good and fast reasoning
The Deepseek destills - Able to handle more complex task while still being fast
DeepHermes-3-Llama-3-8B - A really good vLLM
Medical-Llama3-v2 - Very interesting but be careful
From the limited testing I've done, Gemma 3 27B appears to be an incredibly strong model. But I'm not seeing the same performance in Ollama as I'm seeing on aistudio.google.com. So, I'd recommend trying it from the source before you draw any conclusions.
One of the downsides of open models is that there are a gazillion little parameters at inference time (sampling strategy, prompt template, etc.) that can easily impair a model's performance. It takes some time for the community to iron out the wrinkles.
The Gemma 2 Instruct models are quite good (9 & 27B) for writing. The 27B is good at following instructions. I also like DeepSeek R1 Distill Llama 70B.
The Gemma 3 Instruct 4B model that was released today matches the output of the larger models for some of the stuff I am trying.
Recently, I compared 13 different online and local LLMs in a test where they tried to recreate Saki's "The Open Window" from a prompt.[1] Claude wins hands down IMO, but the other models are not bad.
You should try Mistral Small 24b it’s been my daily companion for a while and have continued to impress me daily. I’ve heard good things about QwQ 32b that just came out too.
Nice, I think you're nailing the important thing -- which is "what exactly are they good FOR?"
I see a lot of talk about good and not good here, but (and a question for everyone) what are people using the non-local big boys for that the locals CAN'T do? I mean, IRL tasks?
I have had nothing but good results using the Qwen2.5 and Hermes3 models. The response times and token generation speeds have been pretty fantastic compared against other models I've tried, too.
Could you talk a little more about your D&D usage? This has turned into one of my primary use cases for ChatGPT, cooking up encounters or NPCs with a certain flavour if I don't have time to think something up myself. I've also been working on hooking up to the D&D Beyond API so you can get everything into homebrew monsters and encounters.
Do you mostly stick with smaller models? I’m pretty surprised at how good the smaller models can be at times now. A year ago they were nearly useless. I kind of like too the hallucinations are more obvious sometimes. Or at least it seems like they are.
Ah, OpenThinker-7B. A diverse variety of LLM from the OpenThoughts team. Light and airy, suitable for everyday usage and not too heavy on the CPU. A new world LLM for the discerning user.
Let us know when you've evaluated Gemma 3. Just as with the switch between ChatGPT 3.5 and ChatGPT 4, old versions don't tell you much about the current version.
what hardware are you using those on? Is it still prohibitively expensive to self-host a model that gives decent outputs (sorry my last experience has been underwhelming with llama a while back)
The recommended settings according to the Gemma team are:
temperature = 0.95
top_p = 0.95
top_k = 64
Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M
Daniel, as always, thanks for these. I had good results with your Q4_K_M quant on mac / llama.cpp. However, on Linux/A100/ollama, there is something very wrong with your Q8_0 quant. python code has indentation errors, missing close parens, quite a lot that's bad. I ran both with your suggested command lines, but of course could have been some mistake I made. I'm testing the bf16 on the A100 now to make sure it's not a hardware issue, but my gut is there's a model or ollama sampling problem here.
Thanks for this, but I'm still unable to reproduce the results from Google AI studio.
I tried your version and when I ask it to create a tetris game in python, the resulting file has syntax errors. I see strange things like a space in the middle of a variable name/reference or weird spacing in the code output.
Small Models should be train on specific problem in specific language, and should be built one upon another, the way container works. I see a future where a factory or home have local AI server which have many highly specific models, continuously being trained by super large LLM on the web, and are connected via network to all instruments and computer to basically control whole factory. I also see a future where all machinery comes with AI-Readable language for their own functioning. A http like AI protocol for two way communication between machine and an AI. Lots of possibility.
After reading the technical report do the effort of downloading the model and run it against a few prompts. In 5 minutes you understand how broken LLM benchmarking is.
That's why I like giving it a real world test. For example take a podcast transcription and ask it to make show notes and summary. With a temperature of 0 different models will tackle the problem in different ways and you can infer if they really understood the transcript. Usually the transcripts that I give it come from about 1 hour of audio of two or more people talking.
No mention of how well it's claimed to perform with tool calling?
The Gemma series of models has historically been pretty poor when it comes to coding and tool calling - two things that are very important to agentic systems, so it will be interesting to see how 3 does in this regard.
Not sure if anyone else experiences this, but ollama downloads starts off strong but the last few MBs take forever.
Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.
From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.
There are some fixes coming to uniformly speed up pulls. We've been testing that out but there are a lot of moving pieces with the new engine so it's not here quite yet.
It might not be downloading but converting the model. Or if it's already downloading a properly formatted model file, deduping on disk which I hear it does. This also makes its model files on disk useless for other frontends.
I experienced this just now. The download slowed down to approx 500kB/s for the last 1% or so. When this happens, you can Ctrl+C to cancel and then start the download again It will continue from where it left off, but at regular (fast) download speed.
The claim of “strongest” (what does that even mean?) seems moot. I don’t think a multimodal model is the way to go to use on single, home, GPUs.
I would much rather have specific tailored models to use in different scenarios, that could be loaded into the GPU when needed. It’s a waste of parameters to have half of the VRAM loaded with parts of the model targeting image generation when all I want to do is write code.
My usual non-scientific benchmark is asking it to implement the game Tetris in python, and then iterating with the LLM to fix/tweak it.
My prompt to Gemma 27b (q4) on open webui + ollama: "Can you create the game tetris in python?"
It immediately starts writing code. After the code is finished, I noticed something very strange, it starts a paragraph like this:
"
Key improvements and explanations:
Clearer Code Structure: The code is now organized into a Tetris class, making it much more maintainable and readable. This is essential for any non-trivial game.
"
Followed by a bunch of fixes/improvements, as if this was not the first iteration of the script.
I also notice a very obvious error: In the `if __name__ == '__main__':` block, it tries to instantiate a `Tetris` class, when the name of the class it created was "TetrisGame".
Nevertheless, I try to run it and paste the `NameError: name 'Tetris' is not defined` error along with stack trace specifying the line. Gemma then gives me this response:
"The error message "NameError: name 'Tetris' is not defined" means that the Python interpreter cannot find a class or function named Tetris. This usually happens when:"
Then continues with a generic explanation with how to fix this error in arbitrary programs. It seems like it completely ignored the code it just wrote.
I ran the same prompt on google AI studio it had the same behavior of talking about improvements as if the code it wrote was not the first version.
Other than that, the experience was completely different:
- The game worked on first try
- I iterated with the model making enhancements. The first version worked but didn't show scores, levels or next piece, so I asked it to implement those features. It then produced a new version which almost worked: The only problem was that levels were increasing whenever a piece fell, and I didn't notice any increase in falling speed.
- So I reported the problems with level tracking and falling speed and it produced a new version which crashed immediately. I pasted the error and it was able to fix it in the next version
- I kept iterating with the model, fixing issues until it finally produced a perfectly working tetris game which I played and eventually lost due to high falling speed.
- As a final request, I asked it to port the latest working version of the game to JS/HTML with the implementation self contained in a file. It produced a broken implementation, but I was able to fix it after tweaking it a little bit.
Gemma 3 27b on Google AI studio is easily one of the best LLMs I've used for coding.
Unfortuantely I can't seem to reproduce the same results in ollama/open webui, even when running the full fp16 version.
These bar charts are getting more disingenuous every day. This one makes it seem like Gemma3 ranks as nr. 2 on the arena just behind the full DeepSeek R1. But they just cut out everything that ranks higher. In reality, R1 currently ranks as nr. 6 in terms of Elo. It's still impressive for such a small model to compete with much bigger models, but at this point you can't trust any publication by anyone who has any skin in model development.
The chart isn't claiming to be an overview of the best ranking models - it's an evaluation of this particular model, which wouldn't be helped at all by having loads more unrelated models in the chart, even if that would have helped you avoid misunderstanding the point of the chart.
The most disturbing thing is that in the chart it ranks higher than V3. Test a few prompts against DeepSeek V3 and Gemma 3. They are like at two totally different levels, one is a SOTA model, one is a small LLM that can be useful for certain vertical tasks perhaps.
open llm leaderboard [0] is probably good to compare open weights model on many different benchmarks - wish they put also some closed source one just to see what's relative ranking of best open weights to closed source one. They haven't updated yet for gemma 3 though
Gemma 2 27B at 4 bits would be a drooling idiot anyway, even going down to 8 bits seems to significantly lobotomize it. Qwens are surprisingly resistant to quantization compared to most so it'll pull ahead just in that already in terms of coherence for the same VRAM amount.
We'll see if the quantization aware versions are any better this time around, but I doubt any inference framework will even support them. Gemma.cpp never got a a standard compatible server API so people could actually use it, and as a result got absolutely zero adoption.
They've had years to provide the needed memory but can't/won't.
The future of local LLMs is APUs such as Apple M series and AMD Strix Halo.
Within 12 months everyone will have relegated discrete GPUs to the AI dustbin and be running 128GB to 512GB of delicious local RAM with vastly more RAM than any discrete GPU could dream of.
That seems a tad dramatic. GPU's were widespread because of gaming, not AI. That the overlapping market would somehow just all magically have >3,000$ _and_ decide to switch to a non-standard, non-CUDA hardware solution in just 12 months is absurd.
Ollama silently (!!!) drops messages if the context window is exceeded (instead of, you know, just erroring? who in the world made this decision).
The workaround until now was to (not use ollama or) make sure to only send a single message. But now they seem to silently truncate single messages as well, instead of erroring! (this explains the sibling comment where a user could not reproduce the results locally).
Use LM Studio, llama.cpp, openrouter or anything else, but stay away from ollama!
"AI company" makes this an unreasonable wide question but I'll assume you mean of the big players in this ecosystem. I miss later models from Grok and xai, which don't seem to care about sharing models either
archerx|11 months ago
Models that are worth writing home about are;
EXAONE-3.5-7.8B-Instruct - It was excellent at taking podcast transcriptions and generating show notes and summaries.
Rocinante-12B-v2i - Fun for stories and D&D
Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks
OpenThinker-7B - Good and fast reasoning
The Deepseek destills - Able to handle more complex task while still being fast
DeepHermes-3-Llama-3-8B - A really good vLLM
Medical-Llama3-v2 - Very interesting but be careful
Plus more but not Gemma.
anon373839|11 months ago
One of the downsides of open models is that there are a gazillion little parameters at inference time (sampling strategy, prompt template, etc.) that can easily impair a model's performance. It takes some time for the community to iron out the wrinkles.
sieve|11 months ago
The Gemma 3 Instruct 4B model that was released today matches the output of the larger models for some of the stuff I am trying.
Recently, I compared 13 different online and local LLMs in a test where they tried to recreate Saki's "The Open Window" from a prompt.[1] Claude wins hands down IMO, but the other models are not bad.
[1] Variations on a Theme of Saki (https://gist.github.com/s-i-e-v-e/b4d696bfb08488aeb893cce3a4...)
mythz|11 months ago
BTW mistral-small:24b is also worth mentioning (IMO best local model) and phi4:14b is also pretty strong for its size.
mistral-small was my previous local goto model, testing now to see if gemma3 can replace it.
zacksiri|11 months ago
jrm4|11 months ago
I see a lot of talk about good and not good here, but (and a question for everyone) what are people using the non-local big boys for that the locals CAN'T do? I mean, IRL tasks?
blooalien|11 months ago
usef-|11 months ago
mupuff1234|11 months ago
rpastuszak|11 months ago
> Qwen2.5-Coder-14B-Instruct - Good for simple coding tasks > OpenThinker-7B - Good and fast reasoning
Any chance you could be more specific, ie give an example of a concrete coding task or reasoning problem you used them for?
thom|11 months ago
DeepSeaTortoise|11 months ago
Also lobotomized LLMs ("abliterated") can be a lot of fun.
pduggishetti|11 months ago
memhole|11 months ago
sebastiansm|11 months ago
karma_fountain|11 months ago
panki27|11 months ago
Do you have any recommendations for a "general AI assistant" model, not focused on a specific task, but more a jack-of-all-trades?
xnx|11 months ago
tomp|11 months ago
IME Qwen2.5-3B-Instruct (or even 1.5B) have been quite remarkable, but I haven't done that heavy testing.
_1|11 months ago
dudefeliciano|11 months ago
michaelbuckbee|11 months ago
unknown|11 months ago
[deleted]
danielhanchen|11 months ago
The recommended settings according to the Gemma team are:
temperature = 0.95
top_p = 0.95
top_k = 64
Also beware of double BOS tokens! You can run my uploaded GGUFs with the recommended chat template and settings via ollama run hf.co/unsloth/gemma-3-27b-it-GGUF:Q4_K_M
vessenes|11 months ago
EDIT: 27b size
tarruda|11 months ago
I tried your version and when I ask it to create a tetris game in python, the resulting file has syntax errors. I see strange things like a space in the middle of a variable name/reference or weird spacing in the code output.
svachalek|11 months ago
>>> who is president
The বর্তমানpresident of the United States is Джо Байден (JoeBiden).
swores|11 months ago
https://news.ycombinator.com/item?id=43340491
iamgopal|11 months ago
antirez|11 months ago
archerx|11 months ago
amelius|11 months ago
toinewx|11 months ago
smcleod|11 months ago
The Gemma series of models has historically been pretty poor when it comes to coding and tool calling - two things that are very important to agentic systems, so it will be interesting to see how 3 does in this regard.
PKop|11 months ago
[0] https://github.com/ollama/ollama/issues/9680
[1] https://github.com/ollama/ollama/issues/9680#issuecomment-27...
mythz|11 months ago
Finally just finished downloading (gemma3:27b). Requires the latest version of Ollama to use, but now working, getting about 21 tok/s on my local 2x A4000.
From my few test prompts looks like a quality model, going to run more tests to compare against mistral-small:24b to see if it's going to become my new local model.
Patrick_Devine|11 months ago
dizhn|11 months ago
squeakywhite|11 months ago
elif|11 months ago
amelius|11 months ago
wtcactus|11 months ago
I would much rather have specific tailored models to use in different scenarios, that could be loaded into the GPU when needed. It’s a waste of parameters to have half of the VRAM loaded with parts of the model targeting image generation when all I want to do is write code.
JKCalhoun|11 months ago
amelius|11 months ago
singularity2001|11 months ago
[0] https://huggingface.co/open-r1/OlympicCoder-7B?local-app=vll...
[1] https://pbs.twimg.com/media/GlyjSTtXYAAR188?format=jpg&name=...
tarruda|11 months ago
My prompt to Gemma 27b (q4) on open webui + ollama: "Can you create the game tetris in python?"
It immediately starts writing code. After the code is finished, I noticed something very strange, it starts a paragraph like this:
" Key improvements and explanations:
"Followed by a bunch of fixes/improvements, as if this was not the first iteration of the script.
I also notice a very obvious error: In the `if __name__ == '__main__':` block, it tries to instantiate a `Tetris` class, when the name of the class it created was "TetrisGame".
Nevertheless, I try to run it and paste the `NameError: name 'Tetris' is not defined` error along with stack trace specifying the line. Gemma then gives me this response:
"The error message "NameError: name 'Tetris' is not defined" means that the Python interpreter cannot find a class or function named Tetris. This usually happens when:"
Then continues with a generic explanation with how to fix this error in arbitrary programs. It seems like it completely ignored the code it just wrote.
tarruda|11 months ago
Other than that, the experience was completely different:
- The game worked on first try
- I iterated with the model making enhancements. The first version worked but didn't show scores, levels or next piece, so I asked it to implement those features. It then produced a new version which almost worked: The only problem was that levels were increasing whenever a piece fell, and I didn't notice any increase in falling speed.
- So I reported the problems with level tracking and falling speed and it produced a new version which crashed immediately. I pasted the error and it was able to fix it in the next version
- I kept iterating with the model, fixing issues until it finally produced a perfectly working tetris game which I played and eventually lost due to high falling speed.
- As a final request, I asked it to port the latest working version of the game to JS/HTML with the implementation self contained in a file. It produced a broken implementation, but I was able to fix it after tweaking it a little bit.
Gemma 3 27b on Google AI studio is easily one of the best LLMs I've used for coding.
Unfortuantely I can't seem to reproduce the same results in ollama/open webui, even when running the full fp16 version.
whbrown|11 months ago
By default, Ollama uses a context window size of 2048 tokens.
whiplash451|11 months ago
unknown|11 months ago
[deleted]
sigmoid10|11 months ago
swores|11 months ago
antirez|11 months ago
pzo|11 months ago
[0] https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...
leumon|11 months ago
moffkalast|11 months ago
We'll see if the quantization aware versions are any better this time around, but I doubt any inference framework will even support them. Gemma.cpp never got a a standard compatible server API so people could actually use it, and as a result got absolutely zero adoption.
aravindputrevu|11 months ago
Suddenly after reasoning models, it looks like OSS models have lost their charm
archerx|11 months ago
unknown|11 months ago
[deleted]
chaosprint|11 months ago
wewewedxfgdf|11 months ago
They've had years to provide the needed memory but can't/won't.
The future of local LLMs is APUs such as Apple M series and AMD Strix Halo.
Within 12 months everyone will have relegated discrete GPUs to the AI dustbin and be running 128GB to 512GB of delicious local RAM with vastly more RAM than any discrete GPU could dream of.
throwaway314155|11 months ago
lvl155|11 months ago
tekichan|11 months ago
casey2|11 months ago
axiosgunnar|11 months ago
Ollama silently (!!!) drops messages if the context window is exceeded (instead of, you know, just erroring? who in the world made this decision).
The workaround until now was to (not use ollama or) make sure to only send a single message. But now they seem to silently truncate single messages as well, instead of erroring! (this explains the sibling comment where a user could not reproduce the results locally).
Use LM Studio, llama.cpp, openrouter or anything else, but stay away from ollama!
Technetium|11 months ago
kjrfghslkdjfl|11 months ago
[deleted]
unknown|11 months ago
[deleted]
eogrok|11 months ago
[deleted]
tarruda|11 months ago
archerx|11 months ago
world2vec|11 months ago
finnjohnsen2|11 months ago
elif|11 months ago