(no title)
spankalee | 10 days ago
I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.
It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.
Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.
So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.
For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.
bluegatty|10 days ago
I think it speaks to the broader notion of AGI as well.
Claude is definitively trained on the process of coding not just the code, that much is clear.
Codex has the same limitation but not quite as bad.
This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.
Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.
rhubarbtree|10 days ago
But then they leave the door open for Anthropic on coding, enterprise and agentic workflows. Sensibly, that’s what they seem to be doing.
That said Gemini is noticeably worse than ChatGPT (it’s quite erratic) and Anthropic’s work on coding / reasoning seems to be filtering back to its chatbot.
So right now it feels like Anthropic is doing great, OpenAI is slowing but has significant mindshare, and Google are in there competing but their game plan seems a bit of a mess.
datahack|10 days ago
It is the company’s constant kryptonite.
They seem to be, from my third part perspective, repeating the same ol’, same ol’ pattern. It is the “wave lesson” all over again.
Anthropic meanwhile is giving people what they want. They are really listening. And it’s working.
spankalee|10 days ago
This definitely feels like it.
It's hard to really judge, but Gemini feels like it might actually write better code, but the _process_ is so bad that it doesn't matter. At first I thought it was bad integration by the GitHub Copilot, but I see it elsewhere now.
andai|10 days ago
I have a pretty crude mental model for this stuff but Opus feels more like a guy to me, while Codex feels like a machine.
I think that's partly the personality and tone, but I think it goes deeper than that.
(Or maybe the language and tone shapes the behavior, because of how LLMs work? It sounds ridiculous but I told Claude to believe in itself and suddenly it was able to solve problems it wouldn't even attempt before...)
teaearlgraycold|10 days ago
Nuance like this is why I don’t trust quantitative benchmarks.
esoterae|9 days ago
Jack of all trades, master of none, is oftentimes better than master of one.
karmasimida|10 days ago
OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly
onlyrealcuzzo|10 days ago
Most of Gemini's users are Search converts doing extended-Search-like behaviors.
Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.
alphabetting|10 days ago
For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:
1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%
swftarrow|10 days ago
hintymad|10 days ago
unknown|10 days ago
[deleted]
gavmor|10 days ago
renegade-otter|10 days ago
ionwake|10 days ago
ant6n|10 days ago
* randomly fails reading PDFs, but lies about it and just makes shit up if it can't read a file, so you're constantly second guessing whether the context is bullshit
* will forget all context, especially when you stop a reply (never stop a reply, it will destroy your context).
* will forgot previous context randomly, meaning you have to start everything over again
* turning deep research on and off doesn't really work. Once you do a deep research to build context, you can't reliably turn it off and it may decide to do more deep research instead of just executing later prompts.
* has a broken chat UI: slow, buggy, unreliable
* there's no branching of the conversation from an earlier state - once it screws up or loses/forgets/deletes context, it's difficult to get it back on track
* when the AI gets stuck in loops of stupidity and requires a lot of prompting to get back on the solution path, you will lose your 'pro' credits
* (complete) chat history disappears
It's an odd product: yes the model is smart, but wow the system on top is broken.
s3p|10 days ago
ceroxylon|10 days ago
dist-epoch|10 days ago
foz|10 days ago
raducu|10 days ago
Claude provides nicer explanations, but when it comes to CoT tokens or just prompting the LLM to explain -- I'm very skeptical of the truthfulness of it.
Not because the LLM lies, but because humans do that also -- when asked how the figured something, they'll provide a reasonable sounding chain of thought, but it's not how they figured it out.
fl0ki|10 days ago
raducu|10 days ago
Yes, gemini loops but I've found almost always it's just a matter of interrupting and telling it to continue.
Claude is very good until it tries something 2-3 times, can't figure it out and then tries to trick you by changing your tests instead of your code (if you explicitly tell it not to, maybe it will decide to ask) OR introduce hyper-fine-tuned IFs to fit your tests, EVEN if you tell it NOT to.
RachelF|10 days ago
- it is "lazy": I keep having to tell it to finish, or continue, it wants to stop the task early.
- it hallucinates: I have arguments with it about making up API functions to well known libraries which just do not exist.
avereveard|10 days ago
Sometime you can save so much time asking claude codex and glm "hey what you think of this problem" and have a sense wether they would implement it right or not.
Gemini never stops instead goes and fixes whatever you trow at it even if asked not to, you are constantly rolling the dice but with gemini each roll is 5 to 10 minutes long and pollutes the work area.
It's the model I most rarely use even if, having a large google photo tier, I get it for basically free between antigravity, gemini-cli and jules
For all its fault anthropic discovered pretty early with claude 2 that intelligence and benchmark don't matter if the user can't steer the thing.
ojr|10 days ago
port11|9 days ago
3 Flash usually doesn't get into any loops, but then again, it’s also not really following prompts properly. I’ve tried all manner of harnesses around what it shouldn’t do, but it often ignores some instructions. It also doesn’t follow design specs at all, it will output React code that is 70% like what it was asked to do.
My experience with Stitch is the same. Gemini has nice free-use tiers, but it wastes a lot of my time with reprompting it.
Alex-Programs|9 days ago
If I were to build something for Gemini models I'd plan around ingesting a bunch of context then oneshotting it.
Oras|10 days ago
stephen_cagle|10 days ago
tldr; It is great at search, not so much action.
neves|10 days ago
It's not very complex, but a great time saver
PratMish|9 days ago
menaerus|10 days ago
And yet it happily told me what I exactly wanted it to tell me - rewrite the goddamn thing using the (C++) expression templates. And voila, it took "it" 10 minutes to spit out the high-quality code that works.
My biggest gripe for now with Gemini is that Antigravity seems to be written by the model and I am experiencing more hiccups than I would like to, sometimes it's just stuck.
ubercore|9 days ago
What's Conway's Law for LLM models going to be called?
thot_experiment|10 days ago
It's been pretty good for conversations to help me think through architectural decisions though!
boppo1|10 days ago
tom_m|10 days ago
knollimar|10 days ago
Im fully immersed
orbital-decay|10 days ago
SkyPuncher|10 days ago
They'd do well to make a "geminin-flash-lite-for-tools" that their pro model calls whenever it needs to do something simple.
acters|10 days ago
fwipsy|9 days ago
WhitneyLand|10 days ago
Makes you wonder though how much of the difference is the model itself vs Claude Code being a superior agent.
slopinthebag|10 days ago
My workflow is to basically use it to explain new concepts, generate code snippets inline or fill out function bodies, etc. Not really generating code autonomously in a loop. Do you think it would excel at this?
mikestorrent|10 days ago
scotty79|10 days ago
Bnjoroge|10 days ago
jpcompartir|10 days ago
agentifysh|10 days ago
jbellis|10 days ago
https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/
hopefully 3.1 is better.
nicce|10 days ago
Maybe it is just a genius business strategy.
motoboi|10 days ago
One thousand people using an agent over a month will generate like 30-60k good examples of tool use and nudge the model into good editing.
The only explanation I have is that Google is actually using something else internally.
klooney|10 days ago
jatins|10 days ago
mrnobody_67|10 days ago
zobzu|9 days ago
ckdot|9 days ago
varispeed|10 days ago
I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.
fragmede|10 days ago
lal77|9 days ago
[deleted]