top | item 47076453

(no title)

spankalee | 10 days ago

I hope this works better than 3.0 Pro

I'm a former Googler and know some people near the team, so I mildly root for them to at least do well, but Gemini is consistently the most frustrating model I've used for development.

It's stunningly good at reasoning, design, and generating the raw code, but it just falls over a lot when actually trying to get things done, especially compared to Claude Opus.

Within VS Code Copilot Claude will have a good mix of thinking streams and responses to the user. Gemini will almost completely use thinking tokens, and then just do something but not tell you what it did. If you don't look at the thinking tokens you can't tell what happened, but the thinking token stream is crap. It's all "I'm now completely immersed in the problem...". Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress. It's bad at using tools and tries to edit files in weird ways instead of using the provided text editing tools. In Copilot it, won't stop and ask clarifying questions, though in Gemini CLI it will.

So I've tried to adopt a plan-in-Gemini, execute-in-Claude approach, but while I'm doing that I might as well just stay in Claude. The experience is just so much better.

For as much as I hear Google's pulling ahead, Anthropic seems to be to me, from a practical POV. I hope Googlers on Gemini are actually trying these things out in real projects, not just one-shotting a game and calling it a win.

discuss

order

bluegatty|10 days ago

Yes, this is very true and it speaks strongly to this wayward notion of 'models' - it depends so much on the tuning, the harness, the tools.

I think it speaks to the broader notion of AGI as well.

Claude is definitively trained on the process of coding not just the code, that much is clear.

Codex has the same limitation but not quite as bad.

This may be a result of Anthropic using 'user cues' with respect to what are good completions and not, and feeding that into the tuning, among other things.

Anthropic is winning coding and related tasks because they're focused on that, Google is probably oriented towards a more general solution, and so, it's stuck in 'jack of all trades master of none' mode.

rhubarbtree|10 days ago

Google are stuck because they have to compete with OpenAI. If they don’t, they face an existential threat to their advertising business.

But then they leave the door open for Anthropic on coding, enterprise and agentic workflows. Sensibly, that’s what they seem to be doing.

That said Gemini is noticeably worse than ChatGPT (it’s quite erratic) and Anthropic’s work on coding / reasoning seems to be filtering back to its chatbot.

So right now it feels like Anthropic is doing great, OpenAI is slowing but has significant mindshare, and Google are in there competing but their game plan seems a bit of a mess.

datahack|10 days ago

I know this is only a partial answer, but I feel like Google is once again trying to build a product based on internal priorities, existing business protectionism, and internal business goals, rather than building a product that is listening actively to real use feedback as the primary priority.

It is the company’s constant kryptonite.

They seem to be, from my third part perspective, repeating the same ol’, same ol’ pattern. It is the “wave lesson” all over again.

Anthropic meanwhile is giving people what they want. They are really listening. And it’s working.

spankalee|10 days ago

> Claude is definitively trained on the process of coding not just the code

This definitely feels like it.

It's hard to really judge, but Gemini feels like it might actually write better code, but the _process_ is so bad that it doesn't matter. At first I thought it was bad integration by the GitHub Copilot, but I see it elsewhere now.

andai|10 days ago

Tell me more about Codex. I'm trying to understand it better.

I have a pretty crude mental model for this stuff but Opus feels more like a guy to me, while Codex feels like a machine.

I think that's partly the personality and tone, but I think it goes deeper than that.

(Or maybe the language and tone shapes the behavior, because of how LLMs work? It sounds ridiculous but I told Claude to believe in itself and suddenly it was able to solve problems it wouldn't even attempt before...)

teaearlgraycold|10 days ago

> Claude is definitively trained on the process of coding not just the code, that much is clear.

Nuance like this is why I don’t trust quantitative benchmarks.

esoterae|9 days ago

The full aphorism is:

Jack of all trades, master of none, is oftentimes better than master of one.

karmasimida|10 days ago

Gemini just doesn’t do even mildly well in agentic stuff and I don’t know why.

OpenAI has mostly caught up with Claude in agentic stuff, but Google needs to be there and be there quickly

onlyrealcuzzo|10 days ago

Because Search is not agentic.

Most of Gemini's users are Search converts doing extended-Search-like behaviors.

Agentic workflows are a VERY small percentage of all LLM usage at the moment. As that market becomes more important, Google will pour more resources into it.

alphabetting|10 days ago

the agentic benchmarks for 3.1 indicate Gemini has caught up. the gains are big from 3.0 to 3.1.

For example the APEX-Agents benchmark for long time horizon investment banking, consulting and legal work:

1. Gemini 3.1 Pro - 33.2% 2. Opus 4.6 - 29.8% 3. GPT 5.2 Codex - 27.6% 4. Gemini Flash 3.0 - 24.0% 5. GPT 5.2 - 23.0% 6. Gemini 3.0 Pro - 18.0%

swftarrow|10 days ago

I suspect a large part of Google's lag is due to being overly focused on integrating Gemini with their existing product and app lines.

hintymad|10 days ago

My guess is that Gemini team didn't focus on the large-scale RL training for the agentic workload. And they are trying to catch up with 3.1.

gavmor|10 days ago

I've had plenty of success with skills juggling various entities via CLI.

renegade-otter|10 days ago

It's like anything Google - they do the cool part and then lose interest with the last 10%. Writing code is easy, building products that print money is hard.

ionwake|10 days ago

Can you explain what you mean by its bad at agentic stuff?

ant6n|10 days ago

Google is is also consistently the most frustrating chat system on top of the model. I use Gemini for non coding tasks. So I need to feed it a bunch of context (documents) to do my tasks - which can be pretty cumbersome. Gemini

* randomly fails reading PDFs, but lies about it and just makes shit up if it can't read a file, so you're constantly second guessing whether the context is bullshit

* will forget all context, especially when you stop a reply (never stop a reply, it will destroy your context).

* will forgot previous context randomly, meaning you have to start everything over again

* turning deep research on and off doesn't really work. Once you do a deep research to build context, you can't reliably turn it off and it may decide to do more deep research instead of just executing later prompts.

* has a broken chat UI: slow, buggy, unreliable

* there's no branching of the conversation from an earlier state - once it screws up or loses/forgets/deletes context, it's difficult to get it back on track

* when the AI gets stuck in loops of stupidity and requires a lot of prompting to get back on the solution path, you will lose your 'pro' credits

* (complete) chat history disappears

It's an odd product: yes the model is smart, but wow the system on top is broken.

s3p|10 days ago

Don't get me started on the thinking tokens. Since 2.5P the thinking has been insane. "I'm diving in to the problem", "I'm fully immersed" or "I'm meticulously crafting the answer"

ceroxylon|10 days ago

I once saw "now that I've slept on it" in Gemini's CoT... baffling.

dist-epoch|10 days ago

That's not the real thinking, it's a super summarized view of it.

foz|10 days ago

This is part of the reason I don't like to use it. I feel it's hiding things from me, compared to other models that very clearly share what they are thinking.

raducu|10 days ago

> Don't get me started on the thinking tokens.

Claude provides nicer explanations, but when it comes to CoT tokens or just prompting the LLM to explain -- I'm very skeptical of the truthfulness of it.

Not because the LLM lies, but because humans do that also -- when asked how the figured something, they'll provide a reasonable sounding chain of thought, but it's not how they figured it out.

fl0ki|10 days ago

"I'm now completely immersed in the problem" is my new catchphrase, thanks for sharing.

raducu|10 days ago

> Gemini also frequently gets twisted around, stuck in loops, and unable to make forward progress.

Yes, gemini loops but I've found almost always it's just a matter of interrupting and telling it to continue.

Claude is very good until it tries something 2-3 times, can't figure it out and then tries to trick you by changing your tests instead of your code (if you explicitly tell it not to, maybe it will decide to ask) OR introduce hyper-fine-tuned IFs to fit your tests, EVEN if you tell it NOT to.

RachelF|10 days ago

I haven't used 3.1 yet, but 3.0 Pro has been frustrating for two reasons:

- it is "lazy": I keep having to tell it to finish, or continue, it wants to stop the task early.

- it hallucinates: I have arguments with it about making up API functions to well known libraries which just do not exist.

avereveard|10 days ago

Yeah gemini 3.0 is unusable to me, to an extent all models do things right or wrong, but gemini just refuses to elaborate.

Sometime you can save so much time asking claude codex and glm "hey what you think of this problem" and have a sense wether they would implement it right or not.

Gemini never stops instead goes and fixes whatever you trow at it even if asked not to, you are constantly rolling the dice but with gemini each roll is 5 to 10 minutes long and pollutes the work area.

It's the model I most rarely use even if, having a large google photo tier, I get it for basically free between antigravity, gemini-cli and jules

For all its fault anthropic discovered pretty early with claude 2 that intelligence and benchmark don't matter if the user can't steer the thing.

ojr|10 days ago

I primarily use Gemini 3 Flash with a GUI coding agent I made by myself and its been able to successfully one-shot mostly any task I throw at it. Why would I ever use a more expensive reasoning and slower reasoning model? I am impressed with the library knowledge Gemini knows, I don't use any skills or MCP and its able to implement functions to perfection. No one crawls more data than Google and their model reflects that in my experience.

port11|9 days ago

My experience with Antigravity was that 3 Pro can reason itself out of Gemini’s typical loops, but won’t actually achieve it (it gets stuck).

3 Flash usually doesn't get into any loops, but then again, it’s also not really following prompts properly. I’ve tried all manner of harnesses around what it shouldn’t do, but it often ignores some instructions. It also doesn’t follow design specs at all, it will output React code that is 70% like what it was asked to do.

My experience with Stitch is the same. Gemini has nice free-use tiers, but it wastes a lot of my time with reprompting it.

Alex-Programs|9 days ago

I'm curious, what's the agent like?

If I were to build something for Gemini models I'd plan around ingesting a bunch of context then oneshotting it.

Oras|10 days ago

Glad I’m not the only one who experienced this. I have a paid antigravity subscription and most of the time I use Claude models due to the exact issues you have pointed out.

stephen_cagle|10 days ago

I also worked at Google (on the original Gemini, when it was still Bard internally) and my experience largely mirrors this. My finding is that Gemini is pretty great for factual information and also it is the only one that I can reliably (even with the video camera) take a picture of a bird and have it tell me what the bird is. But it is just pretty bad as a model to help with development, myself and everyone I know uses Claude. The benchmarks are always really close, but my experience is that it does not translate to real world (mostly coding) task.

tldr; It is great at search, not so much action.

neves|10 days ago

Gemini interesting with Google software gives me the best feature of all LLMs. When I receive a invite for an event, I screenshot it, share with Gemini app and say: add to my Calendar.

It's not very complex, but a great time saver

PratMish|9 days ago

Gemini is pretty hit-or-miss with tool calls. Even when I explicitly ask for a code block, it tends to break the formatting and spill the text everywhere.

menaerus|10 days ago

I don't know ... as of now I am literally instructing it to solve the chained expression computation problem which incurs a lot of temporary variables, of which some can be elided by the compiler and some cannot. Think linear algebra expressions which yield a lot of intermediate computations for which you don't want to create a temporary. This is production code and not an easy problem.

And yet it happily told me what I exactly wanted it to tell me - rewrite the goddamn thing using the (C++) expression templates. And voila, it took "it" 10 minutes to spit out the high-quality code that works.

My biggest gripe for now with Gemini is that Antigravity seems to be written by the model and I am experiencing more hiccups than I would like to, sometimes it's just stuck.

ubercore|9 days ago

Apologize for the low effort comment, but your description of Gemini kind of reminds me of my impression of Google's approach to products too. There's often brilliance there, confounded by sometimes muddled approaches.

What's Conway's Law for LLM models going to be called?

thot_experiment|10 days ago

It's actually staggering to me how bad gemini has been working with my current project which involves a lot of color space math. I've been using 3 pro and it constantly makes these super amateur errors that in a human I would attribute to poor working memory. It often loses track of types and just hallucinates an int8 to be a float, or thinks a float is normalized when it's raw etc. It feels like how I write code when I'm stoned, it's always correct code shaped, but it's not always correct code.

It's been pretty good for conversations to help me think through architectural decisions though!

boppo1|10 days ago

I'm interested in color space math, is your project public?

tom_m|10 days ago

3.0 pro is fantastic. Can't wait for 3.1. and no I'm not solely a user of Gemini, I also love Opus. I just end up using 3.0 pro more.

knollimar|10 days ago

Is the thinking token stream obfuscated?

Im fully immersed

orbital-decay|10 days ago

It's just a summary generated by a really tiny model. I guess it also an ad-hoc way to obfuscate it, yes. In particular they're hiding prompt injections they're dynamically adding sometimes. Actual CoT is hidden and entirely different from that summary. It's not very useful for you as a user, though (neither is the summary).

SkyPuncher|10 days ago

I've had a similar experience. Gemini is superb at incredibly hard stuff, but falls apart on some of the most basic things (like tool calling).

They'd do well to make a "geminin-flash-lite-for-tools" that their pro model calls whenever it needs to do something simple.

acters|10 days ago

I have personally seen a rise of LLMs being too lazy to investigate or do some level of figuring out things on their own and just jump to conclusions and hope you tell them extra information even if it is something they can do on their own.

fwipsy|9 days ago

I assumed the "thinking" output from Gemini was the result of a smaller model summarizing because it contains no actual reasoning. Perhaps they did this to prevent competitors training off it?

WhitneyLand|10 days ago

Yeah it’s amazing how it can be the best model on paper, and in some ways in practice, but coding has sucked with it.

Makes you wonder though how much of the difference is the model itself vs Claude Code being a superior agent.

slopinthebag|10 days ago

Hmm, interesting..

My workflow is to basically use it to explain new concepts, generate code snippets inline or fill out function bodies, etc. Not really generating code autonomously in a loop. Do you think it would excel at this?

mikestorrent|10 days ago

I think that you should really try to get whatever agent you can to work on that kind of thing for you - guide it with the creation of testing frameworks and code coverage, focus more on the test cases with your human intellect, and let it work to pass them.

scotty79|10 days ago

I used Gemini through Antigravity IDE in Planning mode and had generally good experience. It was pretty capable, but I don't really read chat history, I don't trust it. I just look at the diffs.

Bnjoroge|10 days ago

Agree, even through gemini cli, gemini 3 has just been underwhelming. You can clearly tell, the agentic harness/capability wasnt native to the model at all. Just patched on it

jpcompartir|10 days ago

Yep, Gemini is virtually unusable compared to Anthropic models. I get it for free with work and use maybe once a week, if that. They really need to fix the instruction following.

agentifysh|10 days ago

Relieved to read this from an ex-Googler at least we are no the crazy ones we are made out to be whenever we point out issues with Gemini

jbellis|10 days ago

yeah, g3p is as smart or smarter as the other flagships but it's just not reliable enough, it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

https://blog.brokk.ai/gemini-3-pro-preview-not-quite-baked/

hopefully 3.1 is better.

nicce|10 days ago

> it will go into "thinking loops" and burn 10s of 1000s of tokens repeating itself.

Maybe it is just a genius business strategy.

motoboi|10 days ago

gemini-cli being such a crap tells me that Google is not dogfooding it, because how else would they not have the RL trajectories to get a decent agent?

One thousand people using an agent over a month will generate like 30-60k good examples of tool use and nudge the model into good editing.

The only explanation I have is that Google is actually using something else internally.

klooney|10 days ago

Claude probably

jatins|10 days ago

Yep, great models to use in gemini.google.com but outside of that it somehow becomes dumb (especially for coding)

mrnobody_67|10 days ago

I was burning $10-$20 per hour, $1.50 - $3.00 per prompt with Gemini 3 in Openclaw... it was insanely inefficient.

zobzu|9 days ago

same here (ex G and all that jazz). but in practice it means I use gemini for a lot of stuff, just not code. Claude wont try yo one shoot complex stuff that Gemini will + but claude will reliably produce what you expect.

ckdot|9 days ago

Gemini 3.1 is surprisingly bad at coding, especially if you consider that they built an IDE (Antigravity) around it: I let it carefully develop a plan according to very specific instructions. The outcome was terrible: AGENTS.md ignored, syntax error in XML (closing tag missed), inconsistent namings, misinterpreting console outputs, which where quite clear ("You forgot to add some attribute foobar"). I‘m quite disappointed.

varispeed|10 days ago

> stuck in loops

I wonder if there is some form of cheating. Many times I found that after a while Gemini becomes like a Markov chain spouting nonsense on repeat suddenly and doesn't react to user input anymore.

fragmede|10 days ago

Small local models will get into that loop. Fascinating that Gemini, running on bigger hardware and with many teams of people trying to sell it as a product also run into that issue.

lal77|9 days ago

[deleted]