Impressive seeing Google notch up another ~25 ELO on lmarena, on top of the previous #1, which was also Gemini!
That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability. While I do think Gemini is a good model, having used both Gemini and Claude Opus 4 extensively in the last couple of weeks I think Opus is in another league entirely. I've been dealing with a number of gnarly TypeScript issues, and after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it. Opus solved the same problems with no sweat. I know that that's a fairly isolated anecdote and not necessarily fully indicative of overall performance, but my experience with Gemini is that it would really want to kludge on code in order to make things work, where I found Opus would tend to find cleaner approaches to the problem. Additionally, Opus just seemed to have a greater imagination? Or perhaps it has been tailored to work better in agentic scenarios? I saw it do things like dump the DOM and inspect it for issues after a particular interaction by writing a one-off playwright script, which I found particularly remarkable. My experience with Gemini is that it tries to solve bugs by reading the code really really hard, which is naturally more limited.
Again, I think Gemini is a great model, I'm very impressed with what Google has put out, and until 4.0 came out I would have said it was the best.
o3 is still my favorite over even Opus 4 in most cases. I've spent hundreds of dollars on AI code gen tools in the last month alone and my ranking is:
1. o3 - it's just really damn good at nuance, getting to the core of the goal, and writing the closest thing to quality production level code. The only negative is it's cutoff window and cost, especially with it's love of tools. That's not usually a big deal for the Rails projects I work on but sometimes it is.
2. Opus 4 via Claude Code - also really good and is my daily driver because o3 is so expensive. I will often have Opus 4 come up with the plan and first pass and then let o3 critique and make a list of feedback to make it really good.
3. Gemini 2.5 Pro - haven't tested this latest release but this was my prior #2 before last week. Now I'd say it's tied or slightly better than Sonnet 4. Depends on the situation.
4. Sonnet 4 via claude Code - it's not bad but needs a lot of coaching and oversight to produce really good code. It will definitely produce a lot of code if you just let it go do it's thing but it's not the quality, concise, and thoughtful code without more specific prompting and revisions.
I'm also extremely picky and a bit OCD with code quality and organization in projects down to little details with naming, reusability, etc. I accept only 33% of suggested code based on my Cursor stats from last month. I will often revert and go back to refine the prompt before accepting and going down a less than optimal path.
What I like about Gemini is the search function that is very very good compared to others. I was blown away when I asked to compose me an email for a company that was sending spam to our domain. It literally searched and found not only the abuse email of the hosting company but all the info about the domain and the host(mx servers, ip owners, datacenters, etc.). Also if you want to convert a research paper into a podcast it did it instantly for me and it's fun to listen.
I’ve been giving the same tasks to claude 4 and gemini 2.5 this week and gemini provided correct solutions and claude didn’t. These weren’t hard tasks either, they were e.g. comparing sql queries before/after rewrite - Gemini found legitimate issues where claude said all is ok.
I haven't tried all of the favorites, just what is available with Jetbrains AI, but I can say that Gemini 2.5 is very good with Go. I guess that makes sense in a way.
I think the only way to be particularly impressed with new leading models lately is to hold the opinion all of the benchmarks are inaccurate and/or irrelevant and it's vibes/anecdotes where the model is really light years ahead. Otherwise you look at the numbers on e.g. lmarena and see it's claiming a ~16% preference win rate for gpt-3.5-turbo from November of 2023 over this new world-leading model from Google.
for bulk data extraction on personal real life data I experienced that even gpt-4o-mini outperforms latest gemini models in both quality and cost. i would use reasoning models but their json schema response is different from the non-reasonig models, as in: they can not deal with union types for optional fields when using strict schemas... anyway.
idk whats the hype about gemini, it's really not that good imho
I just realized that Opus 4 is the first model that produced "beautiful" code for me. Code that is simple, easy to read, not polluted with comments, no unnecessary crap, just pretty, clean and functional. I had my first "wow" moment with it in a while.
That being said it occasionally does something absolutely stupid. Like completely dumb. And when I ask it "why did you do this stupid thing", it replies "oh yeah, you're right, this is super wrong, here is an actual working, smart solution" (proceeds to create brilliant code)
I'd start to worry about OpenAI, from a valuation standpoint. The company has some serious competition now and is arguably no longer the leader.
its going to be interesting to see how easily they can raise more money. Their valuation is already in the $300B range. How much larger can it get given their relatively paltry revenue at the moment and increasingly rising costs for hardware and electricity.
If the next generation of llms needs new data sources, then Facebook and Google seem well positioned there, OpenAI on the other hand seems like its going to lose such race for proprietary data sets as unlike those other two, they don't have another business that generates such data.
When they were the leader in both research and in user facing applications they certainly deserved their lofty valuation.
What is new money coming into OpenAI getting now?
At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.
Or at an extremely lofty P/E ratio of say 100 that would be $3B in annual earnings, that analysts would have to expect you to double each year for the next 10ish years looking out, ala AMZN in the 2000s, to justify this valuation.
They seem to have boxed themselves into a corner where it will be painful to go public, assuming they can ever figure out the nonprofit/profit issue their company has.
Congrats to Google here, they have done great work and look like they'll be one of the biggest winners of the AI race.
There is some serious confusion about the strength of OpenAIs position.
"chatgpt" is a verb. People have no idea what claude or gemini are, and they will not be interested in it, unless something absolutely fantastic happens. Being a little better will do absolutely nothing to convince normal people to change product (the little moat that ChatGPT has simply by virtue of chat history is probably enough from a convenience standpoint, add memories and no super obvious path to export/import either and you are done here).
All that OpenAI would have to do, to easily be worth their evaluation eventually, is to optimize and not become offensively bad to their, what, 500 million active users. And, if we assume the current paradigm that everyone is working with is here to stay, why would they? Instead of leading (as they have done so far, for the most part) they can at any point simply do what others have resorted to successfully and copy with a slight delay. People won't care.
> At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.
Oops I think you may have flipped the numerator and the denominator there, if I’m understanding you. Valuation of 300B , if 2x sales, would imply 150B sales.
Even if they're winning the AI race, their search business is still going to be cannibalized, and it's unclear if they'll be able to extract any economic rents from AI thanks to market competition. Of course they have no choice but to compete, but they probably would have preferred the pre-AI status quo of unquestioned monopoly and eyeballs on ads.
I think it’s too early to say they are not the leader given they have o3 pro and GPT 5 coming out within the next month or two. Only if those are not impressive would I start to consider that they have lost their edge.
Although it does feel likely that at minimum, they are neck and neck with Google and others.
>At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.
What? Apple has a revenue of 400B and a market cap of 3T
> At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.
Even Google doesn't have $600B revenue. Sorry, it sounds like numbers pulled from someone's rear.
> At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.
Lmfao where did you get this from? Microsoft has less than half of that revenue, and is valued > 10x than OpenAI.
Revenue is not the metric by which these companies are valued...
I was tempted by the ratings and immediately paid for a subscription to Gemini 2.5.
Half an hour later, I canceled the subscription and got a refund.
This is the laziest and stupidest LLM.
What he had to do, he told me to do on my own. And also when analyzing simple short documents, he pulled up some completely strange documents from the Internet not related to the topic.
Even local LLMs (3B) were not so stupid and lazy.
As if 3 different preview versions of the same model is not confusing enough, the last two dates are 05-06 and 06-05. They could have held off for a day:)
Since those days are ambiguous anyway, they would have had to hold off until the 13th.
In Canada, a third of the dates we see are British, and another third are American, so it’s really confusing. Thankfully y-m-d is now a legal format and seems to be gaining ground.
I have two issues with Gemini that I don't experience with Claude: 1. It RENAMES VARIABLE NAMES even in places where I don't tell it to change (I pass them just as context). and 2. Sometimes it's missing closing square brackets.
Sure I'm a lazy bum, I call the variable "json" instead of "jsonStringForX", but it's contextual (within a closure or function), and I appreciate the feedback, but it makes reviewing the changes difficult (too much noise).
I have a very clear example of Gemini getting it wrong:
For a code like this, it keeps changing processing_class=tokenizer to "tokenizer=tokenizer", even though the parameter was renamed and even after adding the all caps comment.
#Set up the SFTTrainer
print("Setting up SFTTrainer...")
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
args=sft_config,
processing_class=tokenizer, # DO NOT CHANGE. THIS IS NOW THE CORRECT PROPERTY NAME
)
print("SFTTrainer ready.")
I haven't tried with this latest version, but the 05-06 pro still did it wrong.
I find o1-pro, which nobody ever mentions, is in the top spot along with Gemini. But Gemini is an absolute mess to work with because it constantly adds tons of comments and changes unrelated code.
It is worth it sometimes, but usually I use it to explore ideas and then have o1-pro spit out a perfect solution ready diff test and merge.
i've noticed with ChatGPT is will 100% ignore certain instructions and I wonder if it's just an LLM thing. For example, I can scream and yell in caps at ChatGPT to not use em or en dashes and if anything it makes it use them even more. I've literally never once made it successfully not use them, even when it ignored it the first time, and my follow up is "output the same thing again but NO EM or EN DASHES!"
i've not tested this thoroughly, it's just my ancedotal experience over like a dozen attempts.
AI Studio uses your API account behind the scenes, and it is subject to normal API limits. When you signup for AI Studio, it creates a Google Cloud free tier project with "gen-lang-client-" prefix behind the scenes. You can link a billing account at the bottom of the "get an api key page".
Also note that AI studio via default free tier API access doesn't seem to fall within "commercial use" in Google's terms of service, which would mean that your prompts can be reviewed by humans and used for training. All info AFAIK.
I found all the previous Gemini models somewhat inferior even compared to Claude 3.7 Sonnet (and much worse than 4) as my coding assistants. I'm keeping an open mind but also not rushing to try this one until some evaluations roll in. I'm actually baffled that the internet at large seems to be very pumped about Gemini but it's not reflective of my personal experience. Not to be that tinfoil hat guy but I smell at least a bit of astroturf activity around Gemini.
> I'm actually baffled that the internet at large seems to be very pumped about Gemini but it's not reflective of my personal experience. Not to be that tinfoil hat guy but I smell at least a bit of astroturf activity around Gemini.
I haven't used Claude, but Gemini has always returned better answers to general questions relative to ChatGPT or Copilot. My impression, which could be wrong, is that Gemini is better in situations that are a substitute for search. How do I do this on the command line, tell me about this product, etc. all give better results, sometimes much better, on Gemini.
I think it's just very dependent on what you're doing. Claude 3.5/3.7 Sonnet (thinking or not) were just absolutely terrible at almost anything I asked of it (C/C++/Make/CMake). Like constantly giving wrong facts, generating code that could never work, hallucinating syntax and APIs, thinking about something then concluding the opposite, etc. Gemini 2.5-pro and o3 (even old o1-preview, o1-mini) were miles better. I haven't used Claude 4 yet.
But everyone is using them for different things and it doesn't always generalize. Maybe Claude was great at typescript or ruby or something else I don't do. But for some of us, it definitely was not astroturf for Gemini. My whole team was talking about how much better it was.
I'm switching a lot between Sonnet and Gemini in Aider - for some reason some of my coding problems only one of models capable to solve and I don't see any pattern which cold give answer upfront which I should to use for specific need.
> I found all the previous Gemini models somewhat inferior even compared to Claude 3.7 Sonnet (and much worse than 4) as my coding assistants.
What are your usecases? Really not my experience, Claude disappoints in Data Science and complex ETL requests in python. O3 on the other hand really is phenomenal.
I think they are fairly interchangeable,
In Roo Code, Claude uses the tools better, but I prefer gemini's coding style and brevity (except for comments, it loves to write comments)
Sometimes I mix and match if one fails or pursues a path I don't like.
My experience has been that Gemini's code (and even conversation) is a little bit uglier in general - but that the code tends to solve the issue you asked with fewer hallucinations.
I can't speak to it now - have mostly been using Claude Code w/ Opus 4 recently.
As a lawyer, Claude 4 is the best writer, and usually, but not always, the leader in legal reasoning. That said, o3 often grinds out the best response, and Gemini seems to be the most exhaustive researcher.
I mean, they're cheaper models and they aren't as much if a pain about rate limiting as Claude was/they have a pretty solid depenresesrch without restrictive usage limits. IDK how it is for long running agentic stuff, would be surprised if it was anywhere near the other models, but for a general chatgpt competitor it doesn't matter if it's not as good as opus 4 if it's way cheaper and won't use up your usage limit
They can't because if someone has built something around that version they don't want to replace that model with a new model that could provide different results.
Does 82.2 correspond to the "Percent correct" of the other models?
Not sure if OpenAI has updated O3, but it looks like "pure" o3 (high) has a score of 79.6% in the linked table, "o3 (high) + gpt-4.1" combo has a the highest score of 82.7%.
The previous Gemini 2.5 Pro Preview 05-06 (yea, not current 06-05!) was at 76.9%.
That looks like a pretty nice bump!
But either way, these Aider benchmarks seem to be most useful/trustworthy benchmarks currently and really the only ones I'm paying attention to.
Omproves on the Extended NYT Connections benchmark compared to both Gemini 2.5 Pro Exp (03-25) and Gemini 2.5 Pro Preview (05-06), scoring 58.7. The decline observed between 03-25 and 05-06 has been reversed - https://github.com/lechmazur/nyt-connections/.
Almost all of those benchmarks are coding related. It looks like SWE-Bench is the only one where Claude is higher. Hard to say which benchmark is most representative of actual work. The community seems to like Aider Polyglot from what I've seen
I just checked and it looks like the limits for Jules has been bumped from 5 free daily tasks to 60. Not sure it uses the latest model, but I would assume it does
because the plethora of models and versions is getting ridiculous, and for anyone who's not following LLM news daily, you have no clue what to use. There was never a "Google Search 2.6.4 04-13". You just went to google.com and searched.
I found Gemini 2.5 Pro highly useful for text summaries, and even reasoning in long conversations... UP TO the last 2 weeks or month. Recently, it seems to totally forget what I'm talking about after 4-5 messages of a paragraph of text each. We're not talking huge amounts of context, but conversational braindeadness. Between ChatGPT's sycophancy, Gemini's forgetfulness and poor attention, I'm just sticking with whatever local model du jour fits my needs and whatever crap my company is paying for today. It's super annoying, hopefully Gemini gets its memory back!
I believe it's intentionally nerfed if you use it through the app. Once you use Gemini for a long time you realize they have a number of dark patterns to deter heavy users but maintain the experience for light users. These dark patterns are:
- "Something went wrong error" after too many prompts in a day. This was an undocumented rate limit because it never occurs earlier in the day and will immediately disappear if you subscribe for and use a new paid account, but it won't disappear if you make a new free account, and the error going away is strictly tied to how long you wait. Users complained about this for over a year. Of course they lied about the real reasons for this error, and it was never fixed until a few days ago when they rug pulled paying users by introducing actual documented tight rate limits.
- "You've been signed out" error if the model has exceeded its output token budget (or runtime duration) for a single inference, so you can't do things like what Anthropic recommends where you coax the model to think longer.
- I have less definitive evidence for this but I would not be surprised if they programmatically nerf the reasoning effort parameter for multiturn conversations. I have no other explanation for why the chain of thought fails to generate for small context multiturn chats but will consistently generate for ultra long context singleturn chats.
I noticed that same behavior across older Gemini models. I build a chatbot at work around 1.5 Flash, and one day suddenly it was behaving like that. it was perfect before, but after it always saluted the user like it was their first chat, despite me sending the history. And i didn't found any changelog regarding that at the time.
After that i moved to OpenAI, Gemini models just seem unreliable on that regard.
Gemini is a good and fast model, but I think the style of code it writes is... amateur / inexperienced. It doesn't make a lot of mistakes typical of an LLM, but rather chooses approaches that are typical of someone who just learned programming. I have to always nudge it to avoid verbosity, keep structure less repetitive, optimize async code, etc. With claude, I rarely have this problem -- it feels more like working with a more experienced developer.
As a Windsurf user I was happy with Claude 3.7 but then switched to Google Gemini 2.5 when Claude started glitching on a particularly large file. It's a bummer that 3.7 has gone from Windsurf - I considered cancelling my Windsurf subscription, but decided not to because it is still good value for money.
Man, if the benchmarks are to be believed, this is a lifeline for Windsurf as Anthropic becomes less and less friendly.
However, in my personal experience Sonnet 3.x has still been king so far. Will be interesting to watch this unfold. At this point, it's still looking grim for Windsurf.
The truth is that Gemini 2.5 6-05 is a fraud in coding; before, out of 10 codes you wrote, 1 or 2 might not work, meaning they had errors. Now, out of 10 codes, 9 or 10 are wrong. Why does it have so many errors???
Sundar tweeted a lion so it's probably goldmane. Kingfall is probably their deep think model, and they might wait for O3 pro to drop so they can swing back.
Interesting, I just learned about matharena.ai. Google cherry-picks one result where they're the best here, but in the overall results, it's still O3 and o4-mini-high who are in the lead.
So there's both a 05-06 model and a 06-05 model, and the launch page for 06-05 has some graphs with benchmarks for the 05-06 model but without the 06-05 model?
It depends on where and how you use it, I only use the gemini pro model on aistudio, and set the temperature to 0.05 or 0.1 in rare cases I bump it to 0.3 if I need some frontend creativity, it still isn't impressive, I see that claude is still far better, o4-mini-high too. When it comes to o3 I despise it, despite being ranked very high on benchmarks, the best version of it is only available through api.
RIght now, the claude code tooling and ChatGPT codex are far better then anything else I have seen for massive code development. Is there a better option out there with Gemini at the heart of it? I noticed the command line codex might support it.
Amateur question, how are people using this for coding?
Direct chat and copy pasting code? Seems clunky.
Or manually switching in cursor? Although is extra cost and not required for a lot of tasks where Cursor tab is faster and good enough. So need to opt in on demand.
johnfn|9 months ago
That being said, I'm starting to doubt the leaderboards as an accurate representation of model ability. While I do think Gemini is a good model, having used both Gemini and Claude Opus 4 extensively in the last couple of weeks I think Opus is in another league entirely. I've been dealing with a number of gnarly TypeScript issues, and after a bit Gemini would spin in circles or actually (I've never seen this before!) give up and say it can't do it. Opus solved the same problems with no sweat. I know that that's a fairly isolated anecdote and not necessarily fully indicative of overall performance, but my experience with Gemini is that it would really want to kludge on code in order to make things work, where I found Opus would tend to find cleaner approaches to the problem. Additionally, Opus just seemed to have a greater imagination? Or perhaps it has been tailored to work better in agentic scenarios? I saw it do things like dump the DOM and inspect it for issues after a particular interaction by writing a one-off playwright script, which I found particularly remarkable. My experience with Gemini is that it tries to solve bugs by reading the code really really hard, which is naturally more limited.
Again, I think Gemini is a great model, I'm very impressed with what Google has put out, and until 4.0 came out I would have said it was the best.
joshmlewis|9 months ago
1. o3 - it's just really damn good at nuance, getting to the core of the goal, and writing the closest thing to quality production level code. The only negative is it's cutoff window and cost, especially with it's love of tools. That's not usually a big deal for the Rails projects I work on but sometimes it is.
2. Opus 4 via Claude Code - also really good and is my daily driver because o3 is so expensive. I will often have Opus 4 come up with the plan and first pass and then let o3 critique and make a list of feedback to make it really good.
3. Gemini 2.5 Pro - haven't tested this latest release but this was my prior #2 before last week. Now I'd say it's tied or slightly better than Sonnet 4. Depends on the situation.
4. Sonnet 4 via claude Code - it's not bad but needs a lot of coaching and oversight to produce really good code. It will definitely produce a lot of code if you just let it go do it's thing but it's not the quality, concise, and thoughtful code without more specific prompting and revisions.
I'm also extremely picky and a bit OCD with code quality and organization in projects down to little details with naming, reusability, etc. I accept only 33% of suggested code based on my Cursor stats from last month. I will often revert and go back to refine the prompt before accepting and going down a less than optimal path.
batrat|9 months ago
baq|9 months ago
Szpadel|9 months ago
the same with o3 and sonnet (I didn't tested 4.0 much yet to have opinion)
I feel thet we need better parallel evaluation support. where u could evaluate all top models and decide with one provided best solution
varunneal|9 months ago
cwbriscoe|9 months ago
zamadatix|9 months ago
unknown|9 months ago
[deleted]
tempusalaria|9 months ago
lispisok|9 months ago
Goodhart's law applies here just like everywhere else. Much more so given how much money these companies are dumping into making these models.
Alifatisk|9 months ago
No way, is there any way to see the dialog or recreate this scenario!?
AmazingTurtle|9 months ago
idk whats the hype about gemini, it's really not that good imho
tymonPartyLate|9 months ago
I do not understand how those machines work.
tomr75|9 months ago
chollida1|9 months ago
its going to be interesting to see how easily they can raise more money. Their valuation is already in the $300B range. How much larger can it get given their relatively paltry revenue at the moment and increasingly rising costs for hardware and electricity.
If the next generation of llms needs new data sources, then Facebook and Google seem well positioned there, OpenAI on the other hand seems like its going to lose such race for proprietary data sets as unlike those other two, they don't have another business that generates such data.
When they were the leader in both research and in user facing applications they certainly deserved their lofty valuation.
What is new money coming into OpenAI getting now?
At even a $300B valuation a typical wall street analysts would want to value them at 2x sales which would mean they'd expect OpenAI to have $600B in annual sales to account for this valuation when they go public.
Or at an extremely lofty P/E ratio of say 100 that would be $3B in annual earnings, that analysts would have to expect you to double each year for the next 10ish years looking out, ala AMZN in the 2000s, to justify this valuation.
They seem to have boxed themselves into a corner where it will be painful to go public, assuming they can ever figure out the nonprofit/profit issue their company has.
Congrats to Google here, they have done great work and look like they'll be one of the biggest winners of the AI race.
jstummbillig|9 months ago
"chatgpt" is a verb. People have no idea what claude or gemini are, and they will not be interested in it, unless something absolutely fantastic happens. Being a little better will do absolutely nothing to convince normal people to change product (the little moat that ChatGPT has simply by virtue of chat history is probably enough from a convenience standpoint, add memories and no super obvious path to export/import either and you are done here).
All that OpenAI would have to do, to easily be worth their evaluation eventually, is to optimize and not become offensively bad to their, what, 500 million active users. And, if we assume the current paradigm that everyone is working with is here to stay, why would they? Instead of leading (as they have done so far, for the most part) they can at any point simply do what others have resorted to successfully and copy with a slight delay. People won't care.
PantaloonFlames|9 months ago
Oops I think you may have flipped the numerator and the denominator there, if I’m understanding you. Valuation of 300B , if 2x sales, would imply 150B sales.
Probably your point still stands.
jadbox|9 months ago
energy123|9 months ago
orionsbelt|9 months ago
Although it does feel likely that at minimum, they are neck and neck with Google and others.
sebzim4500|9 months ago
What? Apple has a revenue of 400B and a market cap of 3T
Rudybega|9 months ago
Edit: I am dumb, ignore the second half of my post.
ketzo|9 months ago
I agree that Google is well-positioned, but the mindshare/product advantage OpenAI has gives them a stupendous amount of leeway
raincole|9 months ago
Even Google doesn't have $600B revenue. Sorry, it sounds like numbers pulled from someone's rear.
qeternity|9 months ago
Lmfao where did you get this from? Microsoft has less than half of that revenue, and is valued > 10x than OpenAI.
Revenue is not the metric by which these companies are valued...
Oleksa_dr|9 months ago
vthallam|9 months ago
tomComb|9 months ago
In Canada, a third of the dates we see are British, and another third are American, so it’s really confusing. Thankfully y-m-d is now a legal format and seems to be gaining ground.
dist-epoch|9 months ago
they are clearly trolling OpenAI's 4o and o4 models.
declan_roberts|9 months ago
UncleOxidant|9 months ago
unknown|9 months ago
[deleted]
wiradikusuma|9 months ago
Sure I'm a lazy bum, I call the variable "json" instead of "jsonStringForX", but it's contextual (within a closure or function), and I appreciate the feedback, but it makes reviewing the changes difficult (too much noise).
xtracto|9 months ago
For a code like this, it keeps changing processing_class=tokenizer to "tokenizer=tokenizer", even though the parameter was renamed and even after adding the all caps comment.
I haven't tried with this latest version, but the 05-06 pro still did it wrong.AaronAPU|9 months ago
It is worth it sometimes, but usually I use it to explore ideas and then have o1-pro spit out a perfect solution ready diff test and merge.
danielbln|9 months ago
"# Added this function" "# Changed this to fix the issue"
No, I know, I was there! This is what commit messages for, not comments that are only relevant in one PR.
93po|9 months ago
i've not tested this thoroughly, it's just my ancedotal experience over like a dozen attempts.
hu3|9 months ago
I'm thinking of cancelling my ChatGPT subscription because I keep hitting rate limits.
Meanwhile I have yet to hit any rate limit with Gemini/AI Studio.
HenriNext|9 months ago
Also note that AI studio via default free tier API access doesn't seem to fall within "commercial use" in Google's terms of service, which would mean that your prompts can be reviewed by humans and used for training. All info AFAIK.
oofbaroomf|9 months ago
Squarex|9 months ago
fermentation|9 months ago
abraxas|9 months ago
bachmeier|9 months ago
I haven't used Claude, but Gemini has always returned better answers to general questions relative to ChatGPT or Copilot. My impression, which could be wrong, is that Gemini is better in situations that are a substitute for search. How do I do this on the command line, tell me about this product, etc. all give better results, sometimes much better, on Gemini.
verall|9 months ago
But everyone is using them for different things and it doesn't always generalize. Maybe Claude was great at typescript or ruby or something else I don't do. But for some of us, it definitely was not astroturf for Gemini. My whole team was talking about how much better it was.
strobe|9 months ago
3abiton|9 months ago
What are your usecases? Really not my experience, Claude disappoints in Data Science and complex ETL requests in python. O3 on the other hand really is phenomenal.
Fergusonb|9 months ago
throwaway314155|9 months ago
I can't speak to it now - have mostly been using Claude Code w/ Opus 4 recently.
tiahura|9 months ago
nprateem|9 months ago
vikramkr|8 months ago
unpwn|9 months ago
impulser_|9 months ago
ChrisArchitect|9 months ago
(https://news.ycombinator.com/item?id=44192954)
xnx|9 months ago
Workaccount2|9 months ago
[1]https://nitter.net/OfficialLoganK/status/1930657743251349854...
jcuenod|9 months ago
Still actually falling behind the official scores for o3 high. https://aider.chat/docs/leaderboards/
sottol|9 months ago
Not sure if OpenAI has updated O3, but it looks like "pure" o3 (high) has a score of 79.6% in the linked table, "o3 (high) + gpt-4.1" combo has a the highest score of 82.7%.
The previous Gemini 2.5 Pro Preview 05-06 (yea, not current 06-05!) was at 76.9%.
That looks like a pretty nice bump!
But either way, these Aider benchmarks seem to be most useful/trustworthy benchmarks currently and really the only ones I'm paying attention to.
vessenes|9 months ago
hobofan|9 months ago
zone411|9 months ago
unsupp0rted|9 months ago
This table seems to indicate it's markedly worse?
https://blog.google/products/gemini/gemini-2-5-pro-latest-pr...
gundmc|9 months ago
Alifatisk|9 months ago
pu_pe|9 months ago
kristianp|9 months ago
jbellis|9 months ago
pelorat|9 months ago
MallocVoidstar|9 months ago
laweijfmvo|9 months ago
Szpadel|9 months ago
op00to|9 months ago
energy123|9 months ago
- "Something went wrong error" after too many prompts in a day. This was an undocumented rate limit because it never occurs earlier in the day and will immediately disappear if you subscribe for and use a new paid account, but it won't disappear if you make a new free account, and the error going away is strictly tied to how long you wait. Users complained about this for over a year. Of course they lied about the real reasons for this error, and it was never fixed until a few days ago when they rug pulled paying users by introducing actual documented tight rate limits.
- "You've been signed out" error if the model has exceeded its output token budget (or runtime duration) for a single inference, so you can't do things like what Anthropic recommends where you coax the model to think longer.
- I have less definitive evidence for this but I would not be surprised if they programmatically nerf the reasoning effort parameter for multiturn conversations. I have no other explanation for why the chain of thought fails to generate for small context multiturn chats but will consistently generate for ultra long context singleturn chats.
harrisoned|9 months ago
After that i moved to OpenAI, Gemini models just seem unreliable on that regard.
carbocation|9 months ago
lxe|9 months ago
PantaloonFlames|9 months ago
Isn’t this what you can do with system instructions?
fallinditch|9 months ago
sumedh|8 months ago
Are you talking about Sonnet 4 which never came to Windsurf because Anthropic does not want to support OpenAI?
consumer451|9 months ago
However, in my personal experience Sonnet 3.x has still been king so far. Will be interesting to watch this unfold. At this point, it's still looking grim for Windsurf.
lexandstuff|9 months ago
sergiotapia|9 months ago
aienjoyer|8 months ago
aienjoyer|8 months ago
jdmoreira|9 months ago
ketzo|9 months ago
rubslopes|9 months ago
I've been preferring to use Copilot agent mode with Sonnet 4, but it asks you to intervene a lot.
aiiizzz|9 months ago
emehex|9 months ago
paisanashapyaar|9 months ago
Workaccount2|9 months ago
tibbar|9 months ago
energy123|9 months ago
johnnyApplePRNG|9 months ago
feelingsonice|9 months ago
excerionsforte|9 months ago
simianwords|9 months ago
koakuma-chan|9 months ago
_pdp_|9 months ago
BDivyesh|8 months ago
InTheArena|9 months ago
kisamoto|9 months ago
Direct chat and copy pasting code? Seems clunky.
Or manually switching in cursor? Although is extra cost and not required for a lot of tasks where Cursor tab is faster and good enough. So need to opt in on demand.
Cline + open router in VSCode?
Something else?
4d66ba06|8 months ago