top | item 45610602

(no title)

ofek | 4 months ago

The sentiment in this thread surprises me a great deal. For me, Gemini 2.5 Pro is markedly worse than GPT-5 Thinking along every axis of hallucinations, rigidity in its self-assured correctness and sycophancy. Claude Opus used to be marginally better but now Claude Sonnet 4.5 is far better, although not quite on par with GPT-5 Thinking.

I frequently ask the same question side-by-side to all 3 and the only situation in which I sometimes prefer Gemini 2.5 Pro is when making lifestyle choices, like explaining item descriptions on Doordash that aren't in English.

edit: It's more of a system prompt issue but I despise the verbosity of Gemini 2.5 Pro's responses.

discuss

Diggsey|4 months ago

I've found Gemini to be much better at completing tasks and following instructions. For example, let's say I want to extract all the questions from a word document and output them as a CSV.

If I ask ChatGPT to do this, it will do one of two things:

1) Extract the first ~10-20 questions perfectly, and then either just give up, or else hallucinate a bunch of stuff.

2) Write code that tries to use regex to extract the questions, which then fails because the questions are too free-form to be reliably matched by a regex.

If I ask Gemini to do the same thing, it will just do it and output a perfectly formed and most importantly complete CSV.

cageface|4 months ago

For writing code at least this has been exactly my experience. GPT5 is the best but slow. Sonnet 4.5 is a few notches below but significantly faster and good enough for a lot of things. I have yet to get a single useful result from Gemini.

ofek|4 months ago

Here's an example of Gemini 2.5 Pro hallucinating, which happens so much that I don't trust it https://gemini.google.com/share/99a1be550763

coffeeaddict1|4 months ago

Yep, I agree. Gpt 5 thinking is by far the best reasoning model ime. Gemini 2.5 pro is worse in pretty much everything.

CSMastermind|4 months ago

This has been pretty much exactly my experience.

arresin|4 months ago

My honest belief is that they’re are bots. I also find 2.5 worse.