top | item 44838420

(no title)

paulhodge | 6 months ago

Lots of signs point to a conclusion that the Opus and Sonnet models are fundamentally better at coding, tool usage, and general problem solving across long contexts. There is some kind of secret sauce in the way they train the models. Dario has mentioned in interviews that this strength is one of the company's closely guarded secrets.

And I don't think we have a great eval benchmark that exactly measures this capability yet. SWE Bench seems to be pretty good, but there's already a lot of anecdotal comments that Claude is still better at coding than GPT 5, despite having similar scores on SWE Bench.

discuss

order

CuriouslyC|6 months ago

I've been testing AI as a beta reader for >100k novels, and I can tell you with 100% certainty that Claude gets confused about things across long contexts much sooner than either O3/GPT5 or Gemini 2.5. In my experience Gemini 2.5 and O3/GPT5 run neck and neck until around 80-100k tokens, then Gemini 2.5 starts to pull ahead and by 150k tokens it's absolutely dominant. Claude is respectable but clearly in third place.

https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/o... https://longbench2.github.io/

itsafarqueue|6 months ago

Really useful comment thanks. Reminder that LLMs aren’t just for coding.

libraryofbabel|6 months ago

Yeah, agree that the benchmarks don't really seem to reflect the community consensus. I wonder if part of it is the better symbiosis between the agent (Claude Code) and the Opus and Sonnet models it uses, which supposedly are fine-tuned on Claude Code tool calls? But agree, there is probably some additional secret sauce in the training, perhaps to do with RL on multi-step problems...

pcwelder|6 months ago

I get similar accuracy to claude code using claude desktop app with a file+bash mcp (different tools same performance).

My guess for why GPT5 scores more on benchmarks is that they evaluate on well defined tasks with all instructions given at the start.

Real life is multi turn. Multiple set of prompts to adhere to. This is where Claude is likely better.