top | item 44834564

(no title)

Gemini 2.5 Pro is severely kneecapped in this evaluation. Limit of 4096 thinking tokens is way too low; I bet o3 is generating significantly more.

discuss

energy123|6 months ago

For o3, I set reasoning_effort "high" and it's usually 1000-2000 reasoning tokens for routine coding questions.

I've only seen it go above 5000 for very difficult style transfer problems where it has to wrangle with the micro-placement of lots of text. Or difficult math problems.