top | item 42831243

(no title)

I'm confused as to how you haven't found R1 to be much better. My experience has been exactly like that of the OP's

discuss

What type of prompts were you feeding it? My limited understanding is that reasoning models will outperform LLMs like GPT-4/Claude at certain tasks but not others. Prompts that have answers that are more fuzzy and less deterministic (ie. soft sciences) will see reasoning models underperform because their training revolves around RL with rewards.