top | item 42514826

(no title)

deyiao | 1 year ago

The benchmark results seem unrealistically good, but I'm not sure from which angles I should challenge them.

discuss

I think they're real. The model is performing better than claude-3-5-sonnet-20241022 on the claude leaderboard: