top | item 42368319

(no title)

mmiyer | 1 year ago

I guess it's because it has the highest score of all models in instruction following, 20 points higher then Opus, which compensates for shortcomings elsewhere (e.g. in language), and which wouldn't necessarily translate to human evaluation of usefulness.

discuss

order

simonw|1 year ago

Wow, yeah I think you're right - 3.3 somehow gets top position on the entire leaderboard for that category, I bet that skews the average up a lot.