top | item 46904979

(no title)

nopinsight | 24 days ago

From Claude 4.6 Thinking:

OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong.

Scores on Verified tend to run higher, so they're not directly comparable.

discuss

order

No comments yet.