top | item 46913818

(no title)

Some of Opus 4.6's standout results for me:

* GDPVal Elo: 1606 vs. GPT-5.2's 1462. OpenAI reported that GPT-5.2 has a 70.9% win-or-tie rate against human professionals. (https://openai.com/index/gdpval/) Based on Elo math, we can estimate Opus 4.6's win-or-tie rate against human pros at 85–88%.

* OSWorld: 72.7%, matching human performance at ~72.4% (https://os-world.github.io/). Since the human subjects were CS students and professionals, they were likely at least as competent as the average knowledge worker. The original OSWorld benchmark is somewhat noisy, but even if the model remains somewhat inferior to humans, it is only a matter of time before it catches up or surpasses them.

* BrowseComp: At 84%, it is approaching human intersubject agreement of ~86% (https://openai.com/index/browsecomp/).

Taken together, this suggests that digital knowledge work will be transformed quite soon, possibly drastically if agent reliability improves beyond a certain threshold.

discuss

rishabhaiover|23 days ago

Agreed. These metrics + my personal use convey reliable intelligence over consistent usage. Moving forward, if context windows get bigger and token price lower, I have a hard time figuring out why your argument would be wrong.