(no title)
nopinsight | 23 days ago
* GDPVal Elo: 1606 vs. GPT-5.2's 1462. OpenAI reported that GPT-5.2 has a 70.9% win-or-tie rate against human professionals. (https://openai.com/index/gdpval/) Based on Elo math, we can estimate Opus 4.6's win-or-tie rate against human pros at 85–88%.
* OSWorld: 72.7%, matching human performance at ~72.4% (https://os-world.github.io/). Since the human subjects were CS students and professionals, they were likely at least as competent as the average knowledge worker. The original OSWorld benchmark is somewhat noisy, but even if the model remains somewhat inferior to humans, it is only a matter of time before it catches up or surpasses them.
* BrowseComp: At 84%, it is approaching human intersubject agreement of ~86% (https://openai.com/index/browsecomp/).
Taken together, this suggests that digital knowledge work will be transformed quite soon, possibly drastically if agent reliability improves beyond a certain threshold.
rishabhaiover|23 days ago