top | item 42968175

(no title)

Apparently OpenAI's Deep Research already saturated a quarter of this benchmark, more or less a month in. But I also imagine it makes baffling mistakes anyway.

"Humanity's Laster Exam" coming up when?

discuss

No comments yet.