top | item 42968204

(no title)

The name is very intentional, this isn't "AI's Last Evaluation", it's "Humanity's Last Exam". There will absolutely be further tests for evaluating the power of AIs, but the intent of this benchmark is that any more difficult benchmark will either be

- Not an "exam" composed of single-correct-answer closed-form questions with objective answers

- Not consisting of questions that humans/humanity is capable of answering.

For example, a future evaluation for an LLM could consist of playing chess really well or solving the Riemann Hypothesis or curing some disease, but those aren't tasks you would ever put on an exam for a student.

discuss

krackers|1 year ago

Isn't FrontierMath a better "last exam"? Looking through a few of the questions, they seem less reasoning based and more factual based. There's no way that one could answer "How many paired tendons are supported by this sesamoid bone [bilaterally paired oval bone of hummingbirds]" without either having a physical model to dissect, or just regurgitating the info found somewhere authoritative. It seems like the only reason that a lot of the questions can't be solved yet is because the knowledge is specialized enough that it simply is not found on the web, you'd have to phone up the one guy who worked on it.