Still questions that I would assume are semantically similar to the questions you can find in exam prep material all over the internet. My point is that exams are a crutch we use to determine how well a person studied a subject. A crutch we use because we seem to lack better measuring devices. It's very much possible to ace an exam while at the same time being horrible at actually applying/working on a subject. I'd argue therefore that measuring how well LLMs perform on exams designed for humans is simply a more complicated Turing test, with all its shortcomings.
No comments yet.