(no title)
alextheparrot | 7 months ago
I don’t think we should assume answering a test would be easy for a Scantron machine just because it is very good at grading them, either.
alextheparrot | 7 months ago
I don’t think we should assume answering a test would be easy for a Scantron machine just because it is very good at grading them, either.
tempfile|7 months ago
There is no evidence that an LLM can reliably evaluate the semantic content of a sentence, even in cases where we all agree that the semantic content exists. The thread we are participating in demonstrates a particularly egregious failure, but there is no good reason to think that more subtle failures might not exist if we happen to patch this one. Even if they were reliable, you can't evaluate a system with itself - that is basic science.