(no title)
krohling | 2 years ago
I was curious how this was measured since benchmarking accuracy for LLMs is tough. Found this in the paper: "This classification accuracy was benchmarked by manually analyzing over 400 papers across a range of representative searches, and comparing the human evaluation to the language model’s judgment"
I'm skeptical that their dataset of 400 papers with 3 classification labels (highly relevant, closely related, or ignorable) is large enough to represent the diversity of queries they're going to get from users. To be clear, I don't think this undermine's (haha) the value of what they've built, still very cool.
No comments yet.