top | item 39121182

(no title)

krohling | 2 years ago

"Our AI agent finds precisely what you ask for, 10-50x better than Google Scholar"

I was curious how this was measured since benchmarking accuracy for LLMs is tough. Found this in the paper: "This classification accuracy was benchmarked by manually analyzing over 400 papers across a range of representative searches, and comparing the human evaluation to the language model’s judgment"

I'm skeptical that their dataset of 400 papers with 3 classification labels (highly relevant, closely related, or ignorable) is large enough to represent the diversity of queries they're going to get from users. To be clear, I don't think this undermine's (haha) the value of what they've built, still very cool.

discuss

No comments yet.