top | item 22409033

(no title)

craffel | 6 years ago

Yes, unfortunately we have to rely on the very brittle "exact match" method of evaluating whether an answer is correct. FWIW and perhaps surprisingly, this is the primary way question-answering systems are evaluated in common benchmarks. I totally agree that fine-tuning T5 for answer grading would be super interesting!

discuss

modeless|6 years ago

I think it makes some sense to evaluate models like this, as you want to be conservative with the answers you accept (though my second example shows that it isn't always conservative), and models don't have feelings to hurt if they are docked points for not being precise enough. Humans, of course, are more sensitive.

lsb|6 years ago

Does that mean that answer grading would become like comparing summaries of a given text?

dmit|6 years ago

I'm sorry for being blunt, but is it possible that the `very brittle "exact match" method of evaluating whether an answer is correct` means value equality? Is `==` the secret sauce?

craffel|6 years ago

It's slightly more than that -- it also involves lowercasing and removing articles before testing for string equality.

svnpenn|6 years ago

Why are you replying to every single comment?

schoen|6 years ago

I think craffel (probably "Colin Raffel, Senior Research Scientist, Google Research") was directly involved in this research!