top | item 28905677

(no title)

julien_c | 4 years ago

ArXiv link to the paper: https://arxiv.org/abs/2110.08207

GitHub repo: https://github.com/bigscience-workshop/promptsource

discuss

hrgiger|4 years ago

What is the/is there any reproducible measurement for benchmarking a nlp dataset/application. i.e. in paper it mentions:

'Comparing T0 and GPT-3’s robustness Because Brown et al. (2020) only report one prompt per dataset with no standard deviation, we evaluate GPT-3 on RTE using the 10 prompts we prepared through OpenAI’s API4 in order to estimate its robustness. Note that one of our templates is identical to Brown et al. (2020, p. 59)’s reported prompt; this prompt scores 58.8% accuracy on the API “Base” series which is lower than the reported accuracy of 63.5% from Brown et al. (2020). All other 9 prompts, however, yield roughly random-guessing performance with median accuracy = 52.96% and interquartile range = 1.28%. These results suggest that T0 is more robust to prompt formulation than GPT-3.'

srush|4 years ago

Yes there are many reproducible measures for benchmarking NLP datasets. We use many of them in the paper.

The issue here is that we were not completely sure of the process that OpenAI used in their paper. They report the prompt but not the process of finding it. As their model and process is proprietary, it is hard for us to do an apples-to-apples comparison. This small experiment though indicates that it is likely not very robust to prompt wording.