What is the/is there any reproducible measurement for benchmarking a nlp dataset/application. i.e. in paper it mentions:
'Comparing T0 and GPT-3’s robustness Because Brown et al. (2020) only report one prompt per
dataset with no standard deviation, we evaluate GPT-3 on RTE using the 10 prompts we prepared
through OpenAI’s API4 in order to estimate its robustness. Note that one of our templates is identical
to Brown et al. (2020, p. 59)’s reported prompt; this prompt scores 58.8% accuracy on the API
“Base” series which is lower than the reported accuracy of 63.5% from Brown et al. (2020). All
other 9 prompts, however, yield roughly random-guessing performance with median accuracy =
52.96% and interquartile range = 1.28%. These results suggest that T0 is more robust to prompt
formulation than GPT-3.'
Yes there are many reproducible measures for benchmarking NLP datasets. We use many of them in the paper.
The issue here is that we were not completely sure of the process that OpenAI used in their paper. They report the prompt but not the process of finding it. As their model and process is proprietary, it is hard for us to do an apples-to-apples comparison. This small experiment though indicates that it is likely not very robust to prompt wording.
hrgiger|4 years ago
'Comparing T0 and GPT-3’s robustness Because Brown et al. (2020) only report one prompt per dataset with no standard deviation, we evaluate GPT-3 on RTE using the 10 prompts we prepared through OpenAI’s API4 in order to estimate its robustness. Note that one of our templates is identical to Brown et al. (2020, p. 59)’s reported prompt; this prompt scores 58.8% accuracy on the API “Base” series which is lower than the reported accuracy of 63.5% from Brown et al. (2020). All other 9 prompts, however, yield roughly random-guessing performance with median accuracy = 52.96% and interquartile range = 1.28%. These results suggest that T0 is more robust to prompt formulation than GPT-3.'
srush|4 years ago
The issue here is that we were not completely sure of the process that OpenAI used in their paper. They report the prompt but not the process of finding it. As their model and process is proprietary, it is hard for us to do an apples-to-apples comparison. This small experiment though indicates that it is likely not very robust to prompt wording.