(no title)
LogicalRisk | 2 years ago
Perhaps by "random sample problems" you mean that the study is not representative of all of humanity? If so we can still take the paper as evaluating these 415 humans who speak English against the two models. If as you say, the workers are actually just using LLMs then this implies there is some LLM that your average MTurk worker has access to that out-performs GPT 4 and GPT 4V. That seems *extremely* unlikely to say the least.
There is no need for any complex statistical analysis here since the question is simply comparing the scores on a test. It's a simple difference in means. Arguably, the main place that could benefit from additional statistical procedures would be weighting the sample to be representative of a target population, but that in no way affects the results of the study at hand.
No comments yet.