top | item 38336570

(no title)

From reading the paper and the original paper that the data for the MTurk/Prolific samples are drawn from, this is a convenience sample of 415 humans on two platforms. Each worker received a random sample of the ConceptARC problems, and the average score correct is assigned the "Human" benchmark.

Perhaps by "random sample problems" you mean that the study is not representative of all of humanity? If so we can still take the paper as evaluating these 415 humans who speak English against the two models. If as you say, the workers are actually just using LLMs then this implies there is some LLM that your average MTurk worker has access to that out-performs GPT 4 and GPT 4V. That seems *extremely* unlikely to say the least.

There is no need for any complex statistical analysis here since the question is simply comparing the scores on a test. It's a simple difference in means. Arguably, the main place that could benefit from additional statistical procedures would be weighting the sample to be representative of a target population, but that in no way affects the results of the study at hand.

discuss

No comments yet.