Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters?
godelski|1 year ago
Note that Non-STEM is still within error. STEM is less than 2 sigma variance, so our confidence still shouldn't be that high.
n2d4|1 year ago
If 10% of people just click based on how fast the response was because they don't want to read both outputs, your p-value for the latter hypothesis will be atrocious, no matter how large the sample is.
throwaway287391|1 year ago
johnmaguire|1 year ago
Couldn't this be considered a form of preference?
Whether it's the type of preference OpenAI was testing for, or the type of preference you care about, is another matter.