top | item 42891882

(no title)

That would be 12%, why would you assume that is eaten by statistical noise?

discuss

The OPs comment is probably a testament of that. With such a poorly designed A/B test I doubt this has a p-value of < 0.10.

throwaway287391|1 year ago

Erm, why not? A 0.56 result with n=1000 ratings is statistically significantly better than 0.5 with a p-value of 0.00001864, well beyond any standard statistical significance threshold I've ever heard of. I don't know how many ratings they collected but 1000 doesn't seem crazy at all. Assuming of course that raters are blind to which model is which and the order of the 2 responses is randomized with every rating -- or, is that what you meant by "poorly designed"? If so, where do they indicate they failed to randomize/blind the raters?

aqme28|1 year ago

They even include error bars. It doesn't seem to be statistical noise, but it's still not great.