top | item 7146956

(no title)

Weird... I always thought that if you are running a split test in parallel (all at the same time), then you can figure out the number of samples needed to compare the branches with statistical confidence. I mean it makes sense to me. As the number of samples increases for each split test the distribution shifts from a binomial distribution to a gaussian distribution by the central limit theorem, and that happens around 1000 samples with a reasonable conversion rate. Then you're just comparing Gaussians, centered around the mean conversion value, with a width proportional to the number of samples. Taking the difference between two gaussians will give you the "chance to be different". Standard practice is to wait until one branch has a 95% chance to be better and then declare it the winner. This will test for false positives, which is usually what you are concerned about. False negatives don't matter that much when it comes to things like picking a name.

discuss

sp332|12 years ago

Here's a good explanation: http://www.evanmiller.org/how-not-to-run-an-ab-test.html

204NoContent|12 years ago

Thanks for the link to the blog post. It raised an important point worthy of inspection. I ran some numbers and "peeking" after the first 1000 trials does change the outcome. The chance that the outcome will reverse from declaring branch A the winner with 95% confidence to declaring branch B the winner with 95% confidence is rather small, less than 10%. However, if you lower your requirements to 80% confidence then the chance of the winner swapping increases to over 50%! For reference, I used the Wilson approximation for binomial distributions. I'm sure the Wald approximation fares worse.

tlarkworthy|12 years ago

Interesting, this is a weakness of significance testing, in particular, its parametric model. Using Bayesian inference you would be able to look early withou messing your results up.