dargani123's comments

dargani123 | 11 years ago | on: Optimizely Statistics Engine

Hey guys,

Answering addressing a few comments right here. I think the industry deserves a lot of credit in its efforts to help those wanting to run A/B tests. Many people were aware these were issues and many actually tried to fix it (us included). There are many blog posts in the community about why continuous monitoring is dangerous, why you should use a sample size calculator, how to properly set a Minimum Detectable Effect etc... We were part (and definitely not the first) of this group as we published a sample size calculator and spent a lot of time working with our clients on running tests with a safe testing procedure.

However, after doing this and looking more closely attempting to quantify the effect of these efforts we saw an opportunity for a simpler solution that could help even more people. Sequential Testing was this solution, and it's had success in other applications. We wanted to bring sequential testing to A/B testing and take the hard work out of doing it correctly. Specifically, we have built on that groundwork laid in 50's and 60's by providing an always valid notion of p-value that customers are looking for.

While traditional sequential testing combats the continuous monitoring problem well, they require you to have an intimate understanding of the solution that can pose cognitive hurdles for those not well-versed in statistics. You have to either know your target effect size, or have in mind a maximum allowable number of visitors and understand how changes in these will affect the run time of your test. What’s more, it is not straightforward to translate results to standard measures of significance such as p-values. This is actually where the biggest research contribution of Stats Engine comes in. We allow you to run a test, detect a range of effect sizes and provide an always valid FDR-adjusted p-value as opposed to a set of stopping rules that bounds Type 1 error at say 5%. The error rates are valid no matter how the user chooses to interact with the A/B test. Also, FDR control itself has only been around over the last 20-25 years.

Our biggest industry contribution is probably much simpler in us moving a lot of the market to sequential testing more generally. We are happy to be in the position to help build on research and bring this to practical applications.

dargani123 | 11 years ago | on: Optimizely Statistics Engine

Thanks for your comment. This is Darwish, the Product Manager working on Stats Engine. You are correct "classic statistics" is the method we used in the past. It also what is most commonly used in industry (the main reason we started with this method). This was not an easy project for us to take on, but after talking to customers and looking at our historical experiment data, it was clear how important this problem was to solve, and thats why we spent a lot of resources on fixing this. Just for those following along on this comment, its not that "classic statistics" on their own that are incorrect, but rather the misuse of these statistics that can be costly. When used "incorrectly" (not using a sample size calculator, running many goals and variations at a time etc..), you can meaningfully increase your chance of making a bad business decision or commit yourself to unnecessarily long sample sizes. Using statistics correctly is an industry-wide problem that many have tried to solve with education (i.e. give statistics crash courses). We hope that our solution shows how important we think it is that statistics drive day-to-day decisions in organizations and that there are different ways (change the math, not the customer) to get customers to this point. Many companies have data science teams and in-house statisticians that are very aware of these problems, but many don't and thats really where we wanted to help out. You can read more about why we thought this was a serious problem here: http://blog.optimizely.com/2015/01/20/statistics-for-the-int...
page 1