First, you really should move away from frequentist statistical testing and use Bayesian statistics instead. It is perfect for such occasions where you want to adjust your beliefs in what UX is best based on empirical data to support your decision. With collecting data you are increasing confidence in your decision rather than trying to meet an arbitrary criterion of a specific p-value.
Second, the “run-in-parallel” approach has a well defined name in experimental design, called a factorial design. The diagram shown is an example of full factorial design in which each level of each factor is combined with each level of all other factors. The advantage of such design is that interactions between factors can be tested as well. If there are good reasons to believe that there are no interactions between the different factors then you could use a partial factorial design that, which has the advantage of having less total combinations of levels while still enabling estimation of effects of individual factors.
Disagree on using Bayesian statistics. Frequentist statistics are perfect for A/B testing.
There are so many strong biases people have about different parts about UI/UX. One of the significant benefits of A/B testing is that it lets you move ahead as a team and make decisions even when there are strongly differing opinions on your team. In these cases you can just "A/B test" and let the data decide.
But if you are using Bayesian approaches you'll transition those internal arguments to what the prior should be and it will be harder to get alignment based on the data.
Building your own bayesian model with something like pymc3 is also a very reasonable approach to take with small data or data with too much variance to detect effects in a timely manner. This also forces you to think about the underlying distributions that generate your data which is an exercise in itself that can yield interesting insights.
Yes: you could use bayesian priors and a custom model to give yourself more confidence from less data. But...
Don't: for most businesses that are so early they can't get enough users to hit stat-sig, you're likely to be better off leveraging your engineering efforts towards making the product better instead of building custom statistical models. This is nerd-sniping-adjacent, (https://xkcd.com/356/) a common trap engineers can fall into: it's more fun to solve the novel technical problem than the actual business problem.
Though: there are a small set of companies with large scale but small data, for whom the custom stats approaches _do_ make sense. When I was at Opendoor, even though we had billions of dollars of GMV, we only bought a few thousand homes a month, so the Data Science folks used fun statistical approaches like Pair Matching (https://www.rockstepsolutions.com/blog/pair-matching/) and CUPED (now available off the shelf - https://www.geteppo.com/features/cuped) to squeeze a bit more signal from less data.
> Gut Check: Especially if you’re off by quite a bit, this is a chance to take a step back and ask whether the company has reached growth scale or not. It could be that there are plenty of obvious 0-1 tactics left. Not everything has to be an experiment.
This is a key point, imo. I have a sneaking suspicion that a lot of companies are running "growth teams" that don't have the scale where it actually makes sense to do so.
Everything has to be a test early on, but not every test has to rely on random-split-based statistical significance to make a decision. “Would you pay $20 for this?” is a classic way to judge whether your service has product-market fit, and it’s not about sample sizes, not initially.
Some growth teams are trying more exploratory approaches to find something that resonates with simpler approaches. Others rely on A/B tests. Different profiles, but both are “Growth teams”.
There's an argument to be made that, so long as your testing fully encompasses all visitors to your site, you aren't sampling the population, you're fully observing it, and statistical significance is irrelevant.
Sites are always getting new visitors, losing old ones and the ones they’ve observed return irregularly (or commonly, or somewhere in between). So it’s not realistic to assume a given sample of visitors is the population.
That argument is missing that you are using past users’ behaviour as representative of future users’ preferences. You are not sampling marbles in a jar, but making a lot of assumptions, notably about continuity.
“Using modern experiment frameworks, all 3 of ideas can be safely tested at once, using parallel A/B tests (see chart).”
Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments, so your significance calculation is now off. Second, xkcd 882.
> Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments
I get that a bunch at some of my clients. It's a common misconception. Let's say experiment B is 10% better than control but we're also running experiment C at the same time. Since C's participants are evenly distributed across B's branches, by default they should have no impact on the other experiment.
If you do a pre/post comparison, you'll notice that for whatever reason, both branches of C are doing 5% better than prior time periods, and this is because half of them are in the winner branch of B.
NOW - imagine that the C variant is only an improvement _if_ you also include the B variant. That's where you need to be careful about monitoring experiment interactions, I called out in the guide. But better so spend a half day writing an "experiment interaction" query than two weeks waiting for the experiments to run in sequence.
That is a valid concern to be vigilant for. In this case, XKCD is calling out the "find a subgroup that happens to be positive" hack (also here, https://xkcd.com/1478/). However, here we're testing (a) 3 different ideas and (b) only testing each of them once on the entire population. No p-hacking here (far as I can tell, happy to learn otherwise), but good that you're keeping an eye out for it.
grega5|2 years ago
Second, the “run-in-parallel” approach has a well defined name in experimental design, called a factorial design. The diagram shown is an example of full factorial design in which each level of each factor is combined with each level of all other factors. The advantage of such design is that interactions between factors can be tested as well. If there are good reasons to believe that there are no interactions between the different factors then you could use a partial factorial design that, which has the advantage of having less total combinations of levels while still enabling estimation of effects of individual factors.
scottfr|2 years ago
There are so many strong biases people have about different parts about UI/UX. One of the significant benefits of A/B testing is that it lets you move ahead as a team and make decisions even when there are strongly differing opinions on your team. In these cases you can just "A/B test" and let the data decide.
But if you are using Bayesian approaches you'll transition those internal arguments to what the prior should be and it will be harder to get alignment based on the data.
AlexeyMK|2 years ago
The frequentist/bayesian debate is not one I understand well enough to opine - do you have any reading you'd recommend for this topic?
jvans|2 years ago
AlexeyMK|2 years ago
Yes: you could use bayesian priors and a custom model to give yourself more confidence from less data. But...
Don't: for most businesses that are so early they can't get enough users to hit stat-sig, you're likely to be better off leveraging your engineering efforts towards making the product better instead of building custom statistical models. This is nerd-sniping-adjacent, (https://xkcd.com/356/) a common trap engineers can fall into: it's more fun to solve the novel technical problem than the actual business problem.
Though: there are a small set of companies with large scale but small data, for whom the custom stats approaches _do_ make sense. When I was at Opendoor, even though we had billions of dollars of GMV, we only bought a few thousand homes a month, so the Data Science folks used fun statistical approaches like Pair Matching (https://www.rockstepsolutions.com/blog/pair-matching/) and CUPED (now available off the shelf - https://www.geteppo.com/features/cuped) to squeeze a bit more signal from less data.
j7ake|2 years ago
I always say in my profession I will fit models for free, it’s having to clean data and “finish” a project that I demand payment.
hammock|2 years ago
charlierguo|2 years ago
This is a key point, imo. I have a sneaking suspicion that a lot of companies are running "growth teams" that don't have the scale where it actually makes sense to do so.
bertil|2 years ago
Some growth teams are trying more exploratory approaches to find something that resonates with simpler approaches. Others rely on A/B tests. Different profiles, but both are “Growth teams”.
Fomite|2 years ago
RA_Fisher|2 years ago
bertil|2 years ago
malf|2 years ago
Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments, so your significance calculation is now off. Second, xkcd 882.
AlexeyMK|2 years ago
I get that a bunch at some of my clients. It's a common misconception. Let's say experiment B is 10% better than control but we're also running experiment C at the same time. Since C's participants are evenly distributed across B's branches, by default they should have no impact on the other experiment.
If you do a pre/post comparison, you'll notice that for whatever reason, both branches of C are doing 5% better than prior time periods, and this is because half of them are in the winner branch of B.
NOW - imagine that the C variant is only an improvement _if_ you also include the B variant. That's where you need to be careful about monitoring experiment interactions, I called out in the guide. But better so spend a half day writing an "experiment interaction" query than two weeks waiting for the experiments to run in sequence.
> Second, xkcd 882 (https://xkcd.com/882/) I think you're referencing P-hacking, right?
That is a valid concern to be vigilant for. In this case, XKCD is calling out the "find a subgroup that happens to be positive" hack (also here, https://xkcd.com/1478/). However, here we're testing (a) 3 different ideas and (b) only testing each of them once on the entire population. No p-hacking here (far as I can tell, happy to learn otherwise), but good that you're keeping an eye out for it.