top | item 36354280

Annoying A/B testing mistakes

292 points| Twixes | 2 years ago |posthog.com | reply

149 comments

order
[+] alsiola|2 years ago|reply
On point 7 ((Testing an unclear hypothesis), while agreeing with the overall point, I strongly disagree with the examples.

> Bad Hypothesis: Changing the color of the "Proceed to checkout" button will increase purchases.

This is succinct, clear, and is very clear what the variable/measure will be.

> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page. This will then lead to more purchases.

> User research showed that users are unsure of how to proceed to the checkout page.

Not a hypothesis, but a problem statement. Cut the fluff.

> Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page.

This is now two hypotheses.

> This will then lead to more purchases.

Sorry I meant three hypotheses.

[+] mgsouth|2 years ago|reply
* Turns out, folks are seeing the "buy" button just fine. They just aren't smitten with the product. Making "buy" more attention-grabbing gets them to the decision point sooner, so they close the window.

* Turns out, folks see the "buy". Many don't understand why they would want it. Some of those are converted after noticing and reading an explanatory blurb in the lower right. A more prominent "buy" button distracts from that, leading to more "no".

* For some reason, a flashing puke-green "buy" button is less noticable, as evidenced by users closing the window at a much higher rate.

Including untestable reasoning in a chain of hypothesises leads to false confirmation of your clever hunches.

[+] travisjungroth|2 years ago|reply
The biggest issue with those three hypotheses is one of them, the noticing the button, almost certainly isn't being tested. But, how the test goes will inform how people think about that hypothesis.
[+] kevinwang|2 years ago|reply
It is surely helpful to have a "mechanism of action" so that you're not just blindly AB testing and falling victim to coincidences like in https://xkcd.com/882/ .

Not sure if people do this, but with a mechanism of action in place you can state a prior belief and turn your AB testing results into actual posteriors instead of frequentist metrics like p-values which are kind of useless.

[+] ssharp|2 years ago|reply
I don't think these examples are bad. From a clarity standpoint, where you have multiple people looking at your experiments, the first one is quite bad and the second one is much more informative.

Requiring a user problem, proposed solution, and expected outcome for any test is also good discipline.

Maybe it's just getting into pedants with the word "hypothesis" and you would expect the other information elsewhere in the test plan?

[+] thingification|2 years ago|reply
As kevinwang has pointed out in slightly different terms: the hypothesis that seems wooly to you seems sharply pointed to others (and vice versa) because explanationless hypotheses ("changing the colour of the button will help") are easily variable (as are the colour of the xkcd jelly beans), while hypotheses that are tied strongly to an explanation are not. You can test an explanationless hypothesis, but that doesn't get you very far, at least in understanding.

As usual here I'm channeling David Deutsch's language and ideas on this, I think mostly from The Beginning of Infinity, which he delightfully and memorably explains using a different context here: https://vid.puffyan.us/watch?v=folTvNDL08A (the yt link if you're impatient: https://youtu.be/watch?v=folTvNDL08A - the part I'm talking about starts at about 9:36, but it's a very tight talk and you should start from the beginning).

Incidentally, one of these TED talks of Deutsch - not sure if this or the earlier one - TED-head Chris Anderson said was his all-time favourite.

plagiarist:

> That doesn't test noticing the button, that tests clicking the button. If the color changes it is possible that fewer people notice it but are more likely to click in a way that increases total traffic.

"Critical rationalists" would first of all say: it does test noticing the button, but tests are a shot at refuting the theory, here by showing no effect. But also, and less commonly understood: even if there is no change in your A/B - an apparently successful refutation of the "people will click more because they'll notice the colour" theory - experimental tests are also fallible, just as everything else.

[+] kimukasetsu|2 years ago|reply
The biggest mistake engineers make is determining sample sizes. It is not trivial to determine the sample size for a trial without prior knowledge of effect sizes. Instead of waiting for a fixed sample size, I would recommend using a sequential testing framework: set a stopping condition and perform a test for each new batch of sample units.

This is called optional stopping and it is not possible using a classic t-test, since Type I and II errors are only valid at a determined sample size. However, other tests make it possible: see safe anytime-valid statistics [1, 2] or, simply, bayesian testing [3, 4].

[1] https://arxiv.org/abs/2210.01948

[2] https://arxiv.org/abs/2011.03567

[3] https://pubmed.ncbi.nlm.nih.gov/24659049/

[4] http://doingbayesiandataanalysis.blogspot.com/2013/11/option...

[+] travisjungroth|2 years ago|reply
People often don’t determine sample sizes at all! And doing power calculations without an idea of effect size isn’t just hard but impossible. It’s one of the inputs to the formula. But at least it’s fast so you can sort of guess and check.

Anytime valid inference helps with this situation, but it doesn’t solve it. If you’re trying to detect a small effect, it would be nicer to figure out you need a million samples up front versus learning that because your test with 1,000 samples a day took three years.

Still, anytime is way better than fixed IMO. Fixed almost never really exists. Every A/B testing platform I’ve seen allows peeking.

I work with the author of the second paper you listed. The math looks advanced, but it’s very easy to implement.

[+] hackernewds|2 years ago|reply
The biggest mistake is engineers owning experimentation. They should be owned by data scientists.

Realize though that is a luxury, but I also see this trend in blue chip companies

[+] mtlmtlmtlmtl|2 years ago|reply
Surprised no one said this yet, so I'll bite the bullet.

I don't think A/B testing is a good idea at all for the long term.

Seems like a recipe for having your software slowly evolved into a giant heap of dark patterns. When a metric becomes a target, it ceases to be a good metric.

[+] withinboredom|2 years ago|reply
More or less, it tells you the "cost" of removing an accidental dark pattern. For example we had three plans and a free plan. The button for the free plan was under the plans, front-and-center ... unless you had a screen/resolution that most of our non-devs/designers had.

So, the button, (for user's most common resolution) had the button just below the fold.

This was an accident though some of our users called us out for it -- suggesting we'd removed the free plan altogether.

So, we a/b tested moving the button to the top.

It would REALLY hurt the bottom line and explained some growth we'd experienced. To remove the "dark pattern" would mean laying off some people.

I think you can guess which one was chosen and still implemented.

[+] cantSpellSober|2 years ago|reply
Good multivariate testing and (statistically significant) data doesn't do that. It shows lots of ways to improve your UX, and if your guesses at improving UX actually work. Example from TFA:

> more people signed up using Google and Github, overall sign-ups didn't increase, and nor did activation

Less friction on login for the user, 0 gains in conversions, they shipped it anyway. That's not a dark pattern.

If you're intentionally trying to make dark patterns it will help with that too I guess; the same way a hammer can build a house, or tear it down, depending on use.

[+] activiation|2 years ago|reply
> Seems like a recipe for having your software slowly evolved into a giant heap of dark patterns.

Just don't test for dark patterns?

[+] hackernewds|2 years ago|reply
Let's ship the project of those that bang the table, and confirm our biases instead.
[+] matheusmoreira|2 years ago|reply
I don't think it should even be legal. Why do these corporations think they can perform human experimentation on unwitting subjects for profit?
[+] withinboredom|2 years ago|reply
I built an internal a/b testing platform with a team of 3-5 over the years. It needed to handle extreme load (hundreds of millions of participants in some cases). Our team also had a sister team responsible for teaching/educating teams about how to do proper a/b testing -- they also reviewed implementations/results on-demand.

Most of the a/b tests they reviewed (note the survivorship bias here, they were reviewed because they were surprising results) were incorrectly implemented and had to be redone. Most companies I worked at before or since did NOT have a team like this, and blindly trusted the results without hunting for biases, incorrect implementations, bugs, or other issues.

[+] indymike|2 years ago|reply
> It needed to handle extreme load (hundreds of millions of participants in some cases).

I can see extreme loads being valuable for an A/B test of a pipeline change or something that needs that load... but for the kinds of A/B testing UX and marketing does, leveraging statistical significance seems to be a smart move. There is a point where a large sample is trivially more accurate than a small sample.

https://en.wikipedia.org/wiki/Sample_size_determination

[+] srveale|2 years ago|reply
Do you know if there were common mistakes for the incorrect implementations? Were they simple mistakes or more because someone misunderstood a nuance of stats?
[+] rockostrich|2 years ago|reply
Same experience here for the most part. We're working on migrating away from an internal tool which has a lot of problems: flags can change in the middle of user sessions, limited targeting criteria, changes to flags require changes to code, no distinction between feature flags and experiments, experiments often target populations that vary greatly, experiments are "running" for months and in some cases years...

Our approach to fixing these problems starts with having a golden path for running an experiment which essentially fits the OP. It's still going to take some work to educate everyone but the whole "golden path" culture makes it easier.

[+] Sohcahtoa82|2 years ago|reply
The one mistake I assume happens too much is trying to measure "engagement".

Imagine a website is testing a redesign, and they want to decide if people like it by measuring how long they spend on the site to see if it's more "engaging". But the new site makes information harder to find, so they spend more time on the site browsing and trying to find what they're looking for.

Management goes, "Oh, users are delighted with the new site! Look how much time they spend on it!" not realizing how frustrated the users are.

[+] throwaway084t95|2 years ago|reply
That's not Simpson's Paradox. Simpson's Paradox is when the aggregate winner is different from the winner in each element of a partition, not just some of them
[+] jameshart|2 years ago|reply
Yes, I don’t think it’s possible to observe a simpson’s paradox in a simple conversion test, either.

Simpson’s paradox is about spurious correlations between variables - conversion analysis is pure Bayesian probability.

It shouldn’t be possible to have a group as a whole increase its probability to convert, while having every subgroup decrease its probability to convert - the aggregate has to be an average of the subgroup changes.

[+] robertlacok|2 years ago|reply
Exactly.

On that topic – what do you do when you observe that in your test results? What's the right way to interpret the data?

[+] Lior539|2 years ago|reply
I'm the author of this blog. Thank you for calling this out! I'll update the example to fix this :)
[+] hammock|2 years ago|reply
What it is is confounding
[+] londons_explore|2 years ago|reply
I want an A/B test framework that automatically optimizes the size of the groups to maximize revenue.

At first, it would pick say a 50/50 split. Then as data rolls in that shows group A is more likely to convert, shift more users over to group A. Keep a few users on B to keep gathering data. Eventually, when enough data has come in, it might turn out that flow A doesn't work at all for users in France - so the ideal would be for most users in France to end up in group B, whereas the rest of the world is in group A.

I want the framework to do all this behind the scenes - and preferably with statistical rigorousness. And then to tell me which groups have diminished to near zero (allowing me to remove the associated code).

[+] 2rsf|2 years ago|reply
Another challenge, related more to implementation than theory, is having too many experiments running in parallel.

As a company grows there will be multiple experiments running in parallel executed by different teams. The underlying assumption is that they are independent, but it is not necessarily true or at least not entirely correct. For example a graphics change on the main page together with a change in the login logic.

Obviously this can be solved by communication, for example documenting running experiments, but like many other aspects in AB testing there is a lot of guesswork and gut feeling involved.

[+] jedberg|2 years ago|reply
The biggest mistake engineers make about A/B testing is not recognizing local maxima. Your test may be super successful, but there may be an even better solution that's significantly different than what you've arrived at.

It's important to not only A/B test minor changes, but occasionally throw in some major changes to see if it moves the same metric, possibly even more than your existing success.

[+] rmetzler|2 years ago|reply
If I read the first mistake correctly, then getFeatureFlag() has the side-effect to count how often it was called and uses this to calculate the outcome of the experiment? Wow. I don't know what to say....
[+] dbroockman|2 years ago|reply
Another one: don’t program your own AB testing framework! Every time I’ve seen engineers try to build this on their own, it fails an AA test (where both versions are the same so there should be no difference). Common reasons are overly complicated randomization schemes (keep it simple!) and differences in load times between test and control.
[+] alberth|2 years ago|reply
Enough traffic.

Isn’t the biggest problem with A/B testing that very few web sites even have enough traffic to properly measure statistical differences.

Essentially making A/B testing for 99.9% of websites useless.

[+] masswerk|2 years ago|reply
Ad 7)

> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it (…)

Mind that you have to prove first that this preposition is actually true. Your user research is probably exploratory, qualitative data based on a small sample. At this point, it's rather an assumption. You have to transform and test this (by quantitative means) for validity and significance. Only then you can proceed to the button-hypothesis. Otherwise, you are still testing multiple things at once, based on an unclear hypothesis, while merely assuming that part of this hypothesis is actually valid.

[+] mabbo|2 years ago|reply
> The solution is to use an A/B test running time calculator to determine if you have the required statistical power to run your experiment and for how long you should run your experiment.

Wouldn't it be better to have an A/B testing system that just counts how many users have been in each assignment group and end when you have the required statistical power?

Time just seems like a stand in for "that should be enough", when in reality you might have a change in how many users get exposed that differs from your expectations.

[+] aliceryhl|2 years ago|reply
Running the experiment until you have a specific pre-determined number of observations is okay.

However, the deceptively similar scheme of running it until the results are statistical significant is not okay!

[+] iudqnolq|2 years ago|reply
Point one seems to be an API naming issue. I would not anticipate getFeatureFlag to increment a hit counter. Seems like it should be called something like participateInFlagTest or whatever. Or maybe it should take a (key, arbitraryId) instead of just (key), use the hash of the id to determine if the flag is set, and idempotently register a hit for the id.
[+] realjohng|2 years ago|reply
Thanks for posting this. It’s to the point and easy to understand. And much needed- most companies seem to do testing without teaching the intricacies involved.
[+] drpixie|2 years ago|reply
> Relying too much on A/B tests for decision-making

Need I say more? Or just keep tweaking your website until it becomes a mindless, grey, sludge.

[+] franze|2 years ago|reply
plus, mind the Honeymoon Effect

something new performs better cause its new

if you have a platform with lots pf returning users this one will hit you again and again.

so even if you have a winner after the test and make the change permanent, revisit it 2 months later and see if you are now really better of.

all changes of a/b tests in sum has a high chance to just get an average platform in the sum of all changes.