Most Winning A/B Test Results are Illusory [pdf]

[+] gkoberger|12 years ago|reply

At my first webdev internship, my only job was to report to the "Head of Analytics" (a young liberal arts guy). All I did all day was make the tweaks he told me to do. It was stuff like "make this button red, green or blue", or "try these three different phrasings".

We got no more than 100 hits a day, with no more than 2-3 conversions a day, and he would run these tests for, like, 2 days.

I hated it, and the website looked horrible because everything was competing with each other and just used whatever random color won.

[+] CoffeeDregs|12 years ago|reply

I've seen that, too. One of my clients redid their marketing site 3x in one year, each time claiming incredible improvements. The incredible improvements turned out to be local hill climbing, while the entire site's performance languished... 3-4 years ago there were a ton of blog posts about how a green button produced incredible sales when compared to a red button. And so everyone switched to green buttons...

By contrast, I've evolved multiple websites through incremental, globally measured, optimizations. It's a lot of fun and it requires you to really understand your user (I've called AB testing+analytics "a conversation between you and your users"). But, as you point out, it can be tough to get statistically relevant data on changes to a small site. That's why I usually focused on big effects (e.g. 25%), rather than on the blog posts about "OMG! +2.76% change in sales!". That's also why I did a lot of "historical testing", under the assumption that week-week changes in normalized stats would be swamped by my tests.

[+] AJ007|12 years ago|reply

There are a lot of published "case studies" in the internet marketing field that consist of a few hundred views and a handful of conversions. It is even more embarrassing considering you often need 100,000+ unique visitors and thousands of conversions to find real winners.. and you still have to deal with a reality that a real 'winner' may result in a drop off of sales (in lead generation), an increase in charge backs (if your conversion was a sale), etc. This accounts for a sliver of the regression to the mean mentioned in the whitepaper.

Tests have value, but just making your site/app very simple and completely non-confusing to the viewer can do something years of split tests will not.

I suggest running tests and monitoring metrics as you implement design changes, not so much as a magic eight ball, but to ensure you avoid truly catastrophic UI fuck ups.

[+] ernopp|12 years ago|reply

Full disclosure: I work for Qubit who published this white paper.

I see a lot of this kind of testing going on in the industry and it's frustrating. A/B testing can be a massive tool for your business if it's done right but obviously if you only wait for 2-3 conversions, you're not learning much... "Good" to hear that other people feel the same way!

[+] mathewsimonton|12 years ago|reply

I'm someone currently specializing in analytics as a digital marketer at work (and learning R and a bit of Python in my spare time for greater and swifter data analysis!) Similar to your former superior, I'm also coming out of a liberal arts background. I just want to make it clear that someone like me, despite their background, agrees with you that the person you were reporting to was foolish to even bother A/B testing such minor elements at 100 hits/day.

Sadly, many foolish "SEOs"/"digital marketers"/"growth hackers" have this same mentality that such subtle changes--despite low traffic--still offer meaningful information to digest and further analyze. But hey, they gotta keep their boss/clients on-board for the thrill and payment, right? For everyone out there, remember that often outside the highest echelon of traffic levels, this testing is often performed by marketers with BAs in business administration, marketing, or liberal arts degrees like me. They are often not the statisticians referenced in this document. And sadly they may likely be people unlike me, unwilling to stretch out into a programming languages for data analysis and may have never cracked open a book on statistics. But frankly they have other things to worry about--like staying in your budget and overall digital branding and marketing strategy. Their budget and time is likely better applied outside of A/B testing.

If you have a mathematics background, reach out to your marketing department. If you consider yourself a math-wiz, reach out to the "growth hacker" or "SEO" a few feet away. They deal with the stuff you don't want to deal with. You deal with the stuff they don't want to deal with. Help each other out and engage in a conversation to better help your business. At least your superiors would appreciate it.

Personally, when it comes to landing pages, I test much more dramatic shifts--significant changes to the entire design or to the header imagery along with call-to-action. I don't buy into the testing of slight adjustments to things like font size or button color (and especially when there is such so low volume). That said, I've never worked with hundreds of thousands of visitors per month on a site, where anyone would imagine smaller changes for testing can make a bit more sense to look into.

gkoberger, I'm sorry you hated your first webdev internship. I would have hated it too.

On a side note (making specific reference to the document instead of the comment!), I really enjoyed point #3. This speaks very much to the often short-lived A/B testing of low-volume AdWords text ads. The data is often ALL over the place despite the (otherwise) "professional" use of the platform.

[+] loceng|12 years ago|reply

Ahh glorious...

[+] ronaldx|12 years ago|reply

I love the concept of A/A testing here, illustrating that you get apparent results even when you compare something to itself.

I can't imagine how A/B tests are a productive use of time for any site with less than a million users.

There are so many more useful things you could be doing to create value. If you're running a startup you should rather have some confidence in your own decisions.

[+] RyJones|12 years ago|reply

When ExP was a thing at Microsoft, we always ran an A/A test before we did experiments. We'd also do an A/A/B test to make sure the actual experiments were working.

http://www.exp-platform.com/Pages/default.aspx

[+] tel|12 years ago|reply

A/A testing should be used to get accurate estimates for within-sample variance. If you run an A/A/B test then you can calibrate the A/B component to be sensitive w.r.t. the tolerances of real data.

And then yeah, I'm sure a lot of successful A/B tests will get washed.

[+] Homunculiheaded|12 years ago|reply

confidence in your own decisions can also be referred to as a Bayesian prior ;)

I've treated the A/B tests I've run pretty much as a case of Bayesian parameter estimation (where the true conversion of A and of B are your parameter). You then get nice beta distributions you can sample from, as well as use the prior to constrain expectations of improvement and also reduce the effects of early flukes in your sampling.

[+] ernopp|12 years ago|reply

Full disclosure: I work for Qubit who published this white paper.

Just wanted to add that if you have less than a million users you can A/B test for upper funnel goals, effectively measuring if changes improve engagement. Obviously then you have the problem of working out if the engagement translates into more sales but perhaps you're willing to wait longer to find out if a test that improves engagement leads to more revenue in the long run.

[+] darkxanthos|12 years ago|reply

I do this professionally as my sole job. This is one of the very few papers I've read that seem completely legit to me. I especially love their point on necessary sample sizes to get to a 90% power.

[+] ep103|12 years ago|reply

How do you calculate the correct sample size for a test, to achieve the correct "power"?

[+] krallja|12 years ago|reply

Why is 90% power the magic number? What's wrong with 89.99%? Or 99.99%?

[+] pak|12 years ago|reply

This article's title echoes a paper which continues to influence the medical research and bioinformatics community, "Why Most Published Research Findings Are False" by JPA Ioannidis.

http://www.plosmedicine.org/article/info%3Adoi%2F10.1371%2Fj...

While the OP's article targets some low-hanging fruit, like halting criteria, multiple hypotheses, etc. which should be familiar to anyone serious about bioinformatics and statistics, Ioannidis takes these things a little farther and comes up with a number of corollaries that apply equally well to A/B testing.

After all, the randomized controlled trials that the FDA uses to approve new drugs are essentially identical to what would be called an A/B test on Hacker News.

[+] hvass|12 years ago|reply

I strongly recommend using Evan Miller's free A/B testing tools to avoid those issues!

Use them to really know if conversion rate is significantly different, whether the mean value of two groups is significantly different and how to calculate sample size:

http://www.evanmiller.org/ab-testing/

[+] napoleond|12 years ago|reply

This is awesome, thanks for the link! (And the visualizations help a ton, especially for the t-test... it's been a while since I took any stats courses and the terminology always puts me off a bit but the graphs make sense.)

[+] lingben|12 years ago|reply

thanks but what does "expected conversion rate" mean exactly? it isn't defined and I couldn't find that term anywhere else on the site. EDIT: ah, ok, got it. but why is their default expected conversion rate set so high? sheeesh

most people have conversion rates between 1-3%

[+] tristanz|12 years ago|reply

Putting aside bandits and all that, it seems like the first step should be to set up a hierarchical prior which performs shrinkage. Multiple comparisons and stopping issues are largely due to using frequentist tests rather than a simple probabilistic model and inference that conditions and the observed data.

Gelman et al, "Why we (usually) don't have to worry about multiple comparisons" http://arxiv.org/abs/0907.2478

[+] gabemart|12 years ago|reply

  > We know that that, occasionally, a test will generate a
  > false positive due to random chance - we can’t avoid that.
  > By convention we normally fix this probability at 5%. You
  > might have heard this called the significance probability
  > or p-value.

  > If we use a p-value cutoff of 5% we also expect to see 5
  > false positives.

Am I reading this incorrectly, or is the author describing p-values incorrectly?

A p-value is the chance a result at least as strong as the observed result would occur if the null hypothesis is true. You can't "fix" this probability at 5%. You can say "results with a p-value below 5% are good candidates for further testing". The fact that p-values of 0.05 and below are often considered significant in academia tells you nothing about the probability of a false positive occurring in an arbitrary test.

[+] martingoodson|12 years ago|reply

Author of the paper here. You're right this is incorrect. I corrected this in the final copy but a earlier draft seems to have been put on the website. There are a few other errors too. I am describing the 'significance level' here not the 'p-value', as you say.

[+] ronaldx|12 years ago|reply

Yes, there's perhaps a small error, although it might be that he's rounded up in his favour.

In his described scenario there are 90 cases where the null hypothesis is true (he states as a premise: "10 out of our 100 variants will be truly effective").

So strictly, we expect to see 5% of 90 = an average of 4.5 false positives (he says 5 false positives).

[Edited to add: False positive rate is measured as a conditional probability https://en.wikipedia.org/wiki/False_positive#False_positive_...]

[+] paraschopra|12 years ago|reply

The article is spot on. We at http://visualwebsiteoptimizer.com/ know that there are some biases (particularly related to 'Multiple comparisions' and 'Multiple seeing of data') that lead of results that seem better than they actually are. Though the current results are not wrong. They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).

Of course, these are very important issues for A/B testing vendors like us to understand and fix, since users mostly rely on our calculations to base their decisions. You will see us working towards taking care of such issues.

[+] martingoodson|12 years ago|reply

I'm afraid that's not quite right. A simple python simulation will show you that a variant with -5% (ie NEGATIVE) uplift will still give a positive results around 10% of the time if you perform early stopping of the test.

[+] IanCal|12 years ago|reply

> They are directionally correct, and with most A/B tests even if 95% confidence is really a true confidence of 90% or less, the business will still do better implementing the variation (v/s not doing anything).

What? That's not right at all! A confidence measure is how much you can trust that there's actually a difference. You can't say it'll improve things if your confidence is lower than your original threshold!

In addition to this, every time you change something you:

1) Might introduce bugs

2) Spend money

3) Spend time you could be spending adding a new feature or getting a new customer

[+] moapi|12 years ago|reply

Good article in general, I have a small question:

"Let’s imagine we perform 100 tests on a website and, by running each test for 2 months, we have a large enough sample to achieve 80% power. 10 out of our 100 variants will be truly effective and we expect to detect 80%, or 8, of these true effects. If we use a p-value cutoff of 5% we also expect to see 5 false positives. So, on average, we will see 8+5 = 13 winning results from 100 A/B tests."

If we expect 10 truly effective tests and 5 false positives, we'd have 15 tests that rejected the null hypothesis of h_0=h_test. Taking power into account, shouldn't we see 15*0.8, 12 winning results from the results? I.e. wouldn't one of the false positives also have not-enough-power?

[+] ernopp|12 years ago|reply

Full disclosure: I work for Qubit who published this white paper.

Maybe the confusion here is in tests which have a "true" effect and an "observed" effect. If an experiment has a true effect, then you have some chance to observe it, which is the power.

But false positives have by definition already been observed as winners (that's what false positives are), so there's no need to apply the factor of 0.8 to them.

[+] unknown|12 years ago|reply

[deleted]

[+] dbroockman|12 years ago|reply

The "regression to the mean" and "novelty" effect is getting at two different things (both true, both important).

1. Underpowered tests are likely to exaggerate differences, since E(abs(truth - result)) increases as the sample size shrinks.

2. The much bigger problem I've seen a lot: when users see a new layout they aren't accustomed to they often respond better, but when they get used to it, they can begin responding worse than with the old design. Two ways to deal with this are long term testing (let people get used to it) and testing on new users. Or, embrace the novelty effect and just keep changing shit up to keep users guessing - this seems to be FB's solution.

[+] stevoski|12 years ago|reply

Great read.

What bothers me about A/B tests is when people say, eg."there was a 7% improvement" without telling us the sample size, or error margin. I'd rather hear: On a sample size of 1,000 unique visits, the improvement rate was 7% +/- 4%

[+] ameister14|12 years ago|reply

I really liked this; it's condescending, but in a good natured sort of way. It's as if the author was trying to explain really basic statistics to a marketer, then realized that the marketer had NO idea what he was talking about.

So you get statements like "This is a well-known phenomenon, called ‘regression to the mean’ by statisticians. Again, this is common knowledge among statisticians but does not seem to be more widely known."

I thought that was hilarious.

[+] IanOzsvald|12 years ago|reply

Martin gave this paper as a talk at our PyData London conference this weekend (thanks Martin!), videos will be linked once we have them. He shares hard-won lessons and good advice. Here's my write-up: http://ianozsvald.com/2014/02/24/pydatalondon-2014/

[+] mildtrepidation|12 years ago|reply

Would be interested to see patio11's feedback on this one.

[+] patio11|12 years ago|reply

Correct on the math, to the limit of my understanding of it and quick glance.

I am agnostic about whether most A/B testing practitioners administer their tests correctly -- of the universe of companies I've seen, far and away the most common error regarding A/B testing is "We don't A/B test.", which remains an error even after you read this article.

The novelty effect they talk about, which the article says is probably simple reversion to the mean, is -- in my opinion -- likely a true observation of the state of the world. You can watch your conversion-rate-over-time for many offers, many designs, many products, etc, and they often start out quite high and taper off, both in circumstances where there is obvious alternate causality and in circumstances where they isn't. By comparison, I have not often participated in tests where conversion rates started out abnormally low and reverted to the mean, which we'd expect exactly as often as "started out high" if that was indeed what we were seeing.

I believe so strongly in the novelty effect that I have written proposals to profitably exploit it by scalably manufacturing novelty. Sadly, none of them are public. It's on my to-do list for one of these months but a lot of things are on my to-do list for one of these months.

If you run many tests, which as time approaches infinity you darn better, your odds of seeing a false positive approach one. Contra the article, you gladly accept this as a cost of doing business, because you know to a statistical certainty that you've seen many, many more true positives.

That about sums it up. If you have any particular questions, happy to answer them. My takeaway is "Good article. Please don't use it to justify a decision to not test."

[+] beambot|12 years ago|reply

Related... someone should write a good article about estimating customer acquisition costs (CAC, or ROI if you prefer) based on conversion rates of ads.

It drives me batty when people tell me their "average" conversion rate is 1% after running a $25 ad campaign with so few clicks. It seems like too many folks are just oblivious to sample size, confidence interval, and power calculations -- something that could be solved with a quick Wikipedia search [1].

[1] https://en.wikipedia.org/wiki/Sample_size_determination

[+] gatehouse|12 years ago|reply

Regarding the final bullet point of doing a second validation, the sample size should be bigger right? Because of the tendency for winners to coincide with +ve random effects, you will choose a larger experiment size and expect to see a lesser result.

[+] 27182818284|12 years ago|reply

Visibility on this is set to "Private" is is really supposed to be linked publically on HN? I was about to Tweet a link to it and then I felt dirty, like maybe the author wanted to send the link to just a select group.

[+] rubiquity|12 years ago|reply

Coming from a poker background, where sample size trumps everything, I've LOL'ed at every person that has ever whipped out an A/B test on me.

[+] StavrosK|12 years ago|reply

This doesn't follow. What if their sample size was 100,000 conversions?

[+] lingben|12 years ago|reply

compare and contrast this whitepaper with arguably one of the most common optimization apps out there:

https://help.optimizely.com/hc/en-us/articles/200133789-How-...

[+] coderdude|12 years ago|reply

In my experience it can't be overstated how important it is to wait until you have a large sample size to decide whether a variation is the winner. Nearly all of the A/B tests I run start out looking like a variation is the clear, landslide winner (sometimes showing 100%+ improvement over the original) only to eventually end up regressing toward the mean. I can't get a clear idea of the winner of a test until I've shown the variation(s) to 10s of thousands of visitors and received a few thousand conversions. I've also learned that it's important to only perform tests on new visitors when possible. That means tests need to run longer to get the appropriate sample size. If you're testing over a few hundred conversions and performing tests on new and returning visitors then you're probably getting skewed results. Again, that's just in my experience so far. YMMV. One thing to consider with a test is that the variations may be too subtle to have a significant, positive impact on conversion.

83 comments