top | item 11437114

20 lines of code that beat A/B testing (2012)

545 points| _kush | 10 years ago |stevehanov.ca | reply

157 comments

order
[+] orasis|10 years ago|reply
Here's what everyone is missing. Don't use bandits to A/B test UI elements, use them to optimize your content / mobile game levels.

My app, 7 Second Meditation, is solid 5 stars, 100+ reviews because I use bandits to optimize my content.

By having the system automatically separate the wheat from the chaff, I am free to just spew out content regardless of its quality. This allows me to let go of perfectionism and just create.

There is an interesting study featured in "Thinking Fast and Slow" where they had two groups in a pottery class. The first group would have the entirety of their grade based on the creativity of a single piece they submit. The second group was graded on only the total number of pounds of clay they threw.

The second group crushed the first group in terms of creativity.

[+] kens|10 years ago|reply
I tried to find the original source of the quality-vs-quantity pottery class story a while back. I think it originates in the book "Art and Fear" but in that book it reads like a parable rather than a factual event. I'm highly suspicious of whether this event actually happened. Anyone have solid evidence?
[+] mrspeaker|10 years ago|reply
I don't understand what you mean by "use them to optimize your content" - how are you doing that with your app? Are you serving different messages to different groups of people? How are you grouping/testing/rating them?
[+] wlesieutre|10 years ago|reply
> The first group would have the entirety of their grade based on the creativity of a single piece they submit. The second group was graded on only the total number of pounds of clay they threw.

I feel like that works partly because an important part of practice the feedback loop between continually practicing and having a sense of whether you did well or not.

Your strategy of not evaluating your own work sounds a bit like mushing clay into shapes with a blindfold on and then tossing it in kiln before you even check whether or not it's shaped like a pot. The users can sort through them later!

If the end goal is just ending up with a volume of work that's been culled down to the better ones, I guess you still get that. But it's inherently different from the Thinking Fast and Slow example where they're in a class and the goal is to learn and get better, rather than see who's made the nicest pot by the end of the semester.

[+] hinkley|10 years ago|reply
You might want to be careful with your conclusions.

We don't know from this study whether the second group was more or less likely to get trapped in the Expert Beginner phase of development.

You definitely don't get anywhere without practice, but you are likely to get nowhere fast without theory.

[+] spyder|10 years ago|reply
solid 5 stars, 100+ reviews because I use bandits to optimize my content.

It's great app, but were the ratings lower in the beginning before the optimization or how do you know the optimization helped the ratings? I'm asking because it seems the app could have good ratings regardless of the content optimization because it's a "feel good" app. Are there counterexamples of of other meditation apps were the UI is good but it has bad reviews because of their low quality content?

[+] fapjacks|10 years ago|reply
There were two epiphanies hidden in this comment. HN is good for one or two every once in a while, but for me personally, this struck gold. Thanks!!
[+] tswartz|10 years ago|reply
Good point! What do you use to do in-app A/B testing?
[+] searine|10 years ago|reply
>I am free to just spew out content regardless of its quality.

Oh, that's a great goal to have...

[+] spiderfarmer|10 years ago|reply
I did a lot of A/B testing, but I think the examples that are used in a lot of articles about A/B testing are weird.

For small changes like change the color / appearance of a button, the difference in conversion rate is not measurable. Maybe if you can test with traffic in the range of >100K unique visitors (from the same sources), you can say with confidence which button performed better.

But how many websites / apps really have >100K uniques? If you have a long running test, just to gather enough traffic, changes are some other factors have changed as well, like the weather, weekdays / weekends, time of month, etc.

And if you have <100K uniques, does the increase in conversion pay for the amount of time you have invested in setting up the test?

In my experience, only when you test some completely different pages you'll see some significant differences.

[+] AJ007|10 years ago|reply
You are right, and the guy that responded with a test of just under 150 conversions is a great example of exactly what people get wrong.

I have been doing design optimization for nearly a decade now and virtually every example I've seen made public is incredibly poor. Sometimes they do not even include any numbers, but just say, look first I had a 1% conversion rate and now I got 1.2%! I wish I had a good example of optimization case study, but I don't recall the last time I saw one.

Without giving anything away, I have done tests this year which required over a million impressions to find out in a statistically significant way not to make any changes.

My general theory is that a never optimized design with a confusing UI has a lot of low hanging fruit. You start cleaning up the bad elements and conversion rates can double or triple. Even if the sample size is less than ideal, these really big pops will be apparent.

Design knowledge is better now than it was in 2003. Mobile forces designers to use one column and leave out a lot of crap. There are a lot of good examples that get copied and good out of the box software. That means when you start optimizing, the low hanging fruit is gone and you need really big sample sizes. Once the low hanging fruit is gone, often those big samples just tell you not to make any changes.

Thinking about free-to-play mobile games recently (an area I have no experience in.) If 1% or fewer of users are converting you really do need a huge install base to beat your competitors at optimization. You need millions of users just to get to 20,000 or 30,000 paying players to test behavioral changes on. That means there actually is some staying power for the winners, at least for a while.

[+] darkxanthos|10 years ago|reply
My professional advice to people who make these unmeasurable changes is: Then don't do them. The worst part about unmeasurable changes is you can't verify if it has a negative effect either and you're essentially saying "I'm focusing on making this change when it will have negligible impact on the business."

You have better things to do.

[+] occamrazor|10 years ago|reply
People read about Google testing different shades of blue on their home page and want to do the same on their website.

They tend to forget that Google has a billion unique users per day, and their website a few hundreds.

[+] mangeletti|10 years ago|reply
> the difference in conversion rate is not measurable

Wut?

Here's the results of me changing an "add to cart" button from a branded looking maroon button to a simple yellow (Amazon style) button: http://cl.ly/0d440I3T333m

That's 26% sales increase from changing a button's color.

If you've got a good eye for usability, your intuition is going to lead to a lot of fun and great results with A/B testing. If not, you'll futz with things that make no difference most of the time and eventually give up. Experimentation is not about testing random permutations (unless you've got an infinite amount of time). It's about coming up with a reasonable hypothesis and then testing it. I study a web page until I come to a conclusion in my mind that something could really use some improvement. Then, even though I'm sure of it in my mind, I test it, because "sure of it in my mind" is wrong about half the time.

One note to those who use Optimizely: you do need to wait at least twice as long as Optimizely thinks, because statistical significance can bounce around with too little traffic. Optimizely thought the above test was done at about 2000 views, which was far too little results to be conclusive, with only ~20 sales.

[+] teekert|10 years ago|reply
" If you have a long running test, just to gather enough traffic, changes are some other factors have changed as well, like the weather, weekdays / weekends, time of month, etc."

I'd serve them at the same time, randomize which clients see which one.

[+] butler14|10 years ago|reply
seasonality doesn't really come into it with A/B split testing - if it changes for one group it changes for both.
[+] sixtypoundhound|10 years ago|reply
Novices also tend to gravitate towards "end-game" business metrics which have a lot more inherent variation than simple operational indicators.

For example - optimizing a content site for AdSense; many folks would gravitate to AdSense $$ as the target metric, which is admittedly an intuitive solution (since that's how you're ultimately getting paid).

But if you think about it....

AdSense Revenue =>

(1 - Bounce Rate) x Pages / Visit x % ads clicked x CPC

Bounce rate is binomial probability with a relatively high p-value (15%+), thus you can get statistically solid reads on results with a relatively small sample.

Pages / Visit is basically the aggregate of a Markov chain (1 - exit probability); also relatively stable.

% ads clicked - binomial probability with low p-value; large samples becomes important

$ CPC - so the ugly thing here is there's a huge range in the value of a click... often as low as $.05 for a casual mobile phone click or $30 for a well qualified financial or legal click (think retargeting, with multiple bidders). And you're usually dealing with a small sample of clicks (since the average % CTR is very low). So HUGE natural variation in results. Oh, and Google likes to penalty price sites with a large rapid increase in click-through-rate (for a few days), so your short term CPC may not resemble what you would earn in steady-state.

So while it may make ECONOMIC sense to use test $ RPM as a metric, you've injected tremendous variation into the test. You can accurately read bounce rate, page activity, and % click-through on a much smaller sample and feel comfortable making a move if you're confident nothing major has changed in terms of the ad quality (and CPC value) you will get.

[+] ReadingInBed|10 years ago|reply
I thought this was a pretty good follow up to show the strengths and weaknesses of this approach: https://vwo.com/blog/multi-armed-bandit-algorithm/. Personally I think this approach makes a lot more sense than a/b testing especially when often people hand off the methodology to a 3rd party without knowing exactly how they work.
[+] aidanf|10 years ago|reply
Here are 2 good articles that follow up on the arguments presented by VWO in that article.

From the first link below: "They do make a compelling case that A/B testing is superior to one particular not very good bandit algorithm, because that particular algorithm does not take into account statistical significance.

However, there are bandit algorithms that account for statistical significance."

* https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...

* https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...

[+] raverbashing|10 years ago|reply
The points raised are valid, if they matter is a different beast

Even in the tests shown, conversion rate was higher for the MABA algorithms than simple A/B testing. "Oh but you get higher statistical significance!" thanks, but that doesn't pay my bills, conversion pays.

[+] 3dfan|10 years ago|reply

    10% of the time, we choose a lever at random. The
    other 90% of the time, we choose the lever that has
    the highest expectation of rewards. 
There is a problem with strategies that change the distribution over time: Other factors change over time too.

For example let's say over time the percentage of your traffic that comes from search engines increases. And this traffic converts better then your other traffic. And let's say at the same time, your yellow button gets picked more often then your green button.

This will make it look like the yellow button performs better then it actually does. Because it got more views during a time where the traffic was better.

This can drive your website in the wrong direction. If the yellow button performs better at first just by chance then it will be displayed more and more. If at the same time the quality of your traffic improves, that makes it look like the yellow button is better. While in reality it might be worse.

In the end, the results of these kinds of adaptive strategies are almost impossible to interpret.

[+] kevin_nisbet|10 years ago|reply
I don't know if this is the case if I understand this algorithm correctly. Say Yellow is 50%, and Green is a 65% success rate after the behavior change, but green is 30% before the behavior change.

By sending 90% of traffic towards yellow, it's ratio will normalize towards the 50% once it has enough traffic. By sending 10% of traffic randomly, eventually the green option will reach 51%, and start taking a majority of traffic, which then will cause it to normalize at it's 65%, and be shown to a majority of users.

I think the problem might be, if you run this with a sufficiently high volume or for a long period of time, that if a behaviour change takes place it will take a long time to learn the new behaviour. Or if two options aren't actually different, it may continually flip back and fourth between two options.

Also, to me, the concept of A/B testing certain things may also have an undesired consequence. For example, I order from amazon every day, but today the but button is blue, what does that actually mean? And I go back to the site later and it's yellow again. There are still many people who get confused by seemingly innocuous changes with the way their computer interacts with them.

[+] metafunctor|10 years ago|reply
This is why you should segment traffic and run separate tests for each segment, whether you're using an A/B testing or a multi-armed banding algorithm.
[+] zeckalpha|10 years ago|reply
It's possible to weight conversions by frecency to account for this, rather than using frequency alone.
[+] ryporter|10 years ago|reply
This is a good overview of the multi-arm bandit problem [1], but the author is far too dismissive of A/B Testing.

First of all, the suggested approach isn't always practical. Imagine that you are testing an overhaul of your website. Do you want daily individual visitors to keep flipping back and forth as the probabilities change? I'm not sure if the author is really suggesting his approach would be a better way to run drug trials, but that's clearly ridiculous. You have to recruit a set of people to participate in the study, and then you obviously can't change what drug you're giving them during the course of experiment!

Second, it ignores the time value of completing an experiment earlier. In the exploration/exploitation tradeoff, sometimes short-term exploitation isn't nearly as valuable as wrapping up an experiment so that your team can move to new experiments (e.g., shuting down the old website in the previous example). If a company expects to have a long lifetime, then, over the a time frame measured in weeks, exploration will likely be relatively far more valuable.

[1] https://en.wikipedia.org/wiki/Multi-armed_bandit

[+] PaulHoule|10 years ago|reply
It is really funny how communities don't talk.

For instance, A/B testing with a 50-50 split has been baked into "business rules" framework from about as along ago as the Multi-armed bandit has been around, but nobody in that community has ever heard of the multi-armed bandit, and in the meantime, machine learning people are celebrating about the performance of NLP systems they build that are far worse than rule-based systems people were using in industry and government 15 years ago.

[+] wodenokoto|10 years ago|reply
Which NLP system are far worse than which rule based systems?

The statement is odd for two reasons. One is that plenty of NLP is rule based, the other is that NLP isn't a form of A/B testing, which is the overall topic here..

[+] visarga|10 years ago|reply
> in the meantime, machine learning people are celebrating about the performance of NLP systems they build that are far worse than rule-based systems people were using in industry and government 15 years ago

That was not my experience. I have been researching chat bots and it would seem I can hardly find one implemented with machine learning but instead almost all are rules based. I was quite disappointed. ML for NLP is just gearing up.

[+] chias|10 years ago|reply
I like the premise of this a lot, but it seems to me that the setting that the author chose (some UI element of a website) is one of the worst possible settings for this: what matters a whole lot more than if your button is red or green or blue is some modicum of consistency.

If you're constantly changing the button color, size, location, whatever... that is an awful experience in and of itself, is it not? If the Amazon "buy now" button changed size / shape / position / color every time I went to buy something, I would get frustrated with it pretty quickly.

[+] blakeyrat|10 years ago|reply
One aspect of testing they leave unsaid is you identify your users (cookie, most commonly) to make sure each user always gets the same experience. That's why your numbers are all based on unique users, not merely users.

Their experience will still change once their cookie expires, but that amount of time is completely under your control.

[+] tswartz|10 years ago|reply
If you cookie the users and make sure the experience they saw is persistent that solves most of this problem. But if you run a lot of separate tests than it's hard to avoid this.
[+] hoddez|10 years ago|reply
There is at least one caveat with multi-armed bandit testing. It assumes that the site/app remains constant over the entire experiment. This is often not the case or feasible, especially for websites with large teams deploying constantly.

When your site is constantly changing in other ways, dynamically changing odds can cause a skew because you could give more of A than B during a dependent change, so you have to normalize for that somehow. A/B testing doesn't have this issue because the odds are constant over time.

[+] conductrics|10 years ago|reply
When thinking about what type of approach is best, first think about the nature of the problem. First is it a real optimization problem, IOW are you more concerned with learning an optimal controller for your marketing application? If so then ask: 1) Is the problem/information perishable - for example Perishable: picking headlines for News articles; Not Perishable: Site redesign. If Perishable then Bandit might give you real returns. 2) Complexity: Are you using covariates (contextual bandits, Reinforcement learning with function approximation) or not. If you are, then you might want your targeting model to serve up best the predicted options in subspaces (frequent user types) that it has more experiences in and for it to explore more in less frequently visited areas (less common user types). 3) Scale/Automation: You have tons of transactional decision problems, and it just doesn't scale to have people running many AB Tests.

Often it is a mix - you might use a bandit approach with your predictive targeting, but you also should A/B tests the impact of your targeting model approach vs a current default and/or a random draw. see slides 59-65: http://www.slideshare.net/mgershoff/predictive-analytics-bro...

For a quick bandit overview check out: http://www.slideshare.net/mgershoff/conductrics-bandit-basic...

[+] thecopy|10 years ago|reply
Offtopic, but why do i have to enable Javascript to even see anything?
[+] kqr|10 years ago|reply
That is really weird. Technically the content is there all along (so it's not loaded in by JavaScript) but you still have to have JavaScript enabled for it to render. Who designed that!?

Edit: hahaha what. It appears the content is laid out with JavaScript. So basically they're using JavaScript as a more dynamic CSS. Let that sink in. They're using JavaScript as CSS.

It sorta-kinda makes sense for the fancy stream of comments but still... why is it a requirement!?

[+] mrob|10 years ago|reply
In Firefox, "View", "Page Style", "No Style". I assume other browsers let you do the same thing. This works for many pages that fail to render without JavaScript.
[+] cm2187|10 years ago|reply
That's actually not off-topic. I wouldn't take web programming advice from someone who thinks displaying a blog article requires javascript.
[+] coldtea|10 years ago|reply
The same reason you have to have electricity to watch TV.

It's a required part of modern web sites -- and it doesn't matter whether it's "really needed" for any particular site or not.

[+] llull|10 years ago|reply
Bandits are great, but using the theory correctly can be difficult (and if accidentally incorrectly applied then ones results can easily become pathologically bad). For instance, the standard stochastic setup requires that learning instances are presented in an iid manner. This may not be true for website visitors, for example different behaviour at different times of day (browsing or executing) or timezone driven differing cultural responses. There is never a simple, magic solution for these things.
[+] saturdayplace|10 years ago|reply
So I googled A/B testing vs Multi-Armed Bandit, and ran into an article that's a useful and informative response to the OP: https://vwo.com/blog/multi-armed-bandit-algorithm/

edit: Ah, 'ReadingInBed beat me to it. tl;dr: Bandit approaches might not _always_ be the best, and they tend to take longer to reach a statistically significant result.

[+] tmaly|10 years ago|reply
I remember reading this post back in 2012. I ended up getting a copy of the Bandit Algorithms book by John Myles White.

Its short, but it covers all the major variations.

[+] jedberg|10 years ago|reply
Just a warning, this isn't a magic bullet to replace all A/B testing. This is great for code that has instant feedback and/or the user will only see once, but for things where the feedback loop is longer or the change is more obvious or longer lasting (like a totally different UI experience), it doesn't work so well.

For example, if your metric of success is that someone retains their monthly membership to your site, it will take a month before you start getting any data at all. At that point, in theory almost all of your users should already be allocated to a test because hopefully they visited (and used) their monthly subscription at least once. So it would be a really bad experience to suddenly reallocate them to another test each month.

[+] mabbo|10 years ago|reply
This presumes a few of things about the decision being tested, many of which aren't always true.

I ran a few basic A/B tests on some handscanner software used in large warehouses. The basic premise is that the user is being directed where to go and what items to collect. The customer wanted to know how changes to font size and colour of certain text would improve overall user efficiency. But the caveat was that we had to present a consistent experience to the user- you can't change the font size every 10 seconds or it will definitely degrade user experience!

My point is that it sounds as though the 1-armed bandit will probably work great provided the test is short, simple, and the choice can be re-made often without impacting users.

[+] zeckalpha|10 years ago|reply
It's possible to assign a user to a category once a week, and use the same algorithm, rebalancing users between the categories as needed once a week.
[+] scottlocklin|10 years ago|reply
Reinforcement approaches are certainly interesting, but one of the things missing here (and in most A/B stuff) is statistical significance and experimental power. If you have enough data, there are hand wavey arguments that this will eventually be right, but in the meanwhile, if there is some opportunity cost (say, imagine this is a trading algo trying to profit from bid/ask), you screwed yourself out of some unknown amount of profits. There actually ways of hanging a confidence interval on this approach which virtually nobody outside the signal processing and information theory communities know about. Kind of a shame.