top | item 45432460

(no title)

hervature | 5 months ago

Not the OP. I think what they are driving at is that if knowledge is discovered during exploration in cohort A, cohort B can exploit it. Then, the whole A/B test breaks down to which cohort got to benefit more from the bandit learnings.

discuss

rented_mule|5 months ago

Yes, this is exactly the kind of scenario I was alluding to.

For example, cohorts with very light traffic are likely to get undue benefit as a lot of exploration might be done before the smaller cohort needs to select an arm, so things are closer to convergence.

Another example is if there are wildly different outcomes between cohorts. More of the exploration will be done in cohorts with more traffic, leading bandit optimizations to fit large cohorts better than lower traffic cohorts.

Even if you do manage to make things independent, you have to wait for bandit convergence before you know what converged results will look like. That requires being able to measure convergence, which isn't always easy, especially if you don't know to do that before designing the system.

With all of these problems, we kept bandits, and even expanded their application. At least for the 10 years I was still around. They are incredibly powerful. But there was a lot of "I wish those damned bandits didn't work so well!"

For anyone who is not aware, A/B tests assume cohorts behave independently of each other. The less true that is, the less reliable the results are. This was even worse for us in parts of our system where there were no bandits, but there was direct interactions between individuals.