top | item 42689221

(no title)

taion | 1 year ago

The problem with this approach is that it requires the system doing randomization to be aware of the rewards. That doesn't make a lot of sense architecturally – the rewards you care about often relate to how the user engages with your product, and you would generally expect those to be collected via some offline analytics system that is disjoint from your online serving system.

Additionally, doing randomization on a per-request basis heavily limits the kinds of user behaviors you can observe. Often you want to consistently assign the same user to the same condition to observe long-term changes in user behavior.

This approach is pretty clever on paper but it's a poor fit for how experimentation works in practice and from a system design POV.

discuss

hruk|1 year ago

I don't know, all of these are pretty surmountable. We've done dynamic pricing with contextual multi-armed bandits, in which each context gets a single decision per time block and gross profit is summed up at the end of each block and used to reward the agent.

That being said, I agree that MABs are poor for experimentation (they produce biased estimates that depend on somewhat hard-to-quantify properties of your policy). But they're not for experimentation! They're for optimizing a target metric.

matusp|1 year ago

Surmountable, yes, but in practice it is often just too much hassle. If you are doing tons of these tests you can probably afford to invest in the infrastructure for this, but otherwise AB is just so much easier to deploy that it does not really matter to you that you will have a slightly ineffective algo out there for a few days. The interpretation of the results is also easier as you don't have to worry about time sensitivity of the collected data.

hinkley|1 year ago

You do know Amazon got sued and lost for showing different prices to different users? That kind of price discrimination is illegal in the US. Related to actual discrimination.

I think Uber gets away with it because it’s time and location based, not person based. Of course if someone starts pointing out that segregation by neighborhoods is still a thing, they might lose their shiny toys.

taion|1 year ago

You can do that, but now you have a runtime dependency on your analytics system, right? This can be reasonable for a one-off experimentation system but it's not likely you'll be able to do all of your experimentation this way.

jacob019|1 year ago

Hey, I'd love to hear more about dynamic pricing with contextual multi-armed bandits. If you're willing to share your experience, you can find my email on my profile.

ivalm|1 year ago

You can assign multiarm bandit trials on a lazy per user basis.

So first time user touches feature A they are assigned to some trial arm T_A and then all subsequent interactions keep them in that trial arm until the trial finishes.

kridsdale1|1 year ago

The systems I’ve use pre-allocate users effectively randomly an arm by hashing their user id or equivalent.