top | item 34742869

(no title)

dylanpyle | 3 years ago

I've seen a similar mistake in a rushed "feature flagging"/phased rollout system.

User IDs were random UUIDs. Let's say we want to release a feature to ~33% of users; we take the first 2 characters of the UUID, giving us 256 possible buckets, then say that everyone in the first 1/3 of that range gets the feature. So, 00XXX...-55XXX... IDs get it, and 56-FF do not. This works fine.

However, if we then release another feature to 10% of users - everyone from 00-1A gets it, 1B-FF do not. That first set now has both features, and 56-FF have none. It turns out you can't draw meaningful conclusions when some users get every new feature and some get none at all.

discuss

order

tmoertel|3 years ago

One easy way to avoid this problem is to give each feature its own independent space of “dice rolls” by hashing the user ids with feature-specific constants before interpreting them as dice rolls:

    feature1_enabled = hash(user_id + "-feature1") / HASH_MAX < 0.3;  # 30% of users.
    feature2_enabled = hash(user_id + "-feature2") / HASH_MAX < 0.1;  # 10% of users.

phreeza|3 years ago

If you suspect some flag effects interact with each other (e.g. one flag increases button size by 10%, and the other decreases it by 10%) you can go one step further and define feature groups and hash by user-id + group-id and then assign non overlapping ranges to the flags.

xmonkee|3 years ago

Huh, thanks for this. Will live in my head from now. Basically, in this situation, use (user_id, feature_name) for hashing, not just the user_id.

yiuywieyrw|3 years ago

A better way is to assign some internal salt to the feature at creation time and use that, that way you are not dependent on something external that user (the creator of the feature flag) could change. I bear the scars of this design mistake from when I worked for a company that provided feature flagging. It was not my initial mistake, but I drew the short straw trying to work around it.

dabeeeenster|3 years ago

Haha funnily enough that is our (Flagmsith's) exact algorithm!

dekhn|3 years ago

Is there a more general term for tuple hashing? IE, math and theory around composition of hashes composed of concatenated (or otherwise combined) typed values?

shoo|3 years ago

> you can't draw meaningful conclusions when some users get every new feature and some get none at all.

those poor cursed users. reminiscent of a joke that used to grace Paul Mineiro's blog (machinedlearnings.com) a few years ago:

> An old curse says: "may you always be assigned to a test bucket."

Eduard|3 years ago

> That first set now has both features, and 56-FF have none. It turns out you can't draw meaningful conclusions when some users get every new feature and some get none at all.

00—1A: have feature flags A and B

1B-55: have feature flag A only

56-FF: have no feature flag

So the actual gotcha here is that there is no cohort for "feature flag B only", right?

This setup can actually be desirable if feature B depends on feature A.

duskwuff|3 years ago

> So the actual gotcha here is that there is no cohort for "feature flag B only", right?

And that, as you add more tests, user 00 will always get the test treatment for every test. If you're running a lot of tests which introduce experimental features or changes to workflows, user 00 is probably going to find the site a lot more chaotic and hard to understand than user FF, and that will skew your test results.

dylanpyle|3 years ago

> So the actual gotcha here is that there is no cohort for "feature flag B only", right?

Yep, exactly - by "that first set" I meant the 00-1A group, could have been clearer. Whatever the smallest rollout bucket is, that group is guaranteed to have every single feature.

This was quite a while ago, but I think the actual case we noticed this with was several features released to 50% of the userbase - so every single user either had all or none at once (unintentionally)