top | item 42901415

(no title)

thraxil | 1 year ago

I would add two things:

It's often important that flag changes be atomic. Having subsequent requests get different flag values because they got routed to different backend nodes while a change is rolling out could cause some nasty bugs. A big part of the value of feature flags is to help avoid those kind of problems with rolling out config changes; if your flags implementation suffers from the same problem, it's not very useful.

Second, config changes are notorious as the cause of incidents. It's hard to "unit test" config changes to the production environment the same way you can with application code. Having people editing a config every time they want to change a flag setting (we're a tiny company and we change our flags multiple times per day) seems like a recipe for disaster.

discuss

withinboredom|1 year ago

Making changes atomic is literally impossible, it's easier to just assume they won't be than chasing down something computer science tells us is impossible. I assume you are saying "every node sees the same change at the same time" when you say "atomic."

As for unit testing flags, you better unit test them! Just mock out your feature flag provider/whatever and test your feature in isolation; like everything else.

thraxil|1 year ago

It seems like you've kind of missed both of my points.

If you're doing canary deploys to a fleet of 2000 nodes, it might take hours for the config to make it to all of them (I've seen systems where a fleet upgrade can take a week to make it all the way out). If your feature flags are configured that way, there's a long time that the state of a flag will be in that in-between state. We put feature flags in the database not config/environment so that we can turn a feature on or off more or less atomically. Ie, an admin goes into the management interface, flips a flag from off to on and then every single request that the system serves after that reflects that state. As long as you're using a database that supports transactions, you absolutely can have a clear point in time that delineates before/after that change. Rolling out a config change to a large fleet, you don't get that.

On the second point, what I'm saying is that (talk to your friendly local SRE if you don't believe me), a large percentage of production incidents in large systems are because of configuration changes, not application changes. This is because those things are significantly harder to really test than application code. Eg, if someone sets an environment variable for the production environment like `REDIS_IP=10.0.0.13` how do you know that's the correct IP address in that environment? You can add a ton of linting, you can do reviews, etc, but ultimately, it's a common vector for mistakes and it's one of the hardest areas to completely prevent human error from creating a disaster. One of the best strategies we have is to structure the system so you don't have to make manual environment/config changes that often. If you implement your feature flag system with environment variables/config, you'll be massively increasing the frequency that people are editing and changing that part of the system, which increases the chances of somebody making a typo, forgetting to close a quote, missing a trailing comma in a json file, etc.

Where I work we make production config changes maybe once a week or so and it's done by people who know the infrastructure very well, there's a bunch of linting and validation, and the change is rolled out with a canary system. In contrast, feature flags are in the database and we have a nice, very safe custom UI so folks on the Product and Support teams can manage the flags themselves, turning them on/off for different customers without having to go through an engineer; they might toggle flags a dozen times a day.