top | item 42910106

(no title)

thraxil | 1 year ago

It seems like you've kind of missed both of my points.

If you're doing canary deploys to a fleet of 2000 nodes, it might take hours for the config to make it to all of them (I've seen systems where a fleet upgrade can take a week to make it all the way out). If your feature flags are configured that way, there's a long time that the state of a flag will be in that in-between state. We put feature flags in the database not config/environment so that we can turn a feature on or off more or less atomically. Ie, an admin goes into the management interface, flips a flag from off to on and then every single request that the system serves after that reflects that state. As long as you're using a database that supports transactions, you absolutely can have a clear point in time that delineates before/after that change. Rolling out a config change to a large fleet, you don't get that.

On the second point, what I'm saying is that (talk to your friendly local SRE if you don't believe me), a large percentage of production incidents in large systems are because of configuration changes, not application changes. This is because those things are significantly harder to really test than application code. Eg, if someone sets an environment variable for the production environment like `REDIS_IP=10.0.0.13` how do you know that's the correct IP address in that environment? You can add a ton of linting, you can do reviews, etc, but ultimately, it's a common vector for mistakes and it's one of the hardest areas to completely prevent human error from creating a disaster. One of the best strategies we have is to structure the system so you don't have to make manual environment/config changes that often. If you implement your feature flag system with environment variables/config, you'll be massively increasing the frequency that people are editing and changing that part of the system, which increases the chances of somebody making a typo, forgetting to close a quote, missing a trailing comma in a json file, etc.

Where I work we make production config changes maybe once a week or so and it's done by people who know the infrastructure very well, there's a bunch of linting and validation, and the change is rolled out with a canary system. In contrast, feature flags are in the database and we have a nice, very safe custom UI so folks on the Product and Support teams can manage the flags themselves, turning them on/off for different customers without having to go through an engineer; they might toggle flags a dozen times a day.

discuss

No comments yet.