top | item 42900678

(no title)

rav | 1 year ago

I don't understand what a dedicated "completely open source offering" provides or what your "five nines feature flag system" provides. If you're running on a simple system architecture, then you can sync some text files around, and if you have a more scalable distributed architecture, then you're probably already handling some kind of slowly-changing, centrally-managed system state at runtime (e.g. authentication/authorization, or in-app news updates, ...) where you can easily add another slowly-changing, centrally-managed bit of data to be synchronised. How do you measure the nines on a feature flag system, if you're not just listing the nines on your system as a whole?

discuss

foobazgt|1 year ago

> If you're running on a simple system architecture,

His point was that even a feature flag system in a complex environment with substantial functional and system requirements is worth building vs buying. If your needs are even simpler, then this statement is even more true!

I'm having a hard time making sense out of the rest of your comment, but in larger businesses the kinds of things you're dealing with are:

- low latency / staleness: You flip a flag, and you'll want to see the results "immediately", across all of the services in all of your datacenters. Think on the order of one second vs, say 60s.

- scalability: Every service in your entire business will want to check many feature flags on every single request. For a naive architecture this would trivially turn into ungodly QPS. Even if you took a simple caching approach (say cache and flush on the staleness window), you could be talking hundreds of thousands of QPS across all of your services. You'll probably want some combination of pull and push. You'll also need the service to be able to opt into the specific sets of flags that it cares about. Some services will need to be more promiscuous and won't know exactly which flags they need to know in advance.

- high availability: You want to use these flags everywhere, including your highest availability services. The best architecture for this is that there's not a hard dependency on a live service.

- supports complex rules: Many flags will have fairly complicated rules requiring local context from the currently executing service call. Something like: "If this customer's preferred language code is ja-JP, and they're using one of the following devices (Samsung Android blah, iPhone blargh), and they're running versions 1.1-1.4 of our app, then disable this feature". You don't want to duplicate this logic in every individual service, and you don't want to make an outgoing service call (remember, H/A), so you'll be shipping these rules down to the microservices, and you'll need a rules engine that they can execute locally.

- supports per-customer overrides: You'll often want to manually flip flags for specific customers regardless of the rules you have in place. These exclusion lists can get "large" when your customer base is very large, e.g. thousands of manual overrides for every single flag.

- access controls: You'll want to dictate who can modify these flags. For example, some eng teams will want to allow their PMs to flip certain flags, while others will want certain flags hands off.

- auditing: When something goes wrong, you'll want to know who changed which flags and why.

- tracking/reporting: You'll want to see which feature flags are being actively used so you can help teams track down "dead" feature flags.

This list isn't exhaustive (just what I could remember off the top of my head), but you can start to see why they're an endeavor in and of themselves and why products like LaunchDarkly exist.

echelon|1 year ago

> if you're not just listing the nines on your system as a whole

At scale the nines of your feature flagging system become the nines of your company.

We have a massive distributed systems architecture handling billions in daily payment volume, and flags are critical infra.

Teams use flags for different things. Feature rollout, beta test groups, migration/backfill states, or even critical control plane gates. The more central a team's services are as common platform infrastructure, the more important it is that they handle their flags appropriately, as the blast radius of outages can spiral outwards.

Teams have to be able to competently handle their own flags. You can't be sure what downstream teams are doing: if they're being safe, practicing good flag hygiene, failing closed/open, keeping sane defaults up to date, etc.

Mistakes with flags can cause undefined downstream behavior. Sometimes state corruption (eg. with complicated multi-stage migrations) or even thundering herds that take down systems all at once. You hope that teams take measures to prevent this, but you also have to help protect them from themselves.

> slowly-changing, centrally-managed system state at runtime

With flags being so essential, we have to be able to service them with near-perfect uptime. We must be able to handle application / cluster restart and make sure that downstream services come back up with the correct flag states for every app that uses flags. In the case of rolling restarts with a feature flag outage, the entire infrastructure could go hard down if you can't do this robustly. You're never given the luxury of knowing when the need might arise, so you have to engineer for resiliency.

An app can't start serving traffic with the wrong flags, or things could go wrong. So it's a hard critical dependency to make sure you're always available.

Feature flags sit so closely to your overall infrastructure shape that it's really not a great idea to outsource it. When you have traffic routing and service discovery listening to flags, do you really want LaunchDarkly managing that?