top | item 47074225

(no title)

You're right, for intra-cluster calls where failures are scoped between the node itself and the infra around it, per-instance breakers are what you want. I wouldn't suggest centralizing those, and I might be wrong, but in most of these scenarios there is no fallback anyways (maybe except Redis?)

Openfuse is aimed at the other case: shared external dependencies where 15 services all call the same dependency and each one is independently discovering the same outage at different times. Different failure modes, different coordination needs, and you have no way to manually intervene or even just see what's open. Think of your house: every appliance has its own protection system, but that doesn't exempt you from having the distribution board.

You can also put it between your service/monolith and your own other services, e.g. if a recommendations engine, or a loyalty system in an E-Commerce or POS softwares go down, all hotpath flows from all other services will just bypass their calls to it. So with "external" I mean another service, whether it's yours or from a vendor.

On the feature flag point: that's interesting because you're essentially describing the pain of building circuit breaker behavior on top of feature flag infrastructure. The "switching back" problem you mention is exactly what half-open state solves: controlled probe requests that test recovery automatically and restore traffic gradually, without someone manually flipping a flag and hoping. That's the gap between "we can turn things off" and "the system recovers on its own." But yeah, we can all call Openfuse just feature flags for resilience, as I said: it's a fusebox for your microservices.

Curious how you handle the recovery side, is it a feature flag provider itself? or have you built something around it and store in your own database?

discuss

nzach|10 days ago

> where 15 services all call the same dependency and each one is independently discovering the same outage at different times

I don't really see what problem this solves. If you have proper timeouts and circuit breakers in your service this shouldn't really matter. This solution will save a few hundred requests, but I don't think this really matters. If this is a pain point its easier to adjust the circuit-breaker settings (reduce the error rate, increase the window, ...) than introduce a whole new level of complexity.

> Curious how you handle the recovery side

We have a feature flag provider built in-house. But it doesn't support this use-case, so what we done is to create flag where we put the % value we want to bring back and handle the logic inside the service. Example: if you want to bring back 6,25% (1/16) of our users this means we should switch back every user that has an account-id ending in 'a'. For 12.5% (2/16) we want users with account-id ending either in 'a' or 'b'. This is a pretty hacky solution, but it solves our problem when we need to transition from our fallback to our main flow.

rodrigorcs|10 days ago

> I don't really see what problem this solves. If you have proper timeouts and circuit breakers in your service this shouldn't really matter.

Each service discovering by their own is not really the main problem to be solved with my proposal, the thing is that by doing it locally, we lack observability and there is no way to act on them.

> what we done is to create flag where we put the % value we want to bring back

Oh I see, well that is indeed a good problem to solve. Openfuse does not do that gradual recovery but it would be possible to add.

Do you think that by having that feature and having the Openfuse solution self-hosted, it would be something you would give a try? Not trying to sell you anything, just gathering feedback so I can learn from the discussion.

By the way, if you don't mind, how often do you have to run that type of recovery?

unknown|10 days ago

[deleted]