top | item 46465644

(no title)

I always thought the companies I worked for would implement chaos testing shortly after this talk/blog released. However; only last year did we do anything even approaching chaos testing. I think this goes to show that the adage “the future is already here just unevenly distributed” carries some truth in some contexts!

I think the companies I worked for were prioritizing working on no issue deployments (built from a series of documented and undocumented manual processes!) rather than making services resilient through chaos testing. As a younger dev this priority struck me as heresy (come on guys, follow the herd!); as a more mature dev I understand time & effort are scarce resources and the daily toil tax needs to be paid to make forward progress… it’s tough living in a non-ideal world!

discuss

oooyay|1 month ago

Chaos testing rarely uncovers anything significant or actionable beyond things you can suss out yourself with a thorough review but has the added potential for customer harm if you don't have all your ducks in a row. It also neatly requires, as a prerequisite, for you to have your ducks in a row.

I think that's why most companies don't do it. A lot of tedium and the main benefit was actually getting your ducks in a row.

closeparen|1 month ago

I think it is more of a social technology for keeping your ducks in a row. Developers won’t be able to gamble that something “never happens” if we induce it weekly.

GauntletWizard|1 month ago

Much of the value from Chaos testing can be gotten much more simply with good rolling CI. Many of the problems that Chaos engineering solved are now considered table stakes, directly implemented into our frameworks and tested well by saidsame CI.

A significant problem with early 'Web Scale' deployments was out of date or stale configuration values. You would specify that your application connects to backend1.example.com for payments and backend2.example.com for search. A common bug in early libraries was that the connection was established once at startup, and then never again. When the backend1 service was long lived, this just worked for months or years at a time - TCP is very reliable, especially if you have sane values on keepalives and retries. Chaos Monkey helped find this class of bug. A more advanced but quite similar class of bug: You configured a DNS name, which was evaluated once at startup, and again didn't update, Your server for backend1 had a stable address for years at a time, but suddenly you needed to failover to your backup or move it to new hardware. At the time of chaos monkey, I had people fight me on this - They believed that doing a DNS lookup every five minutes for your important backends was unacceptable overhead.

The other part is - Modern deployment strategies make these old problems untenable to begin with. If you're deploying on kubernetes, you don't have an option here - Your pods are getting rebuilt with new IP addresses regularly. If you're connecting to a service IP, then that IP is explicitly a LB - It is defined as stable. These concepts are not complex, but they are edge boundaries, and we have better and more explicit contracts because we've realized the need and you "just do" deploy this way now.

Those are just Chaos Monkey problems, though - Latency Monkey is huge, but solves a much less common problem. Conformity Monkey is mostly solved by compliance tools; You don't build, you buy it. Doctor Monkey is just healthchecks - K8s (and other deployment frameworks) has those built in.

In short, Chaos Monkey isn't necessary because we've injected the chaos and learned to control most of what that was doing, and people have adopted the other tools - They're just not standalone, they're built in.

bpt3|1 month ago

It's a great way of thinking about resiliency and fault tolerance, but it's also definitely on the very mature end of the systems engineering spectrum.

If you know things will break when you start making non-deterministic configuration changes, you aren't ready for chaos engineering. Most companies never get out of this state.

closeparen|1 month ago

Having a few fault injection scenarios is baby steps. Next would be Jepsen-style testing, and most mature would be formal verification.