top | item 31208524

(no title)

rystsov | 3 years ago

It wasn't a big surprise for us. Redpanda is a complex distributed system with multiple components even at the core level: consensus, idmepotency, transactions so we were ready that something might be off (but we were pleased to find that all the safety issues were with the things which were behind the feature flags at the time).

Also we have internal chaos test and by the time partnership with Kyle started we already identified half of the consistency issues and sent PRs with fixes. The issues got in the report because by the time we started the changes weren't released yet. But it is acknowledged in the report

> The Redpanda team already had an extensive test suite— including fault injection—prior to our collaboration. Their work found several serious issues including duplicate writes (#3039), inconsistent offsets (#3003), and aborted reads/circular information flow (#3036) before Jepsen encountered them

We missed other issues because haven't exercised some scenario. As soon as Kyle found the issues we were able to reproduce them with the in-house chaos tests and fix. This dual testing (jepsen + existing chaos harness) approach was very beneficial. We were able to check the results and give feedback to Kyle if he found a real thing or if it looks more like an expected behavior.

We fixed all the consistency (safety) issues, but there are several unresolved availability dips. We'll stick with Jepsen (the framework) until we're sure we fixed then too. But then we probably rely just on the in house tests.

Clojure is very powerful language and I was truly amazed how fast Kyle for able to adjust his tests to new information but we don't have clojure expertise and even simple tasks take time. So it's probably wiser to use what we already know even it it a bit more verbose.

discuss

No comments yet.