top | item 14672230

(no title)

ewencp | 8 years ago

It's not Jepsen, but we actually do a fair amount of system and integration testing, some of which does things like kill nodes (randomly, the controller, etc) and validates data is delivered correctly. There is some ongoing work to add other fault injection tests: https://issues.apache.org/jira/browse/KAFKA-5476

This has already caught quite a few bugs, including one that led to a change to the replication protocol which is included in this 0.11.0.0 release. See the community KIP here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+...

One cool thing that happened recently with these tests is that they were modified to make the client implementation pluggable: https://github.com/apache/kafka/pull/2048 Confluent uses this functionality to test all of its clients (librdkafka, confluent-kafka-python, confluent-kafka-go, confluent-kafka-dotnet) in addition to the Java clients. This not only makes us confident of these clients from their first release, but has also found dozens of bugs in both the clients and the broker implementation itself. Getting automated testing across many clients has really stepped up the quality and robustness of both existing and new features.

If you're interested, the tests themselves are here: https://github.com/apache/kafka/tree/trunk/tests

discuss

order

noslowerdna|8 years ago

For what it's worth, the Jira for KIP-101 was created in January 2014. That has been a known potential Kafka data loss scenario for quite a while, just took some time (and evidently the findings of these new stress tests) to be prioritized as a serious problem that needed to be fixed.