top | item 11261703

Announcing Kafka Streams: Stream Processing Made Simple

49 points| nehanarkhede | 10 years ago |confluent.io

3 comments

ameyamk|10 years ago

This looks great. Some open questions -

1. How does it scale horizontally? if I am consuming from say topic with 100 partitions - do I have to deploy this on library on multiple nodes? or expectation is we need to embed this inside spark streaming/ storm/ samza/ any other real time processing framework?

2. Is RocksDB instance local to your consumer/ thread/ node? or can state be stored across nodes? What happens when joins are needed across the partitions?

I have been doing real time processing for a long time now - and this does capture most of the typical things you end up doing with kafka with spark/ storm etc- so this does look a significant step forward.

I am just trying to understand where it exactly fits.

miguno|10 years ago

Disclaimer: I work for Confluent (www.confluent.io). That said, let me clarify that Kafka Streams is a component of the Apache Kafka open source project. There's nothing proprietary about it. The reason the links below point to docs.confluent.io is because the Apache Kafka docs for Kafka Streams are not ready yet; but they will when Kafka 0.10 is released, which will be the first official Kafka release that includes Kafka Streams.

> 1. How does it scale horizontally?

See http://docs.confluent.io/2.1.0-alpha1/streams/architecture.h....

> Do I have to deploy this on library on multiple nodes?

Kafka Streams really is a normal Java library. You include it in "your" Java application just like you'd include a library such as Apache Commons, Google Guava, or JUnit. There's no "deploying" of this library -- i.e. no need to pre-ship it to machines; also, you don't install a cluster or anything like that for Kafka Streams. It's a purely client-side library.

Interestingly, many folks I have talked to seem to be confused initially about the "library" aspect because I suppose, in the big data world, one has simply become used to (and pigeonholed by?) frameworks like Spark, Storm, etc. that you must deploy and operate separately (and which then dictate how they want you to write your own apps against them).

Does that make any sense?

> 2. Is RocksDB instance local to your consumer/ thread/ node?

Yes, it is local. But local state stores (RocksDB is but one choice for implementing these) are also replicated for fault tolerance. Details at http://docs.confluent.io/2.1.0-alpha1/streams/architecture.h... and http://docs.confluent.io/2.1.0-alpha1/streams/architecture.h....

Regarding joins: We have documented some information at http://docs.confluent.io/2.1.0-alpha1/streams/developer-guid....

I hope this helps! If these pointers above aren't sufficient, please drop us a note to dev@kafka.apache.org (http://kafka.apache.org/contact.html). We really appreciate any such feedback (code, docs, whatever) to ensure people have an easy and fun time to get started with Kafka as well as its upcoming Kafka Streams library.

rollulus|10 years ago

Great job, Confluent! I'm really impressed by the helpful documentation and blog posts!