top | item 26112273

(no title)

max_streese | 5 years ago

Hi not sure if I am just completely off here but I am wondering how this relates or compares to processing things with Kafka and Kafka Streams?

If I am reading things correctly with Kafka the workflow equivalent to what's written in the article would be to have your producer produce via hash-based-round-robin (the default partitioning algorithm) based on the key you are interested in into some topic and then your consumer would just read it and your data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees) and also be co-partitioned correctly if you need to read some other topic in with the same number of partitions and the same logical keys produced via the same algorithm. No?

discuss

order

jsjsbdkj|5 years ago

This is the most basic pattern for distributed joins - you hash on the join key in both tables and shuffle data based on hash ranges. In some systems like Redshift you can designate the key for distribution so that "related" records are already co-located on a single shard.

> our data would already be sorted for the given keys (because within a partition Kafka has sorting guarantees)

It's been a while since I used Kafka but I don't remember "sorting guarantees". Consumers see events "in order" based on when they were produced, because each partition is a queue.

max_streese|5 years ago

Yes I guess my point is when using Kafka in combination with Kafka Streams and you produce things partitioned in a way that you need them for consumption then you do not need to do any shuffling in the instance where you want to join because data is already partitioned correctly.