top | item 23790050

(no title)

trengrj | 5 years ago

With Pulsar vs Kafka, I don't see a huge argument between either one functionality wise as they have so much in common (distributed log, Java based, avoid copying memory, use Zookeeper). Because Kafka is more supported and well-known it seems Pulsar needs to be an order of magnitude more performant to capture developer mindshare.

I see the same with Spark vs Flink in that similarities outweigh differences. I wonder if this is some sort of emergent pattern in open source software.

discuss

order

majidazimi|5 years ago

There are real differences among them. Here is some painful aspects of Kafka:

1. A single partition is stored in one node (replicas on another nodes). With this, introducing new nodes takes very long time to replicate large partitions, because it can replicate one partition from only one node (leader of the partition). On Pulsar each segment of partition is stored in a different bookkeeper node.

2. Because of 1, if two consumers read different parts of a partition that are far from each other, they will compete over disk bandwidth. In Kafka consumer can not read from replica node. If a topic is really popular and many consumers try to read from it (from different parts of the file which makes OS page cache useless), total consumption rate is limited to disk bandwidth of a single node. But in Pulsar each consumer can read from different brokers. Catch up consumers won't trash streaming consumers in Pulsar.

These are not problems that can be fixed easily. Additionally, in the realm of streaming the difference between Flink and Spark is day and night. The low watermark feature that Flink offers makes them behave fundamentally different.

toomanybits|5 years ago

1. is true, but if you want that data to move to a new node, it still needs to be replicated. Kafka's approach is to use tiered storage (which I believe is close to completion).

2. Kafka can read from a replica node. It's relatively new but it's there.

qaq|5 years ago

Pulsar is better for very large scale deployments provided you have people to manage it

z9e|5 years ago

Kafka is handling very large scale deployments just fine atm in all the big tech co's.

The only thing I can see that can make this true is Pulsar seems to have better elastic scalability. But it seems to score less on everything else. It has a much more complex storage system that ends up not matching Kafka's high-end throughput at large scale.

From what I recall, Twitter ended up abandoning BookKeeper due to storage scale concerns. Related: https://blog.twitter.com/engineering/en_us/topics/insights/2...

leafboi|5 years ago

>it seems Pulsar needs to be an order of magnitude more performant to capture developer mindshare.

Just to add to this, ease of use/setup is also a huge factor. There are technologies I can just spin up with zero knowledge and learn as I go. These are huge factors in adoption especially with Golang and nodejs.