top | item 23792319

(no title)

majidazimi | 5 years ago

That's true but still limitation is not fully resolved. In order to increase consumption rate, we need to add replicas. In pulsar Brokers are merely cache nodes over Bookkeeper. Adding more Brokers is trivial in Pulsar.

discuss

order

kevstev|5 years ago

How in pulsar do they get around the fact that adding a new broker, data needs to be moved over before that broker can start serving data? This seems like a basic law of physics type limitation to me.

addisonj|5 years ago

Hey, I work on Pulsar, will try and answer this :)

Topics (actually bundles of topics, called bundles) are what is assigned to Brokers. Topic assignment is dynamic, so when a new broker is added, the system will try and shed load from the busiest brokers to even it out on the system.

But unlike Kafka, when a topic is assigned to a broker, it doesn't have much state to move, mostly it just gets metadata added to it and opens a new "ledger" (which is just a chunk of the topics data over a time window, only one ledger is ever open at once). When it needs to serve data, it pulls that from bookkeeper nodes from previous ledgers, so the process of re-distributing load is pretty quick, it also doesn't eagerly pull in a cache.

Now, as far as the cache, that is primarily for "tailing reads", meaning, as writes occurs, and clients who are close to the tip of the recent data will just get it from the broker, without a need to pull it from bookkeeper. This is is one of the key parts about how Pulsar has multiple tiers of storage that help it have such good consistent latency.

Beyond processing writes, the biggest thing brokers do is handling "tailing reads" i.e., clients are consuming right near the tip of the topic. , this is the cache referred to. That means that when a new pbroker is three purposes:

1. Handling writes

miguno|5 years ago

(copying this text from another comment of mine elsewhere)

Well, the Pulsar broker is (kinda) stateless, because they are essentially a caching layer in front of BookKeeper. But where's your data actually stored then? In BookKeeper bookies, which are stateful. Killing and replacing/restarting a Bookkeeper node requires the same redistribution of data as required in Kafka’s case. (Additionally, BookKeeper needs a separate data recovery daemon to be run and operated, https://bookkeeper.apache.org/archives/docs/r4.4.0/bookieRec...)

So the comparison of 'Pulsar broker' vs. 'Kafka broker' is very misleading because, despite identical names, the respective brokers provide very different functionality. It's an apples-to-oranges comparison, like if you'd compare memcached (Pulsar broker) vs. Postgres (Kafka broker).

majidazimi|5 years ago

Network is faster than disk. Once cached, then you are only bound by network IO for subsequent uses.