That's true but still limitation is not fully resolved. In order to increase consumption rate, we need to add replicas. In pulsar Brokers are merely cache nodes over Bookkeeper. Adding more Brokers is trivial in Pulsar.
How in pulsar do they get around the fact that adding a new broker, data needs to be moved over before that broker can start serving data? This seems like a basic law of physics type limitation to me.
Hey, I work on Pulsar, will try and answer this :)
Topics (actually bundles of topics, called bundles) are what is assigned to Brokers. Topic assignment is dynamic, so when a new broker is added, the system will try and shed load from the busiest brokers to even it out on the system.
But unlike Kafka, when a topic is assigned to a broker, it doesn't have much state to move, mostly it just gets metadata added to it and opens a new "ledger" (which is just a chunk of the topics data over a time window, only one ledger is ever open at once). When it needs to serve data, it pulls that from bookkeeper nodes from previous ledgers, so the process of re-distributing load is pretty quick, it also doesn't eagerly pull in a cache.
Now, as far as the cache, that is primarily for "tailing reads", meaning, as writes occurs, and clients who are close to the tip of the recent data will just get it from the broker, without a need to pull it from bookkeeper. This is is one of the key parts about how Pulsar has multiple tiers of storage that help it have such good consistent latency.
Beyond processing writes, the biggest thing brokers do is handling "tailing reads" i.e., clients are consuming right near the tip of the topic. , this is the cache referred to. That means that when a new pbroker is three purposes:
(copying this text from another comment of mine elsewhere)
Well, the Pulsar broker is (kinda) stateless, because they are essentially a caching layer in front of BookKeeper. But where's your data actually stored then? In BookKeeper bookies, which are stateful. Killing and replacing/restarting a Bookkeeper node requires the same redistribution of data as required in Kafka’s case. (Additionally, BookKeeper needs a separate data recovery daemon to be run and operated, https://bookkeeper.apache.org/archives/docs/r4.4.0/bookieRec...)
So the comparison of 'Pulsar broker' vs. 'Kafka broker' is very misleading because, despite identical names, the respective brokers provide very different functionality. It's an apples-to-oranges comparison, like if you'd compare memcached (Pulsar broker) vs. Postgres (Kafka broker).
kevstev|5 years ago
addisonj|5 years ago
Topics (actually bundles of topics, called bundles) are what is assigned to Brokers. Topic assignment is dynamic, so when a new broker is added, the system will try and shed load from the busiest brokers to even it out on the system.
But unlike Kafka, when a topic is assigned to a broker, it doesn't have much state to move, mostly it just gets metadata added to it and opens a new "ledger" (which is just a chunk of the topics data over a time window, only one ledger is ever open at once). When it needs to serve data, it pulls that from bookkeeper nodes from previous ledgers, so the process of re-distributing load is pretty quick, it also doesn't eagerly pull in a cache.
Now, as far as the cache, that is primarily for "tailing reads", meaning, as writes occurs, and clients who are close to the tip of the recent data will just get it from the broker, without a need to pull it from bookkeeper. This is is one of the key parts about how Pulsar has multiple tiers of storage that help it have such good consistent latency.
Beyond processing writes, the biggest thing brokers do is handling "tailing reads" i.e., clients are consuming right near the tip of the topic. , this is the cache referred to. That means that when a new pbroker is three purposes:
1. Handling writes
miguno|5 years ago
Well, the Pulsar broker is (kinda) stateless, because they are essentially a caching layer in front of BookKeeper. But where's your data actually stored then? In BookKeeper bookies, which are stateful. Killing and replacing/restarting a Bookkeeper node requires the same redistribution of data as required in Kafka’s case. (Additionally, BookKeeper needs a separate data recovery daemon to be run and operated, https://bookkeeper.apache.org/archives/docs/r4.4.0/bookieRec...)
So the comparison of 'Pulsar broker' vs. 'Kafka broker' is very misleading because, despite identical names, the respective brokers provide very different functionality. It's an apples-to-oranges comparison, like if you'd compare memcached (Pulsar broker) vs. Postgres (Kafka broker).
majidazimi|5 years ago