top | item 21936252

Apache Pulsar is an open-source distributed pub-sub messaging system

587 points| Bender | 6 years ago |pulsar.apache.org | reply

237 comments

order
[+] addisonj|6 years ago|reply
I just finished rolling out Pulsar to 8 AWS regions with geo-replication. Messages rates are currently at about 50k msgs/sec but still in the process of migrating many more applications. We run on top of kubernetes (EKS).

It took about 5 months for our implementation with a chunk of that work mostly about figuring out how to integrate our internal auth as well as a using hashicorp vault as a clean automated way to get auth tokens for an AWS IAM role.

Overall, we are very pleased and the rest of the engineering org is very excited about it and planning to migrate most of our SQS and Kinesis apps.

Ask me anything in thread and will try and answer questions. At some point we will do a blog post on our experience.

[+] mattboyle|6 years ago|reply
We tried to adopt this but found the documentation very lacking and a severe lack of quality client libraries for our language of choice (go).the "official" one had race conditions in the code as well as "todo" for key pieces littered throughout. There is another from comcast which is abandoned. We had a serious discussion about picking up ownership of the library or writing our own but as a small start up we didnt feel we could do it and still develop the product. I'll continue to keep an eye on pulsar but for now Kafka is the clear go to imo. It's well documented, great SAS offerings (confluent) and tons of books and training courses for it.
[+] ckdarby|6 years ago|reply
> found the documentation very lacking

Really? It is one of the few open source projects that we've felt has had modern documentation. How long ago was this?

> As a small startup

You'll spend more time & money on the OpEx cost with Kafka than picking up the client library for Pulsar.

[+] bsaul|6 years ago|reply
Sidenote question :

Are we heading toward a split between apache/java/zookeeper stacks and go/etcd on the other ? I've seen an issue related to that question on pulsar, and this got me investigating the distributed KV part of the stack.

It seems by looking at some benchmark that etcd is much more performant than zookeeper, and that to some people, operating two stacks seems like an operation maintenance cost a bit too high. Is that a valid concern ?

Also, i've seen that kafka is working on removing the dependency to zookeeper, is pulsar going to take the same road ?

[+] alfalfasprout|6 years ago|reply
I keep seeing new message queue solutions pop up over the years and it's just been my impression at least that this is one area where silicon valley really is way behind the trading industry.

Reliable pub/sub that supports message rates over 100k/sec (even up to the millions) has been available for a while now and with a great deal of efficiency (eg; the Aeron project). The incredible amount of effort to support complex partitions, extreme fault tolerance (instead of more clever recovery logic), etc. add a lot of overhead. To the point of talking about "low latency" overhead in the order of 5ms instead of microseconds or even nanoseconds as is expected in trading.

Worse, many startups try to adopt these technologies where their message rates are miniscule. To give you some context, even two beefy machines with an older message queue solution like ZeroMQ can tolerate throughput in excess of what most companies produce.

This is not to discredit the authors of Pulsar or Kafka at all... but it's just a concerning trend where easy to use horizontally scalable message queues are being deployed everywhere. Similar to how everyone was running hadoop a few years back even when the data fit in memory.

[+] zackmorris|6 years ago|reply
This looks promising. Is there such thing as a generalized SQL query engine that runs over any key-value store that provides certain minimal core operations?

For example, say you have a KV Store with basic mathematical Set operations like GET, SET, UNION, INTERSECT, EXCEPT, etc. The Engine would parse the SQL and then call the low-level KV Store Set operations, returning the result or updating KV pairs. This explains how Join relates to Set operations:

https://blog.jooq.org/2015/10/06/you-probably-dont-use-sql-i...

Another thing I'd like is if KV stores exposed a general purpose functional programming language (maybe a LISP or a minimal stack-based language like PostScript) for running the same SQL Set operations without ugly syntax. I don't know the exact name for this. But if we had that, then we could build our own distributed databases, similar to Firebase but with a SQL interface as well, from KV stores like Pulsar. I'm thinking something similar to RethinkDB but with a more distinct/open separation of layers.

The hard part would be around transactions and row locking. A slightly related question is if anyone has ever made a lock-free KV store with Set operations using something like atomic compare-and-swap (CAS) operations. There might be a way to leave requests "open" until the CAS has been fully committed. Not sure if this applies to ledger/log based databases since the transaction might already be deterministic as long as the servers have exact copies of the same query log.

Edit: I wrote this thinking of something like Redis, but maybe Pulsar is only the message component and not a store. So the layering might look like: [Pulsar][KV Store (like Redis)][minimal Set operations][SQL query engine].

[+] manigandham|6 years ago|reply
Spark [1], Presto [2], and Drill [3] can all do that with connectors to different data sources and varying support for advanced SQL.

Pulsar has support for Presto: https://pulsar.apache.org/docs/en/sql-overview/

Pulsar isn't a KV store though, it's a distributed log/messaging system that supports a "key" for each message that can be used when scanning or compacting a stream. GET and SET aren't individual operations but rather scans through a stream or publishing a new message.

If you just want to have a SQL interface to KV stores or messaging systems that support a message key then Apache Calcite [4] can be used as a query parser and planner. There are examples of it being used for Kafka [5].

1. https://spark.apache.org/

2. https://prestodb.io/

3. https://drill.apache.org/

4. https://calcite.apache.org/

5. https://github.com/rayokota/kareldb

[+] atombender|6 years ago|reply
One of the challenges with layering SQL on top of a KV store is query performance.

The most obvious way to model a secondary index on top of a pure KV store is to map indexed values to keys. For example, given the (rowID, name) tuples (123, "Bob"), (345, "Jane"), (234, "Zack"), you can store these as keys:

  name:Bob:123
  name:Jane:345
  name:Zack:234
At this point you don't need or even want values, so this is effectively a sorted set.

Now you can easily find the rowID of Jane by doing a key scan for "name:Jane:", which should be efficient in a KV store that supports key range scans. You can do prefix searches this way ("name:Jane" finds all keys starting with "Jane"), as well as ordinal constraints ("age > 32", which requires that the age index is encoded to something like:

  age:Bob:\x00\x00\x00\x20:123
To perform an union ("name = 'Bob' OR name = 'Jane'"), you simply do multiple range scans, performing a merge sort-ish union operation as you go. To perform an intersection ("name = 'Bob' AND age > 10"), you find the starting point for all the terms and use that as the key range, then do the merge sort.

This is what TiDB and FoundationDB's record layers do, which both have a strict separation between the stateless database layer and the stateful KV layer.

The performance bottleneck will be the network layer. Your range scan operations will be streaming a lot of data from the KV store to the SQL layer, and potentially you'll be reading a lot of data that is discarded by higher-level query layers. This is why TiKV has "co-processor" logic in the KV store that knows how to do things like filter; when TiDB plays your query, it pushes some query operators down to TiKV itself for performance.

Unfortunately, this is not possible with FoundationDB. This is why FoundationDB's authors recommend you co-locate FDB with your application on the same machine. But since FDB key ranges are distributed, there's no way to actually bring the query code close to the data (as far as I know!).

I'm sure you could do something similiar with Redis and Lua scripting, i.e. building query operators as Lua scripts that worked on sorted sets. I wouldn't trust Redis as a primary data store, but it can be a fast secondary index.

[+] PaulHoule|6 years ago|reply
e.g. how about a complex event processing engine? Something like that will do a lot of the above, but the inference database stays managable since old data will fall out of the windows.
[+] samzer|6 years ago|reply
For a moment I thought Bajaj and TVS came together.
[+] lonesword|6 years ago|reply
Captain here.

TVS and Bajaj are major motorbike manufacturers in India, and TVS had a model named "Apache" and Bajaj had a model named "Pulsar".

Flies away

[+] mbostleman|6 years ago|reply
This might be entirely off topic, but I'm having issues using RabbitMQ whereby durability suffers because messages are sent to remote hosts thus exposing them to both the network and remote host availability. On a previous platform I used an MSMQ based system which didn't have this problem since it uses a local store and forward service. So all sends are to localhost and are not affected by the network or the receiver availability. The MSMQ system was my first and only experience with messaging up to now, so I was surprised that any system would not work that way. How is this dealt with in other systems? Is it just a feature that exists or not and you just decide if it's important? And maybe just to shoe horn it to be on topic, does Pulsar use a local service?
[+] manigandham|6 years ago|reply
That's an inherent issue with distributed solutions and is impossible to solve. The only way to deal with it is using various techniques like acknowledgements, retries, local storage, idempotency, etc. MSMQ handles that stuff behind the scenes but the problem itself will always exist if there's a network boundary.

These other systems are designed to be remote with a network interface. You can use the client drivers to handle acknowledgements/retries/local-buffering in your own app or use something like Logstash [1], FluentD [2], or Vector [3] for message forwarding if you want a local agent to send to. You might have to wire up several connectors since none of them forward directly to Pulsar today.

Also RabbitMQ is absolute crap. There are better options for every scenario so I advise using something else like Redis, NATS, Kafka, or Pulsar.

1. https://www.elastic.co/products/logstash

2. https://www.fluentd.org/

3. https://vector.dev/

[+] bauerd|6 years ago|reply
You're free to have queue and workers run on the same machine, just bind to loopback. As soon as you deal with more than one machine, which is required in HA scenarios, you deal with a networked (distributed) system. I might not have understood your question correctly though ...

Edit: Maybe you're looking for acks/confirms? https://www.rabbitmq.com/confirms.html

[+] NicoJuicy|6 years ago|reply
Look up : outbox pattern
[+] firstposterone|6 years ago|reply
Splunk just acquired streamlio and most of the core devs got sucked up. While pulsar is a great product - are you not concerned that these guys are getting paid $$ bank to do something else now?
[+] matteomerli|6 years ago|reply
spoiler: we're still working on Pulsar
[+] zhaijia|6 years ago|reply
There is another company StreamNative, founded by core Pulsar/BookKeeper devs 1 year ago. :)
[+] eerrt|6 years ago|reply
How does this compare to Redis Pub-Sub or RabbitMQ?
[+] sz4kerto|6 years ago|reply
Very different. Pulsar is primarily a Kafka competitor.

- it is much more performant than RabbitMQ - it's a commit log as well, not just a pub-sub system, ie. it is a good candidate as the storage backend for event sourcing - it supports geodistributed and tiered storage (eg. some data on NVMe drives, some on a coldline storage) - it's persistent, not in-memory (primarily)

.. and so on.

[+] Joeri|6 years ago|reply
It’s closer to redis streams, except like kafka you can scale topics beyond the limits of a single server because they can be distributed. You couldn’t run the twitter firehose over redis streams, but you can run it over pulsar or kafka, given enough hardware.
[+] qaq|6 years ago|reply
This scales to multi-datacenter deployments well. Has strong multi-tenancy support if memory serves Yahoo is running a single cluster for all of their properties.
[+] xrd|6 years ago|reply
Or Firebase?
[+] reggieband|6 years ago|reply
I'm still on the fence with these distributed log/queue hybrids. From a theoretical perspective it seems these are excellent. I just have this nagging suspicion that there is some even-worse problem architectures based on these systems will harbor. This kind of ambivalence is something I find myself having to battle more and more in my career as I age. Most of the time the hype around new design/development patterns leads to a worse situation. Very rarely it leads to a significant improvement. I dislike that my first impression looking at a system like this is risk aversion.
[+] throwawaysea|6 years ago|reply
A lot is said or referenced in this conversation about why people chose Pulsar over Kafka. I'm not an expert in this area but are there use cases where Kafka is still better?
[+] EdwardDiego|6 years ago|reply
As someone with a few years of Kafka and the ecosystem under my belt, but no experience of using Pulsar in anger, the areas I can see where Pulsar is behind are mainly ancillary, and will likely be caught up by the community given a year or two.

Kafka Streaming - Pulsar functions don't intend to (by the looks of it) provide all of the functionality available in Kafka Streaming. E.g., joining streams, treating a stream as a table (a log with a key can be treated as a table), aggregating over a window. They seem to be more focused on do X or Y to a given record. That said, you don't need Kafka Streaming for that, other streaming products like Spark Streaming can do it also (although last I checked, Spark Structured Streaming still had some limitations compared to Kafka Streaming - can't do multiple aggregations on a stream etc.) A use case I have and love Kafka Streaming's KTables for is enriching/denormalising transaction records on their way through the pipeline.

Kafka Connect - Pulsar IO will get there with time, but currently KC has a lot more connectors - for example, Pulsar IO's Debezium source is nice, (Debezium was built to use Kafka Connect, but can run embedded), but you may not want to publish an entire change event stream onto a topic, you might just want a view of a given database table available - so KC's JDBC connector is a lot more flexible in that regard, and Pulsar IO currently doesn't have a JDBC source. It also looks like its JDBC sink only supports MySQL and SQLite (according to the docs) - KC's JDBC connector as a sink has a wider range of support for DBs, and can take advantage of functionality like Postgres 9.5+s UPSERT. Likewise, there's no S3 sources or sinks - the tiered storage Pulsar offers is really nice, but you may only want to persist a particular topic to S3.

KSQL - KSQL lets your BAs etc. write SQL queries for streams. That said, I do like Pulsar SQL's ability to query your stored log data. When I've needed to do this with Kafka, I've had to consume the data with a Spark job, which adds overhead for troubleshooting.

So yeah, that's the main areas I can see, but it's really a function of time until Pulsar or community code develops similar features.

The only other major difference I can see is that at the current time, it's comprised of three distributed systems (Pulsar brokers, Zookeeper, Bookkeeper) which is one more distributed system to maintain with all the fun that entails.

That said, I'll be keeping my eye on this, and trialling it when I get some spare time, as I've found that people will inevitably use Kafka like a messaging queue, and that is a bit clunky. Plus I'm a little over having people ask me how many partitions they need :D

[+] dikei|6 years ago|reply
From reading their documents, I really like the design of Pulsar. However, Kafka has been working so well for us and has much better integration with other components of our stack (Flink, Spark, NiFi, etc) that there's no compelling reason to switch.

I think Pulse should really focus on the integration with the rest of the Apache stack if they want to gain traction.

[+] eatonphil|6 years ago|reply
Most of the comments are just pro-Pulsar but what's the architectural trade-off? (Non-architectural trade-off is that Pulsar is a new system to learn for folks familiar with maintaining and using Kafka.)
[+] manigandham|6 years ago|reply
Pulsar is better designed than Kafka in every way with the main trade-off being more moving pieces. That's why the recommended deployment is Kubernetes which can manage all that complexity for you.

Pulsar also lacks in the size of the community and ecosystem where Kafka has much more available.

[+] qeternity|6 years ago|reply
Someone linked some benchmarks here (on mobile and can’t find) that showed a single node Kafka outperforms but as soon as you start scaling Pulsar pulls ahead. I’m not familiar enough with the nitty gritty to comment beyond that.
[+] bovermyer|6 years ago|reply
How does this compare with NATS?
[+] sqreept|6 years ago|reply
NATS is a simpler PUB/SUB system that delivers in the UNIX spirit of small composable parts. Apache Pulsar or Apache Kafka deliver the banana, the ape holding it and the rest of the jungle.
[+] manigandham|6 years ago|reply
NATS is ephemeral pub/sub only. There is no persistence or replay, but focuses on high performance and messaging patterns like request/reply.

Kafka and Pulsar persist every message and different consumers can replay the stream from their own positions. Pulsar also supports ephemeral pub/sub like NATS with a lot more advanced features.

NATS does have the NATS Streaming project for persistence and replay but it has scalability issues. They're working on a new project called Jetstream to replace this in the future.

[+] barbarbar|6 years ago|reply
How is it compared to kafka?
[+] manigandham|6 years ago|reply
Separates storage from brokers for better scaling and performance. Millions of topics without a problem and built-in tenant/namespace/topic hierarchy. Kubernetes-native. Per-message acknowledgement instead of just an offset. Ephemeral pub/sub or persistent data. Built-in functions/lambda platform. Long-term/tiered storage into S3/object storage. Geo-replication across clusters.
[+] mickster99|6 years ago|reply
Is there a reason you went with Pulsar over Kafka? How is the pulsar community? Where are you turning when you have support issues?