top | item 14456513

(no title)

TheHydroImpulse | 8 years ago

Engineer @ Segment

NSQ has served us pretty well but long term persistence has been a massive concern to us. If any of our NSQ nodes go down it's a big problem.

Kafka has been far more complicated to operate in production and developing against it requires more thought than NSQ (where you can just consume from a topic/channel, ack the message and be done). More to that, if you want more capacity you can just scale up your services and be done. With Kafka we had to plan how many partitions we needed and autoscaling has become a bit trickier.

We now have critical services running against Kafka and started moving our whole pipeline to it as well. It's a slow process but we're getting there.

We've had to build some tooling to operate Kafka and ramp up everyone else on how to use it. To be fair, we've also had to build tooling for NSQ, specifically nsq-lookup to allow us to scale up.

We have an nsq-go library that we use in production along with some tooling: https://github.com/segmentio/nsq-go

discuss

order

doh|8 years ago

Have you ever looked at any proprietary solutions like Google's PubSub? We're running on PubSub for over year now and outside of some unplanned downtimes it's scaling very well. But as we're looking to branch out out of GCP we are looking at Kafka as an alternative.

Could you comment on particular problems and challenges that you ran into?

For the context, we're currently sending around 60k messages/sec and around 1k of them contains data larger than 10kb.

TheHydroImpulse|8 years ago

The biggest issue with PubSub and Amazon's alternative is the cost. Being capped at a per-message cost would be a no go.

If you can get away with using PubSub or the like it would be far easier than to manage your own Kafka deployment (correctly).

If data loss is unacceptable then Kafka is basically the only open-source solution that is known for not losing data (if done correctly of course). NSQ was great but lacked durability and replication. We can guarantee that two or more Kafka brokers persisted the message before moving on. With NSQ, if one of our instances died it was a big problem.

Managing Kafka in a cloud environment hasn't been easy and required a lot of investment and we have yet to move everything over to it.

molszanski|8 years ago

Do you have anythings to say about nats.io?

doh|8 years ago

We're using nats for synchronous communication, sending around 10k messages each second through it.

Must say that the stability is great, even with larger payloads (over 10MB in size). We're running it in production for couple of weeks now and haven't had any issues. The main limitation is that there is no federation and massive clustering. You can have a pretty robust cluster, but each node can only forward once, which is limiting.

ah-|8 years ago

Out of interest, what kind of tooling did you build for Kafka?

TheHydroImpulse|8 years ago

We started deploying our Kafka cluster as a set of N EC2 instances but we started running into a bunch of issues (rolling the cluster, rolling an instance without moving partitions around, moving partitions around, etc...)

Now we run Kafka through ECS and wrote some tooling to manage rolling the cluster and replacing brokers. krollout(1) (currently private) basically prevents partitions from becoming unavailable while rolling.

Now that multiple teams are using Kakfa we started exploring how to scale up. Each team may have different requirements and isolation can become an issue. Likely more tooling will need to be built around this.