top | item 11576941

Heroku Kafka

276 points| sixwing | 9 years ago |heroku.com

116 comments

order

mbseid|9 years ago

As a former user of Kafka, this is awesome and it would have been a huge help for our company if this was available then. I'm glad to hear that a company is offering Kafka as opposed to other propriety versions(AWS Kinesis etc).

One thing is odd though, there is no mention of disk space at all and only a configuration of retention time. One of Kafka's best features is the use of disk to store large amounts of messages, you are not RAM bound. Heroku seems to only allows you to set retention times? This could be awesome if they are giving you "unlimited" disk space, but could also be a beta oversight. Interested to see how this progresses.

uhoh-itsmaciek|9 years ago

Hi, I'm Maciek and I work on the Heroku Kafka team. You don't have to think about disk space--it's on us to make sure there's enough to satisfy the retention settings you configure. We're excited to provide another great open-source project as a managed service!

ktamura|9 years ago

Don't forget that Heroku is the original multi-tenant shop. I wouldnt be surprised if a single Kafka instance stores multiple customers's messages and elastically scale as more customers/data is added.

jonahx|9 years ago

> What is Kafka?

> Apache Kafka is a distributed commit log for fast, fault-tolerant communication between producers and consumers using message based topics. Kafka provides the messaging backbone for building a new generation of distributed applications capable of handling billions of events and millions of transactions

Can anyone translate this into meaningful English for me?

superuser2|9 years ago

You send a message (for example some JSON) to a Kafka topic. Any number of clients subscribe to that topic with a specific start time-stamp. Pluck a message off the queue, compute with it, send an acknowledgement. Kafka provides strong assurances that all readers get all the messages and report success (it retries otherwise), even if some participants come and go.

Very useful if, say, you have some real world event and dozens of different micro services need to do something about that event, independently.

You can also just use it for logging.

manigandham|9 years ago

Kafka is a distributed logging system. Write lots of data very fast by using sequential I/O. Consuming apps can then read this just as fast and maintain their own state (of where they last read up to) which allows for multiple fast and simple consumers and an easy way to have a lasting "log" of all the data.

tschellenbach|9 years ago

It's a message queue. You use it for everything you want to do outside of the general request cycle. IE: Making API calls, priming cache, sending emails, etc..

Biggest competitors of Kafka are RabbitMQ and amazon SQS.

amock|9 years ago

It's a distributed message queue.

franciscop|9 years ago

I love Heroku and everything they are doing, it's doubtless a push forward for the web as a whole. However, the pricing for hobby sites (including SSL) is crazy from a personal point of view so I'm slowly moving my projects out of it [1][2]. I wish they had some kind of "Hobby Bundle".

[1] http://umbrellajs.com/ [2] http://picnicss.com/

redtuesday|9 years ago

Check out Red Hat's Openshift Online [1] if you haven't already. They offer 3 Gears for free, each with 512 MB Ram, 1 GB Disk space (install e.g Postgres into one gear and you have a DB with 1 GB) and you can use Lets Encrypt with their Bronze plan (which is free if you only use the 3 free gears). Depending on what your hobby sites do this could be enough.

[1] https://www.openshift.com/pricing/

colinbartlett|9 years ago

I am a longtime Heroku customer on an Enterprise service plan at the moment. I brought this up to my sales rep last time he checked in.

I explained how we moved a bunch of smaller sites to S3, reluctantly because I really like having a unified platform for all our sites. But even though (or perhaps precisely because) we are spending thousands of dollars a month with Heroku, I find the $20/month SSL charge insulting. SSL is not an option anymore.

The good news is, the sales rep said this has come up a lot, they hear us, and to "stay tuned".

sudhirj|9 years ago

Their pricing for hobby sites is 7$ + 10$ DB, which is very comparable with a self setup IaaS like DO and AWS. Personally I think the developer experience is much better on Heroku and quite worth it.

SSL is a pain point, though I do empathize with them - I think they're doing something expensive for that. What I do is to use AWS Cloudfront and ACM for a free cert and site speedup - if they are personal projects the CF bill ought to be in the low few dollars anyway.

mintplant|9 years ago

It looks like those are static sites. Why host them on Heroku? You could stick them on GitHub Pages [1] for free.

[1] https://pages.github.com/

njudah|9 years ago

There will be news on this front soon; stay tuned.

mtw|9 years ago

I host most of my sites on Github pages (Free) and Ruby / Go / Elixir / nodejs projects on DigitalOcean. I also host various logging/analytics/mailing services on DigitalOcean without paying saas services. Total is very reasonable

desireco42|9 years ago

And what awesome projects those are. Thank you!

mateuszf|9 years ago

If you have multiple applications though - it's possible to use one ssl terminator per domain.

tlrobinson|9 years ago

One option for HTTPS is to just stick CloudFlare in front of it.

cachemiss|9 years ago

Kudos to Heroku. As someone who has had to make Kafka into a managed service, I know what a pain it is (I'm not a Kafka fan for a lot of reasons) to administer in a cloud environment.

sethammons|9 years ago

Would love to hear what you don't care for in Kafka and what alterative solution(s) you prefer.

ChartsNGraffs|9 years ago

For anyone wanting to play with Kafka, Spotify's Kafka container was an invaluable resource for getting me up and running with Kafka. All the Zookeeper dependencies are taken care of allowing you to just start playing with Kafka right away. https://github.com/spotify/docker-kafka https://hub.docker.com/r/spotify/kafka/

Jarmo|9 years ago

I never tried spotify's container. Tried wurstmeister's, and was able to run it on a single server for testing purposes, but kept running into issues while clustering on different servers. Decided to use Ambari and have it do all the work for me instead.

manigandham|9 years ago

This will be interesting to try out. I've used all the major cloud event/logging systems (Kinesis, Azure EventHubs, etc) and so far Google PubSub is the best in features and performance.

Only downside with Google Pubsub can be latency (which I'm working on fixing by building a gRPC driver) but Kafka has proven to be too complicated to maintain in-house. If heroku can provide the speed without the ops overhead, it'll be some good competition to Google's option.

Also want to note that Jay Kreps who helped build Kafka at LinkedIn is now behind http://www.confluent.io/ which is like a better/enterprise version of Kafka.

alexatkeplar|9 years ago

Not sure why you are comparing Google Cloud Pub/Sub to Kinesis - the former is a MQ system, not a distributed commit log.

When creating a Kinesis consumer, I can specify whether I want to start reading a stream from a) TRIM_HORIZON (which is the earliest events in the stream which haven't yet been expired aka "trimmed"), b) LATEST which is the Cloud Pub/Sub capability, c) AT_SEQUENCE_NUMBER {x} which means from the event in the stream with the given offset ID, d) AFTER_SEQUENCE_NUMBER {x} which is the event immediately after c), e) AT_TIMESTAMP to read records from an arbitrary point in time.

A Kinesis stream (like a Kafka topic) is a very special form of database - it exists independently of any consumers. By contrast, with Google Cloud Pub/Sub [1]:

> When you create a subscription, the system establishes a sync point. That is, your subscriber is guaranteed to receive any message published after this point.

[1] https://cloud.google.com/pubsub/subscriber

So the stream is not a first class entity in Cloud Pub/Sub - it's just a consumer-tied message queue.

rtehfm|9 years ago

What are your thoughts on Kafka vs Flume?

andreasklinger|9 years ago

For those wondering (all imo and only best guess)

The biggest advantage of kafka is that all of the heroku marketplace all of a sudden becomes "plug and play"

Essentially it's the "backend data" equivalent of what segment does for "frontend data".

Example: What's the benefit of having a graphDB service in the marketplace if most people dont want to / cant invest engineering in keeping the data in (realtime) sync.

With kafka they can establish standards that all partners can adapt to, they will simply offer piping of all heroku postgres/redis changes.

hmottestad|9 years ago

Does anyone know if Kafka has improved on their data loss issues since tested by Aphyr? https://aphyr.com/posts/293-jepsen-kafka

A quote from the article: "At the end of the run, Kafka typically acknowledges 98–100% of writes. However, half of those writes (all those made during the partition) are lost."

lars_francke|9 years ago

Yes, the suggestion discussed by Aphyr has been implemented. You can now set up a lower bound on the ISR size (min.insync.replicas). Together with required.acks=-1 you can wait for a message to be committed to at least min.insync.replicas nodes.

https://issues.apache.org/jira/browse/KAFKA-1555

koolba|9 years ago

I've wondered why there isn't a "big player" in the cloud space for this. Felt like a hole.

My operating theory is that the people who would really make use of something like this have grown beyond managed offerings and would take it in house. For smaller operations Redis is more than enough for pub/sub. Ditto for SQS for externally triggered eventing.

bjt|9 years ago

> For smaller operations Redis is more than enough for pub/sub.

I didn't find that to be so at my last job, one of those smaller operations.

With Redis you're forced to pick between two severely constrained options:

1. Use PUBLISH/SUBSCRIBE. This is nice if you want to have several listeners all receive the same message. But if a listener is down, there's no way for it to recover a message that it missed. If there is no one listening, messages are just dropped.

2. Use LPUSH/BRPOP. This is nice if you want to have several workers all pulling from the same queue, but isn't sufficient if you want to have several queues streaming from the same topic. (E.g. one listener is responsible for syncing to ElasticSearch and another one is syncing to your analytics DB.)

I strongly prefer RabbitMQ. Its model of exchanges and queues supports mixing and matching these semantics much more flexibly.

rhodin|9 years ago

Not a "big player" (yet), but we've been offering Apache Kafka as a Service since June 2015: www.cloudkafka.com.

amock|9 years ago

Depending on what you mean by "this" there are offerings by the big players. Google has Cloud Pub/Sub and AWS has Kinesis in addition to SQS, so two of the big players do have offerings. I'm not familiar enough with Azure to know what it has.

plunchete|9 years ago

Is the pricing public?

neovintage|9 years ago

Not yet. We're working it during our early access program. Well be looking for lots of feedback from customers.

nodesocket|9 years ago

Can somebody provide a real-life use case for Kafka? I've seen comparisons between Redis, but what specifically does Kafka solve that Redis cannot?

ChartsNGraffs|9 years ago

I'd say it's biggest differentiator from a typical messaging system is the ability to rewind and reconsume messages. It's meant to offload a large volume of data quickly and then retain it for some time so that it can processed later on. Data is published to topics and it is entirely feasible to read from one (or more topics), process that data and then publish the results to a different topic. In comparison to Redis, I would say that while they overlap they're each better suited for different problems. Redis is blazing fast, but it's parallelism/replication story isn't as great as Kafka's. Redis is a lot easier to get running though.

yolesaber|9 years ago

Let's say you have a CMS which pushes content to your site. You also want to make the whole site searchable, so you index your content into (e.g) Elasticsearch. Kafka is great for this because you can put the content onto Kafka's message queue and then have a service reading from it which then put's it into Elasticsearch. It scales well, too. So let's say your site takes off and you have hundreds of articles published a day (not to mention updates, deletions etc) - these events can all be sent to kafka and it will maintain the order as well as still be fast. You can also have many many services reading (consuming) from it simultaneously and it will handle it nicely.

Basically, if you want to get data from one place to another and care about order, Kafka is a good solution. It acts as a middleman between services.

manigandham|9 years ago

Kafka and Redis are very different things - see this: https://news.ycombinator.com/item?id=11577312

Redis is a database, Kakfa is a data logging system built for scale and throughput. Event processing (of any kind like stocks, ad impressions, ecommerce purchases) are a great fit. Also good as a message queue unless you need ultra low-latency RPC.

jbob2000|9 years ago

The comments in this thread are funny;

Hey, what is Kafka?

"It's a distributed logging system, not a message queue"

Ok, what's the use case?

describes a case when its used as a message queue

tibbon|9 years ago

Kafka vs Redis. I've only used Redis... what should I know?

manigandham|9 years ago

Redis is an in-memory (with persistence) key-value database that also implements some basic structures like lists, sets and hashes natively.

Kafka is a distributed logging system that can ingest large amounts of data straight to disk, then allows for multiple consumers to read this data through a simple abstraction of topics and partitions. Consumers maintain their own position of where they last read up to (or re-read things if they want) and everything is sequential I/O which creates very high throughput.

mtw|9 years ago

What kind of companies or startups usually use this service?

poooogles|9 years ago

It's pretty big in ad tech, or anywhere that really does lots and lots of centralised logging (Datadog/Loggly both use Kafka).

Lots of places also use it just as a message queue, some places for example write time series metrics to Kafka for monitoring.

elcct|9 years ago

My impression of Kafka was that this thing is bloated. How it compares to something like NSQ?

kasey_junk|9 years ago

Its a completely different use case. Many times people call Kafka a "message queue" but its not. It's a distributed log service. Its possible to build a message queue on top of a distributed log service but there are reasons not to.

Its better to think of Kafka as a database for events, not as a transport mechanism for those events.

As for being bloated, Kafka lives in a very empty space, that is it supports fully ordered events to all consumers (and it has good HA options). The only other tool that I've come across that gives you the same data guarantees is Kinesis and it requires AWS.

I've found that yes Kafka is complex, but its complex because its solving a complex problem, not because its bloated.

That said, if you want a non-ordered message queue, use NSQ instead of Kafka.