top | item 29299625

(no title)

almeria | 4 years ago

Anything more about Kafka you can tell us?

Seriously, the one time I was in a situation where much of the team seemed hellbent on this "just put all in Kafka" idea (without really understanding why, exactly) the arguments they came up with were not too dissimilar from what you've shared with us above. It all seemed to come down to "OMG databases are hard, schemas are hard, our customers don't understand the data they're shoving at us. But Kafka will take care of all of that for us. Because, you know, shiny."

That said I'd still like to have a more ... balanced understanding of why Kafka may not necessarily be The Answer, and/or have more hidden complexity or other negative tradeoffs than we may have bargained for.

discuss

order

zwkrt|4 years ago

My take was humorous but it didn’t hide anything. Kafka was built so that LinkedIn could shove all its real-time click data through a single funnel—terabytes upon terabytes. It has since been evangelized and created a cottage industry of Confluent salespeople who will give your manager a course in how to lobby their engineers into using Kafka. Have scaling problems? Kafka. Have business events that need to be ordered? Kafka! Have “changing schemas”? KAFKA!! I’m always suspicious when a company gives a product away for free tbut then charges $$$ for “support”.

I worked for a high profile recently-failed project from a company that rhymes with Brillo, and our data was just beginning to be too big for google sheets (!). However, we were also having organizational problems because the higher ups were seeing the failing project losing money so they of course decided to hire 100 extra engineers. Our communications (both human and programmatic) were failing and the confluent salespeople began circling like buzzards. Of course by the time it was suggested we we use it the project was already 6 months past the point of no return.

My advice is that if your data fits in a database, use a database. Anyone who says that isn’t scalable should have to tell you the actual reason it doesn’t scale and the number of requests/users/GBs/uptime/ etc that is the bottleneck.

EdwardDiego|4 years ago

To be very clear, Confluent doesn't give _their_ product away for free. Confluent Platform has many differing features to Apache Kafka that never make it into upstream.

E.g., Confluent Replicator vs. Mirror Maker 2, Confluent Platform's tiered storage has been available for quite some time (right now a bunch of people from AirBnB are doing a stellar job bringing tiered storage to FOSS Kafka, I'm hoping 3.1 or 3.2).

Actually, easiest thing to do to see the differences is grep all the subpages of this link for properties that start with "confluent":

https://docs.confluent.io/platform/current/installation/conf...

almeria|4 years ago

Thank you. I am now enlightened.

yongjik|4 years ago

I only had a brief exposure to it, but my impression is that it's sort of a message queue optimized for very large data (TB or more). So, for example, there's no way to easily answer questions like "How many requests did server X generate between 1pm and 2pm and how many of them were served by server Y?" because when your data doesn't fit in a single machine, supporting such queries requires a lot of bookkeeping. If you never need them, you don't want to pay for them.

Of course, when have a few megabytes of data and you route it through Kafka, then all you get is an opaque message queue where you can't see which message went from where to where. Good luck debugging any issues. But, hey, you got to use Kafka.

EdwardDiego|4 years ago

> for example, there's no way to easily answer questions like "How many requests did server X generate between 1pm and 2pm and how many of them were served by server Y?"

There's many ways to answer that using data streamed over Kafka - ingest it into your preferred query engine, go query it.

Kafka is a distributed log, that's it.

EamonnMR|4 years ago

It's nice when you have a bunch of discrete events that you want any number of clients to work on without interfering with each other. Think of it as a fire-and-forget pub-sub. You can always have a worker dumping the queue into a database for later if you want to. It's a bit cantankerous but once you get it running it can handle messages at an impressive scale. It isn't a magic bullet to replace databases, and you can actually add schemas to the data you put in it (and it's generally agreed to be a good idea to do so; it's on our wish list because it would save us from duplicating the schema on the consumer and producer ends.)

curryst|4 years ago

I've worked at some places that used Kafka (including LinkedIn), although I have never been responsible for running the platform itself. I'll chip in with what I see as the negatives.

Kafka sits at roughly the same tier as HTTP, but lacks a lot of the convention we have around HTTP. There's a lot of convention around HTTP that allows people to build generic tooling for any apps that use HTTP. Think visibility, metrics, logging, etc, etc. Those are all things you effectively get for free with HTTP in most languages. Afaict, most of that doesn't exist for Kafka in a terribly helpful. You can absolutely build something that will do distributed tracing for Kafka messages, but I'm not aware of a plug-and-play version like there are for most languages.

The fact that Kafka messages are effectively stateless (in the UDP sense, not the application sense) also trips up a lot of people. If you want to publish a message, and you care what happens to that message downstream, things get complicated. I've seen people do RPC over event buses where they actually want a response back, and it became this complicated system of creating new topics so the host that sent the request would get the response back. Again, in HTTP land, you'd just slap a loadbalancer in front of the app and be done. HTTP is stateful, and lends itself to stateful connections.

Another issues it that when you tell people that they can adjust their schema more often, they tend to go nuts. Schemas start changing left and right, and suddenly you now need a product to orchestrate these schema changes and ensuring you're using the right parser for the right message. Schema validation starts to become a significant hurdle.

It's also architecturally complicated to replace HTTP. An HTTP app can be just a single daemon, or a few daemons with a load balancer or two in front. Kafka is, at minimum, your app, a Kafka daemon, and a Zookeeper daemon (nb I'm not entirely sure Zookeeper is still required). You also have to deal with eventual consistency, which can make coding and reasoning about bugs dramatically harder than it needs to be. What happens when Kafka double-delivers a message?

My pitch is always that you shouldn't use Kafka unless it becomes architecturally simpler than the alternatives. There are problems to which Kafka is a better solution than HTTP, but they don't start with unstable schemas or databases being difficult. Huge volumes of data is a good reason to me, not being sure what your downstreams might be is an option. There are probably more, I'm not an expert.

> our customers don't understand the data they're shoving at us. But Kafka will take care of all of that for us

Kafka isn't going to help with this at all. If your HTTP app can't parse it, neither will your Kafka app. Kafka does have the ability to do replays, but so does shoving the requests in S3 or a databases for processing later. I promise you that "SELECT * FROM requests WHERE status='failed'" is drastically simpler than any Kafka alternative. It is neat that Kafka lets you "roll back time" like that, but you have to very carefully consider the prospect of re-processing the messages that already succeeded. It's very easy to get a bug where you have double entries in databases or other APIs because you're reprocessing a request.

EamonnMR|4 years ago

All very good points. What I like about Kafka is that you can queue up a bunch of messages without needing to be able to handle that load immediately. It lets you build very resistant patterns: if your message-senders overwhelm your message receivers in HTTP you can end up with connection failures, get stuck waiting, etc. In Kafka what happens is you now have a large backlog to work through, but at least your messages are somewhere accessible to you and not dropped on the floor.

HTTP definitely has the edge when it comes to library support. In fact, Confluent et al offer HTTP endpoints for Kafka so that you don't have to deal with the vagaries of actually connecting to a broker yourself (the default timeout in python for an unresponsive broker is _criminal_ for consumers. You will spend several minutes wondering when the message will arrive.) We use an in-house one. But that introduces HTTP's problems back into the process; you need to worry about overwhelming your endpoint again...

Regarding application patterns, ideally you're writing applications that read data from one topic (or receive messages, parse a file, etc) and write to another topic. Treating it as a request that will somehow be responded to later in time scares me and I wouldn't do it. What if your application needs to be restarted while some things are in-flight?

EdwardDiego|4 years ago

> If you want to publish a message, and you care what happens to that message downstream, things get complicated.

Definitely agree. The basic concept of Kafka is that the publisher doesn't care, so long as data isn't lost. If you need the producer to redo stuff if the consumer failed, then Kafka is the square peg in your round hole.

And yeah, the best use case for Kafka is, IMO, "I have to shift terabytes or more of data daily without risking data loss, and I want to decouple consumers from producers".

Gigachad|4 years ago

Our company is currently looking in to kafka and microservices. The problem we have is that the volume of actions going on has gone past what a single rails app with sql server can handle. When I look in to it, it seems like it would mostly be used as some kind of job queue where worker microservices churn through the entries in kafka to do some kind of data processing without needing sql.

But then there are blog posts saying kafka is a terrible job queue because you can only have one worker per partition and it's hard to get more partitions dynamically.

almeria|4 years ago

Very helpful, thanks