top | item 23206566

What every software engineer should know about Apache Kafka

597 points| aloknnikhil | 5 years ago |michael-noll.com | reply

276 comments

order
[+] georgewfraser|5 years ago|reply
This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today. Yes, you can turn a stream of events into a table of the present state. However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. You will more or less have to write a full-fledged DBMS in your application code. And you will probably not do a great job, and will end up with dirty reads, phantoms, and all the other symptoms of a buggy database.

Kafka is a message broker. It’s not a database and it’s not close to being a database. This idea of stream-table duality is not nearly as profound or important as it seems at first.

[+] jholman|5 years ago|reply
Recently I watched a 50-engineer startup allocate more than 50% of their engineering time for about two years to trying to cope with the consequences of using Kafka as their database, and eventually try to migrate off of it. The whole time I was wondering "but how could anyone have started down this path?!?"

Apparently the primary reason they went out of business was sales-related, not purely technical, but if they hadn't used Kafka, they could have had 2x the feature velocity, or better yet 2x the runway, which might have let them survive and eventually thrive.

Imagine, thinking you want a message bus as your primary database.

[+] wenc|5 years ago|reply
A quibble I have with the term "stream-table duality" is that it's not true duality.

You can construct a state (table) from a stream, but you cannot do the reverse. You cannot deconstruct a table into its original stream because the WAL / CDC information is lost -- you can only create a new stream from the table. This means you lose all ability to retrieve previous states by replaying the stream. Information is lost.

Duality in math is an overloaded term but it generally means you can go in either direction. This is not true here.

[+] Ozzie_osman|5 years ago|reply
> This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today.

No. The notion of a "stream-table duality" is a powerful concept, that I've found can change the way any engineer thinks about how they are storing / retrieving data for the better (it's an old idea, rooted in Domain Driven Design, but for some reason a lot of engineers, myself included, still need to learn, or relearn, it).

The notion that relying on a stream as the primary data persistence abstraction or mechanism in a production is the misleading part, at least for now. I'd argue Kafka pushes us in a direction that makes progress along that dimension, and you can apply it successfully with a lot of effort. But to match what you can get from a more traditional DBMS? The tech just isn't there (yet).

[+] strictfp|5 years ago|reply
I always found it ironic that you get most of this for free if you design your sql updates and save/query the transaction log and/or history. A lot of relational dbs have functionality for that.

And if you don't want to use that, there's also products for this specifically, such as event store.

[+] ypcx|5 years ago|reply
I have recently analyzed Kafka as a message bus solution for our team. My take is that Kafka is a very mature and robust message system with clearly defined complexity and thus when used correctly, very dependable. The recent versions even have online rebalancing (but don't tell that to the Pulsar people). However "Kafka Streams" just make me shiver - I don't understand why someone would masquerade a project directly competing with other processing frameworks (e.g. Apache Flink or Spark) as being part of Kafka itself, when apart from utilizing Kafka's Producer and Consumer interfaces it really has nothing else in common with it. Streams is a giant, immature, complexity-hiding mess which as a responsible systems designer you have to run away from really, really fast.
[+] epistasis|5 years ago|reply
The point of that is for people trying to figure out why streams are a useful abstraction at all what's needed to make them useful are some sort of aggregation, and of course tabular state is a common end point.

The article does not recommend writing this code yourself, it shows how to aggregate data into usable forms.

So I think your concerns may be a bit overblown. If you think that ksqlDB or Kafka Streams, the tools shown in this blog post, are are at risk for what you warn, this comment would be a valid criticism. But it's clear that the article isn't advocating for people to write their own versions of that...

[+] eggsnbacon1|5 years ago|reply
I worked on a project that used CQRS and Event Sourcing years ago. It was an unmitigated disaster, never made it out of prototype phase.

By using Kafka (or anything else) as a "commit log" you've just resigned yourself to building the rest of the database around it. In a real RDBMS the commit log is just a small piece of maintaining consistency.

Every time I've worked on a project where we handle our own "projections" (database tables) we ended up mired deep in the consistency concerns normally handled by our database.

Whats so hard? Compaction is an evil problem. We figured we could "handle it later". Well it turns out maintaining an un-compactible commit log of every event even for testing data consumes many terabytes. This and "replays" sunk the project indirectly.

Replays. The ES crowd pretends this is easy. It's not. If you have 3 months of events to replay every time you make certain schema changes you are screwed. Even if you application can handle 100X normal load, its going to take you over a day to restore using a replay. This also happened to us. Every time an instance of anything goes down, it has to replay all the events it cares about. If you have an outage of more than a few instances the thundering herd will take down your entire system no matter how much capacity you have. With a normal system, the Db always has the latest version of all the "projections" unless you're restoring from a backup, then it only has to restore a partial commit log replay, and only once.

Data duplication. Turns out "projecting" everything everywhere is way worse on memory and performance than having the DB thats handling your commit log also store it all in one place. Who knew.

Data corruption. If you get one bad event somewhere in your commit log you are cooked. This happens all the time from various bugs. We resorted to manually deleting these events, negating all the supposed advantages of an immutable commit log. We would have been fine if we let the db handle the commit log events.

Questionable value of replays. You go through a ton of bs to get a system that can rewind to any point in time. When did we use this functionality? Never. We never found a use-case. When we abandoned ES we added Hibernate Auditing to some tables. Its relatively painless, transparent, and handled all of our use cases for replays

[+] hodgesrm|5 years ago|reply
> This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today.

I disagree. Kafka is a log. That's half of a database. The other half is something that can checkpoint state. In analytics that's often a data warehouse. They are connected by the notion that the data warehouse is a view of the state of the log. This is one of a small number of fundamental concepts that come up again and again in data management.

Since the beginning of the year I've talked to something like a dozen customers of my company who use exactly this pattern to implement large and very successful analytic systems. Practical example: you screw up aggregation in the data warehouse due to a system failure. What to do? Wipe aggregates from the last day, reset topic offsets, and replay from Kafka. This works because the app was designed to use Kafka as a replayable, external log. In analytic applications, the "database" is bigger than any single component.

I agree there's a problem with the stream-table duality idea, but it's more that people don't understand the concept that's behind it. The same applies to transactions, another deceptively simple idea that underpins scalable systems. "What is a transaction and why are they useful?" has been my go-to interview question to test developers' technical knowledge for over 25 years. I'm still struck by how many people cannot answer it.

[+] lmm|5 years ago|reply
> However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. You will more or less have to write a full-fledged DBMS in your application code. And you will probably not do a great job, and will end up with dirty reads, phantoms, and all the other symptoms of a buggy database.

It's the same as moving from a framework to a library. At first, the framework seems great, it solves all your problems for you. But then as soon as you want to do something slightly different from the way your framework wanted you to do it, you're stuffed. Much better to use libraries that offer the same functionality that the framework had, but under your own control.

If you use the sledgehammer of a traditional RDBMS you are practically guaranteed to get either lost updates, deadlocks, or both. If you take the time to understand your dataflow and write the stream -> table transformation explicitly, you front-load more of the work, but you get to a database that works properly more quickly.

99% of webapps don't use database transactions properly. Indeed during the LAMP golden age the most popular database didn't even have them. The most well-designed systems I've seen built on traditional databases used them like Kafka - they didn't read-modify-write, they had append-only table structures and transformed one table to compute the next one. At which point what is the database infrastructure buying you?

If you're using a traditional database it's overwhelmingly likely that you're paying a significant performance and complexity cost for a bunch of functionality that you're not using. Given the limited state of true master-master HA in traditional databases, you're probably giving up a bunch of resilience for it as well. If Kafka had come first, everyone would see relational SQL databases as overengineered, overly-prescriptive, overhyped. There are niches where they're appropriate, but they're not a good general-purpose solution.

[+] EamonnMR|5 years ago|reply
It's worth noting that most managed Kafka solutions offer a service which will transform a stream into a database table so you can query it later. We've also built our own tooling to do it. We read from the Kafka stream directly when we're building a pipeline between microservices; for almost everything else we query the database. It works pretty well and we haven't found ourselves wishing we could write pseudo-queries on streams yet.
[+] math|5 years ago|reply
At the moment you more or less need to write a DBMS in your app code, but I don't think that's the end state. I think what we're seeing the beginnings of something big - it just might not seem like it yet because it's the v1 / no where near complete version. I think having all your data in a single system (Kafka, KsqlDb, ..) that allows you to work with it in cross paradigm ways will turn out to be very compelling.
[+] pgwhalen|5 years ago|reply
>You will more or less have to write a full-fledged DBMS in your application code.

Yes, but only if your data is highly relational, and needs to be operated on in that way (e.g. transactionally across records, needing joins to do anything useful, querying by secondary indices, etc.). Depending on your domain, this could be a large amount of your data, or not. Just like anything, Kafka and Kafka Streams are tools that can be used and misused.

There is perhaps an exciting future where something like ksqlDB makes even highly relational data sit well in Kafka, but we're definitely not there yet, and I'm not terribly confident about getting there.

[+] mikekchar|5 years ago|reply
I work on a legacy CouchDB application and the same problem exists in that area. CouchDB is actually really amazing and is wonderful for some types of applications. It really sucks if you want a relational DB.

Keep in mind that I don't know anything about Kafka, but it seems like the same kind of thinking. You can think about your data as a log of events. Since you can always build the current state from the log of events, then you might as well just save the log. This works incredibly well when you treat your data as immutable and need to be able to see all of your history, or you want to be able to revert data to previous states. It also solves all sorts of problems when you want to do things in a distributed fashion.

The application which fits this model well and that almost every develop is familiar with is source control (and specifically git). If that is the kind of thing that you want to do with your data, then it can be truly wonderful. If it's not what you want to do, then it will be the lodestone that drags you underwater to your death.

The legacy application I work on is a sales system and actually it's not a bad area for this kind of data. You enter data. Modifications are essentially diffs. You can go back and forth through the DB to see how things have changed, when they were changed and often why they were changed. In many ways it is ideal for accounting data.

But as you say, if you want a full fledged DBMS, this should not be your primary data store. It can act at the frontend for the data store -- with an eventually consistent DB that you use for querying (and you really have to be OK with that "eventually consistent" thing). But you don't want to build your entire app around it.

[+] dwenzek|5 years ago|reply
> Kafka is a message broker. It’s not a database and it’s not close to being a database.

Kafka is somehow a database, even if specialized on logs of events. For a similar discussion, I would refer to Jim Gray "Queues Are Databases" https://arxiv.org/pdf/cs/0701158.pdf.

[+] dwenzek|5 years ago|reply
I fully agree that this "stream-table duality" is misleading or at least quite vague.

The diagram juxtaposing a chess board (the table) and a sequence of moves (the stream) seems enlightening but actually hides several issues.

* An initial state is required to turn a stream into a table.

* A set of rules is required to reject any move which would break the integrity constraints.

* On the reverse direction, a table value cannot be turned into a stream. We need an history of table values for that.

Hence, the duality, or more accurately the correspondence, is not between stream and table but between constrained-stream and table history.

Even if these issues are not made explicit, the Confluence articles and the Kafka semantics are correct.

* "A table provides mutable data". This highlights that stream-table-duality is relative to a sequence of table states.

* "Another event with key bob arrives" is translated into an update ... to enforce an implicit uniqueness constraint on the key for the tables.

[+] zok3102|5 years ago|reply
Reminds me of that time when database vendors were overreaching to be message queues/brokers - OracleAQ, MSMQ, etc.
[+] manigandham|5 years ago|reply
That's why ksqlDB exists and handles all that for you, turning streams into tables that you can query.
[+] codeisawesome|5 years ago|reply
Allow me to submit that this obsession is... Kafka-esque? :D
[+] pacala|5 years ago|reply
Not super familiar with Kafka, but behind the scenes some RDBMS are log replay engines. Are there technical reasons to not slap a RDBMS frontend on top of a Kafka log and avoid dealing with raw streams in the application code?

Edit: From the sibling comments, the answer is https://ksqldb.io.

[+] opportune|5 years ago|reply
>With Kafka, such a stream may record the history of your business for hundreds of years

Do not do this. Kafka is not a database! Kafka should never be the source of truth for your business. The source of truth should be in whatever consumes data from Kafka downstream when messages are committed as read. Why? Because in your middle layer you can do all the data normalization, sanity checking, processing, and interaction with a REAL database system downstream that can give you things like transactions, ACID, etc.

Of course Confluent wants you to try to use Kafka as a DB, so your usage of it is very high and you pay for the top support package and they have you by the cajones, but that doesn't mean you should do that. You will miss out on all the benefits of using a real database, with what benefit? Having a simple client API?

[+] sixdimensional|5 years ago|reply
So, I've been having a back and forth with a colleague on this and I'm genuinely interested in why you so strongly suggest this.

For the record, I have good real world experience with all kinds of databases (relational, NoSQL, and even legacy multivalue and hierarchical ones), and I don't see why what you say has to be "always true".

One way of looking at Kafka is that is an unbundled transaction log, nothing more or less, so it could be used to permanently store and replay transactional activity, if one wishes. Noting that, even most databases don't store an immutable, permanent transaction log (as they often grow to be huge and are truncated every so often, and tables are used as the current state).

This article by Confluent seems to cover the topic (yes, recognizing it is written by the very vendor you suggest is trying to lock us in): https://www.confluent.io/blog/okay-store-data-apache-kafka/.

Ok, so how about the idea of a persistent, immutable, never-ending transaction log (uhoh, sounds like blockchain now!)? Setting aside Kafka for now, what do you think about the basic design pattern? To me it sounds a bit like it could represent a temporal database in raw transactional log form. Why not?

EDIT: After rereading your comment I see your main concern is using Kafka as a database management system (DBMS). I would agree, that's not what Kafka is for. But, I don't think Confluent intends that use case, do they? I look at it more as an unbundled single component that is very useful by itself, and is part of a more complex data platform/architecture (ex. Lambda or Kappa architecture).

[+] Joeri|5 years ago|reply
In a proper kafka setting you would use avro and a schema repository to have defined schemas for your events and have an ability to upgrade schemas automatically. For breaking changes you would use kafka streams to migrate the old topic into a new topic and then discard the old topic. Kafka has transactions, and with kafka streams you can create joined views that include those transactions.

It’s a different way of working for sure, but for an event sourced system I think kafka is an excellent choice. In a sql database keeping the complete history of changes for years is much less practical than in kafka.

What you shouldn’t do is look at kafka as a database. It’s a message log. You push messages in, you eventually read them back out. The key word is eventually. You can’t read your writes, and that makes building an interactive system on top hard.

[+] Twisell|5 years ago|reply
Just a remark to writers, when you redact an "introduction to smth" please refrain from writing down the name of your product 50 or more time in the first paragraph. It's totally frustrating and made me run away to just look up wikipedia instead to get a grasp of the general idea.

Example: You've probably heard of smth thing and wonder how smth differ from smth else. In this article about smth we will dive into smth things to discover how smth is well better suited to do things thanks to smth things and other things that are really specific to smth! The power of smth enable things to things in a way things do things that make smth something you need to learn about.

So during this journey about smth will be cut in four part the first being an introduction.

(...) --sudo click the link for first part

This is the first part of a fourth series about smth to learn more about things and things in smth.

Smth is very specific about things, and that's specifically why smth is well suited for things. Smth is a new way to do something that will make you think more about things and things. Let's now dive into smth...

Smth use things as a things to do the smth things with tings on smth.........

--sudo repeat marketing ad nauseam

No matter of wether the actual product is pure gold or pure garbage you probably just lost 50% of the readers at some point.

[+] unphased|5 years ago|reply
i totally agree with what you're saying but you should also know that redact doesnt fit in your sentence
[+] gnfargbl|5 years ago|reply
Something that I wish I had known about Apache Kafka a year or so ago is that it essentially has no support for long-running tasks, i.e. tasks where longest-possible-worker-execution-time >> longest-tolerable-group-rebalance-time.

After much angst in trying to work around this issue, I finally gave up and switched to Pulsar. Pulsar isn’t without it’s own issues (mostly around bugs and general maturity) but it handles this particular scenario admirably.

[+] ketralnis|5 years ago|reply
It's true, message buses and work queues have different characteristics. It sounds like you want a work queue, not a message bus. I have very successful experience with using rabbitmq for work queueing, but as you mention there are others too.
[+] ckdarby|5 years ago|reply
What every software engineer should know about Kafka, it's dead.

If you're not already technically chained into it and Confluence hasn't already upsold your poor organization avoid it.

If you want the early flexibility and the rapid PoC just look at AWS Kinesis/Firehouse.

If you're looking at large scale (+1 gbit ingest, 100k/s, kind of stuff) then Apache Pulsar is where to go.

[+] Rebelgecko|5 years ago|reply
I'm surprised by the amount of criticism in this thread. I've used Kafka in the past and it definitely got the job done (as a message bus, not using stream processing or the other more whiz-bang features). What do people use instead?
[+] bonquesha99|5 years ago|reply
As a Kafka alternative, has anyone attempted to use PostgreSQL logical replication with table partitioning for async service communication?

Proof of concept (with diagrams in the comments): https://gist.github.com/shuber/8e53d42d0de40e90edaf4fb182b59...

Services would commit messages to their own databases along with the rest of their data (with the same transactional guarantees) and then messages are "realtime" replicated (with all of its features and guarantees) to the receiving service's database where their workers (e.g. https://github.com/que-rb/que, skip locked polling, etc) are waiting to respond by inserting messages into their database to be replicated back.

Throw in a trigger to automatically acknowledge/cleanup/notify messages and I think we've got something that resembles a queue? Maybe make that same trigger match incoming messages against a "routes" table (based on message type, certain JSON schemas in the payload, etc) and write matches to the que-rb jobs table instead for some kind of distributed/replicated work queue hybrid?

I'm looking to poke holes in this concept before sinking anymore time exploring the idea. Any feedback/warnings/concerns would be much appreciated, thanks for your time!

Other discussions:

* https://old.reddit.com/r/PostgreSQL/comments/gkdp6p/logical_...

* https://dba.stackexchange.com/questions/267266/postgresql-lo...

* https://www.postgresql.org/message-id/CAM8f5Mi1Ftj%2B48PZxN1...

[+] skyde|5 years ago|reply
What every engineer should know about Kafka is that it should not be used for anything critical like you would use Cassandra or Hbase.

But if you are ok with partitions not being available for many hours or losing all written data because the cluster did not automatically move parution to 3 new replica after 2 of the replica failed ... then it’s a good scalable(speed) product.

There is also no serious multi tenant support. So if you need multitenancy you gotta use kubernete and do one cluster per tenant and automate that yourself.

[+] Traster|5 years ago|reply
There seems to be this common problem among relatively new technologies, that they're not actually aware of what the average person knows about them. So let me be the moron in the room. I work at a company that uses Kafka. What I know so far is that Kafka is broken. It seems to me that this article is more about what every software engineer who plans to re-skill as a kafka engineer should know.
[+] sixhobbits|5 years ago|reply
So much criticism here! I've read a lot about Kafka over the last few years and I wish I had read this article earlier -- even basic questions like "Can Kafka store data persistently?" are not adequately answered in many intros to it.

That said, I do find the tutorial flip-flops a bit in target audience. It's mainly "this is what Kafka is", but sometimes has weird asides like "This is how to optimise Kafka" (redundancy, number of partitions, etc) which are pretty distracting from the more fundamental points.

[+] nemetroid|5 years ago|reply
I read the introduction to the series, and then the introduction to the first part, and I’m still not sure exactly what Kafka is, or why I (as part of ”every software engineer”) need to know anything about it. The title suggests that the article(s) will convey some concepts that are useful in a broad sense, but from a skim, this looks like a lot of details about some database-ish product, which probably are good to know if you use that product, but not so much otherwise.
[+] seemslegit|5 years ago|reply
Hmm, I'm pretty sure that a software engineer developing safety-critical firmware for embedded medical systems does not need to know anything about Apache Kafka. Or a game developer. Or a web frontend developer. Given the title it's surprising how many software engineers can in fact go through life and career without ever knowing anything about Apache Kafka.
[+] vsareto|5 years ago|reply
By now, "What Every Software Engineer Should Know" headlines aren't intended to be serious.
[+] DenisM|5 years ago|reply
One can only make assumptions about the needs of every engineer out of profound ignorance, which accordingly sets expectations for the article contents.

Conveniently so, I might add.

[+] PaulWaldman|5 years ago|reply
Is there a reason Kafka wouldn't choose to leverage an existing mature RDBMS for their table storage instead of rolling their own?
[+] skyde|5 years ago|reply
Anyone know how large scale chat system ( facebook messenger, ...) are implemented? My guess is message are in a data store like hbase and a very simple notification system let user that are online know to fetch for new entry
[+] kevindeasis|5 years ago|reply
anyone wanna share their thoughts about deploying their own messaging system vs using a messaging system provided by their cloud provider?
[+] realtalk_sp|5 years ago|reply
The GCP Pub/Sub API has largely replicated all the features you'd want out of Kafka (including Consumer Groups). The primary consideration at this point is cost. There's an inflection point in size (at some very large message volume) where it makes sense to start running your own Kafka cluster and hire a dedicated person or two to manage it. Most companies will never get anywhere close.

Any project just starting out should use Pub/Sub. One thing I really like is that GCP provides emulators of Pub/Sub et al for local testing. That used to be a bit of an obstacle not too long ago.

In terms of lock-in, I don't see how that applies to an AMQ. The data moving through it should only be transiently persisted, up to a week or two at most in the usual case.

If you want to avoid cloud lock-in, have DB backups, use Postgres/MySQL/etc, containerize your service(s), replicate data in object storage, etc. Common sense stuff, if that's something that's of concern.

Personally, I've seen "vendor lock-in" weaponized as an excuse for a lot of costly NIH bullshit. It's painful to reflect back on a project that could have involved literally a tenth of the time and pain it ended up taking because of that one choice alone.

[+] antoncohen|5 years ago|reply
If you are on GCP I think the choice is simple, use Cloud Pub/Sub. Extremely simple, extremely reliable, extremely performant, fairly inexpensive, multi-region (global). No maintenance, no scaling, almost no tunables, it just works.

Google provides a Pub/Sub emulator for local development.

I don't really buy the vendor lock-in thing for Pub/Sub-like systems. The Cloud Pub/Sub usage pattern is basically the same as Kafka, you can have a library that abstracts away the differences. There are open source libraries that do that[1]. If you ever need to switch cloud providers, or want a messaging system to span cloud providers, you can switch without changing lots of code.

[1] https://github.com/google/go-cloud/tree/master/pubsub

[+] dtech|5 years ago|reply
Distributed messaging is really hard to get right. It'll seem to work fine right up until you get weird bugs and unreliability during at the worst moments.

I wouldn't recommend relying primarily on something vendor-specific like Amazon SQS, but there are very good out-of-the-box tools like RabbitMQ or Kafka available.

Writing your own messaging system is like writing your own database, it's the wrong choice 99.9% of the time.

[+] sz4kerto|5 years ago|reply
Testability. Single biggest reason for us not going with the proprietary queue. We can create, reset, throw away queues as we want during testing, thousands times a day. Even on a laptop.
[+] oweiler|5 years ago|reply
We use Amazon MSK and are pretty happy with it so far.
[+] analognoise|5 years ago|reply
"Every engineer".

The whole human activity of reducing science to practical art will do just fine without knowing Apache Kafka, thanks.

[+] simonjgreen|5 years ago|reply
"Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should."
[+] cosmotic|5 years ago|reply
Cool click-bait title
[+] badrabbit|5 years ago|reply
Some times you should just use Graylog (kafka+elastic) ,especially if you are already comfortable with Elastic. You get to scale,retain and monitor your data in addition to stream processing. If I have fairly small Go webapp that needs stream processing, I would just use Graylog instead of trying to use Kafka directly.