What If We Could Rebuild Kafka from Scratch?

[+] galeaspablo|10 months ago|reply

Agreed. The head of line problem is worth solving for certain use cases.

But today, all streaming systems (or workarounds) with per message key acknowledgements incur O(n^2) costs in either computation, bandwidth, or storage per n messages. This applies to Pulsar for example, which is often used for this feature.

Now, now, this degenerate time/space complexity might not show up every day, but when it does, you’re toast, and you have to wait it out.

My colleagues and I have studied this problem in depth for years, and our conclusion is that a fundamental architectural change is needed to support scalable per message key acknowledgements. Furthermore, the architecture will fundamentally require a sorted index, meaning that any such a queuing / streaming system will process n messages in O (n log n).

We’ve wanted to blog about this for a while, but never found the time. I hope this comment helps out if you’re thinking of relying on per message key acknowledgments; you should expect sporadic outages / delays.

[+] singron|10 months ago|reply

Check out the parallel consumer: https://github.com/confluentinc/parallel-consumer

It processes unrelated keys in parallel within a partition. It has to track what offsets have been processed between the last committed offset of the partition and the tip (i.e. only what's currently processed out of order). When it commits, it saves this state in the commit metadata highly compressed.

Most of the time, it was only processing a small number of records out of order so this bookkeeping was insignificant, but if one key gets stuck, it would scale to at least 100,000 offsets ahead, at which point enough alarms would go off that we would do something. That's definitely a huge improvement to head of line blocking.

[+] kcexn|10 months ago|reply

> Furthermore, the architecture will fundamentally require a sorted index, meaning that any such a queuing / streaming system will process n messages in O (n log n).

Would using a sorted index have an impact on the measured servicing time of each message? (Not worst-case, something more like average-cass). It's made extremely clear in the Kafka docs that Kafka's relies heavily on the operating systems filesystem cache for performance, and that seeking through events on disk turns out to be very slow compared to just processing events in-order.

[+] cogman10|10 months ago|reply

> streaming system will process n messages in O (n log n)

I'm guessing this is mostly around how backed up the stream is. n isn't the total number of messages but rather the current number of unacked messages.

Would a radix structure work better here? If you throw something like a UUID7 on the messages and store them in a radix structure you should be able to get O(n) performance here correct? Or am I not understanding the problem well.

[+] vim-guru|10 months ago|reply

https://nats.io is easier to use than Kafka and already solves several of the points in this post I believe, like removing partitions, supporting key-based streams, and having flexible topic hierarchies.

[+] nchmy|10 months ago|reply

NATS is also in the process of a open source license rugpull...

https://news.ycombinator.com/item?id=43783452

[+] fasteo|10 months ago|reply

And JetStream[1] adds persistence to make it more comparable to Kafka

[1] https://docs.nats.io/nats-concepts/jetstream

[+] boomskats|10 months ago|reply

Also remember that NATS was donated to the CNCF a while back, and as a result people built a huge ecosystem around it. Easy to forget.

[+] mountainriver|10 months ago|reply

I greatly prefer redis streams. Not all the same features, but if you just need basic streams, redis has the dead simple implementation I always wanted.

Not to mention you then also have a KV store. Most problems can be solved with redis + Postgres

[+] raverbashing|10 months ago|reply

Honestly that website has the least amount of information per text I've seen in multiple websites

I had to really dig (outside of that website) to understand even what NATS is and/or does

It goes too hard on the keyword babbling and too little on the "what does this actually do"

> Services can live anywhere and are easily discoverable - decentralized, zerotrust security

Ok cool, this tells me absolutely nothing. What service? Who to whom? Discovering what?

[+] wvh|10 months ago|reply

I came here to say just that. Nats solves a lot of those challenges, like different ways to query and preserve messages, hierarchical data, decent authn/authz options for multi-tenancy, much lighter and easier to set up, etc. It has more of a messaging and k/v store feel than the log Kafka is, so while there's some overlap, I don't think they fit the exact same use cases. Nats is fast, but I haven't seen any benchmarks for specifically the bulk write-once append log situation Kafka is usually used for.

Still, if a hypothetical new Kafka would incorporate some of Nats' features, that would be a good thing.

[+] Ozzie_osman|10 months ago|reply

I feel like everyone's journey with Kafka ends up being pretty similar. Initially, you think "oh, an append-only log that can scale, brilliant and simple" then you try it out and realize it is far, far, from being simple.

[+] munksbeer|10 months ago|reply

I'm not a fan or an anti-fan of kafka, but I do wonder about the hate it gets.

We use it for streaming tick data, system events, order events, etc, into kdb. We write to kafka and forget. The messages are persisted, and we don't have to worry if kdb has an issue. Out of band consumers read from the topics and persist to kdb.

In several years of doing this we haven't really had any major issues. It does the job we want. Of course, we use the aws managed service, so that simplifies quite a few things.

I read all the hate comments and wonder what we're missing.

[+] carlmr|10 months ago|reply

I'm wondering how much of that is bad developer UX and defaults, and how much of that is inherent complexity in the problem space.

Like the article outlines, partitions are not that useful for most people. Instead of removing them, how about having them behind a feature flag, i.e. not on by default. That would ease 99% of users problems.

The next point in the article which to me resonates is the lack of proper schema support. That's just bad UX again, not inherent complexity of the problem space.

On testing side, why do I need to spin up a Kafka testcontainer, why is there no in-memory kafka server that I can use for simple testing purposes.

[+] vkazanov|10 months ago|reply

Yeah...

It took 4 years to properly integrate Kafka into our pipelines. Everything, like everything is complicated with it: cluster management, numerous semi-tested configurations, etc.

My final conclusion with it is that the project just doesn't really know what it wants to be. Instead it tries to provide everything for everybody, and ends up being an unbelievably complicated mess.

You know, there are systems that know what they want to be (Amazon S3, Postres, etc), and then there are systems that try to eat the world (Kafka, k8s, systemd).

[+] mrweasel|10 months ago|reply

The worst part of Kafka, for me, is managing the cluster. I don't really like the partitioning and the almost hopelessness that ensues when something goes wrong. Recovery is really tricky.

Granted it doesn't happen often, if you plan correctly, but the possibility of going wrong in the partitioning and replication makes updates and upgrades nightmare fuel.

[+] hinkley|10 months ago|reply

Once you pick an “As simple as possible, but no simpler” solution, it triggers Dunning Kruger in a lot of people who think they can one up you.

There was a time in my early to mid career when I had to defend my designs a lot because people thought my solutions were shallower than they were and didn’t understand that the “quirks” were covering unhappy paths. They were often load bearing, 80/20 artifacts.

[+] kcexn|10 months ago|reply

Having worked with it only a little on occasion. I found that the problem lies in its atrocious documentation.

I get it, there are lots of knobs and dials I can adjust to tune the cluster. A one-line description for each item is often insufficient to figure out what the item is doing. You can get a sense for the problem eventually if you spin up a local environment and one-by-one go through each item to see what it does, but that's super time consuming.

[+] Hamuko|10 months ago|reply

>Initially, you think "oh, an append-only log that can scale, brilliant and simple"

Really? I got scared by Kafka by just reading through the documentation.

[+] nitwit005|10 months ago|reply

> When producing a record to a topic and then using that record for materializing some derived data view on some downstream data store, there’s no way for the producer to know when it will be able to "see" that downstream update. For certain use cases it would be helpful to be able to guarantee that derived data views have been updated when a produce request gets acknowledged, allowing Kafka to act as a log for a true database with strong read-your-own-writes semantics.

Just don't use Kafka.

Write to the downstream datastore directly. Then you know your data is committed and you have a database to query.

[+] Spivak|10 months ago|reply

Once you start asking to query the log by keys, multi-tenancy trees of topics, synchronous commits-ish, and schemas aren't we just in normal db territory where the kafka log becomes the query log. I think you need to go backwards and be like what is the feature a rdbms/nosql db can't do and go from there. Because the wishlist is looking like CQRS with the front queue being durable but events removed once persisted in the backing db where the clients query events from the db.

The backing db in this wishlist would be something in the vein of Aurora to achieve the storage compute split.

[+] peanut-walrus|10 months ago|reply

Object storage for Kafka? Wouldn't this 10x the latency and cost?

I feel like Kafka is a victim of it's own success, it's excellent for what it was designed, but since the design is simple and elegant, people have been using it for all sorts of things for which it was not designed. And well, of course it's not perfect for these use cases.

[+] debadyutirc|10 months ago|reply

This is a question we asked 6 years ago.

What if we wrote it in Rust. And leveraged and WASM.

We have been at it for the past 6 years. https://github.com/infinyon/fluvio

For the past 2 years we have also been building Flink using Rust and WASM. https://github.com/infinyon/stateful-dataflow-examples/

[+] gregfurman|10 months ago|reply

Fluvio looks awesome!

Any chance you’re going to be reviving support for the Kafka wire protocol?

https://github.com/infinyon/fluvio/issues/4259

[+] Micoloth|10 months ago|reply

Interesting!

How would you say your project compares to Arroyo?

[+] fintler|10 months ago|reply

Keep an eye out for Northguard. It's the name of LinkedIn's rewrite of Kafka that was announced at a stream processing meetup about a week ago.

[+] selkin|10 months ago|reply

It solves some issues, and creates some, since Northguard isn’t compatible with the current Kafka ecosystem.

As such, you can no longer use existing software that is built on Kafka as-is. It may not be a grave concern for LinkedIn, but it could be for others that currently benefit from using the existing Kafka ecosystem.

[+] supermatt|10 months ago|reply

> "Do away with partitions"

> "Key-level streams (... of events)"

When you are leaning on the storage backend for physical partitioning (as per the cloud example, where they would literally partition based on keys), doesnt this effectively just boil down to renaming partitions to keys, and keys to events?

[+] gunnarmorling|10 months ago|reply

That's one way to look at this, yes. The difference being that keys actually have a meaning to clients (as providers of ordering and also a failure domain), whereas partitions in their current form don't.

[+] olavgg|10 months ago|reply

How many of the Apache Kafka issues are adressed by switching to Apache Pulsar?

I skipped learning Kafka, and jumped right into Pulsar. It works great for our use case. No complaints. But I wonder why so few use it?

[+] jsumrall|10 months ago|reply

I've been down this path, and if my experience is more common, then it really boils down to the classic "Nobody gets fired for buying IBM", and here IBM -> Confluent.

StreamNative seems like an excellent team, and I hope they succeed. But as another comment has written, something (puslar) being better (than kafka) has to either be adopted from the start, or be a big enough improvement to change— and as difficult and feature-poor that Kafka is, it still gets the job done.

I can rant longer about this topic but Pulsar _should_ be more popular, but unfortunately Confluent has dominated here and rent-seeking this field into the ground.

[+] enether|10 months ago|reply

There’s inherently a lot of path-dependent network effects in open source software.

Just because something is 10-30% better in certain cases almost never warrants its adoption, if on the other side you get much less human expertise, documentation/resources and battle tested testimonies.

This, imo, is the story of most Kafka competitors

[+] EdwardDiego|10 months ago|reply

Some, but then Pulsar brings its own issues.

[+] vermon|10 months ago|reply

Interesting, if partitioning is not a useful concept of Kafka, what are some of the better alternatives for controlling consumer concurrency?

[+] galeaspablo|10 months ago|reply

It is useful, but it is not generally applicable.

Given an arbitrary causality graph between n messages, it would be ideal if you could consume your messages in topological order. And that you could do so in O(n log n).

No queuing system in the world does arbitrary causality graphs without O(n^2) costs. I dream of the day where this changes.

And because of this, we’ve adapted our message causality topologies to cope with the consuming mechanisms of Kafka et al

To make this less abstract, imagine you have two bank accounts, each with a stream. MoneyOut in Bob’s account should come BEFORE MoneyIn when he transfers to Alice’s account, despite each bank account having different partition keys.

[+] frklem|10 months ago|reply

"Faced with such a marked defensive negative attitude on the part of a biased culture, men who have knowledge of technical objects and appreciate their significance try to justify their judgment by giving to the technical object the only status that today has any stability apart from that granted to aesthetic objects, the status of something sacred. This, of course, gives rise to an intemperate technicism that is nothing other than idolatry of the machine and, through such idolatry, by way of identification, it leads to a technocratic yearning for unconditional power. The desire for power confirms the machine as a way to supremacy and makes of it the modern philtre (love-potion)." Gilbert Simondon, On the mode of existence of technical objects.

This is exactly what I interpret from these kind of articles: engineering just for the cause of engineering. I am not saying we should not investigate on how to improve our engineered artifacts, or that we should not improve them. But I see a generalized lack of reflection on why we should do it, and I think it is related to a detachment from the domains we create software for. The article suggests uses of the technology that come from so different ways of using it, that it looses coherence as a technical item.

[+] redditor98654|10 months ago|reply

I agree on the head of the line blocking problem and that not everyone needs the per partition ordering. For that I have started to use SQS FIFO with the message grouping key being the logical key for the event/resource. This gives me ordering within the key and not extra ordering across keys. So I don’t have the head of line blocking problem.

If I need multiple independent consumers, I just instead publish to SNS FIFO and let my consumers create their own SQS fifo queues that are subscribed to the topic. The ordering is maintained across SNS and SQS. I also get native DLQ support for poison pills and an SQS consumer is dead simple to operate vs a Kafka consumer.

It does not solve all of the mentioned problems like being able to see what the keys are in the queue or lookup by a given key but as a messaging solution that offers ordering for a key, this is hard to beat.

[+] renatomefi|10 months ago|reply

How do you handle DLQ for ordered keys? I assume if you drop a message then you lose semantics

[+] elvircrn|10 months ago|reply

Surprised there's no mention of Redpanda here.

[+] smittywerben|10 months ago|reply

I don't understand how everyone hates Kafka I use it as a typed write-ahead JSON log with library support for most languages. Yes the systems I've built with this were overengineered but it worked and was reliable. I just bought a larger disk instead of using whatever remains of the great battle of the zookeeper. I just assumed the fact it has any integration support with standard RDBMs must be a byproduct of being Java as purely an accident.

[+] selkin|10 months ago|reply

This is a useful Gedankenexperiment, but I think the replies suggesting that the conclusion is that we should replace Kafka with something new are quiet about what seems obvious to me:

Kafka's biggest strength is the wide and useful ecosystem built on top of it.

It is also a weaknesses, as we have to keep some (but not of all) the design decisions we wouldn't have made had we started from scratch today. Or we could drop backwards compatibility, at the cost of having to recreate the ecosystem we already have.

[+] mgaunard|10 months ago|reply

I can't count the number of bad message queues and buses I've seen in my career.

While it would be useful to just blame Kafka for being bad technology it seems many other people get it wrong, too.

[+] mgaunard|10 months ago|reply

s/useful/easy/

(can't edit)

[+] tyingq|10 months ago|reply

He mentions Automq right in the opener. And if I follow the link, they pitch it in a way that sounds very "too good to be true".

Anyone here have some real world experience with it?

[+] bjornsing|10 months ago|reply

> Key-centric access: instead of partition-based access, efficient access and replay of all the messages with one and the same key would be desirable.

I’ve been working on a datastore that’s perfect for this [1], but I’m getting very little traction. Does anyone have any ideas why that is? Is my marketing just bad, or is this feature just not very useful after all?

1. https://www.haystackdb.dev/

[+] galeaspablo|10 months ago|reply

Some input from previously working on a superset of this problem. And being in a similar position.

Mature projects have too much bureacracy, and even spending time talking to you = opportunity cost. So making a case for why you're going to solve a problem for them is tough.

New projects (whether at big companies or small companies) have 20 other things to worry about, so the problem isn't big enough.

I wrote about this in our blog if you're curious: https://ambar.cloud/blog/a-new-path-for-ambar

[+] notfed|10 months ago|reply

The website seems very vapid. Extraordinary claims require extraordinary evidence. Personally I see a lack of evidence here (that this vendor-locked product is better than existing freeware) and I'm going to move on.

[+] thesimon|10 months ago|reply

> HaystackDB is accessed through a RESTful HTTPS API. No client library necessary.

That's cool, but but I would prefer to not reinvent the wheel. If you have a simple library, that would already be useful.

Some simple code or request examples would be convenient as well. I really don't know how easy or difficult your interface design is. It would be cool to see the API docs.

[+] YetAnotherNick|10 months ago|reply

I wish there is a global file system with node local disks, which has rule driven affinity to nodes for data. We have two extremes, one like EFS or S3 express which doesn't have any affinity to the processing system, and other what Kafka etc is doing where they have tightly integrated logic for this which makes systems more complicated.

[+] XorNot|10 months ago|reply

I might be misunderstanding but isn't this like, literally what GlusterFS used to do?

Like distinctly recall running it at home as a global unioning filesystem where content had to fully fit on the specific device it was targeted at (rather then being striped or whatever).

220 comments