You haven't seen the worst of it. We had to implement a whole kafka module for a SCADA system because Target already had unrelated kafka infrastructure. Instead of REST API or anything else sane (which was available), ultra low volume messaging is now done by JSON objects wrapped in kafka. Peak incompetence.
Let’s be real: teams come to the infra team asking for a queue system. They give their requirements, and you—like a responsible engineer—suggest a more capable queue to handle their needs more efficiently. But no, they want Kafka. Kafka, Kafka, Kafka. Fine. You (meaning an entire team) set up Kafka clusters across three environments, define SLIs, enforce SLOs, make sure everything is production-grade.
Then you look at the actual traffic: 300kb/s in production. And right next to it? A RabbitMQ instance happily chugging along at 200kb/s.
You sit there, questioning every decision that led you to this moment. But infra isn’t the decision-maker. Sometimes, adding unnecessary complexity just makes everyone happier. And no, it’s not just resume-padding… probably.
That’s almost certainly true, but at least part of the problem (not just Kafka but RDD tech in general) is that project home pages, comments like this and “Learn X in 24 hours” books/courses rarely spell out how to clearly determine if you have an appropriate use case at an appropriate scale. “Use this because all the cool kids are using it” affects non-tech managers and investors just as much as developers with no architectural nous, and everyone with a SQL connection and an API can believe they have “big data” if they don’t have a clear definition of what big data actually is.
Or, as mentioned in the article, you've already got Kafka in place handling a lot of other things but need a small queue as well and were hoping to avoid adding a new technology stack into the mix.
I needed to synchronize some tables between MS SQL Server and PostgreSQL. In the future we will need to add ClickHouse database to the mix. When I last looked, the recommended way to do this was to use Debezium w/Kafka. So that is why we use it. Data volume is low.
If anybody knows of a simpler way to accomplish this, please do let me know.
>
Each of these Web workers puts those 4 records onto 4 of the topic’s partitions in a round-robin fashion. And, because they do not coordinate this, they might choose the same 4 partitions, which happen to all land on a single consumer
Then choose a different partitioning strategy. Often key based partitioning can solve this issue. Worst case scenario, you use a custom partitioning strategy.
Additionally , why can’t you match the number of consumers in consumer group to number of partitions?
The KIP mentioned seems interesting though. Kafka folks trying to make a play towards replacing all of the distributed messaging systems out there. But does seem a bit complex on the consumer side, and probably a few foot guns here for newbies to Kafka. [1]
The kafka protocol is a distributed write ahead log. If you want a job queue you need to build something on top of that, it’s a pretty low level primative.
Until you hit scale, the database you're already using is fine. If that's Postgres, look up SELECT FOR UPDATE SKIP LOCKED. The major convenience here - aside from operational simplicity - is transactional task enqueueing.
For hosted, SQS or Google Cloud Tasks. Google's approach is push-based (as opposed to pull-based) and is far and above easier to use than any other queueing system.
SQS, Azure Service Bus, RabbitMQ, ActiveMQ, QPID, etc… any message broker that provides the competing consumer pattern. though I’ll say having managed many of these message brokers myself, it’s definitely better paying for a managed service. They’re a nightmare when you start running into problems.
We use RabbitMQ, and workers simply pull whatever is next in the queue after they finish processing their previous jobs. I’ve never witnessed jobs piling up for a single consumer.
Kafka with a different partitioner would have worked fine. The problem was that the web workers loaded up the same partition. Randomising the chosen partition would have removed, or at least alleviated, the stated problem.
Has anyone used Redpanda? I stumbled upon it when researching streaming, it claims to be Kafka compatible but higher performance and easier to manage. Haven't tried it myself but interested if anyone else has experience.
We build an Infrastructure with about 6 microservices and Kafka as main message queue (job queue).
The problem the author describes is 100% true and if you are scaled with enaugh workers this can turn out really bad.
While not beeing the only issue we faced (others are more environment/project-language specific) we got to a point where we decided to switch from kafka to rabbitmq.
First time I've heard of KIP-932 and it looks very good. The two biggest issues IMO are finding a good Kafka client in the language you need (even for ruby this is a challenge) and easy at-least-once workers.
You can over partition and make at-least-once workers happen (if you have a good Kafka client), or you use an http gateway and give up safe at-least-once. Hopefully this will make it easier to build an at-least-once style gateway that's easier to work with across a variety of languages. I know many have tried in the past but not dropping messages is hard to do right.
For a small load queueing system, I had great success with Apache ActiveMQ back in the days. I designed and implemented a system with the goal of triggering SMS for paid content. This was in 2012.
Ultimately, the system was fast enough that the telco company emailed us and asked to slow down our requests because their API was not keeping up.
In short: we had two Apache Camel based apps: one to look at the database for paid content schedule, and queue up the messages (phone number and content). Then, another for triggering the telco company API.
Having never actually used this platform before, does anybody know why they named it Kafka, with all the horrible meanings?
Per Wiktionary, Kafkaesque: [1]
1. "Marked by a senseless, disorienting, often menacing complexity."
2. "Marked by surreal distortion and often a sense of looming danger."
3. "In the manner of something written by Franz Kafka." (like the software language was written by Franz Kafka)
Example: Metamorphosis Intro: "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked." [2]
What that post describes (all work going to one/few workers) in practice doesn't really happen if you properly randomize (e.g. just use random UUID) ID of the item/task when inserting it into Kafka.
With that (and sharding based on that ID/value) - all your consumers/workers will get equal amount of messages/tasks.
Both post and seemingly general theme of comments here is trashing choice of Kafka for low volume.
Interestingly both are ignoring other valid reasons/requirements making Kafka perfectly good choice despite low volume - e.g.:
- multiple different consumers/workers consuming same messages at their own pace
- needing to rewind/replay messages
- guarantee that all messages related to specific user (think bank transactions in book example of CQRS) will be handled by one pod/consumer, and in consistent order
- needing to chain async processing
And I'm probably forgetting bunch of other use cases.
And yes, even with good sharding - if you have some tasks/work being small/quick while others being big/long can still lead to non-optimal situations where small/quick is waiting for bigger one to be done.
However - if you have other valid reasons to use Kafka, and it's
just this mix of small and big tasks that's making you hesitant... IMHO it's still worth trying Kafka.
Between using bigger buckets (so instead of 1 fetch more items/messages and handle work async/threads/etc), and Kafka automatically redistributing shards/partitions if some workers are slow ... You might be surprised it just works.
And sure - you might need to create more than one topic (e.g. light, medium, heavy) so your light work doesn't need to wait for heavier one.
Finally - I still didn't see anyone mention actual real deal breakers for Kafka.
From the top of my head I recall a big one is no guarantee of item/message being processed only once - even without you manually rewinding/reprocessing it.
It's possible/common to have situations where worker picks up a message from Kafka, processes (wrote/materialized/updated) it and when it's about to commit the kafka offset (effectively mark it as really done) it realizes Kafka already re-partitioned shards and now another pod owns particular partition.
So if you can't model items/messages or the rest of system in a way that can handle such things ... Say with versioning you might be able to just ignore/skip work if you know underlying materialized data/storage already incorporates it, or maybe whole thing is fine with INSERT ON DUPLICATE KEY UPDATE) - then Kafka is probably not the right solution.
You say:
> What that post describes (all work going to one/few workers) in practice doesn't really happen if you properly randomize (e.g. just use random UUID) ID of the item/task when inserting it into Kafka.
I would love to be wrong about this, but I don't _think_ this changes things. When you have few enough messages, you can still get unlucky and randomly choose the "wrong" partitions. To me, it's a fundamental probability thing - if you roll the dice enough times, it all evens out (high enough message volume), but this article is about what happens when you _don't_ roll the dice enough times.
The other thing that's PITA with Kafka is fail/retry.
If you want to continue processing other/newer items/messages (and usually you do), you need to commit Kafka topic offset - leaving you to figure out what to do with failed item/message.
One simple thing is just re-inserting it again into the same topic (at the end). If it was temps transient error that could be enough
Instead of same topic, you can also insert it into another failedX Kafka topic (and have topic processed by cron like scheduled task).
And if you need things like progressive backing off before attempting reprocessing - you liekly want to push failed items into something else.
While it could be another tasks system/setup where you can specify how many reprocessing attempts to make, how much time to wait before next attempt ...etc. Often it's enough to have a simple DB/table.
[+] [-] NovemberWhiskey|1 year ago|reply
[+] [-] hiAndrewQuinn|1 year ago|reply
https://adamdrake.com/command-line-tools-can-be-235x-faster-...
[+] [-] kvakerok|1 year ago|reply
[+] [-] atmosx|1 year ago|reply
Let’s be real: teams come to the infra team asking for a queue system. They give their requirements, and you—like a responsible engineer—suggest a more capable queue to handle their needs more efficiently. But no, they want Kafka. Kafka, Kafka, Kafka. Fine. You (meaning an entire team) set up Kafka clusters across three environments, define SLIs, enforce SLOs, make sure everything is production-grade.
Then you look at the actual traffic: 300kb/s in production. And right next to it? A RabbitMQ instance happily chugging along at 200kb/s.
You sit there, questioning every decision that led you to this moment. But infra isn’t the decision-maker. Sometimes, adding unnecessary complexity just makes everyone happier. And no, it’s not just resume-padding… probably.
[+] [-] FearNotDaniel|1 year ago|reply
[+] [-] tstrimple|1 year ago|reply
[+] [-] evantbyrne|1 year ago|reply
[+] [-] bassp|1 year ago|reply
[+] [-] cheema33|1 year ago|reply
If anybody knows of a simpler way to accomplish this, please do let me know.
[+] [-] wink|1 year ago|reply
Have I used (not necessarily decided on) Kafka in every single company/project for the last 8-9 years? Yes.
Was it the optimal choice for all of those? No.
Was it downright wrong or just added for weird reasons? Also no, not even a single time - it's just kinda ubiquitous.
[+] [-] eBombzor|1 year ago|reply
[+] [-] gottorf|1 year ago|reply
[+] [-] xyst|1 year ago|reply
Then choose a different partitioning strategy. Often key based partitioning can solve this issue. Worst case scenario, you use a custom partitioning strategy.
Additionally , why can’t you match the number of consumers in consumer group to number of partitions?
The KIP mentioned seems interesting though. Kafka folks trying to make a play towards replacing all of the distributed messaging systems out there. But does seem a bit complex on the consumer side, and probably a few foot guns here for newbies to Kafka. [1]
[1] https://cwiki.apache.org/confluence/plugins/servlet/mobile?c...
[+] [-] rockwotj|1 year ago|reply
[+] [-] mumrah|1 year ago|reply
https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A...
[+] [-] jszymborski|1 year ago|reply
Especially for low levels of load, that doesn't require that the dispatcher and consumer are written in the same language.
[+] [-] stickfigure|1 year ago|reply
For hosted, SQS or Google Cloud Tasks. Google's approach is push-based (as opposed to pull-based) and is far and above easier to use than any other queueing system.
[+] [-] ozarker|1 year ago|reply
[+] [-] sea-gold|1 year ago|reply
https://docs.nats.io/nats-concepts/overview/compare-nats
[+] [-] kgeist|1 year ago|reply
[+] [-] gnfargbl|1 year ago|reply
We have been using it in this application for half a decade now with no serious issues. I don't understand why it doesn't get more popular attention.
[+] [-] sc68cal|1 year ago|reply
* Redis streams
* Redis lists (this is what Celery uses when Redis backend is configured)
* RabbitMQ
* ZeroMQ
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] alexwebr|1 year ago|reply
RabbitMQ or AWS SQS are probably good choices.
[+] [-] giovannibonetti|1 year ago|reply
[+] [-] kitd|1 year ago|reply
[+] [-] stephen_g|1 year ago|reply
[+] [-] ha-shine|1 year ago|reply
[+] [-] stackskipton|1 year ago|reply
Also give a shoutout to Beanstalkd (https://beanstalkd.github.io/)
[+] [-] amazingamazing|1 year ago|reply
[+] [-] lmm|1 year ago|reply
[+] [-] voodooEntity|1 year ago|reply
The problem the author describes is 100% true and if you are scaled with enaugh workers this can turn out really bad.
While not beeing the only issue we faced (others are more environment/project-language specific) we got to a point where we decided to switch from kafka to rabbitmq.
[+] [-] enether|1 year ago|reply
[+] [-] film42|1 year ago|reply
You can over partition and make at-least-once workers happen (if you have a good Kafka client), or you use an http gateway and give up safe at-least-once. Hopefully this will make it easier to build an at-least-once style gateway that's easier to work with across a variety of languages. I know many have tried in the past but not dropping messages is hard to do right.
[+] [-] PhilippGille|1 year ago|reply
> Note: when Queues for Kafka (KIP-932) becomes a thing, a lot of these concerns go away. I look forward to it!
[+] [-] brunoborges|1 year ago|reply
Ultimately, the system was fast enough that the telco company emailed us and asked to slow down our requests because their API was not keeping up.
In short: we had two Apache Camel based apps: one to look at the database for paid content schedule, and queue up the messages (phone number and content). Then, another for triggering the telco company API.
[+] [-] araes|1 year ago|reply
Per Wiktionary, Kafkaesque: [1]
1. "Marked by a senseless, disorienting, often menacing complexity."
2. "Marked by surreal distortion and often a sense of looming danger."
3. "In the manner of something written by Franz Kafka." (like the software language was written by Franz Kafka)
Example: Metamorphosis Intro: "One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections. The bedding was hardly able to cover it and seemed ready to slide off any moment. His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked." [2]
[1] Wiktionary, Kafkaesque: https://en.wiktionary.org/wiki/Kafkaesque
[2] Gutenberg, Metamorphosis: https://www.gutenberg.org/cache/epub/5200/pg5200.txt
[+] [-] snotrockets|1 year ago|reply
[+] [-] kod|1 year ago|reply
Seems like a good name for a high-volume distributed log that deletes based on retention, not after consumption.
[+] [-] op00to|1 year ago|reply
[+] [-] InDubioProRubio|1 year ago|reply
[+] [-] denkmoon|1 year ago|reply
[+] [-] techcode|1 year ago|reply
With that (and sharding based on that ID/value) - all your consumers/workers will get equal amount of messages/tasks.
Both post and seemingly general theme of comments here is trashing choice of Kafka for low volume.
Interestingly both are ignoring other valid reasons/requirements making Kafka perfectly good choice despite low volume - e.g.:
- multiple different consumers/workers consuming same messages at their own pace
- needing to rewind/replay messages
- guarantee that all messages related to specific user (think bank transactions in book example of CQRS) will be handled by one pod/consumer, and in consistent order
- needing to chain async processing
And I'm probably forgetting bunch of other use cases.
And yes, even with good sharding - if you have some tasks/work being small/quick while others being big/long can still lead to non-optimal situations where small/quick is waiting for bigger one to be done.
However - if you have other valid reasons to use Kafka, and it's just this mix of small and big tasks that's making you hesitant... IMHO it's still worth trying Kafka.
Between using bigger buckets (so instead of 1 fetch more items/messages and handle work async/threads/etc), and Kafka automatically redistributing shards/partitions if some workers are slow ... You might be surprised it just works.
And sure - you might need to create more than one topic (e.g. light, medium, heavy) so your light work doesn't need to wait for heavier one.
Finally - I still didn't see anyone mention actual real deal breakers for Kafka.
From the top of my head I recall a big one is no guarantee of item/message being processed only once - even without you manually rewinding/reprocessing it.
It's possible/common to have situations where worker picks up a message from Kafka, processes (wrote/materialized/updated) it and when it's about to commit the kafka offset (effectively mark it as really done) it realizes Kafka already re-partitioned shards and now another pod owns particular partition.
So if you can't model items/messages or the rest of system in a way that can handle such things ... Say with versioning you might be able to just ignore/skip work if you know underlying materialized data/storage already incorporates it, or maybe whole thing is fine with INSERT ON DUPLICATE KEY UPDATE) - then Kafka is probably not the right solution.
[+] [-] alexwebr|1 year ago|reply
You say: > What that post describes (all work going to one/few workers) in practice doesn't really happen if you properly randomize (e.g. just use random UUID) ID of the item/task when inserting it into Kafka.
I would love to be wrong about this, but I don't _think_ this changes things. When you have few enough messages, you can still get unlucky and randomly choose the "wrong" partitions. To me, it's a fundamental probability thing - if you roll the dice enough times, it all evens out (high enough message volume), but this article is about what happens when you _don't_ roll the dice enough times.
[+] [-] techcode|1 year ago|reply
If you want to continue processing other/newer items/messages (and usually you do), you need to commit Kafka topic offset - leaving you to figure out what to do with failed item/message.
One simple thing is just re-inserting it again into the same topic (at the end). If it was temps transient error that could be enough
Instead of same topic, you can also insert it into another failedX Kafka topic (and have topic processed by cron like scheduled task).
And if you need things like progressive backing off before attempting reprocessing - you liekly want to push failed items into something else.
While it could be another tasks system/setup where you can specify how many reprocessing attempts to make, how much time to wait before next attempt ...etc. Often it's enough to have a simple DB/table.