top | item 40723302

Ask HN: Why do message queue-based architectures seem less popular now?

389 points| alexhutcheson | 1 year ago

In the late 2000s and early 2010s, I remember seeing lots of hype around building distributed systems using message queues (e.g. Amazon SQS, RabbitMQ, ZeroMQ, etc.) A lot of companies had blog posts highlighting their use of message queues for asynchronous communication between nodes, and IIRC the official AWS design recommendations at the time pushed SQS pretty heavily.

Now, I almost never see engineering blog posts or HN posts highlighting use of message queues. I see occasional content related to Kafka, but nothing like the hype that message queues used to have.

What changed? Possible theories I'm aware of:

* Redis tackled most of the use-case, plus caching, so it no longer made sense to pay the operational cost of running a separate message broker. Kafka picked up the really high-scale applications.

* Databases (broadly defined) got a lot better at handling high scale, so system designers moved more of the "transient" application state into the main data stores.

* We collectively realize that message queues-based architectures don't work as well as we hoped, so we build most things in other ways now.

* The technology just got mature enough that it's not exciting to write about, but it's still really widely used.

If people have experience designing or implementing greenfield systems based on message queues, I'd be curious to hear about it. I'd also be interested in understanding any war stories or pain points people have had from using message queues in production systems.

364 comments

order
[+] hn_throwaway_99|1 year ago|reply
I like a lot of the answers, but something else I'd add: lots of "popular" architectures from the late 00s and early 2010s have fallen by the wayside because people realized "You're not Google. Your company will never be Google."

That is, there was a big desire around that time period to "build it how the big successful companies built it." But since then, a lot of us have realized that complexity isn't necessary for 99% of companies. When you couple that with hardware and standard databases getting much better, there are just fewer and fewer companies who need all of these "scalability tricks".

My bar for "Is there a reason we can't just do this all in Postgres?" is much, much higher than it was a decade ago.

[+] ithkuil|1 year ago|reply
We also have much much bigger single machines available for reasonable money. So a lot of reasonable workloads can fit in one machine now that used to require a small cluster
[+] peoplefromibiza|1 year ago|reply
> "You're not Google. Your company will never be Google."

I'm not sure people realize this now more than then. I was there back then and we surely knew we would never be Google hence we didn't need to "scale" the same way they did.

Nowadays every project I start begins with a meeting where is presented a document describing the architecture we are going to implement, using AWS of course, because "auto-scale" right?, and 9/10 it includes CloudFront, which is a CDN and I don't really understand why this app I am developing, which is basically an API gateway with some customization that made Ngnix slightly less than ideal (but still perfect for the job), and that averages to 5 rps needs a CDN in front of it... (or AWS or auto-scaling or AWS lambda, for that matter)

[+] steve1977|1 year ago|reply
> because people realized "You're not Google. Your company will never be Google."

Is that also why almost no one is using microservices and Kubernetes?

[+] nosefrog|1 year ago|reply
To be fair, I worked on multiple projects removing queues at Google, so it's more than just that.
[+] w10-1|1 year ago|reply
> You're not Google. Your company will never be Google

True, but the CTO comes from twitter/meta/google/some open-source big-data project, the director loves databases, etc.

So we have 40-100 people managing queues with events driven from database journals.

Everyone sees how and why it evolved that way. No one has the skill or political capital to change it. And we spend most of our time on maintenance tasked as "upgrades", in a culture with "a strong role for devops".

[+] XCSme|1 year ago|reply
So true, people optimize prematuriley for when they'll have 100m monthly users, when they have no product yet and won't likely reach 100k users in many years (which can run on a single $100/m dedicated machine...).
[+] bigiain|1 year ago|reply
My, perhaps overly cynical view, is that Message Queue architecture and blogging was all about "Resume Driven Development" - where almost everybody doing it was unlikely to ever need to scale past what a simple monolith could support running on a single laptop. All the same people who were building nightmare micro service disasters requiring tens of thousand of dollars a month of AWS services.

These days all those people who prioritise career building technical feats over solving actual business problems in pragmatic ways - they're all hyping and blogging about AI, with similar results for the companies they (allegedly) are working for: https://www.theregister.com/2024/06/12/survey_ai_projects/

[+] abeppu|1 year ago|reply
I'm sure this happens. But ... most websites I load up have like a dozen things trying to gather data, whether for tracking, visitor analytics, observability, etc. Every time I view a page, multiple new unimportant messages are being sent out, and presumably processed asynchronously. Every time I order something, after I get the order confirmation page, I get an email and possibly a text message, both of which should presumably happen asynchronously, and possibly passing through the hands of more than one SaaS product en route. So given what seems to be the large volume of async messages, in varying/spiking volumes, possibly interacting with 3rd party services which will sometimes experience outages ... I gotta expect that a bunch of these systems are solving "actual business problems" of separating out work that can be done later/elsewhere, can fail and be retried without causing disruptions, etc in order to ensure the work that must happen immediately is protected.
[+] robertlagrant|1 year ago|reply
> My, perhaps overly cynical view, is that Message Queue architecture and blogging was all about "Resume Driven Development" - where almost everybody doing it was unlikely to ever need to scale past what a simple monolith could support running on a single laptop. All the same people who were building nightmare micro service disasters requiring tens of thousand of dollars a month of AWS services.

Yes, that is cynical. People have been building architectures off MQ for a much longer time than microservices have been around. Lots of corporates have used JMS for a long time now.

[+] tuckerconnelly|1 year ago|reply
I can offer one data point. This is from purely startup-based experience (seed to Series A).

A while ago I moved from microservices to monolith because they were too complicated and had a lot of duplicated code. Without microservices there's less need for a message queue.

For async stuff, I used RabbitMQ for one project, but it just felt...old and over-architected? And a lot of the tooling around it (celery) just wasn't as good as the modern stuff built around redis (bullmq).

For multi-step, DAG-style processes, I prefer to KISS and just do that all in a single, large job if I can, or break it into a small number of jobs.

If I REALLY needed a DAG thing, there are tools out there that are specifically built for that (Airflow). But I hear they're difficult to debug issues in, so would avoid at most costs.

I have run into scaling issues with redis, because their multi-node architectures are just ridiculously over-complicated, and so I stick with single-node. But sharding by hand is fine for me, and works well.

[+] democracy|1 year ago|reply
I think this: "* The technology just got mature enough that it's not exciting to write about, but it's still really widely used."

Messaging-based architecture is very popular

[+] casper14|1 year ago|reply
Agreed. It has become a tool just like any other. Just like nobody writes about how they use virtual machines in the cloud anymore.
[+] bwhaley|1 year ago|reply
This is the answer. I'd wager that almost every distributed system that runs at scale uses message queues in some capacity.
[+] spike021|1 year ago|reply
I think that's definitely part of it. Two roles ago my team was invested heavily in SQS and Kinesis. The role before that and my current role are pretty heavy with Kafka still.

I wouldn't call their use super interesting, though.

The last role was simply because the business required as close to real time message processing as possible for billing analytics. But if I tell someone that, it's not incredibly interesting unless I start diving into messages per second and such.

[+] berniedurfee|1 year ago|reply
Yep. It’s a really nice architecture for lots of use cases where it fits.

Every new idea goes through the same cycle of overuse until it finds its niches.

[+] ipsum2|1 year ago|reply
Yeah, this is the most likely reason.

It used to be popular to post about rewriting angular to react. Now everyone just uses react (or they write posts about rewriting react to vue or whatever the flavor of the month)

[+] burutthrow1234|1 year ago|reply
I think "message queues" have become pretty commoditized. You can buy Confluent or RedPanda or MSK as a service and never have to administer Kafka yourself.

Change Data Capture (CDC) has also gotten really good and mainstream. It's relatively easy to write your data to a RDBMS and then capture the change data and propagate it to other systems. This pattern means people aren't writing about Kafka, for instance, because the message queue is just the backbone that the CDC system uses to relay messages.

These architectures definitely still exist and they mostly satisfy organizational constraints - if you have a write-once, read-many queue like Kafka you're exposing an API to other parts of the organization. A lot of companies use this pattern to shuffle data between different teams.

A small team owning a lot of microservices feels like resume-driven developnent. But in companies with 100+ engineers it makes sense.

[+] morbicer|1 year ago|reply
This. You don't need Google scale for Kafka to make sense. Just few acquisitions and a need to fan out some data to multiple products. For example you have SCIM hooks that will write to Kafka so all parts of org can consume the updates. Or customer provisioning.
[+] busterarm|1 year ago|reply
Going to give the unpopular answer. Queues, Streams and Pub/Sub are poorly understood concepts by most engineers. They don't know when they need them, don't know how to use them properly and choose to use them for the wrong things. I still work with all of the above (SQS/SNS/RabbitMQ/Kafka/Google Pub/Sub).

I work at a company that only hires the best and brightest engineers from the top 3-4 schools in North America and for almost every engineer here this is their first job.

My engineers have done crazy things like:

- Try to queue up tens of thousands of 100mb messages in RabbitMQ instantaneously and wonder why it blows up.

- Send significantly oversized messages in RabbitMQ in general despite all of the warnings saying not to do this

- Start new projects in 2024 on the latest RabbitMQ version and try to use classic queues

- Creating quorum queues without replication policies or doing literally anything to make them HA.

- Expose clusters on the internet with the admin user being guest/guest.

- The most senior architect in the org declared a new architecture pattern, held an organization-wide meeting and demo to extol the new virtues/pattern of ... sticking messages into a queue and then creating a backchannel so that a second consumer could process those queued messages on demand, out of order (and making it no longer a queue). And nobody except me said "why are you putting messages that you need to process out of order into a queue?"...and the 'pattern' caught on!

- Use Kafka as a basic message queue

- Send data from a central datacenter to globally distributed datacenters with a global lock on the object and all operations on it until each target DC confirms it has received the updated object. Insist that this process is asynchronous, because the data was sent with AJAX requests.

As it turns out, people don't really need to do all that great of a job and we still get by. So tools get misused, overused and underused.

In the places where it's being used well, you probably just don't hear about it.

Edit: I forgot to list something significant. There's over 30 microservices in our org to every 1 engineer. Please kill me. I would literally rather Kurt Cobain myself than work at another organization that has thousands of microservices in a gigantic monorepo.

[+] zug_zug|1 year ago|reply
To second this theory with some real-world data. A few startups I worked at a NY scala shop that used tons of Akka (event-driven queuing scala thing). Why? Because a manager at his prior job "saved the company" when "everything was slow" by doing this, so mandated it at the new job?

What were doing that required queueing? Not much, we showed people's 401ks on a website, let them adjust their asset mix, and sent out hundreds of emails per day. As you would expect, people almost never log into their 401k website.

A year or so after working there I realized our servers had been misconfigured all along and basically had 0 concurrency for web-requests (and we hadn't noticed because 2 production servers had always served all the traffic we needed). Eventually we just ripped out the Akka because it was unnecessary and added unnecessary complexity.

In the last month this company raised another funding round with a cash-out option, apparently their value has gone up and they are still doing well!

[+] roncesvalles|1 year ago|reply
Two observations:

1. There doesn't seem to be a design review process before people start implementing these things. Devs should make a design document, host a meeting where everyone reads it, and have a healthy debate before implementing stuff. If you have bright people but not-so-bright outcomes, it's because there is no avenue for those who know their shit to speak up and influence things.

2. I will always rather hire a 2-5 YOE dev from a no-name school over a new grad from a top 5 school. The amount that software engineers learn and grow in the first 5 years of their career is immense and possibly more than the rest of their career combined.

[+] scary-size|1 year ago|reply
That doesn’t sound like hiring the only „brightest“.
[+] SwissCoreWizard|1 year ago|reply
> I work at a company that only hires the best and brightest engineers from the top 3-4 schools in North America and for almost every engineer here this is their first job.

Are you making this up to make your claim sound more extravagant? Even Citadel is not that picky so unless you work at OpenAI/Anthropic I'm calling nonsense.

[+] janstice|1 year ago|reply
To be fair, thousands of microservices in a monorepo sounds better than thousands of microservices, each with its own repository (but sub-repo’ing common libraries so everything jams up in everything else).
[+] EVa5I7bHFq9mnYK|1 year ago|reply
Because JS is single threaded, and an average programmer today is too dumb to learn any language beyond JS, you must build everything into "microservices".
[+] spacebanana7|1 year ago|reply
Raw intelligence is of limited help when working in an area that requires lots of highly depreciating domain specific knowledge.

Relatively few graduates know their way around the Snowflake API, or the art of making an electron app not perform terribly. Even sending an email on the modern internet can require a lot of intuition and hidden knowledge.

> There's over 30 microservices in our org to every 1 engineer

I wonder if this a factor in making onboarding of new hires difficult?

[+] nailer|1 year ago|reply
> - Start new projects in 2024 on the latest RabbitMQ version and try to use classic queues

I am out of the loop here. Did Rabbit break classic queues?

[+] bsaul|1 year ago|reply
« Use kafka as a basic message queue ». Since i’m guilty of that (i use kafka as the backbone for pretty much any service 2 service communication, under a « job » api) i wonder why you think that’s wrong.
[+] angarg12|1 year ago|reply
Queues are a tool in your distributed system toolbox. When it's suitable it works wonderfully (typical caveats apply).

If your perception is indeed correct it'd attribute it to your 3rd point. People usually write blogposts about new shiny stuff.

I personally use queues in my design all the time, particularly to transfer data between different systems with higher decoupling. The only pain I have ever experienced was when an upstream system backfilled 7 days of data, which clogged our queues with old requests. Running normally it would have taken over 100 hours to process all the data, while massively increasing the latency of fresh data. The solution was to manually purge the queue, and manually backfill the most recent missing data.

Even if you need to be careful around unbound queue sizes I still believe they are a great tool.

[+] rossdavidh|1 year ago|reply
Message queues have moved on past the "Peak of inflated expectations" and past the "trough of disillusinment" into the "slope of enlightenment", perhaps even the "plateau of productivity".

https://en.wikipedia.org/wiki/Gartner_hype_cycle

[+] wildzzz|1 year ago|reply
I used zmq to build our application used for testing new hardware. Everything comes in via serial every second so I made a basic front end for that serial bus that sends telemetry out over zmq and waits for any commands coming in using pub/sub. The front end piece can sit forever sending out telemetry no one ever hears or I can hook up a logger, debug terminal, data plotter, or a factory test GUI that runs scripts or all at once. Dealing with com ports on windows is a huge hassle so zmq lets me abstract those annoyances away as a network socket. Other engineers can develop their own applications custom to their needs and they have. Our old application tried to shove all of this functionality into one massive python script along with trying to update a complicated Tk GUI every second with new telem. The thing was buckling under its own weight and would actually screw up the serial data coming in if you were running a heavy script in another thread. I know there are ways to turn a serial port into a network socket but I wanted something that didn't require a server or client to be online for the other to function.
[+] pm90|1 year ago|reply
They have become boring so there are less blogs about them.

Thats good. The documentation for eg RabbitMQ is much better and very helpful. People use it as a workhorse just like they use Postgres/MySQL. There’s not much surprising behavior needed to architect around etc.

I love boring software.

[+] robertclaus|1 year ago|reply
I find it super interesting that the comments calling out "obviously we all still use message queues and workers, we just don't write about them" are buried half way down the comments section by arguments about Microservices and practical scalability. A junior engineer reading the responses could definitely get the false impression that they shouldn't offload heavy computation from their web servers to workers at all anymore.
[+] therealdrag0|1 year ago|reply
It’s blowing my mind. Queues are the most bog standards architectural component.
[+] vishnugupta|1 year ago|reply
Speaking from my own experience message queues haven’t disappeared as much as have been abstracted away. For example enqueue to SQS + poll became invoke server less process. There is a message queue in there somewhere just that it’s not as exposed.

Or take AWS SNS which IMO is one level of abstraction higher than SQS. It became so feature rich that it can practically replace SQS.

What might have disappeared is those use cases which used Queues to handle short bursts of peak traffic?

Also streaming has become very reliable tech so a class of usecases that used Queues as streaming pipe have migrated to the streaming proper.

[+] akira2501|1 year ago|reply
> There is a message queue in there somewhere just that it’s not as exposed.

It's still pretty exposed. You can set redelivery timeouts and connect a dead letter queue to lambda functions. Even just the lambda invoke API is obviously just a funny looking endpoint for adding messages to the queue.

> as much as have been abstracted away

In AWS in particular into EventBridge which further extends them with state machines. They've become the very mature corner stone of many technologies.

[+] 1oooqooq|1 year ago|reply
good point. tcp was justified at some point in it's birth a queueing component. today nobody dates to think of it as such.
[+] ilaksh|1 year ago|reply
I think it's simple: async runtimes/modules in JavaScript/Node, Python (asyncio), and Rust. Those basically handle message queues for you transparently inside of a single application. You end up writing "async" and "await" all over the place, but that's all you need to do to get your MVP out. And it will work fine until you really become popular. And then that can actually still work without external queues etc. if you can scale horizontally such as giving each tenant their own container and subdomain or something.

There are places where you need a queue just for basic synchronization, but you can use modules that are more convenient than external queues. And you can start testing your program without even doing that.

Actually async is being used a lot with Rust also, which can stretch that out to scale even farther with an individual server.

Without an async runtime or similar, you have to invent an internal async runtime, or use something like queues, because otherwise you are blocked waiting for IO.

You may still eventually end up with queues down the line if you have some large number of users, but that complexity is completely unnecessary for getting a system deployed towards the beginning.

[+] leftbrainstrain|1 year ago|reply
To back up the story regarding async a bit, at least on the front end ... A long time ago in the 2000s, on front-end systems we'd have a server farm to handle client connections, since we did all rendering on the server at the time. On the heavyweight front end servers, we used threading with one TCP connection assigned to each thread. Threading was also less efficient (in Linux, at least) than it is now, so a large number of clients necessitated a large number of servers. When interfacing with external systems, standard protocols and/or file formats were preferred. Web services of some kind were starting to become popular, usually just when interfacing with external systems, since they used XML(SOAP) at the time and processing XML is computationally expensive. This was before Google V8 was released, so JavaScript was seen as sort of a (slow) toy language to do only minor DOM modifications, not do significant portions of the rendering. The general guidance was that anything like form validation done on the client side in JS was to be done only for slight efficiency gains and all application logic had to be done on the server. The release of NGINX to resolve the C10K problem, Google V8 to make JS run faster, and Node.js to scale front end systems for large numbers of idle TCP connections (C10k) all impacted this paradigm in the 2000s.

Internally, applications often used proprietary communication protocols, especially when interacting with internal queueing systems. For internal systems, businesses prefer data be retained and intact. At the time, clients still sometimes preferred systems be able to participate in distributed two-phase commit (XA), but I think that preference has faded a bit. When writing a program that services queues, you didn't need to worry about having a large number of threads or TCP connections -- you just pulled a request message from the request queue, processed the message, pushed a response onto the response queue, and moved on to the next request message. I'd argue that easing the strong preference for transactional integrity, the removal of the need for internal services to care about the C10k problem (async), and the need to retain developers that want to work with recent "cool" technologies reduced the driver for internal messaging solutions that guarantee durability and integrity of messages.

Also, AWS's certifications try to reflect how their services are used. The AWS Developer - Associate still covers SQS, so people are still using it, even if it isn't cool. At my last job I saw applications using RabbitMQ, too.

[+] memset|1 year ago|reply
It may be that lambdas (cloud functions, etc) have become more popular and supported on other platforms.

When you enqueue something, you eventually need to dequeue and process it. A lambda just does that in a single call. It also removes the need to run or scale a worker.

I think Kafka continues to be popular because it is used as a temporary data store, and there is a large ecosystem around ingesting from streams.

I personally use queues a lot and am building an open source SQS alternative. I wonder if an open source lambda replacement would be useful too. https://github.com/poundifdef/SmoothMQ

[+] weitendorf|1 year ago|reply
This is a big part of it IMO. When your downstream consumers can scale up and down quickly, you don’t necessarily need anything in the middle to smooth out load unless your workloads are especially spiky.

I think this also speaks to a related phenomenon where there are simply more tools and technologies you can buy or run “off the shelf” now. Back in the 2010s everybody was trying to roll their own super complex distributed systems. Nowadays you have a ton of options to pay for more or less polished products to handle that mess for you. No need for engineering meetups and technical blogs about tools that kinda-sorta work if you really know what you’re doing - just pay snowflake or confluent and work on other problems.

[+] jackbauer24|1 year ago|reply
Regarding this issue, I have some observations of my own. I've noticed that systems based on queues, such as Kafka, AMQP, etc., are still very widespread, for example in vehicle networking, transaction systems, and so on. I recently encountered a customer deploying Kafka on AWS, with monthly consumption of Kafka-related computing storage exceeding $1 million. The cluster scale is huge, containing various system events, logs, etc. I've also seen customers building IoT platforms based on Kafka. Kafka has become very central to the IoT platform, and any problems can cause the entire IoT platform to be unavailable. I personally have written over 80% of the code for Apache RocketMQ, and today I have created a new project, AutoMQ (https://github.com/AutoMQ/automq). At the same time, we also see that competition in this field is very fierce. Redpanda, Confluent, WarpStream, StreamNative, etc., are all projects built based on the Kafka ecosystem. Therefore, the architecture based on message queues has not become obsolete. A large part of the business has transformed into a Streaming form. I think Streaming and MQ are highly related. Streaming leans more towards data flow, while MQ leans more towards individual messages.
[+] mrj|1 year ago|reply
People got excited about it as a pattern, but usually apps don't have that many things that really have to go in the background. And once you do, it becomes really hard to ensure transactional safety across that boundary. Usually that's work you want to do in a request in order to return a timely error to the client. So most jobs these days tend to be background things, pre-caching and moving bits around on cdns. But every single one of those comes with a cost and most of us don't really want a mess of background jobs or distributed tasks.

I just added a RabbitMQ-based worker to replace some jobs that Temporal.io was bad at (previous devs threw everything at it, but it's not really suited to high throughput things like email). I'd bet that Temporal took a chunk of the new greenfield apps mindshare though.

[+] liampulles|1 year ago|reply
"The technology just got mature enough that it's not exciting to write about, but it's still really widely used."

My money is on this. I think the simple usecase of async communication, with simple pub/sub messaging, is hugely useful and not too hard to use.

We (as a Dev community) have just gotten over event sourcing, complex networks and building for unnecessary scale. I.e. we're past the hype cycle.

My team uses NATS for Async pub/sub and synchronous request/response. It's a command driven model and we have a huge log table with all the messages we have sent. Schemas and usage of these messages are internal to our team, and are discarded from NATS after consumption. We do at-least-once delivery and message handlers are expected to be idempotent.

We have had one or two issues with misconfiguration in NATS resulting in message replay or missed messages, but largely it has been very successful. And we were a 3 person dev team.

It's the same thing as Kubernetes in my mind - it works well if you keep to the bare essentials and don't try to be clever.

[+] m1keil|1 year ago|reply
I think it's all of the above.

In large enterprises, there is usually some sort of global message bus on top of Kafka, AWS Kinesis or similar.

In smaller shops, the need for dedicated message bus is over engineering and can be avoided by using the db or something like redis. It is still a message queue, just without a dedicated platform.

[+] ryapric|1 year ago|reply
I think it's very much your last theory -- used everywhere but not as interesting to tell people about as it might have been a decade ago. Queues are now Boring Technology(tm), and that's a good thing.
[+] ehnto|1 year ago|reply
They aren't a general solution and don't really add much to your average application. But there are still instances where they make a lot of sense.

What I would need to see required before bothering with a message queue architecture:

* High concurrency, atomic transactions

* Multiple stages of processing of a message required

* Traceability of process actions required

* Event triggers that will actually be used required

* Horizontal scaling actually the right choice

* Message queues can be the core architecture and not an add on to a Frankenstein API

Probably others, and yes you can achieve all of the above without message queues as the core architecture but the above is when I would think "I wonder if this system should be based on async message queues".

[+] grenbys|1 year ago|reply
My company heavily relies on Amazon SQS for background jobs. We use Redis as well but it is hard to run at scale. Hence, anything critical goes to SQS by default. SQS usage is so ubiquitous I can’t imagine anyone be interested in writing a blog post or presenting on a conference. Once you get used to SQS specifics (more than once delivery, message size limit, client/server tooling built, expiration settings, DLQ) I doubt there’s anything that can beat it in terms of performance/reliability Unless you have resources to run Redis/Kafka/etc yourself. I would recommend searching for talks by Shopify eng folks on their experience, in particular from Kir (e.g. https://kirshatrov.com/posts/state-of-background-jobs)