To Message Bus or Not: Distributed Systems Design (2017)

[+] cpitman|6 years ago|reply

In enterprise environments, I more often see overuse of message brokers. I am all for targeted use of messaging systems, but what I usually see is an all or nothing approach where as soon as a team starts using messaging they use it for _all_ interprocess communication.

This has some very painful downsides.

First, traditional message brokers (ie queues, topics, persistence, etc) introduce a potential bottleneck for all communication, and on top of that they are _slow_ compared to just using the network (ie HTTP, with no middleman). I've had customers who start breaking apart their monoliths and replacing with microservices, and since the monolith used messaging they use messaging between microservices as well. Well, the brokers were already scaled to capacity for the old system, so going from "1" monolith service to "10" microservices means deploying 10x more messaging infrastructure. Traditional messaging brokers also don't scale all that well horizontally, so that 10x is probably more like 20x.

Second, performance tuning becomes harder for people to understand. A core tenant of "lean" is to reduce the number/length of queues in a system to get more predictable latency and to make it easier to diagnose bottlenecks. But everyone always does the opposite with messaging: massive queues everywhere. By the time the alarms go off, multiple queues are backing up and no one is clear on where the actual bottleneck is.

What I would like to see more of is _strategic_ use of messaging. For example, a work queue to scale out processing, but the workers use HTTP or some other synchronous method to call downstream services. Limiting messaging to very few points in the process still give a lot of the async benefits to the original client, while limiting how often requests hit the broker and making it easier to diagnose performance problems.

[+] 8fingerlouie|6 years ago|reply

I work for a company that uses message queues for just about everything that moves between (different) hardware, and we've yet to experience problems from a properly configured queue setup.

We've used message queues for 30+ years, and I would say things like reliable messaging/guaranteed delivery, exactly once delivery, as well as not having to care about byte order, has saved us a lot of trouble over the years.

We move data between mainframes, pc servers, and whatever system the customers use, and once the security setup is in place, the queues deliver data reliably.

Without message queues, each application would have to know the details about the receiver, or each client application would have to know the details about the sender, and the implementation would be very specific.

With message queues, you can replace part of your implementation, i.e. migrate it from the mainframe to a unix daemon, and nobody will notice.

That being said, it does come at a heavy cost, and as i started out by saying, we only use it for X-platform messaging. Anything done locally communicates via "something else", files, pipes, shared buffers, or similar. Also, anything idempotent, if using queues at all, is not using "exactly once" delivery, which has a large performance overhead.

[+] nizmow|6 years ago|reply

An intermediary message broker can give you a HUGE benefit though. A touch more robustness (a service can disappear for a short while and hopefully nobody even notices), you get load balancing "for free" (with carefully designed services), and if you're careful with your framework you can keep failed messages sitting around for you to analyse or even replay later. Most likely with HTTP you're going to need some kind of middle man anyway (load balancers, proxies, etc). Though I agree with your conclusion, and I think it mostly stands for almost all tools. Know your problems and apply the appropriate solutions.

[+] jillesvangurp|6 years ago|reply

Yes. For simple setups, I'm usually biased towards having less moving parts. Queues sound nice until you start thinking about the edge cases with respect to errors, lost messages, rolling restarts, etc. In short, pretty much every system with queues I've ever seen this was a headache. There are good solutions for this of course but they require extra work like checking dead letter queues, devops related to monitoring of and managing the queues and depending systems, etc. If find lots of teams cut corners here and then have to deal with the inevitable fallout of things going wrong.

An alternative that is usually simple to implement is some kind of feed based system that you poll periodically. This can be dynamic or even static files. Feeds may be cached. Depending systems can simply poll this. If they go down, they can catch up when they come back. If the feed system goes down, they will catch up when it comes back up. It also scales to large numbers of subscribers. You can do streaming or non streaming variants of this. You can even poll from a browser.

[+] chapium|6 years ago|reply

> Second, performance tuning becomes harder for people to understand. A core tenant of "lean" is to reduce the number/length of queues in a system to get more predictable latency and to make it easier to diagnose bottlenecks. But everyone always does the opposite with messaging: massive queues everywhere. By the time the alarms go off, multiple queues are backing up and no one is clear on where the actual bottleneck is.

This is avoidable however. Its trivial to aggregate queue depth across nodes and report it to an alerting service.

[+] lulf|6 years ago|reply

I recommend looking at a standard messaging protocol like AMQP 1.0, which will allow you to implement the request-response pattern in an efficient manner without message brokers. Either “direct” client server as with HTTP , or via an intermediary such as the Apache Qpid Dispatch Router.

With the dispatch router, you can use pattern matching to specify if addresses should be routed to a broker or be routed directly to a particular service.

This is way you can get the semantics that best fits your use case.

[+] nine_k|6 years ago|reply

Queues and messages is a great way to structure application logic (see Erlang, Go, etc). But all message exchange need not be run through the same (heavyweight) central broker.

In an app I recently worked on we used SNS for heavy lifting (e.g. it is replicated across regions), then a few smaller local queues, including just in-process, in-memory queues. It worked well.

Though yes, you want throttling, backpressure, and good instrumentation.

[+] _pmf_|6 years ago|reply

> First, traditional message brokers (ie queues, topics, persistence, etc) introduce a potential bottleneck for all communication, and on top of that they are _slow_ compared to just using the network (ie HTTP, with no middleman).

How is multiplexing supposed to work with plain HTTP without middleman? One of the main uses for message queues is efficient fan-out (m + n connections instead of m x n).

[+] redact207|6 years ago|reply

This article talks about the benefit of a message bus being pub/sub. This definitely helps to decouple the internals of apps, which in turn makes things easier to maintain.

There are many other benefits to using a message bus, and it's a better fit in general to distributed systems. The hard part is understanding how to structure apps to make them compatible with messaging.

Imagine sending out a command to the bus and not knowing when it'll get processed. Perhaps under a second, perhaps next week. You can't expect a reply at that point because the service that sent it may no longer be running.

If you're on Node, try https://node-ts.github.io/bus/

It's a service bus that manages the complexity of rabbit/sqs/whatever, so you can build simple message handlers and complex orchestration workflows

[+] ledneb|6 years ago|reply

> Imagine sending out a command to the bus and not knowing when it'll get processed

I would love to hear how others are correlating output with commands in such architectures - especially if they can be displayed to users as a direct result of a command. Always felt like I'm missing a thing or two.

It seems the choices are:

* Manage work across domains (sagas, two phase commit, rpc)

* Losen requirements (At some point in the future, stuff might happen. It may not be related to your command. Deal with it.)

* Correlation and SLAs (correlate outcomes with commands, have clients wait a fixed period while collecting correlating outcomes)

Is that a fair summary of where we can go? Any recommended reading?

[+] jandrese|6 years ago|reply

The downside of message busses is that they can be really hard to debug for third parties. I run into this in SystemD, where a process will hang at startup waiting for a particular signal, and I have to hunt down all of the locations that signal could have been generated from to see which one failed. Much much harder than the old system where you would find the line you stalled on and then read up a few lines to see what it was trying to do.

[+] opportune|6 years ago|reply

>Imagine sending out a command to the bus and not knowing when it'll get processed. Perhaps under a second, perhaps next week. You can't expect a reply at that point because the service that sent it may no longer be running.

That's because this is missing the point of a message bus. Namely, communication with a message bus is meant to be one way, there is not supposed to be a response. If you want a timely response, you should set up a loadbalanced autoscaling microservice that maybe hooks into some large backend store.

[+] chrisseaton|6 years ago|reply

> Imagine sending out a command to the bus and not knowing when it'll get processed. Perhaps under a second, perhaps next week.

Isn’t that the situation on any network?

[+] protomyth|6 years ago|reply

> Imagine sending out a command to the bus and not knowing when it'll get processed

That's one of the things about tuple spaces[0,1,2] that was described in the book Mirror Worlds[3]. The author gives a lengthy description of a tuple's life. In practice, it really depended on what pattern you used to process the tuples.

0) https://software-carpentry.org/blog/2011/03/tuple-spaces-or-...

1) https://en.wikipedia.org/wiki/Tuple_space

2) http://wiki.c2.com/?TupleSpace

3) https://www.amazon.com/Mirror-Worlds-Software-Universe-Shoeb...

[+] chrischen|6 years ago|reply

How does Bus compare to Celery?

[+] lugg|6 years ago|reply

Whether to use pub/sub or rest is entirely domain dependent.

So, who cares? Its an implementation detail. Just do whatever makes sense. The most sane systems I've seen make use of both.

> In an architecture driven by a message bus it allows more ubiquitous access to data.

Please stop mistaking design details for architecture. Lots of things allow more ubiquitous access to data.

Talking about this stuff in this way is just going to wind up with you replacing MySQL with Kafka and never actually solving any real problems with your contexts/domains.

[+] closeparen|6 years ago|reply

What does architecture mean to you? The set of components and how they’ll interact is what it seems to mean in my world.

[+] hinkley|6 years ago|reply

I've been burned by both but I think I still lean away from event based systems.

REST is more likely to result in a partially ordered sequence of actions from one actor. User does action A which causes thing B and C to happen. In event driven systems, C may happen before B, or when the user does multiple things, B2 is now more likely to happen before C1.

IME, fanout problems are much easier to diagnose in sequential (eg, REST-based) systems. If only because event systems historically didn't come with cause-effect auditing facilities or back pressure baked in.

[+] rubyn00bie|6 years ago|reply

Totally agreeing with you.

<rant>

A message bus is to do things like fan-out, fucking messages, not act-as-a-fucking-data-store. Situations that need a message bus are so fucking few and far between I can't believe they posted this article.

Message bus is the fastest way to distribute your monolith... I don't know why people are always trying to get rid of REST like it's the problem (it's not; your code/infrastructure is)...

</rant>

[+] scraegg|6 years ago|reply

The first sentence already is triggering me.

No, it's not hard. Like most topics in software engineering there are 50+ years of pretty successful development, backed by science, backed by software from the 70s, backed by all seniority levels of engineers.

The problem nowadays is that people don't want to learn the proven stuff. People want to learn the new hip framework. Learning Kubernetes and React.js is much more fun than learning how actual routing in TCP/IP works, right?

The problem is that something can only be hip for 1-5 years. But really stable complex systems like a distributed network can only be developed in a time frame of 5-10 years. Therefore most of the hip stuff is unfinished or already flawed in its basic architecture. And usually if you switch from one hip stuff to another, you get the same McDonalds software menues, just with differently flavored ketchup and marketing.

So if you feel something is hard it might be because you are not dealing with the actual problem. For instance you might think about doing Kafka, and that's fine. But be aware that email is shipping more messages than Kafka and its doing it longer than Kafka.

For instance topologies: There is no point-to-point. There's star, meshed, bus etc. See here: https://en.wikipedia.org/wiki/Network_topology

If you don't know your topology it might be star or mesh. But it's still a topology or a mix of multiple topologies.

And if you develop a distributed system you really need to think about how your network should be structured. Then if you know which topology fits your use case you can go and figure out the way this topology works and what the drawbacks are. Star networks (like k8s) for instance are easy to setup but have a clear bottle neck in the center. A Bus (like Kafka) is like a street. It works fine until it is full, and there are sadly some activity patterns were an overloaded Bus will cascade and the overload is visible even weeks later (although you have reduced the traffic already), if you don't reset it completely from time to time.

It's not magic. You can look all of it up on wikipedia if you know the keywords. Also there is not a single "good" solution. It depends always on how well the rest of your system integrates with the pros and cons of your topology choice. And if you use multiple topologies in parallel you have a complexity overhead at some point, which is why working in big corps is usually so slow.

[+] geezerjay|6 years ago|reply

> The problem nowadays is that people don't want to learn the proven stuff. People want to learn the new hip framework. Learning Kubernetes and React.js is much more fun than learning how actual routing in TCP/IP works, right?

That's an awfully short-sighted comparison.

There are far more job offers for deploying and managing systems with kubernetes and to develop front-ends with React than there are for developing TCP/IP infrastructure. It's fun to earn a living and enjoy the priviledges of having a high salary, and the odds you get that by studying solved problems that nowadays just work are not that high.

[+] ww520|6 years ago|reply

Besides the distributed case, message bus is invaluable in building crash-proof applications. I've used lightweight message bus within the app itself, for better crash recovery on long running tasks. E.g. Need to generate lots of emails and send them out. I would create a small command object for each email and queue it to the message bus, and let the message handler to handle email generation one by one. The lightweight command objects can be queued up quickly and the UI can return to the user right the way. The slow running email generation can be run in the background. In the event of a system crash, the message bus will be restored automatically and continue on the remaining tasks in the queue.

[+] sk5t|6 years ago|reply

What if your process to generate the initial messages crashes halfway through?

[+] Pamar|6 years ago|reply

I strongly suggest anyone interested in this type of architecture to read https://www.goodreads.com/search?q=Enterprise+Integration+Pa... - it is a really well though out catalog of patterns and solution for Message-based systems.

[+] phoe-krk|6 years ago|reply

It's weird for me to not see even a single mention of any BEAM languages, such as Erlang or Elixir. They are naturally distributed and have discovery, networking and messaging built into the virtual machine itself.

[+] hestefisk|6 years ago|reply

My main gripe with service buses is that they can be very hard to deploy and test automatically. At least for traditional ‘middleware’ like WebSphere/MQ, Weblogic etc. It potentially adds another monolith to your architecture, which, whilst fancy, may not be required. Using ZeroMQ or similar ‘lightweight’ tech could be a better choice for small teams as it is easy to integrate into containers and testable.

[+] kbouck|6 years ago|reply

I'm curious to hear others' opinions on using a database as a message queue. One issue I have with most message brokers is that you can not perform adhoc queries or manipulations of the queue.

When you've got a backlog situation, it's nice to be able to answer questions like: - how many of these queued msgs are for/from customer/partner X.

[+] ngrilly|6 years ago|reply

Must be noted that service meshes bring some of the message bus advantages to the point-to-point architecture.

[+] opportune|6 years ago|reply

Somewhat, but it's a different use case. I'd say the main difference is that message buses are better for non-urgent, hopefully-soon-but-definitely-eventually workloads, while creating something like a message bus between services within a mesh will impose more urgency.

Anecdotally I've heard that extremely chatty services (like something that approximates a message bus) are considered poor mesh design but I don't really understand why that is the case so long as the service architecture is kept clean

[+] carc|6 years ago|reply

A couple of big reasons don't like message buses (but open to hearing about why I'm wrong):

-All "requests" will succeed even if malformed

-Couples producers/consumers to the bus (unless you put in the extra work to wrap it in a very simple service)

[+] GordonS|6 years ago|reply

> -All "requests" will succeed even if malformed

Took me a minute to grok this, but I think you mean, message bus clients can send 'any old shit' and the broker will happily queue it?

A lot of the more 'managed' (abstracted) clients in the statically typed world deal with this by structuring queues by the object type/interface. A bad actor could probably circumvent this if they really wanted, but for normal usage it will at least ensure that objects sent are of the expected type.

This means that in real-world usage, this isn't an issue.

[+] opportune|6 years ago|reply

The fix to A is client side filtering. That can be hard to set up if you don't enforce it to begin with. I'm currently dealing with poor/no client side validation at work and it does become a complete fucking pain dealing with poor upstream schema hygiene

I don't see why the second point is a bad thing, at all. Message buses are meant to provide a non urgent abstraction layer to simplify the relationship between producers and consumers (if you need urgency, don't use a message bus). It simplifies load balancing the consumer stage and doesn't impose any limits on producers (many of which can be clients released into the wild).

[+] EdwardDiego|6 years ago|reply

> All "requests" will succeed even if malformed

A good pattern is to use a schema to verify you're sending valid data in the producer (at the very least, in a unit test, if not at runtime) - for example, the Confluent wrapped Kafka ships with a schema registry and painless Avro based Kafka producers

If overhead in the producer is stopping you from validating each record at creation time, then you can validate downstream, so long as your consumers agree to only consume the validated data.

[+] GordonS|6 years ago|reply

> Couples producers/consumers to the bus (unless you put in the extra work to wrap it in a very simple service)

This is an area where adding a simple layer of abstraction is a good thing - most times, all you want is something like a `Send<T>`/`Send(obj)` method, which really will translate fairly universally across message busses.

I used such a thing recently, switching out RabbitMQ for MQTT - all I really had to do was point to a new implementation of the interface!

[+] ww520|6 years ago|reply

- The consumer needs to validate the incoming requests. Just like any input sources, the incoming data cannot be trusted until validated.

- You need coupling somewhere anyway. Moving the coupling to the bus let the consumer and producer evolve more freely.

[+] teyc|6 years ago|reply

indeed. Here's an example of a case against messaging in distributed systems design. https://www.brisbanetimes.com.au/national/queensland/educati...

A synchronous call would have failed when the operator issued the request, and the onus is on the operator to find alternate ways of contacting the police.

[+] bvrmn|6 years ago|reply

Message bus proponents never build large systems. The only sane way is to be pretty specific about data flow with clear mental model shared between developers and ops. Message bus hides producer-consumer relations and with multiple endpoints it's very hard to reason about the system as a whole.

[+] useful|6 years ago|reply

Some of the most complex systems in the world use a message bus. The entire car and air industry come to mind.

Being able to see a message between systems and have an analyst figure out whats wrong and what it needs to be is an awesome way for a developer to actually code.

[+] redact207|6 years ago|reply

Why do you say that? I'm someone who's worked on a few very large message based systems in finance for years (100s of devs working together in multiple countries). I found that messaging and workflow orchestration were the things that helped keep things same.

[+] bunderbunder|6 years ago|reply

The largest system I ever worked on used a message bus for almost everything, and it worked great.

That said, 100% agreed about sanity. There was a lot of time spent on ensuring that all the developers (and there were a lot of them) had a decent shared understanding of everything, and a very deep understanding of anything they worked on or interacted with directly. Meetings to review the current way data flowed and look for ways to clean it up were a regular occurrence, and ops was deeply involved in everything.

The articles that say or imply, "Don't worry about it, because it's easy for everything to talk or listen to everything else!" strike me as having probably been written by someone whose message bus system hasn't been around for all that long.

[+] HillRat|6 years ago|reply

Absolutely not true, I’m afraid. I’ve architected or worked with message bus architectures in large-scale medical payments processing systems, as a central broker for a national-level telecommunications company, multiple Fortune 500 consumer-oriented companies with complex sales workflows, and so on. This doesn’t obviate the need to be clear on producer-consumer relations, as you point out, but that’s a governance issue, not a technical architecture problem, and one that is at least as severe with API-only architectures.

[+] Diederich|6 years ago|reply

I've worked directly with enormous systems that you have likely personally used that were not only highly coherent and well designed, but also message oriented in architecture.

[+] opportune|6 years ago|reply

Are you aware that LinkedIn developed and uses Apache Kafka for critical production services?

I also work on a large production service that uses a message bus. There are certainly amateurish uses/implementations of message buses out there but definitely not all of them

99 comments