top | item 20828981

Introduction to Event-Driven Architectures with RabbitMQ

175 points| nicolasjudalet | 6 years ago |blog.theodo.com | reply

53 comments

order
[+] Kiro|6 years ago|reply
OT but anyone else hate debugging architectures like this compared to a monolith where you can just follow one big stack trace from start to end?
[+] alexandercrohde|6 years ago|reply
100%

Every time I work somewhere I have to play shepherd and ask the very basics:

- Who's monitoring queue uptime, setting alerts on it if it goes down, waking up in the middle of the night to fix, patch it, setting it up in all test environments

- Have you thought about all the new problems that might happen: queue sending to dead endpoints, circular queue problem, queue being restarted somehow (e.g. deploys) and losing messages?

- If the app fails post-queue, not surfacing the message to the user, do you have a plan to ensure somebody in engineering sees and fixes that error? And then goes back and remediates the broken request(s)?

- Have you prepared code/logs to do distributed tracing?

- If there's a dispute a week from now whether Joe didn't get an email because of a problem BEFORE or AFTER the queue, will you be able to tell from the logs?

Many powerful engineering abstractions (threads, async, services) require one notch higher of engineering talent and allows for all sorts of new failure paths. The tradeoff must be taken very seriously. Most places I have worked at adopted complexity too soon.

[+] cloverich|6 years ago|reply
I think it comes down to taking debugging, monitoring, and alerting seriously. I doubt any team would be better off with a distributed system if they don't spend a serious amount of time setting up logging, monitoring, tracing, making it easy to debug, having automated alerts, etc. Its what separates a great coder from a good engineer and if you don't have the latter on hand, you should steer far away from building systems like this.
[+] LeonM|6 years ago|reply
The timing of this article couldn't have been more painful.

I'm currently working on solving a problem with an insanely over complicated setup (for the task at hand) that is build by another engineer who has since left the company.

It's a cluster of 3 virtualised machines running docker swarm where a RabbitMQ instance ties 40+ worker pods together. Once every 5 days or so the connection between RabbitMQ and (some of) the worker container stops working, causing the worker to crash and the queue message is lost.

We are talking like 5 layers of virtualization and/or abstraction. It's impossible to debug. I honestly don't know how to explain this to my customer.

[+] a-priori|6 years ago|reply
This isn't a sign that one architecture is better than the other. It only indicates that the tools are more mature for debugging monoliths than distributed architectures (including event-driven) for tracing causality and data flow through the whole system.

The fact that there's great advantages to distributed architectures, yet they're harder to work with, is a signal that there's a market opportunity for better tooling around working with them.

[+] awinder|6 years ago|reply
I don't hate debugging in evented/microservice codebases, but it is more complicated to get a good debugging setup in place w/ distributed tracing. But after you do that you're often in a better place with logging, so, it's tradeoffs. It's all tradeoffs in fact, I'll take a monolith where it's appropriate for the space and I'll take evented/microservice architectures when the monolith pattern is no longer a good investment for the problem space. All these things exist for good reasons, don't be an emotional investor [of your own time/opinions] ;-).
[+] afrodc_|6 years ago|reply
From my experience, debugging monoliths is much easier. Microservices a bit hard (mostly networking). Event Driven Architecture with microservices is even harder. My company's brief stint into building one out turned into a disaster of trying to isolate wtf was going on when there was some issue.

Now that could have been the engineers/architects fault, but unless my team grows, I would hate having to deal with it again.

[+] johnbrodie|6 years ago|reply
I've found it very helpful in distributed environments to ensure that you are giving the incoming request an extra "Request ID"-style header. Make sure this header is propagated everywhere, and logged everywhere. Makes debugging much, much easier. Still gets hard at scale, which is why tools like New Relic have various "distributed tracing" features now.
[+] davidw|6 years ago|reply
I forget the exact quote, but it goes something like "the best distributed system is one you didn't build". Obviously there are reasons to build them that become inevitable once a system needs to grow beyond a certain size, but if you can put that off for a while, it's a win.
[+] rswail|6 years ago|reply
we just finished implementation of an event sourced/CQRS solution using kafka.

Yes, you need to have monitoring etc, but the testing and debugging were substantially easier because of how we broke down the "services". For each major entity (or aggregate) there was a service that subscribed to a number of command and event topics, it produced output to an event topic for the aggregate.

We had FSMs for each of the aggregates, documenting the effect of each potential command or external event and the change of state and the resulting state changes (events) and/or commands.

The architectural constraint meant that the infrastructure was the same for each aggregate, the testing of each was independent and could be mocked easily using topic producers.

So as opposed to the "Big Ball Of Mud" we have a monitorable infrastructure (kafka + alerts/stats sent to a statsd integrated with AWS Cloudwatch), we have individual aggregate processing that only respond to incoming commands or events and have defined outputs for each potential incoming command/event.

Much much easier to design, develop and debug.

But the trick is at the start (like anything else). Analyzing the domain to determine the entities/aggregates, modelling the externally generated commands, modelling the FSMs for each aggregate etc.

[+] LargeWu|6 years ago|reply
It can be very hard to know the state of a piece of information within your system unless you fanatically log its state at all times. And even then, you might have to look through lots of distributed logs to find that. Of course, logging the state of the world at every step also creates tons of noise.

At a previous gig, we had a processing workflow where an entity might move through any of 20 processes tied together through queues. It was very, very difficult to track down problems with things dropping out, things getting stuck, etc.

[+] segmondy|6 years ago|reply
Not really, it's tougher, but if you're bringing queues into your workflow then it should be because you need it. You are having to deal with scaling, resiliency, parallelism, faster response time, etc. The benefit then out weighs the debugging overhead. But at that point you have a distributed system. If you have a Distributed system, you should have centralized logging and distributed tracing too so you can debug easier. Without those, you are going to go through the pain.
[+] orthecreedence|6 years ago|reply
To your point...even if you do want eventing, you can do it inside of a monolith and not involve the network at all, which will save a good amount of pain.
[+] BubRoss|6 years ago|reply
I think it depends on the isolation of each part. If the incoming data is fine, then you have the problem narrowed down already. If it isn't, then you have to figure out why it was sent in the first place, probably by detecting something wrong in the sender at run time so you can see what led to the bad message.
[+] shriek|6 years ago|reply
Of course but you're often trading debug-ability over scalability whenever you're picking evented architecture. I do think tools can help here but it's just the nature of this architecture. You're reacting to 'events' instead of following procedural orders.
[+] nicolasjudalet|6 years ago|reply
My experience with debugging with RabbitMQ was not bad, but we had to do a bit of work to ease the process. For example, we configured error queues where processors publish error log and input message information, which is enough to reproduce the bug and understand what happened.
[+] callmeal|6 years ago|reply
Not really. It's sometimes it's easier than debugging monoliths - just log the incoming message, log outgoing message. And you can narrow down the problem just as quickly as in a monolith.
[+] jjtheblunt|6 years ago|reply
Comparably OT, but i bet this is the essence of biology and cancer being tough : signals sent via cytokines and hormones _are_ messages, IIRC.
[+] bigredhdl|6 years ago|reply
That is definitely the balancing act with events. They are great in certain spots, but can become a big pain if they are used for everything.
[+] vemv|6 years ago|reply
I would indicate that engineers' feelings ("hating" something) are quite orthogonal to business needs or technical challenges.

Probably every technical decision maker pushing for microservices knows the perils of distributed debugging. They have some weight - just some.

[+] diminoten|6 years ago|reply
Never been a problem for me, and I absolutely hate the maintenance nightmare that comes with one big repo.

I feel like the more shortsighted/incentivized by sheer work volume a person is, the more they're into monorepos...

[+] danatcofo|6 years ago|reply
account has been suspended. link now broken. /shrug
[+] geodel|6 years ago|reply
You shouldn't let this one event drive whole conversation.
[+] nudpiedo|6 years ago|reply
maybe their AI powered firewall detected that it got an attack known as HN-DDoS.
[+] longcommonname|6 years ago|reply
Queue driven systems really fascinate me, coming from a chemical engineering background I can't help but to see parallels to fluid dynamics and all that difficult math that comes from their analysis.

I've always wanted to create some type of monitoring system that displays the entire system in that vein and then model or using control theory.

Has anybody seen a project that does this?