Every time I work somewhere I have to play shepherd and ask the very basics:
- Who's monitoring queue uptime, setting alerts on it if it goes down, waking up in the middle of the night to fix, patch it, setting it up in all test environments
- Have you thought about all the new problems that might happen: queue sending to dead endpoints, circular queue problem, queue being restarted somehow (e.g. deploys) and losing messages?
- If the app fails post-queue, not surfacing the message to the user, do you have a plan to ensure somebody in engineering sees and fixes that error? And then goes back and remediates the broken request(s)?
- Have you prepared code/logs to do distributed tracing?
- If there's a dispute a week from now whether Joe didn't get an email because of a problem BEFORE or AFTER the queue, will you be able to tell from the logs?
Many powerful engineering abstractions (threads, async, services) require one notch higher of engineering talent and allows for all sorts of new failure paths. The tradeoff must be taken very seriously. Most places I have worked at adopted complexity too soon.
I think it comes down to taking debugging, monitoring, and alerting seriously. I doubt any team would be better off with a distributed system if they don't spend a serious amount of time setting up logging, monitoring, tracing, making it easy to debug, having automated alerts, etc. Its what separates a great coder from a good engineer and if you don't have the latter on hand, you should steer far away from building systems like this.
The timing of this article couldn't have been more painful.
I'm currently working on solving a problem with an insanely over complicated setup (for the task at hand) that is build by another engineer who has since left the company.
It's a cluster of 3 virtualised machines running docker swarm where a RabbitMQ instance ties 40+ worker pods together. Once every 5 days or so the connection between RabbitMQ and (some of) the worker container stops working, causing the worker to crash and the queue message is lost.
We are talking like 5 layers of virtualization and/or abstraction. It's impossible to debug. I honestly don't know how to explain this to my customer.
This isn't a sign that one architecture is better than the other. It only indicates that the tools are more mature for debugging monoliths than distributed architectures (including event-driven) for tracing causality and data flow through the whole system.
The fact that there's great advantages to distributed architectures, yet they're harder to work with, is a signal that there's a market opportunity for better tooling around working with them.
I don't hate debugging in evented/microservice codebases, but it is more complicated to get a good debugging setup in place w/ distributed tracing. But after you do that you're often in a better place with logging, so, it's tradeoffs. It's all tradeoffs in fact, I'll take a monolith where it's appropriate for the space and I'll take evented/microservice architectures when the monolith pattern is no longer a good investment for the problem space. All these things exist for good reasons, don't be an emotional investor [of your own time/opinions] ;-).
From my experience, debugging monoliths is much easier. Microservices a bit hard (mostly networking). Event Driven Architecture with microservices is even harder. My company's brief stint into building one out turned into a disaster of trying to isolate wtf was going on when there was some issue.
Now that could have been the engineers/architects fault, but unless my team grows, I would hate having to deal with it again.
I've found it very helpful in distributed environments to ensure that you are giving the incoming request an extra "Request ID"-style header. Make sure this header is propagated everywhere, and logged everywhere. Makes debugging much, much easier. Still gets hard at scale, which is why tools like New Relic have various "distributed tracing" features now.
I forget the exact quote, but it goes something like "the best distributed system is one you didn't build". Obviously there are reasons to build them that become inevitable once a system needs to grow beyond a certain size, but if you can put that off for a while, it's a win.
we just finished implementation of an event sourced/CQRS solution using kafka.
Yes, you need to have monitoring etc, but the testing and debugging were substantially easier because of how we broke down the "services". For each major entity (or aggregate) there was a service that subscribed to a number of command and event topics, it produced output to an event topic for the aggregate.
We had FSMs for each of the aggregates, documenting the effect of each potential command or external event and the change of state and the resulting state changes (events) and/or commands.
The architectural constraint meant that the infrastructure was the same for each aggregate, the testing of each was independent and could be mocked easily using topic producers.
So as opposed to the "Big Ball Of Mud" we have a monitorable infrastructure (kafka + alerts/stats sent to a statsd integrated with AWS Cloudwatch), we have individual aggregate processing that only respond to incoming commands or events and have defined outputs for each potential incoming command/event.
Much much easier to design, develop and debug.
But the trick is at the start (like anything else). Analyzing the domain to determine the entities/aggregates, modelling the externally generated commands, modelling the FSMs for each aggregate etc.
It can be very hard to know the state of a piece of information within your system unless you fanatically log its state at all times. And even then, you might have to look through lots of distributed logs to find that. Of course, logging the state of the world at every step also creates tons of noise.
At a previous gig, we had a processing workflow where an entity might move through any of 20 processes tied together through queues. It was very, very difficult to track down problems with things dropping out, things getting stuck, etc.
Not really, it's tougher, but if you're bringing queues into your workflow then it should be because you need it. You are having to deal with scaling, resiliency, parallelism, faster response time, etc. The benefit then out weighs the debugging overhead. But at that point you have a distributed system. If you have a Distributed system, you should have centralized logging and distributed tracing too so you can debug easier. Without those, you are going to go through the pain.
To your point...even if you do want eventing, you can do it inside of a monolith and not involve the network at all, which will save a good amount of pain.
I think it depends on the isolation of each part. If the incoming data is fine, then you have the problem narrowed down already. If it isn't, then you have to figure out why it was sent in the first place, probably by detecting something wrong in the sender at run time so you can see what led to the bad message.
Of course but you're often trading debug-ability over scalability whenever you're picking evented architecture. I do think tools can help here but it's just the nature of this architecture. You're reacting to 'events' instead of following procedural orders.
My experience with debugging with RabbitMQ was not bad, but we had to do a bit of work to ease the process. For example, we configured error queues where processors publish error log and input message information, which is enough to reproduce the bug and understand what happened.
Not really. It's sometimes it's easier than debugging monoliths - just log the incoming message, log outgoing message. And you can narrow down the problem just as quickly as in a monolith.
Queue driven systems really fascinate me, coming from a chemical engineering background I can't help but to see parallels to fluid dynamics and all that difficult math that comes from their analysis.
I've always wanted to create some type of monitoring system that displays the entire system in that vein and then model or using control theory.
[+] [-] adamnemecek|6 years ago|reply
[+] [-] Kiro|6 years ago|reply
[+] [-] alexandercrohde|6 years ago|reply
Every time I work somewhere I have to play shepherd and ask the very basics:
- Who's monitoring queue uptime, setting alerts on it if it goes down, waking up in the middle of the night to fix, patch it, setting it up in all test environments
- Have you thought about all the new problems that might happen: queue sending to dead endpoints, circular queue problem, queue being restarted somehow (e.g. deploys) and losing messages?
- If the app fails post-queue, not surfacing the message to the user, do you have a plan to ensure somebody in engineering sees and fixes that error? And then goes back and remediates the broken request(s)?
- Have you prepared code/logs to do distributed tracing?
- If there's a dispute a week from now whether Joe didn't get an email because of a problem BEFORE or AFTER the queue, will you be able to tell from the logs?
Many powerful engineering abstractions (threads, async, services) require one notch higher of engineering talent and allows for all sorts of new failure paths. The tradeoff must be taken very seriously. Most places I have worked at adopted complexity too soon.
[+] [-] cloverich|6 years ago|reply
[+] [-] LeonM|6 years ago|reply
I'm currently working on solving a problem with an insanely over complicated setup (for the task at hand) that is build by another engineer who has since left the company.
It's a cluster of 3 virtualised machines running docker swarm where a RabbitMQ instance ties 40+ worker pods together. Once every 5 days or so the connection between RabbitMQ and (some of) the worker container stops working, causing the worker to crash and the queue message is lost.
We are talking like 5 layers of virtualization and/or abstraction. It's impossible to debug. I honestly don't know how to explain this to my customer.
[+] [-] a-priori|6 years ago|reply
The fact that there's great advantages to distributed architectures, yet they're harder to work with, is a signal that there's a market opportunity for better tooling around working with them.
[+] [-] awinder|6 years ago|reply
[+] [-] afrodc_|6 years ago|reply
Now that could have been the engineers/architects fault, but unless my team grows, I would hate having to deal with it again.
[+] [-] johnbrodie|6 years ago|reply
[+] [-] davidw|6 years ago|reply
[+] [-] rswail|6 years ago|reply
Yes, you need to have monitoring etc, but the testing and debugging were substantially easier because of how we broke down the "services". For each major entity (or aggregate) there was a service that subscribed to a number of command and event topics, it produced output to an event topic for the aggregate.
We had FSMs for each of the aggregates, documenting the effect of each potential command or external event and the change of state and the resulting state changes (events) and/or commands.
The architectural constraint meant that the infrastructure was the same for each aggregate, the testing of each was independent and could be mocked easily using topic producers.
So as opposed to the "Big Ball Of Mud" we have a monitorable infrastructure (kafka + alerts/stats sent to a statsd integrated with AWS Cloudwatch), we have individual aggregate processing that only respond to incoming commands or events and have defined outputs for each potential incoming command/event.
Much much easier to design, develop and debug.
But the trick is at the start (like anything else). Analyzing the domain to determine the entities/aggregates, modelling the externally generated commands, modelling the FSMs for each aggregate etc.
[+] [-] LargeWu|6 years ago|reply
At a previous gig, we had a processing workflow where an entity might move through any of 20 processes tied together through queues. It was very, very difficult to track down problems with things dropping out, things getting stuck, etc.
[+] [-] segmondy|6 years ago|reply
[+] [-] orthecreedence|6 years ago|reply
[+] [-] BubRoss|6 years ago|reply
[+] [-] shriek|6 years ago|reply
[+] [-] nicolasjudalet|6 years ago|reply
[+] [-] callmeal|6 years ago|reply
[+] [-] jjtheblunt|6 years ago|reply
[+] [-] bigredhdl|6 years ago|reply
[+] [-] nonconvergent|6 years ago|reply
[+] [-] vemv|6 years ago|reply
Probably every technical decision maker pushing for microservices knows the perils of distributed debugging. They have some weight - just some.
[+] [-] chadcmulligan|6 years ago|reply
[+] [-] diminoten|6 years ago|reply
I feel like the more shortsighted/incentivized by sheer work volume a person is, the more they're into monorepos...
[+] [-] british_india|6 years ago|reply
[+] [-] danatcofo|6 years ago|reply
[+] [-] geodel|6 years ago|reply
[+] [-] nudpiedo|6 years ago|reply
[+] [-] longcommonname|6 years ago|reply
I've always wanted to create some type of monitoring system that displays the entire system in that vein and then model or using control theory.
Has anybody seen a project that does this?