Ask HN: Has anyone fully embraced an event-driven architecture?
Is there any company out there that has fully embraced this type of architecture when it comes to microservice communication, handling breaking schema changes or failures in an elegant way, and keeping engineers and other data consumers happy enough?
Every event-driven architectural pattern I've read about can quite easily fall apart and I have yet to find satisfying answers on what to do when things go south. As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one.
Is there any non-sales community of professionals discussing this topic?
Any help would be much appreciated.
[+] [-] jfoutz|4 years ago|reply
Windows 95. Old style gui programming meant sitting in a loop, waiting for the next event, then handling it. You type a letter, there's a case switch, and the next character is rendered on the screen. Being able to copy a file and type at the same time was a big deal. You'd experience the dead letter queue when you moved a window while the OS was handling a device, and the window would sort of smear across the screen when the repaint events were dropped.
Concurrent programming is hard. State isolation from micro services helps a lot. but eventually you'll need to share state, and people try stuff like `add 1 to x`, but that has bugs, so they say, `if x == 7 add 1 to x` but that has bugs so they say, `my vector clock looks like foo. if your vector clock matches, add 1 to x, and give me back your updated vector clock` but now you've imposed a total order and have given up a lot of performance.
I'm blind to the actual problem you're facing. My default recommendation is to have a monorepo, and break out helpers for expensive tasks, no different than spinning up a thread or process on a big host. Have a plan for building pipelines a->b->c->d. also have a plan for fan out a->b & a->c & a->d
It has been widely observed there are no silver bullets. but there are regular bullets. Careful and thoughtful use can be a huge win. If you're in that exponential growth phase, it's duck tape and baling wire all the way, get used to everything being broken all the time. if you're not, take your time and plan out a few steps ahead. Operationalize a few micro services. Get comfortable with the coordination and monitoring. Learn to recover gracefully, and hopefully find a way to dodge that problem next time around.
Sorry this is hand wavy. I don't think you're missing anything. it's just hard. if you're stuck because it won't fit on 1X anymore, you've got to find a way to spread out the load.
[+] [-] fxtentacle|4 years ago|reply
The user presses a button which sets a hardware event flag. CPU wakes up from sleep, checks all event flags, handles them, clears the interrupt bits and goes back to sleep.
But using events like this requires a very tight integration between event producer and consumer, so I don't think this will translate well to distributed systems or microservices.
[+] [-] narag|4 years ago|reply
The GUI worked in a single thread, but your whole program didn't need to do that.
Being able to copy a file and type at the same time was a big deal.
That's not correct. You just needed to create a separate thread for the file operation. For some programmers that was a big deal indeed, the same could be said for some programming tools. But that wasn't a general case at all.
There were some ugly things, like the way file operations were treated down the OS level, but it wasn't impossible to make your application responsive. It wasn't even difficult... if you knew how to do it.
[+] [-] DelightOne|4 years ago|reply
How does that look like? Do you mean to learn how this is done with the CI of choice, create helper functions or are there concrete steps that you would recommend? I'm new to this and would appreciate any feedback.
[+] [-] jimmaswell|4 years ago|reply
[+] [-] parentheses|4 years ago|reply
Micro-services is just sharding across a different axis (TM)
[+] [-] blackoil|4 years ago|reply
[+] [-] evanrich|4 years ago|reply
We used Kafka for event-driven micro services quite a bit at Uber. I lead the team that owned schemas on Kafka there for a while. We just did not accept breaking schema changes within the topic. Same as you would expect from any other public-facing API. We also didnt allow multiplexing the topic with multiple schemas. This wasn’t just because it made my life easier. A large portion of the topics we had went on to become analytical tables in Hive. Breaking changes would break those tables. If you absolutely have to break the schema, make a new topic with a new consumer and phase out the old. This puts a lot of onus on the producer, so we tried to make tools to help. We had a central schema registry with the topics the schemas paired to that showed producers who their consumers were, so if breaking changes absolutely had to happen, they knew who to talk to. In practice though, we never got much pushback on the no-breaking changes rule.
DLQ practices were decided by teams based on need, too many things there to consider to make blanket rules. When in your code did it fail? Is this consumer idempotent? Have we consumed a more recent event that would have over-written this event? Are you paying for some API that your auto-retry churning away in your DLQ is going to cost you a ton of money? Sometimes you may not even want a DLQ, you want a poison pill. That lets you assess what is happening immediately and not have to worry about replays at all.
I hope one of the books you are talking about is Designing Data Intensive Applications, because it is really fantastic. I joke that it is frustrating that so much of what I learned over years on the data team could be written so succinctly in a book.
[+] [-] sideway|4 years ago|reply
Two follow up questions if you don't mind me asking, even though I understand you were not on the publishing side:
1. Do you know if changes in the org structure (e.g. when uber was growing fast and - I guess - new teams/product were created and existing teams/products were split) had significant effect on the schemas that had been published since then? For example, when a service is split into two and the dataset of the original service is now distributed, what pattern have you seen working sufficiently well for not breaking everyone downstream?
2. Did you have strong guidelines on how to structure events? Were they entity-based with each message carrying a snapshot of the state of the entities or action-based describing the business logic that occurred? Maybe both?
And yes, one of the books I'm talking about is indeed Designing Data Intensive Applications and I fully agree with you that it's a fantastic piece of work.
[+] [-] kieranmaine|4 years ago|reply
[+] [-] gwbas1c|4 years ago|reply
IMO: It's over-complicated. They can ship a change to a microservice while insulating the other services from risk, but that's just kicking the can for technical debt.
What happens is that, if a service hasn't shipped in a few release cycles, when an update is made to that service, we often find latent bugs. Typically they are the kind of bugs that could be found with simple regression testing; but the company put too much effort into dividing its code into silos. (Basically, they spent a lot of time dealing with the boundaries between their microservies instead of just writing clean, testable code with decent regression suites.)
---
IMO: Don't get too hung up on microservices and events. Focus on writing simple, straightforward code that's easily testable. Make sure you have high unit test coverage and a useful regression suite. Only introduce "microservice" boundaries when there's natural divisions. (IE, one service makes more sense to write in Node.js, another makes more sense to write in C#; or one service should run in Azure and another should run in AWS.)
This, BTW, helped immensely in a previous job. When I worked for Syncplicity, a major Dropbox competitor, we started with a monolithic C# server for most server-side logic, but we had a microservice in AWS to handle uploads and downloads. This helped immensely, because we ended up allowing customers to host their own version of the upload / download server. It was a critical differentiator for us in the marketplace.
[+] [-] bob1029|4 years ago|reply
[+] [-] BulgarianIdiot|4 years ago|reply
Anyone calling themselves an architect, or an engineer, or even just a "good developer" would acknowledge that interaction patterns and concepts are contextual, not general idioms at the project or system level.
Speaking of them as "architectures" or embracing them, as in, doing everything the one holy way is only a crutch for people who are confused by what it means to define your system's architecture. And a silver bullet for consultants to sell you books and training.
There is a lot of empty hype and misconceptions around EDA, for example "it helps decouple services" is thrown around, which is nonsense to anyone who can analyze a system and knows what a dependency is (moving from "I tell you" from "you tell me" is not more decoupled, you just moved the coupling; likewise moving from "A tells B" to "everyone tells B and B tells everyone" as in event hubs is much more coupling, it plays the role of a global system variable basically).
Regarding dead letters, a most trivial answer is log and notify the stakeholders for unconsumed messages. That's the most general approach. Think about dead letter messages the same as exceptions that bubbled to the top of the stack. And when you can handle them more specifically, you do.
[+] [-] rkangel|4 years ago|reply
I touch on it a little bit in another comment: https://news.ycombinator.com/item?id=28047306
[+] [-] monocasa|4 years ago|reply
Basically everything that makes fully event driven architectures difficult is ameliroated because the only consumers are myself and my wife, and we literally built up the whole system. Something appears to be locked up? There's a system of watchdogs to kill stuff, all hardware has been designed to fail off into manual control, and we can pick the pieces up at our leisure like when anything else in the house breaks. The last will and testament messages in MQTT are really nice for at least reporting hard failure conditions.
I'll be the first to admit that I would not look forward to productizing it and supporting someone else's house (to the point that I'll probably never do that). It's so easy for messages to make their way into the bit bucket when setting up a new subsystem, and everything is so loosely coupled because of the event system it's almost like it's all "stringly typed". And being both software engineers, we sort of relish in how awful the UI is, even using 98.css.
[+] [-] mattlondon|4 years ago|reply
I spent years fiddling with zigbee and z-wave and propietary other things from well-known brands as well as unknown brands from amazon. Nothing ever worked correctly for long periods of time.
Now I have binned all that other crap and migrated to esp8266-based devices that can be flashed to tasmota, and have them all talking via a Raspberry Pi running a MQTT server and OpenHab via docker containers. It is now rock solid in terms of reliability - the only failures come when OpenHab's cloud integration dies, and then I have a backup local http server with a javascript based client that just sends commands directly over MQTT. MQTT really was the missing link for me.
OoenHab has been a pain to use, but that is a different story. I'd not recommend it and I'd ditch it in a heartbeat, but sonfar I am in a "aint broke don't fix it" position.
[+] [-] Guest42|4 years ago|reply
[+] [-] jekbao|4 years ago|reply
[+] [-] tjungblut|4 years ago|reply
[+] [-] jetti|4 years ago|reply
One of the benefits of this architecture for us is the ability to easily share information between services. We utilize SNS and SQS for a pub/sub architecture so if we need to expose more information we can just publish another type of message to the topic or if we need to consume some information then we can just listen to the relevant topic.
There are two big issues that I've run into while at this company. One is tracking down where events are coming from can be a big pain, especially as we are replacing services but keeping message formats the same. The other big issue is setting up lower environments (dev,qa,etc) can be difficult because you pretty much need the entire ecosystem in order for the environment to be usable, which requires buy-in from all teams in the organization
[+] [-] jfoutz|4 years ago|reply
One way to ease some of that pain is a standard library to obtain your keys and topic and publish metrics when things are published and consumed, or at least logs on startup.
It's a pain to get buy in, and 10x harder to keep it updated. But if you can solve the problems around getting names and secrets and stuff, folks are usually open to the conversation at least.
[+] [-] aussieguy1234|4 years ago|reply
[+] [-] sideway|4 years ago|reply
Does moving the DLQ messages back to the normal queue mean that all consumers can deal with out-of-order scenarios?
[+] [-] zamalek|4 years ago|reply
Events are a part of a greater whole. It's a tool that you can use to solve certain data flows, but not all data flows. When you start taking more liberty with the word "eventually," you are almost certainly in a realm where event-driven makes the most sense. CQRS is a pretty good example of using many architectures (including event-driven) under a single greater architectural umbrella, and the thought patterns it introduces you to are incredibly useful. But no architecture is gospel, not even close.
Any "pure" architecture is the tail wagging the dog. The problem comes first, the solution comes second, the architecture comes third.
[+] [-] bob1029|4 years ago|reply
If you want to use event driven + microservices, first make sure microservices make sense. Event driven is just a cool way to tie monstrous collections of services together if you have to go down that road.
If you can build a monolith and satisfy the business objectives, you should almost certainly do it. With a monolith, entire classes of things you would want to discuss on hackernews (such as event-driven architectures) evaporate into nothingness.
[+] [-] ARandomerDude|4 years ago|reply
In my opinion there's also a time factor involved, for example you think there's a huge potential but zero actual clients today. In that case if you think a monolith will do the job the first 5 years of its life, build a monolith.
[+] [-] jf22|4 years ago|reply
It was great for certain use cases, bad for others. The architecture made it so it took days to do simple features like adding a sortable column.
Having to deal with that made it the worst job I've ever had. It would take 700 lines of code involving two separate systems and 70 hours to to do tasks that would normally take two hours. I felt a lot of pressure because previously simple tasks would take so long.
[+] [-] rkangel|4 years ago|reply
In well architected Erlang/Elixir, most of your business logic will be written as pure functional code (which is gloriously easy to test), but then it is glued together at the boundary by GenServers (usually). GenServers are an abstraction over the BEAM primitives that makes the 'receive a message, update my state' thing very easy. The simplest handler might look like this:
Here state is a simple integer. When we receive the :increment message, we send back the current value and increment our local state. The way all this is wrapped up, the caller has an API that just looks like a function call which returns the value but the underlying architecture that you're working with is all event driven.[+] [-] zzbzq|4 years ago|reply
Using events into a processing queue is also a good way to make systems more responsive for end-users when compared to making every operation blocking.
Events are not a good replacement for transactional request/response models of (i.e.,, making an API call.) Some people advocate for a "event sourcing" system to create its own internal domain model using events. I don't think this is a good default, but it really comes down to is what tools you're using an how you're used to using them. Namely, you can't have a web service that writes to a RDMBS and then immediately writes to RabbitMQ and call it a consistent system, because the write to RabbitMQ could fail and the systems downstream would be permanently wrong. So event sourcing is used to resolve this into a single-write into a queue system which then forks out into the systems' own RDMBS and also other systems. However the more "normal" way would be to just write this atomically into the RDBMS and have a second process poll it back out into the queue for downstream systems.
[+] [-] HALtheWise|4 years ago|reply
It's not an exact match for the scalable-microservices world you're thinking of (for example, typically robots don't need to deal with runtime schema version skew), but could be interesting to learn about anyway.
[+] [-] xet7|4 years ago|reply
[+] [-] altacc|4 years ago|reply
[+] [-] BulgarianIdiot|4 years ago|reply
[+] [-] navd|4 years ago|reply
[+] [-] Frajedo|4 years ago|reply
We have written an article about how we try to fix that, if you are considering such an infrastructure on AWS feel free to read it:
https://medium.com/ekonoo-tech-finance/centralizing-log-mana...
I believe that this is one of the big tradeoffs that you make when choosing to go for a microservices and event driven infrastructure.
Regarding the original post question, we are indeed going all in on an event-driven infra, and so far it has been going not too bad, happy to answer any extra questions.
[+] [-] yellow_lead|4 years ago|reply
[+] [-] lloydatkinson|4 years ago|reply
[+] [-] Fiahil|4 years ago|reply
In projects where the client already have a mature engineering and data science department, we bring the big guns! The scope is usually much larger, with several workstreams and involve production-ready deployments. In this situation we might build upon what the client already have (ETLs), or initiate a full event-driven transformation with a "backbone" team responsible for creating a platform, and several use cases building upon it. In the usual scenario, a team would want to start large computations or simulations upon recieving a trigger event from a monitoring system (model drift) or a human operator ("what would be the impact in € of a small decrease in parameter X over the next 7 days of forecasted sales"?).
Even-driven systems are much more robust than traditional ETLs with a central data warehouse, but they are also much more complex to understand and operate. In the end, we rarely deploy them because they cost us way too much engineering time compared to the benefits. That's mostly because we spend >70% of our time dealing with "security teams" and "access issues". Seriously.
[+] [-] Licentia|4 years ago|reply
There's a bit of a training curve but it's honestly not that hard if people are willing and wanting to learn. You could level up a moderately experienced team in a matter of weeks to be able to work within a well defined event driven microservice architecture. The part that gets tricky and requires experience is carving out the boundaries and messages.
To answer the question about DLQs I think this is a valid critique. I've seen many places just set and forget DLQs and they might as well not have them. For me, I like to start each DLQ with an alert on every message published. Then manually inspect the message, trace the logs and figure out what to do from there. Once I have enough data on failure modes and paths to rectify them, you can start automating DLQ processing. In general though DLQs should not see much traffic outside of a system going down or poison messages hitting your services (broken schema changes from another service)
[+] [-] thire|4 years ago|reply
Unless the scale is an issue, nowadays I always prefer a more state-full approach when possible.
[+] [-] gwbas1c|4 years ago|reply
I was the lead developer for Syncplicity's desktop client. It was a file synchronization product very similar to Dropbox.
When I joined, the desktop client was 100% event-driven. The problem is that some kinds of operations need to be performed synchronously, so "event-driven" tended to obfuscate what needed to happen when. Translation: For your primary use cases, it's much easier to follow code that calls functions in a well-defined order, instead of figuring out who's subscribed to what. Events are great for secondary use cases.
To translate to microservices, for primary use cases, I'd have the service that generates a result call directly into the next service. For secondary use cases, I'd rely on events. Of course, there's tradeoffs everywhere, but you'll find that newcomers are able to more easily navigate your codebase when it's unambiguous what happens next.
[+] [-] giantg2|4 years ago|reply
For us, it's either a function that will retry the messages after some time, or manual intervention.
Our department recently said we need to move to even driven architecture for one of our processes that currently runs in a batch. They want us to load data into EMR from an S3 bucket populated by Kinesis. Their suggested implementation is simply to run the batch job more frequently instead of once a day... sorry guys, but that's not even driven...
I suggested maybe just setting a trigger on the S3 bucket and hook it up to Glue, since that would actually be event driven. They said 'no' because they don't want the load to EMR or Glue to run too frequently. I guess that makes sense (not that familiar with the ETL tech), but it sure doesn't make sense to call it event driven.
[+] [-] dm3|4 years ago|reply
The closest community to what you are asking for here is probably the DDD/CQRS/ES[1] Slack[2]. Google groups are pretty much dead at this point.
[1]: https://github.com/ddd-cqrs-es [2]: https://ddd-cqrs-es.slack.com/