top | item 28034882

Ask HN: Has anyone fully embraced an event-driven architecture?

284 points| sideway | 4 years ago | reply

After reading quite a few books and blog posts on event-driven architectures and comparing the suggested patterns with what I've seen myself in action, I keep wondering:

Is there any company out there that has fully embraced this type of architecture when it comes to microservice communication, handling breaking schema changes or failures in an elegant way, and keeping engineers and other data consumers happy enough?

Every event-driven architectural pattern I've read about can quite easily fall apart and I have yet to find satisfying answers on what to do when things go south. As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one.

Is there any non-sales community of professionals discussing this topic?

Any help would be much appreciated.

168 comments

order
[+] jfoutz|4 years ago|reply
You're not going to like the answer, but I think it captures some of what you're getting at.

Windows 95. Old style gui programming meant sitting in a loop, waiting for the next event, then handling it. You type a letter, there's a case switch, and the next character is rendered on the screen. Being able to copy a file and type at the same time was a big deal. You'd experience the dead letter queue when you moved a window while the OS was handling a device, and the window would sort of smear across the screen when the repaint events were dropped.

Concurrent programming is hard. State isolation from micro services helps a lot. but eventually you'll need to share state, and people try stuff like `add 1 to x`, but that has bugs, so they say, `if x == 7 add 1 to x` but that has bugs so they say, `my vector clock looks like foo. if your vector clock matches, add 1 to x, and give me back your updated vector clock` but now you've imposed a total order and have given up a lot of performance.

I'm blind to the actual problem you're facing. My default recommendation is to have a monorepo, and break out helpers for expensive tasks, no different than spinning up a thread or process on a big host. Have a plan for building pipelines a->b->c->d. also have a plan for fan out a->b & a->c & a->d

It has been widely observed there are no silver bullets. but there are regular bullets. Careful and thoughtful use can be a huge win. If you're in that exponential growth phase, it's duck tape and baling wire all the way, get used to everything being broken all the time. if you're not, take your time and plan out a few steps ahead. Operationalize a few micro services. Get comfortable with the coordination and monitoring. Learn to recover gracefully, and hopefully find a way to dodge that problem next time around.

Sorry this is hand wavy. I don't think you're missing anything. it's just hard. if you're stuck because it won't fit on 1X anymore, you've got to find a way to spread out the load.

[+] fxtentacle|4 years ago|reply
I fully agree with this, also that's still quite common in the embedded world.

The user presses a button which sets a hardware event flag. CPU wakes up from sleep, checks all event flags, handles them, clears the interrupt bits and goes back to sleep.

But using events like this requires a very tight integration between event producer and consumer, so I don't think this will translate well to distributed systems or microservices.

[+] narag|4 years ago|reply
Windows 95. Old style gui programming meant sitting in a loop, waiting for the next event, then handling it.

The GUI worked in a single thread, but your whole program didn't need to do that.

Being able to copy a file and type at the same time was a big deal.

That's not correct. You just needed to create a separate thread for the file operation. For some programmers that was a big deal indeed, the same could be said for some programming tools. But that wasn't a general case at all.

There were some ugly things, like the way file operations were treated down the OS level, but it wasn't impossible to make your application responsive. It wasn't even difficult... if you knew how to do it.

[+] DelightOne|4 years ago|reply
> Have a plan for building pipelines a->b->c->d. also have a plan for fan out a->b & a->c & a->d

How does that look like? Do you mean to learn how this is done with the CI of choice, create helper functions or are there concrete steps that you would recommend? I'm new to this and would appreciate any feedback.

[+] jimmaswell|4 years ago|reply
Aren't most "events" just an abstraction that's actually implemented by polling and/or an event queue at some level anyway, be it a library or the kernel?
[+] parentheses|4 years ago|reply
I agree mostly to this answer. I also want to point out that we use sharding as a technique.

Micro-services is just sharding across a different axis (TM)

[+] blackoil|4 years ago|reply
React also works quite similar, with the added benefits of async and single thread.
[+] evanrich|4 years ago|reply
Like others have said, it is just one tool in the tool box.

We used Kafka for event-driven micro services quite a bit at Uber. I lead the team that owned schemas on Kafka there for a while. We just did not accept breaking schema changes within the topic. Same as you would expect from any other public-facing API. We also didnt allow multiplexing the topic with multiple schemas. This wasn’t just because it made my life easier. A large portion of the topics we had went on to become analytical tables in Hive. Breaking changes would break those tables. If you absolutely have to break the schema, make a new topic with a new consumer and phase out the old. This puts a lot of onus on the producer, so we tried to make tools to help. We had a central schema registry with the topics the schemas paired to that showed producers who their consumers were, so if breaking changes absolutely had to happen, they knew who to talk to. In practice though, we never got much pushback on the no-breaking changes rule.

DLQ practices were decided by teams based on need, too many things there to consider to make blanket rules. When in your code did it fail? Is this consumer idempotent? Have we consumed a more recent event that would have over-written this event? Are you paying for some API that your auto-retry churning away in your DLQ is going to cost you a ton of money? Sometimes you may not even want a DLQ, you want a poison pill. That lets you assess what is happening immediately and not have to worry about replays at all.

I hope one of the books you are talking about is Designing Data Intensive Applications, because it is really fantastic. I joke that it is frustrating that so much of what I learned over years on the data team could be written so succinctly in a book.

[+] sideway|4 years ago|reply
Thanks for your detailed answer, really appreciate it.

Two follow up questions if you don't mind me asking, even though I understand you were not on the publishing side:

1. Do you know if changes in the org structure (e.g. when uber was growing fast and - I guess - new teams/product were created and existing teams/products were split) had significant effect on the schemas that had been published since then? For example, when a service is split into two and the dataset of the original service is now distributed, what pattern have you seen working sufficiently well for not breaking everyone downstream?

2. Did you have strong guidelines on how to structure events? Were they entity-based with each message carrying a snapshot of the state of the entities or action-based describing the business logic that occurred? Maybe both?

And yes, one of the books I'm talking about is indeed Designing Data Intensive Applications and I fully agree with you that it's a fantastic piece of work.

[+] kieranmaine|4 years ago|reply
When you say you used a central schema registry, did you have a single repo containing all topics and schemas?
[+] gwbas1c|4 years ago|reply
I've been working on a contract for 6 months where the architecture is microservices and queues.

IMO: It's over-complicated. They can ship a change to a microservice while insulating the other services from risk, but that's just kicking the can for technical debt.

What happens is that, if a service hasn't shipped in a few release cycles, when an update is made to that service, we often find latent bugs. Typically they are the kind of bugs that could be found with simple regression testing; but the company put too much effort into dividing its code into silos. (Basically, they spent a lot of time dealing with the boundaries between their microservies instead of just writing clean, testable code with decent regression suites.)

---

IMO: Don't get too hung up on microservices and events. Focus on writing simple, straightforward code that's easily testable. Make sure you have high unit test coverage and a useful regression suite. Only introduce "microservice" boundaries when there's natural divisions. (IE, one service makes more sense to write in Node.js, another makes more sense to write in C#; or one service should run in Azure and another should run in AWS.)

This, BTW, helped immensely in a previous job. When I worked for Syncplicity, a major Dropbox competitor, we started with a monolithic C# server for most server-side logic, but we had a microservice in AWS to handle uploads and downloads. This helped immensely, because we ended up allowing customers to host their own version of the upload / download server. It was a critical differentiator for us in the marketplace.

[+] bob1029|4 years ago|reply
We made the microservices mistake circa 2015-2016. Took us 3 years to recover and we lost some customers. Extremely valuable lessons were learned by all involved.
[+] BulgarianIdiot|4 years ago|reply
There's no such architecture, much like there's no "MVC architecture" or "CQRS architecture". These are patterns that should be used specifically in time and space where and when pros outweigh cons.

Anyone calling themselves an architect, or an engineer, or even just a "good developer" would acknowledge that interaction patterns and concepts are contextual, not general idioms at the project or system level.

Speaking of them as "architectures" or embracing them, as in, doing everything the one holy way is only a crutch for people who are confused by what it means to define your system's architecture. And a silver bullet for consultants to sell you books and training.

There is a lot of empty hype and misconceptions around EDA, for example "it helps decouple services" is thrown around, which is nonsense to anyone who can analyze a system and knows what a dependency is (moving from "I tell you" from "you tell me" is not more decoupled, you just moved the coupling; likewise moving from "A tells B" to "everyone tells B and B tells everyone" as in event hubs is much more coupling, it plays the role of a global system variable basically).

Regarding dead letters, a most trivial answer is log and notify the stakeholders for unconsumed messages. That's the most general approach. Think about dead letter messages the same as exceptions that bubbled to the top of the stack. And when you can handle them more specifically, you do.

[+] rkangel|4 years ago|reply
I suppose it all depends on your definition of the word architecture. Erlang and Elixir are instantiations of the actor model, which is fundamentally an event driven architecture. It is universally the way that state and side effects are handled in those languages.

I touch on it a little bit in another comment: https://news.ycombinator.com/item?id=28047306

[+] monocasa|4 years ago|reply
Not for a company, but I've embraced it pretty hard for my home automation. It's sort of the hammer I hit everything hard enough with until it looks like a nail by making everything go through the MQTT broker. The website? A static json blob describes interesting MQTT topics, and opens a MQTT over websocket connection to read/write any state. Zigbee, et al.? Translate to MQTT. Reporting? Daemon that listens to all topics and dumps it in a Sqlite database to be queried at my leisure. Events like sprinklers on/off? Python scripts in cron jobs that talk to everything via MQTT.

Basically everything that makes fully event driven architectures difficult is ameliroated because the only consumers are myself and my wife, and we literally built up the whole system. Something appears to be locked up? There's a system of watchdogs to kill stuff, all hardware has been designed to fail off into manual control, and we can pick the pieces up at our leisure like when anything else in the house breaks. The last will and testament messages in MQTT are really nice for at least reporting hard failure conditions.

I'll be the first to admit that I would not look forward to productizing it and supporting someone else's house (to the point that I'll probably never do that). It's so easy for messages to make their way into the bit bucket when setting up a new subsystem, and everything is so loosely coupled because of the event system it's almost like it's all "stringly typed". And being both software engineers, we sort of relish in how awful the UI is, even using 98.css.

[+] mattlondon|4 years ago|reply
+1 to MQTT for home automation.

I spent years fiddling with zigbee and z-wave and propietary other things from well-known brands as well as unknown brands from amazon. Nothing ever worked correctly for long periods of time.

Now I have binned all that other crap and migrated to esp8266-based devices that can be flashed to tasmota, and have them all talking via a Raspberry Pi running a MQTT server and OpenHab via docker containers. It is now rock solid in terms of reliability - the only failures come when OpenHab's cloud integration dies, and then I have a backup local http server with a javascript based client that just sends commands directly over MQTT. MQTT really was the missing link for me.

OoenHab has been a pain to use, but that is a different story. I'd not recommend it and I'd ditch it in a heartbeat, but sonfar I am in a "aint broke don't fix it" position.

[+] Guest42|4 years ago|reply
Would you mind describing the parts that became automated? I can think of a number of appliances that could potentially be api-driven but haven’t researched them yet....
[+] jekbao|4 years ago|reply
Are there communities on Discord/Reddit for what you're doing? I'm interested in dabbling with home automation during my free time.
[+] tjungblut|4 years ago|reply
MQTT at home is a real boon. I have mine chugging away in a raspi in a docker container and that drives everything like the plant watering. I also use it to collect metrics and do alerting with influx+grafana.
[+] jetti|4 years ago|reply
Where I currently work we are all in on event-driven architecture. For our DLQs, we have alerts on when the queue is growing in size or if messages are in the queue too long. When those alerts come in, we manually move the messages back to the normal queue for reprocessing and if they get DLQed again after that we will look into the reason it is failing.

One of the benefits of this architecture for us is the ability to easily share information between services. We utilize SNS and SQS for a pub/sub architecture so if we need to expose more information we can just publish another type of message to the topic or if we need to consume some information then we can just listen to the relevant topic.

There are two big issues that I've run into while at this company. One is tracking down where events are coming from can be a big pain, especially as we are replacing services but keeping message formats the same. The other big issue is setting up lower environments (dev,qa,etc) can be difficult because you pretty much need the entire ecosystem in order for the environment to be usable, which requires buy-in from all teams in the organization

[+] jfoutz|4 years ago|reply
Do you have any control over the individual services?

One way to ease some of that pain is a standard library to obtain your keys and topic and publish metrics when things are published and consumed, or at least logs on startup.

It's a pain to get buy in, and 10x harder to keep it updated. But if you can solve the problems around getting names and secrets and stuff, folks are usually open to the conversation at least.

[+] aussieguy1234|4 years ago|reply
I guess it's still harder to track down event emitters, but have you tried using bitbucket or GitHub code search to search all of your repos at once?
[+] sideway|4 years ago|reply
Thanks for your answer, it really helps.

Does moving the DLQ messages back to the normal queue mean that all consumers can deal with out-of-order scenarios?

[+] zamalek|4 years ago|reply
"Everything looks like a red thumb when you're holding a golden hammer."

Events are a part of a greater whole. It's a tool that you can use to solve certain data flows, but not all data flows. When you start taking more liberty with the word "eventually," you are almost certainly in a realm where event-driven makes the most sense. CQRS is a pretty good example of using many architectures (including event-driven) under a single greater architectural umbrella, and the thought patterns it introduces you to are incredibly useful. But no architecture is gospel, not even close.

Any "pure" architecture is the tail wagging the dog. The problem comes first, the solution comes second, the architecture comes third.

[+] bob1029|4 years ago|reply
I have worked in places where event-driven architectures are a necessity (we're talking thousands of real-time systems being integrated together).

If you want to use event driven + microservices, first make sure microservices make sense. Event driven is just a cool way to tie monstrous collections of services together if you have to go down that road.

If you can build a monolith and satisfy the business objectives, you should almost certainly do it. With a monolith, entire classes of things you would want to discuss on hackernews (such as event-driven architectures) evaporate into nothingness.

[+] ARandomerDude|4 years ago|reply
This. I have built many monoliths a few event-driven systems out of necessity.

In my opinion there's also a time factor involved, for example you think there's a huge potential but zero actual clients today. In that case if you think a monolith will do the job the first 5 years of its life, build a monolith.

[+] jf22|4 years ago|reply
I did at an old company.

It was great for certain use cases, bad for others. The architecture made it so it took days to do simple features like adding a sortable column.

Having to deal with that made it the worst job I've ever had. It would take 700 lines of code involving two separate systems and 70 hours to to do tasks that would normally take two hours. I felt a lot of pressure because previously simple tasks would take so long.

[+] rkangel|4 years ago|reply
The closest thing I know of is the Erlang/Elixir approach to program development. The BEAM VM that they're built on, is basically an instantiation of the actor programming model - a series of logical processes (services) that only communicate with each other through messages. Any state is held in an actor and you work with that state in an event driven way based on the messages you receive. I'll give a little peek at this below, but really you'd need to work with it to see how well it works at application scale.

In well architected Erlang/Elixir, most of your business logic will be written as pure functional code (which is gloriously easy to test), but then it is glued together at the boundary by GenServers (usually). GenServers are an abstraction over the BEAM primitives that makes the 'receive a message, update my state' thing very easy. The simplest handler might look like this:

    def handle_call(:increment, _from, state) do
      {:reply, state, state + 1}
    end
Here state is a simple integer. When we receive the :increment message, we send back the current value and increment our local state. The way all this is wrapped up, the caller has an API that just looks like a function call which returns the value but the underlying architecture that you're working with is all event driven.
[+] zzbzq|4 years ago|reply
Events are a part of any good service-oriented architecture. They can replace patterns that involve batch-ETLing large amounts of data from system to system--events are usually a smoother way of doing the same. They're usually more resource-efficient and responsive than a poll & cache approach. They can also create a more consistent way to broadcast data, avoiding CAP problems from trying to do multiple writes to different systems, and preventing systems from devolving into anti-patterns where a system from a business domain gets misused as a message bus for another system.

Using events into a processing queue is also a good way to make systems more responsive for end-users when compared to making every operation blocking.

Events are not a good replacement for transactional request/response models of (i.e.,, making an API call.) Some people advocate for a "event sourcing" system to create its own internal domain model using events. I don't think this is a good default, but it really comes down to is what tools you're using an how you're used to using them. Namely, you can't have a web service that writes to a RDMBS and then immediately writes to RabbitMQ and call it a consistent system, because the write to RabbitMQ could fail and the systems downstream would be permanently wrong. So event sourcing is used to resolve this into a single-write into a queue system which then forks out into the systems' own RDMBS and also other systems. However the more "normal" way would be to just write this atomically into the RDBMS and have a second process poll it back out into the queue for downstream systems.

[+] HALtheWise|4 years ago|reply
By and large, all of industry and academia working on modern robotics systems have converged on using event-driven publish/subscribe message busses for basically everything. For example, a camera driver will produce a stream of "image" events that the trigger other code across the system, all the way until a stream of "motor command" events come out the other side. This model is really valuable because it works so well with logging and replay workflows, and because it makes mocking and replacing different parts of the system really easy (up to and including mocking reality with a simulator). ROS is the major open source framework used in academia, and industry is split between using ROS and building proprietary internal protocols with similar functionalities.

It's not an exact match for the scalable-microservices world you're thinking of (for example, typically robots don't need to deal with runtime schema version skew), but could be interesting to learn about anyway.

[+] xet7|4 years ago|reply
Monolith is easier to handle. With microservices, any network connection could break, you need a lot more code to handle all that complexity and orchestration.
[+] altacc|4 years ago|reply
This depends upon the size & complexity of your monolith & how your development teams work. Essentially it's a trade off about which becomes most work and causes most issues.
[+] BulgarianIdiot|4 years ago|reply
Monoliths vs. services is like biologists arguing about cells vs. organs.
[+] navd|4 years ago|reply
Not sure if there are any communities. My general advice is to invest as much as possible in a good logging solution, traceability, and just general things to make debugging easier. Come up with a way to replay events easily. You'll thank yourself everyday a bug or issue pops up.
[+] Frajedo|4 years ago|reply
I absolutely agree with you, we are currently implementing a fully serverless and event-driven infrastructure and thinking about logging, traceability and debugging across 200+ lambda functions has been quite a pain. This is even more important as we manage financial flows.

We have written an article about how we try to fix that, if you are considering such an infrastructure on AWS feel free to read it:

https://medium.com/ekonoo-tech-finance/centralizing-log-mana...

I believe that this is one of the big tradeoffs that you make when choosing to go for a microservices and event driven infrastructure.

Regarding the original post question, we are indeed going all in on an event-driven infra, and so far it has been going not too bad, happy to answer any extra questions.

[+] yellow_lead|4 years ago|reply
This 100x. Tracing, replay, etc are all invaluable even in non event driven systems
[+] lloydatkinson|4 years ago|reply
I'm not sure you've understood EDA if you're suggesting that good logging is an alternative.
[+] Fiahil|4 years ago|reply
I work for a big AI consultancy. Most of the time we build ETLs for the data-engineering side, in a client driven capacity-building effort. We do this because our focus is on Data Science, not data engineering, and we often work in situations where the client doesn't have an existing data science platform. It's simpler to build, to handover and later to maintain.

In projects where the client already have a mature engineering and data science department, we bring the big guns! The scope is usually much larger, with several workstreams and involve production-ready deployments. In this situation we might build upon what the client already have (ETLs), or initiate a full event-driven transformation with a "backbone" team responsible for creating a platform, and several use cases building upon it. In the usual scenario, a team would want to start large computations or simulations upon recieving a trigger event from a monitoring system (model drift) or a human operator ("what would be the impact in € of a small decrease in parameter X over the next 7 days of forecasted sales"?).

Even-driven systems are much more robust than traditional ETLs with a central data warehouse, but they are also much more complex to understand and operate. In the end, we rarely deploy them because they cost us way too much engineering time compared to the benefits. That's mostly because we spend >70% of our time dealing with "security teams" and "access issues". Seriously.

[+] Licentia|4 years ago|reply
Yes. It's my preferred architecture for any non trivial system. The single biggest downside to it is it's really hard to find people with experience building event driven systems.

There's a bit of a training curve but it's honestly not that hard if people are willing and wanting to learn. You could level up a moderately experienced team in a matter of weeks to be able to work within a well defined event driven microservice architecture. The part that gets tricky and requires experience is carving out the boundaries and messages.

To answer the question about DLQs I think this is a valid critique. I've seen many places just set and forget DLQs and they might as well not have them. For me, I like to start each DLQ with an alert on every message published. Then manually inspect the message, trace the logs and figure out what to do from there. Once I have enough data on failure modes and paths to rectify them, you can start automating DLQ processing. In general though DLQs should not see much traffic outside of a system going down or poison messages hitting your services (broken schema changes from another service)

[+] thire|4 years ago|reply
I had experience with a product using an event based architecture at large scale, and to be honest, it was a pain to work with. For example, traceability, or troubleshooting in general, was very hard since events would spawn more events etc. making things much harder to track than expected.

Unless the scale is an issue, nowadays I always prefer a more state-full approach when possible.

[+] gwbas1c|4 years ago|reply
(Kind of related)

I was the lead developer for Syncplicity's desktop client. It was a file synchronization product very similar to Dropbox.

When I joined, the desktop client was 100% event-driven. The problem is that some kinds of operations need to be performed synchronously, so "event-driven" tended to obfuscate what needed to happen when. Translation: For your primary use cases, it's much easier to follow code that calls functions in a well-defined order, instead of figuring out who's subscribed to what. Events are great for secondary use cases.

To translate to microservices, for primary use cases, I'd have the service that generates a result call directly into the next service. For secondary use cases, I'd rely on events. Of course, there's tradeoffs everywhere, but you'll find that newcomers are able to more easily navigate your codebase when it's unambiguous what happens next.

[+] giantg2|4 years ago|reply
"As a trivial example, everybody talks about dead-letter queues but nobody really explains how to handle messages that end up in one."

For us, it's either a function that will retry the messages after some time, or manual intervention.

Our department recently said we need to move to even driven architecture for one of our processes that currently runs in a batch. They want us to load data into EMR from an S3 bucket populated by Kinesis. Their suggested implementation is simply to run the batch job more frequently instead of once a day... sorry guys, but that's not even driven...

I suggested maybe just setting a trigger on the S3 bucket and hook it up to Glue, since that would actually be event driven. They said 'no' because they don't want the load to EMR or Glue to run too frequently. I guess that makes sense (not that familiar with the ETL tech), but it sure doesn't make sense to call it event driven.