If anyone here wants to do this but don't want to implement all of this yourselves, this "field" is called Durable Execution. Frameworks such as Temporal, Restate and DBOS do a lot of the heavy lifting to get the idempotency, exactly once and recovery to a known state logic here.
Second this. It's only been a few months since I started deploying Temporal at work, and there's no way that I would try implementing all this in-house.
> In APIs, passively safe means failures (crashes, timeouts, retries, partial outages) can't produce duplicate work, surprise side effects, or unrecoverable state.
Idempotence of an operation means that if you perform it a second (or third, etc) time it won't do anything. The "action" all happens the first time and further goes at it do nothing. Eg. switching a light switch on could be seen as "idempotent" in a sense. You can press the bottom edge of the switch again but it's not going to click again and the light isn't going to become any more on.
The concept originates in maths, where it's functions that can be idempotent. The canonical example is projection operators: if you project a vector onto a subspace and then apply that same projection operator again you get the same vector again. In computing the term is sometimes used fairly loosely/analogistically like in the light switch example above. Sometimes, though, there is a mathematical function involved that is idempotent in the mathematical sense.
A form of idempotence is implied in "retries ... can't produce duplicate work" in the quote, but it isn't the whole story. Atomicity, for example, is also implied by the whole quote: the idea that an operation always either completes in its entirety or doesn't happen at all. That's independent of idempotence.
To all the folks saying “durable execution frameworks solve this”—you’re right, but a lot of what’s described in the article isn’t quite the same as durable execution a la temporal. The approach described (transactional outboxes for side effectful operations, and care taken to be idempotent or resumable where possible, and to gracefully degrade, slow down, or rate limit where you can) achieves some of the same properties as a given durable execution framework, its true, but you don’t necessarily need to rewrite your code to be fully event sourced or use a framework to get a lot of those benefits, as the article demonstrates.
Transactional outboxes specifically are one of my favorite patterns: they’re not too hard to add and don’t require changing many core invariants of your system. If you already use some sort of message bus or queue, making publishes to it transactional under a given RDBMS is often as simple as adding some client side code and making sure that logical message deduplication and is present where appropriate: https://microservices.io/patterns/data/transactional-outbox....
If you use a separate message broker (Kafka, SQS, RabbitMQ) with this pattern, you’ll also need a sweeper cron job to re-dispatch failed publishes from the outbox table(s) as well.
Bonus points if this can be implemented on top of existing trigger-based audit table functionality.
Durable execution has already been mentioned as the existing solution for this problem, but I would like to call out a specific pattern that DE makes obsolete: the outbox pattern. Imagine just being able to do do
send a()
send b()
And know both will be sent at least once, without having to introduce an outbox and re-architect your code to use a message relay. We can nitpick the details, but being able to "just write normal code" and get strong guarantees is, imo, real progress.
I get that it is particularly valuable in that scenario by treating other services as "external API", but monolith also do call "external API" and delegate work to async tasks. The principles discussed here API are interesting beyond just micro-services while being lighter and simpler than Durable Execution.
> Didn't we get to the point where we realized that microservices cause too much trouble down the road?
That's a largely ignorant opinion to have. Like any architecture, microservices have clear advantages and tradeoffs. It makes no sense to throw vague blanket statements at an architure style because you assume it "causes trouble", particularly when you know nothing about requirements or constraints and all architectures are far from bullet proof.
It sounds like a good way to make sure you don't overcharge your customers when handling such requests at scale. Failure and duplication will happen, and when serving enough requests will happen often enough to occupy engineering with investigation and resolution efforts forwarded from customer support.
Being prepared for these things to happen and having code in place to automatically prevent, recognize and resolve these errors will keep you, the customers and everyone in between sane and happy.
> That sounds like a lot of over engineering and a good way to never complete the project. Perfect is the enemy of good.
Strong disagree. Addressing expectable failure modes is not over engineering. It's engineering. How do you put together a system without actually thinking through how it fails, and how to prevent those failure scenarios from taking down your whole operation?
I have bad news for everyone. Nothing in computing is synchronous. Every instance we pretend it’s not and call it something else you have a potential failure under the right circumstances.
The more your design admits this the safer it will be. There are practical limits to this which you have to determine for yourself.
> I have bad news for everyone. Nothing in computing is synchronous.
I think you need to sit this one out. This sort of vacuous pedantry does no one any good, and ignores that it's perfectly fine to model and treat some calls are synchronous, including plain old HTTP ones. Just because everything is a state machine this does not mean you buy yourself anything of value by modeling everything as a state machine.
Yeah no…many things are synchronous. I think you’re tilted about the fact that many things are “synchronous*” and that there can be important nuance hidden within the asterisk, but plenty of stuff is synchronous by default.
TCP connect(3) is synchronous. Making a directory on a local filesystem is synchronous. fsync(2) is synchronous. Committing an RDBMS transaction is synchronous.
awildfivreld|1 month ago
fernandopj|1 month ago
fsociety|1 month ago
Now, any system I’ve seen designed around exactly once is complete garbage.
compressedgas|1 month ago
I thought that was what 'idempotent' meant.
omnicognate|1 month ago
The concept originates in maths, where it's functions that can be idempotent. The canonical example is projection operators: if you project a vector onto a subspace and then apply that same projection operator again you get the same vector again. In computing the term is sometimes used fairly loosely/analogistically like in the light switch example above. Sometimes, though, there is a mathematical function involved that is idempotent in the mathematical sense.
A form of idempotence is implied in "retries ... can't produce duplicate work" in the quote, but it isn't the whole story. Atomicity, for example, is also implied by the whole quote: the idea that an operation always either completes in its entirety or doesn't happen at all. That's independent of idempotence.
dalbaugh|1 month ago
locknitpicker|1 month ago
You don't have idempotent crashes.
zbentley|1 month ago
Transactional outboxes specifically are one of my favorite patterns: they’re not too hard to add and don’t require changing many core invariants of your system. If you already use some sort of message bus or queue, making publishes to it transactional under a given RDBMS is often as simple as adding some client side code and making sure that logical message deduplication and is present where appropriate: https://microservices.io/patterns/data/transactional-outbox....
If you use a separate message broker (Kafka, SQS, RabbitMQ) with this pattern, you’ll also need a sweeper cron job to re-dispatch failed publishes from the outbox table(s) as well.
Bonus points if this can be implemented on top of existing trigger-based audit table functionality.
hmaxdml|1 month ago
send a() send b()
And know both will be sent at least once, without having to introduce an outbox and re-architect your code to use a message relay. We can nitpick the details, but being able to "just write normal code" and get strong guarantees is, imo, real progress.
ldng|1 month ago
I get that it is particularly valuable in that scenario by treating other services as "external API", but monolith also do call "external API" and delegate work to async tasks. The principles discussed here API are interesting beyond just micro-services while being lighter and simpler than Durable Execution.
srinath693|1 month ago
vaylian|1 month ago
Didn't we get to the point where we realized that microservices cause too much trouble down the road?
locknitpicker|1 month ago
That's a largely ignorant opinion to have. Like any architecture, microservices have clear advantages and tradeoffs. It makes no sense to throw vague blanket statements at an architure style because you assume it "causes trouble", particularly when you know nothing about requirements or constraints and all architectures are far from bullet proof.
vbezhenar|1 month ago
dxdm|1 month ago
Being prepared for these things to happen and having code in place to automatically prevent, recognize and resolve these errors will keep you, the customers and everyone in between sane and happy.
michalc|1 month ago
I am particularly not a fan of doing unnecessary work/over engineering, e.g. see https://charemza.name/blog/posts/agile/over-engineering/not-..., but even I think that sometimes things _are_ worth it
locknitpicker|1 month ago
Strong disagree. Addressing expectable failure modes is not over engineering. It's engineering. How do you put together a system without actually thinking through how it fails, and how to prevent those failure scenarios from taking down your whole operation?
user3939382|1 month ago
I have bad news for everyone. Nothing in computing is synchronous. Every instance we pretend it’s not and call it something else you have a potential failure under the right circumstances.
The more your design admits this the safer it will be. There are practical limits to this which you have to determine for yourself.
locknitpicker|1 month ago
I think you need to sit this one out. This sort of vacuous pedantry does no one any good, and ignores that it's perfectly fine to model and treat some calls are synchronous, including plain old HTTP ones. Just because everything is a state machine this does not mean you buy yourself anything of value by modeling everything as a state machine.
zbentley|1 month ago
TCP connect(3) is synchronous. Making a directory on a local filesystem is synchronous. fsync(2) is synchronous. Committing an RDBMS transaction is synchronous.