(no title)
mprime1 | 1 year ago
I didn’t read past this:
> Systems like Apache ZooKeeper or Postgres (via Advisory Locks) provide the required building blocks for this
Zookeeper is the original sin. Convincing a whole generation of programmers that distributed lock are a feasible solution.
This is my biggest pet peeve in distributed systems.
——
And if you don’t believe me, maybe you’ll trust Kyle K. of Jepsen fame:
> However, perfect failure detectors are impossible in asynchronous networks.
Links to: https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/p225-...
p1necone|1 year ago
'Technically' intractable problems are solvable just fine in a way that is almost as useful as solving them completely if you can achieve one of two things:
* Reliably identify when you've encountered an unsolvable case (usefulness of this approach depends on the exact problem you're solving).
or
* Reduce the probability of unsolvable cases/incorrect solutions to a level low enough to not actually happen in practice.
'Technically' GUIDs are impossible, reliable network communication (TCP) is impossible, O^2 time complexity functions will grow to unusably large running times - but in practice all of these things are used constantly to solve real problems.
mprime1|1 year ago
I've seen way too many systems where people assume TCP == perfectly reliable and distributed locks == mutual exclusion. Which of course it's not the case.
everforward|1 year ago
I too hate this. Not just because the edge cases exist, but also because of the related property: it makes the system very hard to reason about.
Questions that should be simple become complicated. What happens when the distributed locking system is down? What happens when we reboot all the nodes at once? What if they don't come down at exactly the same time and there's leader churn for like 2 minutes? Etc, etc.
Those questions should be fairly simple, but become something where a senior dev is having to trace codepaths and draw on a whiteboard to figure it out. It's not even enough to understand how a single node works in-depth, they have to figure out how this node works but also how this node's state might impact another node's.
All of this is much simpler in leaderless systems (where the leader system is replaced with idempotency or a scheduler or something else).
I very strongly prefer avoiding leader systems; it's a method of last resort when literally nothing else will work. I would much rather scale a SQL database to support the queries for idempotency than deal with a leader system.
I've never seen an idempotent system switch to a leader system, but I've sure seen the reverse a few times.
anothername12|1 year ago
> I too hate this. Not just because the edge cases exist, but also because of the related property: it makes the system very hard to reason about.
I think this is a huge problem with the way we’re developing software now. Distributed systems are extremely difficult for a lot of reasons, yet it’s often or first choice when developing even small systems!
At $COMPANY we have hundreds of lambdas, DocumentDB (btw, that is hell in case you’re considering it) and other cloud storage and queuing components. On call and bugs basically are quests in finding some corner case race condition/timing problem, read after write assumption etc.
I’m ashamed to say, we have reads wrapped in retry loops everywhere.
The whole thing could have been a Rails app with a fraction of the team size and a massive increase in reliability and easier to reason about/better time delivering features.
You could say we’re doing it wrong, and you’d probably be partly right for sure, but I’ve done consulting for a decade at dozens of other places and it always seems like this.
silasdavis|1 year ago
mprime1|1 year ago
Atomic Broadcast (via Paxos or RAFT) does not depend on partial synchrony assumptions to maintain its safety properties.
Your internet or intranet networks are definitely asynchronous and assuming delays are bound is a recipe for building crappy systems that will inevitably fail on you in hard to debug ways.
gunnarmorling|1 year ago
> leader election will only ever be eventually correct... So you’ll always need to be prepared to detect and fence off work done by a previous leader.
mprime1|1 year ago
I could not get past the point where you promulgate the idea that ZK can be used to implement locks.
Traditionally a 'lock' guarantees mutual exclusion between threads or processes.
"Distributed locks" are not locks at all. They look the same from API perspective, but they have much weaker properties. They cannot be used to guarantee mutual exclusion.
I think any mention of distributed locks / leader election should come with a giant warning: THESE LOCKS ARE NOT AS STRONG AS THE ONES YOU ARE USED TO. Skipping this warning is doing a disservice to your readers.
realaleris149|1 year ago
[1] https://en.wikipedia.org/wiki/Two_Generals%27_Problem
mprime1|1 year ago
TCP guarantees order [as long as the connection is active] but it is far from being 'perfectly reliable'.
Example: sender sends, connection drops, sender has no idea whether the receiver received.
In other words, it works until it doesn't. The fact that sometimes it doesn't means it's not perfect.
TCP is a great tool but it doesn't violate the laws of physics. The 2 generals problems is and will always be impossible.
akira2501|1 year ago
That failures are a possibility does not concern me. That the failures are not fully characterized does.
EGreg|1 year ago
woooooo|1 year ago
mprime1|1 year ago
Atomic Broadcast guarantees order of delivery. It does not (cannot) guarantee timing of delivery. Which is what people want and expect when using distributed lock / leader election.
setheron|1 year ago
tanelpoder|1 year ago
[1] https://en.wikipedia.org/wiki/STONITH
[2] https://docs.oracle.com/en/operating-systems/oracle-linux/8/...
butterisgood|1 year ago
At some point distributed systems that work well vs others that do not is an “art of tuning timeouts and retries”.
Also nothing in production is perfect - so we should consider failures always when writing code in distributed systems and the impacts.
And we will still make mistakes…
orf|1 year ago
For some definition of impossible, given that many systems utilise them effectively. Not all corner cases or theoretical failure modes are relevant to everyone.
lijok|1 year ago
someguy4242|1 year ago
I know you linked the first paper, but yeah candidly I don’t wish to read the full paper
Don’t mean this sarcastically whatsoever. Genuine interest.
mprime1|1 year ago
When programmers think of locks, they think of something that can be used to guarantee mutual exclusion.
Distributed locks have edge cases where mutual exclusion is violated.
Implementation does not matter.
e.g. imagine someone shows you a design for a perpetual motion machine. You don't need to know the details to know it doesn't work! It would violate the laws of physics!
Similarly, anyone telling you they created an implementation of a distributed lock that is safe, is claiming their system breaks the laws of information theory.
"Distributed locks" are at best contention-reduction mechanisms. i.e. they can keep multiple processes from piling up and slowing each other down.
[Some] Paxos for example use leader election to streamline the protocol and achieve high throughput. But Paxos safety does NOT depend on it. If there are multiple leaders active (which will inevitably happen), the protocol still guarantees its safety properties.
withinboredom|1 year ago
pluto_modadic|1 year ago