A proposal for more reliable locks using Redis

[+] mjb|12 years ago|reply

The classic "Leases: an efficient fault-tolerant mechanism for distributed file cache consistency" (http://portal.acm.org/citation.cfm?id=74870) dating back to 1989 is a good read about these kinds of systems. It makes some interesting observations about the approach, and introduces the need for bounded drift.

I think antirez is saying "skew" here when "drift" would be more appropriate. The safety property appears to refer to the different in rates between clocks, rather than the difference in absolute values. That's a much more reasonable assumption, and is likely to be true even with very bad clock hardware over short periods of time.

Obviously the bounded drift assumption,

[+] antirez|12 years ago|reply

> The safety property appears to refer to the different in rates between clocks, rather than the difference in absolute values.

Exactly that, thanks for the correction, indeed the word I used is wrong, even if probably from the context it was understandable I was referring to drift I'm going to replace the term.

[+] Glyptodon|12 years ago|reply

Is this new? I feel like using Redis for locks is something that's been going around for a while. I've used Redis to make locks, and also used it to make counting semaphores. It's a fairly interesting use because it's frequently the the simplest, least overhead means to solve a problem that's reliable enough without actually being reliable.

The most obvious issue is that if Redis goes down you could end up with problems if the processes using the locks continue, particularly depending on when and what state Redis restarts.

Another is that you have to take an approach of re-checking out your lock so as not to let it expire if you can't guarantee strict time constraints. Once you do this, you run a risk of something not finishing but extending its lock indefinitely.

A final issue is that you can end up with a situation where there's no guarantee that a waiting task (or whatever you call something that wants a lock or in on a semaphore) will ever run.

I don't really buy those who talk about this being an insane violation 0 state/share nothing. When I've needed these kinds of primitives it rarely has to do with the application state itself - for example I've used the counting semaphores to control how many worker processes can be active. Likewise, I've used the plain locks (and lock-like structures) to do things like insure atomic/ordered writes for user sessions (I suppose session is stateful, but it's also not really shared application state).

In any case, there are some issues, but at the cost of a minimal amount of nursing the ease of implementation and integration often makes Redis a go-to choice for these kinds of things, particularly in resource (hardware) constrained environments. On the other hand if you're operating a scale where you've got multiple datacenters and such, it's a different ballgame.

[+] antirez|12 years ago|reply

It is not new at all, that's exactly why I wrote the blog post, since it is an actual use case. But it is a good idea to formalize a canonical model to do it with Redis in a safe way, and to show what are the failure modes when we resort to simpler designs (like failover to slaves).

[+] ThePhysicist|12 years ago|reply

I wrote a very similar, Redis-based lock in Python a while ago, here it is:

https://gist.github.com/adewes/6103220

It uses Redis pipelines and watchers to make sure that no race conditions between two processes requiring the same lock occur, and uses "expire" keys to avoid deadlocks.

[+] klibertp|12 years ago|reply

I prefer Lua-scripting based approach: https://github.com/bbangert/retools/blob/master/retools/lock..., but I had to implement something similar to your code to support older Redis versions.

[+] thezilch|12 years ago|reply

You should just use redis-py's Lock (https://github.com/andymccurdy/redis-py/blob/master/redis/cl...)

[+] jcampbell1|12 years ago|reply

> Step 2) It tries to acquire the lock in all the N instances sequentially, using the same key name and random value in all the instances.

> so ideally the client should try to send the SET commands to the N instances at the same time using multiplexing.

I am confused. Are the locks requested sequentially, or at the same time? It seem like if they are requested sequentially, the the random backoff time would need to be a large multiple of the combined latency.

[+] antirez|12 years ago|reply

> I am confused. Are the locks requested sequentially, or at the same time? It seem like if they are requested sequentially, the the random backoff time would need to be a large multiple of the combined latency.

As stated in the post, it is ideal if using multiplexing we send the SET to all the instances at the same time, but this does not change a lot the difference between the chosen lock validity and the latency to set the lock, since with 5 instances it is still requires something in the "a few" millisecond range in the average to set the lock, with not very optimized clients, so 1 second is already three order of magnitude more.

[+] nkozyra|12 years ago|reply

It seems like a lock should be able to autorelease in a distributed environment if the acquirer is no longer available. Would this not be considered "safe?"

What about broadcasting acquire/release messages?

[+] antirez|12 years ago|reply

The proposed algorithm provides a safety guarantee which is time bound: once the lock is acquired it has a specified validity time, after this time, it is possible for another client to reacquire it.

In practical terms this forces you to have the protected code path to be "real time", which is, guaranteed to terminate (or to abort) without the specified time.

[+] oivavoi|12 years ago|reply

So the locks in a Distributed version work something like the NRW concept in Riak ?

[+] antirez|12 years ago|reply

The algorithm described more closely resembles a CP system (in its need to successfully lock the majority of instances), since it requires majority to be available and in order to lock a resource. Also it is not a tunable algorithm, at least the majority of nodes must be locked in order for the safety to be guaranteed.

[+] eddd|12 years ago|reply

I finish reading after first paragraph... When do people will learn that using locks, shared memory does not work? It is just wrong. Things should be immutable, you should share nothing. And by nothing i mean nothing at all. It is just WRONG.

[+] mjb|12 years ago|reply

I don't think you can simplify things that far. Shared-nothing is certainly a good architectural goal, and there are many types of distributed systems where it is both highly desirable and achievable. On the other hand, there are many distributed systems (and parallel systems) where coordination is strictly necessary.

Think about implementing something like the TPC-C load on a distributed database. Many parts of that load can be done sharded, with a shared-nothing model. Other parts need some level of coordination, and some need true serializability (which requires significant coordination). The pieces that can be built on shared-nothing primitives should be built on shared-nothing primitives. However, it's not useful to pretend that the pieces that require coordination simply don't exist.

Similarly, immutability is a good goal. An append-only log of immutable data items is often a very good mental model, and frequently a good implementation model, for distributed databases. It isn't universally applicable, though, and may put constraints on the possible supportable operations that add complexity to other parts of the system.

[+] ericingram|12 years ago|reply

Go create such a system, make it easy to use and widely available, and then you might have room to say such things with some credibility.

26 comments