top | item 21955234

Mutexes are faster than Spinlocks

373 points| erickt | 6 years ago |matklad.github.io

145 comments

order

twic|6 years ago

The author has an implicit definition of "faster" which it is important to be aware of.

The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.

That is, if you have a worker task which is waiting for a supervisor task to tell it to do something, then to minimise the time between the supervisor giving the order and the worker getting to work, use a spinlock.

For this to really work, you need to implement both tasks as threads pinned to dedicated cores, so they won't be preeempted. You will burn a huge amount of CPU doing this, and so it won't be "faster" from a throughput point of view. But the latency will be as low as it's possible to go.

simias|6 years ago

>The main use of spinlocks that i'm aware of is minimising latency in inter-processor communication.

The main use of spinlocks that I'm aware of is dealing with interrupt handlers in driver code. In this situation you generally can't go to sleep (sleeping with interrupts disabled is generally a good way of never waking up) so calling "mutex_lock" is simply out of the question. That's probably a niche use but that's literally the only situation I've ever used actual spin locks instead of mutexes, mainly for the reasons outlined by TFA.

gpderetta|6 years ago

Also if both threads are pinned to separate cores and nothing else is supposed to run on those cores, it is pointless to use anything but spinlocks as there is no other thread that could better use the core (and probably you do not want the core to go to a low power syate waiting for an interrupt).

Snelius|6 years ago

Absolutely! spinlocks is an inter-processor mechanism and mutex inter-processes. Principal difference.

jakemal|6 years ago

What is a use case for optimizing interprocessor latency?

nathanielherman|6 years ago

This experiment is a bit weird. If you look at https://github.com/matklad/lock-bench, this was run on a machine with 8 logical CPUs, but the test is using 32 threads. It's not that surprising that running 4x as many threads as there are CPUs doesn't make sense for spin locks.

I did a quick test on my Mac using 4 threads instead. At "heavy contention" the spin lock is actually 22% faster than parking_lot::Mutex. At "extreme contention", the spin lock is 22% slower than parking_lot::Mutex.

Heavy contention run:

  $ cargo run --release 4 64 10000 100
      Finished release [optimized] target(s) in 0.01s
      Running `target/release/lock-bench 4 64 10000 100`
  Options {
      n_threads: 4,
      n_locks: 64,
      n_ops: 10000,
      n_rounds: 100,
  }

  std::sync::Mutex     avg 2.822382ms   min 1.459601ms   max 3.342966ms  
  parking_lot::Mutex   avg 1.070323ms   min 760.52µs     max 1.212874ms  
  spin::Mutex          avg 879.457µs    min 681.836µs    max 990.38µs    
  AmdSpinlock          avg 915.096µs    min 445.494µs    max 1.003548ms  

  std::sync::Mutex     avg 2.832905ms   min 2.227285ms   max 3.46791ms   
  parking_lot::Mutex   avg 1.059368ms   min 507.346µs    max 1.263203ms  
  spin::Mutex          avg 873.197µs    min 432.016µs    max 1.062487ms  
  AmdSpinlock          avg 916.393µs    min 568.889µs    max 1.024317ms  
Extreme contention run:

  $ cargo run --release 4 2 10000 100
      Finished release [optimized] target(s) in 0.01s
      Running `target/release/lock-bench 4 2 10000 100`
  Options {
      n_threads: 4,
      n_locks: 2,
      n_ops: 10000,
      n_rounds: 100,
  }

  std::sync::Mutex     avg 4.552701ms   min 2.699316ms   max 5.42634ms   
  parking_lot::Mutex   avg 2.802124ms   min 1.398002ms   max 4.798426ms  
  spin::Mutex          avg 3.596568ms   min 1.66903ms    max 4.290803ms  
  AmdSpinlock          avg 3.470115ms   min 1.707714ms   max 4.118536ms  

  std::sync::Mutex     avg 4.486896ms   min 2.536907ms   max 5.821404ms  
  parking_lot::Mutex   avg 2.712171ms   min 1.508037ms   max 5.44592ms   
  spin::Mutex          avg 3.563192ms   min 1.700003ms   max 4.264851ms  
  AmdSpinlock          avg 3.643592ms   min 2.208522ms   max 4.856297ms

hinkley|6 years ago

The top comment opens up the concept of latency versus throughput. My interpretation is that this experiment is demonstrating that optimizing only for latency has consequences elsewhere in the system. Which is not surprising at all, but then again I spend a lot of time explaining unsurprising things.

I remember a sort of sea change in my thinking on technical books during a period where I tended to keep them at work instead of at home. I noticed a curious pattern in which ones were getting borrowed and by whom. Reading material isn't only useful if it has something new to me in it. It's also useful if it presents information I already know and agree with, in a convenient format. Possibly more useful, in fact.

rwem|6 years ago

If you only have 4 threads it is likely that all your CPUs are sharing caches and you won't see the real downside of the spinlock. They don't really fall apart until you have several sockets.

ww520|6 years ago

Modern day user-mode pthread mutex uses the futex kernel syscall [1] to implement the lock, which avoids the syscall in the non-contended case, so it can be very fast, acquiring the lock in the user-mode code entirely. I'm not sure whether the Rust's mutex API is a wrapper on the pthread mutex or calling the old kernel's mutex syscall directly.

Basically the user mode mutex lock is implemented as:

    // In user-mode, if the lock flag is free as 0, lock it to 1 and exit
    while !atomic_compare_and_swap(&lock, 0, 1)
        // Lock not free, sleeps until the flag is changed back to 0
        futex_wait(&lock, 0)  // syscall to kernel
When futex_wait() returns, it means the flag has been set back to 0 by the other thread's unlock, and the kernel wakes my thread up so I can check it again. However, another thread can come in and grab the lock in the meantime, so I need to loop back to check again. The atomic CAS operation is the one acquiring the lock.

[1] https://github.com/lattera/glibc/blob/master/nptl/pthread_mu...

Edit: the atomic_compare_and_swap can be just a macro to the assembly CMPXCHG, so it's very fast to acquire a lock if no one else holding the lock.

sharken|6 years ago

This is pretty much the conclusion in this game related post on spinlocks: https://probablydance.com/2019/12/30/measuring-mutexes-spinl...

jakswa|6 years ago

Oh wow/yikes, Linus Torvalds commented on that recently: https://www.realworldtech.com/forum/?threadid=189711&curpost...

"So you might want to look into not the standard library implementation, but specific locking implentations for your particular needs. Which is admittedly very very annoying indeed. But don't write your own. Find somebody else that wrote one, and spent the decades actually tuning it and making it work.

Because you should never ever think that you're clever enough to write your own locking routines.. Because the likelihood is that you aren't (and by that "you" I very much include myself - we've tweaked all the in-kernel locking over decades, and gone through the simple test-and-set to ticket locks to cacheline-efficient queuing locks, and even people who know what they are doing tend to get it wrong several times).

There's a reason why you can find decades of academic papers on locking. Really. It's hard."

dcolkitt|6 years ago

> Second, the uncontended case looks like > Parking_lot::Mutex avg 6ms min 4ms max 9ms

This estimate is way too high for the uncontested mutex case. On a modern Linux/Xeon system using GCC, an uncontested mutex lock/unlock is well under 1 microsecond.

I have a lot of experience here from writing low-latency financial systems. The hot path we use is littered with uncontested mutex lock/unlock, and the whole path still runs under 20 microseconds. (With the vast majority of that time unrelated to mutex acquisition.)

The benchmark used in the blog post must be spending the vast majority of its time in some section of code that has nothing to do with lock/unlock.

gpm|6 years ago

You're misreading the benchmark, that's 6ms for 10,000 lock/unlocks per thread, 320,000 lock/unlocks total. In other words 0.6 microseconds per thread per lock.

atq2119|6 years ago

This is the time for whole benchmark run, not for an individual lock/unlock. The article is quite clear on that.

ncmncm|6 years ago

As we say in low-latency finance, "a microsecond is an eternity."

If you have threads interacting, whether via mutexes or spinlocks, you have a high-latency system.

kcolford|6 years ago

This absolutely makes sense in userspace. The most important part of a spinlock in an OS is that you can yield to the scheduler instead of taking up CPU time on the core. But that defeats the purpose of using a spinlock when in userspace because you still have to syscall

temac|6 years ago

A spinlock in the kernel typically only spins. You use it for the cases when you can't schedule... So it better will be for a short time only.

But the concept of short time does not even exist deterministically in userspace, usually, because it can always be preempted. So don't use pure spinlocks in userspace, unless you really really know what you are doing (and that includes knowing how your kernel works in great details, in the context of how you use it).

extropy|6 years ago

How about a new opcode wait till memory address read equals? That would allow implementing a power efficient spinlock.

Oh there is one already. Meet PAUSE: https://www.felixcloutier.com/x86/pause

Edit: related post from 2018 https://news.ycombinator.com/item?id=17336853

jnordwick|6 years ago

MONITOR/MWAIT will get you the part where a thread will pause and the cpu can rest until a write on a store on an address range, then the waiting thread is allowed to continue. You can't wait on a specific value though.

gok|6 years ago

Note that all of the locks tested here are unfair, which is why they all show very high waiting variance. Until recently many mutex implementations aimed for fairness, which made them much slower than spinlocks in microbenchmarks like this.

jnordwick|6 years ago

This comes around every so often, and it isn't very interesting in that the best mutexes basically spins 1 or a couple times then falls back to a lock. It isn't a true pure spinlock vs pure lock (mutex/futex) fight.

I think the linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.

His benchmark is weird, but maybe I'm reading it wrong:

* Instead of reducing thread count to reduce contention he appears to increase the number of locks available. This is still a bad scenario for spinlocks since they will still have bad interactions with scheduler (they will use a large amount of cpu time when and get evicted from the run queue and need to be rescheduled).

* Also, I didn't see him pin any of the threads, so all those threads will start sharing some cpus since the OS does need to be doing some work on a them too.

* And Rust can be a little hard to read, but it seem he packed his locks on the same cache line? I don't see any padding in his AmdSpinlock struct. That would be a huge blow for the spinlock because of the false sharing issues. He's getting all the cache coherence traffic still because of it.

The worst cases for the spinlock are not understanding scheduling costs and the cache thrashing that can occur.

What they call the AMD spinlock (basically just a regular sane spinlock that tries to prevent cache thrashing) has its best performace with a low number of threads, assigned to different cores under the same L3 segment.

(Does anybody know if AMD's new microarchitecture went away from the MOESI/directory based cache towards Intel's MESIF/snoop model?)

The MOESI model might have performed better in this regard under worse case scenario since it doesn't need to write the cache line back and can just forward around the dirty line as it keeps track of who owns it.

And if you run under an MESIF-based cache and you can keep your traffic local to your L3 segment, you are backstopped there and never need to go anywhere else.

A spinlock is a performance optimization and should be treated as one. You need to have intimate knowledge of the architecture you are running under and the resources being used.

(edit: answered my own question, apparently the vdso is still pretty limited in what it exports, so no. it isn't happening at this time from what i can tell.)

atq2119|6 years ago

The locks should be on separate cachelines, that's what the CachePadded::new is for.

futex cannot be implemented in the VDSO since it needs to call into the scheduler.

Another way to think about this: VDSO is used for syscalls that are (mostly) read-only and can avoid the kernel mode switch on a common fast path. The futex syscall is already the raw interface that is designed on the assumption that the caller only uses it in the slow path of whatever higher-level synchronization primitive they're implementing, so trying to use VDSO tricks to implement futex would be redundant.

gpderetta|6 years ago

> linux futex can be implemented through the VDSO (can somebody correct me on this), so that eliminates the worse of the sycall costs.

The semantic of a futex wait is a request to the kernel to put the thread to sleep until another thread issues a signal for the same "key".

The trick is that the key is the address of a 32 bit memory location (provided to the kernel in both the signal and wait syscalls), which is normally the mutex address.

So a process first tries to acquire the mutex as for a normal spin lock (with a CAS, exchange, xadd or whatever work for the specific algorithm) attempting to set the lock bit to 1; if it fails (possibly after spin-trying a few times), it invokes the futex wait. As you can see the syscall is only done in the slow path.

On a futex release, the simple solution is to always invoke futex signal syscall after setting the lock bit to zero[1]. To fast path the relase, a wait bit can be set on the acquire path just before calling futex wait, so on the release path, when setting the lock bit to zero, signal would only be called if the waiter bit was set (and the cleared by together with the lock bit)[2].

As you can see the futex syscall is already the slow path and never need to be onvoked in the uncontented case. In fact the kernel doesn't even need to know about the futex untill the first contention.

[1] futexes are edge triggered, same as condition variables, so a signal will only wakes any thread blocked ona wait that happened-before the signal call. Thus there is a race condition if a lock rrlease and signal happens between the failed acquire attempt and the wait call; to prevent this futex wait as an additional parameter that is the expected value of the memory location: the kernel checks the futex address against this valur and will only if it hasn't changed will put the thread to sleep (this is done atomically).

[2] as there could me multiple waiters, a broadcast is actually needed here which leads to a thundering herd as all waiters will race to acquire the lock. The original futex paper used a wait count instead of just a bit but there are other options.

hackworks|6 years ago

Not an expert here.

In a spin lock, the lock state is checked in a tight loop by all waiters. This will be using some sort of memory fence. FWIK, memory fence or barriers flush the CPU cache and would initiate reading the variable (spin lock state) for evaluation. I would expect spin locking overheads to increase with number of cores.

On NUMA, I think flushing is more expensive. Hence, spin locks have an additional overhead of having to load and evaluate on every spin as against being woken up for mutexes (like a callback)

oppositelock|6 years ago

Yes, spin locks require atomic memory operations - as do many other OS primitives. CPUs have dedicated instructions to do this more quickly for locks than generalized coherent memory, for example cmpxchg on x86. There are microarchitectural optimizations to make these things work well.

I've been writing MP code since the early 90's; SPARC, MIPS, x86, PPC, SH-4, Alpha, and all have instructions to make this easier for you.

Spin locks are very useful for locks held for short durations, and only on MP systems, particularly in the kernel, since a spin lock there can deadlock the whole system on a single CPU machine. The very general rule of thumb is that you generally want blocking locks, unless there is a very specific reason to use a spin lock. Those very specific reasons might be waiting briefly on some device in a driver, or running a realtime system where wake latency is critical, etc. Most of the time, spin locks will burn CPU for no good reason, and you still have to implement the non-spinlock case anyway, in case you are running on a 1 CPU system. So, might as well start with blocking locks, and only add spin locks where performance would benefit from it.

rwem|6 years ago

It doesn’t flush the entire cache (that would be a disaster) but it does shoot down the cache line containing the lock in all cores other than the one that acquired the lock.

The real issue with spin locks is fairness. There’s no assurance that any given thread will ever make progress. A thread could starve forever. Production-ready mutexes like absl::Mutex make efforts toward fairness, even if they don’t have hard guarantees.

elteto|6 years ago

> checked in a tight loop by all waiters

This actually does not have to be this way. You could have a linked list of spinlocks, one for each waiter. Each waiter spins on its own, unique spinlock. When the previous waiter is done it unlocks the next spinlock, and so on. The implementation gets a bit complicated on non-GC languages, since there are races between insertion/removal on the linked list. If the number of threads is fixed and long-lived then it becomes easier, since instead of a linked list you can have an array of spinlocks.

Note: in some architectures (x86?) you could possibly do away with atomics, since (I believe) int updates are atomic by default. Not really sure though.

lowbloodsugar|6 years ago

TFA makes the point that modern "mutex" implementations actually use spinlocks first and only fall back to heavy, kernel Mutexes if there is contention. So the title is click-baity. Mutexes are slower than spinlocks. The "faster than spinlocks" mutexes in this article are actually spinlocks that fallback to mutexes.

Then the benchmark uses spinlocks in situations that spinlocks aren't great for. And, surprise, spinlocks are slower, than spinlocks-with-mutexes.

Spinlocks are great in situations such as:

  * There are far more resources than threads,
  * The probability of actually having to spin is small,
    ideally if the time spent in the lock is a few instructions
  * When you can't use optimistic concurrency*
* because perhaps the number of memory locations to track is too complicated for my poor brain and I can't be arsed to remember how TLA+ works

There's plenty of linked list implementations, for example, that use optimistic concurrency. At that point you've got yourself a thread-safe message queue and that might be better than mutexes, too.

jakswa|6 years ago

Coming as a ruby developer who dabbles in async rust in my free time, these posts/threads/links have been my best read of 2020 so far. My CS curriculum barely covered real-world lock usage, much less the ties to linux/windows schedulers + modern CPU caching. Big thanks all around to these internet commentators.

Ericson2314|6 years ago

As the person that added a bunch of functionality to spin-rs making it roughly match the std API, yes you should not use spinlocks in scheduled code.

That said, I see why Rust makes things so annoying. I want lots of code to work in no-std so have a robust embedded and kernel ecosystem. It would be really nice to abstract over the locking implementation with type parameters, but that required "higher kinded types" which Rust doesn't have. The alternative is relying on carefully coordinated Cargo features, which is much more flaky and hard to audit (e.g. with the type system). Given that, I am not sure people over-use spin locks.

lilyball|6 years ago

I'm really curious how macOS's os_unfair_lock compares here.

LeoNatan25|6 years ago

Is that implementation open source? Don’t remember which dyld contains it.

AnanasAttack|6 years ago

Thread scheduling (or waking up cores) is slow. Because of this, mutexes will look better on dumb benchmarks, as the contending threads keep going to sleep, while the single succesful owner has practically uncontended access

AstralStorm|6 years ago

There are various degrees of slow, in addition to kernel being smarter about multiple cores and SMT siblings than your application.

Kernel can run your code on a cooled core, giving it higher clock, for example. Ultimately making it run faster.

Of course this won't show in a benchmark where all the threads do mostly calculation rather than contention, but that's not the typical case. That mostly shows up in compute such as multithreaded video where latency does not matter one bit.

Typically you have more of a producer/consumer pattern where consumer sleeps, and it's beneficial to run it on a cold CPU, assuming the kernel woke it up beforehand.

Source: hit some latency issues with an ancient kernel on a nastily hacked big little architecture ARM machine. It liked to overheat cores and put heavy tasks on the overheated ones for alleged power saving. (Whereas running a task quicker saves power.)

bubbleRefuge|6 years ago

couldn't you eliminate the bad spinlock behavior by coding them to be go into an efficient wait if to much spinning is going on ?

rwem|6 years ago

That’s what virtually all battle-hardened lock libraries do: spin for a bit (but not too tightly, using a pause in the loop) then fall back to waiter lists and futex.

CodeWriter23|6 years ago

Or use a mutex and let the scheduler implement equitable dispatching.

kccqzy|6 years ago

Yes. And then it becomes an adaptive mutex.

shin_lao|6 years ago

Most mutexes implementations spin before truly acquiring the lock.

Also, there are better spinlock implementations, such as speculative spinlocks, and queued locks.

CoolGuySteve|6 years ago

.

atq2119|6 years ago

This is the time for whole benchmark run, not for an individual lock/unlock. The article is quite clear on that.