No it isn't, it has a fixed number of yields, which has a very different duration on various CPUs
> Threads wait (instead of spinning) if the lock is not available immediately-ish
They use parking lots, which is one way to do futew (in fact, WaitOnAddress is implemented similarly). And no if you read the code, they do spin. Worse, they actually yield the thread before properly parking.
The basic rule of writing your own cross-thread datastructures like mutexes or condition variables is... don't, unless you have very good reason not to. If you're in that rare circumstance where you know the library you're using isn't viable for some reason, then the next best rule is to use your OS's version of a futex as the atomic primitive, since it's going to solve most of the pitfalls for you automatically.
The only time I've manually written my own spin lock was when I had to coordinate between two different threads, one of which was running 16-bit code, so using any library was out of the question, and even relying on syscalls was sketchy because making sure the 16-bit code is in the right state to call a syscall itself is tricky. Although in this case, since I didn't need to care about things like fairness (only two threads are involved), the spinlock core ended up being simple:
As always: use standard libraries first, profile, then write your own if the data indicate that it's necessary. To your point, the standard library probably already uses the OS primitives under the hood, which themselves do a short userspace spin-wait and then fall back to a kernel wait queue on contention. If low latency is a priority, the latter might be unacceptable.
The following is an interesting talk where the author used a custom spinlock to significantly speed up a real-time physics solver.
Another time when writing a quick and dirty spinlock is reasonable is inside a logging library. A logging library would normally use a full-featured mutex, but what if we want the mutex implementation to be able to log? Say the mutex can log that it is non recursive yet the same thread is acquiring it twice; or that it has detected a deadlock. The solution is to introduce a special subset of the logging library to use a spinlock.
Another somewhat known case of a spinlock is in trading, where for latency purposes the OS scheduler is essentially bypassed by core isolation and thread pinning, so there’s nothing better for the CPU to do than spinning.
I wrote my own spin lock library over a decade ago in order to learn about multi threading, concurrency, and how all this stuff works. I learned a lot!
I struggled with this in Wine. "malloc" type memory allocation involves at least two levels of spinlocks. When you do a "realloc", the spinlocks are held during the copying operation. If you use Vec .push in Rust, you do a lot of reallocs.
In a heavily multithreaded program, this can knock performance down by more than two orders of magnitude. It's hard to reproduce this with a simple program; it takes a lot of concurrency to hit futex congesion.
Real Windows, and Linux, don't have this problem. Only Wine's "malloc" in a DLL, which does.
Bug reports resulted in finger-pointing and denial.[1] "Unconfirmed", despite showing debugger output.
Nice article! Yes, using spinlocks in normal userspace applications is not recommended.
One area where I found spinlocks to be useful is in multithreaded audio applications. Audio threads are not supposed to be preempted by other user space threads because otherwise they may not complete in time, leading to audio glitches. The threads have a very high priority (or have a special scheduling policy) and may be pinned to different CPU cores.
For example, multiple audio threads might read from the same sample buffer, whose content is occasionally modified. In that case, you could use a reader-writer-spinlock where multiple readers would be able to progress in parallel without blocking each other. Only a writer would block other threads.
What would be the potential problems in that scenario?
I've heard of issues on Arm devices with properly isolated cores (only one thread allowed, interrupts disabled) because the would interact with other threads using such a spinlock, threads which were not themselves isolated. The team replaced it all with a futex and it ended up working better in the end.
Sadly this happened while I was under another project so I don't have the details, but this can be problematic in audio too. To avoid the delay of waking up thread you can actually wake them a tiny bit early and then spin (not on a lock), since you know work is incoming.
Recently implemented a fixed-size memory pool with spinlocks and now I'm wondering - how would one implement them without a spinlock?
Edit: Maybe I'm confusing terminology. What I'm doing is looping until other threads returned memory, but I'm also doing a short sleep during each loop iteration.
> Notice that in the Skylake Client microarchitecture the RDTSC instruction counts at the machine’s guaranteed P1 frequency independently of the current processor clock (see the INVARIANT TSC property), and therefore, when running in Intel® Turbo-Boost-enabled mode, the delay will remain constant, but the number of instructions that could have been executed will change.
rdtsc may execute out of order, so sometimes an lfence (previously cpuid) can be used and there is also rdtscp
The issue with that is that a load fence may be very detrimental to perf. It doesn't really matter if rdtsc executes out of order in this code anyway, and there is no need for sync between cores.
My concurrency knowledge is a bit rusty but aren't spinlocks only supposed to be used for very brief waits like in the hundreds of cycles (or situations where you can't block... like internal o/s scheduling structures in SMP setups)? If so how much does all this back off and starvation of higher priority threads even matter? If it is longer then you should use a locking primitive (except for in those low level os structures!) where most of the things discussed are not an issue. Would love to hear the use cases where spin locks are needed in eg user space, I dont doubt they occur.
That's how they are supposed to work indeed! But spin locks aren't the only spin loops you may find, and allocator for example do spin. And for example under an allocation heavy code (that you should avoid too, but happens due to 3rd parties in real life), this can trigger contention, so you need contention to not be the worse type of contention.
How can you guarantee that the OS doesn't preempt your thread in the middle of the spinlock? Suddenly your 100 cycle spinlock turns into millions or billions of wasted cycles, because the other threads that are trying to acquire the same lock are spinning and didn't bother informing the OS scheduler that they need the thread that is holding the spinlock, which also didn't inform the OS, to finish its business ASAP.
> The code is not thread-safe as, if multiple threads attempt to use this lock, we could read invalid values of isLocked (in theory, and on a CPU where tearing could happen on its word size).
The issue isn’t just tearing but also memory order. On some architectures you can read a valid but out of date value in Thread A after Thread B has updated that value. (Memory order is mentioned later in the article, to be fair.)
i always got the sense that spinlocks were about maximum portability and reliability in the face of unreliable event driven approaches. the dumb inefficient thing that makes the heads of the inexperienced explode, but actually just works and makes the world go 'round.
"Unfair" paragraph is way too short. This is the main problem! The outlier starvation you get from contended spinlocks is extraordinary and, hypothetically, unbounded.
Well, you need to have specified what you actually want. "Fair" sounds like it's just good, but it's expensive, so unless you know that you need it, which probably means knowing why, you probably don't want to pay the price.
Stealing is an example of an unfairness which can significantly improve overall performance.
what is/are the thread synchronization protocol called which is the equivalent to ethernet's CSMA? there's no "carrier sensing", but instead "who won or mistakes were made" sensing. or is that just considered a form of spinlock? (you're not waiting for a lock, you perform your operation then see if it worked; though you could make the operation be "acquire lock" in which case it's a spinlock)
You can limit yourself to the performance of a 1mhz 6502 with no OS if you don't like it. Even MSDos on a 8086 with 640K ram allows for things that require complexity of this type (not spin locks, but the tricks needed to make "terminate stay resident" work are evil in a similar way)
It works if there is no scheduler, or you tell the scheduler what you're doing.
Turns out the first scenario is rare outside of embedded or OS development. The second scenario defeats the purpose because you're doing the same thing a mutex would be doing. It's not like mutexes were made slow on purpose to bully people. They're actually pretty fast.
OS kernel runqueue is using a spinlock to schedule everything. So it works. Should you ever use a spinlock in application code? No. Let the OS via the synchronization primitives in whatever language your app is in.
Probably not, not without formal verification which is usually lacking.
Everyone's computers hang or get slow some of the time. Probably all of our locks have bugs in them, but good luck getting to the bottom of that, right now the industry is barely capable of picking a sorting algorithm that actually works.
pizlonator|1 month ago
The author should read https://webkit.org/blog/6161/locking-in-webkit/ so that they understand what they are talking about.
WebKit does it right in the sense that:
- It as an optimal amount of spinning
- Threads wait (instead of spinning) if the lock is not available immediately-ish
And we know that the algorithms are optimal based on rigorous experiments.
Lectem|1 month ago
> - It as an optimal amount of spinning
No it isn't, it has a fixed number of yields, which has a very different duration on various CPUs
> Threads wait (instead of spinning) if the lock is not available immediately-ish
They use parking lots, which is one way to do futew (in fact, WaitOnAddress is implemented similarly). And no if you read the code, they do spin. Worse, they actually yield the thread before properly parking.
bbri06|1 month ago
jcranmer|1 month ago
The only time I've manually written my own spin lock was when I had to coordinate between two different threads, one of which was running 16-bit code, so using any library was out of the question, and even relying on syscalls was sketchy because making sure the 16-bit code is in the right state to call a syscall itself is tricky. Although in this case, since I didn't need to care about things like fairness (only two threads are involved), the spinlock core ended up being simple:
fasterik|1 month ago
The following is an interesting talk where the author used a custom spinlock to significantly speed up a real-time physics solver.
Dennis Gustafsson – Parallelizing the physics solver – BSC 2025 https://www.youtube.com/watch?v=Kvsvd67XUKw
kccqzy|1 month ago
squirrellous|1 month ago
wallstop|1 month ago
unknown|1 month ago
[deleted]
Animats|1 month ago
Real Windows, and Linux, don't have this problem. Only Wine's "malloc" in a DLL, which does.
Bug reports resulted in finger-pointing and denial.[1] "Unconfirmed", despite showing debugger output.
[1] https://bugs.winehq.org/show_bug.cgi?id=54979
rendaw|1 month ago
electroglyph|1 month ago
spacechild1|1 month ago
One area where I found spinlocks to be useful is in multithreaded audio applications. Audio threads are not supposed to be preempted by other user space threads because otherwise they may not complete in time, leading to audio glitches. The threads have a very high priority (or have a special scheduling policy) and may be pinned to different CPU cores.
For example, multiple audio threads might read from the same sample buffer, whose content is occasionally modified. In that case, you could use a reader-writer-spinlock where multiple readers would be able to progress in parallel without blocking each other. Only a writer would block other threads.
What would be the potential problems in that scenario?
Lectem|1 month ago
m-schuetz|1 month ago
Edit: Maybe I'm confusing terminology. What I'm doing is looping until other threads returned memory, but I'm also doing a short sleep during each loop iteration.
rdtsc|1 month ago
rdtsc may execute out of order, so sometimes an lfence (previously cpuid) can be used and there is also rdtscp
See https://github.com/torvalds/linux/blob/master/arch/x86/inclu...
And just because rdtsc is constant doesn't mean the processor clock will be constant that could be fluctuating.
Lectem|1 month ago
horizion2025|1 month ago
Lectem|1 month ago
imtringued|1 month ago
foldr|1 month ago
The issue isn’t just tearing but also memory order. On some architectures you can read a valid but out of date value in Thread A after Thread B has updated that value. (Memory order is mentioned later in the article, to be fair.)
a-dub|1 month ago
jeffbee|1 month ago
tialaramex|1 month ago
Stealing is an example of an unfairness which can significantly improve overall performance.
fsckboy|1 month ago
kobebrookskC3|1 month ago
gafferongames|1 month ago
unknown|1 month ago
[deleted]
CamperBob2|1 month ago
bluGill|1 month ago
imtringued|1 month ago
Turns out the first scenario is rare outside of embedded or OS development. The second scenario defeats the purpose because you're doing the same thing a mutex would be doing. It's not like mutexes were made slow on purpose to bully people. They're actually pretty fast.
adrr|1 month ago
direwolf20|1 month ago
nh23423fefe|1 month ago
lmm|1 month ago
Everyone's computers hang or get slow some of the time. Probably all of our locks have bugs in them, but good luck getting to the bottom of that, right now the industry is barely capable of picking a sorting algorithm that actually works.