top | item 34847102

(no title)

drmeister | 3 years ago

The big problem I have with locks is under high use, they slow parallel code down MORE than if you ran it serially. This is because of "false sharing" when one core grabs the lock, it clears the cache line of every other core that will grab the lock, and they have to fetch from the main memory to read the lock. Fetching from main memory is very slow. I use compare-and-swap in those cases, and almost every lock I've used I've converted to compare-and-swap. I'd like to explore "restartable sequences" for parallel data structures (https://tinyurl.com/vzh3523y) but AFAIK this is only available on linux and we develop on MacOS and Linux.

discuss

zznzz|3 years ago

What you describe (multiple cores sharing the same lock) sounds like true sharing, not false sharing. False sharing means cache line contention is when cores are using separate structures (i.e. not actually shared) which happen to be on the same cache line.

Compare and swap still typically requires the cache line to bounce around between cores, so if that’s the primary cost of locks for you it seems like compare and swap doesn’t really fix the problem of frequent synchronization between cores.

drmeister|3 years ago

Ah - thank you. "True sharing" makes way more sense to describe the situation. The general problem with locks I've seen again and again - and maybe CAS doesn't improve it, but I thought it did. It is challenging to profile this stuff carefully. I've been disappointed many times by my attempts at multithreaded programming when locks are involved. I've implemented a multithreaded Common Lisp programming environment (https://github.com/clasp-developers/clasp.git) that interoperates with C++. Two locks that cause me massive headaches are (1) in the unwinder when we do a lot of stack unwinding in multiple threads and (2) in the memory manager when we do a lot of memory allocation in multiple threads. Parallel code runs multiple times slower than serial code. In another situation where we do parallel non-linear optimization without allocating memory, we can get 20 CPU working in parallel with no problems.

sapiogram|3 years ago

If false sharing is truly the bottleneck in your code, can't you just make sure the lock reserves the entire cache line? Each core still needs to fetch from L3 cache to grab the lock itself, but that's roughly the same cost as a compare-and-swap.

hinkley|3 years ago

The concurrent queue code in Java has/had an array the locks were placed in. Now the problem with putting locks in arrays instead of scattering them across the heap is that you get false sharing immediately. So the array is 8 or 16 times as big as it needs to be to get one lock per cache line.

charleslmunger|3 years ago

The @Contended annotation will pad the memory layout of fields to fill a cache line and avoid false sharing.