top | item 47150479

(no title)

whizzter | 4 days ago

I'd be super interested in how this compares between cpu architectures, is there an optimization in Apple silicon that makes this bad while it'd fly on Intel/AMD cpus?

discuss

order

hansvm|4 days ago

I've observed the same behavior on AMD and Intel at $WORK. Our solution (ideal for us, reads happening roughly 1B times more often than writes) was to pessimize writes in favour of reads and add some per-thread state to prevent cache line sharing.

We also tossed in an A/B system, so reads aren't delayed even while writes are happening; they just get stale data (also fine for our purposes).

gpderetta|4 days ago

the behaviour is quite typical for any MESI style cache coherence system (i.e. most if not all of them).

A specific microarchitecture might alleviate this a bit with lower latency cross-core communication, but the solution (using a single naive RW lock to protect the cache) is inherently non-scalable.

PunchyHamster|4 days ago

Read lock requires communication between cores. It just can't scale with CPU count