(no title)
andikleen2 | 4 years ago
Programming under high cache line contention is like message passing on a really busy network with many nodes, and anything that cuts down round trips can make a big difference in scalability. Most people who do network programming know these lessons by heart, but it's still poorly understood by people doing shared memory.
So maybe it's simpler, but likely it's slower too.
brucehoult|4 years ago
The paper you're referring to isn't about LL/SC at all. It's about CAS vs fetch-and-add (called AMOADD on RISC-V), showing that F&A scales better than C&S.
LL/SC is referred to only in that some ISAs (including ARM and RISC-V) use LL/SC to implement C&S. But then it goes on to look at x86 exclusively.
Not only the instructions matter here but also implementation.
For example, the way that LL/SC can livelock is that the LL needs to cause that cache line to be acquired for writing, flushing it from other CPUs caches. If another CPU does the same to you before you get to the SC then you lose out and have to loop and try again.
RISC-V provides a "forward progress guarantee" if there are 16 or fewer instructions between the LL and SC, they are "simple" instructions, and they are all in the same cache line.
One simple way to implement this is to delay responding to a cache eviction request for up to 16 clock cycles if you have a LL on that cache line. Then the only way the SC fails (for properly written code) is if you take an interrupt.
The design and implementation of the Atomic Memory Operations is also interesting. RISC-V was co-designed with the TileLink bus. TileLink comes in three flavours:
1) TL-UL Uncached Lightweight: simple get/put transactions
2) TL-UH Uncached Heavyweight: adds support for atomic operations
3) TL-C Cached: adds support for coherent cache blocks
With TL-C, atomic memory operations don't have to be executed by fetching the data all the way to the CPU, do the operation, and write it back. With, for example, an AMOADD (atomic add), both the address and the data to be added are sent over the TL-C bus and the add is executed in a shared L2 or L3 cache or even potentially (it's up to the system designer) directly in the L1 cache of another CPU. Only the result of the add is sent back. The latency is potentially the same as a simple memory read.
TL-C is obviously used inside an SoC, but can also be used over a link such as FMC (FPGA Mezzanine Card), PCIe, or 400G Ethernet (e.g. Western Digital's "OmniXtend").
TL-UH would typically be implemented by peripherals so they can implement AMOs directly on their registers.
TileLink:
https://sifive.cdn.prismic.io/sifive%2Fcab05224-2df1-4af8-ad...
Other buses implement similar features, but this is the one I'm more familiar with.