top | item 17751049

(no title)

mrkgnao | 7 years ago

> Because the two NUMA nodes are ~entirely independent, it's capable of running two independent processes at full speed.

I don't understand. From my (admittedly little better than layperson's) knowledge, I'm guessing the cores of most multicore processors have to compete for memory access...? Is there a good search term I can use to help me understand what's going on here?

discuss

Osiris|7 years ago

There are 2 dies in the 1950X, each one has 2 memory channels. Thus, it's possible to run a process on one (8-core) die that maxes out the memory bandwidth to it's two local DDR4 channels while the other die still has full bandwidth access to it's own DDR4 channels.

Threadripper is able to switch between NUMA (non-uniform memory access) mode and "regular" mode. In NUMA, the OS knows that 2 channels are attached to 1 die and 2 channels on the other, thus allowing lower latencies because the OS knows what RAM to allocate based on which core the process is running on.

gascan|7 years ago

As a bonus, if you are explicitly NUMA & the OS/code does a good job, there's little line contention or resource sharing (e.g., caches) between die.

vbezhenar|7 years ago

Does it really work this way (with automatic memory and core pinning)? Both Windows and Linux can do that?

blattimwind|7 years ago

NUMA means "(Explicitly) Non-Uniform Memory Access"; this means that some cores have easier (lower latency, higher bandwidth) access to some memory regions than others.

In practice this means that memory controllers are partitioned amongst groups of cores, with some slower and often otherwise busy interconnect between those groups.

The software implication is that if task X uses some bit of memory a lot, then that bit of memory better be node-local, i.e. easy to access for the core where task X is running.

paulmd|7 years ago

Threadripper and Epyc are essentially multi-socket-in-a-package. There is an inter-processor link which is analogous to Intel QPI or DMI, it just runs between dies within a single socket instead of dies in separate sockets.

Threadripper and Epyc present themselves as 2 or 4 separate NUMA nodes depending on model. Spreading a single task across multiple NUMA nodes usually hurts performance significantly (often slower than just running it on a single node using fewer threads), but you can run 2/4 separate tasks at pretty much full speed.

The new WX processors are a little weird because two of the NUMA nodes have no direct access to RAM at all, they have to ask the other 2 dies to do it for them and pass it over.