top | item 32889337

Measuring CPU core-to-core latency

319 points| nviennot | 3 years ago |github.com

90 comments

order

wyldfire|3 years ago

This is a cool project.

It looks kinda like the color scales are normalized to just-this-CPU's latency? It would be neater if the scale represented the same values among CPUs. Or rather, it would be neat if there were an additional view for this data that could make it easier to compare among them.

I think the differences are really interesting to consider. What if the scheduler could consider these designs when weighing how to schedule each task? Either statically or somehow empirically? I think I've seen sysfs info that describes the cache hierarchies, so maybe some of this info is available already. That nest [1] scheduler was recently shared on HN, I suppose it may be taking advantage of some of these properties.

[1] https://dl.acm.org/doi/abs/10.1145/3492321.3519585

chaps|3 years ago

You might just be going down a deep rabbit hole with these questions :)

You can definitely do things like that, including pinning an application to a certain set of cores for performance. When you have NUMA it's pretty much mandatory for low latency applications, since each CPU gets its own memory slots, significantly lowering latency cost. It's physically closer, for one.

It gets really fun when you get into network traffic, since you can bind a NIC to specific cores alongside kernel bypass stuffs and interrupt handling. Sometimes it even makes sense to have one NIC per CPU!

It's a lot of setup work though. Have a look into CPU shielding as well. The Linux scheduler cares more about fairness than it does latency, so you have to be very intentional in your setup.

foota|3 years ago

The phrase for this is NUMA, and NUMA scheduling is the term for scheduling that accounts for it. Mostly, it avoids moving threads between different NUMA domains.

rigtorp|3 years ago

I have something similar but in C++: https://github.com/rigtorp/c2clat

virgulino|3 years ago

I went to your homepage. Your gif "Programming in C++" made me really laugh, thanks for that! 8-)

https://rigtorp.se/

amock|3 years ago

This works much better on my POWER9 system. I have two CPUs in my machine with the same core count but different cache configurations according to hwloc and this showed that they really are different.

habibur|3 years ago

You need to add a Makefile there. It's just two lines.

hot_gril|3 years ago

I was wondering what real-life situations this benchmark matters the most in, then I remembered... A few years ago I was working on a uni research project trying to eek out the most performance possible in an x86 software-defined EPC, basically the gateway that sits between the LTE cell tower intranet and the rest of the Internet. The important part for me to optimize was the control plane, which handles handshakes between end users and the gateway (imagine everyone spam-toggling airplane mode when their LTE drops). Cache coherence latency was a bottleneck. The control plane I developed had diminishing returns in throughput up to like 8 cores on a 12-core CPU in our dual-socket test machine. Beyond that, adding cores actually slowed it down significantly*. Not a single-threaded task but not embarrassingly parallel either. The data plane was more parallel, and it ran on a separate NUMA node. Splitting either across NUMA nodes destroyed the performance.

* which in hindsight sounds like TurboBoost was enabled, but I vaguely remember it being disabled in tests

nviennot|3 years ago

So what did you do to get the most performance out of the system? pin the threads to specific CPUs?

jtorsella|3 years ago

If anyone is interested, here are the results on my M1 Pro running Asahi Linux:

Min: 48.3 Max: 175.0 Mean: 133.0

I’ll try to copy the exact results once I have a browser on Asahi, but the general pattern is most pairs have >150ns and a few (0-1; 2-3,4,5; 3-4,5; 4-5; 6-7,8,9; 7-8,9; 8-9) are faster at about 50ns.

Edit: The results from c2clat (a little slower, but the format is nicer) are below.

CPU 0 1 2 3 4 5 6 7 8 9

   0    0   59  231  205  206  206  208  219  210  210

   1   59    0  205  215  207  207  209  209  210  210

   2  231  205    0   40   42   43  180  222  224  213

   3  205  215   40    0   43   43  212  222  213  213

   4  206  207   42   43    0   44  182  227  217  217

   5  206  207   43   43   44    0  215  215  217  217

   6  208  209  180  212  182  215    0   40   43   45

   7  219  209  222  222  227  215   40    0   43   43

   8  210  210  224  213  217  217   43   43    0   44

   9  210  210  213  213  217  217   45   43   44    0

mrtksn|3 years ago

M1 on Macbook air on macOS Ventura beta 7:

min/max/mean - 34.6ns/57.7ns/39.1ns

       0       1       2       3       4       5       6       7   
  0
  1   58±6 
  2   48±6    46±6 
  3   39±6    37±6    36±6 
  4   41±6    38±6    42±6    38±6 
  5   35±6    37±6    37±6    36±6    36±6 
  6   38±6    39±6    36±6    40±6    40±6    38±6 
  7   39±6    38±6    36±6    37±6    38±6    36±6    38±6

nviennot|3 years ago

It'd be great if you could provide the output.csv actually :D

I think there's a CLI tool for pastebin

tbrock|3 years ago

Would love to see what the results are for an m2.

snvzz|3 years ago

>This software is licensed under the MIT license

Maybe consider including a MIT license in the repository.

Legally, that's a bit more sane than having a line in the readme.

In practice, github will recognize your license file and show the license in the indexes an d in the right column of your repository's main page.

nviennot|3 years ago

Okay, I've added a file

dan-robertson|3 years ago

It would be interesting to have a more detailed understanding of why these are the latencies, e.g. this repo has ‘clusters’ but there is surely some architectural reason for these clusters. Is it just physical distance on the chip or is there some other design constraint?

I find it pretty interesting where the interface that cpu makers present (eg a bunch of equal cores) breaks down.

Veliladon|3 years ago

Rings are great for latency on low core count situations. The LCC Intel chips all have a massive 512-bit ring bus (2x256-bit in each direction) internally which delivers crazy fast core to core latency. However, this quickly starts to break down under higher core counts. Intel gets around this to some extent with its P and E cores with 4 E cores occupying the same slot on the ring bus as a P core. However, once you start getting over 14-16 slots on the ring bus it starts to get overwhelmed.

So when you switch to mesh buses the interconnect takes up way more space. So one has to compromise between bus width and the amount of area one is using for the interconnects. Typically this means running reduced width buses around the mesh which limits core to core bandwidth. Not so much a big deal if you're running a server, more a problem though if you're trying to run interactively with a user. Unless of course you're Apple and just devote a truckload of die space to dump a fucking mammoth amount of interconnect between your dies.

There's also ancillary concerns as well like fabrication yield. For instance AMD runs chiplets probably because they can mix and match yields and they naturally segment the market. Get a CCX with 3 working cores? Pair it with another and you have a 6C/12T CPU. Get a CCX with 2 working cores? Pair it with another and you get a 4C/8T. Intel either gets a working die or they don't.

The problem here is the interconnect between the CCXs is relatively slow. Dog slow compared to the ring bus. Even running the Infinity Fabric's fclock at 1.8GHz only nets you 57.6GB/sec between CCXs and five times the latency of the ring bus. When you look at a Ryzen 3300 (2x2 CCX) and a Ryzen 3300X (1x4 CCX) the difference in performance is non-trivial and that's the Infinity Fabric dragging performance down. In comparison an Intel core's L3 cache on a 3GHz ring bus (i.e. non-turbo) pulls down at 96GB/sec. Sure you're still ultimately limited by DRAM but if stuff is staying in LLC it's a hell of a performance boost. In Zen 3 AMD even went to 8 core CCXs which gave the whole thing a huge performance boost. Part of that was because the smaller lithography gave them more area to play with so they could fit everything plus the interconnects onto the chiplet size they needed.

So yeah, I hope that little greatly oversimplified, surface level look was helpful.

bitcharmer|3 years ago

Most of this cross-core overhead diversity is gone on skylake and newer chips because Intel moved from a ring topology to mesh design for their l3 caches.

xani_|3 years ago

Just look at the processor architecture diagram.

But TL;DR modern big processors are not one big piece of silicon but basically "SMP in a box", a bunch of smaller chiplets interconnected with eachother. That helps with yield ("bad" chiplet costs you just 8 cores, not whole 16/24/48/64 core chip). Those also usually come with their own memory controllers.

And so you basically have NUMA on a single processor with all of the optimization challenges for it

ip26|3 years ago

Some of it is simple distance. Some of it is architectural choices because of the distance. A sharing domain that spans a large distance performs poorly because of the latency. Therefore domains are kept modest, but the consequence is crossing domains has an extra penalty.

ozcanay|3 years ago

I am currently working on my master's degree on computer science and studying on this exact topic.

In order to measure core-to-core latency, we should also learn how the cache coherence works on Intel. I am currently experimenting with microbenchmarks on Skylake microarchitecture. Due to the scalability issues with ring interconnect on CPU dies in previous models, Intel opted for 2D mesh interconnect microarchitecture in recent years. In this microarchitecture, CPU die is split into tiles each accommodating cores, caches, CHA, snoop filter etc. I want to emphasize the role of CHA here. Each CHA is responsible for managing coherence of a portion of the addresses. If a core tries to fetch a variable that is not in its L1D or L2 cache, the CHA managing the coherence of the address of the variable being fetched will be queried to learn whereabouts of the variable. If the data is on the die, the core currently owning the variable will be told to forward that variable to the requesting core. So, even though the cores that communicate with each other are physically contiguous, the location of the CHA that manages the coherence of the variable they will pass back and forth also is important due to cache coherence mechanism.

Related links:

https://gac.udc.es/~gabriel/files/DAC19-preprint.pdf

https://par.nsf.gov/servlets/purl/10278043

scrubs|3 years ago

Please see my reply/response ref: "ffwd: delegation is (much) faster than you think" in this HN post. It's a scheme to leverage high core-to-core bandwidth to make coordinating shared data structures easier. I'd definitely be interested in your 2 cents.

moep0|3 years ago

Why does CPU=8 in Intel Core i9-12900K have fast access to all other cores? It is interesting.

nviennot|3 years ago

It would be nice to have another bench from the Alder Lake, so we can validate the results

ajolly|3 years ago

I suspect what you are seeing is the preferred cores. With TVB/OCTVB (thermal velocity boost/overclocking thermal velocity boost) those can boost higher than the other cores.

From my 12900ks (its a boosted 12900k) Core 6-7 is rated for 5.5ghz.

Freq MHz DTS C VID mv PC Eff Fr UCcode 0x1f VR Volt Limit 2500 ---------------------------------------------------- 800 24 1307 52 517 Uncore 3600 IA AC LL 0.5999 800 20 1282 52 47 Power 23.465 IA DC LL 1.0996 800 20 1314 52 70 Current Limit 0.0000 SA AC LL 0.0000 800 20 1308 52 69 iccmax 1023 SA DC LL 0.0000 800 20 1303 52 88 PL1 32760 iccmax dis True 800 18 1293 52 69 PL2 32760 TAU 33 800 20 1286 55 39 Memory 5200 PPP_OVR True 800 22 730 55 37 EE_Turbo_Dis True RTH_Dis True 800 22 1377 40 60 Dis_Ring_EE False HWGuidedSch True 800 22 1172 40 73 IA_CEP_Dis True Dynamic_Mem True Full_Range_Multi False SA_Freq_OVR False TSC_Dis_HW False Banding_Ratio 0 PVD_Ratio_thresh 0 SA_CEP_Dis True FLL OC Mode 3

---------------------------------------------------------------------------------------------------------------------------------

Core Voltage Adapt 0 Ring Voltage Adapt 0 Core PLL V 900 Core Voltage Offset 0 Ring Voltage Offset 0 Ring PLL V 900 L2 Voltage Adapt 0 L2 Voltage Offset 0 AVX512 Offset 0 SA Voltage Offset 0 Ring VID 0 MC PLL V 900 AVX Offset 0 AVX2 V Guardband 0 AVX512 V Guardband 0 SA Voltage Manual 1150

---------------------------------------------------------------------------------------------------------------------------------

Turbo Ratio Limit 55, 55, 52, 52, 52, 52, 52, 52 Core OCMB Max Ratio 0 Ring Min Ratio 8 Turbo Limit Cores 1, 2, 3, 4, 5, 6, 7, 8 Ring OCMB Max Ratio 0 Ring Max Ratio 47 Atom Ratio Limit 40, 40, 40, 40, 40, 40, 40, 40 Atom Limit Cores 255, 255, 255, 255, 255, 255, 255, 255Atom OCMB Max Ratio 0 OS Max Ratio 34 HWP Min Ratio 43 HWP Max Ratio 255 Max Possible Core 40 Max Possible Ring 47 UCLK 2600

Num cores: 24 Using RDTSC to measure time: false Num round trips per samples: 5000 Num samples: 300 Showing latency=round-trip-time/2 in nanoseconds:

       0       1       2       3       4       5       6       7       8       9      10      11      12      13      14      15      16      17      18      19      20      21      22      23
  0
  1    0±0
  2   37±6    36±6
  3   31±6    37±6     5±5
  4   26±6    26±6    26±6    26±6
  5   26±6    31±6    26±6    31±6     5±5
  6   31±6    31±6    36±6    31±6    26±6    21±6
  7   31±6    31±6    31±6    31±6    26±6    26±6     5±5
  8   31±6    31±6    31±6    31±6    26±6    26±6    31±6    26±6
  9   36±6    31±6    31±6    31±6    26±6    26±6    26±6    26±6     5±5
 10   31±6    36±6    31±6    26±6    26±6    21±6    26±6    31±6    26±6    31±6
 11   31±6    36±6    31±6    36±6    26±6    26±6    31±6    31±6    36±6    31±6     5±5
 12   31±6    36±6    31±6    42±6    26±6    31±6    31±6    37±6    31±6    37±6    31±6    36±6
 13   36±6    37±6    42±6    36±6    26±6    26±6    31±6    31±6    31±6    31±6    31±6    31±6     5±5
 14   31±6    36±6    37±6    36±6    31±6    26±6    36±6    31±6    36±6    31±6    36±6    31±6    31±6    36±6
 15   31±6    31±6    31±6    31±6    26±6    26±6    36±6    31±6    31±6    31±6    31±6    31±6    31±6    31±6     0±0
 16   42±6    42±6    42±6    42±6    31±6    31±6    37±6    36±6    36±6    42±6    37±6    36±6    36±6    36±6    37±6    37±6
 17   42±6    42±6    36±6    42±6    31±6    36±6    37±6    36±6    36±6    37±6    36±6    37±6    36±6    36±6    36±6    37±6    47±6
 18   42±6    42±6    42±6    42±6    31±6    31±6    37±6    36±6    37±6    36±6    42±6    36±6    37±6    36±6    36±6    36±6    47±6    52±6
 19   42±6    42±6    42±6    37±6    31±6    31±6    36±6    37±6    36±6    36±6    37±6    37±6    36±6    36±6    42±6    37±6    52±6    47±6    47±6
 20   42±6    42±6    42±6    36±6    37±6    31±6    37±6    36±6    37±6    36±6    36±6    42±6    37±6    36±6    36±6    42±6    42±6    42±6    42±6    42±6
 21   42±6    37±6    42±6    42±6    31±6    31±6    37±6    36±6    42±6    36±6    37±6    36±6    37±6    37±6    36±6    36±6    42±6    42±6    42±6    42±6    47±6
 22   42±6    42±6    42±6    36±6    36±6    31±6    36±6    37±6    36±6    37±6    36±6    36±6    36±6    37±6    36±6    42±6    42±6    42±6    42±6    42±6    47±6    52±6
 23   36±6    42±6    42±6    42±6    31±6    31±6    37±6    36±6    37±6    42±6    36±6    36±6    37±6    42±6    36±6    36±6    42±6    42±6    42±6    42±6    52±6    47±6    47±6
Min latency: 0.0ns ±0.0 cores: (1,0) Max latency: 52.3ns ±16.3 cores: (23,20) Mean latency: 34.6ns

alberth|3 years ago

Could it be because it's the newest CPU included in the test suite.

Possibility related to what fab generation the CPU is on vs. other chips tested?

EDIT: why the downvotes?

bhedgeoser|3 years ago

On a 5950x, the latencies for core 0 are very high if SMT is enabled, I wonder why that is?

         0       1  
    0
    1   26±0
    2   26±0    17±0
    3   27±0    17±0
    4   32±0    17±0
    5   29±0    19±0
    6   32±0    18±0
    7   31±0    17±0
    8  138±1    81±0
    9  138±1    83±0
   10  139±1    80±0
   11  136±1    84±0
   12  134±1    83±0
   13  137±1    80±0
   14  136±1    84±0
   15  139±1    84±0
   16   16±0    16±0
   17   28±0     8±0
   18   33±0    17±0
   19   29±0    16±0
   20   28±0    17±0
   21   29±0    19±0
   22   32±0    18±0
   23   31±0    17±0
   24  137±1    81±0
   25  140±1    79±0
   26  143±1    80±0
   27  138±1    82±0
   28  139±1    82±0
   29  139±1    81±0
   30  142±1    82±0
   31  142±1    84±0

bullen|3 years ago

I ran .c2clat on a Raspberry 4:

  CPU    0    1    2    3
    0    0   77   77   77
    1   77    0   77   77
    2   77   77    0   77
    3   77   77   77    0
And Raspberry 2:

  CPU    0    1    2    3
    0    0   71   71   71
    1   71    0   71   71
    2   71   71    0   71
    3   71   71   71    0

jeffbee|3 years ago

Fails to build from source with Rust 1.59 so I tried the C++ `c2clat` from elsewhere in the thread. Quite interesting on Alder Lake, because the quartet of Atom cores has uniform latency (they share an L2 cache and other resources) while the core-to-core latency of the Core side of the CPU varies. Note that the way these are logically numbers is 0,1 are SMT threads of the first core and so forth through 14-15. 16-19 are Atom cores with 1 thread each.

  CPU   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19
   0    0   12   60   44   60   44   60   43   50   47   56   48   58   49   60   50   79   79   78   79
   1   12    0   45   45   44   44   60   43   51   49   55   47   57   49   56   51   76   76   76   76
   2   60   45    0   13   42   43   53   43   48   37   52   41   53   42   53   42   72   72   72   72
   3   44   45   13    0   42   43   53   42   47   37   51   40   53   41   53   42   72   72   72   72
   4   60   44   42   42    0   13   56   43   49   52   54   41   56   42   42   41   75   75   74   75
   5   44   44   43   43   13    0   56   43   51   54   55   41   56   42   56   42   77   77   77   77
   6   60   60   53   53   56   56    0   13   49   54   56   41   57   42   57   42   78   78   78   78
   7   43   43   43   42   43   43   13    0   46   47   54   41   41   41   55   41   72   71   71   71
   8   50   51   48   47   49   51   49   46    0   12   51   51   54   56   55   56   75   75   75   75
   9   47   49   37   37   52   54   54   47   12    0   49   53   54   56   55   54   74   69   67   68
  10   56   55   52   51   54   55   56   54   51   49    0   13   53   58   56   59   75   75   76   75
  11   48   47   41   40   41   41   41   41   51   53   13    0   51   52   55   59   75   75   75   75
  12   58   57   53   53   56   56   57   41   54   54   53   51    0   13   55   60   77   77   77   77
  13   49   49   42   41   42   42   42   41   56   56   58   52   13    0   55   54   77   77   77   77
  14   60   56   53   53   42   56   57   55   55   55   56   55   55   55    0   12   74   70   78   78
  15   50   51   42   42   41   42   42   41   56   54   59   59   60   54   12    0   75   74   74   77
  16   79   76   72   72   75   77   78   72   75   74   75   75   77   77   74   75    0   55   55   55
  17   79   76   72   72   75   77   78   71   75   69   75   75   77   77   70   74   55    0   55   55
  18   78   76   72   72   74   77   78   71   75   67   76   75   77   77   78   74   55   55    0   55
  19   79   76   72   72   75   77   78   71   75   68   75   75   77   77   78   77   55   55   55    0

Veliladon|3 years ago

It's probably the ring bus topology. As you pointed out E cores are directly connected to L2 and only one stop away from each other on the ring bus. P cores can be up to 6 stops away from each other because there's 4 on each side, an E core cluster on each side, and a system agent (PCIe/DRAM) or iGPU stop in between each side.

vladvasiliu|3 years ago

This is interesting. I'm getting much worse results on an i7-1165G7 than the ones published:

    Num cores: 8
    Using RDTSC to measure time: true
    Num round trips per samples: 5000
    Num samples: 300
    Showing latency=round-trip-time/2 in nanoseconds:

       0       1       2       3       4       5       6       7
  0
  1   70±1
  2   53±1    42±0
  3   73±5   134±5    80±1
  4   16±0    49±1    56±1    46±1
  5   63±4    28±1   128±5    67±1    66±1
  6   56±1    49±1    10±0    81±4   124±4    72±1
  7   57±1    57±1    45±1    10±0    63±4   130±5    87±1

    Min  latency: 10.1ns ±0.2 cores: (6,2)
    Max  latency: 134.1ns ±5.3 cores: (3,1)
    Mean latency: 64.7ns

consp|3 years ago

Might too-aggressive power saving be an issue? The spikes (100+ns) are all hyper-threads. edit: No, 4 to 6 is not a HT.

sgtnoodle|3 years ago

I've been doing some latency measurements like this, but between two processes using unix domain sockets. I'm measuring more on the order of 50uS on average, when using FIFO RT scheduling. I suspect the kernel is either letting processes linger for a little bit, or perhaps the "idle" threads tend to call into the kernel and let it do some non-preemptable book keeping.

If I crank up the amount of traffic going through the sockets, the average latency drops, presumably due to the processes being able to batch together multiple packets rather than having to block on each one.

mayapugai|3 years ago

This is a fascinating insight into a subsystem which we take from granted and naively assume is homogeneous. Thank you so much for sharing.

A request to the community - I am particularly interested in the Apple M1 Ultra. Apple made a pretty big fuss about the transparency of their die-to-die interconnect in the M1 Ultra. So, it would be very interesting to see what happens with it - both on Mac OS and (say, Asahi) Linux.

scrubs|3 years ago

This benchmark reminds me of "ffwd: delegation is (much) faster than you think" https://www.seltzer.com/margo/teaching/CS508-generic/papers-....

This paper describes a mechanism for client threads pinned to a distinct cores to delegate a function call to distinguished server thread pinned to its own core all on the same socket.

This has a multitude of applications the most obvious one making a shared data structure MT safe through delegation rather than saddling it with mutexes or other synchronization points especially beneficial with small critical sections.

The paper's abstract concludes claiming "100% [improvement] over the next best solution tested (RCL), and multiple micro-benchmarks show improvements in the 5–10× range."

The code does delegation without CAS, locks, or atomics.

The efficacy of such a scheme rests on two facets, which the paper explains:

* Modern CPUs can move GBs/second between core L2/LLC caches

* The synchronization between requesting clients and responding servers depends on each side spinning on shared memory address looking for bit toggles. Briefly, servers only read client request memory which the client only writes. (Clients each have their own slot). And on the response side client's read the servers shared response memory, which only the server writes. This one-side read, one-side write is supposed to minimize the number of cache invalidations and MESI syncs.

I spent some time testing the author's code and went so far as writing my own version. I was never able to make it work with anywhere near the throughput claimed in the paper. There's also some funny "nop" assembler instructions within the code that I gather is a cheap form of thread yielding.

In fact this relatively simple SPCP MT ring buffer which has but a fraction of the code:

https://rigtorp.se/ringbuffer/

did far, far better.

In my experiments then CPU spun too quickly so that core-to-core bandwidth was quickly squandered before the server could signal response or the client could signal request. I wonder if adding select atomic reads as with the SPSC ring might help.

jesse__|3 years ago

This is absolutely the coolest thing I've seen in a while.

bullen|3 years ago

When would cores talk to cores like this is measuring?

Would two cores reading and writing to the same memory have this contention?

fuy|3 years ago

They mention that they measure "cache coherence protocol" throughput. So yes, this happens when two or more cores need to work on the same memory and they need to synchronize/update what is in their L1 and L2 (and probably L3 as well?) caches.

fideloper|3 years ago

Because I’m ignorant: What are the practical take aways from this?

When is a cpu core sending a message to another core?

SkipperCat|3 years ago

In HFT, we typically pin processes to run on a single isolated core (on a multicore machine). That allows the process to avoid a lot of kernel and other interrupts which could cause the process to not operate in a low latency manner.

If we have two of these processes, each on separate cores, and they occasionally need to talk to each other, then knowing the best choice of process/core location can keep the system operating in the lowest latency setup.

So, an app like this could be very helpful for determining where to place pinned processes onto specific cores.

There's also some common rules-of-thumb such as, don't put pinned processes that need to communicate on cores that are separated by the QPI, that just adds latency. Make sure if you're communicating with a NIC to find out which socket has the shortest path on the PCI bus to that NIC and other fun stuff. I never even thought about NUMA until I started to work with folks in HFT. It really makes you dig into the internals of the hardware to squeeze the most out of it.

crazytalk|3 years ago

It's mentioned in the readme - this is measuring the latency of cache coherence. Depending on architecture, some sets of cores will be organized with shared L2/L3 cache. In order to acquire exclusive access to a cache line (memory range of 64-128ish bytes), caches belonging to other sets of cores need to be waited on to release their own exclusive access, or to be informed they need to invalidate their caches. This is observable as a small number of cycles additional memory access latency that is heavily dependent on hardware cache design, which is what is being measured

Cross-cache communication may simply happen by reading or writing to memory touched by another thread that most recently ran on another core

Check out https://en.wikipedia.org/wiki/MOESI_protocol for starters, although I think modern CPUs implement protocols more advanced than this (I think MOESI is decades old at this point)

aseipp|3 years ago

The example code uses an atomic store instruction in order to write values from threads to a memory location, and then an atomic read to read them. The system guarantees that a read of a previously written location is consistent with a subsequent write, i.e. "you always read the thing you just wrote" (on x86, this guarantee is called "Total Store Ordering.") Reads and writes to a memory location are translated to messages on a memory bus, and that is connected to a memory controller, which the CPUs use to talk to the memory they have available. The memory controller is responsible for ensuring every CPU sees a consistent view of memory according to the respective platform memory ordering rules, and with respect to the incoming read/write requests from various CPUs. (There are also caches between the DRAM and CPU here but they are just another layer in the hierarchy and aren't so material to the high-level view, because you can keep adding layers, and indeed some systems even have L1, L2, L3, and L4 caches!)

A CPU will normally translate atomic instructions like "store this 32-bit value to this address" into special messages on the memory bus. Atomic operations it turns out are already normally implemented in the message protocol between cores and memory fabric, so you just translate the atomic instructions into atomic messages "for free" and let the controller sort it out. But the rules of how instructions flow across the memory bus is complicated because the topology of modern CPUs is complicated. They are divided, partitioned into NUMA domains, have various caches that are shared or not shared between 1-2-or-4-way clusters, et cetera. They must still obey the memory consistency rules defined by the platform, and all the caches and interconnects between them. As a result, there isn't necessarily a uniform measurement of time for any particular write to location X from a core to be visible to another core when it reads X; you have to measure it to see how the system responds, which might include expensive operations like flushing the cache. It turns out two cores that are very far away will just take more time to see a message, since the bus path will likely be longer -- the latency will be higher for a core-to-core memory write where the write will be visible consistently.

So when you're designing high performance algorithms and systems, you want to keep the CPU topology and memory hierarchy in mind. That's the most important takeaway. From that standpoint, these heatmaps are simply useful ways of characterizing the baseline performance of some basic operations between CPUs, so you might get an idea of how topology affects memory latency.

eternalban|3 years ago

Hm. Use icelake, with an aggregator process sitting in core 11 and have all the others run completely on input alone and then report to core 11. (Core 11 from that heatmap appears to be the only cpu with a sweetheart core having low latency to all other cores.) I wonder how hard is to write a re-writer to map an executable to match cpu architecture characteristics. Something like graph transformations to create clusters (of memory addresses) that are then mapped to a core.

bee_rider|3 years ago

Does anyone know what is up with the 8275CL? It looks... almost periodic or something.

monocasa|3 years ago

I think that's one of the Intel chips where the cores are connected on a 2D grid rather than the more hierarchical or single ring interconnects you see in the other systems.

stevefan1999|3 years ago

What about OS scheduling overhead?

wmf|3 years ago

It's an order of magnitude slower.

throaway53dh|3 years ago

What's the diff b/w sockets and cores? Does socket have seperate Ln caches and cores share the ca he?