(no title)
marmaduke | 1 year ago
perhaps also more precisely they also have quite an advantage on anything that needs and plays nicely with caches? when I sliced my problem to maximize cache usage, I saw pretty clear scalability with cores: L1/L2 cache bandwidth is ~30GB/s, so e.g. a 32 core system starts to compete with the big consumer GPUs.
dragontamer|1 year ago
Not only because caching is complex in CPU land, but because GPU has a completely different set of caches (and registers) that depends entirely on architecture.
Case in point: GPUs often have access to 256 architectural registers aka 1024-bytes of register space. Now this depends on how much occupancy your GPU code is targeting (maybe 4-occupancy?), but there's a lot you can do with even 64-registers (aka: 4-occupancy and 256-bytes of register space), and is key to making those blazing fast FP16 matrix-multiplication kernels "for AI" everyone's so hot about right now.
For something like FP16 matrix multiply (a very cache-friendly problem), the entire SIMD (all 32-lanes) of the GPU symmetric multiprocessor work together on the problem. So we're talking about an effective 32kB of __register space__ (let alone cache or other memory in the hierarchy).
Even before FP16 matrix-multiply instructions, this absurd register space advantage is why GPUs were the king of matrix multiplication.
--------
GPUs are worse at larger cache sizes, say 1MB or 2MB. At 1MB+, a modern CPU core's L2 cache can hold all of that (either Intel P-core, AMD Zen5, or even Intel E-core can hold many MB in L2).
GPUs have a secret though: a crossbar at the __shared__ memory level. Rearranging memory and data across your lanes can be done through this crossbar (including many-to-one reductions for atomics, or one-to-many broadcasts in just a single clock tick). So your GPU-lanes have incredible communication available to them (and this crossbar is the key for modern ballot / voting based horizontal compute of modern GPU styles). This is only ~64kB of space and shared between all GPU-lanes but with 1024-lanes supporting communication its an important element of GPU memory.
CPU L3 cache is very nice from a latency perspective, but bandwidth wise L3 cache is on the order of GPU's GDDR6x or HBM. 500GB/s to 2000GB/s, depending on the technology.
Finally, CPU DDR5 RAM may be the slowest in this discussion, but its the biggest. 2TB+ Xeon Servers aren't even that expensive and can be assumed for any serious tech firm these days (ie: all-RAM Databases and whatnot).
---------
So at different sizes and different use-scenarios, GPUs and CPUs will trade places. I'd expect CPUs to win most red-and-black tree races, but GPUs will win matrix multiplication. Both take advantage of "cache" but in very different ways.