top | item 47162361

(no title)

fulafel | 3 days ago

You're right in terms of fitting your program to memory, so that it can run in the first place.

But in performance work, the relative speed of RAM relative to computation has dropped such that it's a common wisdom to treat today's cache as RAM of old (and today's RAM as disk of old, etc).

In software performance work it's been all about hitting the cache for a long time. LLMs aren't too amenable to caching though.

discuss

makapuf|3 days ago

AFAIK, you can't explicitly allocate cache like you allocate RAM however. A bit like if you could only work on files and ram was used for cache. Maybe I am mistaken ? (Edit: typo)

lou1306|3 days ago

You can't explicitly allocate cache, but you can lay things out in memory to minimize cache misses.

KeplerBoy|3 days ago

You can in CUDA. You can have shared memory which is basically L1 cache you have full control over. It's called shared memory because all threads within a block (which reside on a common SM) have fast access to it. The downside: you now have less regular L1 cache.

KellyCriterion|3 days ago

Reminds me somehow on: https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...

;-)

seanmcdirmid|3 days ago

LLMs need memory bandwidth to stream lots of data through quickly, not so much caching. Well, this is basically the same way that a GPU uses memory.

zozbot234|3 days ago

OTOH, LLM inference tends to have very predictable memory access patterns. So well-placed prefetch instructions that can execute predictable memory fetches in parallel with expensive compute might help CPU performance quite a bit. I assume that this is done already as part of optimized numerical primitives such as GEMM, since that's where most of the gain would be.