top | item 47103979

(no title)

DoctorOetker | 8 days ago

Is there a reason GPU's don't use insane "blocks" of sdcard slots (for massively parallel io) so the model weights don't need to pass through a limited PCI bus?

discuss

order

Neywiny|8 days ago

Yes. Let's do the math. The fastest sd cards can read at around 300 MB/s (https://havecamerawilltravel.com/fastest-sd-cards/). Modern GPUs use 16 lanes of PCIe gen 5, which is 16x32Gb/s = 512Gb/s = 64 GB/s. Meaning you'd need over 200 of the fastest SD cards. So what you're asking is: is there a reason GPUs don't use 200 SD cards? And I can't think of any way that would work

hedgehog|8 days ago

SD is obviously the wrong interface for this but "High Bandwidth Flash" (stacked flash akin to HBM) is in development for exactly this kind of problem. AMD actually made a GPU with onboard flash maybe a decade ago but I think it was a bit early. Today I would love to have a pool of 50GB/s storage attached to the GPU.

Dylan16807|7 days ago

One thing to note, those aren't the fastest SD cards, those are the fastest UHS-II SD cards. The future is SD Express and you can already get microSDs at 900 MB/s.

magicalhippo|8 days ago

Some years ago I realized that if I had oodles of money to spend I would totally get someone to make a PCIe card with like several hundreds microSD cards on it.

You can buy vertical microSD connectors, so you can stack quite a lot of them on a PCIe card. Then a beefy FPGA to present it as a NVMe device to the host.

Goal total capacity, as you can put 1TB cards in there. And for teh lulz of course.

jiggawatts|8 days ago

The next gen inference chips will use High Bandwidth Flash (HBF) to store model weights.

These are made similarly to HBM but are lower power and much higher capacity. They can also be used for caching to reduce costs when processing long chat sessions.

numpad0|8 days ago

Maybe latency. IIRC flash is a lot laggier than DRAMs and SRAMs.

DoctorOetker|7 days ago

The random access memory models is not really representative of ML workloads (both training and inference), where multiplying large tensors result in predictable memory access patterns.