DRAM speeds is one thing, but you should also account for the data rate of the PCIe bus (and/or VRAM speed). But yes, holding it "lukewarm" in DRAM rather than on NVMe storage is obviously faster.
Four channels of DDR4-3200 vs two channels of DDR5-6400 (four subchannels) should come out pretty close. I don't see any reason why the DDR4 configuration would be consistently faster; you might have more bank groups on DDR4, but I'm not sure that would outweigh other factors like the topology and bandwidth of the interconnects between the memory controller and the CPU cores.
LLama 3.1 however is not MoE, so all params are active.
For MoE it is tricky, because for each token you only use a subset of params (an “expert”) but you don’t know which one, so you have to keep them all in memory or wait until it loads from slower storage, potentially different for each token.
someguy2026|8 days ago
tgrowazay|6 days ago
In general systems usually have PCIE version with bandwidth better than RAM of that system.
For example a system with DDR4 (27Gbs) usually has at least PCIE4 (32Gbs at 16x).
But you can bottleneck that by building a DDR5 (40Gbs) system with PCIE4 card.
xaskasdf|8 days ago
uf00lme|8 days ago
wtallis|8 days ago
vlovich123|8 days ago
zozbot234|8 days ago
tgrowazay|6 days ago
LLama 3.1 however is not MoE, so all params are active.
For MoE it is tricky, because for each token you only use a subset of params (an “expert”) but you don’t know which one, so you have to keep them all in memory or wait until it loads from slower storage, potentially different for each token.