(no title)
rao-v | 8 days ago
I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.
rao-v | 8 days ago
I’ve also wondered why the routers aren’t training to be serially consistent so you can predict layers to swap into VRAM a few layers ahead to maximize available bandwidth.
reitzensteinm|8 days ago
Unless you're handing that in some kind of fancy way, you'll be holding up the batch while waiting for host memory which will kill your throughout.
It makes much more sense for non batched local inference, especially if you can keep the MoE routing stable like you say, but most folks aren't optimising for that.
zozbot234|8 days ago
svnt|8 days ago
1) This is basically the intention of several recent MoE models: keep particular generally useful experts hot in VRAM.
2) Unless you can swap layers in faster than you consume them there is no point to predicting layers (what does this even really mean? did you mean predicting experts?).
It seems at the moment the best you can do is keep experts and layers more likely to be used for a given query in VRAM and offload the rest, but this is work-dependent.
rao-v|6 days ago
With good predictability of MoE, you might see a world were it's more efficient to spend PCI bandwidth (slower than RAM->CPU) on loading MOE experts for the next ~3 layers from RAM to VRAM so you are not rooflined by CPU compute.
VLLM / SGLang (AFAIK) just assume you have enough VRAM to fit all the experts (but will page KV cache to RAM).
hedgehog|8 days ago
rao-v|6 days ago