top | item 47106326

(no title)

rl3 | 8 days ago

Nice. I've been looking at doing something similar, more on the order of running a 1T model with less than half the available VRAM.

One workup indicated it was theoretically possible to modify a piece of SGLang's routing layer to support JIT predict-ahead expert swaps from Gen5 NVMe storage straight into GPU memory.

I'm hoping that proves true. The setup relies on NVIDIA Dynamo, so NIXL primitives are available to support that.

Curious if anyone's tried this already.

discuss

xaskasdf|8 days ago

That would be nice to see. Actually I was thinking about getting another 3090 and a mobo upgrade since I'm bottlenecked by pcie3 to tryna run glm 4.7 or 5 at q4_k_m, it should be possible.