(no title)
sdrinf | 4 days ago
* even if an openweight model appears on huggingface today, exceeding SOTA, given my extensive experience with a wide variety of model sizes, I would find it highly surprising the "99% of use cases" could be expressed in <100B model.
* Meanwhile: I pulled claude to look into consumer GPU VRAM growth rates, median consumer VRAM went 1-2GB @ 2015 to ~8GB @ 2026, rougly doubles every 5 years; top-end isn't much better, just ahead 2 cycles.
* Putting aside current ram sourcing issues, it seems very unlikely even high-end prosumers will routinely have >100GB VRAM (=ability to run quantized SOTA 100b model) before ~2035-2040.
xml|4 days ago
I also believe that it should eventually be possible to train a model with somewhat persistent mixture of experts, so you only have to load different experts every few tokens. This will enable streaming experts from NVMe SSDs, so you can run state of the art models at interactive speeds with very little VRAM as long as they fit on your disk.
athrowaway3z|4 days ago
But on a tangent, why do you believe in mixture of experts?
Every thing I know about them makes me believe they're a dead-end architecturally.
amelius|4 days ago
MagicMoonlight|4 days ago
WarmWash|4 days ago
vegabook|4 days ago
randusername|4 days ago
otabdeveloper4|4 days ago
There's easier ways to do that.