top | item 44540980

(no title)

renonce | 7 months ago

My experience is that experts are not separated in any intuitive way. I would be very interested (and surprised) if someone manages to prune a majority of experts in a way that preserves model capabilities in a specific domain but not others.

See https://github.com/peteryuqin/Kimi-K2-Mini, a project that keeps a small portion of experts and layers and keep the model capabilities across multiple domains.

discuss

viraptor|7 months ago

Sounds like dumping the routing information from programming questions would answer that... I guess I can do a dump from qwen or deepseek locally. You'd think someone would created that kind of graph already, but I couldn't find one.

What I did find instead is that some MoE models are explicitly domain-routed (MoDEM), but it doesn't apply to deepseek which is just equally load balanced, so it's unlikely to apply to Kimi. On the other hand, https://arxiv.org/html/2505.21079v1 shows modality preferences between experts, even in mostly random training. So maybe there's something there.