top | item 39792756

(no title)

Very cool. I'm curious - did you find the results from your mixture of experts model to be (qualitatively) better than with the standard approach?

discuss

avisoori1x|1 year ago

Thanks! So this is something I tried and qualitatively I didn't see a huge difference. I'd like to swap out my hand rolled modules with standard pytorch modules for self attention etc. and train it on the wikipedia English split. That's on my to-do list for sure.

zingelshuher|1 year ago

I run some tests. Single model of the same size is better than MoE. Single expert out of N is better than model of the same size (i.e. same as expert). 2 experts are better than one. That was on small LLM, not sure if it scales.