top | item 44112248

(no title)

kcorbitt | 9 months ago

It seems like the speedups here are most useful for small models, since on larger models a smaller fraction of the total time would be spent swapping between kernels? Would be interesting to see at least theoretical results for LLMs in the 14-70B parameter range, which is what most folks deploy in practice.

And of course the effect on throughput at larger batch sizes, which they allude to at the end.

Overall a very interesting result!

discuss

order

ptrj_|9 months ago

This could also give a nice speedup for MoE models w/ total 7B-70B parameters but O(10x) fewer active params, e.g. https://huggingface.co/Qwen/Qwen3-30B-A3B, assuming the expert router can be effectively scheduled within the monolithic mega-kernel.

mmoskal|9 months ago

They are reducing forward pass time from say 1.5ms to 1ms. On bigger model you would likely reduce from 15ms to 14.2ms or something like that.