(no title)
bytepoet | 8 months ago
I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.
I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:
FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667
zhihaojia|8 months ago
Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!
bytepoet|8 months ago
I look forward to following mirage development.
ActorNightly|8 months ago