top | item 44322313

(no title)

bytepoet | 8 months ago

This is very cool. I enjoyed going through the writeup and GitHub README.

I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.

I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:

FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667

discuss

zhihaojia|8 months ago

Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.

Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!

bytepoet|8 months ago

Thanks for the inputs. It's very helpful to know.

I look forward to following mirage development.

ActorNightly|8 months ago

Personally I think its a bit of a waste to invest time into gradient training optimizations. A lot of training tasks IRL have discrete values in nature, which can't be trained with gradients.