Diffusion LMs do seem to be able to get more out of the same data. In a world where we are already training transformer based LLMs on all text available, diffusion LMs ability to continue learning on a fixed set of data may be able to outperform transformershttps://arxiv.org/abs/2511.03276
nbardy|3 months ago
So it’s more about the mask modeling objective than Diffusion.
unknown|3 months ago
[deleted]
albertzeyer|3 months ago