(no title)
ActivePattern | 4 months ago
- There are no experts. The outputs are approximating random samples from the distribution.
- There is no latent diffusion going on. It's using convolutions similar to a GAN.
- At inference time, you select ahead-of-time the sample index, so you don't discard any computations.
diyer22|4 months ago
Supplement for @f_devd:
During training, the K outputs share the stem feature from the NN blocks, so generating the K outputs costs only a small amount of extra computation. After L2-distance sampling, discarding the other K-1 outputs therefore incurs a negligible cost and is not comparable to discarding K-1 MoE experts (which would be very expensive).
f_devd|4 months ago
The ahead-of-time sampling doesn't make much sense to me mechanically, and isn't really mentioned much. But I will hold my judgement for future versions since the FID performance of this first iteration is still not that great.