(no title)
espadrine | 4 months ago
Why RVQ though, rather than using the raw VAE embedding?
If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?
vvolhejn|4 months ago
But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.
Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759
At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926
programjames|4 months ago