top | item 45659976

(no title)

espadrine | 4 months ago

That makes sense.

Why RVQ though, rather than using the raw VAE embedding?

If I compare rvq-without-quantization-v4.png with rvq-2-level-v4.png, the quality seems oddly similar, but the former takes a 32-sized vector, while the latter takes two 32-sized (one-hot) vectors, (2 = number of levels, 32 = number of quantization cluster centers). Isn't that more?

discuss

vvolhejn|4 months ago

I had a part about this but I took it out: for compression, you could keep the embeddings unquantized and it would still compress quite well, depending on the embedding dimension and the number of quantization levels.

But categorical distributions are better for modelling. It's a little difficult to explain here without using diagrams. The intuition is that if you try to have a model predict the next embedding and not the next token, you can't model multimodal distributions - you'll end up predicting the mean of the possible continuations and not the mode, which is not what you want.

Check out Section 5.3 and Figure 6 from PixelRNN, where they discuss this phenomenon: https://arxiv.org/pdf/1601.06759

At the bottom of the blog, I link two articles that do make continuous embeddings work. One of them is the Kyutai paper Continuous Audio Language Models: https://arxiv.org/abs/2509.06926

programjames|4 months ago

Hmm, I think a mixture of beta distributions could work just as well as cateogrical here. I'm going to train it for PixelRNN, but it's going to take hours or days to train (it's a very inefficient and unparallelizable architecture). I'll report back tomorrow.