advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.
we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.
I don't find the articles argument that this is due to tokenisation convincing.
right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.
cubefox|4 months ago
bigzyg33k|4 months ago
I don't find the articles argument that this is due to tokenisation convincing.
sbrother|4 months ago
oezi|4 months ago
fragmede|4 months ago