top | item 45656700

(no title)

bigzyg33k | 4 months ago

advanced voice mode operates on audio tokens directly, it doesn't transcribe them into "text tokens" as an intermediate step like the original version of voice mode did.

discuss

order

cubefox|4 months ago

But they behave just like models which use text tokens internally, which is also pointed out at the end of the above article.

bigzyg33k|4 months ago

we don't know if that's due to inherent limitations of the tokenisation of audio, or a byproduct of reinforcement learning. In my own usage, I noticed a significant degradation in capabilities over time from when they initially released advanced voice mode. The model used to be able to sing, whisper, imitate sounds and tone just fine, but I imagine this was not intended and has subsequently been stunted via reinforcement learning.

I don't find the articles argument that this is due to tokenisation convincing.

sbrother|4 months ago

right, but either whatever audio tokenization it's doing doesn't seem to encode pitch, or there's ~nothing where pitch is relevant in the training set.

oezi|4 months ago

Absolutely correct! My simple test is if it can tell American and British English Tomato and Potato apart. So far it can't.

fragmede|4 months ago

Which "it" are you referring to? There are models that can.