The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.
A simple way is to split the model’s output stream before TTS.
Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.
There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)
artur44|2 months ago
pugio|2 months ago