A simple way is to split the model’s output stream before TTS.
Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.
pugio|2 months ago
artur44|2 months ago