This can still happen even with thinking models as long as the model outputs tokens in a sequence. Only way to fix would be to allow it to restart its response or switch to diffusion.
I think this poster is suggesting that, rather than "thinking" (messages emitted for oneself as audience) as a discrete step taken before "responding", the model should be trained to, during the response, tag certain sections with tokens indicating that the following token-stream until the matching tag is meant to be visibility-hidden from the client.
Less "independent work before coming to the meeting", more "mumbling quietly to oneself at the blackboard."
You could throw the output into a cleansing, "nonthinking" LLM, removing the steering tokens and formatting the response in a more natural way. Diffusion models are otherwise certainly a very interesting field of research.
It's an artifact of post-training approach. Models like kimi k2 and gpt-oss do not utter such phrases and are quite happy to start sentences with "No" or something to the tune of "Wrong".
Diffusion also won't help the way you seem to think it will (that the outputs occur in a sequence is not relevant, what's relevant is the underlying computation class backing each token output, and there, diffusion as typically done does not improve on things. The argument is subtle but the key is that output dimension and iterations in diffusion do not scale arbitrarily large as a result of problem complexity).
derefr|5 months ago
Less "independent work before coming to the meeting", more "mumbling quietly to oneself at the blackboard."
adastra22|5 months ago
poly2it|5 months ago
Vetch|5 months ago
Diffusion also won't help the way you seem to think it will (that the outputs occur in a sequence is not relevant, what's relevant is the underlying computation class backing each token output, and there, diffusion as typically done does not improve on things. The argument is subtle but the key is that output dimension and iterations in diffusion do not scale arbitrarily large as a result of problem complexity).