(no title)
chaeronanaut | 11 months ago
This is false, reasoning models are rewarded/punished based on performance at verifiable tasks, not human feedback or next-token prediction.
chaeronanaut | 11 months ago
This is false, reasoning models are rewarded/punished based on performance at verifiable tasks, not human feedback or next-token prediction.
Xelynega|11 months ago
What does CoT add that enables the reward/punishment?
Jensson|11 months ago
And you really want to train on specific answers since then it is easy to tell if the AI was right or wrong, so for now hidden CoT is the only working way to train them for accuracy.