(no title)
utdiscant | 1 year ago
"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses."
When looking at the chain of thought (COT) in the examples, you can see that the model employs different COT strategies depending on which problem it is trying to solve.
persedes|1 year ago
nmca|1 year ago
unknown|1 year ago
[deleted]
mountainriver|1 year ago
qudat|1 year ago
Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.
danielmarkbruce|1 year ago
diedyesterday|1 year ago