top | item 41524052

(no title)

Feels like a lot of commenters here miss the difference between just doing chain-of-thought prompting, and what is happening here, which is learning a good chain of thought strategy using reinforcement learning.

"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses."

When looking at the chain of thought (COT) in the examples, you can see that the model employs different COT strategies depending on which problem it is trying to solve.

discuss

persedes|1 year ago

I'd be curious how this compared against "regular" CoT experiments. E.g. were the gpt4o results done with zero shot or was it asked to explain it's solution step by step.

nmca|1 year ago

It was asked to explain step by step.

unknown|1 year ago

[deleted]

mountainriver|1 year ago

It’s basically a scaled Tree of Thoughts

qudat|1 year ago

In the primary CoT research paper they discuss figuring out how to train models using formal languages instead of just natural ones. I'm guessing this is one piece to the model learning tree-like reasoning.

Based on the quick searching it seems like they are using RL to provide positive/negative feedback on which "paths" to choose when performing CoT.

danielmarkbruce|1 year ago

This seems most likely, with some special tokens thrown in to kick off different streams of thought.

diedyesterday|1 year ago

Reminds me of how Google's AlphaGo learned to play the best Go that was ever seen. And this somewhat seems a generalization of that.