top | item 42824793

(no title)

jedharris | 1 year ago

See also independent RL based reasoning results, fully open source: https://hkust-nlp.notion.site/simplerl-reason

Very small training set!

"we replicate the DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data. We show that long Chain-of-Thought (CoT) and self-reflection can emerge on a 7B model with only 8K MATH examples, and we achieve surprisingly strong results on complex mathematical reasoning. Importantly, we fully open-source our training code and details to the community to inspire more works on reasoning."

discuss

order

No comments yet.