top | item 42826497

(no title)

a1j9o94 | 1 year ago

Probably not the whole model, but the first step was "fine tuning" the base model on ~800 chain of thought examples.

Those were probably from OpenAI models. Then they used reinforcement learning to expand the reasoning capabilities.

discuss

mkl|1 year ago

800k. They say they came from earlier versions of their own models, with a lot of bad examples rejected. They don't seem to say which models they got the "thousands of cold-start" examples from earlier in the process though.