top | item 42983436 (no title) hodapp | 1 year ago You are right; the advanced in DeepSeek-R1 used RL almost solely because of the chain-of-thought sequences they were generating and training it on. discuss order hn newest No comments yet.
No comments yet.