top | item 42992201

(no title)

Chio | 1 year ago

We kind-of have that in DeepSeek-R1-zero [1], but it has problem. From the original authors:

> With RL, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing.

A lot of these we can probably solve, but as other have pointed out we want a model that humans can converse with, not an AI for the purpose of other AI.

That said, it seems like a promising area of research:

> DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community.

[1] https://github.com/deepseek-ai/DeepSeek-R1

discuss

HarHarVeryFunny|1 year ago

Despite the similar "zero" names, DeepSeek-R1 Zero and AlphaGo Zero have nothing in common.

AlphaGo came before AlphaGo Zero; it was trained on human games, then improved further via self-play. The later AlphaGo Zero proved that pre-training on human games was not necessary, and the model could learn from scratch (i.e. from zero) just via self-play.

For DeepSeek-R1, or any reasoning model, training data is necessary, but hard to come by. One of the main contributions of the DeepSeek-R1 paper was describing their "bootstrapping" (my term) process whereby they started with a non-reasoning model, DeepSeek-V3, and used a three step process to generate more and more reasoning data from that (+ a few other sources) until they had enough to train DeepSeek-R1, which they then further improved with RL.

DeepSeek-R1 Zero isn't a self-play version of DeepSeek-R1 - it was just the result of the first (0th) step of this bootstrapping process whereby they used RL to finetune DeepSeek-V3 into the (somewhat of an idiot savant - one trick pony) R1 Zero model that was then capable of generating training data for the next bootstrapping step.

antirez|1 year ago

That's not what happened. R1-Zero is a model per se, released with a different set of weights. Also it's not an intermediate step obtained making R1. In R1, a first SFT was performed before the RL training. While R1-Zero performed ONLY the RL training (on top of the raw V3).

Of course it's hard to argue that R1-Zero and AlphaZero are very similar, since in the case of AlfaZero (I'm referring to the chess model, not Go) only the rules were known to the model, and no human game was shown, while here:

1. The base model is V3, that saw a lot of thigs in pre-training.

2. The RL for the chain of thought has as target math problems that are annotated with the right result. This can be seen as somewhat similar to the chess game finishing with a positive, negative, or draw result. But still... it's text with a problem description.

However the similarity is that in the RL used for R1-Zero, the chain of thought to improve problem solving is learned starting cold, without showing the model any CoT to fine tune on it. However the model could sample from the V3 latent space itself that was full of CoT examples of humans, other LLMs, ...