(no title)
mdda | 1 year ago
One difference from the initial step is that the second time around includes the initial step and the aha comment in the context : It is, after all, just doing LLM token-wise prediction.
OTOH, the RL process means that it has potentially learned the impact of statements that it makes on the success of future generation. This self-direction makes it go somewhat beyond vanilla-LLM pattern mimicry IMHO.
No comments yet.