Calling it now - RL finally "just works" for any domain where answers are easily verifiable. Verifiability was always a prerequisite, but the difference from prior generations (not just AlphaGo, but any nontrivial RL process prior to roughly mid-2024) is that the reasoning traces and/or intermediate steps can be open-ended with potentially infinite branching, no clear notion of "steps" or nodes and edges in the game tree, and a wide range of equally valid solutions. As long as the quality of the end result can be evaluated cleanly, LLM-based RL is good to go.As a corollary, once you add in self-play with random variation, the synthetic data problem is solved for coding, math, and some classes of scientific reasoning. No more modal collapse, no more massive teams of PhDs needed for human labeling, as long as you have a reliable metric for answer quality.
This isn't just neat, it's important - as we run out of useful human-generated data, RL scaling is the best candidate to take over where pretraining left off.
resiros|9 months ago
I guess that's now becoming true with LLMs.
Faster LLMs -> More intelligence
UncleOxidant|9 months ago
couldn't you say that if you squint hard enough, GA looks like a category of RL? There are certainly a lot of similarities, the main difference being how each new population of solutions is generated. Would not at all be surprised that they're using a GA/RL hybrid.
vjerancrnjak|9 months ago
If variety is sought, why not beam with nice population statistic.
yorwba|9 months ago
skybrian|9 months ago
They are having some success in making it work internally. Maybe only the team that built it can get it to work? But it does seem promising.
unignorant|9 months ago
As far as I can read, the weights of the LLM are not modified. They do some kind of candidate selection via evolutionary algorithms for the LLM prompt, which the LLM then remixes. This process then iterates like a typical evolutionary algorithm.
modeless|9 months ago
smattiso|9 months ago
vrm|9 months ago
4b11b4|9 months ago
I suppose you could consider that last part (optimizing some metric) "RL".
However, it's missing a key concept of RL which is the exploration/exploitation tradeoff.
TechDebtDevin|9 months ago
There are monopolies on the coolest sets of data in almost all industries, all the RL in the world won't do us any good if those companies doing the data hoarding are only using it to forecast outcomes that will make them more money, not what can be done to better society.
spyckie2|9 months ago
unknown|9 months ago
[deleted]
unknown|9 months ago
[deleted]
obsolete_wagie|9 months ago