This is impressive, but what prevents the blue agent from generating an incorrect proof of a "true example"? What prevents the red agent from generating a correct disproof of a "false example"? I'm curious how they managed to generate a truly unlimited source of correctly labeled examples.
HanClinto|1 year ago
That's the role of the Verifier. It's not going to be perfect, and I'm sure some incorrect proofs of true examples slip through, but it's good enough to increase the quality of the model overall.
> "What prevents the red agent from generating a correct disproof of a "false example"?
And on the other side, it's counterbalanced by the rules engine (math) that can determine absolutely whether or not the right answer is given at the end.
The Red and the Blue agents are held in check by the tension between the math engine and the verifier, and they are free to fight back-and-forth within those parameters as long as they are able. Eventually, I think the Red agent loses the ability to attack effectively, and so that's the big limit on OpenAI's arrangement. This particular game isn't balanced enough for this training loop to continue infinitely.