top | item 45122227

(no title)

One example: when we want to increase performance on a task which can be automatically verified, we can often generate synthetic training data by having the current, imperfect models attempt the task lots of times, then pick out the first attempt that works. For instance, given a programming problem, we might write a program skeleton and unit tests for the expected behavior. GPT-5 might take 100 attempts to produce a working program; the hope is that GPT-6 would train on the working attempt and therefore take much less attempts to solve similar problems.

As you suggest, this costs lots of time and compute. But it's produced breakthroughs in the past (see AlphaGo Zero self-play) and is now supposedly a standard part of model post-training at the big labs.

discuss

No comments yet.