(no title)
why_only_15 | 12 days ago
The point I was trying to make with "RL is only necessary once" is that you can embark on a single self-play loop getting better and better, and this will get you to something close to the frontier. Once you're at the frontier, the frontier doesn't move very much, so you have quite a while (decade?) where it's totally fine to distill from the RL games.
On correction histories -- imo I correctly described what they do. Cosmo was annoyed by the word "adapt" but what I described was the adaptation.
On SPSA -- you don't have a gradient! you don't do backprop! this is what i was trying to get at.
No comments yet.