(no title)
markisus | 4 months ago
I don’t get why a trajectory would provide only one bit of information.
Each step of the trajectory is at least giving information about what state transitions are possible.
An infinitely long trajectory can explore the whole state space if there are no absorbing states. Such a trajectory would provide a massive amount of information about the system, even if we ignored the final reward.
navar|4 months ago
This is in contrast to more "supervised" forms of learning where you could get a loss for each token produced (e.g. cross entropy loss), and where you'd get, as a consequence O(number of tokens) information into your gradients.
unknown|4 months ago
[deleted]
mountainriver|4 months ago
I’m still not fully convinced of the 1bit claim, they made other mistakes in the blog post