I feel like both this comment and the parent comment highlight how RL has been going through a cycle of misunderstanding recently from another one of its popularity booms due to being used to train LLMs
While collecting data according to policy is part of RL, 'reductive' is an understatement. It's like saying algebra is all about scalar products. Well yes, 1%
mistercheph|3 months ago
mountainriver|3 months ago
They also force exploration as a part of the algorithm.
They can be used for synthetic data generation once the reward model is good enough.
phyalow|3 months ago
singularity2001|3 months ago