(no title)
jimfleming | 7 years ago
1. Sample actions from a random policy distribution.
2. Fit an inverse model with supervised learning from this data. Inverse models learn to map current observations and next observations to the action which produced the next observation: f(s_t, s_t+1) -> a_t
3. Use reinforcement learning to fit a policy which varies the next observation towards a goal: p(s_t) -> s_t+1
4. Use new data from attempts with the policy and inverse model working together to continue training the inverse model.
Motor babbling is a quick way of generating data but it isn't particularly efficient. The problem with taking random actions is that most of your data is going to cover parts of the state space that aren't important for the task. The addition of the policy allows biasing future attempts towards more useful areas of the state space to continue training the inverse model.
This paper [1] also includes a forward and inverse model to improve sample efficiency for more examples of these ideas.
chroem-|7 years ago