top | item 19453748

(no title)

The paper[0] is less vague than the article. To put this in terms of reinforcement learning:

1. Sample actions from a random policy distribution.

2. Fit an inverse model with supervised learning from this data. Inverse models learn to map current observations and next observations to the action which produced the next observation: f(s_t, s_t+1) -> a_t

3. Use reinforcement learning to fit a policy which varies the next observation towards a goal: p(s_t) -> s_t+1

4. Use new data from attempts with the policy and inverse model working together to continue training the inverse model.

Motor babbling is a quick way of generating data but it isn't particularly efficient. The problem with taking random actions is that most of your data is going to cover parts of the state space that aren't important for the task. The addition of the policy allows biasing future attempts towards more useful areas of the state space to continue training the inverse model.

This paper [1] also includes a forward and inverse model to improve sample efficiency for more examples of these ideas.

[0] https://www.nature.com/articles/s42256-019-0029-0

[1] https://arxiv.org/abs/1606.07419

discuss

chroem-|7 years ago

They're essentially doing surrogate model optimization, but using a neural network instead of Gaussian processes. This is the same way that the control policy for the MIT Cheetah robot was created.