top | item 12358419

(no title)

iofj | 9 years ago

The paper talks about this. Answer: no it isn't, but it's a bit of a matter of definition. A reinforcement learning task will discount future actions, this won't. Also the training is what many people would call a supervised learning task :

It works by training 2 functions, and those functions are very similar to Deep-Q learning functions: f(s, c, a) s = start image (before grasp attempt, robot in zero position), c = current image, a = possible robot command, predicts the odds that a will lead to a successful grasp. And while f could work by itself (ie. 10k random commands, evaluate f, pick the one with the highest odds of success), it doesn't. There's also a g(c) which tries to predict which commands f will give the best odds.

You might wonder how it decides it's grasped something ? When doing nothing is >90% likely to result in a successful grasp it considers the grasp attempt finished (I assume there's also a time limit involved). There's also some method to trigger an abort.

What I find particularly weird (although very much like in the Deep-Q learning papers) is that g essentially tries to predict f, and is mostly an optimization. You couldn't easily train g directly without ridiculous amounts of supervision (you'd need example successful grasps, potentially a lot of them). But to train f, you only need a few successful grasps, which you can just make happen by having a dumb action that will work 1/1000 times. Evaluating f is a huge problem (to many inputs need to be tried), so you couldn't use that function without g. And training g is something that can happen quickly offline.

The brilliance is that this method takes a success signal, and generates supervised learning samples. Those are the easiest thing you could possibly train on, and so very useful.

Combining f and g (on presumably a decent GPU) they can get a 90% grasp success rate on what looks like a motley collection of cutlery, pencils, a stapler, a nightlight, a few pens and (?) some lipstick, a few small boxes. But this is an "online" method, it should get better at it over time by itself. (God help the rest of us if Google decided to release their data before training fully converged because it takes too long)

discuss

order

No comments yet.