top | item 40055127

(no title)

ewjt | 1 year ago

This is not preprogrammed replay. Replay would not be able handle even tiny variations in the starting positions of the shirt.

discuss

lyapunova|1 year ago

So, a couple things here.

It is true that replay in the world frame will not handle initial position changes for the shirt. But if the commands are in the frame of the end-effector and the data is object-centric, replay will somewhat generalize.(Please also consider the fact that you are watching the videos that have survived the "should I upload this?" filter.)

The second thing is that large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing. Not bad inherently, but just a fact.

My point is that there was an academic contribution made back when the first aloha paper came out and they showed doing BC on low-quality hardware could work, but this is like the 4th paper in a row of sort of the same stuff.

Since this is YC, I'll add - As an academic (physics) turned investor, I would like to see more focus on systems engineering and first-principles thinking. Less PR for the sake of PR. I love robotics and really want to see this stuff take off, but for the right reasons.

modeless|1 year ago

> large-scale behavior cloning (which is the technique used here), is essentially replay with a little smoothing

A definition of "replay" that involves extensive correction based on perception in the loop is really stretching it. But let me take your argument at face value. This is essentially the same argument that people use to dismiss GPT-4 as "just" a stochastic parrot. Two things about this:

One, like GPT-4, replay with generalization based on perception can be exceedingly useful by itself, far more so than strict replay, even if the generalization is limited.

Two, obviously this doesn't generalize as much as GPT-4. But the reason is that it doesn't have enough training data. With GPT-4 scale training data it would generalize amazingly well and be super useful. Collecting human demonstrations may not get us to GPT-4 scale, but it will be enough to bootstrap a robot useful enough to be deployed in the field. Once there is a commercially successful dextrous robot in the field we will be able to collect orders of magnitude more data, unsupervised data collection should start to work, and robotics will fall to the bitter lesson just as vision, ASR, TTS, translation, and NLP before.

klowrey|1 year ago

The detail about end-effector frame is pretty critical as doing this BC with joint angles would not be tractable. You can tell there was a big shift from the RL approaches trying to do very generalizing algorithms to more recent works that are heavily focused on this arms/manipulators because end-effector control enables more flashy results.

Another limiting factor is that data collection is a big problem: not only will you never be sure you've collected enough data, they're collecting data of a human trying to do this work through a janky teleoperation rig. The behavior they're trying to clone is of a human working poorly, which isn't a great source of data! Furthermore limiting the data collection to (typically) 10Hz means that the scene will always have to be quasi-static, and I'm not sure these huge models will speed up enough to actually understand velocity as a 'sufficient statistic' of the underlying dynamics.

Ultimately, it's been frustrating to see so much money dumped into the recent humanoid push using teleop / BC. It's going to hamper the folks actually pursing first-principles thinking.