(no title)
carlosdp | 8 months ago
I'm not sure that's necessarily true for a lot of tasks.
A good way to measure this in your head is this:
"If you were given remote control of two robot arms, and just one camera to look through, how many different tasks do you think you could complete successfully?"
When you start thinking about it, you realize there are a lot of things you could do with just the arms and one camera, because you as a human have really good intuition about the world.
It therefore follows that robots should be able to learn with just RGB images too! Counterexamples would be things like grabbing an egg without crushing, perhaps. Though I suspect that could also be done with just vision.
moefh|8 months ago
I don't see how that follows. Humans have trained by experimenting with actually manipulating things, not just by vision. It's not clear at all that someone who had gained intuition about the world exclusively by looking at it would have any success with mechanical arms.
amelius|8 months ago
1. First create a model that can evaluate how well a task is going; the YT approach can be used here.
2. Then build a real-world robot, and train it by letting it do tasks, and use the first model to supervise it; here the robot can learn to rely on extra senses such as touch/pressure.
jpc0|8 months ago
Simple concept, pick up a glass and pour its content into a vertical hole the approximate size of your mouth. Think of all the failure modes that can be triggered in the trivial example you do multiple times a day, to do the same from a single camera feed with no other indicators would take you hours to master and you already are a super intelligent being.
jrimbault|8 months ago
I have done this 3 seconds gesture, and variations of it, my whole life basically, and never noticed I was throwing the glass from one hand to the other without any visual feedback.
var_cw|8 months ago
stavros|8 months ago
abenga|8 months ago
amelius|8 months ago
godelski|8 months ago
You introduced knowledge not obtained through observation. In fact, the knowledge you introduced is the whole chimichanga! It is an easy mistake to make, so don't feel embarrassed.
The claim is that one can learn a world model[0] through vision. The patent countered by saying "vision is not enough." Then you countered by saying "vision is enough if you already have a world model."
[0] I'll be more precise here. You can learn *A* world model, but it isn't the one we really care about and "a world" doesn't require being a self consistent world. We could say the same thing about "a physics", but let's be real, when we say "physics" we know which one is being discussed...
deadfoxygrandpa|8 months ago
jaisio|8 months ago
And where does this intuition come from? It was buily by also feeling other sensations in addition to vision. You learned how gravity pulls things down when you were a kid. How hot/cold feels, how hard/soft feels, how thing smell. Your mental model of the world is substantially informed by non-visual clues.
> It therefore follows that robots should be able to learn with just RGB images too!
That does not follow at all! It's not how you learned either.
Neither have you learned to think by consuming the entirety of all text produced on the internet. LLMs therefore don't think, they are just pretty good at faking the appearance of thinking.
corimaith|8 months ago
There are an infinite number of scenes that can be matched to one 2d picture. And what is a scene really? The last time I checked, RGB was not a good way of input in Computer Vision and rather relied on increasing levels of gradients via CNNs to build a compositional scene. None of that is paticularly translatable to how a LM works with text.
suddenlybananas|8 months ago
whatever1|8 months ago
unknown|8 months ago
[deleted]