top | item 20711574

(no title)

tgog | 6 years ago

This completely neglects the fact that humans can build near perfect 3D representations of the world with 2D images stitched together with the parallax neural nets in our brain. This blogpost briefly mentions it in one line as a throwaway and says you'd need extremely high resolution cameras?? Doesn't make sense at all. Two cameras of any resolution spaced a regular distance apart should be able to build a better parallax 3D model than any one camera alone.

discuss

order

sairahul82|6 years ago

The first thing we need to remember is the self driving doesn't work like our brain. If they do then we don't need to train them with billions of images. So the main problem is not just building the 3d models. For example we don't crash into the car because we never seen that car model or that kind of vehicle before. Check https://cdn.technologyreview.com/i/images/bikeedgecasepredic... we never think that there is a bike infront of us.

Humans do lot more than just identifying an image or doing 3d reconstruction. We have context about the roads, we constantly predict the movement of other cars, we do know how to react based on the situation and most importantly we are not fooled by simple image occlusions. Essentially we have a gigantic correlation engine that takes decision based on comprehending different things happening on the road.

The AI algorithms we teach does not work in the same way as we do. They overly depend on the identifying the image. Lidar provides another signal to the system. It provides redundancy and allows the system to take the right decision. Take the above linked image for an example.

We may not need a lidar once the technology matures but at this stage it is a pretty important redundant system.

ebg13|6 years ago

> So the main problem is not just building the 3d models

That's not relevant when discussing which technology to use to build the 3d models. Everything you said is accurate until the last few sentences. Lidar provide the same information (line of sight depth) as stereo cameras, just in a different way. The person you're responding to is talking about depth from stereo, not cognition.

rkangel|6 years ago

> The first thing we need to remember is the self driving doesn't work like our brain. If they do then we don't need to train them with billions of images.

I had always assumed that the first few years of infancy was effectively a period of training a neural net (the brain) against a continuous series of images (everything seen).

ricardobeat|6 years ago

Where is the bike example from? All these instances of recognition error are meaningless when they don’t come from actual production systems by auto makers. They don’t just slap OpenCV into a car.

qgadrian|6 years ago

Having a redundant system is the key here.

Also provides a reliable source of data, if humans have a LiDAR in their system then we would use it to improve our decisions.

I don’t see why we should limit the AV.

Complexicate|6 years ago

The human brain is horrible at building truly accurate 3D representations of the world. Our mental maps are constantly missing a magnitude of details while tricking us and creating approximations to fill in the blanks.

Easy examples of this are optical illusions, ghosts, and ufos. There is also "selective attention tests" where a majority of people miss glaringly obvious events right in front of them, when they're focusing on something else. Regular people also tend to bump into things, spill things, and trip, even when going 3 miles an hour (walking speed).

taneq|6 years ago

Exactly. We don't build detailed accurate 3D maps. We build fuzzy semantic 2.5-ish-D maps that are 99% metadata. And they work incredibly well.

rdtsc|6 years ago

But at the same time people don't think much about getting in their cars and driving to work or the grocery store.

So it seems that a truly accurate 3D representations of the world are not necessary, at least for driving. Perhaps it's the resolution? Looking at the samples in the article, they are just terribly fuzzy, with a narrow field of view. If I had to drive and only see the world through that kind of view, I don't think I would be doing very well.

m3at|6 years ago

We don't just have 2D data though.

We learn objects representations by interacting with them over years in a multi modal fashion. Take for example a simple drinking glass: we know its material properties (it is transparent, solid, can hold liquids), its typical position (stay on a tabletop, upright with the open side on top), its usage (grab it with a hand and bring to mouth)...

We also make heavy use of the time dimension, as over a few seconds we see the same objects from different view points and possibly in different states.

Only after learning what a glass is can we easily recover its properties on a still 2D image.

So at least for learning (might be skippable at inference), it makes a lot of sense to me to have more than 2D still images.

ebg13|6 years ago

You're not responding to what they said. The person you're responding to is talking about depth from stereo, not cognition. Lidar _also_ doesn't know what the glass feels like.

joshvm|6 years ago

Others have commented about the human aspect.

> Two cameras of any resolution spaced a regular distance apart should be able to build a better parallax 3D model than any one camera alone.

This is true if the platform isn't moving.

If you have the time dimension and you have good knowledge of motion between frames (difficult), you can use the two views as a virtual stereo pair. This is called monocular visual/inertial-SLAM. You can supplement with GPS, 2D lidar, odometry and IMU to probabalistically fuse everything together. There have been some nice results published over the years.

But in general yes, you'll always be better off if you have a proper stereo pair with a camera either side of the car.

microcolonel|6 years ago

> humans can build near perfect 3D representations of the world

The idea that the human brain has a "near perfect" 3D representation of one's surroundings seems inaccurate to me. There's a difference between near perfection and good enough that people don't often get hurt, when all of their surroundings are deliberately constructed to limit exposure to danger.

LeifCarrotson|6 years ago

I write code for industrial equipment and often get the request to fix a problem with software. The question "Can a computer do X" is too easy to answer in the affirmative - "Yes, but less accurately and only most of the time, and with a lot of time and money" gets condensed to "Yes" quickly.

And it is indeed an impressive and heroic piece of work when you can fix sensor problems with clever filtering, or fix mechanical problems with clever control algorithms. But when designing new equipment or deciding a path to fix a bad design, you never want to hamstring yourself from the start with poor quality input data and output actuators. That approach only leads to pain.

Once you have lots of experience with a particular design - dozens of similar machines running successfully in production for years - then you can start looking for ways to be clever and improve performance over the default or save a little money.

I understand Elon's desire to get lots of data. But there will be a much greater chance of success if it starts with Lidar + cameras, and a decade down the road you can work on camera-only controls and compare what they calculated and would have done to what the Lidar measured and the car actually responded. Only when these are sufficiently close should you phase out the Lidar.

Remember, you're comparing bad input data going to the best neural net known in the universe (the human brain) with millenia of evolution and decades of training data to sensor inputs to brand new programming. Help out the computer with better input data.

Symmetry|6 years ago

For human level driving a human level understanding of these scene from purely visual information is quite good enough. The first problem, though, is that the human brain has far more processing power than any computer that can fit in a car and probably more than any single computer yet constructed (estimating even to a single order of magnitude is hard). We're also leveraging millions of years of evolution though I'm not entirely sure how much of a difference that makes given how different our ancestral environment was from driving a car.

The other thing is that we, ideally, want a computer to drive a car better than a human can. There's a lot to be gained from having precise rather than approximate notions or other objects' distances and speeds in terms of driving both safely and efficiently. Now, Tesla has also got that Radar which when fused with visual data will help somewhat but I'm not sure how far that can get them.

KaiserPro|6 years ago

Yes, we can. We can do it with one eye too.

but it takes at least 10 years to train.

But most of the time we are not building a 3d map from points. we are building it from object inference.

There are many advantages that we have over machines:

o The eye seens much beter in the dark o It has a massive dynamic range, allowing us to see both light and dark things o it moves to where the threat is o if it's occluded it can move to get a better image o it has a massive database of objects in context o each object has a mass, dimension, speed and location it should be seen in

None of those are 3d maps, they are all inference, where one can derrive the threat/advantage based on history.

We can't make machines do that yet.

you are correct that two cameras allows for better 3d pointcloud making in some situations. but a moving single camera is better than a static multiview camera.

however even then the 3d map isn't all that great, and has a massive latency compared to lidar.

jsharf|6 years ago

I think most of our ability to judge relative distance is based on our brains judgement of lighting, texture, inference, and sound. While having two eyes helps a lot, you can still navigate a complex office environment with one eye closed. It just takes a bit more care.

mbrumlow|6 years ago

When I was younger I remember hearing about how we can do all these things because we have 2 eyes. And that depth perception is what gives us the ability to not walk into walls, and do other things including driving.

I have thought about this many times and often wondered why when closing one eye I am still able to function.

Sense then I have thought strongly that having depth perception is used for training some other part of our brain, and then only used to increase accuracy of our perception of reality.

Further proof of this is TV. Even on varying sized screens humans tend to do well figuring out the actual size of things displayed.

xiphias2|6 years ago

About 10 years ago I went to an eye doctor with a small object in my eye, and she had to cover it after removing the small object.

Driving back home with 1 eye was scary even though I was going much slower. It is possible to drive with 1 eye, but much much harder than with 2 eyes.

mcqueenjordan|6 years ago

There are also depth cues from https://en.wikipedia.org/wiki/Vergence#Convergence, right? As in focusing on the object itself?

adrianmonk|6 years ago

Wikipedia lists 18 different types of depth cues that humans use!

https://en.wikipedia.org/wiki/Depth_perception

This seems like a bit of a double-edged sword. On the one hand, it means there's more than one way to achieve a 3D model of the world with cameras. On the other hand, it means that if what machines can do with cameras is going to match what we humans can do with our eyes, they will need to either advance along 18 different fronts or take some of those cues further than we can.

Fricken|6 years ago

The most rudimentary life forms are little factories that build themselves. I think we should concentrate on making cars that build themselves and maybe then our technology will be sophisticated enough to consider looking into giving our cars human-like optical processing faculties.

Otherwise we'll just have to figure out how to build autonomous vehicles with the technology we have, which is pretty crappy in comparison to biology in a lot of ways still.

nguoi|6 years ago

When a tree falls over a river, it creates a rudimentary bridge, as has happened for longer than humans have existed. Yet, while we can create huge suspension bridges from steel, we can't create wood.

asdf21|6 years ago

This is getting into grey goo territory.

mantap|6 years ago

You cannot have false negatives. Ever. You cannot have a situation where the system doesn't see a pedestrian and runs over them at without noticing. So you need to make a very convincing argument that it can't happen.

With cameras and computer vision there's no way to prove it. There is always a chance that it will glitch out for a second and kill someone.

pfundstein|6 years ago

Autonomous vehicles don't need to be perfect drivers -- from it, they just need to be better than humans.

threeseed|6 years ago

> near perfect 3D representations of the world with 2D images

This is ridiculous.

I am sitting in front of a monitor right now. Please explain how I can perfectly determine the depth of it even though I can't see behind it ? I can move my ahead all around it to capture hundreds of different viewpoints but a car can't do that.

ebg13|6 years ago

Nobody made a rule that says cars can't have cameras in more than one location.

aeternus|6 years ago

When moving, cars can compare hundreds of different viewpoints. Multiple cameras provide for depth perception when stationary.

davidgould|6 years ago

it’s too bad that cars can’t move to get additional points of view.

sdenton4|6 years ago

Cameras do not perform saccades, for starters... The hardware isn't as analogous as it might seem.