Everything you mention seems possible to replicate with 2 cameras, a few sensors (and a
X years of AI/software dev). Which is the point of people saying cameras are good enough if human’s eyes are good enough.
The human eye is a far better sensor in many ways than the CMOS sensors used in cars. They have variable focus lenses, finely controlled irises, and are on gimbaled mounts in the head. They have integrated shades/shutters and cleaning mechanisms. The head itself is on a gimbaled mount which itself can be moved. Attached to the eye mount are inertial sensors, audio sensors, and an accelerometer.
A camera or even a stereo pair of cameras mounted in or on the car will provide inferior imagery to the control system than eyes to the brain. They have less dynamic range and no articulation. If you wanted to replicate human style vision you'd need a bunch of fixed cameras and inertial and acceleration sensors all on top of AI that's better than what Tesla's been demonstrating.
LIDAR is the most straightforward augmentation for fixed cameras because it can build very accurate depth maps and image segmentation. You need fewer fixed cameras if your spacial model is built with LIDAR. You're in even better shape if those systems are augmented with radar.
While humans don't have LIDAR and such, our visual systems are highly developed and augmented with highly developed proprioception. Trying to replicate it with just cameras and tons of processing power is a fool's errand.
First worked on Autonomous cars in 2007 for DARPA grand challenge and even back then fused sensing was where it was at. Modern Cameras are better but they're no replacement for the eye. The best thing we can do right now is take high quality cameras and augment them with things like radar and LIDAR to get close to the human eye level perception. It makes the AI's job more about the macro driving problems and less about vision. Look at the string of crashes of Teslas into white box trucks on bright days.
I remember the first time we had problems with a matte black surface with our LIDAR that would have been easily spotted by our camera and vice versa with a shiny white surface in direct sunlight relative the car being easily picked up by lidar but nearly invisible to the cameras.
Two cameras, plus as-of-yet unknown human vision to synthesize depth info correctly for arbitrary scenes, plus the ability to articulate side by side by 8-10 inches, plus a very efficient liquid coating plus cleaning mechanism (blinking), plus the ability to deploy anti-glare shades proactively (hands).
Don’t forget the self maintaining general intelligence with a minimum of 14-16 years of real world training (minimum, varies by state) to understand context and environmental factors, go to highly trained specialists when sensors seem out of calibration, etc.
I think he's saying that the "X years of AI/software dev" are really the important part, much more than the mechanics of the visual sensing. And that the human brain and optical system doing that is what's not so easy to replicate in a machine.
The best cameras are about 80% there, compared to human eyes. At least, in some respects. There are no cameras that are 80% there in all respects. And that’s when compared to low-quality human vision that is attached to low-attention/highly distracted humans.
Now, there’s this little thing called The Pareto Principle.
That last 20% is going to take a long time to achieve, and is going to be very, very expensive.
Are you willing to roll a ten sided die every time you get in the car, and only if you roll a three or higher, do you get to arrive at your destination unhurt, on time, and without major incident?
2 cameras with 3-axis rotation and xyz movement to look around obstructions and see things out of reach from a fixed perspective.
Systems wise, that's going to be a lot more complex than having more cameras. Of course, if you've got enough cameras you can probably get better visibility than a human, and continuously process all viewpoints. But 2 cameras is definitely insufficient.
giantrobot|4 years ago
A camera or even a stereo pair of cameras mounted in or on the car will provide inferior imagery to the control system than eyes to the brain. They have less dynamic range and no articulation. If you wanted to replicate human style vision you'd need a bunch of fixed cameras and inertial and acceleration sensors all on top of AI that's better than what Tesla's been demonstrating.
LIDAR is the most straightforward augmentation for fixed cameras because it can build very accurate depth maps and image segmentation. You need fewer fixed cameras if your spacial model is built with LIDAR. You're in even better shape if those systems are augmented with radar.
While humans don't have LIDAR and such, our visual systems are highly developed and augmented with highly developed proprioception. Trying to replicate it with just cameras and tons of processing power is a fool's errand.
uncoder0|4 years ago
I remember the first time we had problems with a matte black surface with our LIDAR that would have been easily spotted by our camera and vice versa with a shiny white surface in direct sunlight relative the car being easily picked up by lidar but nearly invisible to the cameras.
URSpider94|4 years ago
lazide|4 years ago
IIAOPSW|4 years ago
I have an adversarial example.
https://wallpapercave.com/wp/InAcQKW.jpg
vikingerik|4 years ago
bradknowles|4 years ago
Now, there’s this little thing called The Pareto Principle.
That last 20% is going to take a long time to achieve, and is going to be very, very expensive.
Are you willing to roll a ten sided die every time you get in the car, and only if you roll a three or higher, do you get to arrive at your destination unhurt, on time, and without major incident?
toast0|4 years ago
Systems wise, that's going to be a lot more complex than having more cameras. Of course, if you've got enough cameras you can probably get better visibility than a human, and continuously process all viewpoints. But 2 cameras is definitely insufficient.