Note that many of the errors are much more understandable if one considers that the convolutional net pooling destroys much of the spatial relations in the pictures.
I imagine I might make similar errors if I only got little jumbled fragments to work from.
Given those conditions, the cat "laying on a couch" or the dog "jumping to catch a frisbee" hardly even seem like errors to me.
This is going to get radically better when someone works out an efficient way to keep the spatial relations.
Geoff Hinton gave a talk last week at Berkeley on exactly this problem - in pixel space, object identities are all tangled up with location/pose information in a very nonlinear way; it would be nice to find a representation that actually preserves both components while disentangling them ("equivariance") instead of just throwing away all of the spatial information ("invariance", what convnets do). He's done some work on this, a lot of which is apparently unpublished, but gave a reference to one older paper covering some of the ideas:
https://www.cs.toronto.edu/~hinton/absps/transauto6.pdf
1. Object recognition (there's dog and frisbee in the photo)
2. Object localization (Dog and frisbee's ROI in the photo)
3. Relation estimation (Based on X factors, the dog might be chasing the frisbee).
Not sure what you meant by spatial relations (localization?) but recognizing (what) and localizing (where) would be key to drawing relationships between objects.
Much more likely than Xophmeister's explanation, I think, is that the brightness of a color in an image is relative to the lighting condition. (Remember that pink is just white mixed with red.) See image B:
It may not have a sophisticated enough vocabulary to distinguish 'pink' when 'red' was close enough. This effect is manifest in human languages which classify colours differently: say, for example, a language may have no word for 'blue', so the sky is 'green' to its speakers; it's still perceptually different to them, of course, but the lack of fidelity means it can't be communicated better than "sky green" or "grass green".
This is interesting. I think that natural language generation is a largely overlooked task outside of machine translation -- perhaps because most tasks that might require it can get away with the much stupider, much easier job of filling in templates like a form letter. It's cool to see Google attempting the real thing, on top of the image recognition.
That said, I don't expect particularly high accuracy from the composition of an image recognition system and natural language generation. The first actual demo of this is going to be a source of utter hilarity. I hope they're okay with that.
[+] [-] etiam|11 years ago|reply
I imagine I might make similar errors if I only got little jumbled fragments to work from. Given those conditions, the cat "laying on a couch" or the dog "jumping to catch a frisbee" hardly even seem like errors to me.
This is going to get radically better when someone works out an efficient way to keep the spatial relations.
[+] [-] davmre|11 years ago|reply
[+] [-] iamsalman|11 years ago|reply
1. Object recognition (there's dog and frisbee in the photo) 2. Object localization (Dog and frisbee's ROI in the photo) 3. Relation estimation (Based on X factors, the dog might be chasing the frisbee).
Not sure what you meant by spatial relations (localization?) but recognizing (what) and localizing (where) would be key to drawing relationships between objects.
Really impressive work but definitely not a leap.
[+] [-] SammoJ|11 years ago|reply
[+] [-] Trufa|11 years ago|reply
[+] [-] jessriedel|11 years ago|reply
http://www.huevaluechroma.com/pics/3-4.jpg
This is also true to some extent with the actual hue (i.e. red versus green, rather than brightness; see image C) but less so.
[+] [-] Xophmeister|11 years ago|reply
[+] [-] rspeer|11 years ago|reply
That said, I don't expect particularly high accuracy from the composition of an image recognition system and natural language generation. The first actual demo of this is going to be a source of utter hilarity. I hope they're okay with that.
[+] [-] teddyh|11 years ago|reply
[+] [-] Bjoern|11 years ago|reply