Building a natural description of images

[+] etiam|11 years ago|reply

Note that many of the errors are much more understandable if one considers that the convolutional net pooling destroys much of the spatial relations in the pictures.

I imagine I might make similar errors if I only got little jumbled fragments to work from. Given those conditions, the cat "laying on a couch" or the dog "jumping to catch a frisbee" hardly even seem like errors to me.

This is going to get radically better when someone works out an efficient way to keep the spatial relations.

[+] davmre|11 years ago|reply

Geoff Hinton gave a talk last week at Berkeley on exactly this problem - in pixel space, object identities are all tangled up with location/pose information in a very nonlinear way; it would be nice to find a representation that actually preserves both components while disentangling them ("equivariance") instead of just throwing away all of the spatial information ("invariance", what convnets do). He's done some work on this, a lot of which is apparently unpublished, but gave a reference to one older paper covering some of the ideas: https://www.cs.toronto.edu/~hinton/absps/transauto6.pdf

[+] iamsalman|11 years ago|reply

Three components:

1. Object recognition (there's dog and frisbee in the photo) 2. Object localization (Dog and frisbee's ROI in the photo) 3. Relation estimation (Based on X factors, the dog might be chasing the frisbee).

Not sure what you meant by spatial relations (localization?) but recognizing (what) and localizing (where) would be key to drawing relationships between objects.

Really impressive work but definitely not a leap.

[+] SammoJ|11 years ago|reply

Relevant, very similar paper input/output wise, from our resident karpathy with a detailed discussion in comments: https://news.ycombinator.com/item?id=8621658

[+] Trufa|11 years ago|reply

I am very surprised it got the color of the motorcycle wrong, it seems like the easiest thing to detect...

[+] jessriedel|11 years ago|reply

Much more likely than Xophmeister's explanation, I think, is that the brightness of a color in an image is relative to the lighting condition. (Remember that pink is just white mixed with red.) See image B:

http://www.huevaluechroma.com/pics/3-4.jpg

This is also true to some extent with the actual hue (i.e. red versus green, rather than brightness; see image C) but less so.

[+] Xophmeister|11 years ago|reply

It may not have a sophisticated enough vocabulary to distinguish 'pink' when 'red' was close enough. This effect is manifest in human languages which classify colours differently: say, for example, a language may have no word for 'blue', so the sky is 'green' to its speakers; it's still perceptually different to them, of course, but the lack of fidelity means it can't be communicated better than "sky green" or "grass green".

[+] rspeer|11 years ago|reply

This is interesting. I think that natural language generation is a largely overlooked task outside of machine translation -- perhaps because most tasks that might require it can get away with the much stupider, much easier job of filling in templates like a form letter. It's cool to see Google attempting the real thing, on top of the image recognition.

That said, I don't expect particularly high accuracy from the composition of an image recognition system and natural language generation. The first actual demo of this is going to be a source of utter hilarity. I hope they're okay with that.

[+] teddyh|11 years ago|reply

Duplicate of https://news.ycombinator.com/item?id=8623095

[+] Bjoern|11 years ago|reply

The other way around. The link you gave is a duplicate of this post actually.

15 comments