top | item 9369121

(no title)

cheatsheet | 11 years ago

> β€œIt goes beyond image classification β€” the most popular task in computer vision β€” and tries to answer one of the most fundamental questions in computer vision: What is the right representation of visual scenes?

Can someone knowledgeable in graphics research explain the context that this question comes from?

If I am reading the question correctly, I infer that the question suggests that there exists a right way to reproduce the visual experience of reality. To me, this sounds like a question that is equally valid to have no answer (or many answers) in aesthetics, art, and philosophy, etc.

discuss

order

rasz_pl|11 years ago

Think about Dreaming. "seeing" during a dream state works by experiencing pure data representation of the real world. People fluent in lucid dreaming can tell you something funny happens when you try to thorough examine objects while sleeping. Constructed worlds tend to be skin deep, and fall apart when poked. Everything is build with ideas drawn from your experience.

Its Plato's Allegory of the Cave all the way down.

Imagine "watching" a movie compressed using your very own prior knowledge. Every scene could be described in couple of hundred lines of plaintext. Today we do this by reading a book :) What if we could build an algorithm able to render movies from books?

TheGrassyKnoll|11 years ago

Fascinating Captain, machine intelligence meets philosophy.

"The world is such and such or so and so, only because we talk to ourselves about its being such and such and so and so..." Carlos Castaneda

digi_owl|11 years ago

Makes me think of Vocaloid.

tel|11 years ago

The wiggle word here is "right", I suppose. It's easy to ascribe meanings to that word which are very difficult to use---my limited understanding of Philosophy makes me think that this is the realm of ideas like "qualia" and the like.

For a long time statisticians wrangled over this word in a reduced context. The "art" of statistics is to build a model of the world which is sufficiently detailed to capture interesting data but not so detailed to make it difficult to interpret as a human decision-maker. Statisticians usually solve this problem by building a lot of models, getting lucky, presenting things to people and seeing what sticks.

For a long time this lack of a notion of "rightness" was so powerful that it precluded advancement of the field in certain ways.

With the advent of computers we discovered a new, even more precise form of "right" however and this formed the bedrock of Machine Learning. The "right" ML is concerned with is predictive power. A model is "right" when it leads to a training and prediction algorithm which is "probably, approximately correct", e.g. you can feed real data in and end up with something useful (with a high degree of probability).

So with respect to computer vision we know that it is very difficult to build "efficient" algorithms, ones which work well while using a reasonable amount of training data. CV moved forward when it realized that there were representations of the visual field which led to better predictive power---these were originally generated by studying the visual center of human and animal brains, but more recently have been generated "naively" by computers.

So, there's a reasonably well-defined way that we can find the "right" representation of visual scenes: if we find one which ultimately is best-in-class of all representations for any choice of ML task then it's "right".

darkmighty|11 years ago

I like this definition, it's almost equivalent to the one given below by me: if you have a good predictor you can compress the information well, but not optimally. But to compress optimally, you need more than an optimal (single outcome) predictor, you need a predictor that will output probabilities of various events close to the true probability.

So in some sense optimal compression gives the best you could hope, up to limitations of the probabilistic models, which is why I like this explanation.

darkmighty|11 years ago

The question is fundamental to all kinds of recognition: recognizing the invariants of the scene, the data that distinguishes it from other scenes, which is very close to the definition of Shannon information.

For example, if you can extract a 'Mesh' from a 2D picture, you can generate many other view points, and that mesh can be considered a good representation. If you are more sophisticated however (and perhaps have a larger "dictionary"), you can instead extract 'There are two wooden chairs 1m from each other, ...'.

That's the sense in which the representation is fundamental to computer vision -- it distills what the system knows (or what it wants to know) about scenes. The more concise the representation without loss of information the smarter your system is (and past a point becomes a general AI problem).

nabla9|11 years ago

You can figure out statistical structure of natural images (or just faces) and derive efficient representations with similar properties as to those observed in the visual system of the brain.

See for example:

Natural Image Statistics β€” A probabilistic approach to early computational vision https://www.cs.helsinki.fi/u/ahyvarin/natimgsx/