top | item 22085550

(no title)

cgel | 6 years ago

Author. There is a misunderstanding in your argument. Your point is about the dataset being sampled from the true distribution. We are happy with that assumption (it's orthogonal to our point).

The problem we have is that to apply Bayes rule you NEED a prior distribution over the correct functions, applied to the points on the dataset, and to the points outside of the dataset. In other words, one thing is assuming that the dataset is representative of the (unknown) classification task, the other is to assume that you know what the distribution over classification tasks is.

discuss

YeGoblynQueenne|6 years ago

Hi. Yes, I see- I misunderstood this. My apologies for the hasty reading of your post.

But, in that case, there does exist a very good generalisation prior on function space that is well known and well understood: the simplest hypothesis (e.g. the one with the smallest minumum description length) is always better (because it results in a reduction of the hypothesis search space with a corresponding reduction to the size of the error on unseen data while keeping the number of examples constant).

See:

Occam's Razor (Blumer and friends):

https://www.sciencedirect.com/science/article/pii/0020019087...

Quoting from the abstract:

We show that a polynomial learning algorithm, as defined by ["A theory of the learnable", Valiant 1984], is obtained whenever there exists a polynomial-time method of producing, for any sequence of observations, a nearly minimum hypothesis that is consistent with these observations.

Would that begin to address your concerns?

unknown|6 years ago

[deleted]