top | item 42215736

(no title)

ericjang | 1 year ago

intra-distribution generalization is also not well posed in practical real world settings. suppose you learn a mapping f : x -> y. casually, intra-distribution generalization implies that f generalizes for "points from the same data distribution p(x)". Two issues here:

1. In practical scenarios, how do you know if x' is really drawn from p(x)? Even if you could compute log p(x') under the true data distribution, you can only verify that the support for x' is non-zero. one sample is not enough to tell you if x' drawn from p(x).

2. In high dimensional settings, x' that is not exactly equal to an example within the training set can have arbitrarily high generalization error. here's a criminally under-cited paper discussing this: https://arxiv.org/abs/1801.02774

discuss

mjburgess|1 year ago

Worse even than this: there are no distributions.

What we mean by x ~ p(x), y ~ p(y|x) is not x -> y st. x = f(y)

Reality itself has no probability distributions. Reality follows a causal model, where a causal relation is given in terms of necessity and possibility.

Eg., there is no such thing as Photo ~ P(Photo|PhotoOfCat) to be learned, only (All Causes) -> PhotoOfCat. Thus the setup of ML as y = f(x) is incorrect, there is no `f` which satisfies this formula (in almost all cases).

Consider the LLM case: reality has no P("The War in Ukraine"| TheWarIn2022) -- either the speaker meant TheWarIn2022, or they didnt. There's no sense in which reality has it that the utterance is intrinsically ambiguous (necessarily, for communication to be possible, pragmatics+semantics has to be able to fully resolve meaning).

So what are LLMs learning? Just an implied empirical distribution which is "smoothed over" the data just enough that it "hangs on to it, without repeating it" -- and this is vital, since if it were to try to generalise in the scientific sense, it would cease to be meaningful, since no algorithm which computes P(y|x) in this manner could capture the necessary relata that fully resolves meaning. Any system capable of modelling meaning would be probabilistic only in the sense of having a prior over such causal models: P("TheWarInUkraine"|TheWarIn2022, CausalModel) = 1, but P(CausalModel) < 1

So it's always undefined what it means to "generalise" wrt to an empirical distribution -- there aren't any.

When we say scientific theories generalise, we mean their posited necessary causal relations are maintained across irrelevant interventions. Eg., newton's theory of gravity generalises in that each term (F, M, m, r) is a valid measure of some property, and it remains a valid measure across a very large number of environments.

It fails to generalise for extreme values of M, m, etc.

In the ML sense, all intra-distributional generalisation fails for trivial permutations of any causal property, eg., m+dm -- because this induces an entirely new distribution. The "generalisation error" depends on what m+dm does within our model, but regardless, generalisation fails.

Scientific theories do not fail to generalise in this way, irrelevant causal interventions make no difference to the explanatory adequacy (or predictive power) of the theory.

jebarker|1 year ago

Thanks for the clarification. I understand much better what you mean by "scientific generalization". I can't tell whether you're suggesting that LLMs are a dead end for modeling meaning or just that LLMs as estimating probability distributions is the wrong way to think about them?