top | item 40043535

(no title)

saeranv | 1 year ago

I think they are accounting for the entire context, they specifically write out:

>> P(next_word|previous_words)

So the "next_word" is conditioned on "previous_words" (plural), which I took to mean the joint distribution of all previous words.

But, I think even that's too reductive. The transformer is specifically not a function acting as some incredibly high-dimensional lookup table of token conditional probabilities. It's learning a (relatively) small amount of parameters to compress those learned conditional probabilities into a radically lower-dimensional embedding.

Maybe you could describe this as a discriminative model of conditional probability, but at some point, we start describing that kind of information compression as semantic understanding, right?

discuss

order

nerdponx|1 year ago

It's reductive because it obscures just how complicated that `P(next_word|previous_words)` is, and it obscures the fact that "previous_words" is itself a carefully-constructed (tokenized & vectorized) representation of a huge amount of text. One individual "state" in this Markov-esque chain is on the order of an entire book, in the bigger models.

mjburgess|1 year ago

It doesnt matter how big it is, it's properties dont change. eg., it never says, "I like what you're wearing" because it likes what I'm wearing.

It seems there's an entire generation of people taken-in by this word, "complexity" and it's just magic sauce that gets sprinkled over ad-copy for big tech.

We know what it means to compute P(word|words), we know what it means that P("the sun is hot") > P("the sun is cold") ... and we know that by computing this, you arent actaully modelling the temperature of the sun.

It's just so disheartening how everyone becomes so anthropomorphically credulous here... can we not even get sun worship out of tech? Is it not possible for people to understand that conditional probability structures do not model mental states?

No model of conditional probabilities over text tokens, no matter how many text tokens it models, ever says, "the weather is nice in august" because it means the weather is nice in august. It has never been in an august; or in weahter; nor does it have the mental states for preference, desire.. nor has it's text generation been caused by the august weather.

This is extremely obvious, as in, simply refelect on why the people who wrote those historical text did so.. and reflect on why an LLM generates this text... and you can see that even if an LLM produced word-for-word MLK's I have a dream speech, it does not have a dream. It has not suffered any oppression; nor organised any labour; nor made demands on the moral conscience of the public.

This shouldnt need to be said to a crowd who can presumably understand what it means to take a distribution of text tokens and subset them. It doesnt matter how complex the weight structure of an NN is: this tells you only how compressed the conditional probability distribution is over many TBs of all of text history.