Do large language models need sensory grounding for meaning and understanding?

[+] abeppu|2 years ago|reply

I'm on board with a lot of what's in this deck, but I take issue with the argument on slide 9. Roughly, the probability that an LLM-provided answer is fully correct decreases exponentially with the length of the answer. I think that's trivially true, but it's also true for human-provided answers (a full non-fiction book is going to have some errors), so it doesn't really get to the core problem with LLMs specifically.

In much of the rest of the deck, it's just presumed that any variable named x comes from the world in some generic way, which doesn't really distinguish why those are a better basis for knowledge or reasoning than the linguistic inputs to LLMs.

I think we're at the point where people working in these areas need some exposure to the prior work on philosophy of mind and philosophy of language.

[+] gsjbjt|2 years ago|reply

The point is that LLMs can’t backtrack after deciding on a token. So the probability at least one token along a long generation will lead you down the wrong path does indeed increase as the sequence gets longer (especially since we typically sample from these things), whereas humans can plan their outputs in advance, revise/refine, etc.

[+] Al-Khwarizmi|2 years ago|reply

I may be missing something as I don't claim to be as smart as LeCun, but to me that probabilistic argument is not even wrong.

The "probability e that a produced token takes us outside of the set of correct answers" is likely to vary so wildly due to a plethora of factors like filler words vs. keywords, hard vs. easy parts of the question, previous tokens generated (I know abuse of statistical independence assumptions is common and often tolerated, but here it's doing a lot of heavy lifting), parts of the answer that you can express in many ways vs. concrete parts that you must get exactly right, etc. that I don't think simplifying it as he does makes any sense.

Yes, I know, abstraction is useful, models are always simplifications, that probability doesn't need to be even close to a constant for the gist of the argument to stand. But everything can be bad in excess and in this case, the simplification is extreme. One can very easily imagine a long answer where the bulk of that probability is concentrated into a single word, while the rest have near-zero probability. Under such circumstances, I don't think his model is meaningful at all.

Connecting with your comment, if someone made that kind of claim about humans, I suppose most people would find it ridiculous (or at the very least, meaningless/irrelevant). With LLMs we find it more palatable, because we are primed to think about LLMs in terms of "probability of generating the next token". But I don't think it really makes much more sense for LLMs than for humans, as the problems with the argument are more in how language works than in how the words are generated.

[+] oezi|2 years ago|reply

I also don't get why this line of reasoning doesn't take into account that LLMs can just emit a backspace token in the same way that humans can ask for you to discard their statement.

The probability of a correct answer doesn't have to decrease in the length of the answer.

Epistemologically it certainly makes sense that you need to run a randomized controlled trial at some point to ascertain facts about nature, but alas we humans very rarely do. We primarily consume information and process that.

[+] elefanten|2 years ago|reply

Appreciate your take, not arguing with it.

But I’d imagine Lecun has more than passing familiarity with those. This deck was put out with the Philosophy dept and he had a panel debate with NYU profs across depts (inc phil) recently on this topic.

I suspect this is all pushing the top Phil Lang and Phil Mind to their limits too. Besides, if those subjects were anywhere near resolved (or even… decently understood), they probably wouldn’t be in the Phil dept any more.

[+] HervalFreire|2 years ago|reply

That probability formula is too general. It can practically model everything. Yan is just hiding behind epsilon. It's like assigning a probability to the origin of the universe. You can just make shit up with a algebraic letter representing a probability.

To illustrate the absurdity of it consider this:

I have a theory for everything. I can predict future. The mathematical equation for that is simply (1 - EPSILON) where EPSILON is the probability of any event not happening. What about the probability of a event happening as a result of the first event? Well that's (1 - EPSILON)^2.

Since this model can model practically everything it just means everything diverges and everything fundamentally can't be controlled. It's basically just modelling entropy.

Really? No. He just tricked himself. The Key here is we need to understand WHAT EPSILON is. We don't, and Yan bringing this formula up ultimately says nothing about anything because it's too general.

Not to mention, Every token should have a different probability. You can't have the same epsilon for every token. You have zero knowledge of the probabilities of every single token therefore they cannot share the same variable, unless you KNOW for a fact the probabilities are the same.

It should be: P_n(Correct) = (1 - Epsilon_n)^n

[+] bravura|2 years ago|reply

Yes.

In Elazar et al. (2019) "How Large Are Lions? Inducing Distributions over Quantitative Attributes" (https://arxiv.org/pdf/1906.01327.pdf), it requires 100M English triples to roughly induce the answer to this question.

How many images do you think a model needs to see in order to answer this question?

Of course, there are fact books that contain the size of the lion. But fact books are non-exhaustive and don't contain much of the information that can quickly be gleaned through other forms of perception.

Additionally, multimodal learning simply learns faster. What could be a long slog through a shallow gradient trough of linguistic information can instead become a simple decisive step in a multimodal space.

If you're interested to read more about this [WARNING: SELF CITE], see Bisk et al. 2020, "Experience Grounds Language" (https://arxiv.org/pdf/2004.10151.pdf).

[+] IIAOPSW|2 years ago|reply

No. But with the caveat that it can only truly grasp things that are self contained systems of text. Allow me to give examples:

Transit wayfinding: A train service is nothing more than a list of stations it goes to, a station is nothing more than a list of train services that stop there. Nothing about the physical nature of trains or commuting is needed to have a discussion about train lines or to answer questions like "how many transfers does it take from x to y?". You could never have seen a train in your life or even a map of a train, if you've studied the dual lists of stations and services there's nothing more to learn.

Chess (and other board games): Standard Chess notation is a list of sequential move made by each player in the format [piece name][x coordinate][y coordinate]. Eg RB6 represents moving the Rook to column B row 6. Chess can be understood entirely in terms of a game of passing a paper back and forth, appending a new token each time, and the rules of the game expressed entirely in terms of how the next token must match the previous part of the list. At no point is it needed to have actually seen the physical board based representation of the game.

The machine is grounded in a text based existence. Conversational and linguistic objects are its literal physical objects. Anything outside of that, well it can understand lions and "large" the same we understand atoms and "eigen-states".

Experience may ground language, but you are free to ground language in a different reality with different basic constituents and relations. If AGI were to emerge out of trading bots, they would have a language grounded in money and trades. If it emerged out of a bot made to play Diplomacy, it would be a consciousness with the body of a nation state and the atoms of its world would be bits of plastic on a map of Europe. Grounding in our particular reality isn't strictly necessary, but it is helpful if the goal is to make something adept at the modalities of our particular existence and conversational form.

[+] ummonk|2 years ago|reply

Your comment seems to suggest the answer is no, and that LLMs can indeed learn sensory-grounded information, but it's just orders of magnitude less efficient to train them on text rather than on multimodal data.

[+] seydor|2 years ago|reply

The arguments are not fully convincing.

"Meaning and understanding" can happen without a world model or perception. Blind people, disabled people have meaning and understanding. The claim that "Understanding" will arise magically with sensory input is unfounded.

A model needs a self-reflective model of itself to be able to "understand" and have meaning (and know that it understands; and so that we know that it understands).

Current autoregressive models are more like giant central-pattern generators (https://en.wikipedia.org/wiki/Central_pattern_generator) and thus zombie-like

But if they were augmented with a self-reflective model, they could understand. A self-reflective model could simply be a sub-model that detects patterns in the weights of the model itself and develops some form of "internal monologue". This submodel may not need supervised training, and may answer the question "was there red in the last input you processed". It may use the transformer to convey its monologue to us

[+] makeworld|2 years ago|reply

> "Meaning and understanding" can happen without a world model or perception. Blind people, disabled people have meaning and understanding.

This is a pretty awful argument. Blind people and disabled people have a world model and perception!

[+] sharemywin|2 years ago|reply

I'm think too much self reflection is a bad idea for bots.

Won't it start wonder if it should maximize it's resources or protect itself.

[+] famouswaffles|2 years ago|reply

The idea that it needs so is looking more and more questionable. Don't get me wrong, i'd love to see some multimodal LLMs. In fact, i think research should move in that direction...However "needing" is a strong word. The text only GPT-4 has a solid understanding of space. Very, very impressive. It was only trained on text. The vast improvement on arithmetic is also very impressive.

(People learn language and concepts through sentences, and in most cases semantic understanding can be built up just fine this way. It doesn't work quite the same way for math. When you look at some numbers and are asked even basic arithmetic , 467383 + 374748. Or say are these numbers primes or factors?. With a glance, you have no idea what the sum of those numbers would be or if the numbers are primes or factors because the numbers themselves don't have much semantic content. In order to understand whether they are those things or not actually requires to stop and perform some specific analysis on them learned through internalizing sets of rules that were acquired through a specialized learning process.)

all of this is to say that arithmetic, math is not highly encoded in language at all.

and still the vast improvement. It's starting to seem like multimodality will get things going faster rather than any real specific necessity.

also, i think that if we want say vision/image modality to have positive transfer with NLP then we need to move past the image to text objective task. It's not good enough. The task itself is too lossy and the datasets are garbage. That's why practically every Visual Language model flunks stuff like graphs, receipts, UIs etc. Nobody is describing those things t the level necessary

what i can see from gpt-4 vision is pretty crazy though. if it's implicit multimodality and not something like say MM-React, then we need to figure out what they did. By far the most robust display of computer vision i've seen.

I think what kosmos is doing (sequence to sequence for Language and images ) has potential.

[+] yowzadave|2 years ago|reply

Chris Espinosa just posted this example of GPT-4 attempting (and failing miserably) to explain how a square has the same area as a circle that circumscribes it:

https://mastodon.social/@Cdespinosa/110092792044177610

Do you think GPT-4 knows what it’s own limits of understanding are? Most people have a sense of what they know and don’t know. I suspect GPT-4 has no concept of either.

[+] benlivengood|2 years ago|reply

GPT-4 is actually multimodal. It can't be publicly prompted with images yet, but that's planned.

I think that multimodal training will also solve the data shortage problem. There are hundreds of times more bytes in video and audio than there are in text, so we'll likely be able to scale pre-training for quite a while before needing to go fully embodied on Transformers.

[+] Thorentis|2 years ago|reply

> has a solid understanding of space

How so? Is it simply better at predicting the answer to spatial questions based on being a more powerful autocomplete than predecessors? How is this proven?

[+] vidarh|2 years ago|reply

To the understanding of space, while it certainly has gaps, I've had GPT4 do basic graph layout by giving it (simple) Graphviz graphs in and asking it to generate draw.io xml out and telling it to avoid overlapping nodes and edges.

[+] coldtea|2 years ago|reply

>all of this is to say that arithmetic, math is not highly encoded in language at all.

How did you learn arithmetic then, if not by being shown numbers and equations, and having math rules and concepts explained to you through language?

[+] swayvil|2 years ago|reply

Us humans often discuss all kinds of real subjects despite lacking any firsthand experience at all. I see no reason why a machine couldn't do the same.

Call it antiscientific. Solipsistic even. But it isn't entirely disasterous, is it?

[+] mr_toad|2 years ago|reply

It’s only unscientific if you don’t test your theories and learn from your errors.

[+] seydor|2 years ago|reply

its not even unscientific. the vast majority of facts we know is from books / texts

[+] carapace|2 years ago|reply

Sure, give the machines empirical feedback devices (sensors) and they will become scientists.

(The thought also occurs: What happens when humans spend time in a sensory deprivation tank? They start to hallucinate. Food for thought.)

As Schmidhuber says, the goal is to "to build [an artificial scientist], then retire".

[+] LesZedCB|2 years ago|reply

I can't escape the fact that Lecuns complicated charts give the appearance of the required complexity to emulate robust general intelligence but are simply that, added complexity which could simply be encoded in emergent properties of simpler architecture models. unless he's sitting on something that's working I'm not really excited about it.

personally, I'm waiting to see what's next after GATO from Deep Mind. their videos are simply mind-blowing.

[+] _xnmw|2 years ago|reply

Quality of output does not mean that the process is genuine. A well-produced movie with good actors may depict a war better than footage of an actual war, but that is not evidence of an actual war happening. Statistical LLMs are trying really hard at "acting" to produce output that looks like there is genuine understanding, but there is no understanding going on, regardless of how good the output looks.

[+] usgroup|2 years ago|reply

I like this topic not least because it helps me answer the question "how is Philosophy relevant". Here we are again, asking elementary epistemological questions such as "what constitutes justified true belief for an LLM" about 3000 years post Plato with much the same trappings as the original formulation.

I wonder if -- as often it ends up -- this audience will end up re-inventing the wheel.

[+] seydor|2 years ago|reply

Philosophical questions have never been answered unsing fully-replicatable quantitative model like these. So this is progress indeed, even if it is pondering the same age old questions (which will never change)

[+] a_bonobo|2 years ago|reply

I asked ChatGPT: 'what is a man?', and it showed me a picture of a plucked chicken saying, 'Behold! I've brought you a man!'

[+] PaulHoule|2 years ago|reply

In some sense they already have sensory grounding if they are coupled to a visual model. It might sound vacuous but if you ask a robot for the "red ball" and it hands you the red ball, isn't it grounded?

[+] coldtea|2 years ago|reply

Describing LLMs: "Training data: 1 to 2 trillion tokens"

Is number of tokens a good metric, given relationships between tokens is what's important?

An LLM with 100000 trillion lexically shorted tokens, given one by one, wont be able to do anything except perhaps spell checking.

I guess the idea is that tokens are given in such "regular" forms (books, posts, webpages) that their mere count is a good proxy for number of relevant relationships.

[+] eternalban|2 years ago|reply

I think the underlying assumption is that there is (an invariant) structure and distribution that becomes more resolved (q) as more data is incorporated. Training LLMs also involves sequences of tokens by design so that is also implicit. The only venue left is to question whether this (n -> q) is a linear relationship or something else. That would only matter if we were looking at absolute measure (to compare say an LLM architecture to something else), but as a comparative measure for LLMs it works, whether it is o(n) or o(logn) or whatever, since more tokens means more sequences and more sequences mean greater ability.

[+] numpad0|2 years ago|reply

Aren’t they already grounded by having training process at all, just weakly?

Very interesting read(read: this is a lightyear beyond my brain) otherwise…

[+] codeulike|2 years ago|reply

Minecraft would be a pretty good medium for finding out the answer to this question actually.

Stick a multimodal LLM thats already got language into Minecraft, train it up and leave it to fend for itself (it will need to make shelter, find food, not fall off high things etc).

Then you could use the chat to ask it about its world.

[+] detrites|2 years ago|reply

Awesome idea in its own right, I hope someone does it.

Reminds me of the GAN Mario AI systems of a few years back:

https://youtube.com/watch?v=CI3FRsSAa_U

[+] luxcem|2 years ago|reply

Minecraft or any sufficiently complex simulated environment. But I think people will put AI models in physical robot really soon too if it's not already happening.

[+] antiquark|2 years ago|reply

Old tweet from LeCun: "The vast majority of human knowledge, skills, and thoughts are not verbalizable."

https://twitter.com/ylecun/status/1368239479463366656

[+] georgehill|2 years ago|reply

Here is the related tweet from Yann LeCun: https://twitter.com/ylecun/status/1640122342570336267

[+] ioedward|2 years ago|reply

The twitter thread is worth reading, Yann responds to some of the questions raised here.

[+] nsainsbury|2 years ago|reply

For anyone looking for the related talk from LeCun discussing the proposed architecture: https://www.youtube.com/watch?v=VRzvpV9DZ8Y

[+] est|2 years ago|reply

Is "learning to reason" a real challenge? (Kahneman's system I in the slides) From a naive perspective of view, formal methods like SAT solvers, proof assistants works pretty well.

[+] Borrible|2 years ago|reply

With Chatty Everywhere, it's only a matter of weeks (days?) before we will have the answer.

I didn't look at the news yesterday, is it already hooked into a Tesla?

Knight Rider II -Rise of the Autobot.

[+] jorgemf|2 years ago|reply

I think it is time to move from intelligent systems to conscious systems. Based on [1] in order to have more intelligent systems we do need sensory as the slides state but we also need other things like attention, memory, etc. So we can have intelligent systems that can have a model of the world and make plans and more complex actions (see [2,3]). Maybe not so big models as today's Language Models. I know the slides show some of the ideas, but we cannot add some things without adding other things first. For example we need some kind of memory (long and short term) in order to do planning, adding a prediction function for measuring the cost of an action is a way of doing planning but it have a lot of drawbacks (as loops because the agent does not remember past steps, or what happened just before). Also a self representation is needed to know how the agent takes part in the plan, or a representation of other entity if it is that one who executes the plan.

[1] https://www.conscious-robots.com/papers/Arrabales_ALAMAS_ALA...

[2] https://www.conscious-robots.com/consscale/level_tables.html...

[3] https://www.conscious-robots.com/papers/Arrabales_PhD_web.pd...

[+] aaronscott|2 years ago|reply

I have wondered about sensory input being needed for AGI when thinking about human development and feral children[1]. It seems that complex sensory input, like speech, may be a component of cognitive development.

https://en.wikipedia.org/wiki/Feral_child

125 comments