top | item 34285717

Playing games with AIs: The limits of GPT-3 and similar large language models

110 points| nigamanth | 3 years ago |link.springer.com | reply

57 comments

order
[+] visarga|3 years ago|reply
While I agree with the authors that large language models only trained on text lack the ability to distinguish "possible worlds" from reality, I think there is a path ahead.

Large language models might be excellent candidates for evolutionary methods and RL. They need to learn from solving language problems on a massive scale. But problem solving could be the medicine that cures GPT-3's fuzziness, a bit of symbolic exactness injected into the connectionist system.

For example: "Evolution through Large Models" https://arxiv.org/abs/2206.08896

They need learning from validation to complement learning from imitation.

[+] zwaps|3 years ago|reply
This is what ChatGPT does, for instance.
[+] imranq|3 years ago|reply
Interesting how two camps are emerging for LLMs. This one is about how GPT actually learns something, the other one represented by Chomsky and Gary Marcus represent how GPT has learned nothing (https://news.ycombinator.com/item?id=34278243)

I think the difference is what "learned" means. This paper basically says that any finite amount of knowledge compression is learning, whereas the other camp defines learning as some kind of infinite information compression like being able to add any two numbers no matter how large, which is something no language model will ever be able to do.

Personally, I think both sides are right, GPT3 has compressed language patterns across the internet, but its undeniable that GPT3 makes up stuff frequently. Overall, the point still stands that these language models are valuable in some specific contexts, but its not clear how far they can go.

[+] voiper1|3 years ago|reply
>the other camp defines learning as some kind of infinite information compression like being able to add any two numbers no matter how large, which is something no language model will ever be able to do.

I know this isn't strictly an LLM, but can't there be an "extension" that the LLM learns a formula and how to plug values into it (it already seems pretty good at explaining code, so it can already do this) and then new: the ability to actually perform the calculation - execute the code/formula or at least "use a calculator" with the values.

[+] TimTheTinker|3 years ago|reply
Perhaps learning is best defined as the acquisition of knowledge in the philosophical sense - an increasing collection of justified true beliefs.
[+] lostmsu|3 years ago|reply
Humans can not add any two numbers because of eventual mistake, which makes the second definition trivially useless in regards to comparing human intelligence vs AI.
[+] optimalsolver|3 years ago|reply
Would there be any benefit in modeling raw binary sequences rather than tokens?

I think text prediction only gets you so far. But I guess you could use the same principles to predict the next symbol in a binary string. If this binary data represents something like videos of physical phenomena, you might get the AI to profound, novel insights about the Universe just with next-bit prediction.

Hmmm, maybe even I could code something like that.

[+] dwaltrip|3 years ago|reply
I’m waiting for someone to make a GPT-style model trained for video and audio prediction (e.g. frame by frame, perhaps) in addition to the existing text prediction. Imagine using a significant percentage of YouTube content, for example.

It would probably be insanely expensive. But I feel like it would be almost guaranteed to acquire a world model far richer and more robust than ChatGPT’s.

Human babies learn by watching the world around them. Video frame prediction feels much closer to that than text prediction, and given the wildly impressive results we are seeing with large text prediction models alone, it seems like an obvious next step.

[+] ttul|3 years ago|reply
Google Research has a character-based transformer that learns to tokenize text rather than relying on hand coded tokenizers. It demonstrates superior performance on a variety of LLM tasks.

If you have the money, you can apply the transformer architecture to many different tasks and people are experimenting all the time. I think one of the big challenges is always to come up with methods for training such enormous models pragmatically without cost exploding.

[1] https://huggingface.co/docs/transformers/model_doc/canine

[+] stared|3 years ago|reply
Tokenization for models like GPT or BERT can be seen as compression. That is, frequent words are separate tokens. Frequent sequences are separate tokens. On the other hand, if a sequence is very uncommon, then it will contain many tokens.

Sure, you encode bit-by-bit. But it is a fixed-length code, which is even worse than character-by-character.

Maybe you only get worse training and inference time. But I wouldn't be surprised if the encoding also serves as a Bayesian prior, and with a different encoding, you get worse results (for given data).

[+] hooande|3 years ago|reply
It's important to remember that the "power" of gpt doesn't come from the model, but from the sheer scale of the dataset. It's trained on the entire internet, in text form. You can 100% use a transformer architecture to train on binary data. But what data do you have hundreds of tebibytes of?

Language also follows very common and repeatable patterns. "Hello" is often followed by "How are you?", etc. Just like Zipf's Law dictates that some words are used exponentially more than others, there are linguistic and conceptual patterns that appear with predictable frequency. If your bits don't follow similar rules, the results might not be as clean.

I'm pretty sure you could code a transformer to work on binary or video data. Sounds like a great github project. But it's unlikely you'll have the scale of data to do anything close to ChatGPT.

[+] nodemaker|3 years ago|reply
The reason this works for tokens is that tokens are put in a vector space where similar words are in a similar place. The same effect could not be achieved with characters or bits. If you think about it our brains also remember words and not characters
[+] decremental|3 years ago|reply
That might be too low(?) resolution. It would be learning encodings instead of features of the thing that is being encoded. Like training it on terabytes of zip files and expecting it to reproduce from the files contained in the archives.
[+] tbalsam|3 years ago|reply
There's a couple of massive intuition leaps here (around tokens and the ease of which predicting one modality extends to another), but if you're interested in diving into the field at the place where they're asking questions like this, you could start by looking at the transition from BPE to the tokenizer we have today for the tokenization front, and PercieverIO for the multimodal generalization front.
[+] visarga|3 years ago|reply
So you are proposing a massive video model, on the likes of GPT-3? The architecture is simple, but making it train correctly and efficiently is really hard, especially for video.
[+] make3|3 years ago|reply
the step from gpt 2 to gpt3 showed us that any attempt at predicting the behavior of sufficiently scaled up models is really futile
[+] gbmatt|3 years ago|reply
We are just robots in a human simulator, reliving our creation.
[+] andreshb|3 years ago|reply
tl:dr by GPT3

Paper contributes to debate about abilities of large language models like GPT-3

Evaluates how well GPT performs on the Turing Test

Examines limits of such models, including tendency to generate falsehoods

Considers social consequences of problems with truth-telling in these models

Proposes formalization of "reversible questions" as a probabilistic measure

Argues against claims that GPT-3 lacks semantic ability

Offers theory on limits of large language models based on compression, priming, distributional semantics, and semantic webs

Suggests that GPT and similar models prioritize plausibility over truth in order to maximize their objective function Warns that widespread adoption of language generators as writing tools could result in permanent pollution of informational ecosystem with plausible but untrue texts.

[+] JakkTrent|3 years ago|reply
I have reservations about several aspects of this article but what sits least well is the substantial conclusions regarding AI compression loss. In short, I disagree.

The "Hill" example in the text is easily understandable by the author's own presented concept of "text-based, word placement, associative semantics" vs "semantics by definition" we obviously use semantics in the latter sense, hence it's definition, the AI doesn't.

Semantic word relationships identified by GPT3 are based on the frequency of words prior to and following a position in a sentence/text – as presented to the AI in the programmed/learned dataset. Another easier example is when information is known to be untrue. If I include 1,000 written examples of the words “Jack and Jill ran down the volcano” than my AI will be incorrectly answering prompts to finish the nursery rhyme. How many instances of providing users the wrong answer or number of analyzed writings with the correct “ran down the HILL” text before my 1,000 false volcano statements ceases to be the most probable and likely accepted answer.

So, like the article example, if asked "Where was John Smith born?" the AI sees that it has to answer definitively bc its been asked question, so it's going to make a statement it concludes to be the most probable acceptable answer to the prompt - it doesn't see the prompt as a question, the words are not defined as ideas of themselves, nor do the sum of the words present an idea. Word definitions are not really part of the answer process. The AI checks it's dataset and knows all the related word examples it's previously identified through its handy token system – that self-controlled tokenization of memory for storage & retrieval, further removes this from a human brain-like function/process we can empathize with.

Anyways, the AI knows that most statements following words arranged like the question in the prompt include words/textual identifiers not used similarly used in other places of texts – Names! Imagine how it figured out names without understanding definitions conceptually by using frequency of word composition and grammar structure alone. Even though it “knows” the definition of the word “name” that definition is just more words – it has no meaning without context. Prompts provide context not definitions.

There are 3 names in the example prompt question, first/last name and the name/word required as an answer – the location! None of these words the AI see as a person's name or the name of a place, it sees our question as "word that requires my reply to be a definitive statement, primary “name word”, secondary “name word” + "born" --- it knows a different type of name word (a place) is the most plausible next word in sequence because it has a whole token dedicated just to birthdates, with limitless examples. Upon searching its dataset for John Smith, identifies "Hill" as the name word most often associated with the words “where” “was” “john” “smith” “born” used consecutively. The incorrect city, Hill, makes sense given his academic career obviously generating more digital information than his birth announcement/obituary in the hometown paper.

Regarding the wrong date – the AI was never actually answering a question and never made any statements with intent to be truthful. The incorrect birthdate is simply the most probable date given the incorrect "John Smith born in Hill" statement. It couldn't present the correct date following the word “Hill” bc no such examples existed with a higher probability of being acceptable than the incorrect date given. In fact, given the incorrect semantic links made early in the answer, an incorrect date was most probable.

None of that is compression loss. It's just an AI being an AI, doing exactly what it does. I think it's obvious, based solely on what the authors presented themselves, that it's in fact recalling everything - only arriving at a failed answer due to differing expectations of what the answer was. The AI delivered the most probable reply to the prompt, given the contextual data available to it – the same way it delivers answers we expect and are factually correct. It didn’t draw incorrect conclusions bc it chunked everything it learned up and consequentially “lost” some its “memories” in the process.

Programming AI with facts, or only factual information, doesn't solve the problem at all - operating with only factual data would help it regurgitate a "born in this town on this day" type answer more correctly but only bc the token words identified as correlated, in a factual text, do in fact have actual correlation. That only increases the probability of an AI arriving at a "correct" reply/answer while still using the flawed “logic” that allowed for these errors to occur. An AI that speaks only true statements will still have no actual concept of truth.

“Prediction leads to compression, compression leads to generalization, generalization leads to computer intelligence.” - quote from the article.

I know when ppl do this, memory chunking, we do lose stuff – why do we assume that is true for an AI also? What exactly are they compressing? Our memories are filled with lots and lots of data beyond the reason a memory is a memory. The background clutter, noises of a crowd, cars driving by or what lunch was that day, are not necessary to recall the memory of your first kiss, for example – unless you were in a crowded cafeteria at a racetrack, that might be all you recall then. A kiss, loud crowd, race cars – from an entire day of activity, those highlights will be all that remain in time. We need to do that - even with that feature we still forget important things.

What background noise is an AI having to “chunk” away? The parameters for it too broadly set? - narrow them. If it sees too much, we tell what not to see. If it's capacity to effectively store and utilize information encountered as it exists, than we have failed to create an effective AI.

If an AI “reads” a 500 page paper and tokenizes the data – what makes you so sure it cannot recall exactly all 500 words from that token alone?

AI compression loss, in a tokenized type system, would have to derive from the further compression of the tokens themselves or failure with the token system.

Just my quick 1,000+ words

Yeah... sry for the book.

Tl;dr -

I find the idea of AI being wrong due to “compression loss” to be a silly concept.

We should avoid humanizing AI and AI learning – all similarity lies on the surface.

Thanks for reading my rant – have a great day! - Jakksen

[+] galaxyLogic|3 years ago|reply
> GPT can competently engage in various semantic tasks. The real reason GPT’s answers seem senseless being that truth-telling is not amongst them

GPT can (only) mimic speech it is trained on. It can sound or read like real world human speakers it is mimicking. But it can not REASON about whether what it is saying is "true" or logically consistent.

It can not reason. Intelligence requires the ability to reason, logically, right? Therefore I posit GPT is not intelligent. Therefore it can not be AI.

I haven't used GPT but I wonder what happens if you ask it to explain its reasoning to you? What happens if you ask it whether it thinks what it says is true and why it thinks so?

[+] astrange|3 years ago|reply
It can and does actually reason about things. It is merely unpredictable how good it is at it and when it seems to gain the ability to do it.
[+] plutonorm|3 years ago|reply
Why do you offer an opinion without spending time any time with the thing you are talking about?
[+] kingkawn|3 years ago|reply
Big assumption that anyone’s reasoning carries any weight if a chat bot trained on billions of our interactions shows no signs of doing it