top | item 40047575

(no title)

I recently saw Hinton give a talk where he very, very, very excitedly and confidently gave us an example to demonstrate how incredibly intelligent and creative LLMS are.

He asked an LLM a question, but he didn't give us the answer. He let us have time to answer it ourselves. Personally I knew the answer instantly. He gave us the answer and sort of assumed no one would've known the answer and then used it as as justification for how smart these systems were. It honestly didn't feel very reassuring to me and honestly, I'd be surprised if it wasn't a topic covered somewhere on the internet before. With all due respect to Mr Hinton, I felt it showed his age a bit honestly.

What is difficult about Hinton's statements is that he can't really give evidence to back up these sort of claims. How do you measure how much a person knows, and how do you objectively measure how much an LLM knows? How smart is an LLM? You can't really know. It seems almost rhetorical. How many notes in a saxophone ?

We can make observations but that's not a great way to measure anything precisely.

There is a limit to language and I think this is one of those topics where that limit is touched or even breached.I don't even know if "intelligence" is a sufficient enough word to describe what's going on with these systems. It's the best word we have but it doesn't seem to adequately describe what we're observing.

discuss

williamcotton|1 year ago

How do you measure how much a person knows, and how do you objectively measure how much an LLM knows?

Here’s a very basic example of where an LLM is clearly more capable than a human: language translation. I would bet $10k at 10:1 that there are no humans who can reliably translate to and from as many languages as an LLM can.

It is very easy to measure knowledge: test the subject.

Personally, I can’t ever imagine scoring higher on a general knowledge test than a contemporary LLM.

Also, I don’t know of any humans that can run as fast as a car so I don’t know why any of this is surprising or farfetched.

inference-lord|1 year ago

I think you misinterpreted what I mean.

I'm not saying that they can't be more capable, I'm saying the guy can get a little overly excited about things which are hard to measure or quantify.

We're observing these systems and making up our own interpretations about how good they are at certain tasks, but it's not really easy to measure how much better or worse these things can be overall.

Your example about language translation is a good example of where these things aren't really "better", just different. I speak multiple languages and while these systems are fantastic, they can fail in ways a professional translator wouldn't and it doesn't seem to automatically know it failed and should fix itself.

The car example is also great because it again proves my point. We can easily measure a car and a person and workout a car is faster, but we can also see that a car can't walk. So it's faster but it;s also entirely different.

YeGoblynQueenne|1 year ago

>> Here’s a very basic example of where an LLM is clearly more capable than a human: language translation. I would bet $10k at 10:1 that there are no humans who can reliably translate to and from as many languages as an LLM can.

See, translation is exactly the kind of domain where there are no good measures of performance and where performance is open to subjective interpretation, and a lot of it. That's because we don't know what is a "good translation" and, crucially, machine translation systems and language models have not helped us find out.

The way machine translation systems are evaluated is generally by a metric based on the similarity to an arbitrarily chosen "gold standard" translation. What that means in practice is that we have some corpus of parallel texts, we train a machine translation system on a part of the corpus and then test it on the held-out test set. The way we test is that we take each e.g. sentence in a text translated by the system and we compare it, as a bag-of-words or a set of n-grams, to the text in the original translation. If there is a high amount of overlap, the system scores highly. That's the way BLEU scores work and similar metrics like ROUGE.

It is important to note how arbitrary is this metric: out of all possible translations we choose one to be the "reference" translation and compare machine translations to it. The only accepted alternative is eyballing, where we give the machine translation to a bunch of humans and ask them how they feel about it.

My point is that we don't know how to measure knowledge, and language models are trained to maximise similarity, not knowledge. So there's no way to go from observations of their behaviour to a measure of their knowledge. All you can say about a language model is that it is good, or bad, at generating text that's similar to its training corpus. Everything else is an assumption.