top | item 40998518

(no title)

agucova | 1 year ago

> Now, we can see from this description that nothing about the modeling ensures that the outputs accurately depict anything in the world. There is not much reason to think that the outputs are connected to any sort of internal representation at all.

This is just wrong. Accurate modelling of language at the scale of modern LLMs requires these models to develop rich world models during pretraining, which also requires distinguishing facts from fiction. This is why bullshitting happens less with better, bigger models: the simple answer is that they just know more about the world, and can also fill in the gaps more efficiently.

We have empirical evidence here: it's even possible to peek into a model to check whether the model 'thinks' what it's saying is true or not. From “Discovering Latent Knowledge in Language Models Without Supervision” (2022) [1]:

> Specifically, we introduce a method for accurately answering yes-no questions given only unlabeled model activations. It works by finding a direction in activation space that satisfies logical consistency properties, such as that a statement and its negation have opposite truth values. (...) We also find that it cuts prompt sensitivity in half and continues to maintain high accuracy even when models are prompted to generate incorrect answers. Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don't have access to explicit ground truth labels.

So when a model is asked to generate an answer it knows is incorrect, it's internal state still tracks the truth value of the statements. This doesn't mean the model can't be wrong about what it thinks is true (or that it won't try to fill in the gaps incorrectly, essentially bullshitting), but it does mean that the world models are sensitive to truth.

More broadly, we do know these models have rich internal representations, and have started learning how to read them. See for example “Language Models Represent Space and Time” (Wes & Tegmark, 2023) [2]:

> We discover that LLMs learn linear representations of space and time across multiple scales. These representations are robust to prompting variations and unified across different entity types (e.g. cities and landmarks). In addition, we identify individual "space neurons" and "time neurons" that reliably encode spatial and temporal coordinates. While further investigation is needed, our results suggest modern LLMs learn rich spatiotemporal representations of the real world and possess basic ingredients of a world model.

For anyone curious, I can recommend the Othello-GPT paper as a good introduction to this problem (“Do Large Language Models learn world models or just surface statistics?”) [3].

[1]: https://arxiv.org/abs/2310.02207

[2]: https://arxiv.org/abs/2310.02207

[3]: https://thegradient.pub/othello/

discuss

lukev|1 year ago

Nuance: they aren't sensitive to "truth", they are sensitive to recurring or consistent input in the training data.

agucova|1 year ago

This isn't really true. LLMs are discriminating actual truth (though perhaps not perfectly). Other similar studies suggest that they can differentiate, say, between commonly held misconceptions and scientific facts, even when they're repeating the misconception in a context. This suggests models are at least sometimes aware when they're bullshitting or spreading a misconception, even if they're not communicating it.

This makes sense, since you would expect LLMs to perform better when they can differentiate falsehoods from truths, as it's necessary for some contextual prediction tasks (say, the task of predicting Snopes.com, or predicting what would a domain expert say about topic X).

gaganyaan|1 year ago

In the context of LLMs, that is truth, as anyone would understand it. Humans are also sensitive to recurring or consistent input in our training data.

agucova|1 year ago

I messed up the second reference, it should be https://arxiv.org/abs/2212.03827