top | item 41012296

(no title)

agucova | 1 year ago

This isn't really true. LLMs are discriminating actual truth (though perhaps not perfectly). Other similar studies suggest that they can differentiate, say, between commonly held misconceptions and scientific facts, even when they're repeating the misconception in a context. This suggests models are at least sometimes aware when they're bullshitting or spreading a misconception, even if they're not communicating it.

This makes sense, since you would expect LLMs to perform better when they can differentiate falsehoods from truths, as it's necessary for some contextual prediction tasks (say, the task of predicting Snopes.com, or predicting what would a domain expert say about topic X).

discuss

lukev|1 year ago

> LLMs are discriminating actual truth

No. They are functions of their training data. There is absolutely no part of a LLM that functions as a truth oracle.

If training data contains multiple conflicting perspectives on a topic, the LLM has a limited ability to recognize that a disagreement is present and what types of entities are more likely to adopt which side. That is what those studies are reflecting.

That is, emphatically, a very differing thing than "truth."

agucova|1 year ago

> If training data contains multiple conflicting perspectives on a topic, the LLM has a limited ability to recognize that a disagreement is present and what types of entities are more likely to adopt which side. That is what those studies are reflecting.

Again, we have empirical evidence to suggest otherwise. It's not that there's an oracle, but that the LLM does internally differentiate between facts it has stored as simple truth vs. misconceptions vs. fiction.

This becomes obvious by interacting with popular LLMs; they can produce decent essays explaining different perspectives on various issues, and it makes total sense that they can because if you need to predict tokens on the internet, you better be able to take on different perspectives.

Hell, we can even intervene these internal mechanisms to elicit true answers from a model, in contexts where you would otherwise expect the LLM to output a misconception. To quote a recent paper, "Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface" [1], and this matches the rest of the interpretability literature on the topic.

[1]: https://arxiv.org/pdf/2306.03341