(no title)
declaredapple | 1 year ago
FWIW asking LLMs about their training data is generally HEAVILY prone to inaccurate responses. They aren't generally told exactly what they were trained on, so their response is completely made up, as they're predicting the next token based on their training data, without knowing what they data was - if that makes any sense.
Let's say it was only trained on the book 1984. It's response will be based on what text would most likely be next from the book 1984 - and if that book doesn't contain "This text is a fictional book called 1984", instead it's just the story - then the LLM would be completing text as if we were still in that book.
tl;dr - LLMs complete text based on what they're trained with, they don't have actual selfawareness and don't know what they were trained with, so they'll happily makeup something.
EDIT: Just to further elaborate - the "innocent" purpose of this could simply be to prevent the model from confidently making up answers about it's training data, since it doesn't know what it's training data was.
wodenokoto|1 year ago
Hardly any of the training data exists in the context of the word “training data”, unless databricks are enriching their data with such words.