top | item 44087018

(no title)

mbeavitt | 9 months ago

Perhaps they want to include online discussions/commentaries about their paper in the training data without including the paper itself

discuss

mike_hearn|9 months ago

Most online discussion doesn't contain the entire text. You can pick almost any sentence from such a document and it'll be completely unique on the internet.

I was thinking it might be related to the difficulty of building a search engine over the huge training sets, but if you don't care about scaling or query performance it shouldn't be too hard to set one up internally that's good enough for the job. Even sharded grep could work, or filters done at the time the dataset is loaded for model training.

amelius|9 months ago

Why use a search engine when you can use an LLM? ;)