top | item 44283491

(no title)

fuzzbazz | 8 months ago

From a quick web search I can find that there are book review sites that allow users to enter and rate verbatim "quotes" from books. This one [1] contains ~2000 [2] portions of a sentence, a paragraph or several paragraphs of Harry Potter and the Sorcerer's Stone.

Could it be plausible that an LLM had ingested parts of the book via scrapping web pages like this and not the full copyrighted book and get results similar to those of the linked study?

[1] https://www.goodreads.com/work/quotes/4640799-harry-potter-a...

[2] ~30 portions x 68 pages

discuss

paxys|8 months ago

Meta has trained on LibGen so we don't really need to speculate.

https://www.wired.com/story/new-documents-unredacted-meta-co...

aprilthird2021|8 months ago

This is in fact mentioned and addressed in the article. Also, there is pretty clear cut evidence Meta used pirated book data sets knowingly to train the earlier Llama models

aspenmayer|8 months ago

Sure, why not? lol

https://www.reddit.com/r/DataHoarder/comments/1entowq/i_made...

https://github.com/shloop/google-book-scraper

The fact that Meta torrented Books3 and other datasets seems to be by self-admission by Meta employees who performed the work and/or oversaw those who themselves did the work, so that is not really under dispute or ambiguous.

https://torrentfreak.com/meta-admits-use-of-pirated-book-dat...

redox99|8 months ago

Books3 was used in Llama1. We don't know if they used it later on.