top | item 41886218

(no title)

profmonocle | 1 year ago

> if LLM training involves merely reading a dataset, but it is not strictly necessary to copy, or even store it verbatim to be useful, then does it even fall under copyright protection at all?

Copyright includes the creation of derivative works, not just literally copying the source material.

For instance, imagine I read a novel, then I decide to write my own, unauthorized sequel to it. It's not a literal "copy" of the original material - it's my own original text, but obviously a derivative work of the original material. Under copyright law, that would be infringement - I would be sued if I tried to sell that. (Yes, that means fanfiction is infringing, but most rights holders have wisely decided to look the other way on that, as long as it's non-commercial.)

This is what people who claim AI is infringing are worried about. Not that the AI has a literal copy of the source material in its training data, but that the training data can be used to produce a derivative work.

I could write a (crappy) fanfic of the Lord of the Rings without directly referencing the books/movies. And that doesn't mean I have a complete copy of the books/movies in my head - that isn't how memory works. Until now, creating a derivative work without directly using the source material was something only humans could do. This is completely uncharted legal territory.

discuss

order

musicale|1 year ago

LLM-generated book clones (as seen on Amazon and elsewhere) could potentially fall afoul of copyright law in many ways, including: rights to reproduction/substantial similarity; derivative works; adaptation (including translation); distribution; performance and public display (including broadcast or transmission); etc.

AStonesThrow|1 year ago

LLMs don't necessarily need to reproduce their source material to make use of it. They could summarize, analyze, condense, paraphrase, extract statistics or factoids. There's also the question of how the models actually store the source material or not. It's physically impossible for the verbatim text to live in the model weights, and so at the very least, it's compressed or abstracted. So any copyright claims will need to get beyond a simplistic allegation of copying, for sure.