top | item 43896220

(no title)

nilsbunger | 10 months ago

There’s something called a substantive transformation test in copyright law. When you write a summary of a book, you don’t infringe on copyright because it’s a “substantial transformation”. This goes along with the idea that you can copyright the text but not the ideas it expresses.

When model training reads the text and creates weights internally, is that a substantial transformation? I think there’s a pretty strong argument that it is.

discuss

TheOtherHobbes|10 months ago

No transformation is needed.

The point here is that book files have to be copied before they can be used for training. Copyright texts typically say something like "No unauthorised copying or transmission in any form (physical, electronic, etc.)"

Individuals who torrented music and video files have been bankrupted for doing exactly this.

The same laws should apply when a corporation downloads torrent files. What happens to them after they're downloaded is irrelevant to the argument.

If this is enforced (still to be seen...) it would be financially catastrophic for Meta, because there are set damages for works that have been registered for copyright protection - which most trad-pubbed books, and many self-pubbed books, are.

unknown|10 months ago

[deleted]

leovander|10 months ago

> have been bankrupted for doing exactly this.

Only if they seeded the data and some other entity downloaded it, i.e. they hosted the data. In a previous article I believe it was called out that Meta was being a leecher (not seeding back what they downloaded).

It's the hosting that gets you, not the act of downloading it.

jayd16|10 months ago

This is a leap in the argument. We've gone from the right to use a work to "unless the result is identical or close to it, we have full rights to all works.".

Seems like a big gap there.

spwa4|10 months ago

It's COPYright. It has to be very close to the original to be covered by copyright. Hence the name.

mrgoldenbrown|10 months ago

Even if you argue the LLM's are merely summarizing content, they still had to illegally download that content in the first place. The model can't read and simmarize the texts unless the text was illegally downloaded and copied. Piracy isn't suddenly legal just because you promise to delete the movie you downloaded after watching it.

triceratops|10 months ago

The counterargument to that is model training is impossible without making copies. That's not true for humans.

Workaccount2|10 months ago

That's not really true. Models train (in greatly simplified way) by being shown an excerpt and being told to guess the next token from the excerpt. They push around their weights until the token they output matches the next token in the excerpt. Then the excerpt is no longer needed. You can think of it like the article is loaded, the LLM plays this token guessing game through it, then the article is discarded. On the face of it this is what happens, but it gets hairier depending on how exactly this process is done. But it is seemingly not far removed from how humans consume content (acquire, read, discard), hence the legal blur.

realusername|10 months ago

It's also true for humans, you memorize only parts of what you read and see but you still had to view the whole thing first.

The computer model is working differently of course but functionally it's the same idea.