Can someone with actual fundamental understanding of LLMs explain to me why they think it's perfectly legal to train models on copyrighted material? I don't know enough about this. Please don't answer by asking chatgpt.
Consider how commercial search engines are fine to show text snippets, thumbnails and site caches.
AI developers will most likely rely on a Fair Use defense. I think this has a reasonable chance of success since, while the use of a given copyrighted work may affect the market for that work (in this case NYT's article), it can be argued to be highly transformative usage. As in Campbell v. Acuff-Rose Music: "The more transformative the new work, the less will be the significance of other factors", defined as "whether the new work merely 'supersede[s] the objects' of the original creation [...] or instead adds something new".
There's also potential for an "implied license", as in Field v. Google Inc for rehosting a snapshot of a site, where "Google reasonably interpreted absence of meta-tags as permission to present 'Cached' links to the pages of Field's site". As far as I can tell in this case, NYT's robots.txt of the time was obeyed, which permitted automated processing of all but one specific article for some reason.
Why do you think it is legal to train students on copyrighted material? Copyright is supposed to protect from unauthorized reproduction, not unauthorized learning. That the NY Times is able to show some verbatim reproduction, it is a real legal issue, but that should not be extended to training generally.
Students are humans. LLMs are not. Machine "learning" is a metaphor, not what's actually happening. Stop anthropomorphizing, and show some loyalty to your species.
I think your question is incorrect. It’s very likely no-one thinks it’s perfectly legal. There probably are many people who think it’s not a big deal, though. Try coming up with a dataset that doesn’t have any copyrighted material in them. Like seriously try. You can’t use pretty much anything newer than a century old. Everything is copyrighted by default. Very few new things are explicitly in public domain or licensed in a way that would allow usage. Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?
> Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?
Not sure about the rest of the world, but at least for US content I don't think any company would publish that LLM.
That's like 40 years before the civil rights movement, and right about the time of the Tulsa massacre.
It's right around when women got the right to vote.
Trying to get it to not say anything horrible under modern standards seems fraught with issues. I don't know if it would even understand something like "don't be racist", given the context it was trained on.
1. Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music, and later to write a book about music which likely includes some of the concepts you earlier learned.
2. Neither the LLM nor the output text contain sufficient elements of the copyrighted work to qualify for copyright protection. Just like if you turned old library books into compost and sold the compost, you wouldn't expect to pay authors of those books a royalty for the compost sales.
> Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music
If you learn a little too hard though, and reproduce the original textbook in it's entirety, you'll get in trouble.
My guess is that courts will determine that the training itself will not be found illegal, but either the AI companies, or the users, will be found liable for reproducing copywrighted work in output, and no one will want to hold liability for that.
Fair use, probably. How many news pieces have you read that amount to, "The New York Times reports..." followed by a summary of the Times' article? It's not illegal to use copyrighted works at a source, as inspiration, or to guide style.
Surely. Remember when the VCR came out and some parties absolutely freaked out and Jack Valenti said
"I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone."
Then we invented from whole cloth reasons why they were perfectly OK because there was a ton of money to be made and everyone would actually be better off if the VCR was a thing and everyone knew it because it ended up argued after millions of VCRs were already in households.
Read about the 'fair use' doctrine and put yourself in the shoes of someone who is training a model, and see if you can argue, from their perspective, why it should be allowed.
Ukv|2 years ago
AI developers will most likely rely on a Fair Use defense. I think this has a reasonable chance of success since, while the use of a given copyrighted work may affect the market for that work (in this case NYT's article), it can be argued to be highly transformative usage. As in Campbell v. Acuff-Rose Music: "The more transformative the new work, the less will be the significance of other factors", defined as "whether the new work merely 'supersede[s] the objects' of the original creation [...] or instead adds something new".
There's also potential for an "implied license", as in Field v. Google Inc for rehosting a snapshot of a site, where "Google reasonably interpreted absence of meta-tags as permission to present 'Cached' links to the pages of Field's site". As far as I can tell in this case, NYT's robots.txt of the time was obeyed, which permitted automated processing of all but one specific article for some reason.
HumblyTossed|2 years ago
Probably. The question for the courts to decide, then, is how much use is considered fair use.
jibe|2 years ago
arduanika|2 years ago
pointlessone|2 years ago
everforward|2 years ago
Not sure about the rest of the world, but at least for US content I don't think any company would publish that LLM.
That's like 40 years before the civil rights movement, and right about the time of the Tulsa massacre.
It's right around when women got the right to vote.
Trying to get it to not say anything horrible under modern standards seems fraught with issues. I don't know if it would even understand something like "don't be racist", given the context it was trained on.
tmikaeld|2 years ago
Isn't this what Mistral AI did?
londons_explore|2 years ago
1. Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music, and later to write a book about music which likely includes some of the concepts you earlier learned.
2. Neither the LLM nor the output text contain sufficient elements of the copyrighted work to qualify for copyright protection. Just like if you turned old library books into compost and sold the compost, you wouldn't expect to pay authors of those books a royalty for the compost sales.
madeofpalk|2 years ago
If you learn a little too hard though, and reproduce the original textbook in it's entirety, you'll get in trouble.
My guess is that courts will determine that the training itself will not be found illegal, but either the AI companies, or the users, will be found liable for reproducing copywrighted work in output, and no one will want to hold liability for that.
HDThoreaun|2 years ago
formercoder|2 years ago
foogazi|2 years ago
Who owns the copyright then ?
Manuel_D|2 years ago
michaelmrose|2 years ago
"I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone."
Then we invented from whole cloth reasons why they were perfectly OK because there was a ton of money to be made and everyone would actually be better off if the VCR was a thing and everyone knew it because it ended up argued after millions of VCRs were already in households.
Vvector|2 years ago
endisneigh|2 years ago
MPSimmons|2 years ago
FpUser|2 years ago
They're just on a hunt for some extra money.
arduanika|2 years ago
unknown|2 years ago
[deleted]
mmh0000|2 years ago