top | item 38817206

(no title)

grantseltzer | 2 years ago

Can someone with actual fundamental understanding of LLMs explain to me why they think it's perfectly legal to train models on copyrighted material? I don't know enough about this. Please don't answer by asking chatgpt.

discuss

order

Ukv|2 years ago

Consider how commercial search engines are fine to show text snippets, thumbnails and site caches.

AI developers will most likely rely on a Fair Use defense. I think this has a reasonable chance of success since, while the use of a given copyrighted work may affect the market for that work (in this case NYT's article), it can be argued to be highly transformative usage. As in Campbell v. Acuff-Rose Music: "The more transformative the new work, the less will be the significance of other factors", defined as "whether the new work merely 'supersede[s] the objects' of the original creation [...] or instead adds something new".

There's also potential for an "implied license", as in Field v. Google Inc for rehosting a snapshot of a site, where "Google reasonably interpreted absence of meta-tags as permission to present 'Cached' links to the pages of Field's site". As far as I can tell in this case, NYT's robots.txt of the time was obeyed, which permitted automated processing of all but one specific article for some reason.

HumblyTossed|2 years ago

> AI developers will most likely rely on a Fair Use defense.

Probably. The question for the courts to decide, then, is how much use is considered fair use.

jibe|2 years ago

Why do you think it is legal to train students on copyrighted material? Copyright is supposed to protect from unauthorized reproduction, not unauthorized learning. That the NY Times is able to show some verbatim reproduction, it is a real legal issue, but that should not be extended to training generally.

arduanika|2 years ago

Students are humans. LLMs are not. Machine "learning" is a metaphor, not what's actually happening. Stop anthropomorphizing, and show some loyalty to your species.

pointlessone|2 years ago

I think your question is incorrect. It’s very likely no-one thinks it’s perfectly legal. There probably are many people who think it’s not a big deal, though. Try coming up with a dataset that doesn’t have any copyrighted material in them. Like seriously try. You can’t use pretty much anything newer than a century old. Everything is copyrighted by default. Very few new things are explicitly in public domain or licensed in a way that would allow usage. Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?

everforward|2 years ago

> Now imagine LLMs trained on early 20th century newspapers, books and letters. Do you think it would be good at generating code or hip copy for homepage of your next startup?

Not sure about the rest of the world, but at least for US content I don't think any company would publish that LLM.

That's like 40 years before the civil rights movement, and right about the time of the Tulsa massacre.

It's right around when women got the right to vote.

Trying to get it to not say anything horrible under modern standards seems fraught with issues. I don't know if it would even understand something like "don't be racist", given the context it was trained on.

tmikaeld|2 years ago

> Try coming up with a dataset that doesn’t have any copyrighted material in them.

Isn't this what Mistral AI did?

londons_explore|2 years ago

I think the main arguments are:

1. Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music, and later to write a book about music which likely includes some of the concepts you earlier learned.

2. Neither the LLM nor the output text contain sufficient elements of the copyrighted work to qualify for copyright protection. Just like if you turned old library books into compost and sold the compost, you wouldn't expect to pay authors of those books a royalty for the compost sales.

madeofpalk|2 years ago

> Training an LLM is akin to human learning. It is legal to read a textbook about music to learn music

If you learn a little too hard though, and reproduce the original textbook in it's entirety, you'll get in trouble.

My guess is that courts will determine that the training itself will not be found illegal, but either the AI companies, or the users, will be found liable for reproducing copywrighted work in output, and no one will want to hold liability for that.

HDThoreaun|2 years ago

I feel like there’s no way 1 will fly. Very soon ai and humans will explicitly have to follow different laws because they operate very differently.

formercoder|2 years ago

I, a human, can read a copyrighted work and then write a new work and own the copyright on that new work as long as it is not substantially the same.

foogazi|2 years ago

What if you produce a substantially similar work ?

Who owns the copyright then ?

Manuel_D|2 years ago

Fair use, probably. How many news pieces have you read that amount to, "The New York Times reports..." followed by a summary of the Times' article? It's not illegal to use copyrighted works at a source, as inspiration, or to guide style.

michaelmrose|2 years ago

Surely. Remember when the VCR came out and some parties absolutely freaked out and Jack Valenti said

"I say to you that the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone."

Then we invented from whole cloth reasons why they were perfectly OK because there was a ton of money to be made and everyone would actually be better off if the VCR was a thing and everyone knew it because it ended up argued after millions of VCRs were already in households.

Vvector|2 years ago

I was thinking of the VCR as well. SCOTUS ruled that "the technology in question had significant non-infringing uses" making VCRs legal.

endisneigh|2 years ago

It’s currently not explicitly legal or not. There are lawsuits in progress to determine that very answer.

MPSimmons|2 years ago

Read about the 'fair use' doctrine and put yourself in the shoes of someone who is training a model, and see if you can argue, from their perspective, why it should be allowed.

FpUser|2 years ago

We all "train" ourselves on copyrighted materials and later use / or not gained knowledge for our own benefit be it financial or pleasure.

They're just on a hunt for some extra money.

arduanika|2 years ago

Humans aren't computers. Come on, people.

mmh0000|2 years ago

I'm essentially a meat-based LLM and I'm trained almost exclusively on copyrighted material (most of which I pirated).