top | item 44367969

(no title)

3PS | 8 months ago

Broadly summarizing.

This is OK and fair use: Training LLMs on copyrighted work, since it's transformative.

This is not OK and not fair use: pirating data, or creating a big repository of pirated data that isn't necessarily for AI training.

Overall seems like a pretty reasonable ruling?

discuss

order

derbOac|8 months ago

But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine. I guess I fail to see how it's any different from me using it in some other way? If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.

I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission." Maybe if they suddenly loosened copyright enforcement for everyone I might feel differently.

"Kill one man, and you are a murderer. Kill millions of men, and you are a conqueror." (An admittedly hyperbolic comparison, but similar idea.)

rcxdude|8 months ago

>If I wanted to write a play very loosely inspired by Blood Meridian, it might be transformative, but that doesn't justify me pirating the book.

I think that's the conclusion of the judge. If Anthropic were to buy the books and train on them, without extra permission from the authors, it would be fair use, much like if you were to be inspired by it (though in that case, it may not even count as a derivative work at all, if the relationship is sufficiently loose). But that doesn't mean they are free to pirate it either, so they are likely to be liable for that (exactly how that interpretation works with copyright law I'm not entirely sure: I know in some places that downloading stuff is less of a problem than distributing it to others because the latter is the main thing that copyright is concerned with. And AFAIK most companies doing large model training are maintaining that fair use also extends to them gathering the data in the first place).

(Fair use isn't just for discussion. It covers a broad range of potential use cases, and they're not enumerated precisely in copyright law AFAIK, there's a complicated range of case law that forms the guidelines for it)

dragonwriter|8 months ago

> I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission."

That's not what the ruling says.

It says that training a generative AI system not designed primarily as a direct replacement for a work on one or more works is fair use, and that print-to-digital destructive scanning for storage and searchability is fair use.

These are both independent of whether one person or a giant company or something in between is doing it, and independent of the number of works involved (there's maybe a weak practical relationship to the number of works involved, since a gen AI tool that is trained on exactly one work is probably somewhat less likely to have a real use beyond a replacement for that work.)

fallingknife|8 months ago

But if you did pirate the book, and let's say it cost $50, and then you used it to write a play based on that book and made $1 million selling that, only the $50 loss to the publisher would be relevant to the lawsuit. The fact that you wrote a non-infringing play based on it and made $1 million would be irrelevant to the case. The publisher would have no claim to it.

comex|8 months ago

The judge actually agreed with your first paragraph:

> This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.

(But the judge continued that "this order need not decide this case on that rule": instead he made a more targeted ruling that Anthropic's specific conduct with respect to pirated copies wasn't fair use.)

tantalor|8 months ago

The analogy to training is not writing a play based on the work. It's more like reading (experiencing) the work and forming memories in your brain, which you can access later.

I'm allowed to hear a copyrighted tune, and even whistle it later for my own enjoyment, but I can't perform it for others without license.

protocolture|8 months ago

If I buy a book, and use it to prop up the table on which I build a door, I dont owe the author any additional money over what I paid for it.

If I buy a book, and as long as the product the book teaches me to build isnt a competing book, the original author should have no avenue for complaint.

People are really getting hung up on the computer reading the data and computing other data with it. It shouldnt even need to get to fair use. Its so obviously none of the authors business well before fair use.

klabb3|8 months ago

> But those training the LLMs are still using the works, and not just to discuss them, which I think is the point of fair use doctrine.

Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work. The entire thing is based on economic pragmatism, because just copying does obviously not deprive the creator of the work itself, so the only justification in the first place is to protect those who seek to sell immaterial goods, by allowing them to decide how it can be used.

Coming to the conclusion that you can ”fair use” yourself out of paying for the most critical part of your supply makes me upset for the victims of the biggest heist of the century. But in the long term it can have devastating chilling effects, where information silos will become the norm, and various forms of DRM will be even more draconian.

Plus, fair use bypasses any licensing, no? Meaning even if today you clearly specify in the license that your work cannot be used in training commercial AI, it isn’t legally enforceable?

ticulatedspline|8 months ago

Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"

Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.

In this case if you hired a bunch of artists/writers that somehow had never seen a Disney movie and to train them to make crappy Disney clones you made them watch all the movies it certainly would be legal to do so but only if they had legit copies in the training room. Pirating the movies would be illegal.

Though the downside is it does create a training moat. If you want to create the super-brain AI that's conversant on the corpus of copyrighted human literature you're going to need a training library worth millions

martin-t|8 months ago

> Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.

Human time is inherently valuable, computer time is not.

The issue with LLMs is that they allow doing things at a massive scale which would previously be prohibitively time consuming. (You could argue but them how much electricity is worth one human life?)

If I "write" a book by taking another and replacing every word with a synonym, that's obviously plagiarism and obviously copyright infringement. How about also changing the word order? How about rewording individual paragraphs while keeping the general structure? It's all still derivative work but as you make it less detectable, the time and effort required is growing to become uneconomical. An LLM can do it cheaply. It can mix and match parts of many works but it's all still a derivative of those works combined. After all, if it wasn't, it would produce equally good output with a tiny fraction of the training data.

The outcome is that a small group of people (those making LLMs and selling access to their output) get to make huge amounts of money off of the work of a group that is several orders of magnitude larger (essentially everyone who has written something on the internet) without compensating the larger group.

That is fundamentally exploitative, whether the current laws accounted for that situation or not.

johnnyanmac|8 months ago

That's a part of the issue. I'm not sure if this has happened in visual arts, but there is in fact precedent against trying to hire a sound a like over the one you want to sound like. You can't be in talks with Scarlet Johannsen, reject her, and then hire a sound a like and say "talk like Scarlet". It's pretty clear at that point what you want but you didn't want to pay talent for it.

I see elements of that here. Buying copyrighted works not to be exposed and be inspired, nor to utilize the aithor's talents, but to fuel a commercialization of sound-a-likes.

tgv|8 months ago

> Definitely seems reasonable to say "you can train on this data but you have to have a legal copy"

How many copies? They're not serving a single client.

Libraries need to have multiple e-book licenses, after all.

alganet|8 months ago

What you are describing happened and they got sued:

https://en.wikipedia.org/wiki/Mickey_Mouse#Walt_Disney_Produ...

I'm on the Air Pirates side for the case linked, by the way.

However, AI is not a parody. It's not adding to the cultural expression like a parody would.

Let's forget all the law stuff and these silly hypotheticals. Let's think of humanity instead:

Is AI contributing to education and/or culture _right now_, or is it trying to make money? I think they're trying to make money.

ninetyninenine|8 months ago

Agreed. If I memorize a book and I am deployed into the world to talk about what I memorized that is not a violation of copyright. Which is reasonable logically because essentially this is what an LLM is doing.

layer8|8 months ago

It might be different if you are a commercial product which couldn’t have been created without incorporating the contents of all those books.

Humans, animals, hardware and software are treated differently by law because they have different constraints and capabilities.

martin-t|8 months ago

Except you can't do it at a massive scale. LLMs both memorize at a scale bigger than thousands, probably millions of humans AND reproduce at an essentially unlimited scale.

And who gets the money? Not the original author.

bonoboTP|8 months ago

You can talk about it, but you can't sell tickets to an event where you recite from memory all the poems written by someone else without their permission.

LLMs may sometimes reproduce exact copies of chunks of text, but I would say it also matters that this is an irrelevant use case that is not the main value proposition that drives LLM company revenues, it's not the use case that's marketed and it's not the use case that people in real life use it for.

simmerup|8 months ago

Depends whether you actually agree its transformative

lesuorac|8 months ago

For textual purposes it seems fairly transformative.

If you train a LLM on harry potter and ask it to generate a story that isn't harry potter then it's not a replacement.

However, if you train a model on stock imagery and use it to generate stock imagery then I think you'll run into an issue from the Warhol case.

thedevilslawyer|8 months ago

What's the steelman case that is transformative? Because prima-facie, it seems to only output original output - "intelligent" output.

almatabata|8 months ago

If a publisher adds a "no AI training" clause to their contracts, does this ruling render it invalid?

jxdxbx|8 months ago

You don't need a license for most of what people do with traditional, physical copyrighted copies of works: read them, play a DVD at home, etc. Those things are outside the scope of copyright. But you do need a license to make copies, and ebooks generally come with licensing agreements, again because to read an ebook, you must first make a brand new copy of it. Anyway as a result physical books just don't have "licenses" to begin with and if they tried they'd be unenforceable, since you don't need to "agree" to any "terms" to read a book.

dragonwriter|8 months ago

> If a publisher adds a "no AI training" clause to their contracts?

This ruling doesn't say anything about the enforceability of a "don't train AI on this" contract, so even if the logic of this ruling became binding prcecednet (trial court rulings aren't), such clauses would be as valid after as they are today. But contracts only affect people who are parties to the contract.

Also, the damages calculations for breach of contract are different than for copyright infringement; infringement allows actual damages and infringer's profits (or statutory damages, if greater than the provable amount of the others), but breach of contract would usually be limited to actual damages ("disgorgement" is possible, but unlike with infringer's profits in copyright, requires showing special circumstances.)

protocolture|8 months ago

Fair Use and similar protections are there to protect the end user from predatory IP holders.

First, I dont think publishers of physical books in the US get the right to establish a contract. The book can be resold for instance and that right cannot be diminished. But secondly adding more cruft to the distribution of something that the end user has a right to transform, isn't going to diminish that right.

heavyset_go|8 months ago

Fair use overrides licensing

bananapub|8 months ago

what contract? with who?

Meta at least just downloaded ENGLISH_LANGUAGUE_BOOKS_ALL_MEGATORRENT.torrent and trained on that.

doctorpangloss|8 months ago

It’s similar to the Google Books ruling, which Google lost. Anthropic also lost. TechCrunch and others are very aspirational here.

SoKamil|8 months ago

What if I overfit my LLM so it spits out copyrighted work with special prompting? Where to draw the line in training?

bonoboTP|8 months ago

If you do something else, the result may be something else. The line is drawn by the application of subjective common sense by the judge, just as it is every time.

ninetyninenine|8 months ago

I mean the human brain can memorize things as well and it’s not illegal. It’s only illegal if said memorized thing is distributed.

veggieroll|8 months ago

BRB, I'm going to download all the TV shows and movies to train my vision model. Just to be sure it's working properly, I have to watch some for debugging purposes.

ncruces|8 months ago

You need to buy one copy of each for the fair use to apply.