(no title)
3PS | 8 months ago
This is OK and fair use: Training LLMs on copyrighted work, since it's transformative.
This is not OK and not fair use: pirating data, or creating a big repository of pirated data that isn't necessarily for AI training.
Overall seems like a pretty reasonable ruling?
derbOac|8 months ago
I tend to think copyright should be extremely limited compared to what it is now, but to me the logic of this ruling is illogical other than "it's ok for a corporation to use lots of works without permission but not for an individual to use a single work without permission." Maybe if they suddenly loosened copyright enforcement for everyone I might feel differently.
"Kill one man, and you are a murderer. Kill millions of men, and you are a conqueror." (An admittedly hyperbolic comparison, but similar idea.)
rcxdude|8 months ago
I think that's the conclusion of the judge. If Anthropic were to buy the books and train on them, without extra permission from the authors, it would be fair use, much like if you were to be inspired by it (though in that case, it may not even count as a derivative work at all, if the relationship is sufficiently loose). But that doesn't mean they are free to pirate it either, so they are likely to be liable for that (exactly how that interpretation works with copyright law I'm not entirely sure: I know in some places that downloading stuff is less of a problem than distributing it to others because the latter is the main thing that copyright is concerned with. And AFAIK most companies doing large model training are maintaining that fair use also extends to them gathering the data in the first place).
(Fair use isn't just for discussion. It covers a broad range of potential use cases, and they're not enumerated precisely in copyright law AFAIK, there's a complicated range of case law that forms the guidelines for it)
dragonwriter|8 months ago
That's not what the ruling says.
It says that training a generative AI system not designed primarily as a direct replacement for a work on one or more works is fair use, and that print-to-digital destructive scanning for storage and searchability is fair use.
These are both independent of whether one person or a giant company or something in between is doing it, and independent of the number of works involved (there's maybe a weak practical relationship to the number of works involved, since a gen AI tool that is trained on exactly one work is probably somewhat less likely to have a real use beyond a replacement for that work.)
fallingknife|8 months ago
comex|8 months ago
> This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded.
(But the judge continued that "this order need not decide this case on that rule": instead he made a more targeted ruling that Anthropic's specific conduct with respect to pirated copies wasn't fair use.)
tantalor|8 months ago
I'm allowed to hear a copyrighted tune, and even whistle it later for my own enjoyment, but I can't perform it for others without license.
protocolture|8 months ago
If I buy a book, and as long as the product the book teaches me to build isnt a competing book, the original author should have no avenue for complaint.
People are really getting hung up on the computer reading the data and computing other data with it. It shouldnt even need to get to fair use. Its so obviously none of the authors business well before fair use.
klabb3|8 months ago
Worse, they’re using it for massive commercial gain, without paying a dime upstream to the supply chain that made it possible. If there is any purpose of copyright at all, it’s to prevent making money from someone’s else’s intellectual work. The entire thing is based on economic pragmatism, because just copying does obviously not deprive the creator of the work itself, so the only justification in the first place is to protect those who seek to sell immaterial goods, by allowing them to decide how it can be used.
Coming to the conclusion that you can ”fair use” yourself out of paying for the most critical part of your supply makes me upset for the victims of the biggest heist of the century. But in the long term it can have devastating chilling effects, where information silos will become the norm, and various forms of DRM will be even more draconian.
Plus, fair use bypasses any licensing, no? Meaning even if today you clearly specify in the license that your work cannot be used in training commercial AI, it isn’t legally enforceable?
ticulatedspline|8 months ago
Personally I like to frame most AI problems by substituting a human (or humans) for the AI. Works pretty well most of the time.
In this case if you hired a bunch of artists/writers that somehow had never seen a Disney movie and to train them to make crappy Disney clones you made them watch all the movies it certainly would be legal to do so but only if they had legit copies in the training room. Pirating the movies would be illegal.
Though the downside is it does create a training moat. If you want to create the super-brain AI that's conversant on the corpus of copyrighted human literature you're going to need a training library worth millions
martin-t|8 months ago
Human time is inherently valuable, computer time is not.
The issue with LLMs is that they allow doing things at a massive scale which would previously be prohibitively time consuming. (You could argue but them how much electricity is worth one human life?)
If I "write" a book by taking another and replacing every word with a synonym, that's obviously plagiarism and obviously copyright infringement. How about also changing the word order? How about rewording individual paragraphs while keeping the general structure? It's all still derivative work but as you make it less detectable, the time and effort required is growing to become uneconomical. An LLM can do it cheaply. It can mix and match parts of many works but it's all still a derivative of those works combined. After all, if it wasn't, it would produce equally good output with a tiny fraction of the training data.
The outcome is that a small group of people (those making LLMs and selling access to their output) get to make huge amounts of money off of the work of a group that is several orders of magnitude larger (essentially everyone who has written something on the internet) without compensating the larger group.
That is fundamentally exploitative, whether the current laws accounted for that situation or not.
johnnyanmac|8 months ago
I see elements of that here. Buying copyrighted works not to be exposed and be inspired, nor to utilize the aithor's talents, but to fuel a commercialization of sound-a-likes.
tgv|8 months ago
How many copies? They're not serving a single client.
Libraries need to have multiple e-book licenses, after all.
alganet|8 months ago
https://en.wikipedia.org/wiki/Mickey_Mouse#Walt_Disney_Produ...
I'm on the Air Pirates side for the case linked, by the way.
However, AI is not a parody. It's not adding to the cultural expression like a parody would.
Let's forget all the law stuff and these silly hypotheticals. Let's think of humanity instead:
Is AI contributing to education and/or culture _right now_, or is it trying to make money? I think they're trying to make money.
ninetyninenine|8 months ago
layer8|8 months ago
Humans, animals, hardware and software are treated differently by law because they have different constraints and capabilities.
martin-t|8 months ago
And who gets the money? Not the original author.
bonoboTP|8 months ago
LLMs may sometimes reproduce exact copies of chunks of text, but I would say it also matters that this is an irrelevant use case that is not the main value proposition that drives LLM company revenues, it's not the use case that's marketed and it's not the use case that people in real life use it for.
simmerup|8 months ago
lesuorac|8 months ago
If you train a LLM on harry potter and ask it to generate a story that isn't harry potter then it's not a replacement.
However, if you train a model on stock imagery and use it to generate stock imagery then I think you'll run into an issue from the Warhol case.
thedevilslawyer|8 months ago
almatabata|8 months ago
jxdxbx|8 months ago
dragonwriter|8 months ago
This ruling doesn't say anything about the enforceability of a "don't train AI on this" contract, so even if the logic of this ruling became binding prcecednet (trial court rulings aren't), such clauses would be as valid after as they are today. But contracts only affect people who are parties to the contract.
Also, the damages calculations for breach of contract are different than for copyright infringement; infringement allows actual damages and infringer's profits (or statutory damages, if greater than the provable amount of the others), but breach of contract would usually be limited to actual damages ("disgorgement" is possible, but unlike with infringer's profits in copyright, requires showing special circumstances.)
protocolture|8 months ago
First, I dont think publishers of physical books in the US get the right to establish a contract. The book can be resold for instance and that right cannot be diminished. But secondly adding more cruft to the distribution of something that the end user has a right to transform, isn't going to diminish that right.
heavyset_go|8 months ago
bananapub|8 months ago
Meta at least just downloaded ENGLISH_LANGUAGUE_BOOKS_ALL_MEGATORRENT.torrent and trained on that.
doctorpangloss|8 months ago
philipkglass|8 months ago
https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....
Maybe there's another big Google Books lawsuit that Google ultimately lost, but I don't know which one you mean in that case.
SoKamil|8 months ago
bonoboTP|8 months ago
ninetyninenine|8 months ago
veggieroll|8 months ago
ncruces|8 months ago