top | item 45142885

Anthropic agrees to pay $1.5B to settle lawsuit with book authors

989 points| acomjean | 6 months ago |nytimes.com

Also https://www.washingtonpost.com/technology/2025/09/05/anthrop..., https://www.reuters.com/sustainability/boards-policy-regulat...

737 comments

order
[+] aeon_ai|6 months ago|reply
To be very clear on this point - this is not related to model training.

It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.

Buying used copies of books, scanning them, and training on it is fine.

Rainbows End was prescient in many ways.

[+] rchaud|6 months ago|reply
> Buying used copies of books, scanning them, and training on it is fine.

But nobody was ever going to that, not when there are billions in VC dollars at stake for whoever moves fastest. Everybody will simply risk the fine, which tends to not be anywhere close to enough to have a deterrent effect in the future.

That is like saying Uber would have not had any problems if they just entered into a licensing contract with taxi medallion holders. It was faster to just put unlicensed taxis on the streets and use investor money to pay fines and lobby for favorable legislation. In the same way, it was faster for Anthropic to load up their models with un-DRM'd PDFs and ePUBs from wherever instead of licensing them publisher by publisher.

[+] amradio1989|6 months ago|reply
I think the jury is still out on how fair use applies to AI. Fair use was not designed for what we have now.

I could read a book, but its highly unlikely I could regurgitate it, much less months or years later. An LLM, however, can. While we can say "training is like reading", its also not like reading at all due to permanent perfect recall.

Not only does an LLM have perfect recall, it also has the ability to distribute plagiarized ideas at a scale no human can. There's a lot of questions to be answered about where fair use starts/ends for these LLM products.

[+] gnabgib|6 months ago|reply
To be even more clear - this is a settlement, it does not establish precedent, nor admit wrongdoing. This does not establish that training is fair use, nor that scanning books is fine. That's somebody else's battle.
[+] mdp2021|6 months ago|reply
> Buying used copies of books

It remains deranged.

Everyone has more than a right to freely have read everything is stored in a library.

(Edit: in fact initially I wrote 'is supposed to' in place of 'has more than a right to' - meaning that "knowledge is there, we made it available: you are supposed to access it, with the fullest encouragement").

[+] ants_everywhere|6 months ago|reply
I wonder what Aaron Swartz would think if he lived to see the era of libgen.
[+] nickpsecurity|6 months ago|reply
I don't believe that's true. Most work I've read on fair use suggests it has to be a small amount, selectively used, substantially transformed, and not compete with content creators. These AI's training are the opposite of all that. I was surprised of a ruling like this but Alsup is a unique judge.

Additionally, sharing copyrighted works without permission... the data sets or data lakes... is its own tort. You're guilty just for sharing copies before even training. Some copyrighted works are also commercial, copyright with ban on others' commercial use, and patented. Some are NDA'd but 3rd party leaked them. Sources like Common Crawl probably have plenty of such content.

Additionally, there's often contractual terms of use on accessing the content. Even Singapore's and others laws allowing training on copyrighted content usually require that you lawfully accessed that content in the first place. The terms of use are the weakest link there.

I'd like to see these two issues turned by law into a copyright exception that no contract can override. It needs to specifically allow sharing scraped, publicly-visible content. Anything you can just view or download which the copyright owner put up. The law might impose or allow limits on daily scraping quantity, volume, etc to avoid damage scrapers are doing.

[+] GodelNumbering|6 months ago|reply
Settlement Terms (from the case pdf)

1. A Settlement Fund of at least $1.5 Billion: Anthropic has agreed to pay a minimum of $1.5 billion into a non-reversionary fund for the class members. With an estimated 500,000 copyrighted works in the class, this would amount to an approximate gross payment of $3,000 per work. If the final list of works exceeds 500,000, Anthropic will add $3,000 for each additional work.

2. Destruction of Datasets: Anthropic has committed to destroying the datasets it acquired from LibGen and PiLiMi, subject to any legal preservation requirements.

3. Limited Release of Claims: The settlement releases Anthropic only from past claims of infringement related to the works on the official "Works List" up to August 25, 2025. It does not cover any potential future infringements or any claims, past or future, related to infringing outputs generated by Anthropic's AI models.

[+] privatelypublic|6 months ago|reply
Don't forget: NO LEGAL PRECEDENT! which means, anybody suing has to start all over. You only settle in this scenario/point if you think you'll lose.

Edit: I'll get ratio'd for this- but its the exact same thing google did in it's lawsuit with Epic. They delayed while the public and courts focused in apple (oohh, EVIL apple)- apple lost, and google settled at a disadvantage before they had a legal judgment that couldn't be challenged latter.

[+] rendaw|6 months ago|reply
So they can also keep models trained on the datasets? That seems pretty big too, unless the half life of models is so low it doesn't matter.
[+] gooosle|6 months ago|reply
So... it would be a lot cheaper to just buy all of the books?
[+] manbash|6 months ago|reply
Thank you. I assumed it would be quicker to find the link to the case PDF here, but your summary is appreciated!

Indeed, it is not only payout, but the destruction of the datasets. Although the article does quote:

> “Anthropic says it did not even use these pirated works,” he said. “If some other generative A.I. company took data from pirated source and used it to train on and commercialized it, the potential liability is enormous. It will shake the industry — no doubt in my mind.”

Even if true, I wonder how many cases we will see in the near future.

[+] pier25|6 months ago|reply
Only 500,000 copyrighted works?

I was under the impression they had downloaded millions of books.

[+] testing22321|6 months ago|reply
I’m an author, can I get in on this?
[+] Taek|6 months ago|reply
I can't help but feel like this is a huge win for Chinese AI. Western companies are going to be limited in the amount of data they can collect and train on, and Chinese (or any foreign AI) is going to have access to much more and much better data.
[+] arjunchint|6 months ago|reply
Wait so they raised all that money just to give it to publishers?

Can only imagine the pitch, yes please give us billions of dollars. We are going to make a huge investment like paying of our lawsuits.

[+] mNovak|6 months ago|reply
Everything talks about settlement to the 'authors'; is that meant to be shorthand for copyright holders? Because there are a lot of academic works in that library where the publisher holds exclusive copyright and the author holds nothing.

By extension, if the big publishers are getting $3000 per article, that could be a fairly significant windfall.

[+] 93po|6 months ago|reply
very unsurprisingly, new york times is going to frame this as a win for "the little guy" when in reality it's just multi-billion dollar publishers, with a long rich history of their own exploitive practices, hanging on for dear life against generative AI
[+] Scoundreller|6 months ago|reply
Dunno if this matters but I thought the copyright always remains with the creator/author but they end up assigning the rights contractually. At least generally for books. Movies will be copyrighted by the studio.

Kinda how like patents will state the human “inventor” but Apple or whichever corp is assigned the rights.

[+] petralithic|6 months ago|reply
This is sad for open source AI, piracy for the purpose of model training should also be fair use because otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so. There is no way to buy billions of books just for model training, it simply can't happen.
[+] bcrosby95|6 months ago|reply
Fair use isn't about how you access the material, its about what you can do with it after you legally access it. If you don't legally access it, the question of fair use is moot.
[+] dbalatero|6 months ago|reply
This implies training models is some sort of right.
[+] sefrost|6 months ago|reply
I wonder how much it would cost to buy every book that you'd want to train a model.
[+] heavyset_go|6 months ago|reply
I don't know if I agree with it, but you could argue that if a model was built for purely academic purposes, and then used for purely academic purposes, it could meet requirements for fair use.
[+] josh-sematic|6 months ago|reply
Setting aside whether or not I think it should be fair use, you’re only going to be training a new foundation model these days if you have billions of dollars to spend on the endeavor anyway. Nobody is training Llama 5 in their garage.
[+] Aurornis|6 months ago|reply
This is a settlement. It does not set a precedent nor even admit to wrongdoing.

> otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so

Only well funded companies can afford to hire a lot of expensive engineers and train AI models on hundreds of thousands of expensive GPUs, too.

Something tells me many the grassroots LLM training people are less concerned about legality of their source training set than the big companies anyway.

[+] rwmj|6 months ago|reply
(Half joking but) I wonder if musicians need to worry if they learned to play by listening to cassette mixtapes.
[+] t0lo|6 months ago|reply
I wish the hn rules were more flexible because I would write the best comment to you right now.
[+] mbrochh|6 months ago|reply
After their recent change in tune to retain data for longer and to train on our data, I deleted my account.

Try to do that. There is no easy way to delete your account. You need to reach out to their support via email. Incredibly obnoxious dark pattern. I hate OpenAI, but everything with Anthropic also smells fishy.

We need more and better players. I hope that XAi will give them all some good competition, but I have my doubts.

[+] MaxikCZ|6 months ago|reply
See kids? Its okay to steal if you steal more money than the fine costs.
[+] on_meds|6 months ago|reply
It will be interesting to see how this impacts the lawsuits against OpenAI, Meta, and Microsoft. Will they quickly try to settle for billions as well?

It’s not precedent setting but surely it’ll have an impact.

[+] gordian-mind|6 months ago|reply
After the book publishers burned Google Book's Library of Alexandria, they are now making it impossible to train a LLM unless you engage in the medieval process of manually buying paper-copies of work just to scan & destroy them...
[+] r_lee|6 months ago|reply
One thing that comes to mind is...

Is there a way to make your content on the web "licensed" in a way where it is only free for human consumption?

I.e. effectively making the use of AI crawlers pirating, thus subject to the same kind of penalties here?

[+] gpm|6 months ago|reply
Yes to the first part. Put your site behind a login wall that requires users to sign a contract to that effect before serving them the content... get a lawyer to write that contract. Don't rely on copyright.

I'm not sure to what extent you can specify damages like these in a contract, ask the lawyer who is writing it.

[+] Wowfunhappy|6 months ago|reply
I'd argue you don't actually want this! You're suggesting companies should be able to make web scraping illegal.

That curl script you use to automate some task could become infringing.

[+] 7952|6 months ago|reply
Maybe some kind of captcha like system could be devised that could be considered a security measure under the DMCA and not allowed to be circumvented. Make the same content available under a licence fee through an API.
[+] shadowgovt|6 months ago|reply
I'm sure one can try, but copyright has all kinds of oddities and carve-outs that make this complicated. IANAL, but I'm fairly certain that, for example, if you tried putting in your content license "Free for all uses public and private, except academia, screw that ivory tower..." that's a sentiment you can express but universities are under no obligation legally to respect your wish to not have your work included in a course presentation on "wild things people put in licenses." Similarly, since the court has found that training an LLM on works is transformative, a license that says "You may use this for other things but not to train an LLM" couldn't be any more enforceable than a musician saying "You may listen to my work as a whole unit but God help you if I find out you sampled it into any of that awful 'rap music' I keep hearing about..."

The purpose of the copyright protections is to promote "sciences and useful arts," and the public utility of allowing academia to investigate all works(1) exceeds the benefits of letting authors declare their works unponderable to the academic community.

(1) And yet, textbooks are copyrighted and the copyright is honored; I'm not sure why the academic fair-use exception doesn't allow scholars to just copy around textbooks without paying their authors.

[+] Cheer2171|6 months ago|reply
No. Neither legally or technically possible.
[+] codedokode|6 months ago|reply
Is this legal: scan billions of pirated books, train a LLM on them and generate billion public domain books with it so that nobody ever needs copyrighted books anymore?

Also if there is a software library with annoying Stallman-style license, can one use LLM to generate a compatible library but in a public domain or with commercial license? So that nobody needs to respect software licenses anymore? Can we also generate a free Photoshop, Linux kernel and Windows this way?

[+] mhh__|6 months ago|reply
Maybe I would think differently if I was a book author but I can't help but think that this is ugly but actually quite good for humanity in some perverse sense. I will never, ever, read 99.9% of these books presumably but I will use claude.
[+] scotty79|6 months ago|reply
That's the worst AI news I read ever.

Even might AI with billions must kneel to copyright industry. We are forever doomed. Human culture will never be free from the grasp of rent seeking.

[+] novok|6 months ago|reply
I wonder who will be the first country to make an exception to copyright law for model training libraries to attract tax revenue like Ireland did for tech companies in the EU. Japan is part of the way there, but you couldn't do a common crawl type thing. You could even make it a library of congress type of setup.
[+] dataflow|6 months ago|reply
How do legal penalties and settlements work internationally? Are entities in other countries somehow barred from filing similar suits with more penalties?
[+] KTaffer|6 months ago|reply
This was a very tactical decision by Anthropic. They have just received Series F funding, and they can now afford to settle this lawsuit.

OpenAI and Google will follow soon now that the precedent has been set, and will likely pay more.

It will be a net win for Anthropic.

[+] unvritt|6 months ago|reply
I think that one under-discussed effect for settlements like this is the additional tax on experimentation. The largest players can absorb a $1.5B hit or negotiate licensing at scale. Smaller labs and startups, which often drive breakthroughs, may not survive the compliance burden.

That could push the industry toward consolidation—fewer independent experiments, more centralized R&D inside big tech. I feel that, this might slow the pace of unexpected innovations and increase dependence on incumbents.

This def. raises the question: how do we balance fair compensation for creators with keeping the door open for innovation?