Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

[+] neonate|8 months ago|reply

[+] dehrmann|8 months ago|reply

The important parts:

> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use

> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"

It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.

[+] 6gvONxR4sf7o|8 months ago|reply

You skipped quotes about the other important side:

> But Alsup drew a firm line when it came to piracy.

> "Anthropic had no entitlement to use pirated copies for its central library," Alsup wrote. "Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy."

That is, he ruled that

- buying, physically cutting up, physically digitizing books, and using them for training is fair use

- pirating the books for their digital library is not fair use.

[+] jpalawaga|8 months ago|reply

I don't think that's new. google set precedent for that more than a decade ago. you're allowed to transform a book to digital.

[+] MaxPock|8 months ago|reply

How times change .They wanted to lock up Aaron Schwartz for life for essentially doing the same thing Anthropic is doing.

[+] alok-g|8 months ago|reply

AFAIK, Judge Vince Chhabria has countered that Fair Use argument in a later order involving Meta.

https://www.courtlistener.com/docket/67569326/598/kadrey-v-m...

Note: I am not a lawyer.

[+] sershe|8 months ago|reply

Im not sure how I feel about what anthropic did on merit as a matter of scale, but from a legalistic standpoint how is it different from using the book to train the meat model in my head? I could even learn bits by heart and quote them in context.

[+] seuraughty|8 months ago|reply

Feels like information laundering to me.

[+] franczesko|8 months ago|reply

Is fruit of the poisonous tree rule applicable here?

[+] bgwalter|8 months ago|reply

Here is how individuals are treated for massive copyright infringement:

https://investors.autodesk.com/news-releases/news-release-de...

[+] piker|8 months ago|reply

I thought you'd go with this: https://en.wikipedia.org/wiki/United_States_v._Swartz

[+] Aurornis|8 months ago|reply

> Here is how individuals are treated for massive copyright infringement:

When I clicked the link, I got an article about a business that was selling millions of dollars of pirated software.

This guy made millions of dollars in profit by selling pirated software. This wasn't a case of transformative works, nor of an individual doing something for themselves. He was plainly stealing and reselling something.

[+] JimDabell|8 months ago|reply

> illegally copying and selling pirated software

This is very different to what Anthropic did. Nobody was buying copies of books from Anthropic instead of the copyright holder.

[+] stocksinsmocks|8 months ago|reply

Anthropic isn’t selling copies of the material to its users though. I would think you couldn’t lock someone up for reading a book and summarizing or reciting portions of the contents.

Seven years for thumbing your nose at Autodesk when armed robbery would get you less time says some interesting things about the state of legal practice.

[+] ysofunny|8 months ago|reply

before breaking the law, set up a corporation to absorb the liability!

in other words, provided you have enough spare capital to spin up a corporation, you can break the law!!!!

[+] nh23423fefe|8 months ago|reply

What point are you making? 20 years ago, someone sold pirated copies of software (wheres the transformation here) and that's the same as using books in a training set? Judge already said reading isnt infringement.

This is reaching at best.

[+] farceSpherule|8 months ago|reply

Peterson was copying and selling pirated software.

Come up with a better comparison.

[+] chourobin|8 months ago|reply

copyright is not the same as piracy

[+] marapuru|8 months ago|reply

Apparently it's a common business practice. Spotify (even though I can't find any proof) seems to have build their software and business on pirated music. There is some more in this Article [0].

https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...

Funky quote:

> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.

[+] pyman|8 months ago|reply

Anthropic's cofounder, Ben Mann, downloaded million copies of books from Library Genesis in 2021, fully aware that the material was pirated.

Stealing is stealing. Let's stop with the double standards.

[+] originalvichy|8 months ago|reply

At least most pirates just consume for personal use. Profiting from piracy is a whole other level beyond just pirating a book.

[+] dathinab|8 months ago|reply

stealing with the intent to gain a unfair marked advantage so that you can effectively kill any ethically legally correctly acting company in a way which is very likely going to hurt many authors through the products you create is far worse then just stealing for personal use

that isn't "just" stealing, it's organized crime

[+] kube-system|8 months ago|reply

> Stealing is stealing. Let's stop with the double standards.

I get the sentiment, but that statement as is, is absurdly reductive. Details matter. Even if someone takes merchandise from a store without paying, their sentence will vary depending on the details.

[+] 1970-01-01|8 months ago|reply

Let's get actual definitions of 'theft' before we leap into double standards.

[+] x3n0ph3n3|8 months ago|reply

Copyright infringement is not stealing.

[+] NoMoreNicksLeft|8 months ago|reply

[deleted]

[+] damnesian|8 months ago|reply

oh well, the product has a cute name and will make someone a billionaire, let's just give it the green light. who cares about copyright in the age of AI?

[+] Der_Einzige|8 months ago|reply

Information wants to be free.

[+] ramon156|8 months ago|reply

Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books. I'm not saying this is justified, but what would you have done in their situation?

Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?

[+] pyman|8 months ago|reply

These are the people shaping the future of AI? What happened to all the ethical values they love to preach about?

We've held China accountable for counterfeiting products for decades and regulated their exports. So why should Anthropic be allowed to export their products and services after engaging in the same illegal activity?

[+] guywithahat|8 months ago|reply

If you own a book, it should be legal for your computer to take a picture of it. I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them. I don't owe copyright to every book I read because I may subconsciously incorporate their ideas into my future work.

[+] hellohihello135|8 months ago|reply

It’s easy to point fingers at others. Meanwhile the top comment in this thread links to stolen content from Business Insider.

[+] nickpsecurity|8 months ago|reply

Buying, scanning, and discarding was in my proposal to train under copyright restrictions.

You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.

This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.

[+] trinsic2|8 months ago|reply

I'm not seeing how this is fair use in either case.

Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?

It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.

[+] platunit10|8 months ago|reply

Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use, but the legal industry disagrees.

Which of the following are true?

(a) the legal industry is susceptible to influence and corruption

(b) engineers don't understand how to legally interpret legal text

(c) AI tech is new, and judges aren't technically qualified to decide these scenarios

Most likely option is C, as we've seen this pattern many times before.

[+] adolph|8 months ago|reply

  Alsup detailed Anthropic's training process with books: The OpenAI rival 
  spent "many millions of dollars" buying used print books, which the 
  company or its vendors then stripped of their bindings, cut the pages, 
  and scanned into digital files.

I've noticed an increase in used book prices in the recent past and now wonder if there is an LLM effect in the market.

[+] tliltocatl|8 months ago|reply

If the AI movement will manage to undermine Imaginary Property, it would redeem it's externalities threefold.

[+] codedokode|8 months ago|reply

By the way I wonder if recent advancement in protecting Youtube videos from downloaders like yt-d*p are caused by unwillingness to help rival AI companies gather the datasets.

[+] 1970-01-01|8 months ago|reply

The buried lede here is Antrhopic will need to attempt to explain to a judge that it is impossible to de-train 7M books from their models.

[+] nickpsecurity|8 months ago|reply

I'm hoping they fail to incentivize using legal, open, and/or licensed data. Then, thry might have to attempt to train a Claude-class model on legal data. Then, I'll have a great, legal model to use. :)

[+] koolala|8 months ago|reply

Anyone read the 2006 sci-fi book Rainbow's End that has this? It was set in 2025.

[+] Kim_Bruning|8 months ago|reply

actual title:

"Anthropic cut up millions of used books to train Claude — and downloaded over 7 million pirated ones too, a judge said."

A not-so-subtle difference.

That said, in a sane world, they shouldn't have needed to cut up all those used books yet again when there's obviously already an existing file that does all the work.

[+] carlosjobim|8 months ago|reply

If ingesting books into an AI makes Anthropic criminals, then Google et al are also criminals alike for making search indexes of the Internet. Anything published online is equally copyrighted.

[+] riskable|8 months ago|reply

Exactly! If Anthropic is guilty of copyright infringement for the mere act of downloading copyrighted books then so is Google, Microsoft (Bing), DuckDuckGo, etc. Every search engine that exists downloads pirated material every day. They'd all be guilty.

Not only that but all of us are guilty too because I'm positive we've all clicked on search results that contained copyrighted content that was copied without permission. You may not have even known it was such.

Remember: Intent is irrelevant when it comes to copyright infringement! It's not that kind of law.

Intent can guide a judge when they determine damages but that's about it.

[+] kristofferR|8 months ago|reply

Yeah, we can all agree that ingesting books is fair use and transformative, but you gotta own what you ingest, you can't just pirate it.

I can read 100 books and write a book based on the inspiration I got from the 100 books without any issue. However, if I pirate the 100 books I've still committed copyright infringement despite my new book being fully legal/fair use.

[+] m4rtink|8 months ago|reply

Anyone else thinks destroying books for any reason is wrong ?

Or is it perhaps not an universal cultural/moral aspect ?

I guess for example in Europe people could be more sensitive to it.

[+] lawlessone|8 months ago|reply

If they aren't one of a kind and they digitally preserved them in some way i think i would be ok with it.

Saying that though there are tools for digitizing books that don't require destroying them

[+] stackedinserter|8 months ago|reply

There's nothing sacred about books. There are plenty of books that won't be missed if destroyed.

[+] kbelder|8 months ago|reply

I have purposefully destroyed one book in my life, in order to prevent anyone from reading it:

Man of Two Worlds by Brian Herbert.

...and I did the world a favor.

651 comments