top | item 44488331

Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

497 points| pyman | 8 months ago |businessinsider.com | reply

651 comments

order
[+] dehrmann|8 months ago|reply
The important parts:

> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use

> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"

It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.

[+] 6gvONxR4sf7o|8 months ago|reply
You skipped quotes about the other important side:

> But Alsup drew a firm line when it came to piracy.

> "Anthropic had no entitlement to use pirated copies for its central library," Alsup wrote. "Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy."

That is, he ruled that

- buying, physically cutting up, physically digitizing books, and using them for training is fair use

- pirating the books for their digital library is not fair use.

[+] jpalawaga|8 months ago|reply
I don't think that's new. google set precedent for that more than a decade ago. you're allowed to transform a book to digital.
[+] MaxPock|8 months ago|reply
How times change .They wanted to lock up Aaron Schwartz for life for essentially doing the same thing Anthropic is doing.
[+] sershe|8 months ago|reply
Im not sure how I feel about what anthropic did on merit as a matter of scale, but from a legalistic standpoint how is it different from using the book to train the meat model in my head? I could even learn bits by heart and quote them in context.
[+] seuraughty|8 months ago|reply
Feels like information laundering to me.
[+] franczesko|8 months ago|reply
Is fruit of the poisonous tree rule applicable here?
[+] bgwalter|8 months ago|reply
Here is how individuals are treated for massive copyright infringement:

https://investors.autodesk.com/news-releases/news-release-de...

[+] Aurornis|8 months ago|reply
> Here is how individuals are treated for massive copyright infringement:

When I clicked the link, I got an article about a business that was selling millions of dollars of pirated software.

This guy made millions of dollars in profit by selling pirated software. This wasn't a case of transformative works, nor of an individual doing something for themselves. He was plainly stealing and reselling something.

[+] JimDabell|8 months ago|reply
> illegally copying and selling pirated software

This is very different to what Anthropic did. Nobody was buying copies of books from Anthropic instead of the copyright holder.

[+] stocksinsmocks|8 months ago|reply
Anthropic isn’t selling copies of the material to its users though. I would think you couldn’t lock someone up for reading a book and summarizing or reciting portions of the contents.

Seven years for thumbing your nose at Autodesk when armed robbery would get you less time says some interesting things about the state of legal practice.

[+] ysofunny|8 months ago|reply
before breaking the law, set up a corporation to absorb the liability!

in other words, provided you have enough spare capital to spin up a corporation, you can break the law!!!!

[+] nh23423fefe|8 months ago|reply
What point are you making? 20 years ago, someone sold pirated copies of software (wheres the transformation here) and that's the same as using books in a training set? Judge already said reading isnt infringement.

This is reaching at best.

[+] farceSpherule|8 months ago|reply
Peterson was copying and selling pirated software.

Come up with a better comparison.

[+] chourobin|8 months ago|reply
copyright is not the same as piracy
[+] marapuru|8 months ago|reply
Apparently it's a common business practice. Spotify (even though I can't find any proof) seems to have build their software and business on pirated music. There is some more in this Article [0].

https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...

Funky quote:

> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.

[+] pyman|8 months ago|reply
Anthropic's cofounder, Ben Mann, downloaded million copies of books from Library Genesis in 2021, fully aware that the material was pirated.

Stealing is stealing. Let's stop with the double standards.

[+] originalvichy|8 months ago|reply
At least most pirates just consume for personal use. Profiting from piracy is a whole other level beyond just pirating a book.
[+] dathinab|8 months ago|reply
stealing with the intent to gain a unfair marked advantage so that you can effectively kill any ethically legally correctly acting company in a way which is very likely going to hurt many authors through the products you create is far worse then just stealing for personal use

that isn't "just" stealing, it's organized crime

[+] kube-system|8 months ago|reply
> Stealing is stealing. Let's stop with the double standards.

I get the sentiment, but that statement as is, is absurdly reductive. Details matter. Even if someone takes merchandise from a store without paying, their sentence will vary depending on the details.

[+] 1970-01-01|8 months ago|reply
Let's get actual definitions of 'theft' before we leap into double standards.
[+] x3n0ph3n3|8 months ago|reply
Copyright infringement is not stealing.
[+] damnesian|8 months ago|reply
oh well, the product has a cute name and will make someone a billionaire, let's just give it the green light. who cares about copyright in the age of AI?
[+] Der_Einzige|8 months ago|reply
Information wants to be free.
[+] ramon156|8 months ago|reply
Pirate and pay the fine is probably hell of a lot cheaper than individually buying all these books. I'm not saying this is justified, but what would you have done in their situation?

Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?

[+] pyman|8 months ago|reply
These are the people shaping the future of AI? What happened to all the ethical values they love to preach about?

We've held China accountable for counterfeiting products for decades and regulated their exports. So why should Anthropic be allowed to export their products and services after engaging in the same illegal activity?

[+] guywithahat|8 months ago|reply
If you own a book, it should be legal for your computer to take a picture of it. I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them. I don't owe copyright to every book I read because I may subconsciously incorporate their ideas into my future work.
[+] hellohihello135|8 months ago|reply
It’s easy to point fingers at others. Meanwhile the top comment in this thread links to stolen content from Business Insider.
[+] nickpsecurity|8 months ago|reply
Buying, scanning, and discarding was in my proposal to train under copyright restrictions.

You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.

This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.

[+] trinsic2|8 months ago|reply
I'm not seeing how this is fair use in either case.

Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?

It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.

[+] platunit10|8 months ago|reply
Every time an article like this surfaces, it always seems like the majority of tech folks believe that training AI on copyrighted material is NOT fair use, but the legal industry disagrees.

Which of the following are true?

(a) the legal industry is susceptible to influence and corruption

(b) engineers don't understand how to legally interpret legal text

(c) AI tech is new, and judges aren't technically qualified to decide these scenarios

Most likely option is C, as we've seen this pattern many times before.

[+] adolph|8 months ago|reply

  Alsup detailed Anthropic's training process with books: The OpenAI rival 
  spent "many millions of dollars" buying used print books, which the 
  company or its vendors then stripped of their bindings, cut the pages, 
  and scanned into digital files.
I've noticed an increase in used book prices in the recent past and now wonder if there is an LLM effect in the market.
[+] tliltocatl|8 months ago|reply
If the AI movement will manage to undermine Imaginary Property, it would redeem it's externalities threefold.
[+] codedokode|8 months ago|reply
By the way I wonder if recent advancement in protecting Youtube videos from downloaders like yt-d*p are caused by unwillingness to help rival AI companies gather the datasets.
[+] 1970-01-01|8 months ago|reply
The buried lede here is Antrhopic will need to attempt to explain to a judge that it is impossible to de-train 7M books from their models.
[+] nickpsecurity|8 months ago|reply
I'm hoping they fail to incentivize using legal, open, and/or licensed data. Then, thry might have to attempt to train a Claude-class model on legal data. Then, I'll have a great, legal model to use. :)
[+] koolala|8 months ago|reply
Anyone read the 2006 sci-fi book Rainbow's End that has this? It was set in 2025.
[+] Kim_Bruning|8 months ago|reply
actual title:

"Anthropic cut up millions of used books to train Claude — and downloaded over 7 million pirated ones too, a judge said."

A not-so-subtle difference.

That said, in a sane world, they shouldn't have needed to cut up all those used books yet again when there's obviously already an existing file that does all the work.

[+] carlosjobim|8 months ago|reply
If ingesting books into an AI makes Anthropic criminals, then Google et al are also criminals alike for making search indexes of the Internet. Anything published online is equally copyrighted.
[+] riskable|8 months ago|reply
Exactly! If Anthropic is guilty of copyright infringement for the mere act of downloading copyrighted books then so is Google, Microsoft (Bing), DuckDuckGo, etc. Every search engine that exists downloads pirated material every day. They'd all be guilty.

Not only that but all of us are guilty too because I'm positive we've all clicked on search results that contained copyrighted content that was copied without permission. You may not have even known it was such.

Remember: Intent is irrelevant when it comes to copyright infringement! It's not that kind of law.

Intent can guide a judge when they determine damages but that's about it.

[+] kristofferR|8 months ago|reply
Yeah, we can all agree that ingesting books is fair use and transformative, but you gotta own what you ingest, you can't just pirate it.

I can read 100 books and write a book based on the inspiration I got from the 100 books without any issue. However, if I pirate the 100 books I've still committed copyright infringement despite my new book being fully legal/fair use.

[+] m4rtink|8 months ago|reply
Anyone else thinks destroying books for any reason is wrong ?

Or is it perhaps not an universal cultural/moral aspect ?

I guess for example in Europe people could be more sensitive to it.

[+] lawlessone|8 months ago|reply
If they aren't one of a kind and they digitally preserved them in some way i think i would be ok with it.

Saying that though there are tools for digitizing books that don't require destroying them

[+] stackedinserter|8 months ago|reply
There's nothing sacred about books. There are plenty of books that won't be missed if destroyed.
[+] kbelder|8 months ago|reply
I have purposefully destroyed one book in my life, in order to prevent anyone from reading it:

Man of Two Worlds by Brian Herbert.

...and I did the world a favor.