top | item 42971446

Meta torrented & seeded 81.7 TB dataset containing copyrighted data

1270 points| gameshot911 | 1 year ago |arstechnica.com

938 comments

order
[+] gizmo|1 year ago|reply
Based on the encyclopedic knowledge LLMs have of written works I assume all parties did the same. But I think there is a broader point to make here. Youtube was initially a ghost town (it started as a dating site) and it only got traction once people started uploading copyrighted TV shows to it. Google itself got big by indexing other people's data without compensation. Spotify's music library was also pirated in the early days. The contracts with the music labels came later. GPL violations by commercial products fits the theme also.

Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.

[+] peterbonney|1 year ago|reply
The more I learn about how AI companies trained their models, the more obvious it is that the rest of us are just suckers. We're out here assuming that laws matter, that we should never misrepresent or hide what we're doing for our work, that we should honor our own terms of use and the terms of use of other sites/products, that if we register for a website or piece of content we should always use our work email address so that the person or company on the other side of that exchange can make a reasonable decision about whether we can or should have access to it.

What we should have been doing all along is YOLO-ing everything. It's only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.

Suckers. All of us.

[+] JW_00000|1 year ago|reply
I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:

> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.

Following that reference:

> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).

(Presser, 2020) refers to https://twitter.com/theshawwn/status/1320282149329784833. (Which funnily refers to this DMCA policy: https://the-eye.eu/dmca.mp4)

Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.

Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.

[Touvron et al., 2023] https://arxiv.org/pdf/2302.13971

[Gao et al., 2020] https://arxiv.org/pdf/2101.00027

[+] peterclary|1 year ago|reply
I strongly urge people to read Thomas Babington Macaulay's speeches on copyright, its aims, terms, and hazards. Very well reasoned and explained.

In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.

[+] kshri24|1 year ago|reply
> Thomas Babington Macaulay

The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But I have done what I could to form a correct estimate of their value. I have read translations of the most celebrated Arabic and Sanskrit works. I have conversed both here and at home with men distinguished by their proficiency in the Eastern tongues. I am quite ready to take the Oriental learning at the valuation of the Orientalists themselves. I have never found one among them who could deny that a single shelf of a good European library was worth the whole native literature of India and Arabia."

This chap will educate us on copyright?

No thanks!

[+] bbor|1 year ago|reply
I’m a huge IP hater and am sure that happens, but to be fair, letting copyright extend past death also increases the amount the author can sell it for in the first place.
[+] golergka|1 year ago|reply
> in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher

He was able to sell it because it is something valuable, exactly because of the copyright protections. Regardless of whether author sells the rights or not, he and his family would equally be better off with copyright.

[+] arresin|1 year ago|reply
This one example does not make stealing acceptable which is what you’re implying.
[+] mik1998|1 year ago|reply
Libgen is a civilizational project that should be endorsed, not prosecuted. I hope one day people will look at it and think how stupid we were today to shun the largest collection of literary works in human history.
[+] yoavm|1 year ago|reply
We all like hating big corporations, especially Meta, and people seem to use this as an opportunity to advocate for punishing them. I think it's wiser to advocate for changing our IP laws.
[+] palata|1 year ago|reply
You're conflating different problems.

Big corporations are too big, they should just not exist. When you have corporations more powerful than the government of the biggest states, it's a bug, not a feature.

The IP laws may need rethinking. Saying that they should disappear because big corporations are above the law doesn't help, though. First kill the big corporations, then think about fair laws. Changing the law now would not change anything since those corporations are already above the law.

[+] lrvick|1 year ago|reply
I truly hope Meta has a serious security issue that burns their company to the ground.

That said, I want them to burn for the right reasons.

Downloading data that should be available to the public is not one of them.

[+] yodsanklai|1 year ago|reply
Big corporations don't have morale or ethics. They'll break any laws as long as it's profitable. There's no point complaining about Meta or Zuck. Meta does what it's designed to do. If people aren't happy, they should vote for more regulations.
[+] Ekaros|1 year ago|reply
First punish them. Then change the laws.
[+] blueboo|1 year ago|reply
We may in retrospect find that the moment may have passed where "big corporations" have become more powerful and impactful on our lives than the IP laws on the books. After all, we can already plainly see they only come into effect when useful by the powerful
[+] aprilthird2021|1 year ago|reply
I think most of the public is probably in favor of stronger IP laws now that big corps are threatening to make them jobless with IP-disrespecting AIs
[+] freeAgent|1 year ago|reply
The point is about the hypocrisy and double-standards evinced by this behavior.
[+] jillyboel|1 year ago|reply
First we must prosecute Meta into committing suicide like was done to Aaron Swartz. After justice is served, we should change IP laws.
[+] boesboes|1 year ago|reply
They broke the law and should be punished for that. Whether the law should change is a separate discussion.

Also, change the law so this is legal for poor meta? smh..

[+] miltonlost|1 year ago|reply
Big corporations all like hating their consumers abd legal laws. You love committing crimes it seems.
[+] fimdomeio|1 year ago|reply
It really makes you think about those crazy internet folks from back in the day who thought copyright law was too strict and that restricting humanity to knowledge in such a way was holding us all back for the benefit of a tiny few.
[+] jeroenhd|1 year ago|reply
I'm all for chopping up copyright law. But until we do so, companies like Meta need to be treated just like everyone else.

That means lawsuits, prison sentences, and millions in fines. And that's just the piracy part, there's also the lying/fraud part.

Interestingly, a Dutch LLM project was sent a cease and desist after the local copyright lobby caught wind of it being trained on a bunch of pirated eBooks. The case unfortunately wasn't fought out in court, because I would be very interested to see if this could make that copyright lobby take down ChatGPT and the other AI companies for doing the same.

[+] stefan_|1 year ago|reply
The more concerning thing is that the best thing these overpaid people could come up with was.. download the torrent, like everyone else. Here you are, billions of resources, and no one is willing to spend a part of it to at least digitize some new data? Like even Google did?
[+] gameshot911|1 year ago|reply
Beyond illegal downloading and distribution of copyrighted content, the article also describes how Meta staff seemingly lied about it in depositions (including, potentially, Mark Zuckerberg himself).
[+] malfist|1 year ago|reply
Huh, a big tech CEO lied to us?

Flippant response I know, but too many people worship at the alter of the job creater and believe these folks are moral upstanding citizens

[+] bmsleight_|1 year ago|reply
So if I torrented and seeded, I would be doing it for my own entertainment, not commercially. I expect big copy-write holders to come after myself. If Meta does it - I guess they have better lawyers ?

Could make interesting case law.

[+] unification_fan|1 year ago|reply
> Could make interesting case law.

Yeah, to perpetuate this system where only those who can afford lawyers get to benefit

[+] nyoomboom|1 year ago|reply
Remembering Aaron Swartz in this moment
[+] stingraycharles|1 year ago|reply
Which was arguably more innocent — scientific papers.
[+] qup|1 year ago|reply
Would Aaron have preferred us to download the material and train the AI?
[+] zackmorris|1 year ago|reply
Is there a concept in the legal system of first-come-first-served that could be used as precedent?

What I mean is: when someone is prosecuted for copyright infringement, but Meta isn't, then could the case be put on hold until Meta is found guilty and pays a fine?

Also maybe the fine on the later case would have to be proportional to the prior case. So if Meta pays $1 per infringement, the penalty might be $1 for torrenting something else (which is immaterial and not worth the justice system's time) so pretty much all copyright infringement cases would get thrown out.

It reminds me of how mainstream drug addicts get convicted and spend years in prison, while celebrities get off with a warning or monetary fine.

[+] Ekaros|1 year ago|reply
Considering prices for single work, this must be multi-billion dollar compensation.

Take for example 675k paid for 31 songs. So 20k a song. If we estimate book to be say 10MB that would 8 million works. So I think reasonable compensation is something along 163 billion. Not even 10 years of net income. Which I think is entirely fair punishment.

[+] panki27|1 year ago|reply
They could have at the very least seeded some more, to give something back to the, uh, community.
[+] RobotToaster|1 year ago|reply
Before I decided my opinion on this I need to know their ratio.
[+] wnevets|1 year ago|reply
My ISP will shut off my internet if it catches me torrenting copyrighted material but if you're a massive corporation that steals TBs of data its barely a blip in the news.
[+] freeAgent|1 year ago|reply
Wouldn't it be amazing if all of Meta's ISPs cut them off for torrenting? One can dream...
[+] gkbrk|1 year ago|reply
You should look into changing your ISP, or at least get a VPN.
[+] lrvick|1 year ago|reply
This should be legal. Copyright law does more harm than good.

The only ethical problem here is that only Meta sized companies can afford to pay the "damages" for such blatant law violations at worst, or the fees of their lawyers at best.

[+] belter|1 year ago|reply
"Supposedly, Meta tried to conceal the seeding by not using Facebook servers while downloading the dataset to "avoid" the "risk" of anyone "tracing back the seeder/downloader" from Facebook servers, an internal message from Meta researcher Frank Zhang said, while describing the work as in "stealth mode." Meta also allegedly modified settings "so that the smallest amount of seeding possible could occur," a Meta executive in charge of project management, Michael Clark, said in a deposition..."

They will be getting a lot of Frommer Legal letters...

[+] bigmattystyles|1 year ago|reply
The question is, if they could and would have paid for each book, would it be ok to train the LLM on them? I'm talking about prior books, I'm sure new books have language forbidding their use to train LLMs at the point of sale. But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils. Obviously, the LLM can do so at scale, but is there a legal difference?
[+] dragonwriter|1 year ago|reply
> The question is, if they could and would have paid for each book, would it be ok to train the LLM on them?

Whether training on AI model on an array of diffentent works, many of which are copyright protected, is itself a copyright violation, in addition to or distinct from any copyright violation that goes on gathering the dataset for training (and separate from any copyright violation in the actual or intended use of the LLM), remains to be resolved as a legal question, and may or may not have a simple yes or no answer (or the same answer under every system of copyright laws globally).

My inclination is that it is probably generally not a violation in US law, but that's not something I am very confident in; how the definitions of copy and derivative work apply to determine if it would be without fair use, and how fair use analysis applies, are not clear from the available precedent.

> But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils.

It is very clear, by looking at how US copyright law is written and even more clear in its history of application, that information stored in brains of people are without exception neither copies nor new works that can be derivative works under US law, and so cannot be infringing, no matter how you gain them. It’s also very clear in the statute itself and the case law that data in media used by artificial digital computers, on the other hand, can constitute copies or derivative works that can be infringing. Even if the process is arguably similar in legally relevant manners, copyright law is critically focussed on the result and whether it is a particular kind of thing which can be infringing, not just the process.

[+] CryptoBanker|1 year ago|reply
A LLM is not a person. That is the legal difference...until we have Citizens United v2
[+] liendolucas|1 year ago|reply
For some misterious reason I can't see Zuckerberg in front of a judge facing 50 years imprisonment. Anyone can?

I truly hope that whoever takes the case goes after Meta with 1000 times the pressure that was put on Swartz, but honestly I don't expect much just as the top comment precisly expressed.

And if we are going to be fair please also let's not forget about the other usual suspects, or anyone thinks they are falling behind?

[+] Havoc|1 year ago|reply
Really curious what the judges are going to do here.

Horse has functionally bolted on this already

I’m guessing slap on wrist despite courts going after individual for a couple of movies torrented pretty hard

[+] aprilthird2021|1 year ago|reply
Is there any other possible outcome than a fine? That too one which will not really affect Meta's overall earnings
[+] empath75|1 year ago|reply
The reality of the situation is that the economic value and utility of AI is going to cause the laws to be restructured around them.