Based on the encyclopedic knowledge LLMs have of written works I assume all parties did the same. But I think there is a broader point to make here. Youtube was initially a ghost town (it started as a dating site) and it only got traction once people started uploading copyrighted TV shows to it. Google itself got big by indexing other people's data without compensation. Spotify's music library was also pirated in the early days. The contracts with the music labels came later. GPL violations by commercial products fits the theme also.
Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.
The more I learn about how AI companies trained their models, the more obvious it is that the rest of us are just suckers. We're out here assuming that laws matter, that we should never misrepresent or hide what we're doing for our work, that we should honor our own terms of use and the terms of use of other sites/products, that if we register for a website or piece of content we should always use our work email address so that the person or company on the other side of that exchange can make a reasonable decision about whether we can or should have access to it.
What we should have been doing all along is YOLO-ing everything. It's only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.
I don't understand why it's even a question that Meta trained their LLM on copyrighted material. They say so in their paper! Quoting from their LLaMMa paper [Touvron et al., 2023]:
> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.
Following that reference:
> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).
Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.
Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.
I strongly urge people to read Thomas Babington Macaulay's speeches on copyright, its aims, terms, and hazards. Very well reasoned and explained.
In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.
The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But I have done what I could to form a correct estimate of their value. I have read translations of the most celebrated Arabic and Sanskrit works. I have conversed both here and at home with men distinguished by their proficiency in the Eastern tongues. I am quite ready to take the Oriental learning at the valuation of the Orientalists themselves. I have never found one among them who could deny that a single shelf of a good European library was worth the whole native literature of India and Arabia."
I’m a huge IP hater and am sure that happens, but to be fair, letting copyright extend past death also increases the amount the author can sell it for in the first place.
> in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher
He was able to sell it because it is something valuable, exactly because of the copyright protections. Regardless of whether author sells the rights or not, he and his family would equally be better off with copyright.
Libgen is a civilizational project that should be endorsed, not prosecuted. I hope one day people will look at it and think how stupid we were today to shun the largest collection of literary works in human history.
We all like hating big corporations, especially Meta, and people seem to use this as an opportunity to advocate for punishing them. I think it's wiser to advocate for changing our IP laws.
While Aaron Swartz was bullied to suicide, these corporations will walk free and make billions. I say give every tech CEO the Swartz treatment, then change the law.
Big corporations are too big, they should just not exist. When you have corporations more powerful than the government of the biggest states, it's a bug, not a feature.
The IP laws may need rethinking. Saying that they should disappear because big corporations are above the law doesn't help, though. First kill the big corporations, then think about fair laws. Changing the law now would not change anything since those corporations are already above the law.
Big corporations don't have morale or ethics. They'll break any laws as long as it's profitable. There's no point complaining about Meta or Zuck. Meta does what it's designed to do. If people aren't happy, they should vote for more regulations.
We may in retrospect find that the moment may have passed where "big corporations" have become more powerful and impactful on our lives than the IP laws on the books. After all, we can already plainly see they only come into effect when useful by the powerful
It really makes you think about those crazy internet folks from back in the day who thought copyright law was too strict and that restricting humanity to knowledge in such a way was holding us all back for the benefit of a tiny few.
I'm all for chopping up copyright law. But until we do so, companies like Meta need to be treated just like everyone else.
That means lawsuits, prison sentences, and millions in fines. And that's just the piracy part, there's also the lying/fraud part.
Interestingly, a Dutch LLM project was sent a cease and desist after the local copyright lobby caught wind of it being trained on a bunch of pirated eBooks. The case unfortunately wasn't fought out in court, because I would be very interested to see if this could make that copyright lobby take down ChatGPT and the other AI companies for doing the same.
The more concerning thing is that the best thing these overpaid people could come up with was.. download the torrent, like everyone else. Here you are, billions of resources, and no one is willing to spend a part of it to at least digitize some new data? Like even Google did?
Beyond illegal downloading and distribution of copyrighted content, the article also describes how Meta staff seemingly lied about it in depositions (including, potentially, Mark Zuckerberg himself).
So if I torrented and seeded, I would be doing it for my own entertainment, not commercially. I expect big copy-write holders to come after myself. If Meta does it - I guess they have better lawyers ?
Is there a concept in the legal system of first-come-first-served that could be used as precedent?
What I mean is: when someone is prosecuted for copyright infringement, but Meta isn't, then could the case be put on hold until Meta is found guilty and pays a fine?
Also maybe the fine on the later case would have to be proportional to the prior case. So if Meta pays $1 per infringement, the penalty might be $1 for torrenting something else (which is immaterial and not worth the justice system's time) so pretty much all copyright infringement cases would get thrown out.
It reminds me of how mainstream drug addicts get convicted and spend years in prison, while celebrities get off with a warning or monetary fine.
Considering prices for single work, this must be multi-billion dollar compensation.
Take for example 675k paid for 31 songs. So 20k a song. If we estimate book to be say 10MB that would 8 million works. So I think reasonable compensation is something along 163 billion. Not even 10 years of net income. Which I think is entirely fair punishment.
My ISP will shut off my internet if it catches me torrenting copyrighted material but if you're a massive corporation that steals TBs of data its barely a blip in the news.
This should be legal. Copyright law does more harm than good.
The only ethical problem here is that only Meta sized companies can afford to pay the "damages" for such blatant law violations at worst, or the fees of their lawyers at best.
"Supposedly, Meta tried to conceal the seeding by not using Facebook servers while downloading the dataset to "avoid" the "risk" of anyone "tracing back the seeder/downloader" from Facebook servers, an internal message from Meta researcher Frank Zhang said, while describing the work as in "stealth mode." Meta also allegedly modified settings "so that the smallest amount of seeding possible could occur," a Meta executive in charge of project management, Michael Clark, said in a deposition..."
They will be getting a lot of Frommer Legal letters...
The question is, if they could and would have paid for each book, would it be ok to train the LLM on them? I'm talking about prior books, I'm sure new books have language forbidding their use to train LLMs at the point of sale.
But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils. Obviously, the LLM can do so at scale, but is there a legal difference?
> The question is, if they could and would have paid for each book, would it be ok to train the LLM on them?
Whether training on AI model on an array of diffentent works, many of which are copyright protected, is itself a copyright violation, in addition to or distinct from any copyright violation that goes on gathering the dataset for training (and separate from any copyright violation in the actual or intended use of the LLM), remains to be resolved as a legal question, and may or may not have a simple yes or no answer (or the same answer under every system of copyright laws globally).
My inclination is that it is probably generally not a violation in US law, but that's not something I am very confident in; how the definitions of copy and derivative work apply to determine if it would be without fair use, and how fair use analysis applies, are not clear from the available precedent.
> But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils.
It is very clear, by looking at how US copyright law is written and even more clear in its history of application, that information stored in brains of people are without exception neither copies nor new works that can be derivative works under US law, and so cannot be infringing, no matter how you gain them. It’s also very clear in the statute itself and the case law that data in media used by artificial digital computers, on the other hand, can constitute copies or derivative works that can be infringing. Even if the process is arguably similar in legally relevant manners, copyright law is critically focussed on the result and whether it is a particular kind of thing which can be infringing, not just the process.
For some misterious reason I can't see Zuckerberg in front of a judge facing 50 years imprisonment. Anyone can?
I truly hope that whoever takes the case goes after Meta with 1000 times the pressure that was put on Swartz, but honestly I don't expect much just as the top comment precisly expressed.
And if we are going to be fair please also let's not forget about the other usual suspects, or anyone thinks they are falling behind?
[+] [-] gizmo|1 year ago|reply
Companies aggressively protect their own intellectual property but have no qualms about violating the IP rights of others. Companies. Individuals have no such privilege. If you plug a laptop into a closet at MIT to download some scientific papers you forfeit your life.
[+] [-] peterbonney|1 year ago|reply
What we should have been doing all along is YOLO-ing everything. It's only illegal if you get caught. And if you get big enough before you get caught then the rules never have to apply to you anyway.
Suckers. All of us.
[+] [-] JW_00000|1 year ago|reply
> We include two book corpora in our training dataset: the Gutenberg Project, [...], and the Books3 section of ThePile (Gao et al., 2020), a publicly available dataset for training large language models.
Following that reference:
> Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).
(Presser, 2020) refers to https://twitter.com/theshawwn/status/1320282149329784833. (Which funnily refers to this DMCA policy: https://the-eye.eu/dmca.mp4)
Furthermore, they state they trained on GitHub, web pages, and ArXiv, which are all contain copyrighted content.
Surely the question is: is it legal to train and/or use and/or distribute an AI model (or its weights, or its outputs) that is trained using copyrighted material. That it was trained on copyrighted material is certain.
[Touvron et al., 2023] https://arxiv.org/pdf/2302.13971
[Gao et al., 2020] https://arxiv.org/pdf/2101.00027
[+] [-] peterclary|1 year ago|reply
In particular, people often cited the case of authors who had died leaving a family in destitution, and claimed that copyright extension would be a fair way of preventing this, but in most cases the remaining family had never held the copyright; the author had initally sold the reproduction rights to a publisher who had then sat on the work without publishing it. The author, driven into penury, was then induced to sell the copyright to the publisher outright for a pittance. So in such cases a copyright extension only benefited the publisher, and indeed increased their incentive to extort the copyright.
[+] [-] kshri24|1 year ago|reply
The one who got Hindu Sanskrit books translated in a horrible manner and then claimed: "I have no knowledge of either Sanskrit or Arabic. But I have done what I could to form a correct estimate of their value. I have read translations of the most celebrated Arabic and Sanskrit works. I have conversed both here and at home with men distinguished by their proficiency in the Eastern tongues. I am quite ready to take the Oriental learning at the valuation of the Orientalists themselves. I have never found one among them who could deny that a single shelf of a good European library was worth the whole native literature of India and Arabia."
This chap will educate us on copyright?
No thanks!
[+] [-] bbor|1 year ago|reply
[+] [-] golergka|1 year ago|reply
He was able to sell it because it is something valuable, exactly because of the copyright protections. Regardless of whether author sells the rights or not, he and his family would equally be better off with copyright.
[+] [-] arresin|1 year ago|reply
[+] [-] mik1998|1 year ago|reply
[+] [-] yoavm|1 year ago|reply
[+] [-] _Algernon_|1 year ago|reply
https://en.wikipedia.org/wiki/Aaron_Swartz#United_States_v._...
https://en.wikipedia.org/wiki/Aaron_Swartz#Death
While Aaron Swartz was bullied to suicide, these corporations will walk free and make billions. I say give every tech CEO the Swartz treatment, then change the law.
[+] [-] palata|1 year ago|reply
Big corporations are too big, they should just not exist. When you have corporations more powerful than the government of the biggest states, it's a bug, not a feature.
The IP laws may need rethinking. Saying that they should disappear because big corporations are above the law doesn't help, though. First kill the big corporations, then think about fair laws. Changing the law now would not change anything since those corporations are already above the law.
[+] [-] lrvick|1 year ago|reply
That said, I want them to burn for the right reasons.
Downloading data that should be available to the public is not one of them.
[+] [-] yodsanklai|1 year ago|reply
[+] [-] Ekaros|1 year ago|reply
[+] [-] blueboo|1 year ago|reply
[+] [-] aprilthird2021|1 year ago|reply
[+] [-] freeAgent|1 year ago|reply
[+] [-] jillyboel|1 year ago|reply
[+] [-] boesboes|1 year ago|reply
Also, change the law so this is legal for poor meta? smh..
[+] [-] miltonlost|1 year ago|reply
[+] [-] fimdomeio|1 year ago|reply
[+] [-] jeroenhd|1 year ago|reply
That means lawsuits, prison sentences, and millions in fines. And that's just the piracy part, there's also the lying/fraud part.
Interestingly, a Dutch LLM project was sent a cease and desist after the local copyright lobby caught wind of it being trained on a bunch of pirated eBooks. The case unfortunately wasn't fought out in court, because I would be very interested to see if this could make that copyright lobby take down ChatGPT and the other AI companies for doing the same.
[+] [-] stefan_|1 year ago|reply
[+] [-] gameshot911|1 year ago|reply
[+] [-] malfist|1 year ago|reply
Flippant response I know, but too many people worship at the alter of the job creater and believe these folks are moral upstanding citizens
[+] [-] bmsleight_|1 year ago|reply
Could make interesting case law.
[+] [-] unification_fan|1 year ago|reply
Yeah, to perpetuate this system where only those who can afford lawyers get to benefit
[+] [-] nyoomboom|1 year ago|reply
[+] [-] stingraycharles|1 year ago|reply
[+] [-] qup|1 year ago|reply
[+] [-] zackmorris|1 year ago|reply
What I mean is: when someone is prosecuted for copyright infringement, but Meta isn't, then could the case be put on hold until Meta is found guilty and pays a fine?
Also maybe the fine on the later case would have to be proportional to the prior case. So if Meta pays $1 per infringement, the penalty might be $1 for torrenting something else (which is immaterial and not worth the justice system's time) so pretty much all copyright infringement cases would get thrown out.
It reminds me of how mainstream drug addicts get convicted and spend years in prison, while celebrities get off with a warning or monetary fine.
[+] [-] Ekaros|1 year ago|reply
Take for example 675k paid for 31 songs. So 20k a song. If we estimate book to be say 10MB that would 8 million works. So I think reasonable compensation is something along 163 billion. Not even 10 years of net income. Which I think is entirely fair punishment.
[+] [-] panki27|1 year ago|reply
[+] [-] RobotToaster|1 year ago|reply
[+] [-] wnevets|1 year ago|reply
[+] [-] freeAgent|1 year ago|reply
[+] [-] gkbrk|1 year ago|reply
[+] [-] lrvick|1 year ago|reply
The only ethical problem here is that only Meta sized companies can afford to pay the "damages" for such blatant law violations at worst, or the fees of their lawyers at best.
[+] [-] belter|1 year ago|reply
They will be getting a lot of Frommer Legal letters...
[+] [-] bigmattystyles|1 year ago|reply
[+] [-] dragonwriter|1 year ago|reply
Whether training on AI model on an array of diffentent works, many of which are copyright protected, is itself a copyright violation, in addition to or distinct from any copyright violation that goes on gathering the dataset for training (and separate from any copyright violation in the actual or intended use of the LLM), remains to be resolved as a legal question, and may or may not have a simple yes or no answer (or the same answer under every system of copyright laws globally).
My inclination is that it is probably generally not a violation in US law, but that's not something I am very confident in; how the definitions of copy and derivative work apply to determine if it would be without fair use, and how fair use analysis applies, are not clear from the available precedent.
> But legally, how does using a book to train a LLM differ from a teacher learning from a book and teaching its contents to their pupils.
It is very clear, by looking at how US copyright law is written and even more clear in its history of application, that information stored in brains of people are without exception neither copies nor new works that can be derivative works under US law, and so cannot be infringing, no matter how you gain them. It’s also very clear in the statute itself and the case law that data in media used by artificial digital computers, on the other hand, can constitute copies or derivative works that can be infringing. Even if the process is arguably similar in legally relevant manners, copyright law is critically focussed on the result and whether it is a particular kind of thing which can be infringing, not just the process.
[+] [-] CryptoBanker|1 year ago|reply
[+] [-] liendolucas|1 year ago|reply
I truly hope that whoever takes the case goes after Meta with 1000 times the pressure that was put on Swartz, but honestly I don't expect much just as the top comment precisly expressed.
And if we are going to be fair please also let's not forget about the other usual suspects, or anyone thinks they are falling behind?
[+] [-] Havoc|1 year ago|reply
Horse has functionally bolted on this already
I’m guessing slap on wrist despite courts going after individual for a couple of movies torrented pretty hard
[+] [-] aprilthird2021|1 year ago|reply
[+] [-] empath75|1 year ago|reply
[+] [-] unknown|1 year ago|reply
[deleted]
[+] [-] WhereIsTheTruth|1 year ago|reply
[deleted]