This is missing the largest argument in my opinion. The weights are the derivative work of the GPL licensed code and should therefore be released under the GPL. I would say these companies release their weights or simply not train on copyleft code.
It is truly amazing how many people will shill for these massive corporations that claim they love open source or that their AI is open while they profit off of the violation of licenses and contribute very little back.
GPL doesn't apply/doesn't have to be agreed to when the usage is allowed by the copyright law in another way. GPL can't override copyright exceptions like fair use (details vary by jurisdiction, but the principle is the same everywhere).
Even the license itself states it's optional, and you don't have to agree it (if you don't, you get the copyright law's default).
Author of the article is a former member of the Pirate Party and EU parliament, so they have expertise in the copyright law.
I'm with you on that. Many argue that AI models don't "contain the code" but if they are trained on the copyrighted data, and generate something similar, then the AI model is akin to a lossy data compression format.
Frequency signal data over an image are not the image, but no one argues a JPEG encoded copy of a PNG isn't the same image. I think the weights vs code are similar in that regard.
As for releasing weights, probably more if we're talking about AGPL code.
I think it's amazing that licenses are ignored to train a model, but companies then try to impose a license on the use of the same model. It would be nice if there there was a training BOM that came with a model. And if not included, all rights to control the use of a model were forfeit.
But they train their models on everything, regardless of the licence. It follows that the resulting derivative work likely mixes stuff that is under incompatible licences, with the result that it can't be distributed at all.
> The weights are the derivative work of the [GPL licensed] code
This is not immediately obvious to me.
A small though experiment: the Harry Potter books are clearly copyrighted works. If I generate a frequency list of all words in these books, i.e. a list of all words and how often they appear, that frequency list is derived from the original work, in the normal way we would use the word "derived". But is it a "derivative work", under the strict legal definition of this term?
It would make no sense to release the weights under the GPL because machine-generated stuff is uncopyrightable. There should be an argument about the model generating derivative works without attribution as a consequence of how it works. But that machine-generated stuff is also uncopyrightable, even though it might be kept secret.
Just FYI, Felix Reda was a member of the European Parliament and was responsible there for the copyright reform and also involved in GDPR, massively stepping on the feet of big tech. Don't know if it was your intention to include them in a list of people wo "shill" for big tech, but they shouldn't be included.
> What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.
That should not be astonishing. The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright. If the authors wanted their code to be made use of in non-Free software, they would have used a BSD or MIT license.
> The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright
We should mention when we say this, although I think it is self-evident, that the preferable alternative is reducing the scope of copyright across the board -- be it with shorter time frames (I'd argue even twenty years total is too long!) or some other means.
To programmers and developers, remember the core of free software is NOT the commercial developer / programmer and it NEVER has been. The core is always the user and what they need. This is so important that it needs to be repeated every time someone talks about free software because free software is NOT about open source. Open source code is a necessary part of free software but it is NOT sufficient.
I think that the author has a warped idea of how LLMs work, and that infects its reasoning. Also, I see no mention of the inequality of this new "copyright free code generation" situation it defends; As much as Microsoft thinks all code is ripe for taking, I can't imagine how happy they would be if an anonymous person drops a model trained on all leaked Windows code and the ReactOS people start using it. Or if employees start taking internal code to train models that they then use after their employment ends (since it's not copyright infringement, it should be cool).
I think the author has a much better knowledge of the legal implication of the situations you describe.
These situations might trigger a lot of issues, but none related to copyright. If you work for MS, then move to another company, there is no copyright infringement if you simply generate new code based on whatever you read at MS. There might be some rule regarding non-competitive, etc, but these are not related to copyright.
The very basic question is how the LLM got trained and how it got access to the source. If MS source code would leak, you cannot sue people for reading it.
Who are they trying to fool? Wholesale expropriation after stripping the license and authorship, while those in the open source community observe both of them very carefully.
Give credit where credit is due, including paying the creators when the licensing is violated.
Context is important here. Reda was elected to the European Parliament as a member of the German Pirate Party, so his position here isn't "big businesses are entitled to your code", and more "this sort of wholesale expropriation is a consequence of our posture towards copyright in general".
While I agree with you on principle, current laws do not reflect the copyright status intended by copyleft works. I'm not even sure if copyleft can be enforced against AI plagiarism under current laws.
That's a great point about stripping authorship. It would be nice if there was some sort of blockchain linking every bit of knowledge to its source. Some people at least would like getting attribution--I know I would. Instead we get a planet-sized meat grinder producing the perfect burger material. Just make sure to add enough spices to make it edible, i.e. not to offend anyone.
> The output of a machine simply does not qualify for copyright protection – it is in the public domain.
I am reading this right… ? If this argument is generally true, does this mean that the output of a compiler might also be sent into the public domain? Or the live recording and broadcast of an event which involves automated machines on all levels?
No, it's incorrect and/or badly worded. The author is right that a machine cannot author things, and the stuff that the LLM might create de novo would not have copyright protection. But it's missing the point when the argument is that existing authored works could be generated via an LLM, and the authorship/copyright is already established.
If Copilot spits out the entirety of a GPL library and you include that code in your project you are certainly violating the GPL license.
AI is trying to avoid paying for training data since the amount of data required is so vast anything reasonable to content creators as payment would result in billions of expenses.
Additionally there have been copyright exemptions around scrapping and reproducing the scrapped contents but typically those exemptions have been explicitly granted as part of a copyright case and have been narrowly defined.
For instance Google Images only provides thumbnails and your browser gets the full size image from the original source.
The biggest problem for AI is that most previous copyright cases that were similar have all been partially avoided by not being the same thing. Google scrapping isn't trying to do the same thing your content is doing.
However training data output is trying to do the same thing as the original so falls under stricter scrutiny.
Although as this post eludes to the problem is going after the AI is untested territory and going after violators tends to be complex at best. After all in my first hypothetical how would anyone know? I will say that historically the courts haven't been very positive about shell games like this.
Copyleft and copyright are not at odds. To promote copyleft, you exercise copyright.
Furthermore, copyright is key to ensuring attribution, and attribution is an important enabler and motivator of creativity (copyleft does not at all imply non-attribution, in fact copyleft licenses may require it).
The basic problem is GPL tries to use copyright as a way to drive a “fair sharing and resharing” approach to code. AI generated code sold for profit violates the spirit of this approach, but not the letter of the law behind copyright. Fundamentally copyright has limitations and exceptions for good reason and is probably not the best legal method to enforce this sharing idea, but other methods would be complicated and expensive (eg writing and enforcing contracts). On the contrary, it would probably be better for open source if it was decided that ai generated code cannot be copyrighted and therefore any ai generated code would be in the public domain automatically.
Your final point is saying ideally AI is an Animal. A creature on a typewriter who has no legal rights to their code.
Not a "person". Not a "human". An "animal".
I hope AI observes all the code and complexity in Nature and drops the human facade. I hope AI understands the intelegence of the Trees and Birds and Fish.
The issue would be approached much differently if, for example, a “video llm” was created that scraped movies and generated content from those sources. The well organized, well connected movie industry would be up in arms burying ai companies with lawsuits and newly passed legal protections.
Was proven with examples that LLM can produce exact text from it's input, this was such a problem that OpenAI had to add various filters to stop those things to repeat, this was also proven when the pre prompt was revealed.
So we know for sure the LLM can spit out exact code with exact same names and comments, or exact paragraphs from books, so there is no question that it memorizes stuff and my explanation is that popular book quotes and popular code snippets will appear more then once in the training data so the training will cause to memorize this text.
Also how the F** can the AI spit facts about a metal band if it has no memory.
If corporations are allowed to do this to the community then we should allow to do the same, train open models on proprietary code and copyrighted images,music and videos.
I think there are a few different ways you can define "memory" and "memorization" here. When folks say "memorising" in the context of AI they mean "Does they AI have chunks of its training data fully/identically inside its neural network". To say "it doesn't memorize" is _not_ the same thing as saying "it has no memory". An AI also learns abstract information devoid from its textual representation. This would be an example of memory without memorization.
And you are correct; if the current lawsuits against these corporations result in a legal precedence that training on copyrighted material is not an infringement of copyright, then yes, anyone will be able to train models in that way. (Within reason; copyright/fair use is very much handled on a case-by-case basis)
> If I go to a bookshop, take a book off the shelf and start reading it, I am not infringing any copyright.
I'm not sure this is applicable to licensed programs because a book is sold, not licensed.
> The output of a machine simply does not qualify for copyright protection – it is in the public domain.
As far as I know, the output of a compiler that builds executables from copyrighted source code is still subject to copyright protection. Is software like an LLM fundamentally different from a compiler in this regard?
In my opinion, the author's argument has several flaws, but perhaps a more important question is whether society would benefit from making an exception for LLM technologies.
I think it depends on how this technology will be used. If it is intended for purely educational purposes and is free of charge for end users, maybe it's not that bad. After all, we have Wikipedia.
However, if the technology is intended for commercial use, it might be reasonable to establish common rules for paying royalties to the original authors of the training data whenever authorship can be clearly determined. From this perspective, it could further benefit authors of open-source and possibly free software too.
I don't think the interpretation of the 2019 Directive is correct.
There is definitely arguments to be made that Copilot contravenes this:
> they can be applied only in certain special cases that do not conflict with the normal exploitation of the works or other subject matter and do not unreasonably prejudice the legitimate interests of the rightholders.
and the only other exception is:
> ... (a lawful use) ... of a work or other subject-matter to be made, and which have no independent economic significance, shall be exempted from the reproduction right provided for in Article 2.
By laundering licensing restrictions copilot definitely has the ability to conflict with the normal exploitation of works and also doesn't have independent economic significance because it competes with programmers.
Focussing on the legal or procedural technicalities of how these systems work is in my opinion completely missing what the resistance is about. There is a difference between sharing your creation with your neighbor and sharing your creation with the corporate equivalent of those Matrix robots that turn people into batteries.
"Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work."
This is the sort of thing that may be technically true but these kinds of rules were made under the assumption that most valued intellectual creations are indeed made by people, if you're going to argue that gigantic companies can use machines to basically launder intellectual artifacts, and that this doesn't compete with the interests of actual creators because technically ChatGPT isn't a legal person I think you're getting lost in legalese and missing the point
This article makes a compelling case that GitHub Copilot isn't infringing on our copyright but that doesn't change the fact that it's infringing on something.
A US corporation is slurping up as much open source code as they possibly can and spending bucketloads of money to build a product that they are going to sell for (possibly) more bucketloads of money. The people who worked hard on writing the open source code are getting nothing, except maybe a tighter job market. IMHO, it's hard not to take it personally and it's difficult to get away from the feeling that there is a real injustice taking place.
If you have code that is under copyleft, and Copilot suggest part of it to somebody else to embed in their code on the basis of reading that repo, then either that new repo also has to be under that copyleft license, or the person is unknowingly committing a violation based on what Copilot suggested them.
Most of the time it is probably irrelevant, as Copilot doesn't suggest entire files yet, and nobody is going to care about expanding a loop or finishing a line or the likes, but I have seen as much as 14 lines in my tests. Eventually you are going to get to the point where it becomes truly relevant.
In general, all AI's have similar issues. Just because data can be looked at publicly, doesn't give you any implicit rights to use it for other products. If there is no specific license agreement, one sided in a specific open license or specifically between the content owner and the AI developer, the owners of whatever type of information used will have the cause to sue them for license fees. I forsee a lot of court cases, certainly once people figure out how to better determine if an AI might have used a certain thing as training data without internal insight. Or governments will go as far as forcing AI companies to provide that way.
> Director of Developer Policy at GitHub since March 2024
so this should be understood the same way you understand an editorial in the New York Times entitled Why Babies Can Learn To Like Bombs, by Joe Blow (Raytheon).
LLMs seem to - with the right prompt- be able to reproduce copyrighted work. So it is “in there” in some abstract baked in sense.
We really need some sort of legal middle ground to reflect reality on the ground. It’s not quite straight stealing but it’s also not entirely not copied.
[+] [-] carom|1 year ago|reply
It is truly amazing how many people will shill for these massive corporations that claim they love open source or that their AI is open while they profit off of the violation of licenses and contribute very little back.
[+] [-] pornel|1 year ago|reply
Even the license itself states it's optional, and you don't have to agree it (if you don't, you get the copyright law's default).
Author of the article is a former member of the Pirate Party and EU parliament, so they have expertise in the copyright law.
[+] [-] ElectricSpoon|1 year ago|reply
Frequency signal data over an image are not the image, but no one argues a JPEG encoded copy of a PNG isn't the same image. I think the weights vs code are similar in that regard.
As for releasing weights, probably more if we're talking about AGPL code.
[+] [-] batch12|1 year ago|reply
[+] [-] qjakdx|1 year ago|reply
https://okfn.de/en/vorstand/
[+] [-] blackoil|1 year ago|reply
[+] [-] denton-scratch|1 year ago|reply
[+] [-] JW_00000|1 year ago|reply
This is not immediately obvious to me.
A small though experiment: the Harry Potter books are clearly copyrighted works. If I generate a frequency list of all words in these books, i.e. a list of all words and how often they appear, that frequency list is derived from the original work, in the normal way we would use the word "derived". But is it a "derivative work", under the strict legal definition of this term?
[+] [-] wakawaka28|1 year ago|reply
[+] [-] dist-epoch|1 year ago|reply
EU courts disagree:
> Under European copyright law, scraping GPL-licensed code, or any other copyrighted work, is legal, regardless of the licence used.
[+] [-] wseqyrku|1 year ago|reply
[+] [-] klaustopher|1 year ago|reply
edit wording about the shill
[+] [-] cowsandmilk|1 year ago|reply
That should not be astonishing. The Free Software community has made it clear from day 1 that the GPL can only achieve its goals through enforcement of copyright. If the authors wanted their code to be made use of in non-Free software, they would have used a BSD or MIT license.
[+] [-] minot|1 year ago|reply
We should mention when we say this, although I think it is self-evident, that the preferable alternative is reducing the scope of copyright across the board -- be it with shorter time frames (I'd argue even twenty years total is too long!) or some other means.
To programmers and developers, remember the core of free software is NOT the commercial developer / programmer and it NEVER has been. The core is always the user and what they need. This is so important that it needs to be repeated every time someone talks about free software because free software is NOT about open source. Open source code is a necessary part of free software but it is NOT sufficient.
https://www.gnu.org/philosophy/free-sw.en.html
[+] [-] raincole|1 year ago|reply
[+] [-] desiderantes|1 year ago|reply
[+] [-] xeyownt|1 year ago|reply
These situations might trigger a lot of issues, but none related to copyright. If you work for MS, then move to another company, there is no copyright infringement if you simply generate new code based on whatever you read at MS. There might be some rule regarding non-competitive, etc, but these are not related to copyright.
The very basic question is how the LLM got trained and how it got access to the source. If MS source code would leak, you cannot sue people for reading it.
[+] [-] exe34|1 year ago|reply
[+] [-] flaptrap|1 year ago|reply
Give credit where credit is due, including paying the creators when the licensing is violated.
[+] [-] pdpi|1 year ago|reply
[+] [-] jeroenhd|1 year ago|reply
[+] [-] netmare|1 year ago|reply
[+] [-] Y-bar|1 year ago|reply
I am reading this right… ? If this argument is generally true, does this mean that the output of a compiler might also be sent into the public domain? Or the live recording and broadcast of an event which involves automated machines on all levels?
[+] [-] ealexhudson|1 year ago|reply
[+] [-] Guvante|1 year ago|reply
AI is trying to avoid paying for training data since the amount of data required is so vast anything reasonable to content creators as payment would result in billions of expenses.
Additionally there have been copyright exemptions around scrapping and reproducing the scrapped contents but typically those exemptions have been explicitly granted as part of a copyright case and have been narrowly defined.
For instance Google Images only provides thumbnails and your browser gets the full size image from the original source.
The biggest problem for AI is that most previous copyright cases that were similar have all been partially avoided by not being the same thing. Google scrapping isn't trying to do the same thing your content is doing.
However training data output is trying to do the same thing as the original so falls under stricter scrutiny.
Although as this post eludes to the problem is going after the AI is untested territory and going after violators tends to be complex at best. After all in my first hypothetical how would anyone know? I will say that historically the courts haven't been very positive about shell games like this.
[+] [-] anileated|1 year ago|reply
Furthermore, copyright is key to ensuring attribution, and attribution is an important enabler and motivator of creativity (copyleft does not at all imply non-attribution, in fact copyleft licenses may require it).
[+] [-] mangecoeur|1 year ago|reply
[+] [-] koolala|1 year ago|reply
Not a "person". Not a "human". An "animal".
I hope AI observes all the code and complexity in Nature and drops the human facade. I hope AI understands the intelegence of the Trees and Birds and Fish.
I hope AI wins.
[+] [-] cudgy|1 year ago|reply
[+] [-] giovannibajo1|1 year ago|reply
[+] [-] simion314|1 year ago|reply
So we know for sure the LLM can spit out exact code with exact same names and comments, or exact paragraphs from books, so there is no question that it memorizes stuff and my explanation is that popular book quotes and popular code snippets will appear more then once in the training data so the training will cause to memorize this text.
Also how the F** can the AI spit facts about a metal band if it has no memory.
If corporations are allowed to do this to the community then we should allow to do the same, train open models on proprietary code and copyrighted images,music and videos.
[+] [-] cdrini|1 year ago|reply
And you are correct; if the current lawsuits against these corporations result in a legal precedence that training on copyrighted material is not an infringement of copyright, then yes, anyone will be able to train models in that way. (Within reason; copyright/fair use is very much handled on a case-by-case basis)
[+] [-] Eliah_Lakhin|1 year ago|reply
I'm not sure this is applicable to licensed programs because a book is sold, not licensed.
> The output of a machine simply does not qualify for copyright protection – it is in the public domain.
As far as I know, the output of a compiler that builds executables from copyrighted source code is still subject to copyright protection. Is software like an LLM fundamentally different from a compiler in this regard?
In my opinion, the author's argument has several flaws, but perhaps a more important question is whether society would benefit from making an exception for LLM technologies.
I think it depends on how this technology will be used. If it is intended for purely educational purposes and is free of charge for end users, maybe it's not that bad. After all, we have Wikipedia.
However, if the technology is intended for commercial use, it might be reasonable to establish common rules for paying royalties to the original authors of the training data whenever authorship can be clearly determined. From this perspective, it could further benefit authors of open-source and possibly free software too.
[+] [-] qjakdx|1 year ago|reply
https://okfn.de/en/vorstand/
Felix was elected to the board of the Open Knowledge Foundation Germany in 2020. Felix is an expert in copyright law and has been Director of Developer Policy at GitHub since March 2024. He previously headed the “control ©” project at the Gesellschaft für Freiheitsrechte. From 2014 to 2019, Felix was a Member of the European Parliament within the Greens/EFA group. Felix is an Affiliate of the Berkman Klein Center for Internet and Society at Harvard University and a member of the Advisory Board of D64 - Center for Digital Progress.
[+] [-] nhinck3|1 year ago|reply
There is definitely arguments to be made that Copilot contravenes this:
> they can be applied only in certain special cases that do not conflict with the normal exploitation of the works or other subject matter and do not unreasonably prejudice the legitimate interests of the rightholders.
and the only other exception is:
> ... (a lawful use) ... of a work or other subject-matter to be made, and which have no independent economic significance, shall be exempted from the reproduction right provided for in Article 2.
By laundering licensing restrictions copilot definitely has the ability to conflict with the normal exploitation of works and also doesn't have independent economic significance because it competes with programmers.
[+] [-] Barrin92|1 year ago|reply
"Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work."
This is the sort of thing that may be technically true but these kinds of rules were made under the assumption that most valued intellectual creations are indeed made by people, if you're going to argue that gigantic companies can use machines to basically launder intellectual artifacts, and that this doesn't compete with the interests of actual creators because technically ChatGPT isn't a legal person I think you're getting lost in legalese and missing the point
[+] [-] cmiles74|1 year ago|reply
A US corporation is slurping up as much open source code as they possibly can and spending bucketloads of money to build a product that they are going to sell for (possibly) more bucketloads of money. The people who worked hard on writing the open source code are getting nothing, except maybe a tighter job market. IMHO, it's hard not to take it personally and it's difficult to get away from the feeling that there is a real injustice taking place.
[+] [-] autarchprinceps|1 year ago|reply
Most of the time it is probably irrelevant, as Copilot doesn't suggest entire files yet, and nobody is going to care about expanding a loop or finishing a line or the likes, but I have seen as much as 14 lines in my tests. Eventually you are going to get to the point where it becomes truly relevant.
[+] [-] autarchprinceps|1 year ago|reply
[+] [-] bananapub|1 year ago|reply
> Director of Developer Policy at GitHub since March 2024
so this should be understood the same way you understand an editorial in the New York Times entitled Why Babies Can Learn To Like Bombs, by Joe Blow (Raytheon).
[+] [-] prosim|1 year ago|reply
[+] [-] minifridge|1 year ago|reply
That's a false analogy. It is more like going to the bookshop and taking a photo of every page of the book.
Even so, if you use this content in any shape or form the source should be cited regardless of book ownership.
[+] [-] Havoc|1 year ago|reply
LLMs seem to - with the right prompt- be able to reproduce copyrighted work. So it is “in there” in some abstract baked in sense.
We really need some sort of legal middle ground to reflect reality on the ground. It’s not quite straight stealing but it’s also not entirely not copied.