You bring up a really good point. I'm super curious what the legality and ethics around training machines on licensed or even proprietary code would be. IIRC there are implications around code you can build if you've seen proprietary code (I remember an article from HN about how bash had to be written by someone who hadn't seen the unix shell code or something like that).
How would we classify that legally when it comes to training and generating code? Would you argue the machine is just picking up best practices and patterns, or would you say it has gained specifically-licensed or proprietary knowledge?
I would argue that a trained model falls under the legal category of "compilation of facts".
More generally, keep in mind that the legal world, despite an apparent focus on definition is very bad at dealing with novelty, and most of it will end up justifying a posteriori existing practices.
This is a bit tricky, because at least in the U.S., I don't believe it's settled question in law yet. Some of the other posters on here have said that the resulting model isn't covered by GPL--that's partially true, but provenance of data, and the rights to it, definitely does matter. A good example of this was the Everalbum ruling, where the company was forced to delete both the data and the trained models used they were used to generate due to lack of consent from the users from whom the data was taken[1]. Since open source code is, well, open, it's definitely less a problem for permissively-licensed code.
That said, copyright is typically generally assigned to the closest human to the activation process (it's unlikely that Github is going to try to claim the copyright to code generated by Copilot over the human/company pair-programming with it), but since copyleft in general is a pretty domain-specific to software, afaik the way that courts interpret the legality of using code licensed under those terms in training data for a non-copyleft-producing model is still up in the air.
Obligatory IANAL, and also happy to adjust this info if someone has sources demonstrating updates on the current state.
No, a model trained on text covered by a license is not itself covered by the license, unless it explicitly copies the text (you cannot copyright a "style").
But it actually is explicitly copying the text. That's how it works. The training data are massive, and you will get long strings of code that are pulled directly from that training data. It isn't giving you just the style. It may be mashing together several different code examples taking some text from each. That's called "derivative work".
My guess is that it is, if we think of a machine learning framework as a compiler and the model as compiled code. Compiled GPL code is still GPL, that's the entire point.
Anyways, GitHub is Microsoft, and Microsoft has really good lawyers so I guess they did everything necessary to make sur that you can use it the way they tell you so. The most obvious solution would be to filter by LICENSE.txt and only train the model with code under permissive licenses.
The trained model is a derivative work that contains copies of the corpus used for training embedded in the model. If any of the training code was GPL the output is now covered by GPL. The music industry has already done most of the heavy lifting here in terms of scope and nature of derived works, and while IANAL I would not suggest that it looks good for anyone using this tool if GPL code was in the training set.
I seem to remember a similar discussion on Intellicode (similar thing, but more like Intellisense, and as Visual Studio plugin), which is trained on "github projects with more than 100 stars". IFIR they check the LICENSE.txt file in the project and ignore projects with an "incompatible" license. I don't have any links handy which would confirm this though.
My guess would be that the model itself (and the training process) could have different legal requirements compared to the code it generates. The code generated by the model is probably sufficiently transformative new work that wouldn't be GPL (it's "fair use").
I suspect there could be issues on the training side, using copyrighted data for training without any form of licensing. Typically ML researchers have a pretty free-for-all attitude towards 'if I can find data, I can train models on it.'
No, the code generated is what copyright law calls a derivative work and you should go ask Robin Thicke and Pharrell Williams exactly how much slack the courts give for 'sufficiently transformative new work.
But possibly things that were spit out verbatim from the training set, which the FAQ mentions does happen about .1% of the time [1]. Another comment in this thread indicated that the model outputs something that's verbatim usable about 10% of the time. So, taking those two numbers together, if you're using a whole generated function verbatim, a bit of caveat emptor re: licensing might not be the worst idea. At least until the origin tracker mentioned in the FAQ becomes available.
[2] "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions."
I think this would fall under any reasonable definition of fair use. If I read GPL (or proprietary) code as a human I still own code that I later write. If copyright was enforced on the outputs of machine learning models based on all content they were trained on it would be incredibly stifling to innovation. Requiring obtaining legal access to data for training but full ownership of output seems like a sensible middle ground.
Certainly not. If I memorize a line of copyrighted code and then write it down in a different project, I have copied it. If an ML model does the same thing as my brain - memorizing a line of code and writing it down elsewhere - it has also copied it. In neither case is that "fair use".
2) if I write a program that copies parts of other GPL licensed SW into my proprietary code, does that absolve me of GPL if the copying algorithm is complicated enough?
IMO the closest case is probably the students suing turnitin a number of years ago, which iParadigms (the turnitin maker) won [1].
I think this is definitely a gray area and in some way iParadigms winning (compared to all the cases decided in favour of e.g. the music industry), shows the different yardsticks being used for individuals and companies.
IANAL, but my interpretation of the GitHub TOS section D4 would give give GitHub the right to parse your code and/or make copies regardless of what your license states. This is the same reason the GitHub search index isn’t GPL contaminated.
Developers Human brain are also trained with propietary code bases, then when they quit and go elswere , they program using knowledge learned previously, yet you do not sue them.
We kinda have to accept that - we don't have to accept this. You can't interface with humans but you can interface with one of the biggest corporate tech giants straight up leeching from explicitly public and free work for their own private benefit.
andrewstuart2|4 years ago
How would we classify that legally when it comes to training and generating code? Would you argue the machine is just picking up best practices and patterns, or would you say it has gained specifically-licensed or proprietary knowledge?
Iv|4 years ago
More generally, keep in mind that the legal world, despite an apparent focus on definition is very bad at dealing with novelty, and most of it will end up justifying a posteriori existing practices.
cyberfart|4 years ago
I believe you're referring to Clean Room Design[1].
[1] https://en.wikipedia.org/wiki/Clean_room_design
qchris|4 years ago
That said, copyright is typically generally assigned to the closest human to the activation process (it's unlikely that Github is going to try to claim the copyright to code generated by Copilot over the human/company pair-programming with it), but since copyleft in general is a pretty domain-specific to software, afaik the way that courts interpret the legality of using code licensed under those terms in training data for a non-copyleft-producing model is still up in the air.
Obligatory IANAL, and also happy to adjust this info if someone has sources demonstrating updates on the current state.
[1] https://techcrunch.com/2021/01/12/ftc-settlement-with-ever-o...
devetec|4 years ago
blibble|4 years ago
imagine if the output was ruled as being GPLv2, then having to go through a proprietary codebase trying to rip out these bits of code
it would be basically impossible
dekhn|4 years ago
not2b|4 years ago
GuB-42|4 years ago
Anyways, GitHub is Microsoft, and Microsoft has really good lawyers so I guess they did everything necessary to make sur that you can use it the way they tell you so. The most obvious solution would be to filter by LICENSE.txt and only train the model with code under permissive licenses.
6gvONxR4sf7o|4 years ago
This line of thinking applies to the code generated by the model, but not necessarily to the model itself, or the training of it.
evgen|4 years ago
akersten|4 years ago
f38zf5vdt|4 years ago
flohofwoe|4 years ago
uticus|4 years ago
I was wondering the same thing, especially with MS being behind both.
edited: or this? https://docs.microsoft.com/en-us/visualstudio/intellicode/cu...
6gvONxR4sf7o|4 years ago
I suspect there could be issues on the training side, using copyrighted data for training without any form of licensing. Typically ML researchers have a pretty free-for-all attitude towards 'if I can find data, I can train models on it.'
evgen|4 years ago
throwawaygh|4 years ago
Almost certainly not everything.
But possibly things that were spit out verbatim from the training set, which the FAQ mentions does happen about .1% of the time [1]. Another comment in this thread indicated that the model outputs something that's verbatim usable about 10% of the time. So, taking those two numbers together, if you're using a whole generated function verbatim, a bit of caveat emptor re: licensing might not be the worst idea. At least until the origin tracker mentioned in the FAQ becomes available.
[1] https://docs.github.com/en/early-access/github/copilot/resea...
[2] "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions."
pjfin123|4 years ago
sanderjd|4 years ago
megous|4 years ago
2) if I write a program that copies parts of other GPL licensed SW into my proprietary code, does that absolve me of GPL if the copying algorithm is complicated enough?
tomthe|4 years ago
maxhille|4 years ago
Not a critique on your point, which a was just about yo bring up myself.
unknown|4 years ago
[deleted]
cycomanic|4 years ago
I think this is definitely a gray area and in some way iParadigms winning (compared to all the cases decided in favour of e.g. the music industry), shows the different yardsticks being used for individuals and companies.
I'm sure we will see more cases about this.
[1] https://www.plagiarismtoday.com/2008/03/25/iparadigms-wins-t...
teekert|4 years ago
IMTDb|4 years ago
devetec|4 years ago
542458|4 years ago
gutino|4 years ago
wraptile|4 years ago
kp302|4 years ago
iwintermute|4 years ago
There're definitely cases when devs avoid even looking at implementation before creating their own