top | item 27676712

(no title)

iwintermute | 4 years ago

So if it was trained using "source code from publicly available sources, including code in public repositories on GitHub." was it also GPLv2?

So everything generated also GPLv2?

discuss

order

andrewstuart2|4 years ago

You bring up a really good point. I'm super curious what the legality and ethics around training machines on licensed or even proprietary code would be. IIRC there are implications around code you can build if you've seen proprietary code (I remember an article from HN about how bash had to be written by someone who hadn't seen the unix shell code or something like that).

How would we classify that legally when it comes to training and generating code? Would you argue the machine is just picking up best practices and patterns, or would you say it has gained specifically-licensed or proprietary knowledge?

Iv|4 years ago

I would argue that a trained model falls under the legal category of "compilation of facts".

More generally, keep in mind that the legal world, despite an apparent focus on definition is very bad at dealing with novelty, and most of it will end up justifying a posteriori existing practices.

qchris|4 years ago

This is a bit tricky, because at least in the U.S., I don't believe it's settled question in law yet. Some of the other posters on here have said that the resulting model isn't covered by GPL--that's partially true, but provenance of data, and the rights to it, definitely does matter. A good example of this was the Everalbum ruling, where the company was forced to delete both the data and the trained models used they were used to generate due to lack of consent from the users from whom the data was taken[1]. Since open source code is, well, open, it's definitely less a problem for permissively-licensed code.

That said, copyright is typically generally assigned to the closest human to the activation process (it's unlikely that Github is going to try to claim the copyright to code generated by Copilot over the human/company pair-programming with it), but since copyleft in general is a pretty domain-specific to software, afaik the way that courts interpret the legality of using code licensed under those terms in training data for a non-copyleft-producing model is still up in the air.

Obligatory IANAL, and also happy to adjust this info if someone has sources demonstrating updates on the current state.

[1] https://techcrunch.com/2021/01/12/ftc-settlement-with-ever-o...

blibble|4 years ago

until the legal position is clear it you'd have to be insane to allow output from this process to be incorporated into your codebases

imagine if the output was ruled as being GPLv2, then having to go through a proprietary codebase trying to rip out these bits of code

it would be basically impossible

dekhn|4 years ago

No, a model trained on text covered by a license is not itself covered by the license, unless it explicitly copies the text (you cannot copyright a "style").

not2b|4 years ago

But it actually is explicitly copying the text. That's how it works. The training data are massive, and you will get long strings of code that are pulled directly from that training data. It isn't giving you just the style. It may be mashing together several different code examples taking some text from each. That's called "derivative work".

GuB-42|4 years ago

My guess is that it is, if we think of a machine learning framework as a compiler and the model as compiled code. Compiled GPL code is still GPL, that's the entire point.

Anyways, GitHub is Microsoft, and Microsoft has really good lawyers so I guess they did everything necessary to make sur that you can use it the way they tell you so. The most obvious solution would be to filter by LICENSE.txt and only train the model with code under permissive licenses.

6gvONxR4sf7o|4 years ago

> you cannot copyright a "style"

This line of thinking applies to the code generated by the model, but not necessarily to the model itself, or the training of it.

evgen|4 years ago

The trained model is a derivative work that contains copies of the corpus used for training embedded in the model. If any of the training code was GPL the output is now covered by GPL. The music industry has already done most of the heavy lifting here in terms of scope and nature of derived works, and while IANAL I would not suggest that it looks good for anyone using this tool if GPL code was in the training set.

akersten|4 years ago

Well, it probably is explicitly copying at least some subset of the source text - otherwise the code would be syntactically invalid, no?

f38zf5vdt|4 years ago

There will almost certainly be cases where it copies exact lines. When working with GPT2 I got whole chunks of news articles.

flohofwoe|4 years ago

I seem to remember a similar discussion on Intellicode (similar thing, but more like Intellisense, and as Visual Studio plugin), which is trained on "github projects with more than 100 stars". IFIR they check the LICENSE.txt file in the project and ignore projects with an "incompatible" license. I don't have any links handy which would confirm this though.

6gvONxR4sf7o|4 years ago

My guess would be that the model itself (and the training process) could have different legal requirements compared to the code it generates. The code generated by the model is probably sufficiently transformative new work that wouldn't be GPL (it's "fair use").

I suspect there could be issues on the training side, using copyrighted data for training without any form of licensing. Typically ML researchers have a pretty free-for-all attitude towards 'if I can find data, I can train models on it.'

evgen|4 years ago

No, the code generated is what copyright law calls a derivative work and you should go ask Robin Thicke and Pharrell Williams exactly how much slack the courts give for 'sufficiently transformative new work.

throwawaygh|4 years ago

> So everything generated also GPLv2?

Almost certainly not everything.

But possibly things that were spit out verbatim from the training set, which the FAQ mentions does happen about .1% of the time [1]. Another comment in this thread indicated that the model outputs something that's verbatim usable about 10% of the time. So, taking those two numbers together, if you're using a whole generated function verbatim, a bit of caveat emptor re: licensing might not be the worst idea. At least until the origin tracker mentioned in the FAQ becomes available.

[1] https://docs.github.com/en/early-access/github/copilot/resea...

[2] "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set. Here is an in-depth study on the model’s behavior. Many of these cases happen when you don’t provide sufficient context (in particular, when editing an empty file), or when there is a common, perhaps even universal, solution to the problem. We are building an origin tracker to help detect the rare instances of code that is repeated from the training set, to help you make good real-time decisions about GitHub Copilot’s suggestions."

pjfin123|4 years ago

I think this would fall under any reasonable definition of fair use. If I read GPL (or proprietary) code as a human I still own code that I later write. If copyright was enforced on the outputs of machine learning models based on all content they were trained on it would be incredibly stifling to innovation. Requiring obtaining legal access to data for training but full ownership of output seems like a sensible middle ground.

sanderjd|4 years ago

Certainly not. If I memorize a line of copyrighted code and then write it down in a different project, I have copied it. If an ML model does the same thing as my brain - memorizing a line of code and writing it down elsewhere - it has also copied it. In neither case is that "fair use".

megous|4 years ago

1) this is not human, it's some software

2) if I write a program that copies parts of other GPL licensed SW into my proprietary code, does that absolve me of GPL if the copying algorithm is complicated enough?

tomthe|4 years ago

What if I put a licence on my Github-repositories that explicitly forbids the use of my code for machine-learning models?

maxhille|4 years ago

And so it begins: We start applying human rights to AIs.

Not a critique on your point, which a was just about yo bring up myself.

cycomanic|4 years ago

IMO the closest case is probably the students suing turnitin a number of years ago, which iParadigms (the turnitin maker) won [1].

I think this is definitely a gray area and in some way iParadigms winning (compared to all the cases decided in favour of e.g. the music industry), shows the different yardsticks being used for individuals and companies.

I'm sure we will see more cases about this.

[1] https://www.plagiarismtoday.com/2008/03/25/iparadigms-wins-t...

teekert|4 years ago

Is what a human generates GPLv2 because it learned from GPLv2 code?

IMTDb|4 years ago

What if a human copies GPLv2 code?

542458|4 years ago

IANAL, but my interpretation of the GitHub TOS section D4 would give give GitHub the right to parse your code and/or make copies regardless of what your license states. This is the same reason the GitHub search index isn’t GPL contaminated.

gutino|4 years ago

Developers Human brain are also trained with propietary code bases, then when they quit and go elswere , they program using knowledge learned previously, yet you do not sue them.

wraptile|4 years ago

We kinda have to accept that - we don't have to accept this. You can't interface with humans but you can interface with one of the biggest corporate tech giants straight up leeching from explicitly public and free work for their own private benefit.