top | item 27677177

(no title)

natfriedman | 4 years ago

It shouldn't do that, and we are taking steps to avoid reciting training data in the output: https://copilot.github.com/#faq-does-github-copilot-recite-c... https://docs.github.com/en/early-access/github/copilot/resea...

In terms of the permissibility of training on public code, the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use. We are certain this will be an area of discussion in the US and around the world and we're eager to participate.

discuss

order

SCLeo|4 years ago

> ...the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use.

To be honest, I doubt that. Maybe I am special, but if I am releasing some code under GPL, I really don't want it to be used in training a closed source model, which will be used in a closed source software generating code for closed source projects.

zarzavat|4 years ago

The whole point of fair use is that it allows people to copy things even when the copyright holder doesn't want them to.

For example, if I am writing a criticism of an article, I can quote portions of that article in my criticism, or modify images from the article in order to add my own commentary. Fair use protects against authors who try to exert so much control over their works that it harms the public good.

yjftsjthsd-h|4 years ago

Is it any different than training a human? What if a person learned programming by hacking on GPL public code and then went to build proprietary software?

manquer|4 years ago

Perhaps we need GPL v4. I don't think there is any clause in current V2/V3 that prohibits learning from the code, only using the code in other places and running a service with code.

colinbartlett|4 years ago

Would you be okay with a human reading your GPL code and learning how to write closed source software for closed source projects?

dragonwriter|4 years ago

> To be honest, I doubt that.

Okay, but that's...not much of a counterargument (to be fair, the original claim was unsupported, though.)

> Maybe I am special, but if I am releasing some code under GPL, I really don't want it to be used in training a closed source model

That's really not a counterargument. “Fair use” is an exception to exclusive rights under copyright, and renders the copyright holder’s preferences moot to the extent it applies. The copyright holder not being likely to want it based on the circumstances is an argument against it being implicitly licensed use, but not against it being fair use.

__MatrixMan__|4 years ago

> a closed source model

It seems like some of the chatter around this is implying that the resultant code might still have some GPL still on it. But it seems to me that it's the trained model that Microsoft should have to make available on request.

rowanG077|4 years ago

That's the point of fair use. To do something with a material the original author does not want.

slownews45|4 years ago

This is what is so miserable about the GPL progression. We went from GPLv2 (preserving everyone's rights to use code) to GPLv3 (you have to give up your encryption keys) - I think we've lost the GPL as a place where we could solve / answer these types of questions which are good ones - GPL just tanked a lot of trust in it with the (A)GPLv3 stuff especially around prohibiting other developers from specific uses of the code (which is diametrically different from earlier versions which preserved rights).

npteljes|4 years ago

> ...the jurisprudence here – broadly relied upon by the machine learning community – is that training ML models is fair use.

If you train az ML model on GPL code, and then make it output some code, would that not make the result a derivative of the GPL licensed inputs?

But I guess this could be similar to musical composition. If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.

IncRnd|4 years ago

> If the output doesn't resemble any of the inputs, or contains significant continous portions of them, then it's not a derivative.

In this particular case, the output resembles the inputs, or there is no reason to use Github Copilot.

jazzyjackson|4 years ago

> It shouldn't do that, and we are taking steps to avoid reciting training data in the output

This just gives me a flashback to copying homework in school, “make sure you change some of the words around so it’s not obvious”

I’m sure you’re right Re: jurisprudence, but it never sat right with me that AI engineers get to produce these big, impressive models but the people who created the training data will never be compensated, let alone asked. So I posted my face on Flickr, how should I know I’m consenting to benefit someone’s killer robot facial recognition?

ramraj07|4 years ago

Wait I thought y'all argued Google didn't copy Java for Android, now that big tech is copying your code you're crying wolf?

Hamuko|4 years ago

>training ML models is fair use

How does that apply to countries where Fair Use is not a thing? As in, if you train a model on a fair use basis in the US and I start using the model somewhere else?

Asmod4n|4 years ago

Fair use doesn’t exist in Germany.

KMnO4|4 years ago

I don’t think it’s fair to ask a US company to comment on legalities outside of the US.

sicromoft|4 years ago

You just shared a URL that says "Please do not share this URL publicly".

jamie_ca|4 years ago

Well, he's also GitHub's CEO so it's probably just fine.

eqtn|4 years ago

Would i be able to use something like this in the near future to produce a proprietary linux kernel?

CyberRabbi|4 years ago

> training ML models is fair use

In what context? You are planning on commercializing Copilot and in that case the calculus on whether or not using copyright protected material for your own benefit changes drastically.

josourcing|4 years ago

It isn't. US copyright law says brief excerpts of copyright material may, under certain circumstances, be quoted verbatim

----> for purposes such as criticism, news reporting, teaching, and research <----, without the need for permission from or payment to the copyright holder.

Copilot is not criticizing, reporting, teaching, or researching anything. So claiming fair use is the result of total ignorance or disregard.