top | item 27677598

(no title)

Lots of questions:

  - the generated code by AI belongs to me or GitHub?
  - under what license the generated code falls under?
  - if generated code becomes the reason for infringment, who gets the blame or legal action?
  - how can anyone prove the code was actually generated by Copilot and not the project owner?
  - if a project member does not agree with the usage of Copilot, what should we do as a team?
  - can Copilot copy code from other projects and use that excerpt code?
    - if yes, *WHY* ?!
    - who is going to deal with legalese for something he or she was not responsible in the first place?
    - what about conflicts of interest?
  - can GitHub guarantee that Copilot won't use proprietary code excerpts in FOSS-ed projects that could lead to new "Google vs Oracle" API cases?

discuss

natfriedman|4 years ago

In general: (1) training ML systems on public data is fair use (2) the output belongs to the operator, just like with a compiler.

On the training question specifically, you can find OpenAI's position, as submitted to the USPTO here: https://www.uspto.gov/sites/default/files/documents/OpenAI_R...

We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

breck|4 years ago

You should look into:

https://breckyunits.com/the-intellectual-freedom-amendment.h...

Great achievements like this only hammer home the point more about how illogical copyright and patent laws are.

Ideas are always shared creations, by definition. If you have an “original idea”, all you really have is noise! If your idea means anything to anyone, then by definition it is built on other ideas, it is a shared creation.

We need to ditch the term “IP”, it’s a lie.

Hopefully we can do that before it’s too late.

joepie91_|4 years ago

> training ML systems on public data is fair use

Uh, I very much doubt that. Is there any actual precedent on this?

> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

But apparently not eager enough to have this discussion with the community before deciding to train your proprietary for-profit system on billions of lines of code that undoubtedly are not all under CC0 or similar no-attribution-required licenses.

I don't see attribution anywhere. To me, this just looks like yet another case of appropriating the public commons.

king_magic|4 years ago

@Nat, these questions (all of them, not just the 2 you answered) are critical for anyone who is considering using this system. Please answer them?

I for one wouldn't touch this with a 10000' pole until I know the answers to these (very reasonable) questions.

stefano|4 years ago

How do you guarantee it doesn't copy a GPL-ed function line-by-line?

abn120|4 years ago

(1) That may be so, but you are not training the models on public data like sports results. You are training it on copyright protected creations of humans that often took years to write.

So your point (1) is a distraction, and quite an offensive one to thousands of open source developers, who trusted GitHub with their creations.

qihqi|4 years ago

   (1) training ML systems on public data is fair use

This one is tricky considering that kNN is also a ML system.

stwrong|4 years ago

What about privacy. Does the AI send code to GitHub? This reminds me of Kite

croes|4 years ago

Fair use doesn't exist in every country, so it's US only?

stephen82|4 years ago

> We expect that IP and AI will be an interesting policy discussion around the world in the coming years, and we're eager to participate!

Another question is this: let's hypothesize I work solo on a project; I have decided to enable Copilot and have reached a 50%-50% development with it after a period of time. One day the "hit by a bus" factor takes place; who owns the project after this incident?

tlamponi|4 years ago

> the output belongs to the operator, just like with a compiler.

No it really is not that easy, as with compilers it depends on who owned the source and which license(s) they applied on it.

Or would you say I can compile the Linux kernel and the output belongs to me, as compiler operator, and I can do whatever I want with it without worrying about the GPL at all?

user-the-name|4 years ago

> training ML systems on public data is fair use

So, to be clear, I am allowed to take leaked Windows source code and train an ML model on it?

patrickthebold|4 years ago

What does "public" mean? Do you mean "public domain", or something else?

dylannorthrup|4 years ago

Fair Use is an affirmative defense (i.e. you must be sued and go to court to use it; once you're there, the judge/jury will determine if it applies). But taking in code with any sort of restrictive license (even if it's just attribution) and creating a model using it is definitely creating a derivative work. You should remember, this is why nobody at Ximian was able to look at the (openly viewable, but restrictively licensed) .NET code.

Looking at the four factors for fair use looks like Copilot will have these issues: - The model developed will be for a proprietary, commercial product - Even if it's a small part of the model, the all training data for that model are fully incorporated into the model - There is a substantial likelihood of money loss ("I can just use Copilot to recreate what a top tier programmer could generate; why should I pay them?")

I have no doubt that Microsoft has enough lawyers to keep any litigation tied up for years, if not decades. But your contention that this is "okay because it's fair use" based on a position paper by an organization supported by your employer... I find that reasoning dubious at best.

deepnash|4 years ago

It is the end of copyright then. NNs are great at memorizing text. So I just train a large NN to memorize a repository and the code it outputs during "inferencing" is fair use ?

You can get past GPL, LGPL and other licenses this way. Microsoft can finally copy the linux kernel and get around GPL :-).

unknown|4 years ago

[deleted]

unknown|4 years ago

[deleted]

gpm|4 years ago

> - under what license the generated code falls under?

Is it even copyrighted? Generally my understand is that to be copyrightable it has to be the output of a human creative process, this doesn't seem to qualify (I am not a lawyer).

See also, monkeys can't hold copyright: https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

tlamponi|4 years ago

> Is it even copyrighted?

Isn't it subject to the licenses the model was created from, as the learning is basically just an automated transformation of the code, which would be still the original license - as else I could just run some minifier, or some other, more elaborate, code transformation, on some FOSS project, for example the Linux kernel, and relicense it under whatever?

Does not sound right to me, but IANAL and I also did not really look at how this specific model/s is/are generated.

If I did some AI on existing code I'd be quite cautious and group by compatible licences classes, asking the user what their projects licence is and then only use the compatible parts of the models.-Anything else seems not really ethical and rather uncharted territory in law to me, which may not mean much as IANAL and just some random voice on the internet, but FWIW at least I tried to understand quite a few FOSS licences to decide what I can use in projects and what not.

Anybody knows of some relevant cases of AI and their input data the model was from, ideally in jurisdictions being the US or any European Country ones?

lawtalkinghuman|4 years ago

In the US, yes. Elsewhere, not necessarily.

croes|4 years ago

It is output of humans creative processes, just not yours. Like an automated stackoverflow snippet engine.

agilob|4 years ago

>Generally my understand is that to be copyrightable it has to be the output of a human creative process

https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput...

natfriedman|4 years ago

You should read the FAQ at the bottom of the page; I think it answers all of your questions: https://copilot.github.com/#faqs

viccuad|4 years ago

> You should read the FAQ at the bottom of the page; I think it answers all of your questions: https://copilot.github.com/#faqs

Read it all, and the questions still stand. Could you, or any on your team, point me on where the questions are answered?

In particular, the FAQ doesn't assure that the "training set from publicly available data" doesn't contain license or patent violations, nor if that code is considered tainted for a particular use.

samtheprogram|4 years ago

The most important question, whether you own the code, is sort of maybe vaguely answered under “How will GitHub Copilot get better over time?”

> You can use the code anywhere, but you do so at your own risk.

Something more explicit than this would be nice. Is there a specific license?

EDIT: also, there’s multiple sections to a FAQ, notice the drop down... under “Do I need to credit GitHub Copilot for helping me write code?”, the answer is also no.

Until a specific license (or explicit lack there-of) is provided, I can’t use this except to mess around.

dvaun|4 years ago

None of the questions and answers in this section hold information about how the generated code affects licensing. None of the links in this section contain information about licensing, either.

netcraft|4 years ago

I dont see the answer to a single one of their questions on that page - did you link to where you intended?

Edit: you have to click the things on the left, I didn't realize they were tabs.

kitsune_|4 years ago

Sorry Nat, but I don't think it really answers anything. I would argue that using GPL code during training falls under Copilot being a derivative work of said code. I mean if you look at how a language model works, than it's pretty straightforward. The word "code synthesizer" alone insinuates as much. I think this will probably ultimately tested in court.

rozab|4 years ago

This page has a looping back button hijack for me

amelius|4 years ago

Does Copilot phone home?

gpm|4 years ago

When you sign up for the waitlist it asks permission for additional telemetry, so yes. Also the "how it works" image seems to show the actual model is on github's servers.

heavyset_go|4 years ago

Yes, and with the code you're writing/generating.

chuinard|4 years ago

Some of your questions aren't easy to answer. Maybe the first two were OK to ask. Others would probably require lawyers and maybe even courts to decide. This is a pretty cool new product just being shared on an online discussion forum. If you are serious about using it for a company, talk to your lawyers, get in touch with Github's people, and maybe hash out these very specific details on the side. Your comment came off as super negative to me.

Tainnor|4 years ago

> This is a pretty cool new product just being shared on an online discussion forum.

This is not one lone developer with a passion promoting their cool side-project. It's GitHub, which is an established brand and therefore already has a leg up, promoting their new project for active use.

I think in this case, it's very relevant to post these kinds of questions here, since other people will very probably have similar questions.

peddling-brink|4 years ago

I think these are very important questions.

The commenter isn't interrogating some indy programmer. This is a product of a subsidiary of Microsoft, who I guarantee has already had a lawyer, or several, consider these questions.

king_magic|4 years ago

No, they are all entirely reasonable questions. Yeah, they might require lawyers to answer - tough shit. Understanding the legal landscape that ones' product lives in is part of a company's responsibility.

ericbarrett|4 years ago

Regardless of tone, I thought it was chock full of great questions that raised all kinds of important issues, and I’m really curious to hear the answers.