An open source lawyer’s view on the copilot class action lawsuit

belorn|3 years ago

A very interesting interpretation of the github TOS. Kate Downin is saying that users of github is giving a special license to GitHub, one that bypasses the original license. However if that is true then any upload of code that users do not have 100% copyright control of is then a copyright violation since the user would not have the authority to grant github that special license. It would be similar to a user uploading a copyrighted movie to youtube, and google using that as a license to use the movie in an advertisement.

I wonder if a court would think that microsoft in this case has done their due diligent to verify that the license grant that they got from users are correct and in order.

hyperman1|3 years ago

I also wondered about this when I read the TOS.

e.g. 4. [..] You grant us [..] the right to [..] parse, and display Your Content [..] as necessary to provide the Service, This license includes [...] show it to [...] other users; parse it into a search index or otherwise analyze it

As the Service now includes copilot, publishing anything on Github seems to give them the right to use it in copilot. Maybe even for private repos

Besides of the issue we're currently discussing, I wonder also about:

5. [..] you grant each User of GitHub a [..] license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).

So if you find GPLed content on github, you might be allowed to violate the GPL as long as it happens only on github. I don't know how bad this is in practice. Their CI presumably allows you to run code for other people without granting them the rights the GPL should give them, but that might be a violation of the Github TOS as this might be abuse of the CI servers.

This might also mean you violate the GPL when publishing someone else's GPLed code on github, as you now granted Microsoft and others rights not included in the GPL.

Clearly, IANAL, don't know how valid this reading is, but publishing anything you didn't wrote yourself might not be on a very stable legal basis.

https://docs.github.com/en/site-policy/github-terms/github-t...

dathinab|3 years ago

It also falls under the aspect of "hidden surprises" which could mean that this part of the TOS wrt. this specific aspect might not be legally binding/valid. At least in the EU. Or it might.

TazeTSchnitzel|3 years ago

> if that is true then any upload of code that users do not have 100% copyright control of is then a copyright violation since the user would not have the authority to grant github that special license

That doesn't sound right. Licences can allow sublicensing, and I think all the popular open-source ones do.

lindenksv1|3 years ago

Kate Downing here. This is an excellent question. So, just like YouTube, GitHub would likely argue that they are protected by the DMCA and that so long as they comply with DMCA take-down requests, they are not liable for copyright infringement (direct or indirect) for third party content posted to GitHub by people other than the copyright owners. Remember that the DMCA effectively shifts that due diligence you speak of away from providers of online services and onto copyright holders themselves. Without the DMCA, many businesses that rely on user-generated content just wouldn't exist because that due diligence isn't possible at scale - it's often not even possible for individual pieces of content because the publication of any copyrighted work can be very obscure and because in the US you can hold a copyright without formally registering it.

In practice, I think the entire open source world knows that people post each other's open source code on GitHub. Even projects that have very purposefully chosen to primarily use other services or self-host their source code are well aware that their code gets mirrored on GitHub and/or included in other people's repos on GitHub. Up until now, I don't think this has been controversial and I don't think GitHub gets a lot of takedown requests for this practice. I think most developers see this as a feature, not a bug. Copilot might make people rethink whether or not they want to start sending take-down requests but that'll be a tough call for a lot of people because withholding code from GitHub to avoid its usage in Copilot also effectively means making their code less easily available to the rest of the world. It may be very disruptive to other projects that include the copyright owner's code in their own projects.

Andrew_nenakhov|3 years ago

A hypothetical question: imagine a filmmaker, who had studied a lot of obviously copyrighted movies by famous renowned directors. This means he has trained his neural network using their copyrighted licensed content. Does he breach copyright when he composes and films a scene? Are visual quotes copyright theft? Homages? Did George Lucas infringe copyright when he was borrowing compositions from "Triumph of the will"?

ssivark|3 years ago

Just because machine learning uses the word “learning” doesn’t mean it “learns” in the same way a human mind does — that analogy is doing a lot of load bearing in your argument, and needs proving why the program’s nature of creative remixing (for lack of a better word) is the same as a human’s. Right now it seems like you’re just reusing the same word for two phenomena we don’t understand, and therefore claiming they’re equivalent.

See Marvin Minsky’s comment regarding “suitcase words”.

dathinab|3 years ago

If him "composing a scene" means copy pasting clips of the movies he studied and smooth things over, then yes that would be obvious infringement.

And that is what copoilots AI mostly does.

It doesn't "understand the concepts and reproduce something alike" in the sense a human does. It might understand some concepts here and there but it also does a lot of heavy lifting my verbatim "remembering" (i.e. copy pasting) code.

This is also why some people argue that the cases for copilot and some of the image generation networks are different as some of the image generation networks get much closer to "understanding and reproducing a style". (Through potentially just by it being much easier to blend over copy-pasted snippets in images to a point its unrecognizable.)

One of the main problems GitHub has IMHO is that anyone who has studied such generative methods knows that:

1) they are prone to copy-pasting

2) you don't know what they remembered (i.e. stored copies of in a obscure human unreadable encoding, i.e. just distributing such a network can be a copyright infrigement)

3) you don't know when they copy past

4) the copy pasted code often is a bit obscured, ironically (and coincidentally) often comparable with how someone who knowingly commits copyright theft would obscure the code to avoid automated detection

Which means GitHub knowingly accepted and continued with tricking its copilote users into committing copyright infringement under the assumption that such infringement is most times obscured enough to evade automatic detection....

jackdaniel|3 years ago

I see this argument over and over again, and it is so flawed that it is hard to bear.

There is no equal sign between a person and a program.

There is also that thing called "scale" that is critical to the interpretation of the action.

Is eating meat fine? - maybe. Is eating all animals OK? - Hmm...

uklgrant|3 years ago

Humans are not neural networks, that's just a thesis.

Even novelists do not sit all day long in a closed room reading other people's work and then do a collage of what they've read. Otherwise no books would have been written in the first place.

Cut the AI off humans' work, let it interact with the real world and see what it produces. It will be nothing.

Once (if ever?) an AI is capable of producing an actual original work, I'm fine with other AIs stealing from the first one. Please leave humans alone.

polaris64|3 years ago

A difference is that I can't just spin up a copy of George Lucas on my GPU in seconds and request it to produce something from a prompt like "a disappointing prequel".

orangesite|3 years ago

Your magic box is not a film maker and the inputs you are encoding with it are verbatim file content. Said content belonging to someone else.

Please study the series of events that unfolded in the music industry after folk begun incorporating recordings made by other artists in their own work and proceeded to sell the result.

Spoiler: The deeply nuanced question of feeding a mechanical recording through a series of complex physical and mathematical apparatus and whether that constituted a transformational creative act did not come up during the proceedings or final judgements!

badcppdev|3 years ago

I like the scenario: Imagine I've hired an assistant with an eidetic memory who has read loads of books. I pay them to help me write a book and they reproduce a few paragraphs from a different book into my book.

Am I violating copyright? Yes

Imagine they change the character names in those paragraphs. Am I still violating copyright? Yes

At some point you can change enough of the text to not violate copyright. The grey area involves the courts.

It feels very simple to me so I might be missing something.

jillesvangurp|3 years ago

The beauty of the law is that it does not take such philosophical things into consideration. The only thing that matters is the text of the law and it's documented interpretation in various court cases. That's why copyright is excluded from this court case because there are a lot of documented interpretations of fair use. Which also apply here.

The simple layman's version of copyright is that copyright applies to a specific form of a thing and not about the ideas behind that thing.

So, no, George Lucas was not infringing anything. Nor is hip hop music making use of samples infringing anything. Or Andy Warhol integrating photos into his works. Nor is it illegal to paraphrase or refer other authors. And as Oracle found out by challenging it in court, trying to claim ownership over APIs to prevent third party implementations is also not going to work.

All of that falls under fair use. Fair use is what makes copyright useful. Without it you'd have to live in fear that legal copyright holders might come after you if you apply the ideas that you might have been exposed to via their copyrighted work. Fair use exists such that you can make use of information provided to you via a copyrighted work.

sensanaty|3 years ago

Philosophical bullshitting aside (and it really is philosophical bullshitting), I just genuinely don't care if a human or a machine "think" or "learn" in the same way.

I don't want Github or any other megacorp-backed entity abusing the open source community in the way micro$oft is here, it's as simple as that. If they wish to train it on entirely proprietary Microsoft code, then by all means go nuts, but to take the work of open source projects and to hide behind the pretense of the mathematical model behind the A"I" learning something is simply ridiculous to me.

I find it quite curious that they're not doing that (training it on their own codebase). Perhaps they're afraid of their little intelligence spitting out proprietary code verbatim like it's been shown to do many times with licensed open source code.

jules|3 years ago

I bet he would if his movie scene is pixel for pixel identical to the scene he watched.

6stringmerc|3 years ago

No.

Next hypothetical.

steve_gh|3 years ago

Hmmm. I'm interested in the GitHub ToS, which (if I understand correctly) basically says that GitHub and it's affiliates (MS) can use anything you post on GitHub to improve their service.

What if I build an AGPL licenced service, using GitHub to coordinate development. According to the ToS MS could offer a version my service because I posted the code on GitHub, and they are using it to improve their service to me. According to my AGPL licence, they would need to share their source.

So which takes precedence. The licence or the ToS?

rlpb|3 years ago

Consider that you can post somebody else's code to GitHub, and that may be licensed AGPL (or anything else). In that case, somebody else is the copyright holder so clearly the ToS doesn't magically give GitHub any additional rights and the licence applies.

The most they could do is transfer any liability back to you for posting it in breach of some term in their ToS. But that would be absurd since posting someone else's code, licensed under a common (eg. OSI-approved) license, is an established and normal use case for GitHub. If their ToS really did ban the posting of some AGPL code, they really ought to have pointed it out, and of course it'd render GitHub useless for hosting AGPL code.

This would only apply when posting someone else's code. But of course you could always arrange that.

lindenksv1|3 years ago

OP here. If you own the copyright to a work, you can license it in any way you like. You can offer it to some people under a commercial license and to other people under an open source license. Many entities practices dual (or tri or whatever) licensing. When you post things on GitHub, you are essentially dual licensing your work. You're providing it under a very broad license to GitHub and you are providing it under an OSS license (or whatever you like) to other GitHub users. Neither license takes precedence. One license applies to one group of people and the other license applies to the other group of people.

This is very similar to what happens when you sign a contributor agreement before contributing code to an open source project. When you sign the contributor agreement, you're granting a very broad license to your work to the project maintainers. They can then license your work out under any license they want. But likewise, because you are not granting them an exclusive license, you're free to put your contribution license out into the world under any license of your choosing separate and apart from the project that you contributed it to.

Technically, I think the scenario you're describing with AGPL code may well be possible and legal. But practically, I think people would stop using GitHub if they felt that doing so would lead to GitHub/Microsoft undercutting their projects, stealing their customers, or essentially stripping the project of any AGPL obligations. I think that from a business perspective, they're really gambling on the idea that developers will see Copilot as a big boon rather than a value suck. Time will tell whether their gamble has paid off.

david_allison|3 years ago

As a follow-on, what if you're mirroring code which is under an AGPL license? Are you allowed to post it on GitHub if you can't grant those rights under the ToS due to the license of the code?

VBprogrammer|3 years ago

An interesting though experiment is how keen Microsoft would be to allow Copilot to be trained on the Office or Windows source code. If the output is truly free of copyright from its training materials then if not, why not?

amarant|3 years ago

Probably the ToS. You've granted GitHub specifically license to use your code under the terms of the ToS, they effectively have 2 licenses. They can therefore choose under which licence they want to use your code, and will choose the most permissive one, or the one they have the best understanding of: in this case the ToS.

Other parties are not granted license under the ToS, and so will have to abide by the AGPL.

NicoJuicy|3 years ago

Their service is hosting code, not writing code.

That's why it's GitHub, not CodeScribe ( or something)

visarga|3 years ago

I think copyright itself might be on its way out. What meaning does a copyright have when I can click "Variations" on anything and get 4 suggestions in 10 seconds? Imagine how good they will be by 2030.

hooby|3 years ago

Copyright was originally intended to protect the creators of a work.

Over many years it has now mostly become a tool for large companies to accumulate rights (on works they didn't create themselves) and monetize them.

Maybe a reform is needed, to find a way back to the original purpose.

izacus|3 years ago

There has never been more support for tightening and enforcing copyright than there is today. This is very unlikely to change due to megacorps like Microsoft, Disney, Apple et.al. having a massive vested interest to use it to extract maximum profits.

LesZedCB|3 years ago

there's a great youtube doc about everything being remixing which i highly recommend

https://www.everythingisaremix.info/

esalman|3 years ago

Have you tried that on any kind of music?

throwaway290|3 years ago

Copyright becomes especially important and valuable in these circumstances. Remember, original works is how your variation suggestion engine is trained. With remaining incentives taken away there is no more new stuff to train on, networks get trained on own output, the snake eats own tail.

mjw1007|3 years ago

I think this is the most interesting part:

> [Github's Terms of Service] specifically identifies “GitHub” to include all of its affiliates (like Microsoft) and users of GitHub grant GitHub the right to use their content to perform and improve the “Service.” Diligent product counsel will not be surprised to learn that “Service” is defined as any services provided by “GitHub,” i.e. including all of GitHub’s affiliates.

tryre|3 years ago

No, the misinterpretation of the ToS is not the most interesting part. The part that clearly shows her colors is:

"It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions."

LesZedCB|3 years ago

out of curiosity, would anybody else cease to have an issue copilot if it was an open source model?

i'm not paying for copilot right now because i'm waiting for this to shake out. but i'd be happy to pay (even their current asking price) if i knew the model was also open source and could be self hosted.

maybe this is the wrong way to ask the question, but hopefully it makes sense

david_allison|3 years ago

It's not the license of the model, it's the license of the output.

As it stands, Copilot is a black-box which strips copyright from a piece of code.

I'd be fine if it were a level playing field and GitHub also trained it on private repositories - that's a signal that they don't care about copyright at all.

I'd be fine as a developer who releases GPL'ed code if the output was licensed as GPL - obviously no license violation.

I'd be fine (within a reasonable scale) if a developer contacted me and asked to use my code under MIT.

I'm not fine that Copilot allows people to take my code, 'change the variable names' and remove the license. Especially because I have no visibility of the fact that this has occurred.

throwaway290|3 years ago

If it was a true OSS project, first it would not clearly benefit a single near-monopoly by using my code (as in, that wouldn't be its purpose), and second I'm sure its contributors would be well placed to understand the issue and from the start bake in a reliable, transparent mechanism for opting out.

As is, it's EEE applied to open source-- Microsoft's ultimate play against the ethic that brought us Linux among other things. When your brainchild gets gobbled up faster than you can blink, pushed to people who never learn about your existence, and a megacorp that you are ethically opposed to profits from the process, the need for self-actualization is no longer addressed. The fundamental incentive that pushes us to publish in the open, to have other humans acknowledge you and your work and feel pride in it, is being eliminated.

NoboruWataya|3 years ago

I agree - it's problematic enough that licensing information gets lost in the Copilot process, but as is we basically have developers contributing their time and expertise, for free, to the development of Microsoft's new paid proprietary product. Worse still, if Copilot is as revolutionary as some people make it out to be, those same developers are inadvertently helping Microsoft build a monopoly in a new market, with all the disastrous consequences that entails.

runnerup|3 years ago

If it was GPL it could use GPL code and legally there would be no debate.

synapse26|3 years ago

Yes, I’d be one too. I have no legal opinions about this, but morally, Copilot just doesn’t hit me right. One of the purpose open source exist is for it to be, well, open. It’s so annoying seeing this tool Specifically use only open source code and then have the audacity to close source + paywall access to it.

I used to be a little more agreeable with Copilot with training money and all, but seeing Stable Diffusion is willing to open up hundreds of thousands in training, and more in engineering, and therefore create an active community dedicated to improving it everyday, I just can’t help but be so annoyed when one of the world’s biggest tech companies pulls such petty move.

unknown|3 years ago

[deleted]

MattPalmer1086|3 years ago

Has anyone produced a legally watertight license or clause for other licenses that prevents code being used for training of copilot-like services?

insanitybit|3 years ago

The article addresses this in a number of ways.

For example,

> That rings a bit like the Facebook memes of yesteryear promising users that if they just copy and paste these magical sentences onto their timelines, then Facebook won’t be able to do something or other with their data or accounts.

rwmj|3 years ago

It would be a Field of Endeavor restriction so the resulting license wouldn't be open source, and I don't think (?) Copilot is trained on proprietary code.

(Section 6 here: https://opensource.org/osd)

unknown|3 years ago

[deleted]

6stringmerc|3 years ago

I have a companion piece talking about music and training AI/ML:

https://medium.com/@6StringMerc/artificial-intelligence-mach...

terminal_d|3 years ago

If this isn't enough incentive to move away from github, then I don't know what is.

hnbad|3 years ago

> It looks a lot more like trolling if an otherwise incredibly useful and productivity-boosting technology is being stymied by people who want to receive payouts for a lack of meaningless attributions.

This one sentence threw off my entire opinion of the article as it demonstrates the author's clear bias in favor of Copilot, not just specifically in this case but in principle.

Legal opinion on Copilot and generative AI in general hinges entirely on metaphors. If the AI is understood to behave like a human being building knowledge and drawing from it for inspiration, Copilot is just another way to write code. But we've already established legal precedent that machines can not hold copyright, which suggests that they can not be deemed to be creative, which could be used to argue that they are therefore just creating an inventory of copyright works and creating mechanical mashups.

The author's dismissal also ignores that this would not JUST result in attribution. If Copilot indexed copyleft code and were required to provide attribution when using this code, the output might also be affected and this could in turn affect the entire code base. Worse yet, Copilot may output code with conflicting licenses. The author considers only the possibility that Copilot itself might have to inherit the license (and the dismissal that it would "help noone" because it runs on a server ignores both the existence of a (presumably self-hosted) enterprise service and the existence of licenses like AGPL, which would still apply) but it seems most people's concerns are with the output instead.

I also fail to understand how the argument that it doesn't reproduce the code exactly 99% of the time is helpful. If I copy code and rename the variables and run an autoformatter on it, it's still a copy of the code. It's odd to see a lawyer use what is essentially obfuscation as a defense against copyright claims. Also 1% is an incredibly large number given how Copilot is supposed to be used and how large the potential customer base is. Given the direction GitHub is heading with "Hello GitHub" (demoed at GitHub Universe yesterday) it's not unlikely that Copilot would in some cases be used to generate hundreds, thousands or tens of thousands of lines of code in a single project.

The question isn't just whether Copilot is violating the law or not, the question is why it is or isn't because that could have wide implications outside GitHub itself. But as the author points out, sadly the lawsuit doesn't try to settle this for copyright, which might be the most impactful question.

iLoveOncall|3 years ago

This lawsuit is open-source developers destroying open-source.

Havoc|3 years ago

What’s the point of licenses if TOS overrides it?

junon|3 years ago

Github's TOS doesn't infringe on any licenses.

https://docs.github.com/en/site-policy/github-terms/github-t...

I'm actually surprised they allowed Copilot to happen, given this section:

> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.

One could make the argument they had no intrinsic right to use the software for Copilot except under the terms laid out under the respective softwares' licenses. This means any GPL code they copied by error is now in violation of the GPL by default. But IANAL.

puffoflogic|3 years ago

Nothing is being overridden.

You have apparently misunderstood copyright licenses as being something that attaches to a copyrighted work and now must be respected by all users of that work. But that is totally incorrect.

Licenses are individual agreements between copyright holders (or licensees who have been granted the right to re-license) and people who want to exercise one of the rights normally withheld under copyright. A LICENSE file is nothing but an offer to grant a license with specified terms to anyone who might want to use the work, without having to nag the licensor to sign an agreement. The existence of that offer doesn't have anything to do with any other agreement the licensor and a (potential) licensee might make.

In the GitHub case, GitHub has negotiated a different license with the uploader. (That negotiation happened to take the form of a ToS, which is another kind of binding offer.) The LICENSE file has nothing to do with it. It hasn't been overridden, it's just irrelevant. It doesn't add or subtract any terms from the separate and distinct license GitHub negotiated.

amarant|3 years ago

The ToS only applies to GitHub(which includes Microsoft, apparently)

Other parties will still have to abide by your license.

hfglanx|3 years ago

[deleted]

muraiki|3 years ago

Or it could be that she is experienced with both software and law, and that her assessment is different than yours.

> Kate’s passion for open source began in law school, under the tutelage of Eben Moglen, long-time attorney for the Free Software Foundation, founder of the Software Freedom Law Center, and author of the GPL 3. She interned at the Electronic Frontier Foundation and helped write the first complaint against the NSA for warrantless wiretapping.

> At VMware and ServiceNow, she dedicated her time to designing, building, and testing internal compliance tools in collaboration with their respective internal tools teams. She is no stranger to writing specs, creating wireframes, and massive amounts of QA. So much so, that Kate and her husband, Steve Downing, co-founded Critterdom LLC, a software company whose Open Sorcerer product substantially cuts down the time it takes to manually review source code for licenses and create a customer-facing disclosure of that source code.

https://katedowninglaw.com/about/

tallanvor|3 years ago

Whether or not other countries give you the right to enforce your copyright even if you haven't registered it with the government is not relevant for a class action lawsuit filed in the US.

baby|3 years ago

This is why we can’t have nice things. Copilot is the future

nomilk|3 years ago

If organic neural networks are allowed to read and learn from open source code, why should an artificial one be any different?

geysersam|3 years ago

1. Humans are not neural networks. 2. Humans are not allowed to directly copy even rather short snippets of licenced code. 3. Humans do not have the capacity to memorize the entirity GitHub.

unknown|3 years ago

[deleted]

throwaway290|3 years ago

For one, an organic network (for the sake of the argument I'll play along if you want to reduce a human to this) has rights, freedoms and ethical values and is not controlled by a single entity and has not specifically been instantiated to generate profit for such.

insanitybit|3 years ago

HN is so insanely frustrating, so many comments demonstrate that the user didn't read this article at all. Just immediately jumping into a "but what about this argument that I made?".

robocat|3 years ago

  Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."

https://news.ycombinator.com/newsguidelines.html

175 comments