top | item 33226515

GitHub Copilot, with “public code” blocked, emits my copyrighted code

914 points| davidgerard | 3 years ago |twitter.com | reply

775 comments

order
[+] _ryanjsalva|3 years ago|reply
Howdy, folks. Ryan here from the GitHub Copilot product team. I don’t know how the original poster’s machine was set-up, but I’m gonna throw out a few theories about what could be happening.

If similar code is open in your VS Code project, Copilot can draw context from those adjacent files. This can make it appear that the public model was trained on your private code, when in fact the context is drawn from local files. For example, this is how Copilot includes variable and method names relevant to your project in suggestions.

It’s also possible that your code – or very similar code – appears many times over in public repositories. While Copilot doesn’t suggest code from specific repositories, it does repeat patterns. The OpenAI codex model (from which Copilot is derived) works a lot like a translation tool. When you use Google to translate from English to Spanish, it’s not like the service has ever seen that particular sentence before. Instead, the translation service understands language patterns (i.e. syntax, semantics, common phrases). In the same way, Copilot translates from English to Python, Rust, JavaScript, etc. The model learns language patterns based on vast amounts of public data. Especially when a code fragment appears hundreds or thousands of times, the model can interpret it as a pattern. We’ve found this happens in <1% of suggestions. To ensure every suggestion is unique, Copilot offers a filter to block suggestions >150 characters that match public data. If you’re not already using the filter, I recommend turning it on by visiting the Copilot tab in user settings.

This is a new area of development, and we’re all learning. I’m personally spending a lot of time chatting with developers, copyright experts, and community stakeholders to understand the most responsible way to leverage LLMs. My biggest take-away: LLM maintainers (like GitHub) must transparently discuss the way models are built and implemented. There’s a lot of reverse-engineering happening in the community which leads to skepticism and the occasional misunderstanding. We’ll be working to improve on that front with more blog posts from our engineers and data scientists over the coming months.

[+] ianbutler|3 years ago|reply
I just tested it myself on a random c file I created in the middle of a rust project I'm working on, it reproduced his full code verbatim from just the function header so clearly it does regurgitate proprietary code unlike some people have said, I do not have his source so co-pilot isn't just using existing context.

I've been finding co-pilot really useful but I'll be pausing it for now, and I'm glad I have only been using it on personal projects and not anything for work. This crosses the line in my head from legal ambiguity to legal "yeah that's gonna have to stop".

[+] enriquto|3 years ago|reply
Just a heads-up that the person who writes this is Tim Davis[0], author of the legendary CHOLMOD solver[1], which hundreds of thousands of people use daily when they solve sparse symmetric linear systems in common numerical environments.

Even if CHOLMOD is easily the best sparse symmetric solver, it is notoriously not used by scipy.linalg.solve, though, because numpy/scipy developers are anti-copyleft fundamentalists and have chosen not to use this excellent code for merely ideological reasons... but this will not last: thanks to the copilot "filtering" described here, we can now recover a version of CHOLMOD unencumbered by the license that the author originaly distributed it under! O brave new world, that has such people in it!

[0] https://people.engr.tamu.edu/davis/welcome.html

[1] https://github.com/DrTimothyAldenDavis

[+] jefftk|3 years ago|reply
In case anyone interprets this literally: if copilot regurgitates literal code it was trained on that doesn't actually give you an unencumbered version.
[+] mjr00|3 years ago|reply
Same issue with Stable Diffusion/NovelAI and certain people's artwork (eg Greg Rutkowski) being obviously used as part of the training set. More noticeable in Copilot since the output needs to be a lot more precise.

Lawmakers need to jump on this stuff ASAP. Some say that it's no different from a person looking at existing code or art and recreating it from memory or using it as inspiration. But the law changes when technology gets involved already, anyway. There's no law against you and I having a conversation, but I may not be able to record it depending on the jurisdiction. Similarly, there's no law against you looking at artwork that I post online, but it's not out of question that a law could exist preventing you from using it as part of an ML training dataset.

[+] deepspace|3 years ago|reply
This shows how copyright is all screwed up. Let's say the code in question is based on a published algorithm, maybe Yuster and Zwick, (I did not check).

What exactly gives Davis a better claim to the copyright than the inventors of the algorithm? Yes, I know software is copyrightable while algorithms are not, but it is not at all clear to my why that should be the case. The effort of translating an algorithm into code is trivial compared to designing the algorithm in the first place, no?

[+] clnq|3 years ago|reply
To be honest, it would probably benefit all of humanity if we stopped rewriting the same code to then fix the same bugs in it, and instead just used each other's algorithms to do meaningful work.

I work for a large tech company whose lawyers definitely care that my code doesn't train an AI model somewhere much more than I do. On the contrary, I would really like to open source all of my work - it would make it more impactful and would demonstrate my skills. It makes me a bit sad that my life's work is going to be behind lock and key, visible to relatively few people. Not to mention that the hundreds of thousands of work hours, energy and effort that will be spent to replicate it all over my industry in all other lock-and-key companies makes the industry as a whole tremendously inefficient.

I hope that AI models like Copilot will finally show to the very litigious tech companies that their intellectual property has been all over the public domain from the start. And we can get over a lot of the petty algorithm IP suits that probably hold back all tech in aggregate. We should all be working together, not racing against each other in the pursuit of shareholder value.

Historically, mathematicians used to keep their solutions secret in the interest of employment in the middle ages. So there used to be mathematicians that could, for example, solve certain quadratic equations but it took centuries before all humanity could not benefit from this knowledge. I believe this is what is happening with algorithms now. And it is very counter-progress in my opinion.

[+] matheusmoreira|3 years ago|reply
True, copyright is screwed up and completely incompatible with the 21st century. We should abolish it so that these silly questions of data ownership become irrelevant.

However, until that happens, Microsoft and GitHub cannot get away with blatant copyright infringement like this. No one is interested in their poor excuses either. People get sued and DMCA'd out of existence for far lesser offenses, yet Microsoft gets away with violating the license of every free software and open source project out there? That's fucked up.

[+] zarzavat|3 years ago|reply
Algorithms cannot be copyrighted. What is copyrighted is the creative expression of an algorithm. The variable names, the comments, choosing a for loop vs a while loop, or a ternary operator over an “if”, the order of arguments to a function, architectural decisions, etc.

Copyright is formed when a human makes a choice about equivalent ways of implementing an algorithm.

[+] mdaniel|3 years ago|reply
I didn't feel like weighing into that Twitter thread, but in the screenshot one will notice that the code generated by Copilot has secretly(?) swapped the order of the interior parameters to "cs_done". Maybe that's fine, but maybe it's not, how in the world would a Copilot consumer know to watch out for that? Double extra good if a separate prompt for "cs_done" comingles multiple implementations where some care and some don't. Partying ensues!

Not to detract from the well founded licensing discussion, but who is it that finds this madlibs approach useful in coding?

[+] crummy|3 years ago|reply
In my opinion Copilot helps with the easy, boring stuff by typing what I likely would have typed anyway. The harder the code to write, the less likely I'd be to lean on Copilot.
[+] Waterluvian|3 years ago|reply
I think people may be drastically over-valuing their code. If it was emitting an entire meaningful product, that would be something else. But it’s emitting nuts and bolts.

If the issue is more specifically copyright infringement, then leverage the legal apparatus in place for that. Their lawyers might listen better.

This is not a strongly held opinion and if you disagree I would love to hear your constructive thoughts!

[+] jimlongton|3 years ago|reply
>If it was emitting an entire meaningful product, that would be something else. But it’s emitting nuts and bolts.

Take a single C file or even a long function from the leaked Windows NT codebase and include it in your code. See how happy Microsoft will be with it. They spent millions of dollars on their legal teams. Eroding copyright protections will harm the weakest most. How many open source contributors can afford copyright lawyers?

[+] heavyset_go|3 years ago|reply
If they're that trivial and valueless, Microsoft should have no problem coming up with their own training sets instead of stealing them en masse from the public.

If I create something, I get to define the terms of its use, reproduction, distribution, etc. "Value" plays no part in whether someone can appropriate and distribute that creation without permission from the creator.

[+] ironmagma|3 years ago|reply
The OP was a professor of mine, and his library represents the product of thousands of hours of research. Probably every line in there is extremely valuable.
[+] jacooper|3 years ago|reply
I mean it starts like this, but if Copilot gets a pass, companies might just use AI as a way to launder code and avoid complying with Free licenses.
[+] jeppester|3 years ago|reply
Github copilot is a paid product.

It doesn't matter if I think my code is valuable, it's that Github is using everyone's code for their own profit - without opt-in, attribution, or paying a license.

[+] bluehatbrit|3 years ago|reply
I suppose on the one hand you are right, people may well over-value their code. However the argument isn't really about the value or any monetary damage done through this. It's about a violation of ownership and trust.

Right or wrong, copyright doesn't care about how valuable something is. Everything is equally (not in reality but in theory) protected. GitHub is a platform many people have trusted with protecting ownership of their copyrighted code through reasonable levels of security.

I think the big discussion point here is around ensuring that this tool is acting correctly and respecting rights of an individual. It's very easy for a large company to accidentally step on people and not realise it or brush it away. People want to make sure that isn't happening and right now there are some very compelling examples where it looks like this is happening. The fact that this isn't opt-in and there's no way to opt-out your public repositories means the choice has been taken away from people. Previously you were free to license your code as you see fit, now we have some examples of where that license may not be being respected as a result of an expensive GitHub feature.

I think this is where the conversation is centring. It's not about whether your code is valuable or not. It's whether a large company is making profit by stepping on an individuals right of ownership or not.

On the note of leveraging legal apparatus to figure it out I think you're right. The problem is what individual open source maintainer is going to have the funds to bring a reasonable equal legal challenge to such a large organisation? I maintain a relatively well used open source project and I sure as hell don't. Realistically my option is to either spend a lot of personal time and resources to challenge it (if I think wrong-doing is happening) or just suck it up. Given that there's no easy way to figure out if wrong-doing is happening because it's all in the AI soup, it makes it even harder to consider that approach.

I think the point is a lot less about the value of the code, and much more about a massively organisation playing hard and fast with an individuals rights.

None of this is to say GitHub have actually done anything wrong here. I'm sure we'll figure that out in time, but it would be great if they could figure out a way to provide more concrete explanations.

[+] kitsune_|3 years ago|reply
This is about standards. Laws for thee and not for me? It's just particularly hypocritical that the same companies that will sue anyone for violating their copyright have no issue violating copyright themselves.
[+] summerlight|3 years ago|reply
> I think people may be drastically over-valuing their code. If it was emitting an entire meaningful product, that would be something else. But it’s emitting nuts and bolts.

Please refrain yourself from this kind of blatant gaslighting. You're not the one to assess its value or usefulness and your point is at most tangential to the issue. The problem is that the model systematically took non-public domain code without any permits from the author, not whether it's useful or not. It's worth to hear this complaint and Copilot team should be more accountable for this problem since this could lead to more serious copyright infringement fights for its users.

[+] Havoc|3 years ago|reply
Copyright makes no such distinction
[+] chiefalchemist|3 years ago|reply
To some extent I agree with your opening. That is, plenty of cases CP is showing how mundane most code is. It's one commodity stitched to another stitched to another.

That's not considering any legal / license issues, just a simple statement about the data used to train CP.

[+] an1sotropy|3 years ago|reply
This is a huge and looming legal problem. I wonder if what should be a big uproar about it is muted by the widespread acceptance/approval of github and related products, in which case its a nice example of how monopolies damage communities.
[+] jeroenhd|3 years ago|reply
I think it won't become a legal problem until Copilot steals code from a leaked repository (i.e. the Windows XP source code) and that code gets reused in public.

Only then will we see an answer to the question "is making an AI write your stolen code a viable excuse".

I very much approve of the idea of Copilot as long as the copied code is annotated with the right license. I understand this is a difficult challenge but just because this is difficult doesn't mean such a requirement should become optional; rather, it should encourage companies to fix their questionable IP problems before releasing these products into the wild, especially if they do so in exchange for payment.

[+] crazygringo|3 years ago|reply
As some other commenters have noted, it seems like the copyrighted code is being copied and pasted into many other codebases (shadowgovt says they found 32,000 hits), which are then (illegally) representing an incorrect license.

So obviously the source of the error is one or more third parties, not Microsoft, and it's obviously impossible for Microsoft to be responsible in advance for what other people claim to license.

It does make you wonder, however, if Microsoft ought to be responsible for obeying a type of "DCMA takedown" request that should apply to ML models -- not on all 32,000 sources but rather on a specified text snippet -- to be implemented the next time the model is trained (or if it's practical for filters to be placed on the output of the existing model). I don't know what the law says, but it certainly seems like a takedown model would be a good compromise here.

[+] thorum|3 years ago|reply
What might be going on here is that Copilot pulls code it thinks may be relevant from other files in your VS Code project. You have the original code open in another tab, so Copilot uses it as a reference example. For evidence, see the generated comment under the Copilot completion: "Compare this snippet from Untitled-1.cpp" - that's the AI accidentally repeating the prompt it was given by Copilot.
[+] ianbutler|3 years ago|reply
I just tested it myself, and I most certainly do not have his source open, and it reproduced his code verbatim with just the function header in a random test c file I created in the middle of a rust project I'm working on.
[+] fencepost|3 years ago|reply
Seems simple enough to start addressing. Don't sue Microsoft, subpoena them as part of your suit against unnamed companies violating the license. Request information on all public and private repositories that were generated in part using Copilot and which contain relevant code from which the licensing info has been stripped.

After all, Microsoft may not itself be infringing so there may not be a cause of action against them by the copyright holders - but there's probably cause against the (unknowing) infringers and they may have cause.

[+] Spivak|3 years ago|reply
> Request information on all public and private repositories that were generated in part using Copilot and which contain relevant code from which the licensing info has been stripped.

No court is ever going give you that subpoena nor would it even be possible to comply with it even if granted. You might get “show me all the repositories used in the training data for Copilot that contain that snippet.”

[+] seanwilson|3 years ago|reply
For DALL-E and Stable Diffusion, the model size is an order of magnitude smaller than the total size of all the training set images? So it's not possible for the model to regurgitate every image in the training set exactly?

For Copilot, is there a similar argument? Or its model is large enough to contain the training set verbatim?

[+] fireant|3 years ago|reply
This exact code can be found 1000 times on github and many of those are MIT licensed https://github.com/search?q=%22cs+*cs_transpose+%28%22&type=.... Copilot, or any other developer or person, has no way of knowing where the original implementation came from or it's original license. The cat is out of the bag, get used to it.
[+] foepys|3 years ago|reply
It will not be GitHub that will get sued. It'll be the developers that use the code without attribution.

The copyright infringement might not matter if code from individual developers is being used - they usually don't sue. But once this happens to say Oracle's copyrighted code... Well, that is going to be interesting.

[+] vintermann|3 years ago|reply
Yes, they have a way. Even an algorithm given no access to anything but the copilot training data has a way, because it has temporal information: it says where the code appeared first! Github has the data, but doesn't give an easy way to search it, hmmm...

Although we can't rule out a common origin of shared code, including a common origin off github, we can know for sure that old code doesn't copy code from the future.

As to Microsoft and human developers having no clue about a piece of code's origin, thats especially false, since not only do we have timestamps on repositories, we can also easily verify that the code first appeared in the context of the csparse library, by Tim Davis, CS professor at Texas A&M who has worked on sparse matrix numerical methods his entire career.

[+] mafuy|3 years ago|reply
Strong disagree with your conclusion.

That something is effectively public domain does not make it legal to use. This movie was in a thousand torrents, yet one gets still sued for uploading a kilobyte of it.

That it is hard or impossible to know if it is legal to use does not mean it is ok to do so. You need a source for the license that is able to compensate you for the damages you incur in case their license was invalid.

I'm not happy about either of these points, but that's how it is currently and just closing your eyes and hoping it will go away won't work.

[+] choppaface|3 years ago|reply
I think that logic only works for DeCSS.
[+] olliej|3 years ago|reply
There are dozens of companies that ship Linux and other GPL code without providing sources, get used to it!
[+] nightski|3 years ago|reply
Or just don't use Github.
[+] siilats|3 years ago|reply
I think this will be solved like YouTube music. I as copilot user don’t mind paying 0.1cents for a good matrix transform and the code owner doesn’t mind receiving 0.1cents from thousands of users and github gets 30%
[+] ben-schaaf|3 years ago|reply
That's not compatible with the licenses used by software projects, especially GPL.
[+] quickthrower2|3 years ago|reply
The small issue of provenance there! With music it is clearer who sung it, and nonfamous music is almost worthless financially anyway.
[+] zaps|3 years ago|reply
Drunk conspiracy theory: Nat knew Copilot would be a complete nightmare and bailed.
[+] eterevsky|3 years ago|reply
I don't think it's fair to say this it emits the same code. The code on the right is definitely implementing the same algorithm and is generally similar to the code on the left, but it's not identical. IANAL, but I think copyright wouldn't apply in this case.

Imagine a person who would want to implement the same function in their project. They could look at the open source implementation to learn how the algorithm is supposed to work, and would write their implementation. They could end up with the implementation on the right.

[+] PAMANOCH|3 years ago|reply
The "AI" that people keep talking about is no different than any other app like MS Word, which is just a piece of software that serve corporation interests. What we are experiencing today is very simple - big players are using people's work for profit without paying one cent or getting any consent, no need to talk about "How". This is a nightmare scenario under today's social and eco system, and even worse at a production level, because in the end it will form a new industry that has nothing to do with experienced people. Take creative work for example, at the current rate most artists will completely decouple from industry in several years while giving all their works for training for free, and those who control the H/W/R&D resources will find ways to profit from model one way or another, resulting in an "AI" companies controlled "creative industry" with few artists left to direct their work. Can't even think of any other examples close to this in modern history, that a small group of people can do whatever they want under the disguise of "Exciting Technology" which in reality is just stealing an entire industry. There's very little to discuss if you ignore the reality of social systems and just focusing on technical details. We don't live in some fairy tale where you can just let computer do your work and enjoy your life.
[+] anujdeshpande|3 years ago|reply
Out of curiosity - has no one sued OpenAI/GitHub for this? I remember seeing threads like this since Copilot was launched. If there was enough legal pressure, I'd imagine OpenAI/Github training this using opt-in repos instead of using the model which they currently have.
[+] bmitc|3 years ago|reply
What does

> with "public code" blocked

mean? Are you able set a setting in GitHub to tell GitHub that you don't want your code used for Copilot training data? Is this an abuse of the license you sign with GitHub, or did they update it at some point to allow your code to be automatically used in Copilot? I'm not crazy about the idea of paying GitHub for them to make money off of my code/data.

[+] defasdefbe|3 years ago|reply
> We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that matches public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. We plan on continuing to evolve this approach and welcome feedback and comment.

From the FAQ https://github.com/features/copilot/

[+] galleywest200|3 years ago|reply
The option to omit "public code" means it should, in theory, omit code that is licensed under such banners as the GPL. It does not mean "omit private repositories".