If you open-sourced code and allowed it to be used for commercial purposes, I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.
(And charging for a product which adds value to your developer experience and needs money to be run is not a bad thing)
> If you open-sourced code and allowed it to be used for commercial purposes
Uploading it to Github does not transfer ownership or imply allowances for any use. If you upload it without a license it is a copyright violation to copy the code. Even with an MIT license it is a copyright violation to copy the code without attribution.
> I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.
People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.
All of my open source licenses require attribution, but Copilot does not give that attribution. So while my code is open source, Copilot is still violating the open source license. Just because it's open source doesn't mean there are not any terms that must be abided by.
I believe that gives me the right to be mad and to demand they fix their violations, one way or another.
If you write MIT code you expect them not to strip your license out in derivative works. This is exactly what license are for and GitHub is blatantly violating it while people applaud.
Sure, but there's still the license at play here. It's not like they trained it only on public domain/CC0 code. What happens when it verbatim outputs a significant amount of code that was originally MIT, or BSD, or GPL licensed without the appropriate attribution. It can create unintended copyright violations and potentially open people using it up to liability.
Open source code still has a license. That license may or may not require distributing the license along with the code. MIT may allow distribution without a license unless the code share is significant, but reusing GPL3 is a no-go for commercial companies.
The Apache 2 license allows for commercial use, but has implications for the way you can enforce your software patents. It also requires distributing the license file along with your application.
Complaining that companies use the software you told the world was free to use without restriction is dumb. However, not everyone gives away their software for free without restrictions. The fact that Github isn't respecting those licenses is a much bigger problem.
The tool autocompleting some random guy's personal information because he uploaded his blog to Github is highly problematic. The idea of using permissively licensed code to train an AI is not bad, but some human with knowledge of software licenses would need to pre-select those projects.
If all code came from one of those "do whatever the fuck you want" licenses, then there wouldn't be a problem. I'd consider it to be a great product and have no issue paying a fee. There's a huge market for a Copilot product, but this iteration just.. isn't it.
I think you're missing my point. I have tons of MIT code out there, including a node.js project used by lots of companies. I don't care about people using my code for money because I open sourced it under a permissive license. So I'm not really objecting to that.
But what's bothering me about this is that it's not a small company doing this. It's a company that's got crazy amounts of cash, who has been trying to trade on a "we're nice now and we love open source" image in the last few years, now taking all the open source code and balling it up in a closed-source app they will charge us for.
I'd be fine if I got to use it for free, extend it to whatever editing platform I like through its open API, and it was a part of an open project.
But right now it looks like they'll charge, and that bugs me.
“Free and open source, assuming I approve of the usage” is a common sentiment among people who paste Apache or MIT and don’t think about the ramifications. It’s increasingly common.
I think this situation is slightly more complex but that sentiment is at the heart of a lot of pushback against things like this.
Github/Microsoft is going to take your code, and then cut off your access to it. This is what the GPL was designed to fight, so they're going to try it this way instead.
Frankly, I think the reason people are upset is because a tool that once revolved around sharing work with others has been bought by a super giant corporation and then all of that sharing is being turned into a means of putting the people who shared out of work. Or in the very least, cutting their salaries dramatically.
So.. I can see that this ML model is generating some code exactly same as the original dataset, which definiately a problem. A defect model, sure.
Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us. Even the private code, I mean like you join a company, you read their codebase, methodology and it becomes something yours. Copyrights generally not allow you to "copy" the original, but you can still synthesize your own code -- cutting, combination, creating based on whatever you have learnt.
The method of how a ML model works is differ from human brain for sure, but I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????
And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.
The difference is that it's a judgement call when to include attribution, whom to attribute with how much, and overall whether something is too close to be counted as a copyright or other license violation or not. Intelligent humans sometimes, or even often times, have a hard time doing this judgement call. An artificial intelligence would too, and a somewhat simple ML model (no offense) certainly does.
I'm really waiting for this to blow up from the open source license angle. Freely combining code with different license is a hellish undertaking on its own. But already just re-using some, say, GPL code, even staying under the same license, but without proper attribution, is Forbidden with capital F.
Beside that, I cannot understand why the overall idea, using open-source project to train a ML model that generates code would ever be a problem. We human beings are learning as the model, we read others code, books, articles, design patterns... and it becomes part of us.
It's an interesting question.
1) When a human being reads code or a CS text book, we think of them extracting general principles from the code and so not having to repeat that particular code again. In contrast, what GPT-3 and Copilot seem to do is just extract sequences of little snippets, something that apparently requires them to regurgitate the text they've been trained on. That seem rather permanently dependent on the training corpus.
2) Human beings have a natural urge, a natural ethos, to help people learn. It's understandable. The thing is, when suddenly you're not talking people but machines, the reason for this urge easily vanish. Even if github was extracting knowledge from the code, I wouldn't have a reason to help them do so since that knowledge would be their entirely private property. They expect to charge people whatever they judge the going rate would be - why should anyone be helping them without similar compensation? That this is being done by "OpenAI", a company which went from open-nonprofit to closed-for-profit in a matter of few years, should accent this point. We're nowhere near a system that could digest all the knowledge of humankind. But if we got there, one might argue the result should belong to humankind rather than to one genius entrepreneur. And having the result belong one genius entrepreneur has some clear downsides.
> I cannot see why this would be a problem, or why an organic would become something superior that what they do is a creation and a ML mode is scraping your code. What is the difference here????
TL;DR: The AI doesn't know it can't just copy past (from perfect memory) and as such it learned to sometimes just copy past thinks.
The GPT model doesn't: "learn to understand the code and reproduce code based on that knowledge".
What it learns is a bit of understanding but more similar to recombining and tweaking verbatim text snipped it had seen before, without even understanding them or the concept of "not just copy/pasting code". (But while knowing which patterns "fit together").
This means that the model will "if it fits" potentially copy/past code "from memory" instead of writing new code which just happens to be the same/similar. It's like a person with perfect memory sometimes copy pasting code they had seen before pretending they wrote the code based on their "knowledge". Except worse, as it also will copy semantic irrelevant comments or sensitive information (if not pre filtered out before training).
I.e. there is a difference between "having a different kind of understanding" and "vastly missing understanding but compensating it by copying remembered code snippets from memory".
Theoretically it could be possible to create a GPT model which is forced to only understand programming (somewhat) but not memorize text snippets, but practically I think we are still far away from this, as it's really hard to say if a model did memorize copyright protected code.
I have a genuine question about this whole thing with Copilot:
A similar product, TabNine, has been around for years. It does essentially the exact same thing as Copilot, it’s trained on essentially the same dataset, and it gets mentioned in just about every thread on here that talks about AI code generation. (It’s a really cool product btw and I’ve been using and loving it for years). According to their website they have over 1M active users.
Why is this suddenly a huge big deal and why is everyone suddenly freaking out about Copilot? Is it because it’s GitHub and Microsoft and OpenAI behind Copilot vs some small startup you’ve never heard of? Is it just that the people freaking out weren’t paying attention and didn’t realize this service already existed?
The feature of TabNine that uses the "public" dataset is optional. It can also provide completions only based on local code. That optionality is important.
Also, tabnine has a smaller scope; you type "var " and it suggests a variable name and possibly the rest of the line, like autocomplete has been doing for decades. Perfectly normal.
My understanding of copilot is that you can type "// here's a high-level description of my problem" and it'll fill out entire functions, dozens of lines. The scope is much grander.
Because the repository trusted by millions is starting to do things we never anticipated. It's growing in ways that are a touch uncomfortable for some.
I think some are also beginning to feel an Amazonification happening. We built all the stuff and made it free, but now a company is going to own it and profit off of it.
Edit: If we want to prevent this, we need a new license that states our code may not be included in deep learning training sets.
Edit 2: if private repository code is in this training set, it may be possible to leak details of private company infrastructure. Models can leak training data.
GitHub has more visibility and yes, more scrutiny. But that doesn’t mean TabNine would’ve survived without scrutiny, especially after an acquisition. The fact is, size matters.
Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.
Otherwise, Into honestly trying to have a conversation on this to understand the objections because I haven't made up my mind but struggle to see the problem. So pease consider the following:
if the code was not encumbered by restrictions I don't see an obvious problem with this. Using code or data or anything like that in the public commons for a meta analysis doesn't strike me as wrong, even if the people doing it make money off of that analysis.
If I scraped GitHub code and then wrote a book about common coding patterns & practices I don't think that would be wrong.
I used the Brown corpus and multiple other written word corpuses (corpi?) Along with WordNet and other sources to write my thesis in Computational Linguistics Word Sense Disambiguation, later applying it to my job, which earns me money. Is this wrong?
Public datasets have been used extensively for ML already. I don't see this as much different.
> Can't you host code on GitHub that is not "free" for commercial use? If GitHub scraped these projects then it's a problem.
It did. It's spitting up the AGPL in empty files, and AGPL'd code isn't free for commercial use. It requires people who use it to make changes available under the same license.
If an individual hypothetically painstakingly searched through GitHub to see how others wrote an API call and copy-pasted, almost no one would have a problem with that even if they didn't attribute every little code snippet. But some are bringing out the pitchforks because ML can basically do that painstaking search (yes, I know it's not literally a search) so efficiently that it's actually (maybe) useful as a tool. But it's not fundamentally different from what many programmers do all the time.
The difference appears when copyrighted material is repeated verbatim. And because it's obvious Github has no control over how much copyrighted material is being repeated verbatim. And that copyrighted material is intended to be used by commercial companies who copyright their own material and don't want to have their copyright challenged.
OpenAI's argument is that this is fair use, in which case the license does not apply at all (though if the court's decision hangs on certain parts of the fair use tests, especially the fourth part, what was contained in the license may have some relevance).
> It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.
Huh? Over the last few days that I've watched this "copilot" story unfold on various news aggregator sites, I've first seen people point out copyright and other issues with it, then the fast inverse square root tweet happened, and then more articles and tweets like this one and the discussion that we are currently having. But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it. Did I miss something?
> But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it
What you just saw 3 days ago was a hype driven unveiling of a cherry picked contraption by GitHub, OpenAI and Microsoft. Open source became the loser once again and got taken advantage of this clever trick and will soon become a paid service. (With lots of code that is under copyright of various authors.)
Anyone who critiqued the announcement three days ago was drowned out, downvoted and stamped on by the fanatics.
I wanted to see those who had access to it (Not GitHub or Microsoft fans) to demystify and VERIFY the claims rather than blindly trust it. Those suspicions by the skeptics were right, and lots of questions still remain unanswered.
Well done for re-centralising everything to GitHub. Again.
Here's the brutal and ugly truth: why isn't our personal data treated as private property? It's because those who write the laws governing its status either lack the requisite understanding or else practice a form of, to put it mildly, motivated reasoning.
Well, for the most part my code isn't going to do anyone a ton of good. I don't use much in the way of popular frameworks, but I also guess this means I'm gonna be out of a job for not writing "normal" enough code at some point.
Is there no licence with any sort of model training clause: "If this licence or the source code it covers is used to train a statistical model, then the model and code used to create the model are covered by this licence (which has terms like the AGPL)"?
If not, will anybody quietly slip something like this into Copilot's training data?
I'm a big proponent of open source and I'm usually not nice with bad moves of GitHub. For example, i find stupid to use vscode and believe that it is open source when it is a lie.
But, in that case, I think that the things that are put to charge GitHub are not right.
I think that the idea is nice and it is fair from open source code. Anyone is free of downloading free software and doing something similar, and it is nice.
I just find the product itself is stupid, and it is for users to be smart enough not to use that knowing that their is a risk of them being sued for involuntary violating copyright. And GitHub might be at risk if it is a paid service as the companies could sue them back by pretending that they expected the code generated by GitHub to be safe for commercial use.
Also, I would think that GH would have abused if they used 'private repo' codes to train their model without permission.
Unfortunately, just because code is open source doesn't mean that there aren't terms of use attached with it. One of the simplest and most widely used terms is attribution.
This means that if Copilot does not attribute code when it copies and modifies it, then it is violating most open source licenses. Full stop.
What's hilarious about auto-generating the GPL license is that it's provable Copilot is trained on GPL code, but it's essentially impossible to tell which code it came from. Any legal battle will be strange... Is it enough for Copilot to not regurgitate GPL licensed code exactly? Is it enough for Copilot to create a slightly modified version?
Laughably, as soon as slight variation is added, there is so much code in the world that it'll be impossible to prove wrongdoing for HTML or JavaScript synthesis. A model trained on all permissively licensed code on GitHub looks a lot like your own GPL code? Are you sure your code is so unique?
Microsoft of course will implement compliance standards as necessary (they genuinely do not want to break the law), but what does this mean for smaller companies and individuals training models?
If you're hosting at the free github service, or even paid, github did not scrape your code. They just accessed the information on the hardware they owned. HTTP wouldn't have to be involved at all. They could just look at the disks.
Additionally, "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy.""
The above isn't to say I agree with this but just to highlight the dangers of outsourcing and the cloud.
believe it or not there's more countries in the world than the United States
> "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy."
this is definitely not the case for 100% of the rest of the world
The good news is that Github then also has no reasonable expectation for me to use their service. Most developers can just as easily set up a Gitlab or self-hosted alternative with zero friction.
This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.
But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.
The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.
This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
The way "AI" works for now, Copilot never comes with its own ideas, as it is incapable of deductive reasoning. It basically just detects from the context then mixes variations of things it learned. If there is nothing to mix (that is if there is a single source), the risk of spitting verbatim is high. But if there are multiple sources and some mixing and some amount of tiny differences, said differences better not be not too trivial because I don't see why we would suddenly drop Abstraction-Filtration-Comparison approaches...
So their defense of the like "oh it's fine it very rarely emits verbatim things" is bullshit anyway. That's an answer to a wrong question, at least given the answer is in this direction (would there be ton of verbatim recitation, they obviously would not try to wave away the problem like that -- however we can not conclude anything from verbatim output being rare, despite them stating that as if it a quite central and strong argument)
Here is the relevant portion of GitHub's terms of service (section D.4) [0]:
"""
4. License Grant to Us
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
...
"""
Note that the relevant detail is that this applies to public repositories not covered under some free/libre license. I also assume this excludes private repos which might have more restrictive terms of use. GitHub has a section on it I just haven't read it in detail and so maybe the above covers private repos as well.
To me it seems, the whole subject requires additional consideration in licensing. It is a little like applying telephone based law to the internet. It will not 100% fit.
If the creators interests are not clearly expressed anymore with a license, we need updates to the license texts.
Let's look at MIT:
____________________
"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
[...]
____________________
From the license text alone, it would not be clear to me, why anyone could claim that the OpenAI codex or the Github Copilot would require attribution to any of the used MIT source code to generate the AI model. The AI model is simply not a copy of the source or of a portion thereof. It is essentially a mathematical / statistical analysis of it.
Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?
Where does the responsibility for attribution lie? Should we treat the AI code generation models like a copy & paste program? Usually you cannot really say where the copy came from 100% - how do you know what factors influenced it?
> Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?
Let's handle the simplest case first: Copilot can and does regurgitate large pieces of its training dataset verbatim. This is a well-known and trivially demonstrable property of all ML models in this family. Would such exact copy fall under the license of the code being copied? This of course needs to be tested in courts, but my gut says "yes". The problem now is, if you're using Copilot, you may end up with such copied code in your codebase without ever knowing, and this might open you to liability.
I bet there are bad actors already starting to spam GitHub with sensibly looking projects that have hidden vulnerabilities, in hopes the next retraining of Copilot will pick them up.
Newsflash: all open source means that you're already doing free work for the largest corporations in the world! It seems like developers, as a group, decided that it would be better to spend their nights writing free code for FAANG, so they would be able to keep their day jobs. Bezos and friends thank you all. #genius
Source Hut[0] is getting more attractive with each passing day, but I'm not sure I can adapt to it's weird e-mail centric pull requests (and I know that this is a standard Git feature, but the UX seems bad).
> It’s truly disappointing to watch people cheer at having their work and time exploited
Maybe it's my information bubble, but I don't see anyone cheering. Currently Copilot churning out rather bad code. I am definitely would not use it. And my prediction about it that it will go like Tesla's autopilot for years.
A little. But all works output by GPT-3 are provided in “source form” to everyone who uses them – whereas lots of the output of Co-Pilot (trained on copyleft code, among other things) is going into proprietary software projects.
(Also, GPT-3 wasn't trained on nearly as much writing as that. Even if you ignore lost writing, GPT-3 was trained on a small subset of the 'net.)
I don’t understand this mentality. The AI is trained (or at least supposed to be - that’s fixable) on code that was published under open licenses. The “exploited by the man” trope after publishing OSS feels entirely backwards.
Nothing is free people! ... People are outraged by GitHub but nobody is going after Facebook or Google for training their AIs on your personal data. Facebook used your face to train some algos, google your personal emails etc.
A lot of people dislike them and minimize their use.
More importantly, we are seeing a bait-and-switch. People agreed on GitHub storing, showing and indexing their code and issues, not using the code for Copilot, regardless of what the fine print in the usage agreement says.
It's not about it being free, it's about GitHub taking something you licensed with conditions (ie. attribution or keep this copyright notice and license file, etc.) and blatantly ignoring your license because they know you probably can't afford to sue them for copyright infringement. Open Source doesn't mean you can reproduce and copy the code freely, licenses exist for a reason. Also: of course people care about Facebook et al. (not enough, I'll grant you). Plenty of people complain about Facebook violating privacy every single day.
I mean.. I am? I care much more about Google or Facebook profiling and profiting off of my data (especially when I don’t consent to giving it to them in any meaningful way) than I do letting GitHub do things with code I knew was freely available and that other entities could use in profitable ways.
Because people accept a EULA when they give their data to FB or Google. Github is exploiting a grey area to leverage a big fat chunk of GPL (or other) licensed code in ways that are perceived to be a technically probably legal, but morally very ambiguous understanding of these licenses.
There may be discussions to be made about licenses, but "to watch people cheer at having their work and time exploited by a company worth billions" is a disappointingly myopic take, especially from a developer.
Information that is aggregated and organized for easy retrieval is worth more than the sum of individual bits of information. I thought that was common sense.
We might as well complain that billionaire supermarket chains are pocketing all the profit while not growing a single potato by themselves.
On their website they say that "GitHub Copilot is a code synthesizer, not a search engine: the vast majority of the code that it suggests is uniquely generated and has never been seen before. We found that about 0.1% of the time, the suggestion may contain some snippets that are verbatim from the training set."
So it won't copypaste your code. It had just read code from open sources and learned from it - similar to what humans do. So I don't see any problem with this.
GitHub has 56 million users as of September 2020 (according to Wikipedia). Let's assume that only 1 million of them use Copilot at an average of once a week.
That means that every week, there will be 1000 verbatim copypaste of code by Copilot. Then multiply that by a year or more as Copilot gets older.
0.1% may not seem like a lot, but at the scale of Internet companies, it always is.
To the extent Copilot is doing something illegal, or making its users inadvertently engage in illegal behavior, it is copyright infringement, as (most) license violations are copyright violations.
Copyright cuts both ways. Free Software and Open Software exist in context, and because of, copyright laws. This means that a person or a company using output from Copilot may be engaging in copyright infringement. In other words, Copilot is enabling software piracy.
I might be sympathetic to it, and even consider it mostly positive, but then if companies can use my code ignoring the license, I want to be able to Torrent their products in peace too.
iliekcomputers|4 years ago
(And charging for a product which adds value to your developer experience and needs money to be run is not a bad thing)
hmfrh|4 years ago
Uploading it to Github does not transfer ownership or imply allowances for any use. If you upload it without a license it is a copyright violation to copy the code. Even with an MIT license it is a copyright violation to copy the code without attribution.
> I don't see the point of being pissy about Github using it, I'm saying this as someone who's written quite a lot of MIT code.
People are probably angry because this is yet another case of a big multinational corporation abusing unclear or difficult to enforce legislation for profit.
ghoward|4 years ago
I believe that gives me the right to be mad and to demand they fix their violations, one way or another.
SamWhited|4 years ago
chartreusek|4 years ago
jeroenhd|4 years ago
The Apache 2 license allows for commercial use, but has implications for the way you can enforce your software patents. It also requires distributing the license file along with your application.
Complaining that companies use the software you told the world was free to use without restriction is dumb. However, not everyone gives away their software for free without restrictions. The fact that Github isn't respecting those licenses is a much bigger problem.
The tool autocompleting some random guy's personal information because he uploaded his blog to Github is highly problematic. The idea of using permissively licensed code to train an AI is not bad, but some human with knowledge of software licenses would need to pre-select those projects.
If all code came from one of those "do whatever the fuck you want" licenses, then there wouldn't be a problem. I'd consider it to be a great product and have no issue paying a fee. There's a huge market for a Copilot product, but this iteration just.. isn't it.
bphogan|4 years ago
But what's bothering me about this is that it's not a small company doing this. It's a company that's got crazy amounts of cash, who has been trying to trade on a "we're nice now and we love open source" image in the last few years, now taking all the open source code and balling it up in a closed-source app they will charge us for.
I'd be fine if I got to use it for free, extend it to whatever editing platform I like through its open API, and it was a part of an open project.
But right now it looks like they'll charge, and that bugs me.
hbz65|4 years ago
I think this situation is slightly more complex but that sentiment is at the heart of a lot of pushback against things like this.
Retr0id|4 years ago
jrm4|4 years ago
Github/Microsoft is going to take your code, and then cut off your access to it. This is what the GPL was designed to fight, so they're going to try it this way instead.
Those who do not learn history yadda yadda.
brutal_chaos_|4 years ago
Abishek_Muthian|4 years ago
If so then what about private repositories with a permissible license but not been made public for what ever reason.
What about those projects whose dependencies has permissible license but main repo doesn't? Can GitHub just go oops!
I think the point that so much confusion exists regarding their product & possible violation of user's trust is a valid reason to be pissy about.
matsemann|4 years ago
But we didn't.
licenseauth|4 years ago
Where is this MIT licensed codes of yours, because it definitely is not on your github.
seph-reed|4 years ago
firebaze|4 years ago
yangff|4 years ago
And also recently we saw GPT that generates articles, waifulabs that generates ... waifus... to be honest I cannot perceive the difference since all of them are "learning" (in a mechanical way of human created knowledge.
wildmanx|4 years ago
I'm really waiting for this to blow up from the open source license angle. Freely combining code with different license is a hellish undertaking on its own. But already just re-using some, say, GPL code, even staying under the same license, but without proper attribution, is Forbidden with capital F.
dathinab|4 years ago
More like a defect approach, behavior like that is well known(1) to be basically guaranteed to happen with GPT-3 and similar approaches.
(1): By people involved in the respective science categories (Representation Learning/Deep Learning, NLP, etc.).
joe_the_user|4 years ago
It's an interesting question.
1) When a human being reads code or a CS text book, we think of them extracting general principles from the code and so not having to repeat that particular code again. In contrast, what GPT-3 and Copilot seem to do is just extract sequences of little snippets, something that apparently requires them to regurgitate the text they've been trained on. That seem rather permanently dependent on the training corpus.
2) Human beings have a natural urge, a natural ethos, to help people learn. It's understandable. The thing is, when suddenly you're not talking people but machines, the reason for this urge easily vanish. Even if github was extracting knowledge from the code, I wouldn't have a reason to help them do so since that knowledge would be their entirely private property. They expect to charge people whatever they judge the going rate would be - why should anyone be helping them without similar compensation? That this is being done by "OpenAI", a company which went from open-nonprofit to closed-for-profit in a matter of few years, should accent this point. We're nowhere near a system that could digest all the knowledge of humankind. But if we got there, one might argue the result should belong to humankind rather than to one genius entrepreneur. And having the result belong one genius entrepreneur has some clear downsides.
dathinab|4 years ago
TL;DR: The AI doesn't know it can't just copy past (from perfect memory) and as such it learned to sometimes just copy past thinks.
The GPT model doesn't: "learn to understand the code and reproduce code based on that knowledge".
What it learns is a bit of understanding but more similar to recombining and tweaking verbatim text snipped it had seen before, without even understanding them or the concept of "not just copy/pasting code". (But while knowing which patterns "fit together").
This means that the model will "if it fits" potentially copy/past code "from memory" instead of writing new code which just happens to be the same/similar. It's like a person with perfect memory sometimes copy pasting code they had seen before pretending they wrote the code based on their "knowledge". Except worse, as it also will copy semantic irrelevant comments or sensitive information (if not pre filtered out before training).
I.e. there is a difference between "having a different kind of understanding" and "vastly missing understanding but compensating it by copying remembered code snippets from memory".
Theoretically it could be possible to create a GPT model which is forced to only understand programming (somewhat) but not memorize text snippets, but practically I think we are still far away from this, as it's really hard to say if a model did memorize copyright protected code.
nlh|4 years ago
A similar product, TabNine, has been around for years. It does essentially the exact same thing as Copilot, it’s trained on essentially the same dataset, and it gets mentioned in just about every thread on here that talks about AI code generation. (It’s a really cool product btw and I’ve been using and loving it for years). According to their website they have over 1M active users.
Why is this suddenly a huge big deal and why is everyone suddenly freaking out about Copilot? Is it because it’s GitHub and Microsoft and OpenAI behind Copilot vs some small startup you’ve never heard of? Is it just that the people freaking out weren’t paying attention and didn’t realize this service already existed?
rdw|4 years ago
Also, tabnine has a smaller scope; you type "var " and it suggests a variable name and possibly the rest of the line, like autocomplete has been doing for decades. Perfectly normal.
My understanding of copilot is that you can type "// here's a high-level description of my problem" and it'll fill out entire functions, dozens of lines. The scope is much grander.
Lariscus|4 years ago
echelon|4 years ago
I think some are also beginning to feel an Amazonification happening. We built all the stuff and made it free, but now a company is going to own it and profit off of it.
Edit: If we want to prevent this, we need a new license that states our code may not be included in deep learning training sets.
Edit 2: if private repository code is in this training set, it may be possible to leak details of private company infrastructure. Models can leak training data.
ghoward|4 years ago
jchw|4 years ago
moocowtruck|4 years ago
ineedasername|4 years ago
Otherwise, Into honestly trying to have a conversation on this to understand the objections because I haven't made up my mind but struggle to see the problem. So pease consider the following:
if the code was not encumbered by restrictions I don't see an obvious problem with this. Using code or data or anything like that in the public commons for a meta analysis doesn't strike me as wrong, even if the people doing it make money off of that analysis.
If I scraped GitHub code and then wrote a book about common coding patterns & practices I don't think that would be wrong.
I used the Brown corpus and multiple other written word corpuses (corpi?) Along with WordNet and other sources to write my thesis in Computational Linguistics Word Sense Disambiguation, later applying it to my job, which earns me money. Is this wrong?
Public datasets have been used extensively for ML already. I don't see this as much different.
pessimizer|4 years ago
It did. It's spitting up the AGPL in empty files, and AGPL'd code isn't free for commercial use. It requires people who use it to make changes available under the same license.
ghaff|4 years ago
joe_the_user|4 years ago
rcxdude|4 years ago
st_goliath|4 years ago
> ...
> It’s truly disappointing to watch people cheer at having their work and time exploited by a company worth billions.
Huh? Over the last few days that I've watched this "copilot" story unfold on various news aggregator sites, I've first seen people point out copyright and other issues with it, then the fast inverse square root tweet happened, and then more articles and tweets like this one and the discussion that we are currently having. But I somehow don't really recall anyone besides the Microsoft marketing department being overly excited about it. Did I miss something?
Nuzzerino|4 years ago
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
dinglejungle|4 years ago
nomercy400|4 years ago
That would be exciting tech for me.
ghoward|4 years ago
rvz|4 years ago
What you just saw 3 days ago was a hype driven unveiling of a cherry picked contraption by GitHub, OpenAI and Microsoft. Open source became the loser once again and got taken advantage of this clever trick and will soon become a paid service. (With lots of code that is under copyright of various authors.)
Anyone who critiqued the announcement three days ago was drowned out, downvoted and stamped on by the fanatics.
I wanted to see those who had access to it (Not GitHub or Microsoft fans) to demystify and VERIFY the claims rather than blindly trust it. Those suspicions by the skeptics were right, and lots of questions still remain unanswered.
Well done for re-centralising everything to GitHub. Again.
andrewjl|4 years ago
scrollaway|4 years ago
seph-reed|4 years ago
Time to move on to the carbon age I suppose.
rikroots|4 years ago
I do pity the poor algorithm that has to parse sense into my coding idiosyncrasies.
sfg|4 years ago
If not, will anybody quietly slip something like this into Copilot's training data?
greatgib|4 years ago
But, in that case, I think that the things that are put to charge GitHub are not right.
I think that the idea is nice and it is fair from open source code. Anyone is free of downloading free software and doing something similar, and it is nice.
I just find the product itself is stupid, and it is for users to be smart enough not to use that knowing that their is a risk of them being sued for involuntary violating copyright. And GitHub might be at risk if it is a paid service as the companies could sue them back by pretending that they expected the code generated by GitHub to be safe for commercial use.
Also, I would think that GH would have abused if they used 'private repo' codes to train their model without permission.
ghoward|4 years ago
This means that if Copilot does not attribute code when it copies and modifies it, then it is violating most open source licenses. Full stop.
maxbendick|4 years ago
Microsoft of course will implement compliance standards as necessary (they genuinely do not want to break the law), but what does this mean for smaller companies and individuals training models?
eqtn|4 years ago
superkuh|4 years ago
Additionally, "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy.""
The above isn't to say I agree with this but just to highlight the dangers of outsourcing and the cloud.
blibble|4 years ago
> "The third-party doctrine is a United States legal doctrine that holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs), and e-mail servers—have "no reasonable expectation of privacy."
this is definitely not the case for 100% of the rest of the world
smoldesu|4 years ago
croes|4 years ago
They will do whatever they want with your code.
MS didn't change a bit.
yayr|4 years ago
especially: Conclusion and Next Steps.
This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.
But there’s still one big difference between GitHub Copilot reciting code and me reciting a poem: I know when I’m quoting. I would also like to know when Copilot is echoing existing code rather than coming up with its own ideas. That way, I’m able to look up background information about that code, and to include credit where credit is due.
The answer is obvious: sharing the prefiltering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether.
This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
temac|4 years ago
So their defense of the like "oh it's fine it very rarely emits verbatim things" is bullshit anyway. That's an answer to a wrong question, at least given the answer is in this direction (would there be ton of verbatim recitation, they obviously would not try to wave away the problem like that -- however we can not conclude anything from verbatim output being rare, despite them stating that as if it a quite central and strong argument)
bphogan|4 years ago
abetusk|4 years ago
"""
4. License Grant to Us
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
...
"""
Note that the relevant detail is that this applies to public repositories not covered under some free/libre license. I also assume this excludes private repos which might have more restrictive terms of use. GitHub has a section on it I just haven't read it in detail and so maybe the above covers private repos as well.
[0] https://docs.github.com/en/github/site-policy/github-terms-o...
ChrisMarshallNY|4 years ago
> We are obsessed with shiny without considering that it might be sharp.
unknown|4 years ago
[deleted]
yayr|4 years ago
If the creators interests are not clearly expressed anymore with a license, we need updates to the license texts.
Let's look at MIT:
____________________
"Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. [...]
____________________
From the license text alone, it would not be clear to me, why anyone could claim that the OpenAI codex or the Github Copilot would require attribution to any of the used MIT source code to generate the AI model. The AI model is simply not a copy of the source or of a portion thereof. It is essentially a mathematical / statistical analysis of it.
Now what about any generated new source? How similar does it need to be to any source to be a copy? At what size of the generated code it qualifies to be a copy instead of a snippet of industry best practice?
Where does the responsibility for attribution lie? Should we treat the AI code generation models like a copy & paste program? Usually you cannot really say where the copy came from 100% - how do you know what factors influenced it?
TeMPOraL|4 years ago
Let's handle the simplest case first: Copilot can and does regurgitate large pieces of its training dataset verbatim. This is a well-known and trivially demonstrable property of all ML models in this family. Would such exact copy fall under the license of the code being copied? This of course needs to be tested in courts, but my gut says "yes". The problem now is, if you're using Copilot, you may end up with such copied code in your codebase without ever knowing, and this might open you to liability.
lokl|4 years ago
TeMPOraL|4 years ago
cush|4 years ago
coliveira|4 years ago
haolez|4 years ago
[0] https://sourcehut.org/
ghoward|4 years ago
lmarcos|4 years ago
It's not that crazy.
unknown|4 years ago
[deleted]
SergeAx|4 years ago
Maybe it's my information bubble, but I don't see anyone cheering. Currently Copilot churning out rather bad code. I am definitely would not use it. And my prediction about it that it will go like Tesla's autopilot for years.
mensetmanusman|4 years ago
wizzwizz4|4 years ago
(Also, GPT-3 wasn't trained on nearly as much writing as that. Even if you ignore lost writing, GPT-3 was trained on a small subset of the 'net.)
ricardobeat|4 years ago
gdsdfe|4 years ago
goodpoint|4 years ago
A lot of people dislike them and minimize their use.
More importantly, we are seeing a bait-and-switch. People agreed on GitHub storing, showing and indexing their code and issues, not using the code for Copilot, regardless of what the fine print in the usage agreement says.
SamWhited|4 years ago
joe_the_user|4 years ago
Maybe people should be mad about what Facebook or Google do but that stuff doesn't involve taking stuff outside their terms of use.
Maybe Github could try attaching a "we can relicense all your code whenever we want" condition to their hosting but they'd lose all their business.
ericmay|4 years ago
void_mint|4 years ago
...what?
shakow|4 years ago
code_duck|4 years ago
macintux|4 years ago
nomercy400|4 years ago
yongjik|4 years ago
Information that is aggregated and organized for easy retrieval is worth more than the sum of individual bits of information. I thought that was common sense.
We might as well complain that billionaire supermarket chains are pocketing all the profit while not growing a single potato by themselves.
pessimizer|4 years ago
Are you making a claim that Netflix shouldn't be required to pay for individual movies because they sell a collection of movies?
hekec|4 years ago
So it won't copypaste your code. It had just read code from open sources and learned from it - similar to what humans do. So I don't see any problem with this.
rhn_mk1|4 years ago
Second, we can't ignore that if someone deliberately tries to make it spit out copyrighted code, the chances are going to be much greater.
Why would anyone? Plausible deniability: "I didn't copy this GPL procedure, the copilot gave it to me!"
ghoward|4 years ago
That means that every week, there will be 1000 verbatim copypaste of code by Copilot. Then multiply that by a year or more as Copilot gets older.
0.1% may not seem like a lot, but at the scale of Internet companies, it always is.
macintux|4 years ago
You might want to check out this video...
https://twitter.com/mitsuhiko/status/1410886329924194309
einpoklum|4 years ago
Original code in somebody's GitHub repo:
Copilot code: Not copy pasted! Uniquely generated! Never before seen!skc|4 years ago
unknown|4 years ago
[deleted]
mrkramer|4 years ago
tasubotadas|4 years ago
It's a NET POSITIVE FOR EVERYBODY.
TeMPOraL|4 years ago
Copyright cuts both ways. Free Software and Open Software exist in context, and because of, copyright laws. This means that a person or a company using output from Copilot may be engaging in copyright infringement. In other words, Copilot is enabling software piracy.
I might be sympathetic to it, and even consider it mostly positive, but then if companies can use my code ignoring the license, I want to be able to Torrent their products in peace too.
rhn_mk1|4 years ago
Traubenfuchs|4 years ago
You just did.
stakkur|4 years ago
unknown|4 years ago
[deleted]
tedunangst|4 years ago
speedgoose|4 years ago
justbored123|4 years ago
[deleted]
ghoward|4 years ago
So I think I have a right to be mad when they do something like this to code I previously stored on GitHub.
throwaway2048|4 years ago
einpoklum|4 years ago