(no title)
vadiml
|
2 years ago
I'm really baffled by all this discussion on copyrights in the age of AI. The Copilot does not
'steal' or and reproduce our code - it simply LEARNS from it as a human coder would learn from it. IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas.
spuz|2 years ago
> This can lead to some copylefted code being included in proprietary or simply not copylefted projects. And this is a violation of both the license terms and the intellectual proprety of the authors of the original code.
If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.
Edit: I'm definitely no expert on copyright law for code but my personal rule is don't include someone's copyrighted code if it can by unambiguously identified as their original work. For very small lines of code, it would be hard to identify any single original author. When it comes to whole functions it gets easier to say "actually this came from this GPL licensed project". Since Copilot can produce whole functions verbatim, this is the basis on which I state that it "would be a clear violation" of the licence. If Copilot chooses to be less concerned about violating the law than I am then that's a problem. But maybe I'm overly cautious and the GPL is more lenient than this in reality.
hutzlibu|2 years ago
But only snippets as far as I can tell.
This is the codeexample linked from the author:
https://web.archive.org/web/20221017081115/https://nitter.ne...
It is still not trivial code, but are there really lot's of different ways on how to transpose matrixes?
(Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)
And just slightly changing the code seems trivial, at what point will it be acceptable?
I just don't think spending much energy there is really beneficial for anyone.
I rather see the potential benefits of AI for open source. I haven't used Copilot, but ChatGPT4 is really helpful generating small chunks of code for me, enabling me to aim higher in my goals. So what's the big harm, if also some proprietary black box gets improved, when also all the open source devs can produce with greater efficency?
ithkuil|2 years ago
I guess the likelihood decreases as the code length increases but the likelihood also increases the more constraints on parameters such as code style, code uniformity etc you pose.
messe|2 years ago
Not necessarily. If it's just a small snippet of code, even an entire function taken verbatim, it may not be sufficiently creative for copyright to apply to it.
Copyright is a lot less black and white than most here seem to believe.
ignoramous|2 years ago
We aren't talking verbatim generation of entire packages of code here, are we? Code snippets are surely covered under fair use?
Kiro|2 years ago
It's not a problem in practice. It only does so if you bait it really hard and push it into a corner, at which point you may just as well copy the code directly from the repo. It simply doesn't happen unless you know exactly what code you're trying to reproduce and that's not how you use Copilot.
allmadhare|2 years ago
In a lot of scenarios, there is an existing best practice or simply only one real 'good' way to achieve something - in those cases are we really going to say that despite the fact a human would reasonably come to the same output code, that the AI can't produce it because someone else wrote it already?
welshwelsh|2 years ago
It should be easy to check Copilot's output to make sure it's not copied verbatim from a GPL project. Colleges already have software that does this to detect plagiarism.
If it's a match, just ask GPT to refactor it. That's what humans do when they want to copy stuff they aren't allowed to copy, they paraphrase it or change the style while keeping the content.
vadiml|2 years ago
sacrosancty|2 years ago
[deleted]
Aachen|2 years ago
People put code on github to be read by anyone (assuming a public repository), but the terms of use are governed by the license. Now you've got a system that ignores the license and scrapes your data for its own purpose. You can pretend it's human but the capabilities aren't the same. (Humans generally don't spend a month being trained on all github code and remember large chunks of it for regurgitation at superhuman speeds, nor can they be horizontally scaled after learning.)
You can still be of the opinion that this is fine, and I may or may not be fine with it as well, I just don't think the stated reason holds up to logic and other opinions ought to "baffle" you
az226|2 years ago
jeroenhd|2 years ago
Also, AI doesn't learn "like a human". Neural networks are an extremely simplistic representation of a biological brain and the details of how learning and human memory works aren't even all that clear yet.
Open source code usually comes with expectations for the people who use it. That expectation can be as simple as requiring a reference back to the authors, adding a license file to clarify what the source was based on, or in more extreme cases putting licensing requirements on the final product.
Unless Microsoft complies with the various project licenses, I don't see why this is antithetical to the idea of open source at all.
asimpletune|2 years ago
I don't really want this to comment to be perceived as flame bait (AI seems to be a very sensitive topic in the same sense as crypto currency), so instead let me just pose a simple question. If Copilot really learns as a human, then why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?
spuz|2 years ago
The problem is that in reality, even though the original data is gone, a language model like Copilot _can_ reproduce some code byte for byte somehow drawing the information from the weights in its network and the result is a reproduction of copyrighted work.
CapsAdmin|2 years ago
I've never studied computer science formally but I doubt students learn only from the CS curriculum? I don't even know how much knowledge CS curriculum entails but I don't for example see anything wrong including example code written by humans.
Surely students will collectively also learn from millions of code examples online alongside the study. I'm sure teachers also do the same.
A language model can also only learn from text, so what about all the implicit knowledge and verbal communication?
mirekrusin|2 years ago
flumpcakes|2 years ago
If I took two copy-righted pictures and layered them on top of each other at 50% opacity. Would that be OK or copy right infringement?
AI models just use more weights/biases and more images (or any input).
vadiml|2 years ago
CapsAdmin|2 years ago
In my mind (and I suspect others too) in machine learning context, statistical inference and learning became synonymous with all the recent development.
The way I see it, there's now a discussion around copyright because people have different fundemental views on what learning is and what it means to be a human that don't really surface.
lewhoo|2 years ago
datavirtue|2 years ago
bilqis|2 years ago
adlpz|2 years ago
The only leap here is the fact that the programmer has outsourced the learning to a tool that does it for them, which they then use to do their job, just as before.
josefx|2 years ago
copyright is a thing, AI do not change that.
> does not 'steal' or and reproduce our code - it simply LEARNS from it as a human
And here we have the central problem, does it act like a human or does it not act like a human? Humans copy things they learn all the time, some of us know various songs by heart, others will even quote entire movies from memory. If AI can learn and reproduce things like humans do then you need to take steps to ensure that the output is properly stripped from any content that might infringe on existing copyrighted works.
ChatGTP|2 years ago
webmobdev|2 years ago
anileated|2 years ago
Rather, Copilot is a tool. Microsoft/ClosedAI operate this tool. Commercially. They crawl original works and through running ML on it automatically generate and sell derivative works from those original works, without any consent or compensation. They are the ones who violate copyright, not Copilot.
Merad|2 years ago
As an aside, it seems really strange to invoke "open source ideas" as an argument in favor of a for-profit company building a closed source product that relies on millions of lines of open source code.
bamboozled|2 years ago
Here is Microsoft doing as Microsoft does…
pull_my_finger|2 years ago
cccbbbaaa|2 years ago
Twirrim|2 years ago
It doesn't LEARN anything, let alone like a human coder would. It has absolutely zero understanding. It's not actually intelligent. It's a highly tuned mathematical model that predicts what the next word should be.
BlueTemplar|2 years ago
sethd|2 years ago
> it simply LEARNS from it as a human coder would learn from it.
The LLM doesn’t learn, the authors of the LLM are encoding copyright protected content into a model using gradient decent and other techniques.
Now as far as I understand the law, that’s OK. The problems arise when distribution of the model comes into play.
I’m curious, are you a programmer yourself? Don’t take this the wrong way, but I want to understand the background of people who coming to the kind of conclusion you seemed to arrive at about how LLMs work.
otikik|2 years ago
What humans do to learn is intuitive, but it is not simple. What the machine does is also not simple, it involves some tricky math.
Precisely if the process was simple, then it could be more easily argued that the machine is "just copying" - that is simple.
There's a lot of nuance here.
What the machine is doing "looks similar to what humans do from the exterior", the same way that a plane flying "looks similar" to a flying bird. But the airplane does not flap its wings.
> kind of irrational and antithetical to open source ideas
Open source ideas are not the only ideas in town.
eptcyka|2 years ago
ignoramous|2 years ago
Though, GitHub would do well to also bake-in approp attributions if a significant portion of the generated code is a copypasta.
remix2000|2 years ago
golergka|2 years ago
vadiml|2 years ago
ChatGTP|2 years ago
In the case of ChatGPT-x, Open AI company which is disguised as a not for profit with a goal of producing ever more powerful models that may eventually be capable of replacing you at work while seemingly not having any plan to give back to those who’s work was used to make them insane amounts of money.
They haven’t even given back any of their research. So it’s ok to take everyone’s open source work and not give back is it ?
This isn’t some cute little robot who wakes up in the morning and decided it wants to be a coder. This is a multi-national company who has created the narrative you’re repeating. They know exactly what they’re doing.
oytis|2 years ago
zirgs|2 years ago
unknown|2 years ago
[deleted]
xxs|2 years ago
I thought that was a sarcastic remark, given the capitalization of 'learn', but followed by IMHO dispelled that part.
We have no idea how humans learn, and the 'AI' has a statistical approach, not much more than that.
lawn|2 years ago
The interesting debate should be what happens in the gray area, when you read a lot of code and learns patterns and ideas.
datavirtue|2 years ago
sureglymop|2 years ago
hnbad|2 years ago
Intellectual property is a nebulous concept to begin with, if you really try to understand it. There's a reason copyright claim systems like those at YouTube don't really concern themselves with ownership (that's what DMCA claims are for) but instead with the arbitrary terms of service that don't require you to have a degree in order to determine the boundaries of "fair use" (even if it mimics legal language to dictate these terms and their exemptions).
The problem isn't AI. The problem is property. Ever since Enclosure we've been trying to dull the edges of property rights to make sure people can actually still survive despite them. At some point you have to ask yourself if maybe the problem isn't how sharp the blade you're cutting yourself is but whether you should instead stop cutting. We can have "free culture" but then we can't have private property.
lucideer|2 years ago
You may be right that this is antithetical to "open source" ideas, as Tim O'Reilly would've defined it - a la MIT/BSD/&c., but it's very much in line with copyleft ideas as RMS would've defined it - a la GPL/EUPL/&c. - which is what's being explicitly discussed in this article.
The two are not the same: "open source" is about widespread "open" use of source code, copyleft is much more discerning and aims to carefully temper reuse in such a way that prioritises end user liberty.
goodpoint|2 years ago
This is really not how LLMs work.
loveparade|2 years ago
If the data could only used by other open source projects, e.g. open source AI models, I don't think anyone would complain.
You could argue "well, but anyone can use the code on Github" and while that's technically true, it's obvious that with both Github and OpenAI being owned by Microsoft, OpenAI gets a huge competitive advantage due to internal partnerships.
toastal|2 years ago
unknown|2 years ago
[deleted]
dingledork69|2 years ago
friendzis|2 years ago
Does it though? It "learns" correlations between tokens/sequences. A human coder would look at a piece of code and learn an algorithm. The AI "learns" token structure. A human reproducing original code verbatim would be incidental. AI (language model, at least) producing algorithm-implementing code would be incidental.
pppkkkiii|2 years ago
datavirtue|2 years ago
9991|2 years ago
bombolo|2 years ago
Have you personally ever put out something in public domain?