WingNews

spuz|2 years ago

The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim.

> This can lead to some copylefted code being included in proprietary or simply not copylefted projects. And this is a violation of both the license terms and the intellectual proprety of the authors of the original code.

If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

Edit: I'm definitely no expert on copyright law for code but my personal rule is don't include someone's copyrighted code if it can by unambiguously identified as their original work. For very small lines of code, it would be hard to identify any single original author. When it comes to whole functions it gets easier to say "actually this came from this GPL licensed project". Since Copilot can produce whole functions verbatim, this is the basis on which I state that it "would be a clear violation" of the licence. If Copilot chooses to be less concerned about violating the law than I am then that's a problem. But maybe I'm overly cautious and the GPL is more lenient than this in reality.

hutzlibu|2 years ago

"The problem is not that Copilot produces code that is "inspired" by GPL code, it's that it spits out GPL code verbatim."

But only snippets as far as I can tell.

This is the codeexample linked from the author:

https://web.archive.org/web/20221017081115/https://nitter.ne...

It is still not trivial code, but are there really lot's of different ways on how to transpose matrixes?

(Also the input was "sparse matrix transpose, cs_", so his naming convention especially included. So it is questionable if a user would get his code in this shape with a normal prompt)

And just slightly changing the code seems trivial, at what point will it be acceptable?

I just don't think spending much energy there is really beneficial for anyone.

I rather see the potential benefits of AI for open source. I haven't used Copilot, but ChatGPT4 is really helpful generating small chunks of code for me, enabling me to aim higher in my goals. So what's the big harm, if also some proprietary black box gets improved, when also all the open source devs can produce with greater efficency?

ithkuil|2 years ago

Could a human also accidentally spit out the exact code while having it just learned and not memorized in good faith?

I guess the likelihood decreases as the code length increases but the likelihood also increases the more constraints on parameters such as code style, code uniformity etc you pose.

messe|2 years ago

> this would be a clear violation of the licence

Not necessarily. If it's just a small snippet of code, even an entire function taken verbatim, it may not be sufficiently creative for copyright to apply to it.

Copyright is a lot less black and white than most here seem to believe.

ignoramous|2 years ago

> If the author was a human, this would be a clear violation of the licence. The AI case is no different as far as I can tell.

We aren't talking verbatim generation of entire packages of code here, are we? Code snippets are surely covered under fair use?

Kiro|2 years ago

> it's that it spits out GPL code verbatim

It's not a problem in practice. It only does so if you bait it really hard and push it into a corner, at which point you may just as well copy the code directly from the repo. It simply doesn't happen unless you know exactly what code you're trying to reproduce and that's not how you use Copilot.

allmadhare|2 years ago

Just because code exists in a copyrighted project doesn't mean that it is on the only instance of that code in the world.

In a lot of scenarios, there is an existing best practice or simply only one real 'good' way to achieve something - in those cases are we really going to say that despite the fact a human would reasonably come to the same output code, that the AI can't produce it because someone else wrote it already?

welshwelsh|2 years ago

This seems like a really, really easy problem to fix.

It should be easy to check Copilot's output to make sure it's not copied verbatim from a GPL project. Colleges already have software that does this to detect plagiarism.

If it's a match, just ask GPT to refactor it. That's what humans do when they want to copy stuff they aren't allowed to copy, they paraphrase it or change the style while keeping the content.

vadiml|2 years ago

So we should attack the problem of proprietary code. Maybe from Right to Repair angle. I believe there should be no such thing as closed source code.

sacrosancty|2 years ago

[deleted]

Aachen|2 years ago

Bit of a false equation to act as though a massive computer system is the same as any individual.

People put code on github to be read by anyone (assuming a public repository), but the terms of use are governed by the license. Now you've got a system that ignores the license and scrapes your data for its own purpose. You can pretend it's human but the capabilities aren't the same. (Humans generally don't spend a month being trained on all github code and remember large chunks of it for regurgitation at superhuman speeds, nor can they be horizontally scaled after learning.)

You can still be of the opinion that this is fine, and I may or may not be fine with it as well, I just don't think the stated reason holds up to logic and other opinions ought to "baffle" you

az226|2 years ago

And GitHub’s EULA gives it the right to train Copilot on public code you host on GitHub.

jeroenhd|2 years ago

Copilot has been caught multiple times reproducing code verbatim. At some point it spat out some guy's complete "about me" blog page. That's not learning, that's copying in a roundabout way.

Also, AI doesn't learn "like a human". Neural networks are an extremely simplistic representation of a biological brain and the details of how learning and human memory works aren't even all that clear yet.

Open source code usually comes with expectations for the people who use it. That expectation can be as simple as requiring a reference back to the authors, adding a license file to clarify what the source was based on, or in more extreme cases putting licensing requirements on the final product.

Unless Microsoft complies with the various project licenses, I don't see why this is antithetical to the idea of open source at all.

asimpletune|2 years ago

No disrespect but I am baffled by your statement that it learns, even to go so far as to say as a human coder would learn.

I don't really want this to comment to be perceived as flame bait (AI seems to be a very sensitive topic in the same sense as crypto currency), so instead let me just pose a simple question. If Copilot really learns as a human, then why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?

spuz|2 years ago

I think the comment was trying to draw the distinction between a database and a language model. The database of code on GitHub is many terabytes large, but the model trained on it is significantly smaller. This should tell us that a language model cannot reproduce copyrighted code byte for byte because the original data simply doesn't exist. Similarly, when you and I read a block of code, it leaves our memory pretty quickly and we wouldn't be able to reproduce it byte for byte if we wanted. We say the model learns like a human because it is able to extract generalised patterns from viewing many examples. That doesn't mean it learns exactly like a human but it's also definitely not a database.

The problem is that in reality, even though the original data is gone, a language model like Copilot _can_ reproduce some code byte for byte somehow drawing the information from the weights in its network and the result is a reproduction of copyrighted work.

CapsAdmin|2 years ago

> why don't we just train it on a CS curriculum instead of millions of examples of code written by humans?

I've never studied computer science formally but I doubt students learn only from the CS curriculum? I don't even know how much knowledge CS curriculum entails but I don't for example see anything wrong including example code written by humans.

Surely students will collectively also learn from millions of code examples online alongside the study. I'm sure teachers also do the same.

A language model can also only learn from text, so what about all the implicit knowledge and verbal communication?

mirekrusin|2 years ago

We do but we also simulate it doing homework very well.

flumpcakes|2 years ago

AI doesn't "learn". It's statistical inference if anything.

If I took two copy-righted pictures and layered them on top of each other at 50% opacity. Would that be OK or copy right infringement?

AI models just use more weights/biases and more images (or any input).

vadiml|2 years ago

And what is LEARNING in your opinion?

CapsAdmin|2 years ago

You can make out the two original copyrighted pictures in that case, and all you did was using 50% opacity which might not be very transformative, so probably?

In my mind (and I suspect others too) in machine learning context, statistical inference and learning became synonymous with all the recent development.

The way I see it, there's now a discussion around copyright because people have different fundemental views on what learning is and what it means to be a human that don't really surface.

lewhoo|2 years ago

If "like a human" is enough to get human rights then why did I get a parking ticket even when I argued that my car just stands there like a human ? This really isn't as good a defense as people portray. There are a lot of rights and privileges granted to humans but not to objects - we can all agree on that I think.

datavirtue|2 years ago

And if you need a person with supercharged rights and a slippery amount of liability...form a corporation.

bilqis|2 years ago

There is a difference between a person learning and a commercial product learning from someone else’s work, probably ignoring all the licenses.

adlpz|2 years ago

To be fair, when a programmer learns from publicly available but not public-domain code, and then applies the ideas, patterns, idioms and common implementations in their daily job as a software developer, the result is very much a "commercial product" (the dev company, the programmer themselves if a freelancer) learning from someone else's work and ignoring all the licenses.

The only leap here is the fact that the programmer has outsourced the learning to a tool that does it for them, which they then use to do their job, just as before.

josefx|2 years ago

> all this discussion on copyrights in the age of AI.

copyright is a thing, AI do not change that.

> does not 'steal' or and reproduce our code - it simply LEARNS from it as a human

And here we have the central problem, does it act like a human or does it not act like a human? Humans copy things they learn all the time, some of us know various songs by heart, others will even quote entire movies from memory. If AI can learn and reproduce things like humans do then you need to take steps to ensure that the output is properly stripped from any content that might infringe on existing copyrighted works.

ChatGTP|2 years ago

There is a definite difference between singing a song while walking down the street and writing down the lyrics, putting it in a database, claiming it’s my content and then selling it on, even if it’s slightly rehashed.

webmobdev|2 years ago

I would have no problem if such AI systems are also completely open source, can be run by me on my system and come with all models to use them also easily available (again in some form of opensource license). I genuinely don't see that happening in the future with BigTech. As such, as a proponent of FSF GPL philosophy, I have no interest in supporting such systems with my hard work, my source code. So yes, I do consider it stealing - my hard labour in any GPL opensource work is meant for the public good (for example, to preserve our right to repair by ensuring the source code is always available through the GPL license). Any corporate that uses my work, for profit, without either paying me or blocking the public good that I am trying for is simply exploiting me and the goodwill of others like me.

anileated|2 years ago

Copilot does not steal. Copilot does not learn. If you want to apply these concepts to LLMs, first prove how an LLM is human and then explain why it doesn’t have human rights.

Rather, Copilot is a tool. Microsoft/ClosedAI operate this tool. Commercially. They crawl original works and through running ML on it automatically generate and sell derivative works from those original works, without any consent or compensation. They are the ones who violate copyright, not Copilot.

Merad|2 years ago

Whether an LLM actually learns is completely tangential to the topic at hand. A human coder who learned from copyrighted code and then reproduced that code (intentionally or not) would be in violation of the copyright. This is why projects like Wine are so careful about doing clean room implementations.

As an aside, it seems really strange to invoke "open source ideas" as an argument in favor of a for-profit company building a closed source product that relies on millions of lines of open source code.

bamboozled|2 years ago

It’s also fair to say that a lot of this carefulness has probably made life difficult for the developers of wine, but they wanted to avoid Microsoft’s legal team. So they respected the copyright laws.

Here is Microsoft doing as Microsoft does…

pull_my_finger|2 years ago

I'm in several communities for smaller/niche languages and asking questions about things that have few sources make it much more clear that it's not "learning" but grabbing passages/chunks of source. Maybe with subjects that have more coverage it can assimilate more "original" sounding output.

cccbbbaaa|2 years ago

Plenty of people already argued that LLMs don't actually learn like a human. However, you should keep in mind the reason why clean-room reverse engineering exists: humans learn from source material. FLOSS RE projects (e.g. nouveau) typically don't like leaks, because some contributors might be exposed to copyrighted material. Sometimes, the opposite happens: people working on proprietary software are not allowed to see the source of a FLOSS alternative.

Twirrim|2 years ago

> it simply LEARNS from it as a human coder would learn from it.

It doesn't LEARN anything, let alone like a human coder would. It has absolutely zero understanding. It's not actually intelligent. It's a highly tuned mathematical model that predicts what the next word should be.

BlueTemplar|2 years ago

I can also learn things with no understanding (like a foreign word), I doubt that would make me immune to copyright ?

sethd|2 years ago

Your comment implies that we’re in some age of AGI, but we’re not there yet. Some argue that we’re not even close, but who knows, that’s all speculation.

> it simply LEARNS from it as a human coder would learn from it.

The LLM doesn’t learn, the authors of the LLM are encoding copyright protected content into a model using gradient decent and other techniques.

Now as far as I understand the law, that’s OK. The problems arise when distribution of the model comes into play.

I’m curious, are you a programmer yourself? Don’t take this the wrong way, but I want to understand the background of people who coming to the kind of conclusion you seemed to arrive at about how LLMs work.

otikik|2 years ago

> it simply LEARNS from it as a human coder would learn from it

What humans do to learn is intuitive, but it is not simple. What the machine does is also not simple, it involves some tricky math.

Precisely if the process was simple, then it could be more easily argued that the machine is "just copying" - that is simple.

There's a lot of nuance here.

What the machine is doing "looks similar to what humans do from the exterior", the same way that a plane flying "looks similar" to a flying bird. But the airplane does not flap its wings.

> kind of irrational and antithetical to open source ideas

Open source ideas are not the only ideas in town.

eptcyka|2 years ago

Humans don't learn an algorithm by memorizing a particular implementation character by character.

ignoramous|2 years ago

That's all the more reason for the utility of solutions like Copilot? Humans are limited in both time and memory.

Though, GitHub would do well to also bake-in approp attributions if a significant portion of the generated code is a copypasta.

remix2000|2 years ago

Neither does copilot.

golergka|2 years ago

And airplanes don't flap their wings, but we still agree that they're flying, just as birds do.

vadiml|2 years ago

There are people who do it... I personally know a guy whit photographic memory

ChatGTP|2 years ago

Humans are intentionally loading up giant sets of curated data for training, purposes, into a super computer to produce a model which is an black box and have provided zero attribution or credit to those who made this work possible. Humans are tuning these models to produce the results you see.

In the case of ChatGPT-x, Open AI company which is disguised as a not for profit with a goal of producing ever more powerful models that may eventually be capable of replacing you at work while seemingly not having any plan to give back to those who’s work was used to make them insane amounts of money.

They haven’t even given back any of their research. So it’s ok to take everyone’s open source work and not give back is it ?

This isn’t some cute little robot who wakes up in the morning and decided it wants to be a coder. This is a multi-national company who has created the narrative you’re repeating. They know exactly what they’re doing.

oytis|2 years ago

"Learning" is a technical term, AI doesn't really learn the same way a human does. There is a huge difference between allowing your fellow human beings to learn from you and allowing corporations to appropriate your knowledge by passing it through a stochastic shuffler.

zirgs|2 years ago

Individuals can train their own LLMs too.

unknown|2 years ago

[deleted]

xxs|2 years ago

>it simply LEARNS from it as a human coder would learn from it

I thought that was a sarcastic remark, given the capitalization of 'learn', but followed by IMHO dispelled that part.

We have no idea how humans learn, and the 'AI' has a statistical approach, not much more than that.

lawn|2 years ago

A human who learns to copy code letter for letter does just that: copies code. Same with an AI.

The interesting debate should be what happens in the gray area, when you read a lot of code and learns patterns and ideas.

datavirtue|2 years ago

Code, is at best, a trade secret (it is also data). Keep it close to your chest, or don't.

sureglymop|2 years ago

But.. to be clear what you can and can't do with certain code depends on the license. Imagine code that is "open source" as in openly visible and available, yet the license explicitly forbids the use of it to train any AI/LLM. Now how could the creator enforce that? Don't get me wrong, I am aware that the enforcement of such licenses is already hard (even for organizations like the FSF).. but now you are going up against something automated where you might not even know what exactly happens.

hnbad|2 years ago

Potayto potahto. We all know there's a difference between training a machine learning model and learning a skill as a human being. Even if you can trick yourself into believing AI is just kinda like how human brains work maybe, the obvious difference is that you can't just grow yourself a second brain and treat it like a slave whereas having more money means you can build a bigger and better AI and throw more resources at operating it.

Intellectual property is a nebulous concept to begin with, if you really try to understand it. There's a reason copyright claim systems like those at YouTube don't really concern themselves with ownership (that's what DMCA claims are for) but instead with the arbitrary terms of service that don't require you to have a degree in order to determine the boundaries of "fair use" (even if it mimics legal language to dictate these terms and their exemptions).

The problem isn't AI. The problem is property. Ever since Enclosure we've been trying to dull the edges of property rights to make sure people can actually still survive despite them. At some point you have to ask yourself if maybe the problem isn't how sharp the blade you're cutting yourself is but whether you should instead stop cutting. We can have "free culture" but then we can't have private property.

lucideer|2 years ago

> IMHO desire to prevent learning from your open source code seems kind of irrational and antithetical to open source ideas

You may be right that this is antithetical to "open source" ideas, as Tim O'Reilly would've defined it - a la MIT/BSD/&c., but it's very much in line with copyleft ideas as RMS would've defined it - a la GPL/EUPL/&c. - which is what's being explicitly discussed in this article.

The two are not the same: "open source" is about widespread "open" use of source code, copyleft is much more discerning and aims to carefully temper reuse in such a way that prioritises end user liberty.

goodpoint|2 years ago

> it simply LEARNS from it as a human coder would learn from it.

This is really not how LLMs work.

loveparade|2 years ago

A key difference is that a company is making a proprietary paid product out of the learnings from your code. This has nothing to do with open source.

If the data could only used by other open source projects, e.g. open source AI models, I don't think anyone would complain.

You could argue "well, but anyone can use the code on Github" and while that's technically true, it's obvious that with both Github and OpenAI being owned by Microsoft, OpenAI gets a huge competitive advantage due to internal partnerships.

toastal|2 years ago

Imagine if folks got royalties on commits, or the language model was required to be open as well.

unknown|2 years ago

[deleted]

dingledork69|2 years ago

The company that trains/owns the AI steals the content.

friendzis|2 years ago

> it simply LEARNS from it as a human coder would learn from it

Does it though? It "learns" correlations between tokens/sequences. A human coder would look at a piece of code and learn an algorithm. The AI "learns" token structure. A human reproducing original code verbatim would be incidental. AI (language model, at least) producing algorithm-implementing code would be incidental.

pppkkkiii|2 years ago

If that were true, Copilot would have been scanning windows and office source code. But we don't see that.

datavirtue|2 years ago

Nobody wants that.

9991|2 years ago

Apes love moralizing and being indignant. This joker wants to share open source code and restrict what other people do with it.

bombolo|2 years ago

So, like any license except public domain?

Have you personally ever put out something in public domain?