top | item 34186192

Ownership of AI-Generated Code Hotly Disputed

87 points| Chris2048 | 3 years ago |spectrum.ieee.org

200 comments

order
[+] wrs|3 years ago|reply
“…modify[ing] its AI model so that it traces attribution and gives credit to the original authors of the code, adding the associated copyright notices and license terms in the process…Biderman says is technologically feasible.”

Is it really feasible? What does “traces attribution” even mean here? It’s not emitting “code”, it’s emitting individual tokens that each were found throughout the input corpus. The “code” is the arrangement of those tokens, but that is determined by the weighting of the whole network, so what can be traced?

Can someone who understands generative ML better than me weigh in on this?

[+] dahart|3 years ago|reply
Why wouldn’t it be feasible? (Maybe this depends on what you mean by ‘feasible’.) There’s no technical reason you can’t back-track the weights and make a list of which tokens from which training data were sampled. The list might be long, it could be impractical, but that has little bearing on whether it’s technically possible, right?

The problem here happens when the same source is sampled for many tokens in a row because it’s the only match for the context. It could also happen that many tokens in a row each have a long list of sources, but when put together have a subset of sources that appear in every token’s list. That means that someone’s input is being repeated verbatim, even if the network wasn’t trying to reproduce a single source. We could prune the list of attribution sources at the expense of compute by running largest common subset algorithms, which might be sufficient for attribution tracing?

It feels like this whole question might hinge on Fair Use. The network is copying other people’s code one token at a time. We (society & copyright law) all tend to agree that’s fine when it’s a single token out of context, and we all tend to agree it’s not fine when the whole output program matches any single input source. The question naturally becomes, “where’s the line, how many tokens in a row from a single source should be allowed?”

[+] falcolas|3 years ago|reply
> It’s not emitting “code”, it’s emitting individual tokens ... The “code” is the arrangement of those tokens, but that is determined by the weighting of the whole network

This theory of operation is not borne out in reality. It's been clearly displayed that these tools are emitting verbatim copies of existing code (and its comments) in their input.

It's even being seen in image generation, where NatGeo cover images are reproduced in their entirety or where stock photo watermarks are emitted on finished images.

And so, what can be traced back to individual sources? Quite a bit it would seem.

[+] ProlificInquiry|3 years ago|reply
This quote seems to fundamentally misunderstand what transformers are doing at all. Technically I suppose you could save all gradient updates from every input token, and do some weighted averaging to show what inputs affected the particular output the most, but saving all those gradient updates would be unimaginably space consuming. "Feasible" is doing a lot of work there.

It's very hard for people to get away from the idea that GPT is "copying" something, but that's not what it's doing. The reality is, to get the exact artifact which produced the code in question, you need "Call me Ishmael" from Moby Dick just as much as the Linux kernel source.

[+] Filligree|3 years ago|reply
It sounds like nonsense. The most plausible solution would be to provide credit to every single author whose code was used in the original training set; of course, that would run into gigabytes just for the credits.
[+] hackinthebochs|3 years ago|reply
In the ideal case the next token is determined by the local context (the prefix string) and the entire corpus of trained code. In this case the prefix string has not been seen before and so the generator must do some interpretation/extrapolation to determine the likely continuation. But in some cases, perhaps many cases, the prefix string has been seen before, or is similar enough to what has been seen before, that the best continuation is just to spit out the similar string in the training corpus. Presumably such cases can be detected due to specific patterns of activation in the network and attributions can be captured/applied.

One dumb way to do this would be to include self-attributions directly in the stream of training data. So in the cases where the best continuation is to just transcribe the training data, the attributions is included in the data itself.

[+] dec0dedab0de|3 years ago|reply
Honestly, I would be fine with verbatim chunks of code if it could gaurantee a compatible license and did the copyright notices properly.

That always seemed like a problem with open source, you should theoretically be able to copy bits of code from hundreds of projects to make a new one, but keeping track of the licenses makes it too much of a pain. So the closest we really see is people vendorizing libraries.

[+] beecafe|3 years ago|reply
You can just find the most matching snippets in the training data using an embedding model.
[+] visarga|3 years ago|reply
They probably mean using a code search engine to check all snippets. The simplest thing would be am n-gram filter. A more advanced approach would use a code similarity neural net. It's not principled attribution, just locating the most similar example in the training set.
[+] godelski|3 years ago|reply
So there a few ways this can actually be done with different levels of accuracy/precision. But it is going to be complicated no matter what.

The easy thing to do is to compare outputs to inputs. This isn't technically hard (i.e. cosine similarity) but it is computationally difficult (i.e. cosine similarity of output to entire input). This would give us some weightings that show similarity. But this doesn't really tell us attribution, rather more a correlation. These are statistical models so there's reason to believe that this is okay.

Then there's inversion. This is processing data backwards through the model. This has different complexities compared to what type of model you're working with. GANs aren't great, diffusions are okay, normalizing flows are trivial (but good luck generating good images from NFs). If we can invert the model it is much easier to investigate and probe for contribution by looking at the distance of the latent generative variable to the location of the latent trained data. Basically you're looking at how different information is contributing to the overall output. This can also be done at every level in the network. Obviously this gets both technically challenging as well as computationally.

Another method would be using dataset reconstruction (this is outside my wheelhouse fwiw). This is where you try to recover the dataset from the final trained network. This too is complicated but there's plenty of papers showing progress in this space (lots of interest from privacy groups).

(TLDR-ish) There's other methods too. But basically what is being said is that there are ways to denote what and how much the training data contributed to the output of the model wherein we can then measure how similar the output is to the inputs (i.e. copying).

[+] nonrandomstring|3 years ago|reply
The answer is of course it's possible, so long as you have a couple of gigabytes spare for the acknowledgements page. Any construct worth attributing will have roots in billions of parameters.
[+] skibidibipiti|3 years ago|reply
Theoretically possibly, technically challenging. The training data would need to be annotated with traces, so a person would have to figure out where all of the answers came from
[+] ajsnigrutin|3 years ago|reply
Maybe like a science paper... 10 lines of code, 50 citations :)
[+] jacquesm|3 years ago|reply
To me AI code generators are the equivalent of crypto tumblers or mixers for digital coins. You can pretend all you want that the output is 'clean' but we all know it came from somewhere else and wasn't actually generated by the software, just endless little snippets that other people made.
[+] godelski|3 years ago|reply
This is definitely not true. In theory a generator could reproduce a distribution of data that includes more than the data it was able to sample from. For example, you could create the distribution of all possible faces from a subset of sample faces. That means you could create the faces of people who do not yet exist. Now good luck getting that perfect generator (black swans, bias, etc), but it would be naive to believe that they (well... that there can be no generator that) are just copy input data and mashing them together. At least if you're going to believe that that also isn't happening in the real world, but then what's the difference?
[+] remus|3 years ago|reply
I don't think that's entirely true. When I've tested codepilot it also knows something about the context in which it's working, so it can use relevant variable names in it's suggestions. To me that's a step beyond rote regurgitation of other code.
[+] system2|3 years ago|reply
ChatGPT output is very clean and precise especially when variables provided in detail. I doubt you can trace it back if you make the prompt very elaborate.
[+] scotty79|3 years ago|reply
[+] falcolas|3 years ago|reply
I propose that information naturally wants to degrade. Paper decomposes. Bits flip. File formats are replaced and lost. Storage mediums degrade. It's all an extension of the universe trending towards entropy.

It actually takes quite a bit of effort to store, then distribute information precisely and broadly. There's a lot of infrastructure, effort, and money involved, and still information degrades and disappears over time.

Any libre information exists because people have put effort into it. Sometimes a lot of effort.

[+] somrand0|3 years ago|reply
considering that information is something static (which doesn't change) that describes (is about) how something else is changing

I think that indeed, it is the inherent nature of information to radiate itself, i.e. to share (to shine, to spread)

[+] belorn|3 years ago|reply
When Copilot was released the copyright discussion focused exclusively on code, but now we see very similar discussion around images with stable diffusion and the sister project unstable diffusion. When Copilot do reach the courts there will be some indication on how courts view author consent when it comes to training material. After that we might then see court cases for each form of media (images, video, text, sound), and also in their each unique contexts (books, online videos, porn, stock databases, and so on).
[+] jacquesm|3 years ago|reply
In my opinion using all of the code on GitHub without respecting the licenses was a capital mistake. It should have been opt-in, maybe with some incentive but to just take it all without so much as a by-your-leave is not going to play well in court.
[+] rboyd|3 years ago|reply
I think it’s silly to pretend that human programmers are emitting a lot of code with a high degree of originality either. We’re all remixing some long-forgotten influential code laying deep in latent memory just like the models.
[+] marstall|3 years ago|reply
by reducing credit, copilot reduces incentive to create and publish free code. biting the hand that feeds it.

exact same problem exists with GPT3 and others.

big tech slashing and burning, ruthlessly exploiting the least empowered people in the tech economy.

neat hack.

[+] mellosouls|3 years ago|reply
I would have thought that in the vast majority of current AI-generated code we are talking about single blocks and functions that are just Intellisense on steroids that only a rather self-deluding coder would consider original enough or "theirs" to attribute authorship to.

There are no doubt grey areas and more serious cases as the technology improves and the generated content increases in length and functional value, but I hope we don't throw the productive baby out with the Luddite bathwater...

[+] joshspankit|3 years ago|reply
I’m horrified.

I know we’re “just” talking about code here but decisions will be far-reaching and if we let the powers that be force AI-generated content to be copyright-attributed, two things are going to happen:

1. The biggest benefits of AI are going to be pushed back for decades and possibly indefinitely

2. Artists will be absolutely slaughtered as the same rules will come at them full-force. Almost every artist draws inspiration from other creative work. That’s how creativity works. Can you imagine how stifling it would be for every artist to have to document and “pay for” every piece of art they’ve ever seen??

[+] joshspankit|3 years ago|reply
I’m horrified.

If we let the powers that be force AI-generated content to be copyrighted, two things are going to happen:

1. The biggest benefits of AI are going to be pushed back for decades and possibly indefinitely

2. Artists will be absolutely slaughtered as the same rules will come at them full-force. Almost every artist draws inspiration from other creative work. That’s how creativity works. Can you imagine how stifling it would be for every artist to have to document and “pay for” every piece of art they’ve ever seen??

[+] walnutclosefarm|3 years ago|reply
It seems to me there are some really fundamental questions about copyright and use posed here that have been submerged in the background of the internet rush to digitize all forms of content and expression. Even though these generative tools may simplify the path to verbatim code (or image, or text) copying without attribution, and complicate determining what, if any, attributions are required, they didn't invent it. A browser coupled with a search engine facilitates such copying all the time. Like the generative models, the search engine is a tool that has read, for its own purposes, vast swaths of content, remembered key details about it, and serves a means to render the content up to a user. It's not the copying per se that causes trouble - creators make their material on the internet with the expectation and hope that it will be copied, by all those tools, into search engines and ultimately onto screens, so I can see it. It's that the chain of attribution is lost (or with OS code, the license imprint, which is really just another form of attribution), and with the attribution, any hope that the creator will benefit from the copying.

But actually carrying attribution forward is going to be hard. Things that come out of a generative model that are substantially identical to a particular input do so for one of two reasons: one particular training input to the model is overwhelmingly the "best" source of response to the prompt, or a training input has been repeated over and over again in the training set so as to become the consensus response to one or more prompts. The first might reasonably feasible to track down, although it's bound to be computationally expensive. The second ... really tough, since it forces the model to "know" which of those many training sources is the one that should attributed. The widely circulated example of an image generation model reproducing Steve McMurray's famous photo of Sharbat Gula, a green-eyed Afghani girl that appeared on National Geographics cover in 1985 shows the problem. Do a Google search for "green-eyed Afghan girl" and you'll find hundreds of copies of varying resolution and definition, and hundreds more derivative versions, of McMurray's photo. A model spitting out yet another derived, but nearly identical version is likely drawing from hundreds of those images itself, not some original root, copyrighted golden copy. Which should it attribute?

[+] keewee7|3 years ago|reply
Why have so many of these AI models been trained on GPL licensed code? Almost half the controversy could have avoided by ignoring GPL code.

I know that even non-copyleft licenses like the Apache and MIT licenses are copyrighted and require attribution however it would have caused far less controversy than training on GPL licensed code.

[+] randombits0|3 years ago|reply
Code cannot be owned. A creative expression may be copyrighted. Purely functional expressions may not be copyrighted. The output of a trained AI is insufficiently creative to be copyrighted. Only humans can hold a copyright.

Now with all that, there really isn’t anything here to get worked up over.

[+] hutzlibu|3 years ago|reply
"Purely functional expressions may not be copyrighted."

Is this more than a opinion?

Because you can have whole programs as a long functional expression (not that I am a fan of such a coding style, but it exists).

[+] visarga|3 years ago|reply
> The output of a trained AI is insufficiently creative to be copyrighted. Only humans can hold a copyright.

That's overgeneralisation. A language model alone, yes, is just derivative. But a language model trained on solving problems with reinforcement learning can surpass humans. For example AlphaGo and AlphaTensor are models that learned from running simulations.

[+] bitwrangler|3 years ago|reply
I would think copyright law would have something to say about what is or is not a derivative work. How similar to code snippets have to be (or claim to be) to be considered infringement on original source?
[+] padolsey|3 years ago|reply
Outside of AI models, copy/pasting snippets from the likes of StackOverflow is already on unsteady ground. The threshold to bother with (and win) legal fights is pretty high. AI is catalyzing some kind of slow revolution in what "ownership" is, but there doesn't seem to be any definition that would _always_ satisfy common sense. Even if github yields and adds attribution or filters on license types, there's still a massive trove of knotted data in all these newer text/language/image/code models. Most of it is not truly public domain.. and a lot of it is wholly private and would involve contravention of regs like GDPR if used. Data used to feed these huge models is, sometimes (technically speaking...) stolen, commandeered, swiped from unassuming creators. Many artists the world over are rightfully furious about DALL-E and Stable Diffusion. I wonder where this will all end!

AI needs to be regulated somehow. Precedent needs to be set.

[+] fariszr|3 years ago|reply
Regardless of what side you stand on, this case is sorely needed, so everybody knows what the legalities are.
[+] j16sdiz|3 years ago|reply
This article adds no new information. We knew it is hotly debated, and we have heard both sides.
[+] andybak|3 years ago|reply
Just possibly there are people out there who haven't heard both sides and perhaps they are the intended audience for this?

And maybe some of them even read HN occasionally and are currently enjoying the article.

[+] somrand0|3 years ago|reply
ownership is a concept in dire need of revision
[+] bodyfour|3 years ago|reply
Once we have AI-generated laws I'm sure this will be sorted out.
[+] marc_io|3 years ago|reply
Yes, as long as it doesn't disregard the original creator's intentions in the formulation.
[+] claytongulick|3 years ago|reply
Agreed.

You don't mind if I borrow your car do you?

And your house?

Let's not quibble about the particulars of me ever giving them back.

[+] crawfordcomeaux|3 years ago|reply
Yes, like scientific grounding or abandonment.

If only we held all myths to such high esteem as we societally hold ownership.

[+] cyber_kinetist|3 years ago|reply
The only way to truly own something, is to either share it or destroy it.
[+] wahnfrieden|3 years ago|reply
or abolition

but to do that without any reasonable social support nets is absurd

[+] riedel|3 years ago|reply

[deleted]

[+] breck|3 years ago|reply
Ideas cannot be owned. People can be owned, if you have slavery.

Everyone must unlearn the term "Intellectual Property". These laws are anti-property rights. They are Intellectual Slavery laws (https://breckyunits.com/an-unpopular-phrase.html).

The United States government employs more knowledge workers than all other companies (see NIH, DoD, CDC, NASA, NOAA, NWS, et cetera). Everything they produce is public domain, by law. And yet, the people producing these information products still get paid!

We don't need (c)opywrong laws. We don't need Intellectual Slavery laws. We still have cotton even after the 13th Amendment (we actually have more and better cotton now), and we will still have creative works after the passing of the Intellectual Freedom Amendment (we actually will have more and better creative works) - https://breckyunits.com/the-intellectual-freedom-amendment.h....

[+] claytongulick|3 years ago|reply
People have a strange tendency to enjoy being rewarded for work they've done.

I suspect that the folks who invest hundreds of millions of dollars into production costs for a movie, rather enjoy the ability to recoup those costs by restricting access to only those who are willing to pay for the privilege.