top | item 36525642

(no title)

jedbrown | 2 years ago

Even MIT licensed code requires you to preserve the copyright and permission notice.

If a human did what these language models are doing (output derivative works with the copyright and license stripped), it would be a license violation. When humans want to create a new implementation with clean IP, they have one team study the IP-encumbered code and write a spec, then a different team writes a new implementation according to the spec. LM developers could have similar practices, with separately-trained components that create an auditable intermediate representation and independently create new code based on that representation. The tech isn't up to that task and the LM authors think they're going to get away with laundering what would be plagiarism if a human did it.

discuss

visarga|2 years ago

Why can't AI do the same: copyrighted code -> spec -> generated code.

... and then execute copyrighted code -> trace resulting values -> tests for new code.

AI could do clean room reimplementation of any code to beef up the training set. It can also make sure the new code is different from the old code at ngram-level, so even by chance it should not look the same.

Would that hold up in court? Is it copyright laundering?

jedbrown|2 years ago

Language models don't understand anything, they just manipulate tokens. It is a much harder task to write a spec (that humans and courts can review if needed to determine is not infringement) and (with a separately trained tool) implement the spec. The tech just isn't ready and it's not clear that language models will ever get there.

What language models could do easily is to obfuscate better so the license violation is harder to prove. That's behavior laundering -- no amount of human obfuscation (e.g., synonym substitution, renaming variables, swapping out control structures) can turn a plagiarized work into one that isn't. If we (via regulators and courts) let the Altmans of the world pull their stunt, they're going to end up with a government-protected monopoly on plagiarism-laundering.

koolba|2 years ago

Isn’t the language model itself the spec?

Potentially for all of the inputs at once.

Maxion|2 years ago

> When humans want to create a new implementation with clean IP, they have one team study the IP-encumbered code and write a spec, then a different team writes a new implementation according to the spec.

Maybe at a FAANG or some other MegaCorp, but most companies around barely have a single dev team at all, or if they're larger barely have one per project.

jameshart|2 years ago

There’s a clear separation between the training process which looks at code and outputs nothing but weights, and the generation process which takes in weights and prompts and produces code.

The weights are an intermediate representation that contains nothing resembling the original code.

zacmps|2 years ago

But the original content is frequently recoverable.

You can't just take copyrighted code, base 64 it, sent it to someone, have them decode it, and claim there was no copyright violation.

From my (admittedly vague) understanding copyright law cares about the lineage of data, and I don't see how any reasonable interpretation could consider that the lineage doesn't pass through models.

IANAL

vkou|2 years ago

The neurons in my brain when I plagiarize are just arrangements of atoms that contain nothing that resembles orginal code/text passages/etc.

snickerbockers|2 years ago

> The weights are an intermediate representation that contains nothing resembling the original code.

So is the ELF.

__loam|2 years ago

I think this view is incredibly dangerous to any kind of skills mastery. It has the potential to completely destroy the knowledge economy and eventually degrade AI due to a dearth of training data.

cj|2 years ago

Has anyone been able to create a prompt that GPT4 replies to with copyrighted content (or content extremely similar to the original content)?

I'm curious how easy or difficult it is to get GPT to spit out content (code or text) that could be considered obvious infringement.

Tempted to give it half of some closed-source or restrictive licensed code to see if it auto-completes the other half in a manner that is obviously recreating the original work.

littlestymaar|2 years ago

I don't know about GPT-4 but you could get ChatGPT to spit Carmac's Fast Inverse square root with the comments and all (I can't find the tweet though…)

Edit: it wasn't ChatGPT but Copilot see https://twitter.com/mitsuhiko/status/1410886329924194309

jameshart|2 years ago

I can reproduce when prompted all the lyrics to Bohemian Rhapsody, but my doing so isn’t automatically copyright infringement. It would depend on where, when, how, in front of what audience, and to what purpose I was reciting them as to whether it was irrelevant to copyright law, protected under some copyright use case, civilly infringing, or criminally infringing copyright abuse.

The same applies to GPT. It could reproduce Bohemian Rhapsody lyrics in the course of answering questions and there’s no automatic breach of copyright that’s taking place. It’s okay for GPT to know how a well known song goes.

If copilot ‘knows how some code goes’ and is able to complete it, how is that any different?