(no title)
jedbrown | 2 years ago
If a human did what these language models are doing (output derivative works with the copyright and license stripped), it would be a license violation. When humans want to create a new implementation with clean IP, they have one team study the IP-encumbered code and write a spec, then a different team writes a new implementation according to the spec. LM developers could have similar practices, with separately-trained components that create an auditable intermediate representation and independently create new code based on that representation. The tech isn't up to that task and the LM authors think they're going to get away with laundering what would be plagiarism if a human did it.
visarga|2 years ago
... and then execute copyrighted code -> trace resulting values -> tests for new code.
AI could do clean room reimplementation of any code to beef up the training set. It can also make sure the new code is different from the old code at ngram-level, so even by chance it should not look the same.
Would that hold up in court? Is it copyright laundering?
jedbrown|2 years ago
What language models could do easily is to obfuscate better so the license violation is harder to prove. That's behavior laundering -- no amount of human obfuscation (e.g., synonym substitution, renaming variables, swapping out control structures) can turn a plagiarized work into one that isn't. If we (via regulators and courts) let the Altmans of the world pull their stunt, they're going to end up with a government-protected monopoly on plagiarism-laundering.
koolba|2 years ago
Potentially for all of the inputs at once.
Maxion|2 years ago
Maybe at a FAANG or some other MegaCorp, but most companies around barely have a single dev team at all, or if they're larger barely have one per project.
jameshart|2 years ago
The weights are an intermediate representation that contains nothing resembling the original code.
zacmps|2 years ago
You can't just take copyrighted code, base 64 it, sent it to someone, have them decode it, and claim there was no copyright violation.
From my (admittedly vague) understanding copyright law cares about the lineage of data, and I don't see how any reasonable interpretation could consider that the lineage doesn't pass through models.
IANAL
vkou|2 years ago
snickerbockers|2 years ago
So is the ELF.
__loam|2 years ago
cj|2 years ago
I'm curious how easy or difficult it is to get GPT to spit out content (code or text) that could be considered obvious infringement.
Tempted to give it half of some closed-source or restrictive licensed code to see if it auto-completes the other half in a manner that is obviously recreating the original work.
littlestymaar|2 years ago
Edit: it wasn't ChatGPT but Copilot see https://twitter.com/mitsuhiko/status/1410886329924194309
jameshart|2 years ago
The same applies to GPT. It could reproduce Bohemian Rhapsody lyrics in the course of answering questions and there’s no automatic breach of copyright that’s taking place. It’s okay for GPT to know how a well known song goes.
If copilot ‘knows how some code goes’ and is able to complete it, how is that any different?