top | item 47257803

Relicensing with AI-Assisted Rewrite

401 points| tuananh | 22 days ago |tuananh.net | reply

394 comments

order
[+] calny|22 days ago|reply
The maintainer's response: https://github.com/chardet/chardet/issues/327#issuecomment-4...

The second part here is problematic, but fascinating: "I then started in an empty repository with no access to the old source tree, and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code." Problem - Claude almost certainly was trained on the LGPL/GPL original code. It knows that is how to solve the problem. It's dubious whether Claude can ignore whatever imprints that original code made on its weights. If it COULD do that, that would be a pretty cool innovation in explainable AI. But AFAIK LLMs can't even reliably trace what data influenced the output for a query, see https://iftenney.github.io/projects/tda/, or even fully unlearn a piece of training data.

Is anyone working on this? I'd be very interested to discuss.

Some background - I'm a developer & IP lawyer - my undergrad thesis was "Copyright in the Digital Age" and discussed copyleft & FOSS. Been litigating in federal court since 2010 and training AI models since 2019, and am working on an AI for litigation platform. These are evolving issues in US courts.

BTW if you're on enterprise or a paid API plan, Anthropic indemnifies you if its outputs violate copyright. But if you're on free/pro/max, the terms state that YOU agree to indemnify THEM for copyright violation claims.[0]

[0] https://www.anthropic.com/legal/consumer-terms - see para. 11 ("YOU AGREE TO INDEMNIFY AND HOLD HARMLESS THE ANTHROPIC PARTIES FROM AND AGAINST ANY AND ALL LIABILITIES, CLAIMS, DAMAGES, EXPENSES (INCLUDING REASONABLE ATTORNEYS’ FEES AND COSTS), AND OTHER LOSSES ARISING OUT OF … YOUR ACCESS TO, USE OF, OR ALLEGED USE OF THE SERVICES ….")

[+] DrammBA|22 days ago|reply
Also the maintainer's ground-up rewrite argument is very flimsy when they used chardet's test-data and freely admit to:

> I've been the primary maintainer and contributor to this project for >12 years

> I have had extensive exposure to the original codebase: I've been maintaining it for over a decade. A traditional clean-room approach involves a strict separation between people with knowledge of the original and people writing the new implementation, and that separation did not exist here.

> I reviewed, tested, and iterated on every piece of the result using Claude.

> I was deeply involved in designing, reviewing, and iterating on every aspect of it.

[+] Lerc|22 days ago|reply
There was a paper that proposed a content based hashing mask for traning

The idea is you have some window size, maybe 32 tokens. Hash it into a seed for a pseudo random number generator. Generate random numbers in the range 0..1 for each token in the window. Compare this number against a threshold. Don't count the loss for any tokens with a rng value higher than the threshold.

It learns well enough because you get the gist of reading the meaning of something when the occasional word is missing, especially if you are learning the same thing expressed many ways.

It can't learn verbatim however. Anything that it fills in will be semantically similar, but different enough to get cause any direct quoting onto another path after just a few words.

[+] oofbey|22 days ago|reply
The difference in indemnification based on which plan you’re on is super important. Thanks for pointing that out - never would have thought to look.
[+] popalchemist|22 days ago|reply
Copyright does not cover ideas. Only specific executions of ideas. So unless it's a line-by-line copy (unlikely) there is no recourse for someone to sue for a re-execution/reimplementation of an idea.
[+] aeon_ai|22 days ago|reply
You've likely paid attention to the litigation here. Regardless of what remains to be litigated, the training in and of itself has already been deemed fair use (and transformative) by Alsup.

Further, you know that ideas are not protected by copyright. The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.

If it were the case that the LLM ingested the code and regurgitated it (as would be the premise of highlighting the training data provenance), that similarity would be much higher. That is not the case.

[+] danlitt|22 days ago|reply
I am pretty sure this article is predicated on a misunderstanding of what a "clean room" implementation means. It does not mean "as long as you never read the original code, whatever you write is yours". If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy. Traditionally, a human-driven clean room implementation would have a vanishingly small probability of matching the original codebase enough to be considered a copy. With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly). Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

What the chardet maintainers have done here is legally very irresponsible. There is no easy way to guarantee that their code is actually MIT and not LGPL without auditing the entire codebase. Any downstream user of the library is at risk of the license switching from underneath them. Ideally, this would burn their reputation as responsible maintainers, and result in someone else taking over the project. In reality, probably it will remain MIT for a couple of years and then suddenly there will be a "supply chain issue" like there was for mimemagic a few years ago.

[+] femto|22 days ago|reply
> If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy.

That's not what the law says [1]. If two people happen to independently create the same thing they each have their own copyright.

If it's highly improbable that two works are independent (eg. the gcc code base), the first author would probably go to court claiming copying, but their case would still fail if the second author could show that their work was independent, no matter how improbable.

[1] https://lawhandbook.sa.gov.au/ch11s13.php?lscsa_prod%5Bpage%...

[+] brians|22 days ago|reply
I do not agree with your interpretation of copyright law. It does ban copies: there has to be information flow from the original to the copy for it to be a "copy." Spontaneous generation of the same content is often taken by the courts to be a sign that it's purely functional, derived from requirements by mathematical laws.

Patent law is different and doesn't rely on information flow in the same way.

[+] petercooper|22 days ago|reply
The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation

I know you were simplifying, and not to take away from your well-made broader point, but an API-derived implementation can still result in problems, as in Google vs Oracle [1]. The Supreme Court found in favor of Google (6-2) along "fair use" lines, but the case dodged setting any precedent on the nature of API copyrightability. I'm unaware if future cases have set any precedent yet, but it just came to mind.

[1]: https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

[+] zabzonk|22 days ago|reply
> It does not mean "as long as you never read the original code, whatever you write is yours"

I think there is precedence that says exactly this - for example the BIOS rewrites for the IBM PC from people like Phoenix. And it would be trivial to instruct an LLM to prefer to use (say, in assembler) register C over register B wherever that was possible, resulting in different code.

[+] wareya|22 days ago|reply
> If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy.

If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright. You're going to get sued and lose, but platonically, you're in the clear. If it's merely somewhat similar, then you're probably in the clear in practice too: it gets very easy very fast to argue that the similarities are structural consequences of the uncopyrightable parts of the functionality.

> The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly).

This is almost the opposite of correct. A clean room implementation's dirty phase produces a specification that is allowed to include uncopyrightable implementation details. It is NOT defined as producing an API, and if you produce an API spec that matches the original too closely, you might have just dirtied your process by including copyrightable parts of the shape of the API in the spec. Google vs Oracle made this more annoying than it used to be.

> Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

If you follow CRRE, it's not a copy, full stop, even if it's somehow 1:1 identical. It's going to be JUDGED as a copy, because substantial similarity for nontrivial amounts of code means that you almost certainly stepped outside of the clean room process and it no longer functions as a defense, but if you did follow CRRE, then it's platonically not a copy.

> What the chardet maintainers have done here is legally very irresponsible.

I agree with this, but it's probably not as dramatic as you think it is. There was an issue with a free Japanese font/typeface a decade or two ago that was accused of mechanically (rather than manually) copying the outlines of a commercial Japanese font. Typeface outlines aren't copyrightable in the US or Japan, but they are in some parts of Europe, and the exact structure of a given font is copyrightable everywhere (e.g. the vector data or bitmap field for a digital typeface, as opposed to the idea of its shape). What was the outcome of this problem? Distros stopped shipping the font and replaced it with something vaguely compatible. Was the font actually infringing? Probably not, but better safe than sorry.

[+] thousand_nights|22 days ago|reply
the whole concept of a "clean room" implementation sounds completely absurd.

a bunch of people get together, rewrite something while making a pinky promise not to look at the original source code

guaranteeing the premise is basically impossible, it sounds like some legal jester dance done to entertain the already absurd existing copyright laws

[+] jen20|22 days ago|reply
> What the chardet maintainers have done here is legally very irresponsible.

Perhaps the maintainer wants to force the issue?

> Any downstream user of the library is at risk of the license switching from underneath them.

Checking the license of the transitive closure of your dependencies is table stakes for using them.

[+] pmarreck|22 days ago|reply
> With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

I beg to differ. Please examine any of my recent codebases on github (same username); I have cleanroom-reimplemented par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now, right here, so here we go...)

https://github.com/pmarreck?tab=repositories&type=source

I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec

Not only are they rewritten in a completely different language, but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I invite you and anyone else to compare them to the original source and find overt similarities.

With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.

[+] j45|22 days ago|reply
This reminds me of a full rewrite.

When a developer reimplements a complete new version of code from scratch, with an understanding only, a new implementation generally should be an improvement on any source code not equal.

In today’s world, letting LLMs replicate anything will generate average code as “good” and generally create equivalent or more bloat anyways unless well managed.

[+] dathinab|22 days ago|reply
the author speaks about code which is syntactically completely different but semantically does the same

i.e. a re-implementation

which can either

- be still derived work, i.e. seen as you just obfuscating a copyright violation

- be a new work doing the same

nothing prevents an AI from producing a spec based on a API, API documentation and API usage/fuzzing and then resetting the AI and using that spec to produce a rewrite

I mean "doing the same" is NOT copyright protection, you need patent law for that. Except even with patent law you need to innovations/concepts not the exact implementation details. Which means that even if there are software patents (theoretically,1) most things done in software wouldn't be patentable (as they are just implementation details, not inventions)

(1): I say theoretically because there is a very long track record of a lot of patents being granted which really should never be granted. This combined with the high cost of invalidating patents has caused a ton of economical damages.

[+] nairboon|22 days ago|reply
That code is still LGPL, it doesn't matter what some release engineer writes in the release notes on Github. All original authors and copyright holders must have explicitly agreed to relicense under a different license, otherwise the code stays LGPL licensed.

Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.

[+] pocksuppet|22 days ago|reply
In the legal system there's no such thing as "code that is LGPL". It's not an xattr attached to the code.

There is an act of copying, and there is whether or not that copying was permitted under copyright law. If the author of the code said you can copy, then you can. If the original author didn't, but the author of a derivative work, who wasn't allowed to create a derivative work, told you you could copy it, then it's complicated.

And none of it's enforced except in lawsuits. If your work was copied without permission, you have to sue the person who did that, or else nothing happens to them.

[+] pavlov|22 days ago|reply
If anything, the SCOTUS decision would seem to imply that generative AI transformations produce no additional creative contribution and therefore the original copyright holder has all rights to any derived AI works.

(IANAL)

[+] dathinab|22 days ago|reply
iff it went through the full clean room rewrite just using AI then no, it's de-facto public domain (but also it probably didn't do so)

iff it is a complete new implementation with completely different internal then it could also still be no LGPL even if produced by a person with in depth knowledge. Copyright only cares if you "copied" something not if you had "knowledge" or if it "behaves the same". So as long as it's distinct enough it can still be legally fine. The "full clean room" requirement is about "what is guaranteed to hold up in front of a court" not "what might pass as non-derivative but with legal risk".

[+] pornel|22 days ago|reply
Generative AI changed the equation so much that our existing copyright laws are simply out of date.

Even copyright laws with provisions for machine learning were written when that meant tangential things like ranking algorithms or training of task-specific models that couldn't directly compete with all of their source material.

For code it also completely changes where the human-provided value is. Copyright protects specific expressions of an idea, but we can auto-generate the expressions now (and the LLM indirection messes up what "derived work" means). Protecting the ideas that guided the generation process is a much harder problem (we have patents for that and it's a mess).

It's also a strategic problem for GNU. GNU's goal isn't licensing per se, but giving users freedom to control their software. Licensing was just a clever tool that repurposed the copyright law to make the freedoms GNU wanted somewhat legally enforceable. When it's so easy to launder code's license now, it stops being an effective tool.

GNU's licensing strategy also depended on a scarcity of code (contribute to GCC, because writing a whole compiler from scratch is too hard). That hasn't worked well for a while due to permissive OSS already reducing scarcity, but gen AI is the final nail in the coffin.

[+] pocksuppet|22 days ago|reply
It's not a problem. If you give a work to an AI and say "rewrite this", you created a derivative work. If you don't give a work to an AI and say "write a program that does (whatever the original code does)" then you didn't. During discovery the original author will get to see the rewriter's Claude logs and see which one it is. If the rewriter deleted their Claude logs during the lawsuit they go to jail. If the rewriter deleted their Claude logs before the lawsuit the court interprets which is more likely based on the evidence.
[+] empath75|22 days ago|reply
> Generative AI changed the equation so much that our existing copyright laws are simply out of date.

Copyright laws are predicated on the idea that valuable content is expensive and time consuming to create.

Ideas are not protected by copyright, expression of ideas is.

You can't legally copy a creative work, but you can describe the idea of the work to an AI and get a new expression of it in a fraction of the time it took for the original creator to express their idea.

The whole premise of copyright is that ideas aren't the hard part, the work of bringing that idea to fruition is, but that may no longer be true!

[+] leecommamichael|22 days ago|reply
“Changing the equation” by boldly breaking the law.
[+] ajross|22 days ago|reply
> GNU's goal isn't licensing per se, but giving users freedom to control their software.

I think that's maybe misunderstanding. GNU wants everyone to be able to use their computers for the purposes they want, and software is the focus because software was the bottleneck. A world where software is free to create by anyone is a GNU utopia, not a problem.

Obviously the bigger problem for GNU isn't software, which was pretty nicely commoditized already by the FOSS-ate-the-world era of two decades ago; it's restricted hardware, something that AI doesn't (yet?) speak to.

[+] satvikpendem|22 days ago|reply
Honestly, good. Copyright and IP law in general have been so twisted by corporations that only they benefit now, see Mickey Mouse laws by Disney for example, or patenting obvious things like Nintendo or even just patent trolling in general.
[+] kshri24|22 days ago|reply
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.

EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.

[+] samrus|22 days ago|reply
> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?

[+] mfabbri77|22 days ago|reply
This has the potential to kill open source, or at least the most restrictive licenses (GPL, AGPL, ...): if a license no longer protects software from unwanted use, the only possible strategy is to make the development closed source.
[+] jerf|22 days ago|reply
"Accepting AI-rewriting as relicensing could spell the end of Copyleft"

True, but too weak. It ends copyright entirely. If I can do this to a code base, I can do it to a movie, to an album, to a novel, to anything.

As such, we can rest assured that for better or for worse this is going to be resolved in favor of this not being enough to strip the copyright off of something and the chardet/chardet project would be well advised not to stand in front of the copyright legal behemoth and defeat it in single combat.

[+] Vegenoid|21 days ago|reply
No, because the function of code is distinct from the implementation of the code. With software, something that is functionally identical can be created with a different underlying implementation. This is not the case with media.
[+] emsign|22 days ago|reply
By design you can't know if the LLM doing the rewrite was exposed to the original code base. Unless the AI company is disclosing their training material, which they won't because they don't want to admit breaking the law.
[+] shevy-java|22 days ago|reply
> By design you can't know if the LLM doing the rewrite was exposed to the original code base.

I agree, in theory. In practice courts will request that the decision-making process will be made public. The "we don't know" excuse won't hold; real people also need to tell the truth in court. LLMs may not lie to the court or use the chewbacca defence.

Also, I am pretty certain you CAN have AI models that explain how they originated to the decision-making process. And they can generate valid code too, so anything can be autogenerated here - in theory.

[+] AyanamiKaine|22 days ago|reply
The worst problem is that a LLM could not only copy the exact code it was trained on but possibly even their comments!

There is one thing arguing that the code is a one to one copy but when the comments are even the same isn’t it quite clear it’s a copy?

[+] stuaxo|22 days ago|reply
I don't see how (with current LLMs that have been trained on mixed licensed data) you can use the LLM to rewrite to a less restrictive license.

You could probably use it to output code that is GPL'd though.

[+] listeria|21 days ago|reply
Apparently there was some talk a few years ago about adding the project to the python standard library[1] and the maintainer seems really[2] interested[3] in that.

But I don't think the standard library maintainers would want to incorporate it considering the way in which the relicensing took place, the controversy, and the implications on software licences in general. So his motivations for the license change seem moot. I certainly wouldn't touch it with a 10-foot pole.

[1]: https://github.com/chardet/chardet/issues/36#issuecomment-76... [2]: https://github.com/chardet/chardet/issues/327#issuecomment-4... [3]: https://github.com/chardet/chardet/issues/327#issuecomment-4...

[+] Retr0id|22 days ago|reply
> In traditional software law, a “clean room” rewrite requires two teams

Is the "clean room" process meaningfully backed by legal precedent?

[+] WhiteDawn|22 days ago|reply
I really dislike the precedent this sets.

A silver lining if this maintainer ends up being in the right is that any proprietary software can easily be reverse engineered and stripped of it's licensing by any hobbyist with enough free time and claude tokens.

Personally, I'd welcome a post-copyright software era

[+] gspr|22 days ago|reply
> If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT. The legal and ethical lines are still being drawn, and the chardet v7.0.0 case is one of the first real-world tests.

This isn't even limited to "the end of copyleft"; it's the end of all copyright! At least copyright protecting the little guy. If you have deep enough pockets to create LLMs, you can in this potential future use them to wash away anyone's copyright for any work. Why would the GPL be the only target? If it works for the GPL, it surely also works for your photographs, poetry – or hell even proprietary software?

[+] andai|22 days ago|reply
Well how did they rewrite it? If you do it in two phases, then it should be fine right?

Phase 1: extract requirements from original product (ideally not its code).

Phase 2: implement them without referencing the original product or code.

I wrote a simple "clean room" LLM pipeline, but the requirements just ended up being an exact description of the code, which defeated the purpose.

My aim was to reduce bloat, but my system had the opposite effect! Because it replicated all the incidental crap, and then added even more "enterprisey" crap on top of it.

I am not sure if it's possible to solve it with prompting. Maybe telling it to derive the functionality from the code? I haven't tried that, and not sure how well it would work.

I think this requirements phase probably cannot be automated very effectively.

[+] christina97|22 days ago|reply
A reminder on this topic that copyright does not protect ideas, inventions, or algorithms. Copyright protects an expression of a creative work. It makes more sense eg. with books, where of course anyone can read the book and the ideas are “free” but copying paragraphs must be scrutinized for copyright reasons. It’s always been a bit weird that copyright is the intellectual property concept that protects code.

When you write code, it is the exact sequence of characters, the expression of the code, that is protected. If you copy it and change some lines, of course it’s still protected. Maybe some way of writing an algorithm is protected. But nothing else (under copyright).

[+] xp84|22 days ago|reply
I get the arguments being made here that the second “team,” that’s supposed to be in a clean room, which isn’t supposed to have read the original source code does have some essence of that source code in its weights.

However, this is solved if somebody trains a model with only code that does not have restrictive licenses. Then, the maintainers of the package in question here could never claim that the clean room implementation derived from their code because their code is known to not be in the training set.

It would probably be expensive to create this model, but I have to agree that especially if someone does manage this, it’s kind of the end of copyleft.

[+] pu_pe|22 days ago|reply
Licensing issues aside, the chardet rewrite seems to be clearly superior to the original in performance too. It's likely that many open source projects could benefit from a similar approach.