top | item 46975762

(no title)

jaccola | 18 days ago

I have no idea how an LLM company can make any argument that their use of content to train the models is allowed that doesn't equally apply to the distillers using an LLM output.

"The distilled LLM isn't stealing the content from the 'parent' LLM, it is learning from the content just as a human would, surely that can't be illegal!"...

discuss

order

mikehearn|18 days ago

The argument is that converting static text into an LLM is sufficiently transformative to qualify for fair use, while distilling one LLM's output to create another LLM is not. Whether you buy that or not is up to you, but I think that's the fundamental difference.

zozbot234|18 days ago

The whole notion of 'distillation' at a distance is extremely iffy anyway. You're just training on LLM chat logs, but that's nowhere near enough to even loosely copy or replicate the actual model. You need the weights for that.

budududuroiu|18 days ago

> The U.S. Court of Appeals for the D.C. Circuit has affirmed a district court ruling that human authorship is a bedrock requirement to register a copyright, and that an artificial intelligence system cannot be deemed the author of a work for copyright purposes

> The court’s decision in Thaler v. Perlmutter,1 on March 18, 2025, supports the position adopted by the United States Copyright Office and is the latest chapter in the long-running saga of an attempt by a computer scientist to challenge that fundamental principle.

I, like many others, believe the only way AI won't immediately get enshittified is by fighting tooth and nail for LLM output to never be copyrightable

https://www.skadden.com/insights/publications/2025/03/appell...

amenhotep|18 days ago

When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models. When you get tokens from one of these providers, you sort of did.

I think it's a pretty weak distinction and by separating the concerns, having a company that collects a corpus and then "illegally" sells it for training, you can pretty much exactly reproduce the acquire-books-and-train-on-them scenario, but in the simplest case, the EULA does actually make it slightly different.

Like, if a publisher pays an author to write a book, with the contract specifically saying they're not allowed to train on that text, and then they train on it anyway, that's clearly worse than someone just buying a book and training on it, right?

BeetleB|18 days ago

> When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models.

Nice phrasing, using "pirate".

Violating the TOS of an LLM is the equivalent of pirating a book.

creamyhorror|18 days ago

Contracts can't exclude things that weren't invented when the contracts were written.

Ultimately it's up to legislation to formalize rules, ideally based on principles of fairness. Is it fair in non-legalistic sense for all old books to be trainable-on, but not LLM outputs?

TZubiri|18 days ago

Because the terms by each provider are different

American Model trains on public data without a "do not use this without permission" clause.

Chinese models train on models that have a "you will not reverse engineer" clause.

WSSP|18 days ago

> American Model trains on public data without a "do not use this without permission" clause.

this is going through various courts right now, but likely not