top | item 34470064

(no title)

rubinelli | 3 years ago

De minimis is a longstanding defense in copyright law. If you are copying very little from very many works, as is the case when you turn multiple petabytes into a few gigabytes of neural network weights, you are in the clear. The problem arises when models overfit and spit out almost perfect copies of the training data.

discuss

belorn|3 years ago

For example, I could take a massive 8k video and covert it into a very small 144p youtube video. Am I in the clear simply because the output is tiny compared to the input? Similar I could take a huge studio master copy of a song and convert it to a very small and rather compressed (distorted) mp3.

I partially agree that some of the problem is when perfect copies are spit out by the models, but I do think there is a bigger problem. Copyright is a complex concept that can't be defined exclusively by a single metric like size, and any mathematically definition will in the end be killed if large copyright holders feel threatened by it.

sdenton4|3 years ago

Thumbnail images don't violate copyright, and are a very helpful comparison case to consider.

"Transformative Use" is a major consideration in fair use copyright: https://en.wikipedia.org/wiki/Transformative_use

ML models do not supplant the pre-existing work, and provide fundamentally new modalities. Transformative use seems like a slam dunk to me, but I guess we'll see what the Supremes decide in twenty years or so...

Animats|3 years ago

There's a Stable Diffusion example where, having been trained on too many Getty Images pictures stamped with their logo, the system generated new images with Getty Images logos.[1] That's a bit embarrassing. There are code generation examples where copyright notices appeared in the output. A plagiarism detection system to insure that the output is sufficiently different from any single training input ought to be possible.

[1] https://petapixel.com/2023/01/17/getty-images-is-suing-ai-im...

phoe-krk|3 years ago

Yes, agreed, I don't think the problem is with networks that mix tons of input data in a way that doesn't heavily derive from one or a couple of sources. The currently available models do not have overfitting solved, though, and this technological imperfection also has direct practical (and legal) consequences.