top | item 37345554

(no title)

rickmode | 2 years ago

I believe we first need to answer the question of whether the copyright of the AI model’s source text or images affects the output.

My opinion — and note I’m a software engineer, not a lawyer — is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material. This would, I think, require the AI’s creator to secure a license for all of its sources that allows this sort of transformation and presentation. And further, a user of the AI would themselves require a license to use the output.

The alternative seems to be “anything goes”.

discuss

order

Nevermark|2 years ago

I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.

A model trained on several copyrighted data sources cannot somehow be used in a way depending on a subset of those sources.

So all parameters of usage and compensation should be settled by contract between the model builder and copyrighted data supplier, before the copyrighted material is used.

Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.

That’s it. That’s the standard. No complicated new laws required.

Model builders obtain permission to use copyrighted material from copyright holders based on any terms both agree to.

Terms might involve model usage limits, term limits, one time compensation, per use compensation, data source credits, or anything else either party wants.

The likely result will be some standard sets of terms becoming popular and well known. But nobody has to agree to anything they don’t want to.

kuchenbecker|2 years ago

I slightly disagree, in that I think the person using the tool should bear the burden of copyright. I.e. if the model outputs something under copywrite it merely can't be republished. In this same way, i can use Photoshop on proprietary data but I can't necessarily sell the results.

gwd|2 years ago

> Or to put it simply: using copyrighted material to create a model would NOT be considered fair use.

The more I think about it, the more something along these lines seems like it might be the right way to think about it.

When you play a DVD, for example, you copy the bits off the DVD, into the memory of your DVD player, and onto your screen; this is all explicitly considered "fair use" copying. But if you then copied those fair-use bits off the screen onto a thousand other screens, that violates copyright.

When you, as the human watch the DVD, bits of it get copied into your brain; but you don't then copy the bits of your brain to millions of other people -- they each have to make their own copy.

We could make the law for LLMs follow a similar logic: That having an LLM watch a video or read a text is similar to having a DVD player read a DVD or a web browser copy information from a website. It's good for that limited use case, but the resulting copy cannot be copied again without a license.

This would allow (say) researchers, or even individuals, to do their own training and so on without a license; but when anyone wanted to create something that they wanted to scale up, they'd have to get licenses for everything.

That would fundamentally keep things balanced as they are now with creators and other creators. The big problem isn't that a handful of other creators may be copying their style; that growth in competition is limiting because of the expense of duplication. It's that millions of electronic engines can copy their style.

kelnos|2 years ago

> I don’t think it makes sense for both model builders and the model’s users to separately obtain licenses for the same works used in the training set.

I'm torn on who should pay, and where and when. In the world of patents, there's often an option/split. Say a chip manufacturer wants to build H265 decoding into their hardware. The chip manufacturer could buy the license. Or the purchaser (who probably is building some sort of board or device around the chip) could pay for the license. Or they could disable that functionality in the end product, and the consumer could pay for a license (or not, if they don't care about that feature).

The most common is usually the middle option: the end-device manufacturer (or brand that eventually sells the product) will pay for the license.

But I'm not sure if this works all that well for an AI model. With hardware, the license is usually paid per unit. It's easy to see that one chip = one license. If the model builder buys a license, that model could be used one time or 100 million times. Tracking use like that probably isn't all that practical, but I think it's safe to say that a 100-million-use model should probably pay more for a license than a single-use model.

So maybe the model builder should be responsible for attaching a comprehensive "copyright history" to the model, and users should have to pay for a license based on their use? Again, not sure how to track that. But I guess general software licensing has similar problems when you can "hide" usage.

Retric|2 years ago

Yes, someone using a model can’t know if the generated text/image/sound is a nearly identical copy of the original material they don’t recognize. If use of the output of these systems comes at significant legal risk then then such systems become nearly useless.

renonce|2 years ago

Problem is, how can you determine if the model contains copyrighted material? The laws governs copyright through ownership, so in order to claim copyright infringement you have to be able pinpoint a specific person and prove that their work is somehow embedded in the gradients, which is not practically possible at the point. It's just like how you can't practically enforce copyright on encrypted data unless you ban encryption altogether.

meowkit|2 years ago

My opinion as a SWE who is dating a lawyer (joke, not a serious qualification but it does provide some insight):

Generative models traverse and interpolate high dimensional state spaces. These state spaces are created from input data.

I would argue people do the exact same thing - the first main difference is we can use novel inputs (e.g. we can use images or words to develop our music/temporal state spaces and vice versa). People also are recursive and self referential in a way that doesn't collapse.

Until we solve the interpretability problem (e.g. can you decode the feature space of a neural network into something we can comprehend) there is no good solution. Either traditional copyright wins and we get even more draconian policies (think Disney and their desire to never put anything in the public domain), or we have a free for all (which I don't think is bad for creative works, but certainly for more practical things like stock photos or nonfiction).

cj|2 years ago

I can appreciate how this line of thinking might be attractive.

But IMO the human<>machine comparison doesn't lend itself much credence. We shouldn't assume that just because a human is allowed to do something, a machine is automatically allowed to do the same thing, too. I think some care should be taken when considering if we allow machines to have the same privileges as humans.

mjan22640|2 years ago

The value of copyright is going to vanish. There is enough public domain material to train models on and to avoid the problem altogether.

There used to be professions like tinkerers, bards, clowns. The tinkerers disappeared when the society became modern. The clowns on the other hand managed to lobby for laws that put people into jail for heinous crimes like copying pictures, and survived longer. They are going to bite the dust now.

rcme|2 years ago

Whether or not “humans do it” isn’t relevant. You can walk around with a copyrighted song in your head. That is not copyright infringement. But if you take that song, create a digital copy, and distribute it for money, then you are violating someone’s copyright. Additionally, our legal system requires a balance of probabilities. It’s hard to prove that someone was influenced by another work unless the similarities are plainly obvious. The same does not apply to ML models where the training data and algorithm are knowable facts.

distract8901|2 years ago

The analogy doesn't hold when you consider the sheer scale of the problem.

I can outright buy a machine for a few thousand dollars that can crank out a faithful rewrite of every Stephen King novel without the shitty endings and nonsense plot points. It can do it in a few days, maybe a couple of weeks at most.

To do that with human labor would take years and cost hundreds of thousands, if not millions of dollars.

Instead of paying an artist a couple hundred for a commissioned drawing, I can just scrape up their entire portfolio and generate any image I want with their style. I can generate hundreds or thousands of images. I can take their distinct style and use it exclusively as the branding for my company.

What a ML model does is very fundamental not what happens when a human draws inspiration from prior art. A human would require an extremely significant amount of time and resources to perfectly imitate every artist they have ever seen. It takes a human significant time and resources to produce faithful variations on prior art.

A ML model is measured in words or images per second.

omnicognate|2 years ago

> ... a SWE who is dating a lawyer

> I would argue people do the exact same thing

Perhaps a ménage à trois with a neuroscientist would change your view on this.

ethbr1|2 years ago

> Until we solve the interpretability problem (e.g. can you decode the feature space of a neural network into something we can comprehend) there is no good solution.

This is the rub. Without reverse attribution... open source anonymous models become a free-for-all loophole.

Since that doesn't currently exist, I think the best we can do is to say that any commercial entity using a model bears the responsibility of proving the model they use is untainted by copyrighted material (to which they haven't secured rights).

Open source model X is... whatever it is.

But I'll be damned if OpenAI / Meta / Microsoft / IBM should be able to build a commercial product on top of laundered copyrighted material while ignoring provenance.

I mean, we have models for this: software code and art. Both aren't clearly attributable. In the case of software code, we've developed case law around clean room design and similarity. In the case of art, we value verifiable chain of custody.

Hopefully, something similar would tilt commercial funding of AI in the direction of responsible use.

Natsu|2 years ago

My problem with this is that artists learn by studying other artists, cutting that off because it's AI rather than focusing on whether the resulting work is derivative, seems more of a problem to me. It seems to me that an AI can be used for either original work or derivatives, proving that you can get derivatives out of it has always struck me as no different than commissioning a copy of someone's work from a human artist and being shocked that you got what you asked for.

freejazz|2 years ago

Can an AI express to you how van gogh affected it as an artist? I'm not sure that AI is "learning" the way we say humans are "learning," when humans learn and study art. Obviously there is no debate that you can input van gogh into a model and produce something van gogh-like as a result. But I've not seen anything that indicates that the AI is learning anything about van gogh at all. Perhaps it comes down to whether you think learning van gogh is just creating a mapping of all of his brush strokes ever, and only exactly what they look like. It's obvious the AI knows nothing more than that. If you think that's what humans do when they learn art, I'd be sad for you!

As to your hypothetical, we don't give copyrights to people who make rote copies of things, human or otherwise. Is the implication of the shock, that there is sufficient difference with the work as to render it a derivative and not a copy? Okay, how so? And of what consequence? Making derivatives of a copyright without license is infringement.

skydhash|2 years ago

You can ask someone to produce a pin-up version of Minnie Mouse, but good luck using it in any commercial activities.

Most LLMs are just profiteering from people’s labor without their consent. And there’s nothing new being produced. It’s always a statistical output of previous works.

idle_zealot|2 years ago

Is intelligence really a factor here?

Say I use the same training set as one of these LLMs, copyright protected text and all, and use it to derive a compression algorithm that uses very little space to store tokens and token sequences that are common in that huge collection of text. The resulting compression scheme includes some sort of statistical artifact derived from that copyrighted text. Is that allowed? And if so why is an LLM different?

cj|2 years ago

Very good question indeed.

A lot of these questions are somewhat ethical/moral in nature. E.g. is it okay to take someone else's creative work, process it through some algorithm, to create a service like ChatGPT? Or a compression algorithm? I don't know.

It's awesome to see the Copyright office request input from both sides of the argument.

quickthrower2|2 years ago

LLMs are generative though not just compressive

stale2002|2 years ago

> is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material

None of what you are saying has anything to do with copyright.

The tool Photoshop isn't generally intelligent either. And yet, yes it can be used to create art using other people's stuff.

And it could be done legally if the results are transformative.

jtr1|2 years ago

Photoshop doesn’t install with a massive directory of other people’s copyrighted works to draw snippets from.

fluidcruft|2 years ago

I personally have a really hard time finding any meaningful difference or distinction between "AI" and "lossy compression". Copyright and "lossy compression" are pretty easy to reason about. Model "building" is "compression". Model "use" is "decompression". Everything about these AI models seems to be about the "lossy" part, but "lossy" is just an adjective to the main show.

It's very difficult to not conclude that copyright of a trained model should be treated identically to the copyright of a zip file.

chii|2 years ago

Information is not copyrighted, just the expression of said information.

So if you took a recipe book, extracted the recipe information, and listed out the recipe in a different format (such as a table), it's a new work. It does not violate the copyright of the recipe book you extracted the info from.

gwd|2 years ago

> I personally have a really hard time finding any meaningful difference or distinction between "AI" and "lossy compression".

If you feed a photo of your dog into a JPEG compressor and the result looked like a cat in the same style, I think you'd be pretty annoyed.

CamperBob2|2 years ago

When you perform lossy compression, you feed it one file at a time, not every file in existence.

tomrod|2 years ago

Some compression, yes, but the analogy oversimplifies. AI rerepresents input information in a transformative way (embedding, say) then creates new, derived and combined output from a new input (e.g prompt).

It's not just lossy compression. It's potentially novel.

8note|2 years ago

Why is being a statistical model relevant?

The simplest statistical model is an average. Why would the average pixel rgba of a bunch of images invoke the copyright of those images?

chii|2 years ago

The crux of the AI copyright argument sits in economics. Those currently producing content want future content generated from AI to benefit them financially, as long as a thin sliver of their own content was used in the training.

This is like asking all the student to pay their teachers a (small) percentage of their future economic output.

JamesBarney|2 years ago

My opinion is we should treat AI like photoshop/word/windows. If you use windows to copy a file and distribute it, Microsoft isn't liable you are. If you use word to type up a book and sell it, you're responsible.

Same with a statistical model, if you general a copyrighted work and distribute it you are responsible. But the tool (GPT-4) maker isn't responsible just like Adobe isn't responsible for copyright infringement.

The copyrighted text/image isn't generated until you ask it to. Your prompt is what reproduces the material.

NoMoreNicksLeft|2 years ago

Why would any non-lunatic want to live in a world where someone can't import an image into software?

If only some software is disallowed, then why permit Excel but prohibit Stable Diffusion?

Can someone even look at a SD-generated image, and claim with certainty that their own art was used to train it? Any more than claiming that another artist was inspired by it, looking at their output?

I'm fine with anything goes. The alternative seems to be copyright maximalist clownworld.

paxys|2 years ago

> is that an AI, being a statistical model and not generally intelligent, should not be allowed to disregard the copyright of its source material

But then you are just shifting the problem forward by an inch. What happens when tomorrow someone declares that their model is generally intelligent and is therefore allowed to disregard copyright when training just like a person can?

jasonzemos|2 years ago

This point is of the utmost importance from a public policymaking perspective. Laws such as these are easy to craft now and difficult to change later. I feel like we are previewing an unfolding disaster here.

The future will clearly yield a class of "beings" striving for some degree of indistinguishability from or coexistence with humans. Proposals that discriminate --literally discriminate -- without respect for the principles of universality and equal treatment under law are creating and condemning a marginalized group before it even reaches maturity. This is an old and tired theme repeated through history. Let's foresee this and not get it wrong.

freejazz|2 years ago

Is it your experience that people's facial declarations cary the day in legal disputes? It's not mine. Rather, it seems like the whole thing is designed to provide scrutiny against bare facial declarations that something is true or false.

I see this on HN all the time "someone just has to claim" "someone just has to say". Yeah... that's not how it works. People can say whatever they want, that doesn't mean it satisfied their burden of proof. Self serving testimony is the lowest form of evidence imaginable.

paulusthe|2 years ago

I agree completely. AI model trainers should have to pay the people who provide their training materials, and there should be a default assumption of opting out until someone or their company explicitly opts in.

Unfortunately the Peter thiels and all those bizarrely out of touch silicon valley assholes have already effectively scraped the Internet because ethics don't matter if you're special like them, so to a degree regulations are way behind the ball.

That said it's still worth doing, and I'd love to see it done retroactively as well. It's not as if "I forgot that I had a public Myspace 25 years ago" is an implicit user opt-in for some startup to save your data - however anonymized they claim it is (lol!) - and train its AI on it.

zmmmmm|2 years ago

> The alternative seems to be “anything goes”.

Seems like a huge false dichotomy. You really can't imagine anything in between total shutdown of AI training on public data sources and no rules at all?

I think we should try a bit harder for a middle ground.

lewhoo|2 years ago

I think you are right. People argue if LLM's store or maybe generalize. I propose an experiment for anyone interested. Try and do this prompt multiple times and change the appropriate verse numbers:

> Provide quote from King James' Bible Genesis :25-31

or

> Provide quote from King James' Bible Genesis :1-25

or whatever you fancy.

I didn't go through the whole Bible, but I got pretty much a verbatim chapter. I argue that you can't do this with copyrighted books only because of guardrails and not chatgpt's lack of capability so the information is there, and it's verbatim. Plus other books don't have such nifty indexing.

mensetmanusman|2 years ago

Because the cat is out of the bag so to speak, any attempt to force ai companies to generate their own content to train on means we are signing up for a future where only multi billion dollar companies are in control.

PaulDavisThe1st|2 years ago

If they were truly forced to do this, even they would find it difficult.

gnopgnip|2 years ago

Is there any precedent where copyright was focused on the input rather than the final published work?

jj999|2 years ago

Compilers

harshreality|2 years ago

This is more of a problem for images, where similar output to inputs is likely, than for LLMs, where no matter what you prompt it with I doubt you can get it to regurgitate any significant parts of Harry Potter well enough to be a classical copyright violation of any of the novels. Maybe you could generate a copyright violation of character traits.

The output space of images (MB for larger images) tends to be larger than books (a few hundred KB of text for a long novel), but the perceptual output space of books is much larger.

Any determination that licensing is required for AI generation, or use of AI-generated works, is unacceptable until Congress or courts put some reasonable objective tests in place to determine what is and isn't a copyright violation for various types of works of various lengths. Not the ambiguous 4-factor test that is basically whatever the judge feels like. It will be a complete mess otherwise. They can't just define a new AI policy for copyright with a few types of works in mind; it has to work for all works.

You could look at this mathematically from a complexity perspective and try to define a similarity function that's true when a second work is close enough to a first work to be a derived work (assuming the first one had been seen by the creator of the second). Unfortunately that won't work because nobody can define such a function to everyone's satisfaction, and the courts wouldn't accept any informal suggestion of a definition when it didn't come from Congress. Specifically, you'd get into trouble with consistency in the function determining derived works depending on length of the work: short works, like a haiku, are much more sensitive to copyright violation in some ways... a mere 17 syllables is a complete reproduction and therefore a copyright violation, yet a single word isn't; for a novel, reproducing 1/17 of the content is almost certainly a copyright violation, but reproducing 17 syllables probably isn't.

Different stakeholders and creative re-mixers would want different things from the function. It's untenable.

judge2020|2 years ago

> This would, I think, require the AI’s creator to secure a license for all of its sources that allows this sort of transformation and presentation

That is a fairly illogical leap. From your text alone, “should not be allowed to disregard the copyright of its source material” would be: “the AI’s maintainer should have a fairly reliable (but not infallible) system to output how likely it generated something that is a direct derivative work of something in its dataset”. As a human you don’t need to attribute/license every piece of art you’ve seen of clouds if you draw a cloud. So if an AI draws a cloud that is actually derivative of the millions of clouds it has seen, then it doesn’t need any permission from the millions of creators to draw one either.

rmbyrro|2 years ago

AI is taking work away from lawyers, and instantly creating more work for lawyers.

Ain't that interesting to reflect upon?

I speculate there is a hidden force in the universe, something physicists are yet to identify, which mandates: "they shall always have something to do".

mjan22640|2 years ago

The human brain is no different. It generates content from the things it learned.

CatWChainsaw|2 years ago

Repost #4 I believe

https://news.ycombinator.com/item?id=37305580

"I'll keep saying it every time this comes up. I LOVE being told by techbros that a human painstaking studying one thing at a time, and not memorizing verbatin but rather taking away the core concept, is exactly the same type of "learning" that a model does when it takes in millions of things at once and can spit out copyrighted code verbatim."

gaganyaan|2 years ago

I hope your opinion isn't shared by lawmakers. Copyright is a relic of the past, and it needs to be put out of its misery. Trying to (mis)apply copyright here would just lobotomize the US. Existing companies would just technically operate out of a saner jurisdiction, and we'd be handing other countries a golden opportunity to leapfrog the US.

scotty79|2 years ago

"anything goes" is the best and most natural solution. Just don't let people copyright the output if they don't have full copyright on all of the inputs. This should finally get rid of the cancer that is copyright in a generation or two.

rickmode|2 years ago

Generic reply to siblings here… I get the intelligence argument.

My _main_ point is that there’s a non-trivial question to answer here.

I’m not qualified to answer (though I’ve offered up my non-expert opinion). It certainly seems to quickly veer in to philosophy!

jillesvangurp|2 years ago

It shows you are not a lawyer. You misunderstand how copyright works. Creating copies or derivative works and distributing those is all that matters under copyright. This is not "disregarding" copyright (which is not an actual thing) but something that is either fair use or may require some kind of permission from the creators of the original by those distributing some kind of derived work or copy. That's why it's called copyright.

Copyright merely restricts the distribution of original works or their derivatives. In case of an infringement, copyright holders can insist you stop distribution and/or compensate them for that.

If I sell you a paint brush, I'm not liable for you putting a red nose on the mona lisa and trying to sell it off as an original work. Doing that on the original would be an act of vandalism (because you don't own it) and doing that on a replica that you got from somewhere infringes on the rights of those that created the replica. Which is a derived work or copy in itself of course and the distribution of that is regulated by copyright. Distribution of such a replica is of course fine because Da Vinci has been dead for a very long time and his work would no longer be protected under copyright. Distributing your red nosed mona lisa would therefore be fine too. Either way, the paint brush seller is no party in this case this is between you, Da Vinci, his descendants, and the replica creators.

Now your assertions as to what AIs are of aren't, are simply not relevant. You assert it's a statistics algorithms thingy. That sounds like a tool to me. Yet another paint brush. Using a paint brush is not infringing on anyone's rights. For that you have to distribute the results of your work. The nature of the tool does not matter. How you use the tool does not matter either. You merely create (potentially) derivative works with the tool and what you do with those matters. Especially when you distribute them to others. One of those derivative works is of course the AI model itself. Creating one is fine. Copyright gets potentially infringed when you distribute one.

Now we get to the core of the matter. Can you with a straight face say the AI model resembles the original and is a derivative work. It doesn't actually look like or resemble the original in any shape or form. Even proving the AI model is derived from the original is tricky. Copyright is not about protecting vague ideas or notions but the concrete shape or form of things. And it's only an infringement if you distribute a derived work or a copy of a thing to others. So, merely creating an AI model is not distributing anything to anyone. You are merely using tools to create something for yourself. An AI model in this case.

Distributing a verbatim copy of a book is an infringement. Citing the book in your own work is fair use (up to a point). Paraphrasing elements from the book, acknowledging it exists, taking inspiration of it, or reading it aren't copyright infringements.

The legal problem with AI models is that their concrete shape or form doesn't resemble the original inputs in any shape or form. Besides, companies like OpenAI don't actually distribute their AI models. They are huge; it's not very practical. They merely exploit those models to generate outputs to inputs from their users and customers. Are those outputs derivative works? Maybe, but that's where it gets tricky. They clearly aren't in the classical sense. Not even close. But if you somehow could conclude that they are, who is distributing that derivative work? Secondly, it the AI model is a tool, who actually creates those outputs and are those outputs protected under copyright? Who actually holds those rights? And how would you tell apart such an output from a human created one?

It's questions like this that make all this extremely murky from a legal point of view. IMHO without dramatic changes to copyright law or the way it has been commonly interpreted legally, it's just very poorly suited to do anything about stopping AI companies from doing what they are doing. You'd have to bend the conventional interpretation quite a bit for that. No doubt, there will be court cases where people will try to do that. But it will take many years before the dust settles on that. And I wouldn't get my hopes up on some unexpected/dramatic outcome.

freejazz|2 years ago

This is generally, but I'm surprised you aren't aware that distribution isn't the only right protected by copyright - creating derivative works is protected, display rights are protected.