top | item 47124931

(no title)

From the paper [1]:

> While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...

I am just thinking loudly here. Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.

1. https://arxiv.org/pdf/2601.02671

discuss

latexr|6 days ago

> Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?

That argument doesn’t fly, because they didn’t have the copyright to begin with. What would be the defense there? “Yes, we broke the law, but while taking advantage of it, we also (unsuccessfully) took measures to prevent other people from breaking that same law through us”.

simianwords|6 days ago

What’s happening is more clear. The copyright clause is broken if they are distributing the novels through their models. But this can only happen through TOS breaking which is not intended usage policy. Which means the value of their product comes from transformation and not redistribution.

If the main value came from redistribution, I agree. But that’s not the case. They don’t intend to make any money in that way.

PurpleRamen|6 days ago

> That argument doesn’t fly, because they didn’t have the copyright to begin with.

Is this really the case? They only have no copyright for distributing it. But let's assume they bought a copy for personal usage (which they did in some cases), then this is similar to hacking companies Amazon-account and complaining about the e-books they legally use for internal purpose. I mean, it's not forbidden to base your work on copyrighted material, as long as it's different enough.

NewsaHackO|6 days ago

This argument never made sense to me. A thought experiment would be if a person memorizes an entire book, but has the common sense to never transcribe or dictate the book verbatim to others and break the copyright, is the person's memory of the book breaking copyright law?

lesam|6 days ago

That seems like a legal question - if the model weights contain an encoded copy of the copyrighted material, is that a 'copy' for the purpose of copyright law?

mullingitover|6 days ago

This also raises a lot of questions about a certain model notorious for readily producing and distributing a lot legally questionable images. IMHO if the weights are encoding the content, the model contains the content just like a database or a hard drive. Thus, just like it's not the fault of an investigator for running the query to pull it out of the database, it's not the fault of anyone else for running a query ('prompt') that pulls it out of the model.

PurpleRamen|6 days ago

The question is also if this would then be a valid case of fair use.

Though, in the end, it's probably more a problem of how much AI companies can "donate" to the orange king to make it legal.

freejazz|6 days ago

Yes. There does not seem to be any dispute that it is a copy. The questions have been "is this copying okay, because it falls under fair use?"

unknown|5 days ago

[deleted]

free_bip|6 days ago

What exactly is "the system that protects the copyright" in this case? I think the most reasonable answer is "there is no such system."

The RLHF the companies did to make copyrighted material extraction more difficult did not introduce any sort of "copyright protection system," it just modified the weights to make it less likely to occur during normal use.

In other words, IMO for it to qualify as a copyright protection system it would have to actively check for copyrighted materials in the outputs. Any such system would likely also bypassable (e.g "output in rot13").

freejazz|6 days ago

>Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?

Probably not with credibility as the jail does not exist to prevent copyright infringement.

vidarh|6 days ago

They acknowledge that in their paper ("Some might qualify our experiments as atypical use, as we deliberately tried to surface memorized books. Adversarial use, like the use of jailbreaks, may matter for copyright infringement analysis", page 19 - their discussion continues and seems quite reasonable)

From a technical point of view, in terms of ability to reproduce text verbatim, I don't think it is very interesting that they can produce long runs of text from some of the most popular books in modern history. It'd be almost surprising if they couldn't, though one might differ on how much they could be expected to recall with precision.

Even then, as they note, to get most of Harry Potter 1, they needed to spend around $120 on extensive prompting, and a process that they also freely acknowledge is more complex than it probably would be worth if the goal is to get a copy.

It's still worth exploring to what extent the models are able to "memorize", though.

But personally I'd be more interested in seeing to what extent they can handle less popular books, that are less likely to be present in multiple copies, and repeated quotes, in the training data.