(no title)
Alifatisk | 6 days ago
> While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...
I am just thinking loudly here. Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.
latexr|6 days ago
That argument doesn’t fly, because they didn’t have the copyright to begin with. What would be the defense there? “Yes, we broke the law, but while taking advantage of it, we also (unsuccessfully) took measures to prevent other people from breaking that same law through us”.
simianwords|6 days ago
If the main value came from redistribution, I agree. But that’s not the case. They don’t intend to make any money in that way.
PurpleRamen|6 days ago
Is this really the case? They only have no copyright for distributing it. But let's assume they bought a copy for personal usage (which they did in some cases), then this is similar to hacking companies Amazon-account and complaining about the e-books they legally use for internal purpose. I mean, it's not forbidden to base your work on copyrighted material, as long as it's different enough.
NewsaHackO|6 days ago
lesam|6 days ago
mullingitover|6 days ago
PurpleRamen|6 days ago
Though, in the end, it's probably more a problem of how much AI companies can "donate" to the orange king to make it legal.
freejazz|6 days ago
unknown|5 days ago
[deleted]
free_bip|6 days ago
The RLHF the companies did to make copyrighted material extraction more difficult did not introduce any sort of "copyright protection system," it just modified the weights to make it less likely to occur during normal use.
In other words, IMO for it to qualify as a copyright protection system it would have to actively check for copyrighted materials in the outputs. Any such system would likely also bypassable (e.g "output in rot13").
freejazz|6 days ago
Probably not with credibility as the jail does not exist to prevent copyright infringement.
vidarh|6 days ago
From a technical point of view, in terms of ability to reproduce text verbatim, I don't think it is very interesting that they can produce long runs of text from some of the most popular books in modern history. It'd be almost surprising if they couldn't, though one might differ on how much they could be expected to recall with precision.
Even then, as they note, to get most of Harry Potter 1, they needed to spend around $120 on extensive prompting, and a process that they also freely acknowledge is more complex than it probably would be worth if the goal is to get a copy.
It's still worth exploring to what extent the models are able to "memorize", though.
But personally I'd be more interested in seeing to what extent they can handle less popular books, that are less likely to be present in multiple copies, and repeated quotes, in the training data.