top | item 47125589

(no title)

porkloin | 7 days ago

I think it's important because there are a bunch of would-be claimants for intellectual property violation. Many people speculate that their work was used in training data, but it can be difficult to produce sufficient proof that their copyrighted work is present in the training data. If you could reliably get an LLM to produce 70% of a copyrighted book that would probably be enough to get a few lawyers salivating.

I didn't read the source paper referenced in the ars technica piece, but this statement about it makes me wonder how useful it actually is:

> But a study published last month showed that researchers at Stanford and Yale Universities were able to strategically prompt LLMs from OpenAI, Google, Anthropic, and xAI to generate thousands of words from 13 books, including A Game of Thrones, The Hunger Games, and The Hobbit.

It seems like well-known books with tons of summary, adaptations into film scripts, and tons of writing about the book in the overall corpus make it way less surprising to see be partially reproducible.

So I guess that's a lot of words to say - yeah until there's something definitive that allows people to prompt LLMs into either unlawfully recreating an entire work verbatim or otherwise indisputably proving that a copyrighted work was used in training data, there's probably nothing game changing in it.

discuss

order

vidarh|7 days ago

It's well-known books, yes, and even then with significant errors which means presumably lawyers for the AI companies would argue there is no possible damage. That said, US copright law has statutory damages for registered works that are not based on real, documented damages. I could totally see it being fought over, but I also agree it's probably not going to end up being game changing.

I suspect very works will be memorised enough to be an issue, and we'll see the providers tighten up their guardrails a bit for works that are well known enough to actually be a potential issue (issue in the form of lawsuits, not in the form of real damages to the copyright holders)