top | item 47124908

(no title)

bena | 7 days ago

This feels like a "no shit" moment.

Because if LLMs are prediction machines, the original novel would be a valid organization of the tokens. So there should be a prompt that can cause that sequence to be output.

discuss

yathern|7 days ago

Hmmm I think you're sort of right but not entirely. It's true that a novel consists of a valid organization of tokens, and that this sequence can be feasibly made to be output from a model. But when you say this:

> So there should be a prompt that can cause that sequence to be output

Is where I think I might disagree. For example, the odds of predicting verbatim the next sentence in, say, Harry Potter should be astronomically low for a large majority of it. If it wasn't, it'd be a pretty boring book. The fact that it can do this with relative ease means it has been trained on the material.

The issue at hand is about copyright and Intellectual Property - if the goal of copyright is to protect the IP of the author, then LLMs can sort of act like an IP money laundering scheme - where the black box has consumed and can emit this IP. The whole concept of IP is a little philosophical and muddy, with lots of grey area for fair use, parody, inspiration, and adaptation. But this gets very odd when we consider it in light of these models which can adapt and use IP at a massive massive scale.

Sharlin|7 days ago

That's not how it works… They aren't able to literally regurgitate everything they've read, no matter how you prompt them. That would obviously violate the pigeonhole principle. LLMs are, of course, a lossy compression format, and figuring out just how lossy the format is, and the degree of lossiness depends on the frequency of the given string in the training data. It's clearly worthwhile to investigate how exactly it depends.

beder|7 days ago

Yes, this is absolutely right (for some sufficiently complicated prompt). Borges wrote a great short story that explores this idea, "Pierre Menard, Author of the Quixote", where Menard, a fictional 20th century author, "wrote" Don Quixote as an original work.

tsimionescu|7 days ago

This is completely false. The odds of an LLM predicting the text of a novel that is not part of the training set is basically 0 - you can experiment with this if you want. It is essentially like the infinite monkeys on infinite typewriters thing (only slightly more constrained).

This is not to say that they couldn't write a novel, even a very good one - that is a completely different discussion.

simianwords|7 days ago

Not if they are aligned not to do it. Which is what they tried but it could be bypassed by jailbreaks.