top | item 36664086

(no title)

Lockyy | 2 years ago

It seems relatively straight forward (famous last words) to assess whether actual copyrighted text is embedded within the network. If you can prompt output that includes verbatim extracts when the copyright avoidance post-processing is disabled then you know that it has been consumed.

Of course whether that was purposeful or inadvertently as a part of the larger training set would not be determined but you would know that the text is in there.

discuss

Aerroon|2 years ago

>If you can prompt output that includes verbatim extracts

If I create a program that picks random words from a dictionary and I end up with a seed that generates that text verbatim, then does that mean my program contains the copyrighted text?

You might be able to craft an intricate prompt that just happens to recreate that copyrighted text. Run it enough times until you get it verbatim and done.

constantcrying|2 years ago

>If I create a program that picks random words from a dictionary

And LLMs do that, except prior to picking the word, they do complex statistics to figure out the probability distributions of those words.

Almost certainly some combination of input and RNG seed will produce any "small" combination of words.

constantcrying|2 years ago

> you can prompt output that includes verbatim extracts when the copyright avoidance post-processing is disabled then you know that it has been consumed.

No, you know that likely that part was consumed. You would need to show that it will generate arbitrary passages from the text.

And LLMs are inherently random, so proof that this happens is very difficult to obtain and showing that it is actual output nearly impossible, especially if you just have API access and can't use the model directoy (e.g. fix the RNG seed).

If you have that you can debate if it is/isn't fair use.

Lockyy|2 years ago

Arbitrary passages is what I meant by "verbatim extracts."