(no title)
Lockyy | 2 years ago
Of course whether that was purposeful or inadvertently as a part of the larger training set would not be determined but you would know that the text is in there.
Lockyy | 2 years ago
Of course whether that was purposeful or inadvertently as a part of the larger training set would not be determined but you would know that the text is in there.
Aerroon|2 years ago
If I create a program that picks random words from a dictionary and I end up with a seed that generates that text verbatim, then does that mean my program contains the copyrighted text?
You might be able to craft an intricate prompt that just happens to recreate that copyrighted text. Run it enough times until you get it verbatim and done.
constantcrying|2 years ago
And LLMs do that, except prior to picking the word, they do complex statistics to figure out the probability distributions of those words.
Almost certainly some combination of input and RNG seed will produce any "small" combination of words.
constantcrying|2 years ago
No, you know that likely that part was consumed. You would need to show that it will generate arbitrary passages from the text.
And LLMs are inherently random, so proof that this happens is very difficult to obtain and showing that it is actual output nearly impossible, especially if you just have API access and can't use the model directoy (e.g. fix the RNG seed).
If you have that you can debate if it is/isn't fair use.
Lockyy|2 years ago