top | item 40260370

(no title)

clay_the_ripper | 1 year ago

It does seem like this is a possible method to test if an LLM has your data in it.

People have found other ways to do that of course, but this is pretty clever.

discuss

order

mvkel|1 year ago

Not necessarily. This also uncovers the weakness of the NYT lawsuit.

Imagine in your corpus of training data is the following:

- bloga.com: "I read in the NYT that 'it rains cats and dogs twice per year'"

- blogb.com: "according to the NYT, 'cats and dogs level rain occurs 2 times per year."

- newssite.com: "cats and dogs rain events happen twice per year, according to the New York Times"

Now, you chat with an LLM trained on this data, asking it "how many times per year does it rain cats and dogs?"

"According to the New York Times, it rains cats and dogs twice per year."

NYT content was never in the training data, however it -is- mentioned a lot on various sources throughout commoncrawl-approved sources, therefore gets a higher probability association with next token.

Zoom that out to full articles quoted throughout the web, and you get false positives.

refulgentis|1 year ago

They were getting huge chunks, verbatim of NYT articles out. I remember being stunned. Then I remember finding out there was some sort of trick to it that made it seem sillier.