top | item 47154445

(no title)

To be clear, we are making a clear concession here that the people weren't truly anonymous. But we did use an LLM to remove any identifying information from HN making them quasi-anonymous, this is more described in the appendix Table 2.

We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.

The blog post might be more approachable for a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...

discuss

dang|4 days ago

Thanks for that link! I'll put in the top text.

Edit: actually I've re-upped your submission of that link and moved the links to the paper to the toptext instead. Hopefully this will ground the discussion more in the actual study.

ranger_danger|4 days ago

But you also relied on people giving away too much personal information about themselves... which won't always be the case.

majorchord|4 days ago

Yeah my first thought was "of course an LLM can do that, we didn't need a paper to tell us". I would be more impressed if it could do it without that information, such as by analyzing writing styles and other cues that aren't direct PII.

DalasNoin|4 days ago

I agree that these accounts probably on average still contain more information than the average pseudonymous account. I think we could try to use the LLM to increasingly ablate more information and see how it performance decays – to be clear we already heavily remove such information, see Table 2 appendix. But I don't expect that to change the basic conclusions.

famouswaffles|4 days ago

Over a large enough timeframe (often a couple years at most), almost everyone online gives too much information about themselves. A seemingly innocuous statement can pin you to an exact city and so on.