top | item 46888773

(no title)

qoez | 25 days ago

People always claimed this as a data leak vector but I've always been sceptical. Like just writing style and vocabulary is probably extremely shared among too many people to narrow it down much. (How people that you know could have written this reply?) The counter argument is that he had a very specific style in his mail so maybe this is a special case.

discuss

order

Eisenstein|25 days ago

If you have a large enough set to test against and a specific person you are looking for, this is totally doable currently.

fluoridation|25 days ago

Of course it's doable. The question is how reliable the results are.

hansvm|25 days ago

Not just a test set, but enough of a set to search through and compare against. Several pages of in-depth writing isn't anywhere near sufficient, even when limiting the search space to ~10k people.

zxcvasd|25 days ago

this is a well-studied field (stylometry). when combining writing styles, vocabulary, posting times, etc. you absolutely can narrow it down to specific people.

even when people deliberately try to feign some aspects (e.g. switching writing styles for different pseudonyms), they will almost always slip up and revert to their most comfortable style over time. which is great, because if they aren't also regularly changing pseudonyms (which are also subject to limited stylometry, so pseudonym creation should be somewhat randomized in name, location, etc.), you only need to catch them slipping once to get the whole history of that pseudonym (and potentially others, once that one is confirmed).

scythe|25 days ago

Stylometry is okay if you're trying to deanonymize a large enough sample text. A reddit account would be doable. But individual 4chan posts? You barely have enough content within the text limit.

ge96|25 days ago

People do change over time, I used to write "ha" after every sentence for some reason