Author here. This site lets you put in a username and get the users with the most similar writing style to that user. It confirmed several users who I suspected were alts and after informally asking around has identified abandoned accounts of people I know from many years ago. I made this site mostly to show how easy this is and how it can erode online privacy. If some guy with a little bit of Python, and $8 to rent a decent dedicated server for a day can make this, imagine what a company with millions of dollars and a couple dozen PhD linguists could do.Here's Paul Graham:
https://stylometry.net/user?username=pg
Here are some frequent HN commenters: (EDIT: Removed due to privacy concerns)
[+] [-] sillysaurusx|3 years ago|reply
The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.
The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.
[+] [-] hnburnerUixoHr5|3 years ago|reply
I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.
I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.
Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.
[+] [-] dimmke|3 years ago|reply
[+] [-] costco|3 years ago|reply
[+] [-] lettergram|3 years ago|reply
https://news.ycombinator.com/item?id=17944293
The approach I took was a bit different, but also no ML required.
The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.
It’s a very small space to try to compare so simple methods will work fine.
[+] [-] echelon|3 years ago|reply
I put in my username and found my pre-echelon alt, possibilistic.
(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] User23|3 years ago|reply
[+] [-] bb88|3 years ago|reply
[+] [-] FormerBandmate|3 years ago|reply
> sillysaurus2
Tbf a human could have found a bunch of them relatively easily
[+] [-] jll29|3 years ago|reply
Also, the cosine of the vectors of word frequencies conflates author-specific vocabulary and topics; in other words, my account is grouped (with >51% similarity, according to the demo) with someone probably because we wrote about similar things. A strong stylometric matcher ought to be robust against topic shifts (our personal writing style is what stays constant when we move from writing about one topic to writing about another topic, just like our personality is what stays constant about our behavior over time - of course styles do change, but the premise then has to be that such changes happen very slowly).
Stylometrics/authorship identification is interesting and has led to some surprising findings, e.g. in forensic linguistics (Malcolm Coulthard wrote several good books about the topic).
This paper lists some other features that could be used and compares a bunch of techniques: https://research.ijcaonline.org/volume86/number12/pxc3893384...
[+] [-] sillysaurusx|3 years ago|reply
This is a fascinating way to find similar HN users who aren’t the same person. It’s a surprisingly great recommendation engine. “If you like pg, you might also like…”
Sure, the privacy concerns are valid, but the cat’s out of the boot. Might as well enjoy the benefits.
montrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567
Nicely done. One of the best hacks I’ve seen in a long time.
[+] [-] rcarr|3 years ago|reply
Excerpts from wiki:
> Before the publication of Industrial Society and Its Future, Kaczynski's brother, David, was encouraged by his wife to follow up on suspicions that Ted was the Unabomber.[91] David was dismissive at first, but he took the likelihood more seriously after reading the manifesto a week after it was published in September 1995. He searched through old family papers and found letters dating to the 1970s that Ted had sent to newspapers to protest the abuses of technology using phrasing similar to that in the manifesto.[92]
> In early 1996, an investigator working with Bisceglie contacted former FBI hostage negotiator and criminal profiler Clinton R. Van Zandt. Bisceglie asked him to compare the manifesto to typewritten copies of handwritten letters David had received from his brother. Van Zandt's initial analysis determined that there was better than a 60 percent chance that the same person had written the manifesto, which had been in public circulation for half a year. Van Zandt's second analytical team determined a higher likelihood. He recommended Bisceglie's client contact the FBI immediately.[96]
> In February 1996, Bisceglie gave a copy of the 1971 essay written by Ted Kaczynski to Molly Flynn at the FBI.[87] She forwarded the essay to the San Francisco-based task force. FBI profiler James R. Fitzgerald[98][99] recognized similarities in the writings using linguistic analysis and determined that the author of the essays and the manifesto was almost certainly the same person. Combined with facts gleaned from the bombings and Kaczynski's life, the analysis provided the basis for an affidavit signed by Terry Turchie, the head of the entire investigation, in support of the application for a search warrant.[87]
https://en.m.wikipedia.org/wiki/Ted_Kaczynski
[+] [-] drc500free|3 years ago|reply
I appear to be a well-educated, over-confident know-it-all.
[+] [-] pavlov|3 years ago|reply
[+] [-] highwaylights|3 years ago|reply
This was certainly eye-opening.
Update: It’s actually a little strange that reading through some of the matches it’s not just style that overlaps but perspectives in quite a few cases too. I’m definitely not the unique little snowflake that some others are finding themselves to be.
[+] [-] bee_rider|3 years ago|reply
The most noticeable similarity is that we both clearly have strong opinions about some things, and like to share information, but also like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.
The downside is, I guess, this could be seen as a bit weasel-word-y or indirect.
[+] [-] bhaney|3 years ago|reply
Don't we all?
[+] [-] reducesuffering|3 years ago|reply
I’m pretty sure participation in HN is a 99% sure filter for being called this many times in one’s life.
[+] [-] closeparen|3 years ago|reply
[+] [-] seydor|3 years ago|reply
[+] [-] jsnell|3 years ago|reply
And yeah, there's a bunch of high confidence (.6-.8) hits for that account, and from a quick browse of the comments of the recently active ones, they look really likely to be alts. Like, all three that I looked at had comments that made it very clear it was this person writing pseudonymously. (E.g. writing on their signature issue, and saying they couldn't go into more detail due to fear of self-doxxing; or somebody literally saying that the alt's claims reminded them of the public writings of the notorious guy years ago).
Obviously I'm not naming the account, but this functionality turned out way creepier than I thought the moment I tried it on the account of somebody who has a reason to disassociate from an existing public persona, but still wants to participate here.
[+] [-] gus_massa|3 years ago|reply
I don't think the list of pg alternate account is accurate. I checked a few. They have many oneliners that is typical of pg, but the topics and style don't look similar.
I searched a few more and got better results. :)
I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.
[+] [-] costco|3 years ago|reply
It's based purely off frequency of the 200 most common English 1 word phrases, 2 word phrases, 3 word phrases, 1 character sequences, 2 character sequences, and 3 character sequences. Topic does not really have anything to do with it. If I had more time I probably would've done a smarter model that accounted for things like that.
[+] [-] Fnoord|3 years ago|reply
I don't do throwaway. I either post or STFU. I also STFU on darknet. Its why I found it fun to read/lurk on things like I2P back when it was new. And I know that on a pseudonymous account it is only a matter of time until it can be linked to another pseudonymous account. It would not surprise me if stylometry was used on Dread Pirate Roberts or the people behind The Pirate Bay or the people behind Wikileaks (Assange's sockpuppet accounts). Such can also have been used to verify afterwards instead of beforehand. Though with TPB since it was on clearweb an advanced adversary could have used correlation/timing attack to figure who wrote what.
I'm having fun times recognizing other Dutch people though their usage of English language. For example, a distinctive word I see Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a red flag the person is native Dutch. I wonder if there are stylometry tools available for figuring if someone used physical vs touchscreen keyboard (I used Glider to write this post, spellchecker unavailable).
And yes, organizations like secret service and police should use such tools as well. It is a known tool, why not use it for good? As with any tool, it can be used for good and evil. On HN this could be useful for the mod team (AFAIK nowadays only dang) to find banned people's sockpuppets. Cross-community could also be a fun project: find a HN user's Twitter or Reddit account. And I hope this method is also used to find Russian trolls on social media.
[+] [-] dlkf|3 years ago|reply
I wonder if this explains our similarity. And if so, could we tweak the algo by e.g. Removing text that is prepended with ”>”
[+] [-] bscphil|3 years ago|reply
Also, since this uses only word frequency, there are probably relatively easy improvements to make that would make it even more powerful, like looking at particular runs of words that are unique. Some expressions or figurative language only show up in combinations of words, and tend to be highly style specific.
[+] [-] costco|3 years ago|reply
[+] [-] faeriechangling|3 years ago|reply
[+] [-] setr|3 years ago|reply
[+] [-] saurik|3 years ago|reply
[+] [-] dsr_|3 years ago|reply
I'm 0.566 correlated with logfromblammo -- and while we are definitely not the same person, I could easily imagine writing a sentence such as:
"For some bizarre reason, management has not yet assigned a task to their programmer underlings to automated themselves out of existence. I can't imagine why."
which is theirs, not mine, from about a year ago. I like that.
On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.
[+] [-] costco|3 years ago|reply
This is due to the Firebase API not updating when users ask the admins to move their comments to another account.
[+] [-] lifeisstillgood|3 years ago|reply
[+] [-] DenisM|3 years ago|reply
As you're typing out a comment the software gives you a list of accounts you're becoming similar to. That way you can adjust your writing as you type.
[+] [-] davebillyhock|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] CharlesW|3 years ago|reply
[+] [-] serhack_|3 years ago|reply
[+] [-] costco|3 years ago|reply
[+] [-] super256|3 years ago|reply
They were always communicating in some kind of meme-russian, and their texts were funny to read. [1]
I believe their writing mostly defeated this kind of analysis, at the cost of looking like idiots (which was probably the reason no one sent them crypto-dollars to buy that stuff exclusively).
Here's an excerpt:
"Attention government sponsors of cyber warfare and those who profit from it !!!!
How much you pay for enemies cyber weapons? Not malware you find in networks. Both sides, RAT + LP, full state sponsor tool set? We find cyber weapons made by creators of stuxnet, duqu, flame. Kaspersky calls Equation Group. We follow Equation Group traffic. We find Equation Group source range. We hack Equation Group. We find many many Equation Group cyber weapons. You see pictures. We give you some Equation Group files free, you see. This is good proof no? You enjoy!!! You break many things. You find many intrusions. You write many words. But not all, we are auction the best files."
[1] https://archive.ph/20160815133924/http://pastebin.com/NDTU5k...
[+] [-] spdustin|3 years ago|reply
(FWIW, it didn’t find my throwaways; my own model didn’t, either, because I knew that word choice wasn’t enough to avoid being outed by stylometry)
Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.
[+] [-] zxcvbn4038|3 years ago|reply
“Find hot single women who write just like you”
[+] [-] nrp|3 years ago|reply
[+] [-] forgotpwd16|3 years ago|reply
[+] [-] interroboink|3 years ago|reply
It's so easy for something like this to be turned into a tool for a witch hunt, targeting innocents.
[+] [-] costco|3 years ago|reply