Show HN: Using stylometry to find HN users with alternate accounts

[+] sillysaurusx|3 years ago|reply

Wow. This gives a lot of false positives, but it found all ~10 of my old accounts over the years.

The most interesting thing is that my writing style changed pretty drastically since a decade ago. Searching for my oldest account matches my earliest usernames, whereas searching this account matched the rest.

The details of the algorithm are fascinating: https://stylometry.net/about Mostly because of how simple it is. I assumed it would measure word embeddings against a trained ML model, but nothing so fancy.

[+] hnburnerUixoHr5|3 years ago|reply

Woof.

I create new accounts on a semi-regular basis because I think cliques are the most corrosive factor to social media. Any time my account gathers enough upvotes enough I destroy it for another.

I had four accounts. None are over 50% confidence, but when I look at any one account the others are consistently #2, #3, and #4.

Now I’m thinking very carefully about what words I use to avoid linking this as the 5th account.

[+] dimmke|3 years ago|reply

On the other side of the coin, I have never had an alternate HN account (beyond maybe 1-2 throwaways with only one post or comment) so seeing the list of users that are most similar to me was interesting. I didn't see some stark similarities based on a quick peek at their comments, but it was interesting.

[+] costco|3 years ago|reply

Yeah top 20 is a little excessive because in my own tests I found that top 20 is only marginally more accurate than top 10. You can get a more academic explanation [here](https://www.tandfonline.com/doi/abs/10.1080/09296174.2011.53...). I was amazed too because it seemed too easy!

[+] lettergram|3 years ago|reply

Frankly similar to how I was doing in back in 2018 (when you and I chatted about it on HN lol)

https://news.ycombinator.com/item?id=17944293

The approach I took was a bit different, but also no ML required.

The real trick is pruning and going cross platform. There are around 100k active HN accounts (meaning posts a few times a year), maybe 200k if you count at least one post a year. But <10k that post weekly.

It’s a very small space to try to compare so simple methods will work fine.

[+] echelon|3 years ago|reply

It works like a charm for me too.

I put in my username and found my pre-echelon alt, possibilistic.

(Echelon was taken when I registered possibilistic, but it must have been unused and dropped.)

[+] unknown|3 years ago|reply

[deleted]

[+] User23|3 years ago|reply

I’d figured it would be some kind of n-gram frequency analysis. Would be interesting to code that up and compare.

[+] bb88|3 years ago|reply

sillysaurus3 was in mine. :) Clearly we're not the same.

[+] FormerBandmate|3 years ago|reply

> sillysaurus3

> sillysaurus2

Tbf a human could have found a bunch of them relatively easily

[+] jll29|3 years ago|reply

The method used, i.e. to calculate the cosine of the two authors' word vectors, is poorly suited for stylometric analysis because it is based on a poster's lexicon and the word frequencies of each word, but ignoring stylistically relevant factors like word order.

Also, the cosine of the vectors of word frequencies conflates author-specific vocabulary and topics; in other words, my account is grouped (with >51% similarity, according to the demo) with someone probably because we wrote about similar things. A strong stylometric matcher ought to be robust against topic shifts (our personal writing style is what stays constant when we move from writing about one topic to writing about another topic, just like our personality is what stays constant about our behavior over time - of course styles do change, but the premise then has to be that such changes happen very slowly).

Stylometrics/authorship identification is interesting and has led to some surprising findings, e.g. in forensic linguistics (Malcolm Coulthard wrote several good books about the topic).

This paper lists some other features that could be used and compares a bunch of techniques: https://research.ijcaonline.org/volume86/number12/pxc3893384...

[+] sillysaurusx|3 years ago|reply

Ha, gruseom shows up for pg, which is dang’s old account. A worthy successor.

This is a fascinating way to find similar HN users who aren’t the same person. It’s a surprisingly great recommendation engine. “If you like pg, you might also like…”

Sure, the privacy concerns are valid, but the cat’s out of the boot. Might as well enjoy the benefits.

montrose is almost definitely pg. Someone who talks about ancient history, Occam’s razor, VCs and startups, uses the phrase “YC cos” (relatively uncommon), etc. https://news.ycombinator.com/item?id=17112567

Nicely done. One of the best hacks I’ve seen in a long time.

[+] rcarr|3 years ago|reply

This is somewhat similar to how they ended up catching the Unabomber. The FBI were literally at a dead end. They ended up posting one of his letters/manifestos in the paper, somebody recognised a turn of phrase the unabomber used that was unusual and reported it as possibly being their brother, FBI investigated the lead and it lead them straight to him.

Excerpts from wiki:

> Before the publication of Industrial Society and Its Future, Kaczynski's brother, David, was encouraged by his wife to follow up on suspicions that Ted was the Unabomber.[91] David was dismissive at first, but he took the likelihood more seriously after reading the manifesto a week after it was published in September 1995. He searched through old family papers and found letters dating to the 1970s that Ted had sent to newspapers to protest the abuses of technology using phrasing similar to that in the manifesto.[92]

> In early 1996, an investigator working with Bisceglie contacted former FBI hostage negotiator and criminal profiler Clinton R. Van Zandt. Bisceglie asked him to compare the manifesto to typewritten copies of handwritten letters David had received from his brother. Van Zandt's initial analysis determined that there was better than a 60 percent chance that the same person had written the manifesto, which had been in public circulation for half a year. Van Zandt's second analytical team determined a higher likelihood. He recommended Bisceglie's client contact the FBI immediately.[96]

> In February 1996, Bisceglie gave a copy of the 1971 essay written by Ted Kaczynski to Molly Flynn at the FBI.[87] She forwarded the essay to the San Francisco-based task force. FBI profiler James R. Fitzgerald[98][99] recognized similarities in the writings using linguistic analysis and determined that the author of the essays and the manifesto was almost certainly the same person. Combined with facts gleaned from the bombings and Kaczynski's life, the analysis provided the basis for an affidavit signed by Terry Turchie, the head of the entire investigation, in support of the application for a search warrant.[87]

https://en.m.wikipedia.org/wiki/Ted_Kaczynski

[+] drc500free|3 years ago|reply

This is a super interesting tool for self reflection. Looking at the top 10 similar accounts to mine, it gives me an arms-length view of how other people probably interpret my tone.

I appear to be a well-educated, over-confident know-it-all.

[+] pavlov|3 years ago|reply

My #3 match is cstross, and now I’m convinced that my life-long secret dream of being a successful sci-fi novelist is basically a matter of typing. (Ideas? Character development? Ruthless editing? Developing an audience? Having a publisher? What do I need of those when the Computer told me I’m practically a genius…)

[+] highwaylights|3 years ago|reply

Same. Looking through some of the handles on my list tells me that I come across like a not-particularly-well-educated McSmug that needs to take a good long look at myself. Wouldn’t be so bad if I wasn’t reading the posts thinking I definitely could see myself writing this.

This was certainly eye-opening.

Update: It’s actually a little strange that reading through some of the matches it’s not just style that overlaps but perspectives in quite a few cases too. I’m definitely not the unique little snowflake that some others are finding themselves to be.

[+] bee_rider|3 years ago|reply

I also enjoyed reading one of my style-partner’s posts.

The most noticeable similarity is that we both clearly have strong opinions about some things, and like to share information, but also like to be clear about our unknowns or opinions. So, lots of “sounds likes,” “probably,” “could be” and so on.

The downside is, I guess, this could be seen as a bit weasel-word-y or indirect.

[+] bhaney|3 years ago|reply

> I appear to be a well-educated, over-confident know-it-all.

Don't we all?

[+] reducesuffering|3 years ago|reply

> over-confident know-it-all.

I’m pretty sure participation in HN is a 99% sure filter for being called this many times in one’s life.

[+] closeparen|3 years ago|reply

That's what we all come to HN for...

[+] seydor|3 years ago|reply

we must be a good match

[+] jsnell|3 years ago|reply

After a few tries on boring accounts, I thought to try the account of somebody who was notorious for an incident outside of HN, and had a (deservedly) bad time at HN for a couple of years before the account went dark.

And yeah, there's a bunch of high confidence (.6-.8) hits for that account, and from a quick browse of the comments of the recently active ones, they look really likely to be alts. Like, all three that I looked at had comments that made it very clear it was this person writing pseudonymously. (E.g. writing on their signature issue, and saying they couldn't go into more detail due to fear of self-doxxing; or somebody literally saying that the alt's claims reminded them of the public writings of the notorious guy years ago).

Obviously I'm not naming the account, but this functionality turned out way creepier than I thought the moment I tried it on the account of somebody who has a reason to disassociate from an existing public persona, but still wants to participate here.

[+] gus_massa|3 years ago|reply

It would be nice to make the names clickable.

I don't think the list of pg alternate account is accurate. I checked a few. They have many oneliners that is typical of pg, but the topics and style don't look similar.

I searched a few more and got better results. :)

I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.

[+] costco|3 years ago|reply

> I searched myself (that I know that I have no alternate accounts). I recognize a few users that are interested in similar topics, and I discuss/upvote them many times. But I didn't recognize most of the user of the list.

It's based purely off frequency of the 200 most common English 1 word phrases, 2 word phrases, 3 word phrases, 1 character sequences, 2 character sequences, and 3 character sequences. Topic does not really have anything to do with it. If I had more time I probably would've done a smarter model that accounted for things like that.

[+] Fnoord|3 years ago|reply

Cool stuff, thank you for sharing your findings!

I don't do throwaway. I either post or STFU. I also STFU on darknet. Its why I found it fun to read/lurk on things like I2P back when it was new. And I know that on a pseudonymous account it is only a matter of time until it can be linked to another pseudonymous account. It would not surprise me if stylometry was used on Dread Pirate Roberts or the people behind The Pirate Bay or the people behind Wikileaks (Assange's sockpuppet accounts). Such can also have been used to verify afterwards instead of beforehand. Though with TPB since it was on clearweb an advanced adversary could have used correlation/timing attack to figure who wrote what.

I'm having fun times recognizing other Dutch people though their usage of English language. For example, a distinctive word I see Dutch people use a lot is 'oke' instead of 'OK' or 'okay'. Its a red flag the person is native Dutch. I wonder if there are stylometry tools available for figuring if someone used physical vs touchscreen keyboard (I used Glider to write this post, spellchecker unavailable).

And yes, organizations like secret service and police should use such tools as well. It is a known tool, why not use it for good? As with any tool, it can be used for good and evil. On HN this could be useful for the mod team (AFAIK nowadays only dang) to find banned people's sockpuppets. Cross-community could also be a fun project: find a HN user's Twitter or Reddit account. And I hope this method is also used to find Russian trolls on social media.

[+] dlkf|3 years ago|reply

The top hit on my list looked familiar. I looked at their recent comments and saw a discussion between that user and me. We were quoting eachother directly throughout.

I wonder if this explains our similarity. And if so, could we tweak the algo by e.g. Removing text that is prepended with ”>”

[+] bscphil|3 years ago|reply

The scary thing is that once you have this data, finding HN matches for individual targeted users on other sites becomes trivial, even if those sites are harder to scrape. I bet most people here have an anonymous Reddit account, for example. If you wanted to know who was behind a particular Reddit account, you could feed it into something like this and compare the results with HN, where accounts are less likely to be anonymous. Or build a database based on blogs, Github comments, etc.

Also, since this uses only word frequency, there are probably relatively easy improvements to make that would make it even more powerful, like looking at particular runs of words that are unique. Some expressions or figurative language only show up in combinations of words, and tend to be highly style specific.

[+] costco|3 years ago|reply

I could have used a part of speech tagger, looked at time of day a user posts, capitalization, spelling errors, etc. From what I understand the state of the art is lightyears ahead of this, there are even companies with actual linguists who will act as expert witnesses in court to say stuff like "we can say with 95% certainty that xyz authored this email." Honestly it's kind of scary. There are papers that talk about cross platform authorship attribution, one I think did it with Twitter, Blogspot, G+ and had pretty good results.

[+] faeriechangling|3 years ago|reply

Thus proving the only actually anonymous community in practice is 4chan, and that’s why it’s so toxic.

[+] setr|3 years ago|reply

Forget the alternate accounts — if two users are close in style, there’s a decent chance they should be friends. This is an HN friendship machine.

[+] saurik|3 years ago|reply

It would be convenient if the usernames linked to the comment pages on Hacker News (to avoid having to copy/paste and URL hack, which is made even slightly more annoying because for some reason when I tap and hold the usernames to copy them your markup--I haven't looked at why yet--is causing an extra space character to get copied on the left).

[+] dsr_|3 years ago|reply

This is interesting.

I'm 0.566 correlated with logfromblammo -- and while we are definitely not the same person, I could easily imagine writing a sentence such as:

"For some bizarre reason, management has not yet assigned a task to their programmer underlings to automated themselves out of existence. I can't imagine why."

which is theirs, not mine, from about a year ago. I like that.

On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.

[+] costco|3 years ago|reply

> On the other hand, I'm nearly as correlated with peterwwillis: 0.5485 -- who has no comments and no submissions.

This is due to the Firebase API not updating when users ask the admins to move their comments to another account.

[+] lifeisstillgood|3 years ago|reply

I had a similar experience finding my most likely alt (.50 suggesting I am a unique snowflake as I have always thought :-), my most likely alt is writing certainly in a style I appreciate and on subjects I often mention.

[+] DenisM|3 years ago|reply

How about this for countermeasure:

As you're typing out a comment the software gives you a list of accounts you're becoming similar to. That way you can adjust your writing as you type.

[+] davebillyhock|3 years ago|reply

This found an alt that I created specifically to see if I could write artificially to defeat this kind of analysis. I have seen other tools like it posted to HN, but none before had found that account. I guess I need to up my game.

[+] unknown|3 years ago|reply

[deleted]

[+] CharlesW|3 years ago|reply

If you don't mind sharing, are you "writing artificially" purely in your head, or are you using techniques like intermediate translations?

[+] serhack_|3 years ago|reply

[+] costco|3 years ago|reply

That post was actually what motivated me to make this. I'm on your email list :)

[+] super256|3 years ago|reply

Ahhh, anyone remembers this hacking crew who leaked BLUEETERNAL and other NSA tools and exploits? Shadowbrokers.

They were always communicating in some kind of meme-russian, and their texts were funny to read. [1]

I believe their writing mostly defeated this kind of analysis, at the cost of looking like idiots (which was probably the reason no one sent them crypto-dollars to buy that stuff exclusively).

Here's an excerpt:

"Attention government sponsors of cyber warfare and those who profit from it !!!!

How much you pay for enemies cyber weapons? Not malware you find in networks. Both sides, RAT + LP, full state sponsor tool set? We find cyber weapons made by creators of stuxnet, duqu, flame. Kaspersky calls Equation Group. We follow Equation Group traffic. We find Equation Group source range. We hack Equation Group. We find many many Equation Group cyber weapons. You see pictures. We give you some Equation Group files free, you see. This is good proof no? You enjoy!!! You break many things. You find many intrusions. You write many words. But not all, we are auction the best files."

[1] https://archive.ph/20160815133924/http://pastebin.com/NDTU5k...

[+] spdustin|3 years ago|reply

Have you tried including parts of speech (for example, as bigrams and trigrams) as part of the features considered in your model? I’ve had great success with stylometry that goes beyond TF-IDF with bags of words; including grammar patterns was shockingly good.

(FWIW, it didn’t find my throwaways; my own model didn’t, either, because I knew that word choice wasn’t enough to avoid being outed by stylometry)

Edit: by bigrams and trigrams, I mean reducing word to their parts of speech labels and using THOSE as word tokens. You’ll find that native English speakers have higher weights on some phrase construction patterns than, say, folks from Romania. TF-IDF is useful for these POS-grams (just made that word up) as well.

[+] zxcvbn4038|3 years ago|reply

How long until this becomes the algorithm for a dating site?

“Find hot single women who write just like you”

[+] nrp|3 years ago|reply

This seems like a great way to hire freelance copywriters/ghost writers too. I would absolutely hire someone I knew could match my tone well for writing generic unattributed copy.

[+] forgotpwd16|3 years ago|reply

Wouldn't be surprised if dating sites already used similar algorithms.

[+] interroboink|3 years ago|reply

This is one reason why I like legal doctrines such as "beyond a reasonable doubt." Even a 0.9 match in a tool like this could be a coincidence, if there are millions of users. But that won't stop people from casually believing "aha it must be an alt account", based on some anecdata.

It's so easy for something like this to be turned into a tool for a witch hunt, targeting innocents.

[+] costco|3 years ago|reply

But a 0.8 or 0.9 match and something like Tor usage could be enough to justify a warrant. That's why I'm not sure I want to open source the code because I don't want to normalize this.

511 comments