"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."
and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.
i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.
That's a great background paper on the Netflix attack, we make a pretty direct comparison in section 5. We also try to use similar methods for comparison in sections 4 and 6. In section 5 we transform peoples Reddit comments into movie reviews with an LLM and then see if LLMs are better than naraynan purely on movie reviews. LLMs are still much better (getting about 8% but the average person only had 2.5 movies and 48% only shared one movie, so very difficult to match)
We don't need everyone to be completely anonymous to state and corporate actors. We just need to make it so that they can't identify and surveil everyone at once, because it would be too expensive.
The US defense budget is about $1T dollars. They can't spend it all on surveillance, but let's say tech companies + gov spends about this amount per year on surveillance in total. If we can raise the cost to surveil the average person to over $10K/yr, they just lose. This is very doable.
Every little precaution you take will raise the cost, probably more than you think. Every open-source project that aims to anonymize and decentralize is an arrow in their knee. They're hoping that you'll get cynical and stop trying because they don't stand a chance otherwise.
> Does privacy of Netflix ratings matter? The issue is not “Does the average Netflix subscriber care about the privacy of his movie viewing history?,” but “Are there any Netflix subscribers whose privacy can be compromised by analyzing the Netflix Prize dataset?”
Throwaway accounts using "clever" turns of phrase can often be anonymized by double click, right-clicking -> googling their witty pun and seeing their the sole instance elsewhere, on Twitter, Facebook, etc
If I see a couple words I dont know in a row, I can infer a posters real name.
Id be more specific but any example is doxxing, literally so
Many years ago (early 2000s) I worked for a firm that would help identify people who were doing "pump and dump" stock scams on Yahoo Finance message boards.
Step 1 was to scrape all of their posts into a database.
Step 2 was to have a human analyst review all of the posts for clues about who that person was
It was amazing that you could easily figure out:
- if they were at work or home from when they posted (9am to 5pm vs 6pm to 1am)
- what city they were in (based on sports teams, mentioning local landmarks etc0
- roughly what career they had
- their age based on cultural references
and mostly b/c they would drop a crumb of information here and there over months. They probably forgot about all of these individual events but when reading all of the posts in a few hours, the details became pretty evident. You get enough of these details and you can start to venn diagram people down to a few 100 likely candidates and then use LexisNexus style tools to narrow it down even further.
Given the above, it doesn't surprise me that LLMs can do the same but at high speed and across multiple sites etc.
I recently decided to play around with this, given... well my profile... and I will say that Gemini was good at zeroing in on who I was, but for whatever reason would refuse to stay my name.
This is exactly why local inference matters. Every query you send to a cloud API is another data point. Your prompts contain your code, your logs, your thought process — arguably more identifying than your HN comments.
The paper shows deanonymization from public posts. Imagine what's possible with private API traffic: the questions you ask, the code you paste, the errors you debug. Even if providers don't read it today, the data exists and the cost of analyzing it is going to zero.
Air-gapped local inference isn't paranoia. It's necessary.
Combine this with the fact that even the private mode of any AI provider still keeps logs of the chats and from some past discussion iirc, will keep it indefinitely.
> Air-gapped local inference isn't paranoia. It's necessary.
I definitely agree, I am seeing new model like qwen-3.5-30A3b (iirc) being able to be run reasonably on normal hardware (You can buy a mac mini whose price hasn't been inflated) and get decent tps while having a decent model overall.
There are some services like proton lumo, the service by signal, kagi's AI which seem to try to be better but long term, my plan is to buy mac-mini for such levels of inference for basic queries.
Of course, in the meanwhile like for example coding, it might not make too big of a difference between using local model or not unless for the most extremely sensitive work (perhaps govt/bank oriented)
I post under my real name here, pretty much the only place I post. It keeps me honest and straight in what I say when I choose to say it. I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration. I don't know what it will be but I would expect some adversarial stuff. Trying to keep clean is what I'd prefer for myself and my kids.
On other hand, the Neal Stephenson's Fall or, Dodge in Hell book has an interesting idea in early phase of the book where a person agrees to what we now know "flood the zone with sh*t" (Steve Bannon's sadly very effective strategy) to battle some trolls. Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core. It is cleverly explored in the book albeit for too short of a time before moving into the virtual reality. I think there are a few people out here right now practicing this.
> I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration.
I don’t think you’re wrong, but the fact that people consider it inevitable we’ll all have an immutable social acceptance grade that includes everything from teenage shitposts to things you said after a loved one died, or getting diagnosed with cancer, makes me regret putting even a moment of my professional energies towards advancing tech in the US.
Do you want culture to be frozen and instant digital communication with anyone else in the world to become a privilege of the few? Because that's where "clean" leads. And all you get is a little bit of temporary safety.
Here's a different vision for the future:
Let information filtering become each individual's own responsibility. We have LLMs now, and they'll get more efficient, so why not use them locally to filter incoming feeds according to each of our own preferences, but remove all of the filtering/moderation for posting info out. Build systems to decentralize and anonymize the Internet so that people can discover anyone and aren't afraid to post anything. Make it so that everyone can get a message out to the world and nobody can be arrested or assassinated for it. This will put an end to most violent conflict because they'd be replaced by online discourse.
Let the Internet be flooded with trash and gold at the same time. Let each individual decide what info is/isn't valuable to them. Let those individuals self-organize. Let ideas compete freely, so that the best ones may prevail.
I have lived my life on the web under the assumption the other Tom Clancy will leave enough chaff in my wake to make things hard. But probably not because I make the same 5 or 6 jokes over and over.
>I post under my real name here, pretty much the only place I post. It keeps me honest and straight in what I say when I choose to say it.
I do the same thing, and I think I'm a much better person for it. The Internet is not, in my final analysis, some indiscriminate dumping ground for my personal issues and moods. It's a place where I can relax and practice putting forward a more prosocial form of myself, even when what I actually have to say is uncomfortable.
While we can't predict how the adversary will read and respond to our moves, I suspect the easier marks are the people who choose to publicly drench everything they touch in negativity and cynicism. It's a sign of an already compromised social immune system.
I view posting online with a real name like getting a permanent tattoo.
My values or priorities may significantly change over decades, especially as a child, so why would I want to jeopardize the reputation of a potential future identity with something I may post today?
I am similar in that all of my interactions are with my real name and it is unique enough that just putting it into google will instantly identify me. There is one other 'jeff sponaugle' but I think he is far more annoyed with my presence than I would be with him.
On the plus side, someone will sometimes say while talking to me - oh your are that Subaru guy, or that youtube guy, or whatever and that is fun connection.
I've come to a similar conclusion. I now almost exclusively post under my real name online, and before writing something, I ask myself whether it's something I'd say to a person's face and whether I'm comfortable being quoted on it. If not, I look for a more neutral, stronger version of the argument I'm trying to make (stronger, as in strong enough to stand without rhetorical devices or fallacies), or, I qualify the statement as an opinion or something I consider to be a possibility.
> I tried talking to my children about leaving as clean of a footprint on the internet as one can in anticipation of future people/systems taking that into consideration.
You don’t know what information about you can bring you in trouble in the future.
Data poisoning your own online profile is all nice and well. But in a society that goes beyond itself to cram AI into about every imaginable system, it may not be smart at all. Already in early adopter phase the average person gives way too much authoritative weight to what LLM's come up with. If complex societal processes become basically AI-driven you may get into a world of hurt. "I am sorry, we can't give you that passport right now, until we investigate potentially fraudulent behavior our AI flagged us about".
Yes it's basically data poisoning. It reminds me of the approach the Adnauseum extension takes. It hides ads from you like traditional adblockers but under the hood it's actually selectively clicking them to fool advertisers. I don't know if it's smart enough to create a "profile" for you (e.g. "soccer mom from Michigan") but that seems like the logical next step. Instead of just "flooding the zone with shit" you'd be more selectively/consistently misleading
> Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core.
I don’t think this is humanly possible against machine learning. After all, it is specifically designed to weed through noisy data and identify patterns. It may delay discovery, but will at some point easily fall apart, by something as simple as a “filter out shitposting and deliberate pollution” prompt. Even more so when you guide it towards specific attributes.
I think as the younger generations come of age they simply will not care about that sort of thing. Like it or not, it's part of the culture and might just be accepted as the norm.
Autonomous Proxies for Execration - spam bots whose entire purpose is flooding the internet with spam so as to make identifying anything true utterly impossible. If you can't differentiate between real and unreal information in online comments, then online comments stop being a significant factor in shaping public opinion. You need to abstract - identify reliable sources of information, individuals or institutions that do the work to collect and curate.
We're already seeing this as a side effect of the mishmash of influence operations on social media - with so many competing interests, mixed in with real trolls, outrage farmers, grifters, and the like, you literally cannot tell without extensive reputation vetting whether or not a source is legitimate. Even then, any suggestion that an account might be hacked or compromised, like a significant sudden deviation in style or tone or subject matter, you have to balance everything against a solid model of what's actually behind probably 80% or more of the "user" posts online.
There are a lot of aligned interests causing APEs to manifest - they're a mix of psyop style influence campaigns, some aimed at demoralization, others at outrage engagement, others at smears and astroturfing and even doing product placement and subtle advertisement. The net effect is chaos, so they might as well be APEs.
Fifteen years or so ago I read an article arguing that by the time Millennials are nearing retirement and have more political power, people will give less of a shit about what you did online in your twenties because we will have, out of necessity, learned that asshattery in your twenties is largely irrelevant to your trustworthiness in your sixties.
When I was that age, you could tell the kids who had political ambitions self-censored online. But now every is buck wild so you have to ignore that when looking at people.
For example, a MASSIVE portion of Millennials and younger looking at the Main election are pretty chill about the leading Democratic candidate having a Nazi tattoo because of this very thing. Basically, "dumb, drunk, deployed Marines will get cool skull and crossbones tattoos in their early twenties, and so what if he said a couple ill-worded somewhat misogynistic things in his twenties, that was decades ago, and he's obviously a different person."
Contrast with Bill Clinton, where he literally had to explain away university marijuana usage TWENTY YEARS AFTER THE FACT.
Point is, I think we're witnessing this evolution happening right now.
I tried this today with this username and other usernames on this and other platforms with Claude Code
- First it told me it couldn't do this, that this was doxxing
- I said: its for me, I want to see if I can be deanonymized
- Claude says: oh ok sure and proceeds to do it
It analyzed my profile contents and concluded that there were likely only 5 - 10 people in the world that would match this profile (it pulled out every identifying piece of information extremely accurately). Basically saying: I don't have access to LinkedIn but if I did I could find you in like 5 seconds.
Anyway, like others have said: this type of capability has always been around for nation state actors (it's just now frighteningly more effective), but e.g. for your stalker? For a fraudster or con artist? Everyone has a tremendous unprecedented amount of power at their fingertips with very little effort needed.
I'm not sure the practical implications are as dramatic as the paper suggests. Most adversaries who would want to deanonymize people at scale (governments, corporations) already have access to far more direct methods. The people most at risk from this are probably activists and whistleblowers in jurisdictions where those direct methods aren't available, not average users.
People who comment about their boss and workplaces?
People on HN who talk about their work but want to remain anonymous? People who don’t want to be spammed if they comment in a community? Or harassed if they comment in a community? Maybe someone doesn’t want others to find out they are posting in r/depression. (Or r/warhammer.)
Anonymity is a substantial aspect of the current internet. It’s the practical reason you can have a stance against age verification.
On the other hand, if anonymity can be pierced with relative ease, then arguments for privacy are non sequiturs.
I actually think those most at risk are normal people the activists will harass. Soon it will be possible for anybody who works at the “wrong” business or expresses any opinion on any subject to be casus belli for unhinged, terminally online, mentally ill people who are mad about the thing of the day to start making threatening calls to your employer or making false reports to police or sending deep fake porn to your mom.
I think that we are close to a time where the Internet is so toxic and so policed that the only reasonable response is to unplug.
Attacks can be chained, and this can all be automated. For example, imagine pigbutchering scams... except it's there, similar to some voice-cloning scams, just to get enough data to stylometrically fingerprint you for future reference. You make sure to never comment too much or spicily under your real name, but someone slides into your DMs with a thoughtful, informative, high-quality comment, and you politely strike up an interesting conversation which goes well and you think nothing of it and have forgotten it a week later - and 5 years later you're in jail or fired or have been doxed or been framed. 'Direct methods' can't deliver that kind of capability post hoc, even for actors who do have access to those methods (which is a vanishing percentage of all actors). No one has cheap enough intelligence and skilled labor to do this right now. But they will.
I can imagine a lot of countries who want to control what their citizens say abroad. I know Iraq in Saddam Hussein's time did it in the UK, China does it now.
While you're right as in, it's nothing new given a trail of info, here they didn't need to do classical feature engineering, but purely LLM (agentic) flow. But yes, given how much information is self exposed online I am not surprised this is made easier with LLMs. But the interesting application is identifying users with multiple usernames on HN or reddit.
But with HN, I'd like to ask @dang and HN leadership to support deleting messages, or making them private (requiring an HN account to see your posts).
At first I thought of how this would impact employment. But then I thought about how ICE has been tapping reddit,facebook and other services to monitor dissenters. The whole orwellian concern is no longer theoretical. I personally fear physical violence from my government, as a result. But I will continue to criticize them, I just wish it wasn't so easy for them to retaliate.
Despite being pseudonymous, I don’t take great pains to hide who I am. I am in my 50s and live on the West coast. I don’t have socials and I don’t post anywhere else. Have at it!
If you are semi-retired, you’re free from the threat of cancellation. As long as you aren’t posting about crimes, there’s limits to what anyone can legally do to you. (Still, it’s good to be prudent and limit sharing.)
Kind of short sighted only consider social cancellation. People in power change, laws get applied retroactively. History is full of people who get purged from stuff that was fine when it was written
Unless you're in the nebulous situation of being Hispanic in the US, in which case you might get profiled. Or you might have family with jobs that are subject to pressure -- and right now, that seems like most jobs, because calling employers spineless is an insult to worms. Or if you'd like to travel by air, because watchlists are back, and carriers may just refuse service.
As people will point out, the OSINT techniques described are nothing new - typically, in the past, you could de-anonymize based on writing style or niche topics/interests. Totally deanonymization can occur if any of these accounts link to profiles containing pictures of their faces, which can then be web-searched to link to a real identity. It's astounding how many people re-use handles on stuff like porn sites linked very easily to their IRL identity.
While people will point out this isn't new, the implication of this paper (and something I have suspected for 2 years now but never played with) is that this will become trivial, in what would take a human investigator a bit of time, even using common OSINT tooling.
You should never assume you have total anonymity on the open web.
If LLMs can identify a person across websites, I can ask LLM to read up his posts and write like him impersonating him and then this feeds back into the tools identifying him. I can probabilistically malign a person this way.
I think the implication is this will become trivial and trivially automated, no human investigator needed. I bet there will be plugins in one year's time to right click on a post and get a full report on who the author is.
everyone in the comments is talking about stylometry and rewriting your posts with LLMs. the paper barely uses stylometry. the attack surface is semantic: your interests, your city, the conference you mentioned once 2 years ago. you can't rewrite your way out of having said you work in fintech in austin and own a golden retriever.
you can intentionally add false biographical information. what if you had a bot posting responses in subreddits for cities across the world on your account
I want to use "slower" methods of identification more. Like say for instance within a few blocks of you a human can identify who you are for any service that wants to do some kind of verification/proof you are/have XYZ.
We could designate specific individuals to do for you and me just like we do for today's trust authorities for website certificates.
No more verified profiles by uploading names, emails and passports and photographs(gosh!). Just turned 18 and want to access insta? Go to the local high school teacher to get age verified. Finished a career path and want it on linked in? Go to the company officer. Are you a new journalist who wants to be designated on X as so but anonymously? Go to the notary public.
One can do this cryptographically with no PII exchanged between the person, the community or the webservice. And you can be anonymous yet people know you are real.
It can be all maintained on a tree of trust, every individual in the chain needs to be verified, and only designated individuals can do actions that are sensitive/important.
You only need to do this once every so often to access certain services. Bonus: you get to take a walk and meet a human being.
I bet we're about to see reduction of online public communications. Count how many times you had a desire to share your knowledge or correct someone online (aka somebody is WRONG on the internet). People would stop doing that, just to not train some big-corp model using their knowledge. Artists already not happy about that, but there are many other types of expertise people will stop sharing.
Doesn’t all this deanonymization stuff depend on one fatal assumption: that people are actually being truthful with what they say about themselves?
If you’re basically LARPing a new personality every time and just making up details about where you live or what your life is like then how is this ever going to work? Someone could say they live in San Francisco while actually living in Indiana.
I feel like this is one of those products OpenAI et al are quietly perfecting. Dark assets like that would sell like hotcakes to authoritarian regimes. That would explain how they eventually plan to reach profitability.
Somebody I know irl has figured out I'm me here on Hackernews, based on the fact that my writing style here matches my verbal style. Fingerprinting people based on their words is one of the things I actually expect LLMs to be really absurdly good at.
Indeed, fears about deanonymization are a reaction to three structural shifts: the cost of analysis has plummeted, the volume of stored data has increased dramatically, and models have become better at identifying patterns that humans miss, making it impossible for interested parties not to take advantage of this. But the conclusion isn't that "anonymity is dead." The conclusion is that anonymity is no longer a guaranteed technical property. It's becoming a behavioral skill that can be developed.
Clearly the cia or other gov institution. Its purpose is to create an irresistible honeypot so that anyone who figures out a working and time feasible implementation of shor's law or other prime factorization technique would reveal their hand.
If with LLM's you can deanonymize at scale, on a personal level, you should also be able to figure out what posts are leading to this deanonymization and remove them or modify them.
What's wild to me is that people worry about writing style fingerprinting while casually uploading their literal DNA to consumer genomics companies. 23andMe went bankrupt and suddenly 15 million people's most identifying data imaginable is an asset in a fire sale.
Your writing style can theoretically be masked with an LLM. Your genome can't. And it doesn't just identify you -- it identifies your relatives, your disease risks, your ancestry, things you might not even know about yourself yet. The deanonymization vector here is permanent and irrevocable in a way that no amount of OPSEC can fix after the fact.
The semantic approach in this paper (interests, clues, behavioral patterns) is scary enough. Now imagine combining that with leaked genetic data. You don't even need to match writing styles when you can match someone's 23andMe profile to their health subreddit posts about conditions they're genetically predisposed to.
Information leaks everywhere, as the ability to process it increases, I think ultimately it will lead to a world where there are no secrets, provided one has the resources and intention to look for something.
For a few years now I have been telling people how unprepared the world is for this change. Not understanding how this is possible will lead to people outright deifying AI that has the capability to do things like this. It will seem like omniscience.
I think the main protection we have in a world where you cannot effectively hide, is that anyone who abuses this ability will be operating under the same system. You can use it to your advantage, but not without getting caught.
Worked on a de-anonymiser in the 90s for identifying banned users and banning their newly created ban-avoidance accounts. Worked based on triplets of words. Worked surprisingly well, so this does not surprise me.
The obvious retort is to just use an AI to rewrite everything you post, but this will open other attack vectors.
Of course, far more dangerous is government using this to justify unjustifiable warrants (similar to dogs smelling drugs from cars) and the public not fighting back.
Additionally, you can open up copilot.microsoft.com or w/e and ask it to summarize any reddit users (and presumably HN) posts. Not just the content, but their emotional state (without prompting).
[0] Note: last I tried this was months ago, things may have changed.
Yeah been thinking it’s time to scale back online engagement given the US both has access to everyone’s data and is pivoting to a ahem different style of country
The real-world benchmark approach is the right direction. Most agent evals I've seen test for task completion on clean inputs. That's not how production use looks.
What tends to break agents in the wild: ambiguous instructions that have multiple valid interpretations, state that changes mid-task, and error recovery when a sub-step fails silently rather than loudly.
The hardest thing to benchmark is graceful degradation. A good agent should know when to stop and ask for clarification rather than confidently completing the wrong task.
The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
We don't use (much) stylometry, so this won't help. This is totally something you could try, but we use interests and clues. Semantic information you reveal about yourself.
I don't think this is working any more, but there was a stylometic analysis of HN users a few years ago, and it was extremely effective (at least, for myself and people who felt the need to post in the comments): https://news.ycombinator.com/item?id=33755016
There is also a practical issue here that people usually don't write a lot on linkedin, most people just have structured biographical information. We use very limited stylometry in section 6 for matching reddit users who we synthetically split according to time.
L33tsp34k also accomplishes this. The original anonymising hacker stylometry :)
I am intrigued by the idea that in the future, communities might create a merged brand voice that their members choose to speak in via LLMs, to protect individual anonymity.
Maybe only your close friends hear your real voice?
> The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
A problem with that is then your post may read like LLM slop, and get disregarded by readers.
We test different methods, in section 2, we use LLM agents to agentically identify people. We don't share any code here, but you could try with various freely available agents on yourself.
I remember their being a previous post about stylometry analysis of HN accounts. And people confirmed the top account correlations. It basically identified all the HN alt accounts
That's honestly quite terrifying. If you're posting somewhere else under a pseudonym, this technology can get you doxxed. The safest thing to do is to not participate in communities at all. Avoid posting, avoid social interactions, just be a ghost. The future is bleak.
Maybe I missed something, but I see little evidence that there is a concerning ability to deanonymize. Many people post under a pseudonym but then link to their GitHub etc.
In fact by construction the HN dataset _only_ consists of people who are comfortable with their real identity being linked to it.
The real question is whether someone who is pseudonymous and actually attempting to remain so can be deanonymized.
Everyone should really stop posting online unless their job requires it.
The platforms offer only castrated interactions designed not to accomplish anything. People online are useless obnoxious shadows of their helpful and loving self.
No one cares more what you say than those monitoring you and building that detailed profile with sinister motives. The ratio must be something like 1000:1 or worse.
What this tells me is that major social media sites, some of which claim to be developing frontier models, have no excuse for a bots waging influence campaigns on their sites.
We do advocate for stricter controls on data access on social platforms because of this. There is a bit of an unfortunate trade-off, but I think allowing mass-scraping or downloads of data from social sites can be misused in increasingly more ways.
Could another mitigation be polluting identities online with fake ones so that real identities become hard to sift out.
For example if I tell my bot to clone me 100x times on all my platforms, all with different facts or attributes, suddenly the real me becomes a lot harder to select. Or any attribute of mine at all becomes harder to corroborate.
I hate to use this reference, but like the citadel from Rick and Morty.
We use semantic information inferred from comments and submissions. I think using stylometry would be a great addition, but it would be hard to google for "guy who writes fanciful using many puns" rather then "indie developer in Switzerland". I think stylometry could be better used for verification, once you have a small set of candidates stylometry could further narrow down the candidates and be used to make a decision.
so if they put their linkedin account on their HN account, we can figure out who they are.... genius stuff, AI really is changing the landscape all right
To be clear, we are making a clear concession here that the people weren't truly anonymous. But we did use an LLM to remove any identifying information from HN making them quasi-anonymous, this is more described in the appendix Table 2.
We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.
It's a pity that you didn't make your point more thoughtfully because it's one of the few comments in the thread so far that has anything to do with the actual paper, and even got a response from one of the authors. That's good! Unfortunately, badness destroys goodness at a higher rate than goodness adds it...at least in this genre.
john_strinlai|5 days ago
i like to introduce students to de-anonymization with an old paper "Robust De-anonymization of Large Sparse Datasets" published in the ancient history of 2008 (https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf):
"We apply our de-anonymization methodology to the Netflix Prize dataset, which contains anonymous movie ratings of 500,000 subscribers of Netflix [...]. We demonstrate that an adversary who knows only a little bit about an individual subscriber can easily identify this subscriber’s record in the dataset."
and that was 20 years ago! de-anonymization techniques have improved by leaps and bounds since then, alongside the massive growth in various technology that enhances/enables various techniques.
i think the age of (pseduo-)anonymous internet browsing will be over soon. certainly within my lifetime (and im not that young!). it might be by regulation, it might be by nature of dragnet surveillance + de-anonymization, or a combination of both. but i think it will be a chilling time.
DalasNoin|5 days ago
txrx0000|4 days ago
The US defense budget is about $1T dollars. They can't spend it all on surveillance, but let's say tech companies + gov spends about this amount per year on surveillance in total. If we can raise the cost to surveil the average person to over $10K/yr, they just lose. This is very doable.
Every little precaution you take will raise the cost, probably more than you think. Every open-source project that aims to anonymize and decentralize is an arrow in their knee. They're hoping that you'll get cynical and stop trying because they don't stand a chance otherwise.
mtone|4 days ago
Well said.
c22|4 days ago
Jerrrrrrrry|5 days ago
If I see a couple words I dont know in a row, I can infer a posters real name.
Id be more specific but any example is doxxing, literally so
user3939382|4 days ago
alexpotato|4 days ago
Step 1 was to scrape all of their posts into a database.
Step 2 was to have a human analyst review all of the posts for clues about who that person was
It was amazing that you could easily figure out:
- if they were at work or home from when they posted (9am to 5pm vs 6pm to 1am)
- what city they were in (based on sports teams, mentioning local landmarks etc0
- roughly what career they had
- their age based on cultural references
and mostly b/c they would drop a crumb of information here and there over months. They probably forgot about all of these individual events but when reading all of the posts in a few hours, the details became pretty evident. You get enough of these details and you can start to venn diagram people down to a few 100 likely candidates and then use LexisNexus style tools to narrow it down even further.
Given the above, it doesn't surprise me that LLMs can do the same but at high speed and across multiple sites etc.
tsumnia|4 days ago
rudhdb773b|4 days ago
dirk94018|4 days ago
The paper shows deanonymization from public posts. Imagine what's possible with private API traffic: the questions you ask, the code you paste, the errors you debug. Even if providers don't read it today, the data exists and the cost of analyzing it is going to zero.
Air-gapped local inference isn't paranoia. It's necessary.
Imustaskforhelp|4 days ago
> Air-gapped local inference isn't paranoia. It's necessary.
I definitely agree, I am seeing new model like qwen-3.5-30A3b (iirc) being able to be run reasonably on normal hardware (You can buy a mac mini whose price hasn't been inflated) and get decent tps while having a decent model overall.
There are some services like proton lumo, the service by signal, kagi's AI which seem to try to be better but long term, my plan is to buy mac-mini for such levels of inference for basic queries.
Of course, in the meanwhile like for example coding, it might not make too big of a difference between using local model or not unless for the most extremely sensitive work (perhaps govt/bank oriented)
danielodievich|5 days ago
On other hand, the Neal Stephenson's Fall or, Dodge in Hell book has an interesting idea in early phase of the book where a person agrees to what we now know "flood the zone with sh*t" (Steve Bannon's sadly very effective strategy) to battle some trolls. Instead of trying to keep clean, the intent is just to spam like crazy with anything so nobody understands the core. It is cleverly explored in the book albeit for too short of a time before moving into the virtual reality. I think there are a few people out here right now practicing this.
DrewADesign|5 days ago
I don’t think you’re wrong, but the fact that people consider it inevitable we’ll all have an immutable social acceptance grade that includes everything from teenage shitposts to things you said after a loved one died, or getting diagnosed with cancer, makes me regret putting even a moment of my professional energies towards advancing tech in the US.
txrx0000|4 days ago
Here's a different vision for the future:
Let information filtering become each individual's own responsibility. We have LLMs now, and they'll get more efficient, so why not use them locally to filter incoming feeds according to each of our own preferences, but remove all of the filtering/moderation for posting info out. Build systems to decentralize and anonymize the Internet so that people can discover anyone and aren't afraid to post anything. Make it so that everyone can get a message out to the world and nobody can be arrested or assassinated for it. This will put an end to most violent conflict because they'd be replaced by online discourse.
Let the Internet be flooded with trash and gold at the same time. Let each individual decide what info is/isn't valuable to them. Let those individuals self-organize. Let ideas compete freely, so that the best ones may prevail.
tclancy|5 days ago
hiAndrewQuinn|4 days ago
I do the same thing, and I think I'm a much better person for it. The Internet is not, in my final analysis, some indiscriminate dumping ground for my personal issues and moods. It's a place where I can relax and practice putting forward a more prosocial form of myself, even when what I actually have to say is uncomfortable.
While we can't predict how the adversary will read and respond to our moves, I suspect the easier marks are the people who choose to publicly drench everything they touch in negativity and cynicism. It's a sign of an already compromised social immune system.
rudhdb773b|4 days ago
My values or priorities may significantly change over decades, especially as a child, so why would I want to jeopardize the reputation of a potential future identity with something I may post today?
sponaugle|5 days ago
On the plus side, someone will sometimes say while talking to me - oh your are that Subaru guy, or that youtube guy, or whatever and that is fun connection.
qsera|5 days ago
The only winning move here is not to play.
pavel_lishin|5 days ago
I honestly don't even think I understood the ending. Or the middle, if I'm being extra honest.
I think Anathem addressed the "flood the zone with shit" much better in something like three paragraphs.
hliyan|4 days ago
croes|4 days ago
You don’t know what information about you can bring you in trouble in the future.
rapnie|4 days ago
culi|4 days ago
47282847|4 days ago
I don’t think this is humanly possible against machine learning. After all, it is specifically designed to weed through noisy data and identify patterns. It may delay discovery, but will at some point easily fall apart, by something as simple as a “filter out shitposting and deliberate pollution” prompt. Even more so when you guide it towards specific attributes.
slopinthebag|5 days ago
unknown|4 days ago
[deleted]
gambutin|5 days ago
AFAIK the strategy is usually used to divert attention from one subject that could be harmful to a person to some other stuff.
Wouldn’t spamming in that case provide more information about you?
godelski|5 days ago
ectospheno|5 days ago
observationist|5 days ago
We're already seeing this as a side effect of the mishmash of influence operations on social media - with so many competing interests, mixed in with real trolls, outrage farmers, grifters, and the like, you literally cannot tell without extensive reputation vetting whether or not a source is legitimate. Even then, any suggestion that an account might be hacked or compromised, like a significant sudden deviation in style or tone or subject matter, you have to balance everything against a solid model of what's actually behind probably 80% or more of the "user" posts online.
There are a lot of aligned interests causing APEs to manifest - they're a mix of psyop style influence campaigns, some aimed at demoralization, others at outrage engagement, others at smears and astroturfing and even doing product placement and subtle advertisement. The net effect is chaos, so they might as well be APEs.
KPGv2|5 days ago
When I was that age, you could tell the kids who had political ambitions self-censored online. But now every is buck wild so you have to ignore that when looking at people.
For example, a MASSIVE portion of Millennials and younger looking at the Main election are pretty chill about the leading Democratic candidate having a Nazi tattoo because of this very thing. Basically, "dumb, drunk, deployed Marines will get cool skull and crossbones tattoos in their early twenties, and so what if he said a couple ill-worded somewhat misogynistic things in his twenties, that was decades ago, and he's obviously a different person."
Contrast with Bill Clinton, where he literally had to explain away university marijuana usage TWENTY YEARS AFTER THE FACT.
Point is, I think we're witnessing this evolution happening right now.
aspenmartin|4 days ago
- First it told me it couldn't do this, that this was doxxing
- I said: its for me, I want to see if I can be deanonymized
- Claude says: oh ok sure and proceeds to do it
It analyzed my profile contents and concluded that there were likely only 5 - 10 people in the world that would match this profile (it pulled out every identifying piece of information extremely accurately). Basically saying: I don't have access to LinkedIn but if I did I could find you in like 5 seconds.
Anyway, like others have said: this type of capability has always been around for nation state actors (it's just now frighteningly more effective), but e.g. for your stalker? For a fraudster or con artist? Everyone has a tremendous unprecedented amount of power at their fingertips with very little effort needed.
kseniamorph|5 days ago
intended|5 days ago
People on HN who talk about their work but want to remain anonymous? People who don’t want to be spammed if they comment in a community? Or harassed if they comment in a community? Maybe someone doesn’t want others to find out they are posting in r/depression. (Or r/warhammer.)
Anonymity is a substantial aspect of the current internet. It’s the practical reason you can have a stance against age verification.
On the other hand, if anonymity can be pierced with relative ease, then arguments for privacy are non sequiturs.
GorbachevyChase|5 days ago
I think that we are close to a time where the Internet is so toxic and so policed that the only reasonable response is to unplug.
gwern|5 days ago
ceejayoz|5 days ago
Easier methods probably means more adversaries.
graemep|5 days ago
3abiton|4 days ago
afpx|5 days ago
cryptonector|4 days ago
password4321|4 days ago
This page is anonymous
20190119 https://news.ycombinator.com/item?id=20220048 (149 points, 51 comments)
20130501 https://news.ycombinator.com/item?id=5638988 (453 points, 243 comments)
https://news.ycombinator.com/threads?id=voidnull
https://antirez.com/hnstyle?username=voidnull
notepad0x90|4 days ago
But with HN, I'd like to ask @dang and HN leadership to support deleting messages, or making them private (requiring an HN account to see your posts).
At first I thought of how this would impact employment. But then I thought about how ICE has been tapping reddit,facebook and other services to monitor dissenters. The whole orwellian concern is no longer theoretical. I personally fear physical violence from my government, as a result. But I will continue to criticize them, I just wish it wasn't so easy for them to retaliate.
iamnothere|5 days ago
If you are semi-retired, you’re free from the threat of cancellation. As long as you aren’t posting about crimes, there’s limits to what anyone can legally do to you. (Still, it’s good to be prudent and limit sharing.)
comrh|4 days ago
angry_octet|5 days ago
unknown|5 days ago
[deleted]
JohnMakin|5 days ago
While people will point out this isn't new, the implication of this paper (and something I have suspected for 2 years now but never played with) is that this will become trivial, in what would take a human investigator a bit of time, even using common OSINT tooling.
You should never assume you have total anonymity on the open web.
ghywertelling|5 days ago
warkdarrior|5 days ago
with|4 days ago
comrh|4 days ago
ghm2199|4 days ago
We could designate specific individuals to do for you and me just like we do for today's trust authorities for website certificates.
No more verified profiles by uploading names, emails and passports and photographs(gosh!). Just turned 18 and want to access insta? Go to the local high school teacher to get age verified. Finished a career path and want it on linked in? Go to the company officer. Are you a new journalist who wants to be designated on X as so but anonymously? Go to the notary public.
One can do this cryptographically with no PII exchanged between the person, the community or the webservice. And you can be anonymous yet people know you are real.
It can be all maintained on a tree of trust, every individual in the chain needs to be verified, and only designated individuals can do actions that are sensitive/important.
You only need to do this once every so often to access certain services. Bonus: you get to take a walk and meet a human being.
deepsun|4 days ago
bigwheels|5 days ago
Show HN: Using stylometry to find HN users with alternate accounts
https://news.ycombinator.com/item?id=33755016 - Nov 2022, 519 comments
password4321|4 days ago
20250415 https://news.ycombinator.com/item?id=43705632 Reproducing Hacker News writing style fingerprinting (325 points, 159 comments)
deadbabe|4 days ago
If you’re basically LARPing a new personality every time and just making up details about where you live or what your life is like then how is this ever going to work? Someone could say they live in San Francisco while actually living in Indiana.
cluckindan|5 days ago
bitwize|5 days ago
boisterousness|3 days ago
gormen|4 days ago
block_dagger|5 days ago
hellojesus|5 days ago
Cider9986|5 days ago
DalasNoin|5 days ago
prats226|4 days ago
HelixSequencing|4 days ago
Your writing style can theoretically be masked with an LLM. Your genome can't. And it doesn't just identify you -- it identifies your relatives, your disease risks, your ancestry, things you might not even know about yourself yet. The deanonymization vector here is permanent and irrevocable in a way that no amount of OPSEC can fix after the fact.
The semantic approach in this paper (interests, clues, behavioral patterns) is scary enough. Now imagine combining that with leaked genetic data. You don't even need to match writing styles when you can match someone's 23andMe profile to their health subreddit posts about conditions they're genetically predisposed to.
yomismoaqui|5 days ago
And surprise, a tool made for processing text did it quite well, explaining the kind of phrase constructions that revealed my native language.
So maybe this is a plus for passing any text published on the internet through a slopifier for anonymization?
EDIT: deanonymization -> anonymization
joe_mamba|5 days ago
Or vice versa, Indian scammers online can now run their traditional Victorian English phrasing through an AI to sound more authentically American.
Interviewers now have to deal with remote North Korean deepfaked candidates pretending to be Americans.
Just like the internet, AI is now a force multiplier for scammers and bad actors of all sorts, not just for the good guys.
Lerc|4 days ago
For a few years now I have been telling people how unprepared the world is for this change. Not understanding how this is possible will lead to people outright deifying AI that has the capability to do things like this. It will seem like omniscience.
I think the main protection we have in a world where you cannot effectively hide, is that anyone who abuses this ability will be operating under the same system. You can use it to your advantage, but not without getting caught.
nickdothutton|4 days ago
casey2|5 days ago
Of course, far more dangerous is government using this to justify unjustifiable warrants (similar to dogs smelling drugs from cars) and the public not fighting back.
DalasNoin|5 days ago
(We use a little stylometry in a single experiment in section 5)
Noaidi|3 days ago
> Anonymity is a myth. I am sure by now an LLM can figure out who you are and where you live by your HN posts alone."
>> iamnothere 3 days ago | parent [–] >> Do it then
https://news.ycombinator.com/item?id=47123383
YesBox|5 days ago
[0] Note: last I tried this was months ago, things may have changed.
YesBox|5 days ago
Last block of text from copilot :/
-----------
If you want, I can also break down:
Their posting style (tone, frequency, community engagement)
How their work compares to other indie city builders
What seems to resonate most with Reddit users
Just tell me what angle you want to explore next.
Havoc|4 days ago
Pity - the pseudo anon internet is fun
unknown|5 days ago
[deleted]
lunaprompts_hn|4 days ago
What tends to break agents in the wild: ambiguous instructions that have multiple valid interpretations, state that changes mid-task, and error recovery when a sub-step fails silently rather than loudly.
The hardest thing to benchmark is graceful degradation. A good agent should know when to stop and ask for clarification rather than confidently completing the wrong task.
mhitza|5 days ago
https://en.wikipedia.org/wiki/Stylometry
The best course of action to combat this correlation/profiling, seems to be usage of a local llm that rewrites the text while keeping meaning untouched.
Ideally built into a browser like Firefox/Brave.
DalasNoin|5 days ago
The blog post might be more approachable if you want to get a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...
IncreasePosts|5 days ago
DalasNoin|5 days ago
patcon|5 days ago
I am intrigued by the idea that in the future, communities might create a merged brand voice that their members choose to speak in via LLMs, to protect individual anonymity.
Maybe only your close friends hear your real voice?
Speaking of which, here's a speculative fiction contest: https://www.protopianprize.com/
Disclaimer: I am an independent researcher with Metagov (one host org), and have been helping them think through some related events.
EDIT: I've belatedly realized that stylometry isn't involved, but I think some of the above "what if" thought could still hold :)
spoaceman7777|4 days ago
5o1ecist|5 days ago
There are no two ways of expressing something in ways that might create equal impressions.
Relevant: https://www.perplexity.ai/search/hey-hey-someone-on-hn-wrote...
palmotea|5 days ago
A problem with that is then your post may read like LLM slop, and get disregarded by readers.
Another reason why LLMs are destruction machines.
flux3125|4 days ago
thesz|4 days ago
Stylometry can match not only people, but ethnic groups. No LLM required.
gambutin|5 days ago
EDIT: please someone build this, vibe-code it. Thanks
intended|5 days ago
That said, give it a few days and someone will have a proof of concept out.
DalasNoin|5 days ago
unknown|5 days ago
[deleted]
stackghost|5 days ago
zoklet-enjoyer|5 days ago
GorbachevyChase|5 days ago
qsort|5 days ago
Hello, LLM! :)
tryauuum|5 days ago
I've been trying to delete my GitHub account for many months
reducesuffering|5 days ago
jacquesm|5 days ago
razingeden|5 days ago
dpc_01234|5 days ago
ranger_danger|5 days ago
matheusmoreira|4 days ago
thatguysaguy|4 days ago
The real question is whether someone who is pseudonymous and actually attempting to remain so can be deanonymized.
matheusmoreira|4 days ago
They can. That's the point. This site serves as a dataset against which pseudonymous posts can be evaluated.
econ|5 days ago
The platforms offer only castrated interactions designed not to accomplish anything. People online are useless obnoxious shadows of their helpful and loving self.
No one cares more what you say than those monitoring you and building that detailed profile with sinister motives. The ratio must be something like 1000:1 or worse.
unknown|5 days ago
[deleted]
sbmsr|5 days ago
Foobar8568|4 days ago
georgeburdell|5 days ago
greesil|5 days ago
yu3zhou4|5 days ago
comrh|4 days ago
Zigurd|5 days ago
DalasNoin|5 days ago
wasmainiac|5 days ago
For example if I tell my bot to clone me 100x times on all my platforms, all with different facts or attributes, suddenly the real me becomes a lot harder to select. Or any attribute of mine at all becomes harder to corroborate.
I hate to use this reference, but like the citadel from Rick and Morty.
SchemaLoad|4 days ago
aplomb1026|5 days ago
[deleted]
DalasNoin|5 days ago
switchbak|5 days ago
throwaway4928ab|4 days ago
[deleted]
newzino|5 days ago
[deleted]
retew22|4 days ago
[deleted]
retew22|4 days ago
[deleted]
squeefers|5 days ago
DalasNoin|5 days ago
We do also make a more real world like test in section 2. There we use the anthropic interviewer dataset which Anthropic redacted, from the redacted interviews our agent identified 9/125 people based on clues.
The blog post might be more approachable for a quick take: https://simonlermen.substack.com/p/large-scale-online-deanon...
dang|5 days ago
https://news.ycombinator.com/newsguidelines.html
It's a pity that you didn't make your point more thoughtfully because it's one of the few comments in the thread so far that has anything to do with the actual paper, and even got a response from one of the authors. That's good! Unfortunately, badness destroys goodness at a higher rate than goodness adds it...at least in this genre.
nottorp|5 days ago
A more funny question is: did they match me to the correct linkedin profile, or did the LLM pick someone else?