top | item 31883373

Ask HN: GPT-3 reveals my full name – can I do anything?

710 points| BoppreH | 3 years ago | reply

Alternatively: What's the current status of Personally Identifying Information and language models?

I try to hide my real name whenever possible, out of an abundance of caution. You can still find it if you search carefully, but in today's hostile internet I see this kind of soft pseudonymity as my digital personal space, and expect to have it respected.

When playing around in GPT-3 I tried making sentences with my username. Imagine my surprise when I see it spitting out my (globally unique, unusual) full name!

Looking around, I found a paper that says language models spitting out personal information is a problem[1], a Google blog post that says there's not much that can be done[2], and an article that says OpenAI might automatically replace phone numbers in the future but other types of PII are harder to remove[3]. But nothing on what is actually being done.

If I had found my personal information on Google search results, or Facebook, I could ask the information to be removed, but GPT-3 seems to have no such support. Are we supposed to accept that large language models may reveal private information, with no recourse?

I don't care much about my name being public, but I don't know what else it might have memorized (political affiliations? Sexual preferences? Posts from 13-year old me?). In the age of GDPR this feels like an enormous regression in privacy.

EDIT: a small thank you for everybody commenting so far for not directly linking to specific results or actually writing my name, however easy it might be.

If my request for pseudonymity sounds strange given my lax infosec:

- I'm more worried about the consequences of language models in general than my own case, and

- people have done a lot more for a lot less name information[4].

[1]: https://arxiv.org/abs/2012.07805

[2]: https://ai.googleblog.com/2020/12/privacy-considerations-in-...

[3]: https://www.theregister.com/2021/03/18/openai_gpt3_data/

[4]: https://en.wikipedia.org/wiki/Slate_Star_Codex#New_York_Time...

346 comments

order
[+] jmillikin|3 years ago|reply

  > I try to hide my real name whenever possible, out of an
  > abundance of caution. You can still find it if you search
  > carefully, but in today's hostile internet I see this kind
  > of soft pseudonymity as my digital personal space, and expect
  > to have it respected.
Without judging whether the goal is good or not, I will gently point out that your current approach doesn't seem to be effective. A Google search for "BoppreH" turned up several results on the first page with what appears to be your full name, along with other results linking to various emails that have been associated with that name. Results include Github commits, mailing list archives, and third-party code that cited your Github account as "work by $NAME".

As a purely practical matter -- again, not going into whether this is how things should be, merely how they do be -- it is futile to want the internet as a whole to have a concept of privacy, or to respect the concept of a "digital personal space". If your phone number or other PII has ever been associated with your identity, that association will be in place indefinitely and is probably available on multiple data broker sites.

The best way to be anonymous on the internet is to be anonymous, which means posting without any name or identifier at all. If that isn't practical, then using a non-meaningful pseudonym and not posting anything personally identifiable is recommended.

[+] ChrisMarshallNY|3 years ago|reply
I gave up anonymity. I just learned to lean into taking control of my ID. Some time ago, I realized that there's no way for me to participate online, without things being attributed to me.

I learned this, by setting up a Disqus ID. I wanted to comment on a blog post, and started to set up an account.

After I started the process, it came back, with a list of random posts, from around the Internet (and some, very old), and said "Are these yours? If so, would you like to associate them with your account?"

I freaked. Many of them were outright troll comments (I was not always the haloed saint that you see before you) that I had sworn were done anonymously. They came from many different places (including DejaNews). I have no idea how Disqus found them.

Every single one of them was mine. Many, were ones that I had sworn were dead and buried in a deep grave in the mountains.

Needless to say, I do not have a Disqus ID.

Being non-anonymous means that I need to behave myself, online. I come across as a bit of a stuffy bore, but I suspect my IRL persona is that way, as well.

That's OK.

[+] mpeg|3 years ago|reply
Right? This whole thread feels like a joke when the author just removed their full name from their public, open source code 3 hours ago (and only from one of their repos, their name is fully visible in all the other LICENSE.txt files)
[+] bebrws|3 years ago|reply
I believe in the following sentences very much. However, I believe the value of the internet for any person could possibly be directly correlated with the amount of PII they are willing to share which to me makes this, if, a question of morality, a personal decision.

The sentences that stuck out to me are: “If your phone number or other PII has ever been associated with your identity, that association will be in place indefinitely and is probably available on multiple data broker sites.

The best way to be anonymous on the internet is to be anonymous, which means posting without any name or identifier at all. If that isn't practical, then using a non-meaningful pseudonym and not posting anything personally identifiable is recommended.”

[+] BoppreH|3 years ago|reply
> A Google search for "BoppreH" turned up several results on the first page

Not for me. It took until page 3 for just my first name to appear. If somebody is looking at past Github commits, that's already a high enough barrier for me.

I only partially agree with your conclusion. Asking people to maintain total anonymity always, with any slips punishable by permanent publication of that PII, might be the current status quo, but is not where we as society want to head.

[+] araneae|3 years ago|reply
> The best way to be anonymous on the internet is to be anonymous, which means posting without any name or identifier at all. If that isn't practical, then using a non-meaningful pseudonym and not posting anything personally identifiable is recommended.

A third approach is using a word that means something and thus is not unique at all.

Unique strings for usernames means lots of accurate hits. If you google mine, there will be lots of hits but none are me.

[+] brysonreece|3 years ago|reply
My general belief is that I, and others, should often treat the internet as a public forum like the local town square. Of course people can show up in a physical space, hiding their identities and screaming obscenities at bystanders, but I know I’m not that type of person. As a result, the principle I usually post things under is “conduct myself online as I would in person.”

Of course this doesn’t account for “the crazies” that could potentially harass me into my physical life at an easier rate simply because they’re mad I won an online game or the like. Thankfully I haven’t had to deal with such a situation, but I also believe that may be a consequence of avoiding inflammatory back-and-forths or highly-political discussions since anonymity is reduced, which may invite those attacks.

[+] xwolfi|3 years ago|reply
Yes one of his mistake is to use the same username everywhere. He just needs a few links and he's burned.

It's better to use a username you copied from someone else also, like that if people find links, they find someone else entirely.

[+] gtirloni|3 years ago|reply
> merely how they do be

Going on a tangent here but I've started seeing more "do be" used lately. However, it doesn't seem right for some reason I can't pinpoint (English is not my first language).

Is it from a dialect?

[+] jrm4|3 years ago|reply
The only way to fix this now is through collective, not individual, action. Policy, for example.
[+] bluepuma77|3 years ago|reply
Interesting how everyone says „But I can google you“ instead of thinking about the issue.

Companies are building and selling GPT-3 with 6 billion parameters and one of those „parameters“ seems to be OP‘s username and his „strange“ two word last name.

If models grow bigger, they will potentially contain personal information about everyone of us.

If you can get yourself removed from search indices, shouldn’t there be a way for AI models, too?

Another thought: do we need new licenses (GPL, MIT, etc.) which disallow the use for (for-profit) AI training?

[+] ravel-bar-foo|3 years ago|reply
The FTC has a method for dealing with this: they have in the past year or two ordered companies with ML models built from the personal information of minors to completely delete their models.
[+] jonbwhite|3 years ago|reply
Is it really that different than a search engine? Take away the AI specific language and you have two products that when given his username return results with his real name.
[+] karussell|3 years ago|reply
> Another thought: do we need new licenses (GPL, MIT, etc.) which disallow the use for (for-profit) AI training?

I don't think that we need new licenses, but probably open source projects need a better way to enforce them.

E.g. Copilot just ignores the licensing issues although I can imagine that there could be a solution with a few different models that return code for different purposes. (Like one model returns everything and the code can be used safely only for learning or hobby projects. Another model returns code for GPL code. And a third model returns code compatible with commercial or permissive open source projects.)

Or the model spits out also the licence(s) of the code, but not sure if this is technically possible.

[+] mr_toad|3 years ago|reply
The information is embedded in the weights of various layers in the network. Trying to remove that information by editing weights would be like trying to alter someone’s memory by tinkering with synapses.

The only way to be completely sure of removing information would be to re-train the model without that data.

[+] gjvnq|3 years ago|reply
> If you can get yourself removed from search indices, shouldn’t there be a way for AI models, too?

Absolutely yes!

[+] diamondage|3 years ago|reply
There is a legitimate question here. A lot of comments are trashing this post because his/her name is already all over the internet. But European laws have the 'right to be forgotten'. Aka you can write to Google and have your personal information removed, should you so wish. How might we address this with a GPT3 like model?
[+] remram|3 years ago|reply
I feel like if OP had actually made an effort to hide this information from search engines and GPT-3 remained the last place from which it was available, this point would be a lot more compelling. Right now it's a "everybody has my name and that's fine, but that includes GPT-3 and that makes GPT-3 bad".

I would expect that it would take considerable effort to get this information removed from Google (you would have to write to them with a request under GDPR or similar and have them add a content filter) and I don't see why the same effort wouldn't allow you to get removed from GPT-3 (which is only accessible via a web API, so a similar filter could be added).

[+] cortesoft|3 years ago|reply
I can never understand the ‘right to be forgotten’. How does that not conflict with another right, my ‘right to remember’?
[+] nonameiguess|3 years ago|reply
There are two things you can do in cases like this.

The first is asking a website owner to delete data they collected on you. That doesn't really apply here. The places this person's name is published are his own website that has this username as its url, his own Github repos, and published papers of his that were also on his website. No GDPR request is necessary to remove his name from these places because he already owns that data. As seen, he has already started to delete it himself.

The second is asking search engines to delist a result. As far as I understand, this usually has to involve information that is otherwise meant to be scrubbed from public record, like a newspaper article about a conviction that was eventually sealed. You can't ask Google to not index a scientific journal you published to or your public Github repos.

There are, of course, limits to this thanks to public interest exceptions. I don't believe Prince Andrew can ask Google to de-index anything associating him with Jeffrey Epstein. The public has a right to know, too.

In this guy's case, he really seems to be straddling a line. He contributed to open source projects under his real name linking to a Github repo with the same username he seems to reuse everywhere, including here, and also has a website where the url is that username, and it contained his CV with his real name on it along with a publication history with every publication using his real name. Is it reasonable to do those things and then ask Google and OpenAI not to associate the username with your real name?

At what point are you some regular Joe with a real grievance and at what point are you Ian Murdock complaining that GPT knows you're the Ian associated with debian?

[+] yreg|3 years ago|reply
GDPR is rather vague and perhaps it might be an intended feature.

They could:

1. Set up a content filter that filters op's name from the output. OpenAI would still need to keep record of the name, exposing it to leaks.

2. Remove the name from the dataset and retrain the model, which is obviously infeasible with each GDPR request.

I expect there are other instances where it is impractical or impossible to completely forget someone's data upon a request. Does Google send people spelunking into cold storage archives and actually destroy tapes (while migrating the data that is not supposed to be erased) every time they receive a request?

[+] thatjoeoverthr|3 years ago|reply
I’m playing with it. After giving it my name, it correctly stated that I moved to Poland in Summer ‘08, but then described how I became some kind of techno musician. I run it again and it says wildly different stuff.

I have to say playing with GPT3 has been a mind blowing experience this week and you should all try it.

The most striking point was discovering that if I give it texts from my own chats, or copy paste in RFPs, and ask it to write lines for me, it’s better at sounding like a normal person than I am.

[+] ReactiveJelly|3 years ago|reply
> Posts from 13-year old me?

Right, this is why opsec is something that you must always be doing.

Anything you say can be preserved forever.

Better to use short-lived throwaway identities, and leave yourself the power of combining them later, than to start with one long-lived identity and find yourself unable to split it up.

It's inconvenient in real life that I'm expected to use my legal identity for everything. If I go to group therapy for an embarrassing personal problem, someone there can look me up because everyone is using real names. I don't like it.

[+] can16358p|3 years ago|reply
I agree. However most of us (understandibly) don't think this when we are 13.

If we created an identity that is completely different than our real identity when we're 13, great.

If not, that becomes a problem without an actual solution especially in the age of Internet archives.

[+] nicbou|3 years ago|reply
It's crazy that everyone is blaming OP when exactly what you describe affects most people in their 30s.
[+] criddell|3 years ago|reply
From the TOS:

> Exercising Your Rights: California residents can exercise the above privacy rights by emailing us at: [email protected].

If you happen to be in California (or even if you are not) it might be worth trying to go through their support channel.

[+] kixiQu|3 years ago|reply
The comments do not seem to be addressing something very important:

> I don't care much about my name being public, but I don't know what else it might have memorized (political affiliations? Sexual preferences? Posts from 13-year old me?).

Combine this with

https://news.ycombinator.com/item?id=28216733 https://news.ycombinator.com/item?id=27622100

Google fuck-ups are much, much more impactful than you'd expect because people have come to trust the information Google provides so automatically. This example is being invoked as comedy, but I see people do it regularly:

https://youtu.be/iO8la7BZUlA?t=178

So a bigger problem isn't what GPT-3 can memorize, but what associations it may decide to toss out there that people will treat as true facts.

Now think about the amount of work it takes to find out problems. It's wild that you have to to Google your own name every once in a while to see what's being turned up to make sure you're not being misrepresented, but that's not too much work. GPT-3 output, on the other hand, is elicited very contextually. It's not hard to imagine that <There is a Hristo Georgiev who sold Centroida and moved to Zurich> and <There is a Hristo Georgiev who murdered five women> pop up as <Hristo Georgiev, who sold Centroida and moved to Zurich, had murdered five women.> only under certain circumstances that you can't hope to be able to exhaustively discover.

From a personal angle: My birth name is also the pen name of an erotic fiction author. Hazy associations popping up in generated text could go quite poorly for me.

[+] mensetmanusman|3 years ago|reply
Fascinating!

I didn’t anticipate the use case of GTP being used by debt collection agencies to tirelessly track down targets.

It will be a new type of debtors prison where any leaks of enough personally identifying facets to the internet will string together a mosaic of the target such that the AI sends them calls,sms,tinder dms, etc. until they pay and are released from the digital targeting system.

[+] sitkack|3 years ago|reply
I am sorry for so many comments showing a lack of empathy, basically saying, "what do you expect and do better!". I think you are raising real concerns, these language models will get more and more sophisticated and will basically turn into all knowing oracles. Not just in who you are but what it thinks would be effective in manipulating you.
[+] lolinder|3 years ago|reply
I don't think this is a reasonable fear. It's reasonable to be on guard for some sensitive memorization, but it's not reasonable to fear that a language model will be able to reliably produce information on any given individual. For every person with enough of an online presence to have actually been memorized by GPT-3 or its successors, there are many more that GPT-3 will just produce good-looking nonsense for. It's not possible to distinguish between the two, so creepy surveillance capitalist firms will do better by developing their own specialized models (as they're already doing).
[+] remram|3 years ago|reply
More so than search engines?
[+] bufferoverflow|3 years ago|reply
You have no expectation of privacy while being in public. Supreme Court ruled, that anything that a person knowingly exposes to the public, regardless of location, is not protected by the Fourth Amendment.

Same idea works for information. If you expose private information publicly online, it's unreasonable to expect it to remain private.

By creating this post he insured even less privacy. He attracted even more attention, guaranteeing his public "secret" is widely known.

[+] eterevsky|3 years ago|reply
I just asked GPT-3 a few times who you are and here are its answers:

> BoppreH is an AI that helps me with my songwriting.

> I'm sorry, I don't know who that is.

> I'm sorry, I don't understand your question.

> BoppreH is an artificial intelligence bot that helps people with their daily tasks.

I have a feeling that I'll have better chances just googling you than asking GPT-3.

[+] mikequinlan|3 years ago|reply
If you hadn't just announced that the result returned by GPT-3 is your full name, nobody would have known for certain that it was correct.
[+] trollied|3 years ago|reply
> I try to hide my real name whenever possible, out of an abundance of caution

A quick google suggest that you don't.

[+] browningstreet|3 years ago|reply
Just flew back from Europe. Still traveling actually.

It used to be that when you hit border control you present your passport.

They don’t ask for that anymore: border control waved a webcam at my face, called out my name, told me I could go through. Never once looked at my passport.

I think we’ve lost.

[+] fennecfoxy|3 years ago|reply
Claiming that technology is the problem is naive; it's misuse of technology and lack of appropriate legal frameworks that is the problem.

Being able to walk through an e-passport gate is awesome, we _should_ be using technology to make our lives easier.

But it needs to go hand in hand with legal protections; imagine the past world where car manufacturers were not held to any safety standards or regulations, cars would not be such a boon to us as they are now.

[+] haunter|3 years ago|reply
Am I missing something? You had your full CV on your public homepage with your full name
[+] gordaco|3 years ago|reply
Obligatory xkcd: https://xkcd.com/2169/ .

I'm afraid that we are going to see these kinds of issues proliferate rapidly. It's a consequence of the usage of machine learning with extensive amounts of data coming from "data lakes" and similar non-curated sources.

[+] SnowHill9902|3 years ago|reply
Rotate your usernames every 2 months. Use different usernames on every website. Rotate your full name every 10 years (as suggested by Eric Schmidt).
[+] matheusmoreira|3 years ago|reply
> Rotate your usernames every 2 months. Use different usernames on every website.

How to manage all these identities though? How to make sure they don't leak into each other?

[+] jsmith45|3 years ago|reply
> Rotate your full name every 10 years (as suggested by Eric Schmidt).

This is not always possible if one means not just daily use name but also legal name.

There is at least one state where the name change law allows residents to only change name once (except for marriage related last name changes).

[+] junon|3 years ago|reply
As someone who has tried to do this before, this is very very difficult to do correctly and completely.
[+] m3047|3 years ago|reply
What I find missing in the comments is any examination of the following sequence of hypothetical events:

1) Adversarial input conditioning is utilized to associate an artifact with others, or a behavior.

2) Oblivious victim users of the AI are manipulated into a specific behavior which achieves the adversary's objective.

Imagine a code bot wrongfully attributing you with ownership of pwned code, or misstating the license terms.

Imagine you ask a bot to fill in something like "[email protected]" and instead of filling in (literal) [email protected] it fills in real addresses. Or imagine it's a network diagnostic tool... ooh that's exciting to me.

Past examples of successful campaigns to replace default search results via creative SEO are offered as precedent.

[+] WhiteNoiz3|3 years ago|reply
Sadly, I think the only way to protect against this is with another AI whose job it is to recognize what data is appropriate to reveal and what is private - basically what humans do. But, even then it will probably still be susceptible to tricks. Of course the ideal thing is just to not include it in the training data but I think we know how much effort that would take when the training data is basically the entire internet. I wonder if as AI systems become more efficient and they learn to "forget" information which isn't important and generalize more, that this will become less of an issue.
[+] permo-w|3 years ago|reply
if you want to stay anonymous online, don't try and hide, don't go for this magical, extremist, non-existent "full anonymity". spray out false information at random. overload the machine. give nothing real, then when you do want to be real, it's impossible to tell
[+] bruce343434|3 years ago|reply
Impossible to tell until you have someone with the time to dig through everything and find the real identity from the fake ones.