top | item 20402070

Google employees are listening to Google Home conversations

872 points| bartkappenburg | 6 years ago |translate.google.com

441 comments

order
[+] gringoDan|6 years ago|reply
I think the responses to this can be broken down into a 2x2 matrix: level of concern vs. understanding of technology.

1) Don't understand ML; not concerned - "I have nothing to hide."

2) Don't understand ML; concerned - "I bought this device and now people are spying on me!"

3) Understand ML; not concerned - "Of course, Google needs to label its training data."

4) Understand ML; concerned - "How can we train models/collect data in an ethical way?"

To me, category 3 is the most dangerous. Tech workers have a responsibility not just understand the technologies that they work with, but also educate themselves on the societal implications of those technologies. And as others have pointed out, this extends beyond home speakers to any voice-enabled device in general.

In conversations about this with engineers the response I've gotten is essentially: "Just trust that we [Google/Amazon/etc.] handle the data correctly." This is worrying.

[+] cronix|6 years ago|reply
I'm in the 5th category. 5) Understand ML; concerned - won't allow any of these things in my house period because they will always use them for things behind the scenes that they won't state. I don't care how well trained they are, or "ethical." Ethical... according to who, and at what time period in the future? Ethics change. The data they have on you won't. Look at all of the politicians and other people getting in trouble for things they said 15 years ago, which were generally more acceptable at the time but we've "progressed" since then. Who will be making decisions about you in the future based on last years data? Just don't give it to them.
[+] ovi256|6 years ago|reply
This classification is very useful to discuss this issue.

The difference between 3 and 4, noble as it is, can be caused by feasability concerns that push people into 3, not just ignorance of the privacy impact. Human labelling of training data sets is a big thing in supervised learning. Methods that dispense with this would be valuable for purely economic reasons beyond privacy - the cost of human labelling of data samples. Yet we don't have them!

Techniques like federated learning or differential privacy can train models on opaque (encrypted or unavailable) data. This is nice, but they assume too much: that the data is already validated and analyzed. In real life modelling problems, one starts with an exploratory data analysis, the first step being looking at data samples. Opaque encrypted datasets also stop ML engineers from doing error analysis (look at your errors to better target model/dataset improvements) which is an even bigger issue, IMO, as error analysis is crucial when iterating on a model.

Even for an already productivized model, one has to do maintenance work like checking for concept drift, which I can't see how to do on an opaque dataset.

[+] mejari|6 years ago|reply
>To me, category 3 is the most dangerous. Tech workers have a responsibility not just understand the technologies that they work with, but also educate themselves on the societal implications of those technologies.

Do you think its possible to be educated on the societal implications of these technologies and still not be concerned? Seems like you've written your own viewpoint into the only "logical" one here.

[+] okmokmz|6 years ago|reply
>To me, category 3 is the most dangerous. Tech workers have a responsibility not just understand the technologies that they work with, but also educate themselves on the societal implications of those technologies. And as others have pointed out, this extends beyond home speakers to any voice-enabled device in general.

Yes I'm frequently amazed how many coworkers I have that are still completely plugged into Google, Facebook, Amazon services/spyware, fill their homes with internet enabled "smart devices", have alexa/google assistance etc, and yet they act like I'm paranoid when I try to discuss security concerns or just flat out don't care.

As much as I hate to say it, I think there needs to be a massive breach or abuse of power from one of these organizations/services that has severe real world consequences for those that utilize/support them. Until then nothing will change.

[+] snowwrestler|6 years ago|reply
> Tech workers have a responsibility not just understand the technologies that they work with, but also educate themselves on the societal implications of those technologies.

I think this goes well beyond tech workers. I think it's time for society to legally recognize the balance between the value of ML systems and the privacy concerns of customers of ML.

Doctors and lawyers obviously should understand the value of privacy, but we, as a society, have also created legal rights and duties for them. Conversations with lawyers and doctors are legally privileged; at the same time, there are specific consequences for medical companies or lawyers who do not protect that information.

Companies like Google, Apple, Amazon, etc. certainly have the resources, intelligence, and sophistication to comply with a similar regulatory regime. IMO it should be possible to construct a law that allows companies to collect, store, and tag customer data for purposes of training ML systems, but sets serious duties, with consequences, on them to do it right.

Right now, what is to keep employees at these companies from abusing these systems to stalk, to surveil, to harass, or even just to feed their own curiosity? These data systems are core trade secrets for these companies, which means they are opaque to any kind of oversight from outside the company.

The free market can't create the necessary balance because customers need information to make decisions--information that they don't have. The result will be an increasingly chaotic "hero/shithead rollercoaster" as customers make snap judgments based on scanty or wrong information about what these companies are actually doing.

This is a classic case for regulation, which prevents a "race to the bottom" of sketchy practices for short term gain, while also protecting the ability of people and companies to use this technology to create value.

Doing this right will help data-leveraging companies in the long run, just like attorney-client privilege and HIPAA have helped lawyers and doctors build trust (and therefore value) in their customer relationships.

[+] ethbro|6 years ago|reply
I like that matrix!

One thing that I think gets lost in engineers (and humans) is scale.

Googazon doing {thing} might be "meh" for 10 people. But the implications look very different when it's doing {thing} for 10%+ of a country's population.

At 10 people, I may find out Ted likes to eat Italian. At 10%, I may find out an Italian chain has a sudden health issue and short their stock.

Which is in essence their original playbook: do things that only work at a scale that only we can play at.

[+] Nelson69|6 years ago|reply
Anyone remember the 3d printed gun stuff from a few years back? I think this isn't very different from it. You can take these raw pieces and explain how they are simple and good and draw these simple ethical conclusions from them, but then you add it up and the bigger picture doesn't feel quite the same way. 3D printers are good, sharing 3D printing plans is good, it's good to help your neighbor, no regulations and we're experiencing tremendous growth in the 3D space, people are inventing new stuff, starting new businesses, etc.. all good stuff. but letting any jackass off the street print a working gun when we have how many mass shootings a year? People don't feel the same way. All the pieces are totally okay until you've got a more questionable global intention, and how can you regulate intention?

Google using the data to train models is just a tool, it's a baby step, they aren't doing that to sell the models or in and of itself, they're doing it so that they can generate data that they might consider theirs and not yours from your voice data and then feed that in to other systems which generate tremendous profits for them in ways you don't even know. They have intended uses already. Is it a remotely fair question to talk about ethical training in this context without some idea as to the intended use and distribution of the meta data?

[+] isostatic|6 years ago|reply
5) Understand ML; concerned - "Why do other people in the ML industry think it's OK to use and store peoples data in without informed consent (which are only those in group 3, and group 1+2 don't have informed consent)"
[+] iamnothere|6 years ago|reply
The Mycroft project has a better approach to this:

"Mycroft uses opt-in privacy. This means we will only record what you say to Mycroft with your explicit permission. Don’t want us to record your voice? No problem! If you’d like us to help Mycroft become more accurate, you can opt in to have your voice anonymously recorded."

(project is open source, at https://mycroft.ai/)

Let people participate in R&D if they want to, but don't force it.

[+] munchbunny|6 years ago|reply
I'm perhaps in a subcategory of (3) that falls under "Understand ML; concerned".

Knowing what I know about how people I have worked with have come close to or have actually mishandled data despite the best of intentions, I do not trust any of these teams without an explicit accountability mechanism that is observable by an outside entity. I'm not looking to punish slip-ups, because mistakes happen, but I am looking for external enforcement to keep people honest.

It's not that I think the engineers using this data are mustache twirling villains, it's that I think mishandling is inevitable due to inattention (yes, even you make mistakes!), and we have to design our data pipelines against that.

[+] prepend|6 years ago|reply
There’s a different dimension that may or may not understand ML, but are cognizant that any data created will be viewed at least by the company that creates it.

I fall into that category as I have time, nor do I trust any evaluation methods, to determine if a company is using my data ethically. If I create data and store it something that’s not mine, then I only do that in situations where I’m comfortable with the owner doing anything they want with it.

I understand ML and know that Google has to at least use it for training. I’ve also worked on IT long enough that even in super tight controlled environments data are misused by administrators.

[+] munificent|6 years ago|reply
> In conversations about this with engineers the response I've gotten is essentially: "Just trust that we [Google/Amazon/etc.] handle the data correctly."

No one is afraid of power when it's in their own hands. A common failure mode is that people assume a given power that's in their hands today will always be.

[+] m3at|6 years ago|reply
I'm in both 3 and 4.

4 because not being explicit about the practice is misleading at best, outsourcing the difficult task of keeping the analysis private show how unimportant it's considered, and because big techs have a tendency to decrease privacy over time. Using clients who paid for the product as a dataset generator is also wrong.

But 3 at the same time because well, it's important to evaluate the performance of the product in the field not just in the lab. There were so many cases of catastrophic failures for ML models (ex. classifying black people as gorilla) that having a tight feedback loop is important.

It has to be done right, but evaluating a product that was primarily developed for (or at least by) English speakers and transfered to other domains seem like the right thing to do.

All in all, I don't and wouldn't use one of those assistants because 4 outweigh 3, but it's not binary.

[+] Balgair|6 years ago|reply
>Tech workers have a responsibility not just (to) understand the technologies that they work with

Ok, I agree completely with you, 100%. However, based on my limited worldview, tech workers barely understand the tech they work with at all [0]. Asking for the ethical implications to be mulled over is unlikely to happen considering the near-weekly HN threads on "interviewing sucks, heres how to fix it, lol". We can't even figure out how to hire someone let alone how to impedance-match with them on deep issues like ethical implications of ML/AI.

[0] https://stackoverflow.com/

[+] novok|6 years ago|reply
Get real, obvious, informed consent by asking if you would like your voice prompts to be improved on / heard by real live humans as an opt in. I bet 1/500 of the population would opt in to it.

And the first one to do it should be apple itself.

[+] tgsovlerkhgsel|6 years ago|reply
Assuming categories 1 and 3 are sufficiently large (and I assume that is the case), this is easily resolved by allowing users to choose whether to donate their data for training or not.

If the training already only happens on a 1/500 sample, skewing the sample towards "people who don't care about their privacy" will probably not significantly impact the quality of the data.

I'm surprised this wasn't already the case, but hopefully the article will help the people responsible make better decisions in the trade-off between minimizing onboarding friction and respecting user's privacy in the future.

[+] Rapzid|6 years ago|reply
> societal implications of those technologies

Asserting your point of view as "educated" and "correct" while labeling people who don't share it as dangerous. Doesn't sound like a great way to start a discussion.

[+] neilpointer|6 years ago|reply
I'm between 3 and 4: I just want proof that they remove PII from the audio files. If it's a bunch of audio files with unique IDs and metadata like time of day, count me as a member of group 3.
[+] wybiral|6 years ago|reply
Even if I trust them to do what they say they're doing with the data I may not trust every party who comes to possess that data. And I may not trust their possession/use of it in all future contexts - as their privacy policy slowly drifts into the unknown year after year.

If they're collecting it in a way that can be requested by governments (for instance) or could be leaked by hackers that's another layer of valid "concern" not related to my understanding of the ML aspect of this.

[+] Spooky23|6 years ago|reply
The meta-issue in the United States is that once your data is accessible to a third party, you have no sovereignty over it, and abuse by private actors is "agreed to" by click-wrap and access by government actors is a simple subpoena.

The law needs to catch up. Sharing should require specific informed consent and legislation needs to establish a scope where data stored as a "tenant" on a third party server is given 4th amendment protection.

[+] raghava|6 years ago|reply
Essentially, a larger grid, involving

agent( tech, management ) # assuming management has power over tech worker

understanding-of-ML( yes, no )

concerned-about-ethics-and-privacy( yes, no )

The below combinations are worst in terms of ethics.

{ agent[tech], understanding-of-ML[yes], concerned-about-ethics-and-privacy[no] }

{ agent[management], understanding-of-ML[no], concerned-about-ethics-and-privacy[no] }

{ tech[management], understanding-of-ML[no], concerned-about-ethics-and-privacy[no] }

[+] cblades|6 years ago|reply
I agree, but I think this issue is incredibly mishandled by reporting. The title in the linked article being a great example.

There is absolutely no proof of number 2 in your list, but that is by far the widest-held belief.

It's infuriating, because we can't have a useful societal dialog about the issue if the largest chunk of concerned people are, essentially, conspiracy theorist.

[+] SmirkingRevenge|6 years ago|reply
The one thing about these stories that keep coming out about the home assistants... they kind of create the impression that this is an issue specific to home speakers, and you can avoid it, by simply not buying them.

That's misleading.

Any voice command you use to operate any internet connected tech gadget, from phones to smart TV's, is potentially stored and flagged for human review.

You really have to avoid using voice commands at all, on all of your devices. Even that is probably insufficient. You probably have to go even further and actively disable voice command features on all of your devices, assuming they actually support such a setting. Otherwise here's still the possibility of an accidental recording taking a journey through the clouds, to a stranger's ears.

[+] electrograv|6 years ago|reply
So Google’s response is (paraphrased as fairly as I can while removing the sugar-coating):

’Yes, we hire people to listen in to and transcribe some conversations from the private homes of our customers (so as improve our speech recognition engines); but the recordings aren’t linked to personally identifiable information.’

Even assuming they have only the purest intentions here, I still don’t understand how they can possibly guarantee that these recorded conversations are not linked to personally identifiable information!

For example, what’s to stop me from saying “Hey Google, I am <full legal name / ID> and my most embarrassing and private secret is <...>”?

One might argue that they could detect this in the recognized text and omit those samples, but presumably the whole purpose of hiring people to create transcripts is because the existing speech-to-text engine isn’t perfect, and they need more training data.

[+] TheAdamist|6 years ago|reply
"The man, who wants to remain anonymous, works for an international company hired by Google. "

So not a Google employee at all, a probably low paid contractor who is in possession of thousands of audio files. Your privacy matters, except when the bottom line is involved.

[+] numbsafari|6 years ago|reply
What is doubly concerning here is that the contractor was in a position to demonstrate how the system worked to the reporters. That would seem to indicate they have access to that data in a non-secured environment.

I'm not familiar with EU law around these things, but I would imagine there is some kind of whistleblower mechanism available, and a right for authorities to audit/inspect such activities?

[+] astrea|6 years ago|reply
Sounds like he was a Turker: "For each fragment that he listens, he will receive a few cents."
[+] hknd|6 years ago|reply
The person is probably a temp/vendor from a consulting company (think accenture or cognizant), who should've signed the same NDA agreements as anyone working on that stuff.
[+] d1zzy|6 years ago|reply
Does it matter how much they're payed? They're probably payed the right amount relative to the work they are doing.

Also how is having access to small samples of audio a privacy issue? Are they also receiving enough information to attach an identity to the audio clips? How long are the clips? Are they randomly assigned to humans? Do those humans get to listen to multiple clips from the same Home device and can they tell that's the case?

[+] inerte|6 years ago|reply
Home, Siri, Alexa, M, they all do. I have friends that work on this field transcribing the audio, and measuring its accuracy. Sometimes it's multiple layers of contractors: An employee hands the task to a contractor, another contractor verifies the speech to text, and they're all managed by a contractor.

Search for languages like Portuguese, Swedish, Chinese, etc on LinkedIn and you'll find the jobs posts https://www.linkedin.com/jobs/search/?keywords=portuguese&lo...

[+] paganel|6 years ago|reply
I grew up as a kid in a country ruled by Securitate [1], one of the few institutions that rivaled the East-German Stasi when it came to spying on its own citizens, and as such I'm very, very perplexed of why would anyone bring in a listening device in his/her own house out of his/her own volition. And those people even pay for the privilege of having their home-lives actively monitored and listened to almost all the time, it's crazy.

[1] https://en.wikipedia.org/wiki/Securitate

[+] d1zzy|6 years ago|reply
I would imagine that for the people that didn't grow up in such a country ruled by Securitate do not have the experience to make them fear being listened in. Not saying that they are wrong (they may turn out to be right), just that we are all products of our experiences.
[+] viklove|6 years ago|reply
Do you have a smartphone? Why would you bring that listening device (the smartphone) into your own house out of your own volition? Please explain, because I am very perplexed.
[+] chance_state|6 years ago|reply
"What Orwell failed to predict is that we'd buy the cameras ourselves, and that our biggest fear would be that nobody was watching."
[+] duxup|6 years ago|reply
This falls into the category of:

I bugged my house... NOW MY HOUSE IS BUGGED!

Not to dismiss the value of the news here, it is important for folks to know, but the overall situation is both concerning, and amusing.

[+] lovetocode|6 years ago|reply
I own 4 Home Minis, 1 Home and 2 Home hubs I honestly don't care so long as my data is used to improve the functionality and stability of my investment. It is quite another thing if they are selling my conversations to third-party vendors.
[+] RosanaAnaDana|6 years ago|reply
I mean. Of course they are. Do you expect to be able to do any meaningful level of training on data that hasn't been properly labeled? At some point, a human has to go in and correct the software when the software gets it wrong. If you want services that do what Google Home does, you have to have this.

Even with that, I'm sure that the engineers are flagging voice requests that happen more then once, or where some one has to manually change or correct what the software thought was the request.

This is only creepy if you don't understand how the software works.

[+] gtirloni|6 years ago|reply
From a computer science perspective, what should Google do to train its models in a privacy conscious way?
[+] JorgeGT|6 years ago|reply
The biggest issue IMHO is how the average consumer has been deceived into the belief that current AI is pure AI, when in reality a lot of humans are looking at your pictures, listening to your recordings, crawling through your inbox and analyzing your browsing/purchasing/streaming history, right now: https://imgs.xkcd.com/comics/trained_a_neural_net.png
[+] rev12|6 years ago|reply
I think a lot of people here are under the assumption that voice commands, on any device, have the potential to be human reviewed. I am not sure whether or not the general public has that same assumption.

That being said, my biggest concern is the fact that many of these device don't have a hardware microphone kill switch. I feel better when I know I can control when a device is listening in. I've read reports that some Alexa devices have them, but I don't own any so I am unable to verify that.

I want all of my devices with microphones to hardware based kill switch for the mic; that's my phone, laptop, tablet, everything.

[+] groovybits|6 years ago|reply
Assuming $0.3/audio clip and base wage of $10/hr, that equates to 33.3 audio clips/hr = 266.4 audio clips/day that are being monitored by any one 'language expert'.

However, Google does not specify how long a 'conversation' is. How many sentences make up a conversation? When is the cutoff point?

Google also says '1 in 500' conversations are monitored. That means for any one 'language expert', there are approx. 133,200 conversations/day that have a chance of being monitored.

So basically, you have a 0.2% chance that your conversation is being picked up by any particular 'language expert' per day.

[+] Ensorceled|6 years ago|reply
The number of people in this thread who believe that this is ok because, 1) it's obviously the only way Google could train their voice system and thus 2) people clearly knew what they were getting into, is horrifying.
[+] rchaud|6 years ago|reply
It's no coincidence that companies like Amazon market their Echos as "stocking stuffers" for the holiday season. I've wondered how Google Home and these "smart home" devices were always able to be priced as low as they are. Goes to show that paying for the product doesn't exempt you from still being part of the product.
[+] amacneil|6 years ago|reply
Serious question: how do people think the ML models for Home, Alexa, Siri, etc are trained, if not with human labeling?
[+] rosszurowski|6 years ago|reply
A bit tangential, but I tried sharing this link with a few friends on Facebook Messenger, and noticed it's blocked because it "violates Community Standards" [1]. Even shortened bit.ly links are blocked.

Anyone know why that would be the case? I'm trying to not assume malice (eg. maybe it got misflagged?) but it certainly feels like censoring and is yet another push for me to drop Messenger too.

[1]: https://i.imgur.com/9n1Hyqb.png

[+] bisRepetita|6 years ago|reply
What I am interested is not just to know that employees are sometimes listening, and why.

I want to know what instructions both humans and computers are given if they hear illegal actions, such as violence, illicit trade, etc

If you are an employee, and hear a rape scene, a blackmailing dialog, do you have a duty to report, or to remain silent?

I also want to know how much access law enforcement has on this data. And whether they can re-identify the info, with or without a warrant.