Facebook engineers: we have no idea where we keep all your personal data

[+] ctur|3 years ago|reply

In many ways you can think of large, long-living tech companies not unlike old cities like, say, London or Paris. The buildings and roads you see are built on top of older buildings and ruins. The streets are weirdly shaped and intersect at odd angles because they were made hundreds of years before and adapted over time as needs evolved. There are catacombs underneath sidewalks and no one genuinely understands it all nor does a single reference exist that explains it.

Literally everything predates everyone who lives there. Generations and generations of original designers, architects, and laborers have arrived, plied their trade, and moved away. There are people who are experts in certain parts, and who can build a new skyscraper at any given spot, but it is just layering and organic growth.

The emergent complexity of centuries of being lived in and adapted belies easy understanding.

Large tech companies are similar. You just can't understand how "it" all works. If you were to build it from scratch, perhaps you could, because it would be simpler and clearer, but nothing was made with the current state in mind. It evolved and adapted over time.

So reading this, I am not surprised. I think you'd get the same answer about many other aspects of data, code, system history, etc at any other 10+ yr old tech giant.

[+] kixiQu|3 years ago|reply

I work for a large long-living tech company. Where legal compliance and security are concerned, "it is just layering and organic growth" is not something you get to say. If we were speculating about such a thing on a coffee break, maybe we'd get the kind of answer given here, but if a single reference doesn't exist for something that is necessary for compliance, people get paged until it does.

(There may be other things taken as seriously I'm not thinking of, but compliance and security are the two I've seen drag people out of their beds by their ankles)

[+] adrianh|3 years ago|reply

This is a seductive notion, but it's indeed possible for a tech company to understand how a single piece of data goes through its systems.

It might take a while, and it might involve weeks of code spelunking and dozens of conversations with engineers — but it is indeed possible.

A codebase is not encumbered with the physical constaints of an old city. It's not like trying to figure out the physical state of a certain area of London 30 meters deep — which might be impossible without digging and therefore disrupting other structures in the area. A codebase is simply instructions, which are readable and understandable (with the exception of black-box machine-learning models, but those at least have defined inputs and outputs).

[+] markstos|3 years ago|reply

You can hire one or more data privacy people whose whole job it is to track this. They get an inventory of all the systems and talk to the managers of every system, find the answers and document them.

Meta is choosing not to know or pretending not to know. They have the resources to know where your data goes.

[+] CaptainTaboo|3 years ago|reply

It's not just the tech giants. I work for a company that is fairly dominant in its particular industry (If you're in the US, you've probably done business with one of its customers). Much of our growth has come through the acquisition of other companies, each of which gathers its own user and customer data. From my perspective in IT operations, I don't have any visibility into how we capture user and customer data, but I know we're big enough to track a user across the various products and websites, but still siloed enough that internally we still refer to the various business units as if they were separate organizations, each holding its own subset of the data. To sum up, this description of Facebook's internals doesn't surprise me at all.

[+] jollyllama|3 years ago|reply

To build on your metaphor though, there are cabbies with "the knowledge" in London, and probably similar folks in Paris. The way I have seen this work in tech is, you have to keep the people who know where the bodies are buried around for a long time, and others will use them as a resource to inform their efforts. The alternative, of letting them walk over the years, is that large sections of your stack will have "here be dragons" written over them and become no-go zones where your only option is to route not only engineering but business efforts around them.

[+] BiteCode_dev|3 years ago|reply

Maybe, but one can find out. You can follow the chain of calls of any software, and they have access to everything, as well as the whole history.

It's like when in an interview Buffets says it's very hard to know where all the money is in the financial system.

Sure, it's a complex system, but it's more of a matter of incentive.

[+] borbulon|3 years ago|reply

yeah "favela" is usually the metaphor I use, but it's the same general gist, if a little messier.

[+] rockinghigh|3 years ago|reply

> I think you'd get the same answer about many other aspects of data, code, system history, etc at any other 10+ yr old tech giant.

Processes to access personal data vary wildly among tech companies. At Apple, when a machine learning engineer wants to store and use data on the server side, they need to go through layers of approval from lawyers. Even when approved, it often comes with serious constraints about what can be done with the data. Meta is a lot more lax.

[+] weare138|3 years ago|reply

That's just a truism at any tech company. The issue is Facebook/Meta was misleading about what data is being gathered and the "Download Your Information" didn't actually show you what data FB had gathered about you. By their own admission they don't even know yet users and regulatory agencies were knowingly misled into believing that information was accurate.

[+] Jenk|3 years ago|reply

I agree with the point you are making, but I disagree with the conclusion.

A company like Meta that has been building and advancing the field of ML, AI, and even falling foul of the ethics/morality of large scale public manipulation and political campaigning has already demonstrated they have the wherewithal to find where their data exists. Period.

[+] slenk|3 years ago|reply

Not tech giants that care about compliance...it's not undoable.

They are just poor stewards.

[+] Rackedup|3 years ago|reply

> The buildings and roads you see are built on top of older buildings and ruins

It's easy to delete old data that is never accessed, so I don't get your point...

[+] TomSwirly|3 years ago|reply

What you appear to be saying is that large technology companies are designed from the ground up around flouting the laws about data retention.

Why is this OK?

[+] mistrial9|3 years ago|reply

this useful excuse-insight will be provided to all compliance personnel at the next staff meeting, in writing.

[+] dvfjsdhgfv|3 years ago|reply

As far as code goes, it's true for many companies. As for data, it was similar in many large European companies in the pre-GDPR era. Today, one crucial question is always asked: when you work with data, is this personal data? If it is, you need to deal it with a special way. Personal data is both an asset and a liability.

Most companies found a way do do it. It was a long and often very painful process, involving everything from data entry to backup processes and procedures, but somehow we managed to do it.

[+] liampulles|3 years ago|reply

To extend this wonderful metaphor, perhaps the least terrible solution is for the "city workers" to constantly log any issues as they come across them in their normal day-to-day work. It is the responsibility of the city then to proactively go and fix those issues as they come up.

[+] hkgjjgjfjfjfjf|3 years ago|reply

[deleted]

[+] hkgjjgjfjfjfjf|3 years ago|reply

[deleted]

[+] handity|3 years ago|reply

After reading some more of the transcript, I think the article does such a bad job of describing what was being asked for that it makes the court seem incompetent and Facebook actually rather reasonable. My original comment is below:

Government contracted construction workers: We Have No Idea Where Your Tax Money Goes

When tasked with answering the simple question "Which specific bricks did my tax money buy this year", the two veteran construction workers looked confused and tried to explain that that's not really how taxes work.

The special master at times seemed in disbelief, as when he questioned the engineers over whether any invoices existed for a particular road building contract. “Someone must have a receipt that says this is who the money came from that bought these bricks”

I'm not sure the analogy fits 100%, but it's the closest I could think of. This reads like the author thinks the Facebook algorithm is a human-readable decision tree that only takes into account the data of a single user at a time.

As usual, the problem is not data "collection" or "retention" or "privacy", but "creation". Regulation will always remain woefully inadequate to control such organizations, and the only solution is to adopt systems that don't spew user data everywhere in the first place.

[+] jtbayly|3 years ago|reply

That’s an absurd analogy that doesn’t fit at all.

FB uses the data it can’t find or tell you anything about to successfully sell ad space to third parties targeted back at you.

Let’s make it concrete.

We know FB keeps track of which webpages you visit. We also know they use that data as a way to help target ads. That data isn’t in the data export they give you, as far as I can tell from a brief search.

Is there data coming in about which Instagram profiles you follow and what pics you clicked and liked? Is there other data they keep track of? Undoubtedly.

I’m not surprised nobody knows the answer to what all it might be. But pretending that data is fungible like money is just misdirection.

[+] twanvl|3 years ago|reply

Money is fungible, personal data is not.

[+] anigbrowl|3 years ago|reply

Government contracted construction workers: We Have No Idea Where Your Tax Money Goes

I think it would be a more meaningful analogy if you had posited two government accountants.

[+] jerf|3 years ago|reply

Kind of serves as a good example of "single central conspiracy" versus "many actors given common cause to act in a certain way" ways of thought. It's easy to imagine Zuckerberg going to work every day and laughing maniacally as he personally shepherds your stolen conversation with friends yesterday about your experiences with laundry detergents into The Facebook Info Vault and then personally uses that information to send you ads about Tide laundry detergent, but what Facebook really is by sheer necessity is a whole bunch of agents operating with their own goals. The end result is a massive machine that turns privacy violations into money, but there isn't necessarily a single place where the bad thing happens. In fact you could conceivably be taken on a tour of the system and agree that every individual component is acceptable, or that the vast bulk of them are OK and there's only a single-digit number out of thousands that are problematic.

Nobody could possibly manage Facebook as a centralized single entity, but it's hard to imagine it any other way from the outside.

[+] jkingsbery|3 years ago|reply

While I appreciate this distinction and is an interesting way to look at things, I don't think it's inevitable. I've never worked at Facebook, but I work for another large tech company. Every project at our company, particularly the ones that deal with personal data, must produce a threat model [1] before going into production. In another thread, someone claims "Well that's because the question is meaningless," but from a threat modeling standpoint an engineer needs to understand all aspects of "where" data is for the system under design. Where physically is it stored? What database is it stored in? And knowing that, one must look at what threats are available. Who has access to the data? What other systems have access to the data? How are those users/systems authenticated? Is that access logged?

Now, sometimes systems diverge from the design, and sometimes threat models are incomplete. But the exercise of generating a threat model makes understanding who manages data more manageable.

[1]: https://en.wikipedia.org/wiki/Threat_model

[+] lupire|3 years ago|reply

Not even conspiracy. Governments, free markert, natural world all works the same way. Many semi-independent actors making local optimizations, with varying levels of sophistication in organization and hierarchy. Many emergent properties. IT in the large is like physics, chemistry, biology, psychology anthropology, and sociology. It's not building one precise bridge over and over again.

[+] personjerry|3 years ago|reply

Well that's because the question is meaningless - What does it mean "all your personal data"? What does "where" mean? Physically? Which tables? Which data is relevant? Friends? Friends of friends? Ad data? Behavioural data? ANY inferences, models built on user datasets that might include that user?

They don't even know how to ask the right questions to get to what they want to know.

Context: I worked at Facebook

[+] PeterisP|3 years ago|reply

Yes. All of that. That is the right question, the very first question as a starting point. A basic requirement from a company is to provide an exhaustive, true, up-to-date list of all of these - and if they can't, they should not be permitted to handle private data.

[+] zo1|3 years ago|reply

I just started reading the transcript. They asked a lot of questions trying to figure out details about various products.

It's 850 pages long, and the few spots I jumped to the guy kept answering with "I don't know", "I'm not familiar with this", etc. Even the parts he says "I'm familiar with this service", the guy goes on to answer "I don't know more than what's written there". The guy couldn't even answer if certain databases were accessible by third parties (not implying they were).

This must have been painful to be a part of. 100 variations of "I can neither confirm nor deny".

[+] piva00|3 years ago|reply

All your personal data in the article references data generated by inference from personal data, it was pretty clear in some paragraphs and quotes.

The where part is also alluded to as to mean "systems" from what I gathered.

[+] gthrone|3 years ago|reply

Ya the way he responds saying a full team would be needed to uncover this sounds like he was treating "where" holistically. Every bit of information in every server, application, etc. Feels like a trained response.

[+] kwertyoowiyop|3 years ago|reply

Sure, and what does “is” mean exactly? And what really is “data”? Their answers seem like passive-aggressive BS. If their lives or careers depended on it, they’d come up with a lot of answers PDQ.

[+] seanhunter|3 years ago|reply

Companies which deal with personal data do so in the context of regulatory frameworks (CCPA, GDPR etc) which define specifically what the definition is and what’s covered. They also have to record what they collect and process and for what purpose.

If you’re Facebook you have to know this in order to process that data. You can’t pretend not to understand the question.

[+] snowwrestler|3 years ago|reply

There is a difference between what is known and what is knowable.

For example if you asked me now if I know where all the receipts are for my tax-deductible donations, I would say no, I don’t know. Why? There is no reason for me to know right now.

But if the IRS told me I was being audited and must produce those records or go to jail, I would find them. And I have confidence that I could, because I know that in general, those are the sorts of records I keep (somewhere).

Most Facebook employees do not need to know “where personal data is” to operate their business, so they don’t know that. But at the same time, Facebook could not operate if the data was “nowhere” (destroyed), or if it was randomly distributed with no queryable structure.

So the question is not really what they know now, it is: what incentive do they have to go find out.

[+] ridgered4|3 years ago|reply

I recall when Windows 10 came out some company had demanded to know all of the things it sent back to Microsoft over the wire. Microsoft had some guy basically run wireshark and a bunch of network scans while using it and put that in a report. It blew my mind because it implied nobody at Microsoft actually knew what it was collecting and when anymore than your average security researcher, possibly because there are so many sub-entities in the company hoovering it up for different purposes.

[+] kornhole|3 years ago|reply

Because I suspected this mess, I took a less expedient path to deleting my account. I found a picture of somebody with my same name and made it my picture. I unfriended everybody I actually knew and only kept randos. I liked a bunch of random stuff. I did not actually delete my account until a year later to let the chaotic data propagate into all the subsystems, backups, and partner networks.

[+] LinuxBender|3 years ago|reply

This article reminds me of past audit experiences. It is important to know who to send into the conference room to talk to the auditor. The wrong people can answer too many questions or volunteer the wrong information, leading to more questions and going deeper down the rabbit holes. I suppose the same could apply in this case. It's just a gut feeling.

[+] Delphiza|3 years ago|reply

Many here on HN are saying that such a record of what tech companies do with data is a) impossible because complex systems have evolved over time and the (historical) flows of data are unknown and b) unnecessary because the need to know what happens to data is not relevant to the user or the questioner. The old models of processing data at scale to extract value, as FAANG has been doing, are eventually going to come to an end. Maybe not where there is a power imbalance, such as between Facebook and their users, but at least where users of services have more clout and are paying for the product. I see this in B2B IoT solutions where big customers are very picky about how telemetry is collected by product vendors and, if not pushing back hard, are at least choosing not to use services that are not clear on how data is processed and handled.

The amount of data that we all see, every day, that is grossly mishandled may signal the end of the ML and AI goldrush. You can only build models, and run data through models, with the consent of the owner of that data. Large producers of data (think vehicle fleet operators) are beginning to take ownership of their data, and are only _licensing_ it to processors for very specific purposes. In the example of vehicle fleet operators, they may only want route planning, and not have their data used to sell them tyres based on mileage. Also, while governments may be busy with other stuff currently, at some point they may decide to turn on the regulatory screws.

[+] 0xbadcafebee|3 years ago|reply

"The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript."

Oh you sweet summer child.

[+] mathgladiator|3 years ago|reply

I believe we need a privacy centered view of technology. Even a traditional database creates a spiderweb of 'where is your personal data spread across so many tables', and I hope there is a reckoning against this approach.

I have considered pivoting my service into a privacy respecting data store ( https://www.adama-platform.com/ ), but I've yet to meet anyone that actually cares about user privacy to rethink their data layer.

[+] citizenpaul|3 years ago|reply

This is misdirection plain and simple. In the best most convincing way. Get tunnel vision (sorry) nerds to make an honest display of their ignorance of the bigger picture at Meta. Aww shucks we don't know where that data comes from....

Of course senior engineers don't know where the data comes from exactly. Does your mechanic know where each tire comes from or care? Does your fav restaurant chef know each field that a produce came from? The suppliers know that info.

I know beyond any doubt that there are most definitively some people that know at least roughly where most data about most subjects are at Meta. I'm sure the data is a vastly beyond what a human can process. However some people there know where to go looking. The truth is in the Piles of money they generate by selling that data.

Awww gee wizz your honor. We have no idea how we do it. People just keep throwing piles of money at us so we keep taking it. I don't know why they do it. We definitely don't have mountains of highly specialized data on our users that we tell them we can get exactly what they want then provide that info to them. Somehow though when the court asks without piles of money we just can't make that magic happen...........

[+] tru3_power|3 years ago|reply

There seems to be a pretty concerted effort to paint Meta as worse than the other big ad-tech companies. Maybe it’s just the contrarian in me, but I’ve noticed a way big uptick in these stories over the year.

[+] pavlov|3 years ago|reply

> The systemic fogginess of Facebook’s data storage made answering even the most basic question futile. At another point, the special master asked how one could find out which systems actually contain user data that was created through machine inference.

“I don’t know,” answered Zarashaw. “It’s a rather difficult conundrum.”

This is not a basic question because of how ML works. Can we say a system contains my data if my data was used as a training input to a neural network at some point in the past?

In that case, can I sue anyone using Stable Diffusion for stealing my data because the billions of images in its training set included something I created?

[+] salawat|3 years ago|reply

I would not be surprised at all. I can count on one hand the number of Engineering candidates that given a system description involving three visible networked laptops in view of the camera have answered "where is the data, right now?" , or can at least give a decent stab at "trace the datapath through the machine".

I can't explain somehow this is the average candidate, yet somehow... Life goes on.

[+] notacoward|3 years ago|reply

The basic problem is one of tracking (or not tracking) copies of data. Your information at Facebook is stored primarily on several storage systems, but extra copies might exist in backups, data-analysis pipelines, logging systems, etc. At the time I left (two years ago), there was a cutely-named project well under way to ensure deletion of those primary copies and (IIRC) backups. Fragments still in data-analysis systems might still exist, but they're also less personally identifying. I don't recall the state of anonymization and provenance tracking that would allow even these remnants to be found and purged for certain. So we're basically talking about two different questions.

(1) Is it possible to be sure that the primary copies and backups are gone, so that finding anything that's left would require some very specialized knowledge and/or an infeasibly massive scan? I believe the answer to this is probably yes at this point.

(2) Is it possible to be sure that absolutely every last vestige of the person's time on Facebook is gone? I believe the answer to this is still probably no, and likely to remain so for some time. At the very last, some artifacts will remain in those opaque AI models.

I suspect the same two answers exist at many companies. What is HN's data retention policy? Oh, oops, none of my data would be deleted if I left. I suspect that Google and/or Apple, maybe Microsoft as well, are a lot closer to "yes" on that second question, but even then I suspect gaps appear from time to time.

I say this not to condemn or defend anyone, and I know companies under stricter regulatory regimes can give more definite answers. It's just the state of the art at Big Tech companies as I understand it.

[+] canoebuilder|3 years ago|reply

What are we talking about here? Data associated with specific users?

Well in order for your grandma to log into Facebook her user account must have a primary key associated with it so she sees her info when she logs in and not someone else’s.

We are talking about computers and databases. When did using a computer to search a database become a difficult, nigh impossible thing to do?

Even if design documents and flow charts or whatever don’t exist could they not fairly straightforwardly be reverse engineered by taking a sample of users and searching all databases for associated information?

This seems like a transparent ploy on the part of Facebook to avoid regulation by casting the perfectly doable, searching a database, into some sort of incomprehensible impossible task. The credulous author of the article and many comments here seem to be strangely buying into it.

FB investors don’t seem to have lost faith in the company’s ability to search databases. When it comes to making the company money from those searches, no problem at all. But regarding lawsuits or potential regulation, we can only throw our hands into the air and simply wonder and gasp at this incomprehensible mess we have made.

[+] ctide|3 years ago|reply

It became difficult when your 'database' turned into exabytes of random files in s3.

[+] unknown|3 years ago|reply

[deleted]

[+] ElijahLynn|3 years ago|reply

The special master at times seemed in disbelief, as when he questioned the engineers over whether any documentation existed for a particular Facebook subsystem. “Someone must have a diagram that says this is where this data is stored,” he said, according to the transcript. Zarashaw responded: “We have a somewhat strange engineering culture compared to most where we don’t generate a lot of artifacts during the engineering process. Effectively the code is its own design document often.” He quickly added, “For what it’s worth, this is terrifying to me when I first joined as well.”

“We do not have an adequate level of control and explainability over how our systems use data, and thus we can’t confidently make controlled policy changes or external commitments such as ‘we will not use X data for Y purpose,’” the 2021 document read.

[+] egberts1|3 years ago|reply

Not surprising that the same can be said for a certain US government department having hosted a personal mail server in a bathroom closet and trafficking classified materials and caught much later by OIG.

Or that one time I found a dialup modem under the raised floor and attached to the Equifax (then TRW) mainframe.

Or catching someone inserting a USB stick into a PC inside a secured white-lab area.

Absolutely crazy times for large entities.

128 comments