Ask HN: How to run analytics on data without access to the data?
42 points| michealr | 5 years ago
How I envisioned a solution would be some trusted third party takes my analysis script, returns the report and that is it. I never see the underlying data and recieve only one time token to access it.
I know it will never be hundred percent leak proof, and there is still a level of user trust, I realise that, but just thinking conceptually, is there any existing service out there, that does such a thing or attempts to offer something similar? Or what would an alternative approach look like?
BelenusMordred|5 years ago
A slow leaking ship will still sink. Attempts so far to anonymise public datasets have been terrible and turned into a garbage fire by attackers every time with minimal effort. Don't hand out false promises.
Guess you are looking for fully homomorphic encryption. A long-outstanding problem with lots of smart people working on it, some are doing ok at getting there.
https://github.com/ibm/fhe-toolkit-linux
dataewan|5 years ago
https://en.wikipedia.org/wiki/Differential_privacy
Agree that strong guarantees about privacy aren't achievable.
michealr|5 years ago
Very cool, had read about homomorphic systems. For fully homomorphic systems has there been successful SAAS like offering allowing use of such a systems? Or do you think its still in the research oriented phase?
cipherboy|5 years ago
The benefit being that while you can run any computatio with a FHE, PHEs are generally faster.
IIRC Microsoft was also doing research on PHEs.
meowface|5 years ago
The homomorphic encryption approach probably isn't worth the effort. There's always going to be a trade-off between doing something useful and sufficiently/securely obfuscating/anonymizing the data. So I'd recommend the local approach, with a prominent explanation of how you don't and can't see any of the data.
hunter2_|5 years ago
The problem is, why would end users trust the third party more than the analytics developer? Are there companies that specialize in being this third party and have amassed mutual trust of the general public (akin to a notary public) for handling data and code without leaking either?
satyrnein|5 years ago
franky47|5 years ago
[1] https://chiffre.io
dumbfounder|5 years ago
I am imagining you download the "container", put the data in, encrypt the container with the data inside, and have that run anywhere.
But I have no idea if that is possible.
jhoechtl|5 years ago
stelfer|5 years ago
[1] https://github.com/Google/private-join-and-compute
rjmunro|5 years ago
gopty|5 years ago
Syzygies|5 years ago
The stakes are lower when money, not privacy, is at risk. I have attempted to argue for years that the MathSciNet catalog of the mathematical literature should be open to all forms of machine learning and mind mapping software experiments. It remains a cash cow for the American Mathematical Society, and they're fiercely proud of its human curation by 19th century methods. Meanwhile, mathematicians continue to believe that math remains separated into tribes, with number theorists lobbying to hire their own at departmental meetings. The true connections between ideas defy these ancient categories. I see a generation of potential advances squandered by not letting third-party tools in to study MathSciNet.
The right ideas could help here. One isn't protecting individual privacy, just a cash cow. The bar is lower.
syats|5 years ago
One idea would be:
1. distribute to the data owners a base system (something that can "run" stuff on their premises). People here have mentioned browsers, but for a more intensive processing this might not be enough.. so think of a docker daemon, keys for some docker registries, etc.
2. have a trusted "app store" (e.g. a docker registry where images are built in a reproducible manner from code which is inspected and certified, and then are cryptographically signed)
3. make a well described interface to the apps to consume the data (thinking of the general use case here.. if you just want to analyze fb info then you can make an adhoc parser...)
4. Have the data owner download, check the signature of, configure and run the app on their premises.
Things get even more interesting when the analytics need data from different non-trusting partners, so that Homeomorphic Encryption becomes necessary.
There is at least one specification that aims at supporting all of this: https://www.internationaldataspaces.org/wp-content/uploads/2... although implementation is, so far, lagging behind.
alfl|5 years ago
Shoot us a note -- would love to hear more details.
[0]: https://proofzero.io
amai|5 years ago
https://federated.withgoogle.com/ https://en.wikipedia.org/wiki/Federated_learning https://github.com/poga/awesome-federated-learning
cedricd|5 years ago
Assuming data is in a standard format then you can share your script for people to run themselves. Obviously this is fairly difficult in practice unless you can bundle everything into a client-side script on a website.
For reference Narrator [1] does this -- it puts data into a standard format so that analyses written for one company can be run for another. I'm not suggesting you build your stuff on that platform, but it's an interesting approach that does exist.
[1] https://www.narrator.ai
jedimastert|5 years ago
I'm sure there's some sort of homomorphic encryption[0] magic scheme that might let you process the data on other servers or something, but I could not even begin to tell you how. Really, it's just trust.
brian_spiering|5 years ago
lmkg|5 years ago
Quick summary of important results: You will always leak a small amount of information. But it is possible to bound this leak to whatever level you consider "acceptable." The trade-off is statistical validity of the results (the usual approach adds "noise" to the data and/or analysis).
JosephRedfern|5 years ago
michealr|5 years ago
gostsamo|5 years ago
tjanez|5 years ago
jhoechtl|5 years ago
michealr|5 years ago
sgt101|5 years ago