Replibyte – Seed your database with real data

ff7c11|3 years ago

Trying to think how to anonymise datetimes hurts my head. You might want to randomise the date of an event. But you also need this random date to be consistent with respect to both the current time and the order of other related rows in the database.

lstamour|3 years ago

The answer is always “it depends,” but I think if a date time is a UTC timestamp, such as a record of when an event happened, then with random sampling, it shouldn’t matter? It’s just a timestamp. The amount of information it contains might include location, might include timing to other events, could be correlated, but… on its own? It doesn’t need anonymization. Likewise the sequence of events, should be safe to use.

I get that you can look up or de-anonymize an event by its timestamp and the same is true of ID numbers. But it’s worse for ID numbers because these are often permanent and re-used for multiple events.

But yeah, the risk in anonymized data is that it’s never truly both anonymous and useful. Truly anonymous data might be considered junk or random data.

Anonymized data has some utility purpose to fulfil. Perhaps “realistic” analytics is required, or you want to troubleshoot a production issue without revealing who did what to engineers. So you anonymize the fields they shouldn’t see, and create a subset of data that reproduces the issue…?

Anonymized data is almost always a bad approach compared to generating data from algorithmic or random sources, but sometimes we need anonymized or restricted data to start that process.

bennyp101|3 years ago

How does it keep personal data safe? I had a look at “how it works” and “faqs” but they don’t answer how you keep stuff safe? It also gets uploaded to S3?

I might have missed it, but I need to know exactly where our PII is stored (so not on a dev laptop), how do you know what to replace and what do you do with any info you do replace?

Edit: To answer my own question, via transformers. But that seems to suggest each dev has to keep it up to date with any schema changes etc

(Also some links are broken on GitHub)

crummy|3 years ago

The user tells it what fields need replacing with the yaml config.

ev0xmusic|3 years ago

Hi, author of Replibyte here :)

Yes, transformers is the way to go. I plan to add a way to detect schema changes and at list not trying to create a dump in case of change. I don't think it can be done in a safe way without human admin check.

(Thank you for your PR)

pistoriusp|3 years ago

You may want to check out Snaplet at https://docs.snaplet.dev. I'm the co-founder, but we're not open-source (yet.) Our goal is to give developers a database, and data, that they can code against.

We identify PII by introspecting your database, suggest fields to transform, and provide a JavaScript runtime for writing transformations.

Besides transforming data, you can reduce, and generate data. We are most excited about data-generation!

The configuration lives in your repository, and you can capture the snapshots in GitHub Actions. So you get "gitops workflow" for data.

A typical git-ops workflow:

  1. Add a schema migration for a new column. 
  2. Add a JS function to generate new data for that column.
  3. Add core to use the new column.
  4. Later, once you have data, use the same function to transform the original value. (Or just keep generating it.)

roskilli|3 years ago

One feature I’d love to see is a transformer that instead of providing a random value provides a cryptographic one way hash of the data (ie sha2) - that way key uniqueness stays the same (to avoid unique constraints on columns) and also the same value used in one place will match another value in another table after transformation which more accurately reflects the “shape” of the data.

pistoriusp|3 years ago

We do this via Copycat (https://github.com/snaplet/copycat). We generate static "fake values" by hashing your original value to a number, and map that to a fake-value.

MadsRC|3 years ago

This will not work, at least not if we’re talking PII as it is defined by a Somewhat Sane (TM) privacy legislation.

Sure, passwords and credit card info is obscured with your methodology, but names, dates of birth, sexual orientation, telephone numbers, email and ip will remain unique. This uniqueness is what allows you to potentially identify a person given enough data.

BobbyJo|3 years ago

I hate to be so self promoting (I swear I'm just trying to be helpful), but Gretel has that as a transformer you can use[0]. You can test out a lot of our stuff without payment info through our console[1] if you just want to mess around and see if tools like it ( and Replibyte of course :) ) would fit your use case. That being said, you can run into issues using direct transforms like this, depending on the correlated data, because of various known deanonymization attacks. There are some pretty gnarly examples out there if you Google around.

[0]https://docs.gretel.ai/gretel.ai/transforms/transforms-model...

[1]https://console.gretel.cloud/login

cratermoon|3 years ago

What you're asking for is similar to what goes by the term "tokenization"[1], a technique often used by payment processors to avoid leaking credit card numbers and similar sensitive data. Using the proper transformer might provide the behavior you need.

1 https://www.tokenex.com/resource-center/what-is-tokenization

ev0xmusic|3 years ago

Hi, author of Replibyte here. Feel free to open an issue and explain what is your use case. I will be happy to consider a solution with the community.

zX41ZdbW|3 years ago

I recommend checking out clickhouse-obfuscator. It's a more sophisticated tool for dataset obfuscation.

Installation (single binary Linux/Mac/FreeBSD):

curl https://clickhouse.com/ | sh

./clickhouse obfuscator --help

Docs: https://clickhouse.com/docs/en/operations/utilities/clickhou...

evoxmusic|3 years ago

I will take a look for Replibyte. Thanks for sharing

dopidopHN|3 years ago

The default seems to be to store the sanitized dump on S3.

It’s not always available in a professional context. Or might be considered extraction.

Keeping everything local and detailing exactly what goes where and how would be helpful.

Svarto|3 years ago

Also if it's possible to run everything without uploading it to S3. For a smaller time dev with projects in production I would find this really interesting for debugging the production database data, but in development. Uploading it and having it in S3 would needlessly complicate it for me (even though I can understand enterprise customers might prefer it that way)

evoxmusic|3 years ago

You have a local storage option https://www.replibyte.com/docs/datastores#local-disk

CSSer|3 years ago

I think the description in the man entry is better than the one in the README. Other than that, cool tool!

22 comments