(no title)
nathankleyn | 7 years ago
What you see today in this project is really a means to scratch an itch we had - mainly to quickly and easily sample/obfuscate some delimited data in a way that is "good enough" for use for demonstrating a visualisation tool without using the original dataset. It's important to note that that we intend to use this data still within a secure environment.
This tool is absolutely not up to the task of anonymising a dataset in such a way as to make it able to be made public. For us, it's about risk management vs effort: from a security perspective there are scenarios where we can use samples of data that have gone through this process and decrease the risk of holding data in mutliplate places substantially without significant effort. If we were to go onto to make any of these datasets ultimately public, we'd be looking for a better suited tool.
As a result, tools like ARX are not something we really want to compete with - they're aiming for a complete solution whereby the results are good enough to potentially make public. It goes perhaps without saying really that the reality of this goal is debatable given the research you linked, but some people might be comfortable with those risks.
One thing we've done to try and bridge the gap a bit is to make it really easy to add new functions as we need them, and I think we can get to a point whereby for a good portion of use-cases this tool is good enough (for example, making datasets you can use in a development environment that are representative, but a manageable size and anonymised to a reasonable degree).
We'll also try to add something to the README addressing this exact question from you as it's one I anticipate we're going to get asked a lot - so thanks for the constructive line of questioning as it really will ultimately help us and people who choose to use this tool make a decision that's right for them and their use-cases.
kevin_nisbet|7 years ago