top | item 34973342

Show HN: Scan your code to see where user data is going

2 points| jag729 | 3 years ago |github.com

Hey everyone! We’ve been working on a static code analysis tool to map out where user data is flowing at the code level and catch potential privacy violations; you can check it out here: https://github.com/monoid-privacy/monoid/tree/master/monoid-...

To run it via CLI, use the Docker command in the README with a local directory, and the tool will scan the directory and print detected user data sources, sinks, and paths.

In short, the tool converts code to a code property graph (CPG), extracts the sources and sinks from the CPG, and uses the variable/function names to determine whether the sources could contain user data & the sinks could be sensitive outputs (e.g. logs, DB, analytics/marketing tools, etc.). The output is a list of potential user data variables (the scanning is fairly robust, so it detects everything from standalone variables to class attributes) and the outputs they eventually flow to (e.g. a "first_name" variable that makes its way to Segment).

The goal here is to “shift privacy left” and make it easier to find potential privacy headaches, like user data leaking into logs, earlier in the software lifecycle. The tool slots easily into CI/CD for privacy checks on every commit, and can also be run ad-hoc via the CLI.

This was also a pretty exciting build from a technical perspective; OSS tooling around code graph generation and static analysis is pretty sparse (though https://github.com/Fraunhofer-AISEC/cpg offers a great foundation), so we built out a lot of code property graph generation + manipulation logic from the ground up.

Feedback would be much appreciated!

discuss

No comments yet.