top | item 37697419

(no title)

Process, git and S3.

We trained the whole team to:

- version the analysis/code with git - save the data to the bucket s3://<project_name>/<commit_id> - we wrote a small code to get the commit id to build this path and use boto3 to both access it and save it

We normally work with zipped parquet files and model binnaries and we try to keep them together in the path mentioned

It's super easy and simple, very little dependencies, and allow for rerunning the code with the data. If someone deviates from this standard, we will always request a change to keep it tidy.

Keeping track of data is the same with keeping a clean git tree, it requires practice, a standard, and constant supervision from all.

This saved my butt a many times, such as when I had to rerun an analysis done over a year ago, or take over for a colleague that got sick.

discuss

simonw|2 years ago

I really like the idea of using the commit ID as the bucket prefix for the associated files.