(no title)
thegginthesky | 2 years ago
We trained the whole team to:
- version the analysis/code with git - save the data to the bucket s3://<project_name>/<commit_id> - we wrote a small code to get the commit id to build this path and use boto3 to both access it and save it
We normally work with zipped parquet files and model binnaries and we try to keep them together in the path mentioned
It's super easy and simple, very little dependencies, and allow for rerunning the code with the data. If someone deviates from this standard, we will always request a change to keep it tidy.
Keeping track of data is the same with keeping a clean git tree, it requires practice, a standard, and constant supervision from all.
This saved my butt a many times, such as when I had to rerun an analysis done over a year ago, or take over for a colleague that got sick.
simonw|2 years ago