Ask HN: How do your ML teams version datasets and models?
69 points| skadamat | 2 years ago
DVC felt very clunky (now I need git AND s3 AND dvc) by the team.
What best practices and patterns have you seen work or have you implemented yourself?
69 points| skadamat | 2 years ago
DVC felt very clunky (now I need git AND s3 AND dvc) by the team.
What best practices and patterns have you seen work or have you implemented yourself?
[+] [-] gschoeni|2 years ago|reply
Website: https://oxen.ai
Dev Docs: https://docs.oxen.ai
GitHub: https://github.com/Oxen-AI/oxen-release
Feel free to reach out on the repo issues if you run into anything!
[+] [-] axpy906|2 years ago|reply
[+] [-] zxexz|2 years ago|reply
But perhaps the best solution is to just use something like MlFlow or WandB that handles this for you, if you use the API correctly!
[+] [-] axpy906|2 years ago|reply
At this point it’s a build or buy type deal for models using a model registry service.
Data versioning still feels unsolved to me.
[+] [-] plonk|2 years ago|reply
Models are then stored in an S3 bucket. But since the IDs are unique, they can be exchanged and cached and copied with next to no risk of confusion.
[+] [-] axpy906|2 years ago|reply
[+] [-] janalsncm|2 years ago|reply
Don’t store your data in git, store your training code there and your data in s3. And you can add metadata to the bucket so you know what’s in there/how it was generated.
[+] [-] thegginthesky|2 years ago|reply
We trained the whole team to:
- version the analysis/code with git - save the data to the bucket s3://<project_name>/<commit_id> - we wrote a small code to get the commit id to build this path and use boto3 to both access it and save it
We normally work with zipped parquet files and model binnaries and we try to keep them together in the path mentioned
It's super easy and simple, very little dependencies, and allow for rerunning the code with the data. If someone deviates from this standard, we will always request a change to keep it tidy.
Keeping track of data is the same with keeping a clean git tree, it requires practice, a standard, and constant supervision from all.
This saved my butt a many times, such as when I had to rerun an analysis done over a year ago, or take over for a colleague that got sick.
[+] [-] simonw|2 years ago|reply
[+] [-] herodoturtle|2 years ago|reply
https://mlflow.org/docs/latest/model-registry.html
[+] [-] john-shaffer|2 years ago|reply
Look for something with good algorithms. Xethub worked very well for me, and oxen looks like a good alternative. git-xet has a very nice feature that allows you to mount a repo over the network [0]
[0] https://about.xethub.com/blog/mount-part-1
[+] [-] john-shaffer|2 years ago|reply
[0] https://dvc.org/doc/user-guide/project-structure/configurati...
[+] [-] wingman-jr|2 years ago|reply
[+] [-] kvnhn|2 years ago|reply
In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django. I have a longer comparison in the README[1].
[0]: https://github.com/kevin-hanselman/dud
[1]: https://github.com/kevin-hanselman/dud/blob/main/README.md#m...
[+] [-] vinni2|2 years ago|reply
[+] [-] smfjaw|2 years ago|reply
[+] [-] simon_acca|2 years ago|reply
[+] [-] m_niedoba|2 years ago|reply
https://www.anchorpoint.app/blog/version-control-using-git-a...
[+] [-] snovv_crash|2 years ago|reply
[+] [-] hhh|2 years ago|reply
[+] [-] mjhea0|2 years ago|reply
[+] [-] wendyshu|2 years ago|reply
https://github.com/dolthub/dolt
https://www.pachyderm.com/
[+] [-] prashp|2 years ago|reply
[+] [-] pjfin123|2 years ago|reply
[+] [-] speedgoose|2 years ago|reply
[+] [-] m_niedoba|2 years ago|reply
[+] [-] cuteboy19|2 years ago|reply