top | item 37694701

Ask HN: How do your ML teams version datasets and models?

69 points| skadamat | 2 years ago

Git worked until we hit a few gigabytes. S3 scales super well but version control, documentation, and change management isn't great (we just did lots of "v1" or "vsep28_2023" names).

DVC felt very clunky (now I need git AND s3 AND dvc) by the team.

What best practices and patterns have you seen work or have you implemented yourself?

33 comments

order
[+] gschoeni|2 years ago|reply
We have been working on an open source tool called "Oxen" that aims to tackle this problem! Would love for you to kick the tires and see if it works for your use case. We have a free version of the CLI, python library, and server on github, and a free hosted version you can kick around at Oxen.ai.

Website: https://oxen.ai

Dev Docs: https://docs.oxen.ai

GitHub: https://github.com/Oxen-AI/oxen-release

Feel free to reach out on the repo issues if you run into anything!

[+] axpy906|2 years ago|reply
I really like your read me.
[+] zxexz|2 years ago|reply
I think a decent solution is coming up with a system for storing the models and datasets, checkpoints, etc. in S3 - store the metadata, references, etc. in a well structures postgres table (schema versioning, audit logs, etc. with snapshots). Also, embedding the metadata in the model/dataset as well, in a way you could always reconstruct the database from the artifacts (in Arrow and Parquet files, you can embed arbitrary metadata at the file-level and the field level).

But perhaps the best solution is to just use something like MlFlow or WandB that handles this for you, if you use the API correctly!

[+] axpy906|2 years ago|reply
You included data lineage tracking in the first part which probably needs to be piped from an orchestrator.

At this point it’s a build or buy type deal for models using a model registry service.

Data versioning still feels unsolved to me.

[+] plonk|2 years ago|reply
Models that actually get deployed get a random GUID. Our docs tell us which is which (release date, intended use, etc.)

Models are then stored in an S3 bucket. But since the IDs are unique, they can be exchanged and cached and copied with next to no risk of confusion.

[+] axpy906|2 years ago|reply
Is the bucket versioned?
[+] janalsncm|2 years ago|reply
We have a task name, major version, description and commit hash. So the model name will be something like my_task_ v852_pairwise_refactor_0123ab. Ugly but it works.

Don’t store your data in git, store your training code there and your data in s3. And you can add metadata to the bucket so you know what’s in there/how it was generated.

[+] thegginthesky|2 years ago|reply
Process, git and S3.

We trained the whole team to:

- version the analysis/code with git - save the data to the bucket s3://<project_name>/<commit_id> - we wrote a small code to get the commit id to build this path and use boto3 to both access it and save it

We normally work with zipped parquet files and model binnaries and we try to keep them together in the path mentioned

It's super easy and simple, very little dependencies, and allow for rerunning the code with the data. If someone deviates from this standard, we will always request a change to keep it tidy.

Keeping track of data is the same with keeping a clean git tree, it requires practice, a standard, and constant supervision from all.

This saved my butt a many times, such as when I had to rerun an analysis done over a year ago, or take over for a colleague that got sick.

[+] simonw|2 years ago|reply
I really like the idea of using the commit ID as the bucket prefix for the associated files.
[+] john-shaffer|2 years ago|reply
DVC is slow because it stores and writes data twice, and the default of dozens of concurrent downloads causes resource starvation. They finally improved uploads in 3.0, but downloads and storage are still much worse than a simple "aws s3 cp". You can improve pull performance somewhat by passing a reasonable value for --jobs. Storage can be improved by nuking .dvc/cache. There's no way to skip writing all data twice though.

Look for something with good algorithms. Xethub worked very well for me, and oxen looks like a good alternative. git-xet has a very nice feature that allows you to mount a repo over the network [0]

[0] https://about.xethub.com/blog/mount-part-1

[+] wingman-jr|2 years ago|reply
For a side project of image classification, I use a simple folder system where the images and metadata are both files, with a hash of the image acting as a key/filename - e.g. 123.img and 123.metadata. This gives file independence. Then as needed, I compile a CSV of all the image-to-metadata as needed and version that. Works because I view the images as immutable, which is not true for some datasets. On a local SSD, it has scaled to >300K images. Professionally, I've used something similar but with S3 storage for images and Postgres database for the metadata. This scales up better beyond a single physical machine for team interaction of course. I'd be curious how others have handled data costs as the datasets grow. The professional dataset got into the terabytes of S3 storage and it gets a bit more frustrating when you want to move data but are looking at thousands of dollars projected costs for egress of the data... and that's with S3, let alone a more expensive service. In many ways S3 is so much better than a hard drive, but it's hard not to compare to the relative cost of local storage when the gap gets big enough.
[+] kvnhn|2 years ago|reply
I've used DVC in the past and generally liked its approach. That said, I wholeheartedly agree that it's clunky. It does a lot of things implicitly, which can make it hard to reason about. It was also extremely slow for medium-sized datasets (low 10s of GBs).

In response, I created a command-line tool that addresses these issues[0]. To reduce the comparison to an analogy: Dud : DVC :: Flask : Django. I have a longer comparison in the README[1].

[0]: https://github.com/kevin-hanselman/dud

[1]: https://github.com/kevin-hanselman/dud/blob/main/README.md#m...

[+] smfjaw|2 years ago|reply
ML Flow solves most of these issues for models, I haven't used it in relation to data versioning but it solves most model versioning and deployment management things I can think of
[+] snovv_crash|2 years ago|reply
CSV file in git with paths to all of the files, all the training settings, and the path to the training artifacts (snapshots, loss stats etc). The training artifacts get filled in by CI when you commit. Files can be anywhere, for us it was a NAS due to PII in the data we were training on so "someone else's computer" AKA cloud wasn't an option.
[+] hhh|2 years ago|reply
Why would having PII rule out cloud?
[+] mjhea0|2 years ago|reply
I'm guessing you're looking more for a dev tool, but I co-founded a company that deals with this very thing (among others) from a governance perspective. https://www.monitaur.ai/
[+] pjfin123|2 years ago|reply
I put the metadata in a JSON file and then store the datasets as a zip archive on a Nginx server.
[+] speedgoose|2 years ago|reply
Have you use git or git lfs to store the large files?
[+] m_niedoba|2 years ago|reply
Yes, Git LFS works better then most people think. You can also use Azure DevOps, because they don't charge for storage. We use Anchorpoint as a Git client, because it's optimized for LFS.
[+] cuteboy19|2 years ago|reply
Haphazardly, with commit# + timestamp of training