top | item 36890486

(no title)

TACIXAT | 2 years ago

I am working on this for a different definition of term dataset. I started learning deep learning which led me to start building datasets.

Wanting to store versions of the datasets efficiently I started building a version control system for them. It tracks objects and annotations and can roll back to any point in time. It helps answer questions like what has changed since the last release and which user made which changes.

Still working on the core library but I'm excited for it.

discuss

angrais|2 years ago

Have you looked into existing version control systems for data, such as DVC?

TACIXAT|2 years ago

Thanks for the suggestion. I have glanced through the docs in past but haven't tried it. I am trying to do a bit more than what git can offer.

First the good. Git LFS solves the issue of checking out a massive repository in whole.

Git can work pretty well if your annotations are in a text based format and stored one annotation per file. That makes it easy to track and attribute annotation changes.

What I'm building can serve as a backend to labeling. There is a built in workflow for reviewing changes, objects have different statuses (in annotation, included in release, etc.), reproducible releases, things like that.

It is really designed for collaboration with untrusted third parties. Imagine someone making a pull request for a binary annotation format. To review it you would have to clone it, load it in an annotation tool, then go and tie what you saw to what is in the pull request. What do you do it like 90% of the annotations are correct? Reject everything? Very tough, also assumes your annotater can make a pull request.

Mine will still require you to bring your own annotation tool, but makes it much easier to integrate the review process.