top | item 19470064

Qri: A global dataset version control system built on the distributed web

204 points| anewhnaccount2 | 7 years ago |github.com | reply

42 comments

order
[+] marknadal|7 years ago|reply
I really love the design and style qri! It is fun!

Can I ask why, for a git-style system, IPFS was chosen instead of GUN or SSB?

Certainly, images/files/etc. are better in IPFS than GUN or SSB.

But, you're gonna have a nightmare doing any git-style index/patch/object/etc. operations with it - both GUN & SSB's algorithms are meant to handle this type of stuff.

Did you guys do any analysis?

[+] b_fiive|7 years ago|reply
hey, qri dev here. Delighted you like the design, we're hoping to make data a little more "approachable" :)

We did look into SSB. I'll admit to not hearing about until only a few months ago, but the main reason we chose IPFS was for single-swarm behaviour, allowing for natural deduplication of content (a really nice property for dataset versioning).

The majority of our work has been in the exact area you mentioned, building up a dataset document model that will version, branch, and convert to different formats. We've gone so far as to write our own structured data differ (https://github.com/qri-io/deepdiff). I'm very happy with the progress we've made on this frontier so far.

I'm a huge fan of SSB, but don't think it's well suited for making datasets globally discoverable across the network. In the end the libp2p project tipped the scales for us, providing a nice set of primitives to build on.

[+] DocSavage|7 years ago|reply
Interesting project, particularly with the choice of IPFS and DCAT -- something I'll have to look into. There have been other efforts to handle mostly file-based scientific data with versioning in both distributed (Dat https://blog.datproject.org/tag/science/) and centralized ways (DataHub https://datahub.csail.mit.edu/www/). Juan Benet visited our research center to give a talk about IPFS a few years ago. Really fantastic stuff.

I'm the creator of DVID (http://dvid.io), which has an entirely different approach to how we might handle distributed versioning of scientific data primarily at a larger scale (100 GB to petabytes). Like Qri and IPFS, DVID is written in Go. Our research group works in Connectomics. We start with massive 3D brain image volumes and apply automated and manual segmentation to mine the neurons and synapses of all that data. There's also a lot of associated data to manage the production of connectomes.

One of our requirements, though, is having low-latency reads and writes to the data. We decided to create a Science API that shields clients from how the data is actually represented, and for now, have used an ordered key-value stores for the backend. Pluggable "datatypes" provide the Science API and also translate requests into the underlying key-value pairs, which are the units for versioning. It's worked out pretty well for us and I'm now working on overhauling the store interface and improving the movement of versions between servers. At our scale, it's useful to be able to mail a hard drive to a collaborator to establish the base DAG data and then let them eventually do a "pull request" for their relatively small modifications.

We've published some of our data online (http://emdata.janelia.org) and visitors can actually browse through the 3d images using a Google-developed web app, Neuroglancer. It's running on a relatively small VM so I imagine any significant HN traffic might crush it :/ We are still figuring out the best way to handle the public-facing side.

I think a lot of people are coming up with their own ideas about how to version scientific data, so maybe we should establish a meeting or workshop to discuss how some of these systems might interoperate? The RDA (https://rd-alliance.org/) has been trying to establish working groups and standards, although they weren't really looking at distributed versioning a few years ago. We need something like a Github for scientific data where papers can reference data at a particular commit and then offer improvements through pull requests.

[+] amirouche|7 years ago|reply
> We need something like a Github for scientific data where papers can reference data at a particular commit and then offer improvements through pull requests.

exactly my thought, do you know any working group that is working toward that goal?

[+] ktpsns|7 years ago|reply
> scientific data primarily at a larger scale (100 GB to petabytes)

Buying hard discs (100TB for a few 10kEUR a few years ago) is a real investion in our institute. As far as I understood, with distributed storages each participant volunteers to share his disc to store his (and other) data. Here's the devil's advocate: Why should I share my expensively bought disc space with you?

[+] guywhocodes|7 years ago|reply
What are the benefits of using qri over ipfs? At a glance it seems very similar, just narrower.
[+] ekianjo|7 years ago|reply
In IPFS you can't search from within the protocol as far as i understand. Qri focuses on datasets and provides a search layer directly form its tools.
[+] mewwts|7 years ago|reply
I love how the distributed web is seemingly built more and more in golang these days.

- https://github.com/ethereum/go-ethereum

- https://github.com/ipfs/go-ipfs

- https://github.com/textileio/go-textile

- https://github.com/lightningnetwork/lnd

to name a few other projects.

[+] rolleiflex|7 years ago|reply
Mine is also (Aether - https://getaether.net). I’ve also gotten comments reflecting on this same thing. I love Go. It is boring: it makes sure that I focus on doing interesting things, not on writing interesting code.
[+] Protostome|7 years ago|reply
Why do you love that its go in particular? (seriously asking, out of curiosity. why Go over all other languages, e.g. Rust and such)
[+] sjapkee|7 years ago|reply
It only means that all this will soon die. Ruby of 2017-2019.