DeepDive – System that enables developers to analyze data on a deeper level

[+] nl|11 years ago|reply

It does probabilistic inference![1]

So many open source "Knowledge Graph"-y type projects concentrate on building them like databases, with a query language that assumes the data in them is correct. You see this in things like Freebase, DBPedia and Wikidata, where they typically end up in a triple store and you query using SPARQL.

This isn't how the real world works, and there isn't a lot of publicly available software that takes this into account. There aren't even than many papers about it (the Microsoft Probase paper is one, and there is work from Florida University(?) about using Markov chains to reason while taking probabilities into about).

I'm excited to take a look at this.

[1] http://deepdive.stanford.edu/doc/general/inference.html

[+] anonetal|11 years ago|reply

Aside from the work on probabilistic inference, there is also many papers on "probabilistic databases" in the last 10 years (Chris did his PhD on that topic). That work has looked at SQL-style query processing over "uncertain"/"probabilistic" data.

These were some of the major projects: https://homes.cs.washington.edu/~suciu/project-mystiq.html, http://maybms.sourceforge.net/, http://infolab.stanford.edu/trio/, http://www.cs.umd.edu/~amol/PrDB/, http://dl.acm.org/citation.cfm?id=1376686.

[+] tensor|11 years ago|reply

There is actually a huge body of work on probabilistic reasoning. Just do a google scholar search for the "probabilistic reasoning" or "probabilistic inference" or "probabilistic logic." You might have to dig around in the results, but there are definitely a lot of papers on the topic.

I hadn't heard about DeepDive until now, but I did previously come across another project that does probabilistic reasoning: http://alchemy.cs.washington.edu. I cannot speak to how they compare since I haven't looked into either in great depth yet.

[+] pvnick|11 years ago|reply

> there is work from Florida University(?) about using Markov chains to reason while taking probabilities into about.

You're talking about the work that Prof. Daisy Zhe Wang and her students are doing over at the DSR lab. Go gators!

[1] http://dsr.cise.ufl.edu/?page_id=250

[+] chapulin|11 years ago|reply

It's also being used to aid paleontology research: http://fusion.net/story/30751/paleo-deep-dive-machine-learni...

[+] phreeza|11 years ago|reply

I am wondering what a ballpark figure would be how long it would take to set up an instance of this for a given scientific field for example. Days? Months? Years? I fear it is probably the latter.

[+] batbomb|11 years ago|reply

I've sat in on Chris' class at Stanford.

I think the answer is probably closer to weeks to months if working with field experts, depending on how deep you want to go.

The core of it is open source.

I think the most exciting thing about it is it brings more sophisticated computation to the more qualitative sciences.

[+] rstoner|11 years ago|reply

This could be a huge value-add to the groups that have invested heavily in human-directed knowledge graph construction (e.g. Project Halo/Aristo at the Allen Institute for AI).

[+] polskibus|11 years ago|reply

I'm mostly interested in how much does it differ from what IBM Watson does. Does IBM only rely on probabilistic inference or does it do other data mining as well?

[+] nl|11 years ago|reply

It's (very) roughly comparable to parts of it.

Firstly: IBM is increasingly using the Watson brand for things that don't appear directly related to the Jeopardy winning system (eg, Watson Analytics). When I talk about Watson here I mean the Question Answering (QA) system.

At a very high level DeepDive consists of a Knowledge Graph construction tool, and a probabilistic querying tool. Compared to Watson it is missing a natural language question parsing tool, and any way of dealing with questions that aren't in the KG.

Watson has (very strong) natural language understanding for multi-claused questions, and the Jeopardy version can do things like understand puns. Deepdive doesn't have anything comparable. In the open source space, the closest thing I'm aware of is SEMPRE[1][2].

Watson also has a evidence scoring module, and my understanding is that this can work against unstructured data. Deepdive doesn't have this, and instead relies on probabilistic inference. This is an excellent approach, but relies on doing content extraction first (ie, extract entities and relationships from text and/or other sources). The Microsoft Probase[3] group has published lots in this area.

[1] http://www-nlp.stanford.edu/joberant/homepage_files/publicat...

[2] https://github.com/percyliang/sempre

[3] http://research.microsoft.com/en-us/projects/probase/

[+] signa11|11 years ago|reply

kllj

19 comments