Graph Databases 101

[+] gtrubetskoy|10 years ago|reply

I have spent a lot of time figuring out how to deal with a large graph a couple of years ago. My conclusion - there will never be such a thing as a "graph database". There are many efforts in this area, someone here already mentioned SPARQL and RDF, you can google for "triple stores", etc. There are also large-scale graph processing tools on top of Hadoop such as Giraph or Graphx for Spark.

For the particular project we ended up using Redis and storing the graph as an adjacency list in a machine with 128GB of RAM.

The reason I don't think there ever will be a "graph database" is because there are so many different ways you can store a graph, so many things you might want to do with one. It's trivial to build a "graph database" in a few lines of any programming language - graph traversal is (hopefully) taught in any decent CS course.

Also - the latest versions of PostgreSQL have all the features to support graph storage. It's ironic how PostgreSQL is becoming a SQL database that is gradually taking over the "NoSQL" problem space.

[+] amirouche|10 years ago|reply

In my point of view, the fact that you can add an expert index very easily to a graph database written in a modern language (say no C/C++) makes it even easier to customize an existing graph database to suit your direct need. In turn, storage and runtime can be tunned more easily. Making so easy to have the performance you need. But at the end of the day not dealing with algreba is the best.

[+] yeukhon|10 years ago|reply

Years ago PostgreSQL already support recursive query, and in Oracle you have CONNECT BY. I have only used the recursive with once and it was just a quick demo, but my understanding is update is extremely expensive.

[+] valhalla|10 years ago|reply

If anyone's curious about Network Science/Graph Theory in general here's a free online textbook used by a grad student friend of mine

http://barabasilab.neu.edu/networksciencebook/downlPDF.html

[+] GFK_of_xmaspast|10 years ago|reply

Be careful about Barabasi: https://liorpachter.wordpress.com/2014/02/10/the-network-non...

(FWIW, I had previously read some Barabasi papers and had come away seriously unimpressed, see also https://news.ycombinator.com/item?id=9555547)

[+] rfreytag|10 years ago|reply

Blasted down arrow is so close to the up-arrow I clicked down when I meant up. Someone please cancel out my mistake.

[+] valine|10 years ago|reply

Question as someone new to graph databases: Are there any open source graph databases worth looking into?

[+] rail2rail|10 years ago|reply

We're using TitanDB. One of the main benefits for us is that AWS has provided backend integration with DynamoDB. This affords you practically infinite and painless scaling on a pay-as-you-go model. Love it.

https://aws.amazon.com/blogs/aws/new-store-and-process-graph...

[+] kinow|10 years ago|reply

Depends on what kind of data and graph you are going to store/use. Neo4j is quite popular, cypher isn't very hard to learn, and it has lots of examples. Might be a good choice for a beginner.

https://en.wikipedia.org/wiki/Graph_database#List_of_graph_d...

[+] kawera|10 years ago|reply

Cayley is a good option; we use it in production.

https://github.com/google/cayley

[+] emehrkay|10 years ago|reply

Yes. I love the Tinkerpop stack (http://tinkerpop.incubator.apache.org). I am currently writing a Python library as I develop an application around it called gizmo (https://github.com/emehrkay/gizmo).

[+] SanderMak|10 years ago|reply

We use http://orientdb.com/orientdb/, seems decent so far.

[+] timClicks|10 years ago|reply

Neo4j is a very good option.

[+] iod|10 years ago|reply

ArangoDB is free open source multi model no-sql db that has decent¹ perfomance with graph support:

https://www.arangodb.com

¹ https://www.arangodb.com/2015/10/benchmark-postgresql-mongod...

[+] whazor|10 years ago|reply

There are multiple systems out there, however I have my doubts. It is important that your data does not get corrupted, and that your transactions will not get lost. Furthermore, speedups are possible with certain indices. That is why I personally would want to see some more safety/speed analysis and comparisons between the different systems.

[+] karussell|10 years ago|reply

Here is an (old) overview: https://docs.google.com/spreadsheets/d/1XGapLHpSd2Ta8019VlwY...

[+] espeed|10 years ago|reply

Look at Blazegraph, an open-source GPU-accelerated distributed graph database.

See previous discussion: https://news.ycombinator.com/item?id=11197880

[+] jerven|10 years ago|reply

In the rdf space there are a whole bunch. Graph as in sparkling gives you : Virtuoso Blazegraph Jena Rdf4j Ruby.rdf

There are more but these are opensource and I know them. And money more commercial ones.

[+] d0ne|10 years ago|reply

http://stingergraph.com/ - From Georgia Tech

[+] rusabd|10 years ago|reply

Virtuoso

[+] marknadal|10 years ago|reply

(Full disclosure: I'm the author, we are VC backed) https://github.com/amark/gun is an Open Source graph database with Firebase like realtime synchronization.

[+] AdamN|10 years ago|reply

Everybody's focused on graph databases here but let's talk about Cray! One of the most forward-thinking computer technology companies ever to exist is starting to get out there again. If they got a few hundred million dollars from an outside investor, they could do friggin' incredible things. They already do incredible things but not out there in the way it so easily could be.

[+] fulafel|10 years ago|reply

Cray is a brand name that has been passed around between half a dozen companies (including Sun and SGI) dotted by various kinds of product reboots and commercial failures. Cool stuff but supercomputing isn't the most financially sound business it seems. The current name holder is the company previously called Tera, originally famous for making an aggressively multithreaded HPC computer.

[+] rrrrtttt|10 years ago|reply

Cray nowadays is purely a government contracting shop (think healthcare.gov in terms of technology chops).

[+] amirouche|10 years ago|reply

I am huge fan a graph-y stuff. I did several iteration over a graph database written -- in Python -- using files, bsddb and right now wiredtiger. I also use Gremlin for querying. Have a look at the code https://github.com/amirouche/ajgudb.

Also, I made an hypergraphdb, atom-centered instead of hyperedge focused in Scheme https://github.com/amirouche/Culturia/blob/master/culturia/c....

Did you know that Gremlin, is only srfi-41 aka. stream API with a few graph centric helpers.

edit: it's srfi 41, http://srfi.schemers.org/srfi-41/srfi-41.html

[+] SloopJon|10 years ago|reply

The author's next post describes RDF and SPARQL in the context of the Cray Graph Engine:

http://www.cray.com/blog/how-cray-graph-engine-manages-graph...

[+] lobster_johnson|10 years ago|reply

I've seen people using graph databases as a general-purpose backing store for webapps/microservices. What are people's opinions about this?

My feeling is that graph databases are not suitable/ready for — for lack of a better term — the kind of document-like entity relationship graphs we typically use in webapps. Typical data models don't represent data as vertices and edges, but as entities with relationships ("foreign keys" in RDBMS nomenclature) embedded in the entities themselves.

This coincidentally applies to the relational model, in its most pure, formal, normal form, but the web development community has long established conventions of ORMing their way around this. The thing is, you shouldn't need an ORM with a graph database.

[+] TimPrice|10 years ago|reply

1-Would it be more efficient to store objects that contain its relations if you only do (simple) read operations? (e.g. JSON database)

2-Instead, do graph DB engines try to break through bottlenecks for big data and analytics scenarios?

[+] thesz|10 years ago|reply

It introduces false dichotomy "graph vs relational".

In fact, most (if not all) graph algorithms can be expressed using linear algebra (with specific addition and multiplication). And matrix multiplication is a select from two matrices, related with "where i=j" and aggregation over identical result coordinates.

The selection of multiplication and addition operations can account for different "data stored in links and nodes".

So there is no such dichotomy "graph vs relational".

[+] sophacles|10 years ago|reply

Strictly speaking yeah. Practically speaking: Not really true.

Just because something can be done, doesn't mean it can be done easily or well. I've done a lot of work with relational databases, and I love them for a lot of data sets. But I also have done a lot of work with graph databases - and they make working with graph shaped data a pleasure. I could do a graph in SQL, it's even moderately straight-forward in postgres these days by using WITH RECURSIVE - but it's still not as simple as just loading orient or arango for those tasks.

It's the same reason I keep multiple knives in my kitchen. Sure I could do everything with an 8" chef's knife, but the paring knife and the boning knife just make some tasks easier.

[+] grandalf|10 years ago|reply

True but highly irrelevant to why anyone might choose to use a graph database (or choose not to use one in favor of a relational database)...

[+] sklogic|10 years ago|reply

Of course you can express anything on top of a relational model. But for graphs such a representation would have been awfully inefficient. For this reason, CADs never even tried to switch to a relational data storage once that fancy new relational databases appeared, most of the professional CADs are still using good old graph databases.

[+] cbsmith|10 years ago|reply

linear algebra != relational calculus

Your argument is effectively because Haskell can be implemented in C, there is a false dichotomy between the two languages.

[+] jmartins|10 years ago|reply

Anybody know dgraph.io? it's a Scalable, Distributed, Low Latency, High Throughput Graph Database over terabytes of structured data. DGraph supports facebook GraphQL as query language, and responds in JSON and the storage engine is facebook rocksdb a very fast database. see more in https://github.com/dgraph-io/dgraph

[+] Xyik|10 years ago|reply

One of the biggest challenges in databases is handling concurrency and sharding, wish this would have talked a bit more about how that changes between a graph database and a relational database.

105 comments