This looks like a very nice implementation of a graph database. However a 6-machine cluster barely qualifies as "distributed" for the purposes of a graph database. You will experience almost no sub-linearity at this scale no matter how poorly the algorithms distribute. I am not intimately familiar with the transactional implementation but the graph processing algorithms described in Titan are not significantly distributable.
For graph queries, linearity can be definitely achieved on parallel systems with tens of thousands of compute nodes using the state-of-the-art algorithms. However, Titan does not use any of those algorithms and will be very lucky to exhibit linearity at a couple dozen nodes. Not to take away from the implementation but people looking for a scalable graph database will not be satisfied with Titan.
As an aside, the best algorithms and data structures for doing massively parallel graph operations are not published but obviously exist. The fastest system in the Graph500 benchmark uses a half million cores on a trillion edge graph. That is a several order of magnitude gap between what the best open source systems can do and what systems developed by closed research organizations can do as published in public benchmarks.
(Disclosure: I invented one of the aforementioned families of massively parallel graph processing algorithms in 2009, and not the first. The published literature has not even caught up with the first generation of such algorithms. A significant portion of advanced computer science research is no longer published.)
I think you are looking at a very different use case here. The systems that I think you are referring to analyze a static graph representation. The Graph500 benchmark in particular loads one big static, unlabeled, undirected, property-free graph and then runs extensive (BFS) analysis algorithms on it. The fact that the graph is not changing allows significant investment into building locality optimizing data structures (which is essentially what space decomposition is all about).
Titan on the other hand is a transactional database system to handle large, multi-relational (labeled) graphs with heterogeneous properties. A Titan graph is constantly evolving (as in the posted benchmark).
For graphs (unlike geo-spatial domains), applying space decomposition techniques first requires a metric space embedding which is a non-trivial and computationally expensive process. For changing graphs, this embedding will change as well making this very difficult to use in practice. The best approaches I know of for achieving locality therefore use adaptive graph partitioning techniques instead. However, for the types of OLTP workloads that Titan is optimized for, this would be overkill in the sense that the time spend on partitioning will likely exceed the time saved at runtime. At very large scale, it is most important for OLTP systems to focus on access path optimization based on the ACTUAL query load experienced by the system and not some perceived sense of locality based on connectedness. I published a paper a while ago suggesting one approach to do so:
http://www.knowledgefrominformation.com/2010/08/01/cosi-clou...
The Graph500 benchmark explicitly prohibits this optimization ("The first kernel constructs an undirected graph in a format usable by all subsequent kernels. No subsequent modifications are permitted to benefit specific kernels").
I'd like to see a few other details that aren't mentioned:
- What's the following distribution end up looking like? Does it have a similar fraction of 'celebrity' users with huge follower counts? Or more technically, does the russian roulette against the recommendation sampling end up producing a network similar to a scale free graph grown via preferential attachment (Barabási–Albert model)? It looks like your mean fanout is about 20, which is smaller than what twitter has published, but I'd be more interested in knowing how many 10k+ follower users are in the graph.
- What's the write amplification like? ~1.6 Billion per tweet per follower edges stored daily seems like it could burn a lot of capacity quickly, though most of it will grow cold quickly and could be pushed to archive. Making a rough guess from your disk write monitoring line graph, it looks like you'd be putting down about 16GB a day? It'd be interesting to see a comparison between this run and one done where streams are built indirectly via follower links alone.
- the data we used was crawled by Kwak et. al in 2009. We wanted to use a real social network dataset for the experiment and that was the largest/most useful one we could find. Other than de-duplication we did not make any modifications to the dataset, so the statistics reported in their paper still hold:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153....
- You mean what is the overhead induced by pre-computing the stream edge rather than collecting the relevant streams at query time? You are right that this requires a significant amount of storage space, however, as you also pointed out, this will get cold quickly and be sitting on disk only (i.e. not taking up space in the valuable cache). The reason this is very efficient is because of the time-based vertex centric index we build for the stream edges. This allows us to quickly pull out the most recent tweets for any user. If we had to compute those at query time, we would have to traverse to each person followed, get their 10 most recent tweets and then merge those in-memory. That would be significantly more expensive and since stream reading is probably the most frequent activity on twitter, pre-computing it saves a lot of time at the expense of inexpensive disk storage.
Right now Titan traversals run locally and communicate with the back-end Cassandra or HBase datastore via Thrift, but they are working on moving the traversal code inside the Cassandra/HBase JVM so you can run Gremlin traversals inside the distributed cluster rather than than pulling the elements over the wire.
Cassandra is turning out to be one killer data backend. It's really exciting to see what's being built on top of it.
Hadoop, Solr/Lucene, and now Blueprints/Grapb DB operations are all available on the same Cassandra cluster, in addition to the stuff Cassandra does quote-unquote natively. Add Zookeeper for the few times you need an honest-to-goodness transaction and it's just crazy how good the tech has gotten on the backend in the last 10 years. :-)
Absolutely, without NoSQL solutions like Cassandra Titan would not be possible.
Regarding Zookeeper: We actually build a locking system into Titan that uses quorum reads/writes with time-outs and cleanup to ensure consistency for certain edge/property types as defined inside Titan. This gives you consistency guarantees out of the box without having to introduce another component (like Zookeeper) into your deployment. For infrequent lock usage (which I strongly encourage ;-) this should be sufficient. For frequent locking, something like Zookeeper is far superior.
Titan is distributed and has much higher write performance (5000 transactions/sec). In the words of Matthias Broecheler, the creator of Titan, you would use Titan to implement a social network (https://twitter.com/MBroecheler/statuses/213350753031569409).
Titan has pluggable storage options: Cassandra, HBase, Berkeley DB, and some have been working on adapters for DynamoDB and App Engine Datastore. However, it's not clear how well App Engine and DynamoDB will work with the upcoming Titan changes which will move the traversal engine into the cluster's JVM.
[+] [-] jandrewrogers|13 years ago|reply
For graph queries, linearity can be definitely achieved on parallel systems with tens of thousands of compute nodes using the state-of-the-art algorithms. However, Titan does not use any of those algorithms and will be very lucky to exhibit linearity at a couple dozen nodes. Not to take away from the implementation but people looking for a scalable graph database will not be satisfied with Titan.
As an aside, the best algorithms and data structures for doing massively parallel graph operations are not published but obviously exist. The fastest system in the Graph500 benchmark uses a half million cores on a trillion edge graph. That is a several order of magnitude gap between what the best open source systems can do and what systems developed by closed research organizations can do as published in public benchmarks.
(Disclosure: I invented one of the aforementioned families of massively parallel graph processing algorithms in 2009, and not the first. The published literature has not even caught up with the first generation of such algorithms. A significant portion of advanced computer science research is no longer published.)
[+] [-] mbroecheler|13 years ago|reply
Titan on the other hand is a transactional database system to handle large, multi-relational (labeled) graphs with heterogeneous properties. A Titan graph is constantly evolving (as in the posted benchmark). For graphs (unlike geo-spatial domains), applying space decomposition techniques first requires a metric space embedding which is a non-trivial and computationally expensive process. For changing graphs, this embedding will change as well making this very difficult to use in practice. The best approaches I know of for achieving locality therefore use adaptive graph partitioning techniques instead. However, for the types of OLTP workloads that Titan is optimized for, this would be overkill in the sense that the time spend on partitioning will likely exceed the time saved at runtime. At very large scale, it is most important for OLTP systems to focus on access path optimization based on the ACTUAL query load experienced by the system and not some perceived sense of locality based on connectedness. I published a paper a while ago suggesting one approach to do so: http://www.knowledgefrominformation.com/2010/08/01/cosi-clou... The Graph500 benchmark explicitly prohibits this optimization ("The first kernel constructs an undirected graph in a format usable by all subsequent kernels. No subsequent modifications are permitted to benefit specific kernels").
[+] [-] continuations|13 years ago|reply
Are there any open source graph databases that are scalable in your opinion? Which one would you recommend?
[+] [-] iand675|13 years ago|reply
[+] [-] bugmenot|13 years ago|reply
[deleted]
[+] [-] jasonwatkinspdx|13 years ago|reply
I'd like to see a few other details that aren't mentioned:
- What's the following distribution end up looking like? Does it have a similar fraction of 'celebrity' users with huge follower counts? Or more technically, does the russian roulette against the recommendation sampling end up producing a network similar to a scale free graph grown via preferential attachment (Barabási–Albert model)? It looks like your mean fanout is about 20, which is smaller than what twitter has published, but I'd be more interested in knowing how many 10k+ follower users are in the graph.
- What's the write amplification like? ~1.6 Billion per tweet per follower edges stored daily seems like it could burn a lot of capacity quickly, though most of it will grow cold quickly and could be pushed to archive. Making a rough guess from your disk write monitoring line graph, it looks like you'd be putting down about 16GB a day? It'd be interesting to see a comparison between this run and one done where streams are built indirectly via follower links alone.
[+] [-] mbroecheler|13 years ago|reply
- the data we used was crawled by Kwak et. al in 2009. We wanted to use a real social network dataset for the experiment and that was the largest/most useful one we could find. Other than de-duplication we did not make any modifications to the dataset, so the statistics reported in their paper still hold: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153....
- You mean what is the overhead induced by pre-computing the stream edge rather than collecting the relevant streams at query time? You are right that this requires a significant amount of storage space, however, as you also pointed out, this will get cold quickly and be sitting on disk only (i.e. not taking up space in the valuable cache). The reason this is very efficient is because of the time-based vertex centric index we build for the stream edges. This allows us to quickly pull out the most recent tweets for any user. If we had to compute those at query time, we would have to traverse to each person followed, get their 10 most recent tweets and then merge those in-memory. That would be significantly more expensive and since stream reading is probably the most frequent activity on twitter, pre-computing it saves a lot of time at the expense of inexpensive disk storage.
[+] [-] espeed|13 years ago|reply
See https://groups.google.com/d/msg/gremlin-users/6GYiHn3QR8g/81...
[+] [-] erichocean|13 years ago|reply
Hadoop, Solr/Lucene, and now Blueprints/Grapb DB operations are all available on the same Cassandra cluster, in addition to the stuff Cassandra does quote-unquote natively. Add Zookeeper for the few times you need an honest-to-goodness transaction and it's just crazy how good the tech has gotten on the backend in the last 10 years. :-)
[+] [-] mbroecheler|13 years ago|reply
Regarding Zookeeper: We actually build a locking system into Titan that uses quorum reads/writes with time-outs and cleanup to ensure consistency for certain edge/property types as defined inside Titan. This gives you consistency guarantees out of the box without having to introduce another component (like Zookeeper) into your deployment. For infrequent lock usage (which I strongly encourage ;-) this should be sufficient. For frequent locking, something like Zookeeper is far superior.
[+] [-] amirmansour|13 years ago|reply
[+] [-] espeed|13 years ago|reply
Titan has pluggable storage options: Cassandra, HBase, Berkeley DB, and some have been working on adapters for DynamoDB and App Engine Datastore. However, it's not clear how well App Engine and DynamoDB will work with the upcoming Titan changes which will move the traversal engine into the cluster's JVM.
Both Neo4j and Titan support Blueprints, and you can run Gremlin (https://github.com/tinkerpop/gremlin/wiki) on both. Marko Rodriguez (http://markorodriguez.com/), the creator of Gremlin, is one of the principals behind Titan.
See "The Benefits of Titan": https://github.com/thinkaurelius/titan/wiki/The-Benefits-of-...