I tried Neo4j a while back for recommendations and calculating similarities between users but when running against our full dataset got too many OutOfMemory exceptions. Ended up with a Mahout / Spark solution. It's an awesome graph db though - can find many other uses for it.
levbrie|9 years ago
weego|9 years ago
At that point a more pragmatic solution has always won.
tummybug|9 years ago
iamtherhino|9 years ago
e.g.,
//grabs literally EVERY node in your database
MATCH (Person)-[KNOWS]-(Friend)
//only the people who have a KNOWS relationship between them
MATCH (person:Person)-[:KNOWS]-(Friend:Person)
sandGorgon|9 years ago
so we use neo4j for oltp and spark for the olap part.
whenwillitstop|9 years ago
iamtherhino|9 years ago
It's a single image database (no partitioning except in memory), so all nodes in the cluster will have the complete dataset (thus each node must be large enough to store it). However, because Neo4j doesn't rely on joins / table scans to operate-- traversals are O(1) not O(n). So there's an advantage to doing OLTP work on really really large datasets that have a specific starting point. Neo4j will do pointer arithmetic instead of scans / joins, such that regardless of dataset size a query will only access the fixed amount of data. The reason for this strategy has been that scale up hardware pricing has come down incredibly quickly in the last decade and having a trio of 64+++ GB memory boxes isn't out of the question for most mid-size and enterprise companies. Secondly, distributed systems are non-trival problems to manage both from a development but a devops perspective as well.
The philosophy of the Neo4j team is to conquer the world slowly. In order of priority Neo4j is designed around:
1.) data integrity and availability (ACID transactions, master-slave replication)
2.) rapid reads for graph traversals
3.) ability to store web-scale datasets (trillions++ of nodes)
4.) parallel operations (multi-master, map-reduce, global analytics, etc.)
The product has firmly completely 1 and 2, and is starting to work on 3 and 4 (4 mostly with a databricks / spark partnership).
It fights the same CAP problem that all databases do. We've chosen Consistency and Availability. Partition tolerance just isn't something inherent to graph databases. We can do some really smart math and duplicate nodes with high betweenness centrality (data nodes, not servers) or shuffle data based on access patterns to prevent introducing network latency into query plans that access nodes on multiple partitions. But doing that while maintaining 1 and 2 of the above is very not easy.
Disclaimer:
MATCH (rhino)-[:WORKS_AT]->(neo4j)
WHERE NOT rhino.opinions = neo4j.opinions