top | item 12352190

(no title)

flaviotsf | 9 years ago

I tried Neo4j a while back for recommendations and calculating similarities between users but when running against our full dataset got too many OutOfMemory exceptions. Ended up with a Mahout / Spark solution. It's an awesome graph db though - can find many other uses for it.

discuss

levbrie|9 years ago

Yeah, I'm surprised the Neo4j team hasn't made more of an effort on this. I've run into lots of memory issues with it as well, and although there are reliable, fairly straightforward solutions to most of these problems, the team doesn't seem to be particularly interested in making sure that the defaults are robust enough to handle a reasonable workload. When your database fails on you for making a reasonable query request on a light workload, you can't help but feel troubled. There's a lot to love about Neo4j, but they've got a lot of work to do if they want to win over the developer community as a whole. There may be enterprises that get reassured by a huge price tag and a whole bunch of salespeople at their beck and call, but I don't know any of them. Every engineer I know who is willing to pay for software is either expecting a completely new kind of product or expecting to have an awesome experience with a free version of the tool before being willing to commit even a few bucks a month.

weego|9 years ago

Yeah I've tried a couple of times at getting Neo4j into stacks but the outcome has always been it's pretty much limited to baking relationship data pre/on demand that is saved elsewhere and cleared out otherwise you get into prohibitively expensive licensing / infrastructure territory very quickly.

At that point a more pragmatic solution has always won.

tummybug|9 years ago

Exactly the same as you, I was just trying out neo4j today with a small dataset (30mb) and was getting memory exceptions trying to add a relationship.

iamtherhino|9 years ago

Would you mind sharing the query? If you're hitting OOM exceptions with a dataset of that size there may be a typo in the query that's doing some sort of traveling salesman operation.

e.g.,

//grabs literally EVERY node in your database

MATCH (Person)-[KNOWS]-(Friend)

//only the people who have a KNOWS relationship between them

MATCH (person:Person)-[:KNOWS]-(Friend:Person)

sandGorgon|9 years ago

the solution we are moving to is to use spark to compute similarities, etc and load it into a neo4j graph.

so we use neo4j for oltp and spark for the olap part.

whenwillitstop|9 years ago

Can you though? My impression is that it doesnt scale to large data sets. The use cases for true graph databases (over shaky implementations on HBase/Cassandra) sparse in my opinion.

iamtherhino|9 years ago

6 of this, half a dozen of the other.

It's a single image database (no partitioning except in memory), so all nodes in the cluster will have the complete dataset (thus each node must be large enough to store it). However, because Neo4j doesn't rely on joins / table scans to operate-- traversals are O(1) not O(n). So there's an advantage to doing OLTP work on really really large datasets that have a specific starting point. Neo4j will do pointer arithmetic instead of scans / joins, such that regardless of dataset size a query will only access the fixed amount of data. The reason for this strategy has been that scale up hardware pricing has come down incredibly quickly in the last decade and having a trio of 64+++ GB memory boxes isn't out of the question for most mid-size and enterprise companies. Secondly, distributed systems are non-trival problems to manage both from a development but a devops perspective as well.

The philosophy of the Neo4j team is to conquer the world slowly. In order of priority Neo4j is designed around:

1.) data integrity and availability (ACID transactions, master-slave replication)

2.) rapid reads for graph traversals

3.) ability to store web-scale datasets (trillions++ of nodes)

4.) parallel operations (multi-master, map-reduce, global analytics, etc.)

The product has firmly completely 1 and 2, and is starting to work on 3 and 4 (4 mostly with a databricks / spark partnership).

It fights the same CAP problem that all databases do. We've chosen Consistency and Availability. Partition tolerance just isn't something inherent to graph databases. We can do some really smart math and duplicate nodes with high betweenness centrality (data nodes, not servers) or shuffle data based on access patterns to prevent introducing network latency into query plans that access nodes on multiple partitions. But doing that while maintaining 1 and 2 of the above is very not easy.

Disclaimer:

MATCH (rhino)-[:WORKS_AT]->(neo4j)

WHERE NOT rhino.opinions = neo4j.opinions