I did a lot of research on graph database technologies recently and read a lot of these "let's compare X to Y" articles. What I found is that most benchmarks - especially those done by people affiliated with a given product - often tend to show a distorted and sometimes plain wrong picture.
For example, concerning the performance and scalability of graph databases the main argument of proponents of this technology is the "join bomb" argument, which states that you can't efficiently store a graph in a relational database since it will require O(log(n)) time to lookup neighboring nodes from the index when crawling the graph. However, this is of course only true for B-tree indexes, whereas hash-based indexing would give you basically the same performance (O(1)) on a graph implemented in a relational database.
Additional features like documents and deep indexes are nice of course but can be (and often are) implemented using relational databases as well, so in the end there really isn't such a large advantage to be gained from using a graph database, especially when taking into account the immaturity of many solutions in that space.
>Additional features like documents and deep indexes are nice of course but can be (and often are) implemented using relational databases as well, so in the end there really isn't such a large advantage to be gained from using a graph database, especially when taking into account the immaturity of many solutions in that space.
I've worked with graph data stored in rdbms in the medical informatics space. As you say, there are ways to correctly handle complex graph data in rdbms.
I've also used neo4j as the backend for a wall street analytics app that's in production. Could it have been done in rdms? Sure, but the ad hoc queries that needed to be run against the data were much easier to express as graph traversals than SQL.
There are some obvious downsides with using a graph database, mainly that it's practically impossible to find programmers with non trivial production experience, but it's been a great fit at the two startups I used it at since I got to implement it from the ground up and didn't need a large team.
That being said, database pragmatism is the main lesson to be learned here. Use the right tool(s) for the right jobs.
The purpose of this benchmark series was not to provide a comprehensive test of all these databases. We only wanted to demonstrate that a multi-model database can successfully compete with specialised solutions like document stores and specialised graph databases.
I agree to your comment about graph databases, the crucial thing is that the neighbors of a vertex can be found in time proportional to their number, and that the queries involving an a priori unknown number of steps (graph traversals, path matching, shortest path, etc.) run efficiently in the database server and can be accessed conveniently from the query language.
After working with Neo4j for about six months I would definitely NOT recommend it to anyone. In my experience it has been the least reliable and most buggy database solution I've ever worked with.
From the engineering point of view it has a host of core issues that make it really hard to write code, such as constant deadlock exceptions. These are caused by the fact that the database is largely incapable of handling two simultaneous upserts if those upserts touch the same node. Where most mature, decent DBs can handle this completely transparently Neo4j just panics and returns an error. This means writing Neo4j queries ends up requiring tons of boilerplate to wait for exclusive locks on nodes and/or retry upserts until they succeed.
From the devops perspective managing a cluster is also extremely painful, as there are frequent issues with replicas getting behind on syncing due to the server's pathetically slow write performance even with SSD storage volumes. We tried everything but the bottleneck was the server processes themselves, not the storage volumes, network connection, CPU, etc. We threw some really nice hardware at our Neo4j cluster but it still struggled to keep up with write loads in the range of 500-2000 writes per minute.
The final straw was their latest version 2.2 which they were advertising as a massive improvement in speed and reliability. When we upgraded it turned out to be the exact opposite. A few of our queries got faster but overall most of them got an order of magnitude slower. Their support basically told us that we'd need to rewrite many of our queries or manually set a flag to use their older query engine (and therefore miss out on the speed of the new query engine). Needless to say we decided if we needed to rewrite queries we were going to rewrite them to use a different storage engine entirely.
In my experience Neo4j was little more than a six month waste of time and dev resources.
I'm Claudius, the author of the blog post. The intent of the blog was not to show, that a particular product is not performing well. There are thousands of different use cases and each database has its strengths and weaknesses. For a different scenario the results might be different. Neo4J is a solid product and is doing a good job. The aim of the blog was to show, that multi-model can compete with specialized solutions. What I wanted to show, is that a multi-model approach per se does not carry a performance penalty.
I had a pretty similar experience, in fact. I lost months of work to Neo4J back in 2011 or so.
The purpose was to be the storage backend for ConceptNet, a semantic network that is largish but is far from "big data". The write speed was awful, the stability was awful, the mechanisms for loading in non-toy amounts of data were nearly nonexistent, and I learned what it means to "run out of PermGen".
I hastily aborted the Neo4J plan. I was also burned by RDF triple-stores (they can't cope with small data either) and MongoDB (which seemed to work at first and eventually fell over, for obvious reasons in retrospect).
The graph is now in SQLite plus flat files, which works great. I've concluded by now that the step on the scale beyond "lies" and "damn lies" is "claims about next-generation databases".
Good to hear because I started to experiment with it a couple months ago to solve a specific problem. I abandoned it because the embedded version required an older version of Lucene than I was using. Yes I could have worked around that but it's a lot of extra work to try something that may or may not solve my problem. Then I tried the server version which also required I be able to call web services (something my code had no other use for). Worked through that and found the performance pretty poor compared to embedded. I ended up just solving the problem in Lucene. Glad I didn't spend more time on it.
I would love to have a good graph database later on though
Haha, damn... That sounds pretty harsh. Neo4J seems really interesting to me. Who doesn't want to get better with graphs? I still want to learn the query language as it is applicable to some of the problems I'm trying to solve now.
This isn't the first horror story I've heard about Neo4J so I've actually stayed away from using it professionally.
IMHO, the community needs a set of specific tasks that can be achieved with all databases (just like http://benchmarksgame.alioth.debian.org/ has a series of algorithms for testing different memory/CPU strengths of languages). Then, proponents of each database (e.g., their sponsors, evangelists) can create code and config for running the tests on their database. This could all be open source, and the tests could all be run on the same host (or hosts) for comparison.
When I compare databases, I also search out the performance comparisons created or sponsored by my preferred database provider, then I know that I can trust the results to be complete and unbiased./sarcasm
Seriously, why is this one of the top stories on HN? These types of tests are so easy to tweak in favor of a perferred database that they are completely unreliable. Even neutral comparisons by third parties are rife with errors like not adding proper indices to all the DBs or using query formulations that avoid the indices on some DBs or other configuration issues (DBs are unfortunately tricky.)
I think the only way to do this objectively is to have a test and then give each DB vendor an opportunity to tweak the DB and queries to optimize performance. Seeing how DB vendors optimized performance would actually be very informative to potential users. Everything else is just a comedy of errors (or worse) as normally people usually only have good expertise in one of the DBs in question, if that.
It looks like the source of the test is out there for others to compare/review... which is far better than many of these types of tests do. I'm not an ArrangoDB person, I've done a bit with other NoSQL variants, but will say their approach is fine. They don't even come out on top in all the tests...
It's mainly about showing how they compare performance wise, so that they can concentrate on selling based on features. Which is a pretty fair approach, and I wish them luck.
Also, in looking, it seems that they have pretty broad platform support as well.
> I think the only way to do this objectively is to have a test and then give each DB vendor an opportunity to tweak the DB and queries to optimize performance.
AFAIK they are: the test is open source, the raw results are there, and contributions welcomed. Hopefully the OrientDB team will step up and show how theirs can perform.
Likewise. I started some stuff at work recently comparing Mongo, Orient and PG (using a JSON column). Alas, all the test suite does so far is insert small "documents" (5000 docs w/ 3 name-vals excluding PK/ID) and time that. No read-back tests of any kind yet, so no indices in place either.
For this little test (on my macbook), Mongo was the fastest. PG took 1 1/2 times as long, and Orient took 4 times as long as Mongo. All 3 were driven by a Java client connected via a socket to the DB on localhost. (Orient could have been "in process", but I wanted it external as if on a server)
Of course, the main use-case for Orient is reading back graph chains, so it's a horrible test. However, what we need is a supplemental store to dump some flat junk as the app runs.
Worked with Orient back in 2012, had some issues regarding performance and switched. Following the news about them, I thought they had made great progress. This benchmark kind of shows the exact opposite.
Are you sure that you optimized/tuned Neo4j or MongoDB as much as you did with ArangoDB?
Also, I don't like when a company posts a comparison between its product and others. Although some of them are arguably informative and objective, I consider these posts as marketing/ad posts.
(Disclaimer: Max from ArangoDB)
We have invested considerable effort to optimize each database. Obviously, we know our own product better than the others.
However, we have asked people who know the other products better, and we keep this investigation open for everybody to contribute and to suggest improvements. As you can see from last week's post, there have been very good contributions, we have tried them out and have published the improved results.
Hey all, we sent a Pull Request 2 days ago to the author of the Benchmark, as they used OrientDB incorrectly. Now OrientDB is the fastest in all the benchmarks, except for "singleRead" and "neighbors2", but we know why we're slower there.
We are still waiting for the Arango team to update the results...
However, who is interested in running the tests themselves, just clone this repository:
Anyone using ArangoDB in production who can speak about it? It looks interesting, but like many of the newer databases coming out (Aerospike, Blazegraph, Hyperdex etc.) there is precious little public information from third parties.
Great, now tell the graph-based-DB guys how you do on multi-doc "join" queries :-)
Disclaimer: I'm more of a Postgres fan-boy.
Mongo's single rec insert/fetch times are impressive, though. And, it's pretty easy to set up a single node. So, it could sometimes be the right choice.
[+] [-] ThePhysicist|10 years ago|reply
For example, concerning the performance and scalability of graph databases the main argument of proponents of this technology is the "join bomb" argument, which states that you can't efficiently store a graph in a relational database since it will require O(log(n)) time to lookup neighboring nodes from the index when crawling the graph. However, this is of course only true for B-tree indexes, whereas hash-based indexing would give you basically the same performance (O(1)) on a graph implemented in a relational database.
Additional features like documents and deep indexes are nice of course but can be (and often are) implemented using relational databases as well, so in the end there really isn't such a large advantage to be gained from using a graph database, especially when taking into account the immaturity of many solutions in that space.
[+] [-] mcphilip|10 years ago|reply
I've worked with graph data stored in rdbms in the medical informatics space. As you say, there are ways to correctly handle complex graph data in rdbms.
I've also used neo4j as the backend for a wall street analytics app that's in production. Could it have been done in rdms? Sure, but the ad hoc queries that needed to be run against the data were much easier to express as graph traversals than SQL.
There are some obvious downsides with using a graph database, mainly that it's practically impossible to find programmers with non trivial production experience, but it's been a great fit at the two startups I used it at since I got to implement it from the ground up and didn't need a large team.
That being said, database pragmatism is the main lesson to be learned here. Use the right tool(s) for the right jobs.
[+] [-] neunhoef|10 years ago|reply
The purpose of this benchmark series was not to provide a comprehensive test of all these databases. We only wanted to demonstrate that a multi-model database can successfully compete with specialised solutions like document stores and specialised graph databases.
I agree to your comment about graph databases, the crucial thing is that the neighbors of a vertex can be found in time proportional to their number, and that the queries involving an a priori unknown number of steps (graph traversals, path matching, shortest path, etc.) run efficiently in the database server and can be accessed conveniently from the query language.
[+] [-] NathanKP|10 years ago|reply
After working with Neo4j for about six months I would definitely NOT recommend it to anyone. In my experience it has been the least reliable and most buggy database solution I've ever worked with.
From the engineering point of view it has a host of core issues that make it really hard to write code, such as constant deadlock exceptions. These are caused by the fact that the database is largely incapable of handling two simultaneous upserts if those upserts touch the same node. Where most mature, decent DBs can handle this completely transparently Neo4j just panics and returns an error. This means writing Neo4j queries ends up requiring tons of boilerplate to wait for exclusive locks on nodes and/or retry upserts until they succeed.
From the devops perspective managing a cluster is also extremely painful, as there are frequent issues with replicas getting behind on syncing due to the server's pathetically slow write performance even with SSD storage volumes. We tried everything but the bottleneck was the server processes themselves, not the storage volumes, network connection, CPU, etc. We threw some really nice hardware at our Neo4j cluster but it still struggled to keep up with write loads in the range of 500-2000 writes per minute.
The final straw was their latest version 2.2 which they were advertising as a massive improvement in speed and reliability. When we upgraded it turned out to be the exact opposite. A few of our queries got faster but overall most of them got an order of magnitude slower. Their support basically told us that we'd need to rewrite many of our queries or manually set a flag to use their older query engine (and therefore miss out on the speed of the new query engine). Needless to say we decided if we needed to rewrite queries we were going to rewrite them to use a different storage engine entirely.
In my experience Neo4j was little more than a six month waste of time and dev resources.
[+] [-] don71|10 years ago|reply
I'm Claudius, the author of the blog post. The intent of the blog was not to show, that a particular product is not performing well. There are thousands of different use cases and each database has its strengths and weaknesses. For a different scenario the results might be different. Neo4J is a solid product and is doing a good job. The aim of the blog was to show, that multi-model can compete with specialized solutions. What I wanted to show, is that a multi-model approach per se does not carry a performance penalty.
[+] [-] rspeer|10 years ago|reply
The purpose was to be the storage backend for ConceptNet, a semantic network that is largish but is far from "big data". The write speed was awful, the stability was awful, the mechanisms for loading in non-toy amounts of data were nearly nonexistent, and I learned what it means to "run out of PermGen".
I hastily aborted the Neo4J plan. I was also burned by RDF triple-stores (they can't cope with small data either) and MongoDB (which seemed to work at first and eventually fell over, for obvious reasons in retrospect).
The graph is now in SQLite plus flat files, which works great. I've concluded by now that the step on the scale beyond "lies" and "damn lies" is "claims about next-generation databases".
[+] [-] gmarx|10 years ago|reply
I would love to have a good graph database later on though
[+] [-] mrits|10 years ago|reply
This isn't the first horror story I've heard about Neo4J so I've actually stayed away from using it professionally.
[+] [-] shiv86|10 years ago|reply
[+] [-] mangeletti|10 years ago|reply
This seems to make sense, and is more akin to what https://www.techempower.com/benchmarks/ has done, IIRC.
[+] [-] bhauer|10 years ago|reply
[+] [-] crudbug|10 years ago|reply
[+] [-] bhouston|10 years ago|reply
Seriously, why is this one of the top stories on HN? These types of tests are so easy to tweak in favor of a perferred database that they are completely unreliable. Even neutral comparisons by third parties are rife with errors like not adding proper indices to all the DBs or using query formulations that avoid the indices on some DBs or other configuration issues (DBs are unfortunately tricky.)
I think the only way to do this objectively is to have a test and then give each DB vendor an opportunity to tweak the DB and queries to optimize performance. Seeing how DB vendors optimized performance would actually be very informative to potential users. Everything else is just a comedy of errors (or worse) as normally people usually only have good expertise in one of the DBs in question, if that.
[+] [-] tracker1|10 years ago|reply
It's mainly about showing how they compare performance wise, so that they can concentrate on selling based on features. Which is a pretty fair approach, and I wish them luck.
Also, in looking, it seems that they have pretty broad platform support as well.
[+] [-] porker|10 years ago|reply
AFAIK they are: the test is open source, the raw results are there, and contributions welcomed. Hopefully the OrientDB team will step up and show how theirs can perform.
[+] [-] segmondy|10 years ago|reply
[+] [-] Roboprog|10 years ago|reply
For this little test (on my macbook), Mongo was the fastest. PG took 1 1/2 times as long, and Orient took 4 times as long as Mongo. All 3 were driven by a Java client connected via a socket to the DB on localhost. (Orient could have been "in process", but I wanted it external as if on a server)
Of course, the main use-case for Orient is reading back graph chains, so it's a horrible test. However, what we need is a supplemental store to dump some flat junk as the app runs.
[+] [-] matts9581|10 years ago|reply
[+] [-] jsc123|10 years ago|reply
[+] [-] harunurhan|10 years ago|reply
Also, I don't like when a company posts a comparison between its product and others. Although some of them are arguably informative and objective, I consider these posts as marketing/ad posts.
[+] [-] neunhoef|10 years ago|reply
[+] [-] lvca|10 years ago|reply
We are still waiting for the Arango team to update the results...
However, who is interested in running the tests themselves, just clone this repository:
https://github.com/maggiolo00/nosql-tests
[+] [-] nevi-me|10 years ago|reply
[+] [-] lobster_johnson|10 years ago|reply
[+] [-] lvca|10 years ago|reply
[deleted]
[+] [-] hngonebad|10 years ago|reply
[deleted]
[+] [-] Roboprog|10 years ago|reply
Disclaimer: I'm more of a Postgres fan-boy.
Mongo's single rec insert/fetch times are impressive, though. And, it's pretty easy to set up a single node. So, it could sometimes be the right choice.