I don't think a single product is going to satisfy everyone and that's a problem with the category. If by "graph database" you mean you want to do completely random access workloads you are doomed to bad performance.
There was a time when I was regularly handling around 2³² triples with Openlink Virtuoso in cloud environments in an unchanging knowledge base. I was building that knowledge base with specialized tools that involved
* map/reduce processing
* lots of data compression
* specialized in-memory computation steps
* approximation algorithms that dramatically speed things up (there was a calculation that would have taken a century to do exactly that we could get very close to in 20 minutes)
Another product I've had a huge amount of fun with is Arangodb, particularly I have used it for applications work on my own account. When I flip a Sengled switch in my house and it lights up a hue lamp, arangodb is a part of it. I am working on a smart RSS reader right which puts a real UI in front of something like
and using Arangodb for that. I haven't built apps for customers with it but I did do some research projects where we used it to work with big biomedical ontologies like MeSH and it held up pretty well.
I came to the conclusion that it wasn't scalable to throw everything into one big graph, particularly if you were interested in inference and went through many variations of what to do about it. One concept was a "graph database construction set" that would help build multi-paradigm data pipelines like the ones described above. One thing I got pretty sure about was that it didn't make sense to throw everything into one big graph, particularly if you wanted to do inference, so I got interested in systems that work with lots of little graphs.
I got serious and paired up with a "non-technical cofounder" and we tried to pitch something that works like one of those "boxes-and-lines" data analysis tools like Alteryx. Tools like that ordinarily pass relational rows along the lines but that makes the data pipelines a bear to maintain because people have to set up joins such that what seems like a local operation that could be done in one part of the pipeline requires you to scatter boxes and lines all across a big computations.
I built a prototype that used small RDF graphs like little JSON documents and defined a lot of the algebra over those graphs and used stream processing methods to do batch jobs. It wasn't super high performance but coding for it was straightforward and it was reliable and always got the right answers.
I had a falling out with my co-founder but we talked to a lot of people and found that database and processing pipeline people were skeptical about what we were doing in two ways, one was that the industry was giving up even on row-oriented processing and moving towards column-oriented processing and people in the know didn't want to fund anything different. (Learned a lot about that, I sometimes drive people crazy with, "you could reorganize that calculation and speed it up more than 10x" and they are like "no way", ...) Also I found out that database people really don't like the idea of unioning a large number of systems with separate indexes, they kinda tune out and don't listen until you the conversation moves on.
(There is a "disruptive technology" situation in that vendors think their customers demand the utmost performance possible but I think there are people out there who would be more productive with a slower product that is easier and more flexible to code for.)
I reached the end of my rope and got back to working ordinary jobs. I wound up working at a place which was working on something that was similar to what I had worked on but I spent most of my time on a machine learning training system that sat alongside the "stream processing engine". I think I was the only person other than the CEO and CTO who claimed to understand the vision of the company in all-hands meetings. We did a pivot and they put me on the stream processing engine and I found out that they didn't know what algebra it worked on and that it didn't get the right answers all the time.
Back in those days I got on a standards committee involved w/ the semantics of financial messaging and I have been working on that for years. Over time I've gotten close to a complete theory for how to turn messages (say XML Schema, JSON, ...) and other data structures into RDF structures and after I'd given up I met somebody who actually knows how to do interesting things with OWL, I got schooled pretty intensively, and now we are thinking about how to model messages as messages (e.g. "this is an element, that is an attribute, these are in this exact order...") and how to model the content of messages ("this is a price, that is a security") and I'm expecting to open source some of this in the next few months.
These days I am thinking about what a useful OWL-like product would look like with the advantage that after my time in the wilderness I understand the problem.
Fun read above - very descriptive and interesting... thanks for sharing!
OWL? RDF? Were you an RPI graduate perhaps? (I wasn't but did visit them once as part of research project).
At the end of the day, triple stores (or quad stores with providence) never quite worked as well as simple property graphs (at least for the problems I was solving). I was never really looking for inference, more like extracting specific sub-graphs, or "colored" graphs, so property attribution was much simpler. Ended up fitting it into a simple relational and performance was quite good; even better than the "best" NoSQL solutions out there at the time.
And, triple stores just seem to require SO MANY relations! RDF vs Property Graphs feels like XML vs JSON. They both can get the job done, but one just feels "easier" than the other.
PaulHoule|3 years ago
I don't think a single product is going to satisfy everyone and that's a problem with the category. If by "graph database" you mean you want to do completely random access workloads you are doomed to bad performance.
There was a time when I was regularly handling around 2³² triples with Openlink Virtuoso in cloud environments in an unchanging knowledge base. I was building that knowledge base with specialized tools that involved
Another product I've had a huge amount of fun with is Arangodb, particularly I have used it for applications work on my own account. When I flip a Sengled switch in my house and it lights up a hue lamp, arangodb is a part of it. I am working on a smart RSS reader right which puts a real UI in front of something likehttps://ontology2.com/essays/ClassifyingHackerNewsArticles/
and using Arangodb for that. I haven't built apps for customers with it but I did do some research projects where we used it to work with big biomedical ontologies like MeSH and it held up pretty well.
I came to the conclusion that it wasn't scalable to throw everything into one big graph, particularly if you were interested in inference and went through many variations of what to do about it. One concept was a "graph database construction set" that would help build multi-paradigm data pipelines like the ones described above. One thing I got pretty sure about was that it didn't make sense to throw everything into one big graph, particularly if you wanted to do inference, so I got interested in systems that work with lots of little graphs.
I got serious and paired up with a "non-technical cofounder" and we tried to pitch something that works like one of those "boxes-and-lines" data analysis tools like Alteryx. Tools like that ordinarily pass relational rows along the lines but that makes the data pipelines a bear to maintain because people have to set up joins such that what seems like a local operation that could be done in one part of the pipeline requires you to scatter boxes and lines all across a big computations.
I built a prototype that used small RDF graphs like little JSON documents and defined a lot of the algebra over those graphs and used stream processing methods to do batch jobs. It wasn't super high performance but coding for it was straightforward and it was reliable and always got the right answers.
I had a falling out with my co-founder but we talked to a lot of people and found that database and processing pipeline people were skeptical about what we were doing in two ways, one was that the industry was giving up even on row-oriented processing and moving towards column-oriented processing and people in the know didn't want to fund anything different. (Learned a lot about that, I sometimes drive people crazy with, "you could reorganize that calculation and speed it up more than 10x" and they are like "no way", ...) Also I found out that database people really don't like the idea of unioning a large number of systems with separate indexes, they kinda tune out and don't listen until you the conversation moves on.
(There is a "disruptive technology" situation in that vendors think their customers demand the utmost performance possible but I think there are people out there who would be more productive with a slower product that is easier and more flexible to code for.)
I reached the end of my rope and got back to working ordinary jobs. I wound up working at a place which was working on something that was similar to what I had worked on but I spent most of my time on a machine learning training system that sat alongside the "stream processing engine". I think I was the only person other than the CEO and CTO who claimed to understand the vision of the company in all-hands meetings. We did a pivot and they put me on the stream processing engine and I found out that they didn't know what algebra it worked on and that it didn't get the right answers all the time.
Back in those days I got on a standards committee involved w/ the semantics of financial messaging and I have been working on that for years. Over time I've gotten close to a complete theory for how to turn messages (say XML Schema, JSON, ...) and other data structures into RDF structures and after I'd given up I met somebody who actually knows how to do interesting things with OWL, I got schooled pretty intensively, and now we are thinking about how to model messages as messages (e.g. "this is an element, that is an attribute, these are in this exact order...") and how to model the content of messages ("this is a price, that is a security") and I'm expecting to open source some of this in the next few months.
These days I am thinking about what a useful OWL-like product would look like with the advantage that after my time in the wilderness I understand the problem.
Jupe|3 years ago
OWL? RDF? Were you an RPI graduate perhaps? (I wasn't but did visit them once as part of research project).
At the end of the day, triple stores (or quad stores with providence) never quite worked as well as simple property graphs (at least for the problems I was solving). I was never really looking for inference, more like extracting specific sub-graphs, or "colored" graphs, so property attribution was much simpler. Ended up fitting it into a simple relational and performance was quite good; even better than the "best" NoSQL solutions out there at the time.
And, triple stores just seem to require SO MANY relations! RDF vs Property Graphs feels like XML vs JSON. They both can get the job done, but one just feels "easier" than the other.
carterschonwald|3 years ago
estro0182|3 years ago