top | item 34358912

What Every Competent Graph DBMS Should Do

85 points| semihsalihoglu | 3 years ago |kuzudb.com | reply

35 comments

order
[+] mdaniel|3 years ago|reply
I've now been conditioned to say "ok, when can we expect the Jepsen?" when I see a new database. Although in this case the phrase "in-process" (and it apparently being built on top of Apache Arrow: https://github.com/kuzudb/kuzu/blob/master/external/arrow/ap... ) may make that nonsense, but the readme does also say "Serializable ACID transactions"
[+] semihsalihoglu|3 years ago|reply
I don't quite follow the question here but to clarify one thing: Kùzu is currently integrating Arrow not as a core storage structure but as a file format from which we can ingest data. We are writing Kùzu's entire storage, so it's our own design. It has three components: Vanilla columns for node properties, columnar compressed sparse row (https://tinyurl.com/2r2s4wpe) join indices and relationship properties, and a hash index for primary keys of node records. We don't use Arrow to store db files.

For serializability: yes, we support serializable transactions. So when you insert, delete or update node, rel records, you get all or nothing behavior (e.g., if you rollback none of your updates will be visible).

That said, supporting ACID transactions is a compeletely separate design decision in DBMSs, so our (or other systems') mechanisms to support transactions (for example whether it's based on write ahead logging or not) and storage designs are generally mutually exclusive decisions.

[+] jeremyjh|3 years ago|reply
I don't Jepsen has any relevance to embedded (in-process) databases, nor to any single-node databases. The properties its testing for are related to network partition tolerance.
[+] CharlieDigital|3 years ago|reply
I love seeing Cypher growing.

IMO, the best database query language I've used (various SQL's, document DBs, graph DBs).

[+] pantsforbirds|3 years ago|reply
Hard agree Cypher is the easiest way to read complex graph operations.
[+] zozbot234|3 years ago|reply
SQL is getting a Property Graph Query extension that should finally match the ease of querying of typical graph DB. That said, RDBMS should cover the basic use case quite well unless you are doing very exotic things with network analytics and such.
[+] semihsalihoglu|3 years ago|reply
Yes, this will be interesting to see. This is the SQL/PGQ extension that's coming in 2023. On the one hand, as I articulate in my post, GDBMSs are based on relational principles, so building a system that can seamlessly query in relational and graph model is a good idea. On the other hand, there is this: (1) no SQL extension into a graph-based model has been successful historically (e.g., into RDF or XML or even json). That seems to confuse users, so sticking to a single model seems like a good idea. (2) RDBMSs that implement SQL/PGQ will not be very performant unless they change their cores with several techniques I and my research group has been advocating for (e.g., how to do predefined joins, worst-case optimal joins, factorization etc.).

One of the core developers of Kùzu, Guodong Jin, actually wrote his PhD thesis on a project called GRainDB, which showed how to integrate predefined joins into DuckDB, so an RDBMS can be made efficient on "graph workloads". He did this excellent work: https://www.vldb.org/pvldb/vol15/p1011-jin.pdf, https://www.cidrdb.org/cidr2022/papers/p57-jin.pdf, which articulates a good plan for how RDBMSs can be efficient in a mixed relational-graph model.

Well, it'll be very interesting to see where SQL/PGQ goes...

[+] Rich_Morin|3 years ago|reply
Nicely written article, but it seems to conflate "what it should do" with "how it should do it". So, for example, it talks a lot about optimizations (which feel like implementation decisions).

I'm not saying they aren't important in this DB, but they might not be as relevant in another. For example, Datomic doesn't go this direction, but its ability to avoid read locks on long-running queries might dance around many optimization issues.

FWIW, I like the use of Cypher, which I first encountered in Neo4j. It has always felt pretty intuitive to me, though I don't know all of its limitations and strengths.

[+] semihsalihoglu|3 years ago|reply
I tried to talk about "what it should do" at a higher-level. As a comparison, if you want to read something that is similar in spirit but discusses "how GDBMSs should do" certain things, you can look at the link I have in footnote 4. This is what I call the CWI list of how GDBMSs can be architected to be competent. The CWI list's items are specific implementation and optimization "techniques", like columnar storage, vectorized processing etc.

I can happily say that this list comes from Peter Boncz, who was an early pioneer of analytical RDBMSs, and Peter and I have similar opinions on these core architectural decisions here. And Kùzu currently adopts (or we are implementing) 11 of 12 techniques there.

FYI, in my next posts, I will talk about "how a GDBMS should do" features 1 and 2 in my list, using specific query processing techniques we have implemented in Kùzu.

[+] contravariant|3 years ago|reply
What I still want is a neat way to solve data deduplication. Say you've got a whole lot of tuples saying x = y, then it would be nice to be able to reason about all the equivalence classes (returning the minimum id or w/e).

This is kind of the opposite of a join (where the join is the pullback and the operation described above is more like a pushout).

Symmetric + Transitive closure gets you some of the way there, but aggregation is messy and the paths between nodes get complicated if your tuples aren't ordered nicely (say you have x -> y <- z -> w <- q -> r <- etc.)

[+] semihsalihoglu|3 years ago|reply
Interesting. I think the required computation is different for 2 different cases you have in your post.

Case 1: If you have = relationships (let's call the type of the relationship eq). Then what you need is to identify weak connectivity. If you want to do find the equivalence class of a single node, say with primary key "foo", you can do this with a Cypher query using Kleene star:

"MATCH (a)-[:eq*]->(b) WHERE a.pk = "foo" RETURN DISTINCT b"

This finds the "weakly connected component" of node "foo". If you want to find all equivalence classes, that's equivalent to finding all weakly connected components and you should probably call that algorithm from the algorithms library built over the GDBMS (e.g., https://tinyurl.com/mtx98c5s). You can do it in Cypher but it will be very slow.

Case 2: You have <= relationships, so relationships are not symmetric, so you want to find strong connectivity of nodes. I won't type it here but you could do this with a Cypher query for a single node but for all nodes again you can call the strongly connected components algorithm from the algorithms library of the GDBMS.

You can do some of these natively in Datalog, but if you implement it yourself in a native query language, but the computations is likely to be quite slow. So it might still be better to call a specialized algorithm that's implemented in a library above the DBMS you are using.

Hope this helps.

[+] __afk__|3 years ago|reply
Great read, thank you. Also nice that you brought in RDF databases to help expand and refine the problem space. When I come across a graph database topic here at HN and it leads with an LPG approach, 99% of the time, there is hardly a mention of RDF. That leaves everyone a little worse off.
[+] semihsalihoglu|3 years ago|reply
I agree. As I say in the blog, there is no perfect data model. RDF is great fit for certain things, e.g., when storing dbpedia-like datasets, and doing "automatic" advanced OWL reasoning over it by implementing OWL rules. For example, if a record t X is of type Y and there is a rule that says Y is of type Z, then by OWL reasoning rules, X is of type Z. So if a user scans all Z's, a system that does automatic OWL reasoning should return record t. RDFox is an interesting system that attempts this and has a niche space. Overall the semantic web community has thought greatly about these types of problems and pioneered these types computations.

Property GDBMSs can do manual reasoning queries if you ask questions with Kleene star but they would have to implement OWL rules and "OWL rule" data type, and URI data type etc. There is no other way I'm aware of. I am convinced with the right architecture this should be possible and if we see interest in this, we might work on this in Kùzu. But currently we definitely plan to support URIs as first class citizen data types so at least common queries over URI-heavy datasets and manual reasoning can be done efficiently, and that should cover most use cases for applications that want to model their data as triples of URIs.

[+] GenerocUsername|3 years ago|reply
I made the mistake of learning Neo4j with Cypher before learning SQL, and every interaction with SQL since feels like I'm using some outdated monster
[+] blep_|3 years ago|reply
If it makes you feel any better, I don't know Cypher and SQL still feels like an outdated monster.
[+] HyperSane|3 years ago|reply
Not just SQL but MongoDB and Elasticsearch also seem horrible by comparison. The only other query language that IMHO is as intuitive as Cypher is Splunk.
[+] CharlieDigital|3 years ago|reply
Every time I need to do a nested sub-select, a little part of me dies.
[+] xkcd99|3 years ago|reply
just one question for the author, why does this system feel more like a relational system than a graph database system ? you ask users to define schema (which I think neo4j doesn't do) and also I think the concept of this rel table which I find in your blog is not present in any other graph db
[+] semihsalihoglu|3 years ago|reply
I think it's a mistake for any DBMS to not support a schema. I bet this hurts Neo4j a lot and no question that this is a mistake. In fact some GDBMS, including Kùzu or TigerGraph supports a schema. I think MemGraph does too, though I might be wrong. Schema allows systems to do many core computations efficiently, most importantly scans of data, which is the backbone operation of DBMSs. In fact, because of this every semi-structured graph model historically has been extended with a schema.

In practice, if you want DBMSs to be performant, you need to structure your data. It's one thing to optionally support a semi-structured model, which is for example great when building initial proof of concepts when you want to develop something quickly. It's another thing to not support putting a structure on the data, which you'll want when you finally take your application to production and will care about performance.

[+] maxdemarzi|3 years ago|reply
The lack of a Schema does hurt Neo4j performance. Properties are stored "willy nilly" on a linked list of bytes per node/relationship. No order, an "age" property can be: 45, 38.5, "adult", [18,19], false... and that makes a terrible mess when aggregating, sorting, filtering, searching, etc.
[+] revskill|3 years ago|reply
In the match clause, i don't see the foreign key, how those tables get to know each other to join ?
[+] semihsalihoglu|3 years ago|reply
When you insert the Transfer records in my example, you indicate that they are "edges/relationships" between "Account" node records. The system interprets the values in those records implicitly as foreign keys to the Account records. This is what I mean by "predefined joins" in my example. When you ingest your relationship records, you predefine a join to the system and the GDBMS uses those Transfer records to join node records.

Hope this helps.

[+] xkcd99|3 years ago|reply
in gdbms, there is the concept of "predefined joins" where you define a relationship edge that directly connects 2 nodes, i don't know how these guys do it, but at least in neo4j that's the concept

edit: just saw their blog, they ask you to define a "rel table" and that has to define all the joins and they load it from there

[+] canadiantim|3 years ago|reply
What are your thoughts on EdgeDB?
[+] semihsalihoglu|3 years ago|reply
I can't say something very insightful as I don't know it closely. My shallow understanding was that it is a modern json/xml-like dbms supporting a nested object (tree-like) model and path-expression-based query languages of such systems. If this is accurate, such models are graph-based as well, so the feature set on my blog would largely apply to them as well.