top | item 38872023

(no title)

This is the first blog post on a series of posts I plan to write on the role graph DBMSs and knowledge graphs play on LLM applications and recent text-to-high-level-query-language work I read up on over the holiday season.

These blogs have two goals:

(i) give an overview of what I learned as an outsider looking for technical depth; (ii) discuss some venues of work that I ran into that looked important.

This first post is on "Retrieval Augmented Generation using structured data", so private records stored in relational or graph DBMSs. The post is long and full of links to some of the important material I read (given my academic background, many of these are papers) but it should be an easy read especially if you were an outsider intimidated by this fast moving space.

tl;dr for this post: - I provide an overview of RAG. - Compared to pre-LLM work, the simplicity and effectiveness of developing a natural language interface over your database using LLMs is impressive. - There is little work that studies LLMs' ability to generate Cypher or SPARQL. I also hope to see more work on nested, recursive and union-of-join queries. - Everyone is studying how to prompt LLMs so they generate correct DBMS queries. Here, I hope to see work studying the effects of data modeling (normalization, views, graph modeling) on the accuracy of LLM-generated queries.

Hope some find this interesting.

discuss

emmanueloga_|2 years ago

So this post is about using RAG/LLM to generate queries (Cypher in this case, to be consumed by Kuzu). That way you could ask natural-language questions to be answered by the result of the query.

I wonder if you could comment about other areas of AI+Graphs (I think this is mostly Graph Neural Networks, not sure if anything else?).

For instance, I found PyG and Deep Graph Library but the use cases are so jargon-heavy [1], [2], I'm not sure about the real world applications, in layman terms.

1: https://pytorch-geometric.readthedocs.io/en/latest/tutorial/...

2: https://docs.dgl.ai/tutorials/blitz/index.html

emmanueloga_|2 years ago

Ok, using ChatGPT and Bard (the irony lol) I learned a bit more about GNNs:

GNNs are probabilistic and can be trained to learn representations in graph-structured data and handling complex relationships, while classical graph algorithms are specialized for specific graph analysis tasks and operate based on predefined rules/steps.

* Why is PyG it called "Geometric" and not "Topologic" ?

Properties like connectivity, neighborhoods, and even geodesic distances can all be considered topological features of a graph. These features remain unchanged under continuous deformations like stretching or bending, which is the defining characteristic of topological equivalence. In this sense, "PyTorch Topologic" might be a more accurate reflection of the library's focus on analyzing the intrinsic structure and connections within graphs.

However, the term "geometric" still has some merit in the context of PyG. While most GNN operations rely on topological principles, some do incorporate notions of Euclidean geometry, such as:

- Node embeddings: Many GNNs learn low-dimensional vectors for each node, which can be interpreted as points in a vector space, allowing geometric operations like distances and angles to be applied.

- Spectral GNNs: These models leverage the eigenvalues and eigenvectors of the graph Laplacian, which encodes information about the geometric structure and distances between nodes.

- Manifold learning: Certain types of graphs can be seen as low-dimensional representations of high-dimensional manifolds. Applying GNNs in this context involves learning geometric properties on the manifold itself.

Therefore, although topology plays a primary role in understanding and analyzing graphs, geometry can still be relevant in certain contexts and GNN operations.

* Real world applications:

- HuggingFace has a few models [0] around things like computational chemistry [1] or weather forecasting.

- PyGod [2] can be used for Outlier Detection (Anomaly Detection).

- Apparently ULTRA [3] can "infer" (in the knowledge graph sense), that Michael Jackson released some disco music :-p (see the paper).

- RGCN [4] can be used for knowledge graph link prediction (recovery of missing facts, i.e. subject-predicate-object triples) and entity classification (recovery of missing entity attributes).

- GreatX [5] tackles removing inherent noise, "Distribution Shift" and "Adversarial Attacks" (ex: noise purposely introduced to hide a node presence) from networks. Apparently this is a thing and the field is called "Graph Reliability" or "Reliable Deep Graph Learning". The author even has a bunch of "awesome" style lists of links! [6]

- Finally this repo has a nice explanation of how/why to run machine learning algorithms "outside of the DB":

"Pytorch Geometric (PyG) has a whole arsenal of neural network layers and techniques to approach machine learning on graphs (aka graph representation learning, graph machine learning, deep graph learning) and has been used in this repo [7] to learn link patterns, also known as link or edge predictions."

0: https://huggingface.co/models?pipeline_tag=graph-ml&sort=tre...

1: https://github.com/Microsoft/Graphormer

2: https://github.com/pygod-team/pygod

3: https://github.com/DeepGraphLearning/ULTRA

4: https://huggingface.co/riship-nv/RGCN

5: https://github.com/EdisonLeeeee/GreatX

6: https://edisonleeeee.github.io/projects.html

7: https://github.com/Orbifold/pyg-link-prediction