top | item 39000242

(no title)

This is a post that summarizes some reading that I had done in the space of LLMs + Knowledge Graphs with the goal of identifying technically deep and interesting directions. The post cover retrieval augmented generation (RAG) systems that use unstructured data (RAG-U) and the role folks envision knowledge graphs to play in it. Briefly the design spectrum of RAG-U systems have two dimensions: 1) What additional data to put into LLM prompts: such as, documents, or triples extracted from documents. 2) How to store and fetch that data: such as a vector index, gdbms, or both.

The standard RAG-U uses vector embeddings of chunks, which are fetched from a vector index. An envisioned role of knowledge graphs is to improve standard RAG-U by explicitly linking the chunks through the entities they mention. This is a promising idea but one that need to be subjected to rigorous evaluation as done in prominent IR publications, e.g., SIGIR.

The post then discusses the scenario when an enterprise does not have a knowledge graph and discuss the ideal of automatically extracting knowledge graphs from unstructured pdfs and text documents. It covers the recent work that uses LLMs for this task (they're not yet competitive with specialized models) and highlights many interesting open questions.

Hope this is interesting to people who are interested in the area but intimidated because of the flood of activity (but don't be; I think the area is easier to digest than it may look.)

discuss

kordlessagain|2 years ago

Knowledge graphs improve vector search by providing a "back of the book" index for the content. This can be done using knowledge extraction from an LLM during indexing, such as pulling out keyterms of a given chunk before embedding, or asking a question of the content and then answering it using the keyterms in addition to the embeddings. One challenge I found with this is determining keyterms to use with prompts that have light context, but using a time window helps with this, as does hitting the vector store for related content, then finding the keyterms for THAT content to use with the current query.

sroussey|2 years ago

What open source model is good at pulling keyterms?

daxfohl|2 years ago

Having just started from zero, I agree on the easy to digest point. You can get a pretty good understanding of how most things work in a couple days, and the field is moving so fast that a lot of papers are just exploring different iterative improvements on basic concepts.

mark_l_watson|2 years ago

I really liked the idea of creating linked data to connect chunks. That is an idea that deserves some play time (I just added it to my TODO list). Thanks for the good ideas!