top | item 42434491

(no title)

I’ve been doing a lot of work on semantic data architecture that better supports LLM analytics, did you use any framework or methodology to decide how exactly to present the data/metadata to the LLM context to allow it to make decisions?

discuss

isoprophlex|1 year ago

A pre-processing phase does a lot of heavy lifting, where we stuff the table and column comments, additional metadata, and some hand-tuned heuristics into a graph-like structure. Basically using LLMs itself to preprocess the schema metadata.

Everything is very boring tech-wise, using vanilla postgres/pgvector and a few hundred lines of python. Every RAG-searchable text field (mostly column descriptions and a list of LLM-generated example queries) is linked to nodes holding metadata, at most 2 hops out. The tool is available to 10.000 users, but load is only a few queries per minute at peak... so performance wise it's fine.

philipodonnell|1 year ago

Enhancing the comments on the existing data model seems to be the most common approach for sure. I'm implementing this as a data architecture at several clients and I've found creating a whole new logical structure designed for the LLM is really effective. Not being bound by the original data model lets you solve several problems related to the "n-hops" question, avoiding needing the comments, and the semantics of how data engineers define columns. Some more details here [1], but obviously you can implement this totally yourself by hand.

[1] (https://github.com/eloquentanalytics/pyeloquent/blob/main/RE...)