(no title)
ocolegro | 1 year ago
For this we have a target dataset (the YC co directory) that we have around 100 questions over. We have found that when feeding an entire company listing in along with a single question we can get an accurate single answer (needle in haystack problem).
So to build our evaluation dataset we feed each question with each sample into the cheapest LLM we can find that reliably handles the job. We then aggregate the results.
This is not perfect but it allows us to have a way to benchmark our knowledge graph construction and querying strategy so that we can tune the system ourselves.
p1esk|1 year ago
GTP|1 year ago