top | item 40802313

(no title)

ocolegro | 1 year ago

great question, I can talk about how we do the more challenging "List all YC founders that worked at Google and now have an AI startup."

For this we have a target dataset (the YC co directory) that we have around 100 questions over. We have found that when feeding an entire company listing in along with a single question we can get an accurate single answer (needle in haystack problem).

So to build our evaluation dataset we feed each question with each sample into the cheapest LLM we can find that reliably handles the job. We then aggregate the results.

This is not perfect but it allows us to have a way to benchmark our knowledge graph construction and querying strategy so that we can tune the system ourselves.

discuss

p1esk|1 year ago

OK, so you have a way to evaluate the accuracy and convince yourself that it’s probably works as expected. But what about me, a user? How can I check that the question I asked was answered correctly?

GTP|1 year ago

I think there's no substitute for doing your own research and comparing the results.