top | item 39787979

(no title)

alandu | 1 year ago

We have not come across any benchmark dataset that's actually worth evaluating on because the questions are not representative of real world enterprise problems. They don't reflect the degree of context needed to answer domain/business-specific questions accurately.

discuss

HanClinto|1 year ago

Can you give me an example of the sort of thing you're talking about? I've been using Defog's sql-eval a little bit, but I'd be interested in knowing more about its shortcomings when evaluating these systems.

https://github.com/defog-ai/sql-eval

alandu|1 year ago

An example question in that eval set is "How many publications were published between 2019 and 2021?". That's something GPT without any context can understand how to answer from a schema (which I assume has a column called publications). An example question that I'd get in my previous role at an ecommerce fraud detection company could be something like "what's the chargeback rate on the ATO segment". Neither chargeback rate nor ATO segment are defined in the database schema. Not only did they have different definitions depending on the context (e.g. which customer), the definition also change over time within the same context.