top | item 39133690

(no title)

tdoehmen | 2 years ago

Hi, Till here, worked on the DuckDB-NSQL model on MotherDuck side.

1. definitely training data (for me), we explored about 10 different directions before settling on the current approach. It's easy to underestimate the effect of training data on the quality of the model. Starting point was the benchmark dataset though, which we assembled manually (to avoid data pollution and also because there was simply no text2sql benchmark that covers anything else than plain old SQL select statements with a handful of aggregate functions). And training is also not a one-off thing. With large datasets it is hard to evaluate the quality of the dataset without actually training a few epochs on it and run the benchmark.

2. I left a comment about my view on where such models are effective in a previous commment: https://news.ycombinator.com/item?id=39133155

3. No way - I see a common stack emerging (take a look at companies like https://vanna.ai/, https://www.dataherald.com/, or https://www.waii.ai) that is mainly centered around foundation models like GPT-4 with strong in-context learning capabilities (that's a kind of a must to make these approaches work and comes with long inference times and higher costs). These solutions include things like embedding-based schema filtering, options for users to enrich metadata about tables and columns, including previous related queries into the context etc. around the model. I'd say it's a bit of a different problem from what we aimed at solving.

discuss

swimwiththebeat|2 years ago

Thanks for taking the time to answer the questions and link those resources, really appreciate it and the work your team did!

theboat|2 years ago

I didn't see this in the blog post, but did you train this from scratch or finetune an existing base model?

If from scratch, quite impressive that the model is capable of understanding natural language prompts (English presumably) from such a small, targeted training set.

zainhoda|2 years ago

Founder of Vanna AI here -- appreciate the link and I agree, we're solving slightly different problems.