top | item 39131871

An open source DuckDB text to SQL LLM

131 points| vgt | 2 years ago |motherduck.com

30 comments

I see so many business leaders touting the promise of LLMs allowing business to "talk" to their data. The promise does sound enticing, but it's actually kind of hard to get working in practice.

A lot of our databases at work have columns with custom types and enums, and getting the LLM (Llama2) to write SQL queries to robustly answer natural language questions about the data is tough. It requires a lot of instruction prompting, context, and question-SQL examples (few-shot learning), and it still fails in unexpected ways. It's a tough ask for people to use a tool like this if they can't trust the results all the time. It's also a bit infeasible to scale this to tens or hundreds of tables across our data warehouse.

It's great that a lot of people are trying to crack this problem, I'm curious to try this model out. I'd also love to see if other people have tried solving this problem and made any headway.

tdoehmen|2 years ago

I've been working on the DuckDB-NSQL model on MotherDuck side. I fully agree that general text-2-sql, in the sense of "give me a question, and I'll produce you an arbitrary complex query", is a very tough problem and I actually believe that that's not the right problem to solve. Not necessarily because models are not (going to be) cabable enough, but rather because it's way too hard for humans to express in a single prompt what they actually want. Furthermore, it's simply not the right UX for most SQL users. A much better approach IMO is to keep the SQL analysts in the driver seat, and provide nuanced support wherever/whenever they need it most. The FixIt feature we recently launched goes into the same direction: https://motherduck.com/blog/introducing-fixit-ai-sql-error-f...

In that sense I emphasized in our Blogpost that users should think of it as a documentation oracle that always gives you the exact DuckDB SQL query snippet you are looking for, which is a tremendoues time-saver if you have an abstrat idea of the query you want to write, but you're just not sure about the syntax, expecially with DuckDB having so many functions and SQL extensions.

Here are a few exammples:

- create tmp table from test.csv

- load aws credentials from 'test' profile

- get max of all columns in rideshare table

- show query plan with runtimes for 'SELECT * FROM rideshare'

- cast hvfhs_license_num column to int

- get all columns ending with _amount from taxi table

- show summary statistics of rideshare table

- get a 10% reservoir sample of rideshare table

- get length of drivers array in taxi table

- get violation_type field from other_violations json column in taxi table

- get passenger count, trip distance and fare amount from taxi table and oder by all of them

- list all tables in current database

- get all databases starting with test_

[edit: fixed list formatting]

vgt|2 years ago

I agree. I think full text-to-results (even bypassing text-to-SQL) is akin to L5 self-driving cars. It's easy to get to some reasonable level, say 90%. But to get to the point, where folks can fully trust the system and don't need a steering wheel (or know SQL) may take decades.

We at MotherDuck took an incremental approach. We launched something more akin to lane-assist. We're calling it FixIt - an in-flow visual aid to help you identify and fix errors in your SQL [0].

I think there's gobs of opportunities to improve the analytics experiences without resorting to "L5 self-driving" (e.g. full text-to-results)

[0] https://motherduck.com/blog/introducing-fixit-ai-sql-error-f...

qsort|2 years ago

I've worked on a similar problem and we have pretty much the same issues as you. An idea that makes things better is having an intermediate representation that exposes the key tables of your dwh and gets compiled to SQL. This allows you to have somewhat better security (even if the model outputs garbage it can't do too much damage because nothing except exactly what you want is even representable), and somewhat better explainability (you can't guarantee the model will get it right, but you can see what SQL gets executed rather than "magic 8-ball says no").

But as you say custom types and encoded domain knowledge is extremely tough and as a result it's very tough to "transfer" the system to different databases.

cognaitiv|2 years ago

Lots of folks taking this approach and feels like the wrong entry point, e.g., similar to asking LLMs to generate bytecode when compilers exist or 3d printing a machine vs. building a machine from 3d printed components.

1. Business users aren’t prepared to talk to their data in meaningful ways and this is an opportunity for LLMs to enhance the way users ask questions.

2. SQL modeling languages exist (although I’m not sure there are well maintained open source good ones and this is the biggest obstacle to what I’m working on now) and LLMs can extend these effectively by adding components such as dimensions, metrics, relationships, filters, etc. with less chance of hallucination

3. Deterministic SQL generation from a modeling repository is easier to troubleshoot and provide guarantees than end-to-end generation.

4. Existing SQL can be parsed to generate modeling components that can be committed to the model repository with LLM assistance

5. Much of the richness of going to data to answer questions is context, e.g., how does this dept compare to others, this month to same month last year, etc. Modeling languages are good at expressing these transformations, but business users and often analysts aren’t good at capturing all the permutations of these types of comparisons. Another area where LLMs can help apply tooling.

IMO, LLMs are more effective at using tools than generating complex outputs.

PheonixPharts|2 years ago

> but it's actually kind of hard to get working in practice

One of the biggest challenges I've personally seen in this space is business "leaders" pushing teams to ship products asap lest they loose face among their fellow CEOs for not pushing out "AI" products before everyone else.

I'm fairly optimistic about LLMs being able to truly be transformative, but it's not going to be through forcing the bread-dead UX of hoisting yet another slightly re-imagined chat interface on users.

The idea of "talking to your data" is a promising one, and anyone who has worked for a large data driven org will quickly agree that organizing and searching in-house data is not a solved problem from the UX end of things. But to truly solving these problems, even/especially with LLMs, is going to require thought and experimentation. Something few "business leaders" have patience for.

pacbard|2 years ago

My biggest concern is that this will allow people to type a question and get a number back from the database, without being able to tell if the query is right or if the LLM just made up something.

It can work to support business analysts to crank out more reporters, but I wouldn’t roll it out to all my staff.

ipsum2|2 years ago

Llama2 wasn't trained on code, so it's no surprise that you're having trouble generating SQL.

DebtDeflation|2 years ago

I assume that "Talk to your data" presupposes the existence of some simple view that has been created and that embeds 99% of the required business logic.

"What was the average order size per customer for product XYZ in the West region?"

Imagine turning that one loose against the typical legacy system.

I wouldn't trust an LLM to figure out the joins or aggregate calculation, LET ALONE the definition of a customer, a product, or a region.

But it would almost certainly generate AN answer.

zainhoda|2 years ago

I’m trying to solve for this with my project using RAG and (at least based on what people say in Discord), it’s working really well for them: https://github.com/vanna-ai/vanna

vgt|2 years ago

Co-founder and Head of Produck at MotherDuck here, happy to answer any questions or go nag the amazing engineers [0] who worked on this :)

[0]https://news.ycombinator.com/user?id=tdoehmen

swimwiththebeat|2 years ago

1. First of all, thanks for outlining how you trained the model here in the repo: https://github.com/NumbersStationAI/DuckDB-NSQL?tab=readme-o...! I did not know about `sqlglot`, that's a pretty cool lib. Which part of the project was the most challenging or time-consuming: generating the training data, the actual training, or testing? How did you iterate, improve, and test the model?

2. How would you suggest using this model effectively if we have custom data in our DBs? For example, we might have a column called `purpose` that's a custom defined enum (i.e. not a very well-known concept outside of our business). Currently, we've fed it in as context by defining all the possible values it can have. Do you have any other recs on how to tune our prompts so that this model is just as effective with our own custom data?

3. Similar to above, do you know you can use the same model to work effectively on tens or even hundreds of tables? I've used multiple question-SQL example pairs as context, but I've found that I need 15-20 for it to be effective for even one table, let alone tens of tables.

jakozaur|2 years ago

Congratulations! The FixIt seems a pragmatic feature.

Any other ideas about where you plan to utilize LLM? E.g.

- migrate from different SQL dialects (e.g. date functions)

- make my queries nicer (e.g., rewrite them in shorter form or use better variable names)

- warn about potential bugs (e.g. SQL fanout)

whalesalad|2 years ago

Produck. Amazing.

datadrivenangel|2 years ago

The core issue of text to SQL is that your data has to be good for the generated queries to be correct. The queries may run and return good looking results, but if the data requires domain knowledge ("Don't count people in the customer table without filtering out records with the test flag in the customer attributes table and at least one order in the orders table") you'll get results that don't actually answer your question.

b_mc2|2 years ago

This is awesome, congratulations. I'm glad to see some text-to-sql models being created. Shameless plug: I also just realized you used NSText2SQL[1] which itself contains my text-to-sql dataset, sql-create-context[2], so I'm honored. I used sqlglot pretty heavily on it as well.

Do you think a 3B model might also be in the future, or something small enough that can be loaded up in Transformers.js?

[1] https://huggingface.co/datasets/NumbersStation/NSText2SQL

[2] https://huggingface.co/datasets/b-mc2/sql-create-context

CastFX|2 years ago

I'd love to see how it performs in some benchmarks, specifically against Spider (https://yale-lily.github.io/spider) and BIRD (https://bird-bench.github.io/)

aldarisbm|2 years ago

looks great, most text-to-sql attempts i’ve tried fall short, hoping this is different

terhechte|2 years ago

I do this a lot with GPT4 and I can't recall it ever generating an invalid query, and I did have some complicated ones.