Show HN: Natural-SQL-7B, a strong text-to-SQL model

[+] rgbrgb|2 years ago|reply

So it looks like it scores 76.5% on SQL-Eval [0], a bit behind GPT-4 at 83% and sqlcoder-15b at 78%.

What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?

As a programmer who always has to look stuff up when I SQL, I could definitely see asking something like this for a first draft of a query but it seems like I'm slightly better off asking the bigger models in these one-off cases (and I can run a 15b easily on my 64GB m1). If I'm in a corporate setting I'm not going to leak my schema into OpenAI's training data and there are definitely times when I'd want to run queries offline. Small/local models are great when you want to do a ton of queries (save $$).

A mini data scientist that could be queried by non-technical folks would be awesome but I wonder if there's a way to determine whether the query is falling in the 25% "incorrect" case... maybe there's a RAID-like consensus algorithm where you have multiple interrogate each other's answers to get a higher overall success rate.

Mostly thinking out loud :) but maybe ya'll have more ideas. Congrats on the release, OP!

[0]: https://github.com/defog-ai/sql-eval

[+] fifilura|2 years ago|reply

1. If you are a a programmer I think you should learn SQL. It will give you a different perspective on programming that I think is invaluable. (I.e. programming without for loops)

2. Combining and slicing data is a craft, and doing it subtly wrong in one step can lead to fatal errors in the outcome.

And most importantly, it can be very difficult to notice. Numbers don't smell.

That is why I would be very hesitant to give a slightly more than trivial task to an engine that fails 25% of the time.

But I guess that is the same as any other programming task. Just that other programming tasks require a lot of boilerplate where an AI can help. SQL is much more straight to it.

Maybe it could be useful to ask questions that are similar to writing testcases "how can I verify that my query is doing the right thing"?

[+] beefield|2 years ago|reply

> What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?

I have written a bunch of more or less complicated SQL during my career. And I am pretty sure that if I need to write a SQL statement that's anything but select * from table, my output won't work 75% of time.

I may be special case, but typically if I work on a hard problem, it is not a single hard problem but a sh*tload of connected simple problems. If I can get someone to solve the simple problems 75% of the time correctly so that I can spend my time figuring out how those simple problems are to be connected, I'm ore than happy. And that's exactly how I use chatgpt. I have learned not to ask too complex questions from it. But the simple ones, it mostly aces and when it does not , they are easy to spot, as it is not that I could not have solved them myself, I just did not want to spend time for that. Now, if only the chatgpt was not almost as lazy as me to produce long simple stuff, that would be awesome.

[+] internet101010|2 years ago|reply

> What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?

Yeah this is the issue I have with all of the SQL generation stuff. Not only should the SQL be valid, a prompt like "generate a query that pulls sales for the last quarter" should generate the same output for everyone without fail. Vanna's business logic embedding is a good first step but even then it is only correct like 90% of the time with GPT-4.

Even then, it will only work if there are strong standards and data governance structures in place that everyone within an organization is aligned on. For example, "sales" can mean different things to different people and all of that needs to be buttoned up as well.

[+] wg0|2 years ago|reply

As an end consumer, you should be well versed in SQL (or whatever subject) yourself and should proof read twice all the queries generated before you make a decision based on the data that's pulled from the generated queries.

That's the story of the LLMs in general.

The hype is free. Startup ecosystem are a bonus.

[+] Closi|2 years ago|reply

> What kind of applications would this be useful for? What can you build with an AI data science intern that's right 75% of the time?

The kind of stuff that it is very easy to validate if it works or not :)

I am building a warehouse management system at the moment, and it's great to quickly churn out lots of SQL views (particularly as the schema is changing/evolving slightly as I am writing it, so being able to go back to GPT4 to churn through the changes to some of the 'views' of my pages helps, even if it requires a little testing/validation).

[+] brian_reardon|2 years ago|reply

I would wonder how many of the 25% are actually dangerous to run, considering the target audience might be someone who is rusty at SQL and etc.

[+] whimsicalism|2 years ago|reply

do you not trust the setting OAI provides to exclude your conversation from training data?

[+] dimask|2 years ago|reply

Imo stuff like this can have applications in a narrower set of problems, where you can also further finetune or simple prompt engineer a model to have a much higher accuracy. I do not think that they can become the bases in which database work is gonna take place in general, but it could be used to automate certain types of tasks that would be too hard/complicated to automate in a classical programmatic way, but not too hard/complicated a real intelligence to be needed to solve.

[+] bagels|2 years ago|reply

I know sql pretty well, but every once in a while doing things like percentiles come up where I don't remember the exact incantations to get the window syntax right. Things like this can save some googling and documentation reading. I can read it to see if it makes sense, and execute it. I can create tests for the output to verify. Can save a lot of time. If it's wrong, I use the fallback of googling for examples.

[+] bongodongobob|2 years ago|reply

I can definitely appreciate security concerns but I don't see how providing your schema is of any concern whatsoever.

[+] fijiaarone|2 years ago|reply

The 25% of SQL that covers everything in 99% of applications is a subset of the 76.5% of SQL that the LLM got right.

[+] davidy123|2 years ago|reply

This kind of approach could be a way to make these models reliable. Encode your business restrictions in the database, if they are rigorous enough, if the query passes then it's valid in your business world.

[+] lawxls|2 years ago|reply

Yep, just use GPT-4

[+] lolpanda|2 years ago|reply

I don't think any of those text-to-sql models are solving the right problems. The hard part is not syntax or I don't know how to write a group by query. Most data scientists and engineers spend more time on understanding the meaning of the data. One cannot simply look at a 50 columns table in Snowflake and guess what columns are by their names. For example, we have 10 columns in one tables, all named ...price. We have to go to wiki to find the actual meaning or read the DBT definitions. I cannot trust any queries that models produce because they don't understand the data; they only understand the query syntax.

[+] joshhart|2 years ago|reply

At Databricks we have an LLM that is fine-tuned to do the problem you raise -

https://www.databricks.com/blog/announcing-public-preview-ai...

Many customers like it a lot. Although perhaps in your case if there are many pricing details it may not be quite accurate.

[+] l5870uoo9y|2 years ago|reply

You are right but you can use RAG to “teach” AI about your schema. I did a write up on my implementation [1].

[1]: https://www.sqlai.ai/posts/enhancing-ai-accuracy-for-sql-gen...

[+] mritchie712|2 years ago|reply

Yeah, we've been working on this problem a good bit and I think text-to-sql is a dead-end for analytical questions.

We (https://www.definite.app/) ended up abandoning text-to-sql in favor of answering questions with a semantic layer (which LLM's are far more effective against).

https://www.loom.com/share/a0d3c0e273004d7982b2aed24628ef40

[+] satvikpendem|2 years ago|reply

This is not open source, because you have use-based restrictions. Call it what it is, source available.

[+] thecalebf|2 years ago|reply

Updated the title since it may have been confusing, appreciate the feedback!

[+] CharlesW|2 years ago|reply

> Call it what it is, source available.

Also, it's only weights AFAICT — no source training data/code is available.

[+] itsoktocry|2 years ago|reply

>complex questions like above.

This is cool, and up my alley. But that's not a complex question, it's a basic analytics question. Most analysts will be able to write something like that in their sleep.

I've been using ChatGPT for writing SQL, and it's mediocre. But it'll get better, I'm sure.

[+] thecalebf|2 years ago|reply

Sure, there is a blog post with some other examples of the first model iteration with more complex, multipart questions. https://www.chatdb.ai/post/naturalsql-vs-sqlcoder-for-text-t...

I will update that to be a more truly difficult question. Appreciate the feedback!

[+] int_19h|2 years ago|reply

Even GPT-4 often writes queries with redundant subqueries, excessive nesting or joins etc. I don't think that's particularly valuable.

On the other hand, telling GPT to generate SQL to query a data store as part of solving some task that requires inference from facts captured in that data store works surprisingly well - better than "function calls" with JSON, in my opinion. While such generated queries are also suboptimal, they still capture the intent correctly, and GPT is surprisingly adept at using nested subqueries to get the answer it needs in a single query. And when such generated SQL is wrong, it usually fails to parse (e.g. due to typos in field names), at which point you can just feed the error message back to the model and have it correct that.

[+] buzzm|2 years ago|reply

Like many uses of AI, very good as a "seed" especially if it comes up with nuggets like grouping on a range instead of a single value. But as with almost any database, devil is in details. Different products have different interp of "quantity" (e.g. box vs. unit), coupons and discounts are modeled in weird ways, weights are assumed to be pounds/kg and are mixed without assigning units, etc. etc.

[+] owlstuffing|2 years ago|reply

Good point. However, AI may also train on the database metadata, including DDL comments, locale, etc. and perform data sampling to glean the nuances that are "understood".

[+] CastFX|2 years ago|reply

I was wondering how it performs in a more complex (and realistic) benchmark like Bird?

https://bird-bench.github.io/

[+] jimmytucson|2 years ago|reply

In the data world I've worked with tons of folks whose responsibilities include getting questions from execs, knowing their way around the data warehouse enough to write SQL to answer those questions, and delivering the answers back (sometimes formatted nicely). Sometimes they have to predict what followup questions the exec will ask, like "why is that number so low, it obviously shouldn't be that low" so they can press the data engineers for bugs.

Like all things LLM, I don't know if this is about to make those responsibilities a lot easier, or just eliminate them altogether.

[+] fwip|2 years ago|reply

My money is on "overpaid execs will use this to get wrong information, and get mad at their subordinates for correcting them."

[+] zurfer|2 years ago|reply

really cool, the license is not really standard, but seems open source. The actual model can be found here: https://huggingface.co/cfahlgren1/NaturalSQL-6.7B-v0

This seems like a great base model, although I wonder if text-to-sql is good use case for small models. We are also building a tool in the space and I regularly wish gpt-4 to be even more knowledgable when answering. Even gpt 3.5 is not good enough for production.

[+] thecalebf|2 years ago|reply

Thanks! Yes, that was an earlier iteration. The v1 is here https://huggingface.co/chatdb/natural-sql-7b. Plan to push to Ollama soon and build an open source free tool around it.

Would love to hear about what you are building!

[+] zainhoda|2 years ago|reply

Very cool. Would the license allow for use with Vanna? https://github.com/vanna-ai/vanna

[+] thecalebf|2 years ago|reply

Yes please do! Looks awesome, would love to help any way I can as well.

[+] zeroq|2 years ago|reply

For anyone who says it's useless because it's only 75% correct, please consider these two points:

(1) this is the first instalment, and it's already close to be a thousand times more useful for product owners and analytics than any airtable you can imagine.

(2) as much as I love being on point on every challenge, we're leaving in "good enough" economics for quite some time, and if this will be close enough that will be good enough for business.

[+] Tycho|2 years ago|reply

At university I studied 'natural language interfaces to databases' (NLIDBs). I recall that very early, I think in the 60, NASA and other organisations were trying to build things like this for their scientists. Of course, there was no SQL in those days. Once SQL rose to prominence, my impression is that NLIDB interest faded, especially with the lack of NLP breakthroughs. My project was to create an analog to SQL in more plain, naturalistic English (a bit like what AppleScript is to other programming languages), and then let the user construct the queries by using a GUI rather than free-typing. The GUI would constrain the query text to permissible syntax, letting the user click to add new clauses and select columns etc, while maintaining a 'layman intelligible' sentence. Anyway, maybe now we can get NLIDBs with true NLP.

[+] moltar|2 years ago|reply

The card states:

> This model was evaluated on SQL-Eval, a PostgreSQL-based evaluation framework developed by Defog for testing and alignment of model capabilities.

But this explains the testing part.

However, does it mean that only the PostgreSQL-flavour of SQL is supported?

Would it work for Trino flavour?

[+] aussieguy1234|2 years ago|reply

I've been using GPT-4 to write SQL to get a bunch of insights. Its very good at writing up queries for all kinds of metrics. These queries answer questions that would have previously required an actual data squad.

[+] xfalcox|2 years ago|reply

Context is 4096? My app db DDL is 19877 tokens (using Llama2 tokenizer) long, so that means we need to do a RAG for handling the DDL prompt injection.

A model like this with a 32k long seq_len, like Mixtral, would be a killer for me.

[+] thecalebf|2 years ago|reply

Great call out. Will definitely focus on that in the next iteration!

[+] cuuupid|2 years ago|reply

This is a big improvement, but I'm not a believer that SQL is the most appropriate query lang here. Personally am more bullish on language models being trained with ORMs, as those normally capture much more information about the fields.

e.g. Passing in some of my more complex table schemas related to flight data and asking about overflights, the model struggles to resolve out information related to aviation. However, GitHub Copilot writes me a perfect call to Prisma with the same single line instruction + information spanning the rest of my codebase.

[+] tillvz|2 years ago|reply

Agree that an approach that more semantically models the data is better, especially when you want to eventually let the non-technical users ask questions.

When you're on a higher abstraction level, it also allows you to make clear definitions (e.g. for certain KPIs) and define business logic that always needs to be applied to get the correct results.

There you don't want to leave it up to chance that a filter gets hallucinated in or out when you ask e.g. about your company's revenue.

At Veezoo (https://www.veezoo.com) we have taken the approach that instead of going directly to SQL. So when a user asks a question, Veezoo translates it first into a query against the Knowledge Graph (which represents the business objects, their relationship etc.). From there we compile it into a SQL query depending on the target database (they all have slight differences) without any AI involvement. In this compilation step we also make sure that the business logic is properly applied.

[+] thecalebf|2 years ago|reply

Neat, would you ever use a local model for that if it could work with ORMs?

[+] K0IN|2 years ago|reply

One Problem I always see in such apps is that the ai can't see in to the database or into all entries, so queries without stating data EXACTLY as in the database run into issues.

example: give me the revenue for all logistics firms

but in the database these might not be called "logistics" and may be called "transport" (or anything)

maybe there are some counters to this like finding unique values per column or even better use a grammar based approach, wich will select only valid entries.

but the simple text to SQL is at this point not the "hard thing to solve"

[+] bm-rf|2 years ago|reply

Usually you include the database schema in the context, usually by showing the CREATE statement for the tables you want to query. I've also found that including comments in the CREATE sql can guide the model somewhat. The best approach is probably to finetune one of these models using curated questions for your database.

[+] owlstuffing|2 years ago|reply

Natural language-derived SQL will be useful for programming language integration involving type-safe SQL. I'm currently using manifold-sql[1] for this, and having the ability to transform English directly to SQL type-safely is amazing.

1. https://github.com/manifold-systems/manifold/blob/master/man...

[+] moltar|2 years ago|reply

I have a use case where there's no DDL, just a description of the tables, with provided data types, and descriptions of all of the columns, in JSON.

I could generate DDL statements, of course. But wondering if this is the best way to hint at the model of the database structure.

Also, how would you go about supplying the very verbose descriptions of all of the data types? Would SQL comments be best? Postgres-style column comments?

Thanks!

[+] delichon|2 years ago|reply

So you feed it the whole DDL with the prompt. I wonder how it performs on a task like schema normalization or optimization? Say, also include the slow queries log and ask it for the SQL commands to modify the indexes to speed up those queries. Allow it to normalize/denormalize/materialize as options. Give it a way to run the resulting DDL against a test suite and iterate toward some fitness goal.

This would save gobs of compute.

[+] thecalebf|2 years ago|reply

That is really interesting. I took note of this. That would be really cool!

[+] bottlepalm|2 years ago|reply

Is this the data that was used for fine tuning?

https://github.com/defog-ai/sql-eval/blob/main/data/question...

[+] thecalebf|2 years ago|reply

No, those are benchmark, evaluation questions. The fine tune dataset was a custom, synthetically generated dataset of ~20k PostgreSQL Text to SQL pairs covering different SQL categories and question types.

I mention a little more about it here https://x.com/calebfahlgren/status/1754247740291207198?s=20

168 comments