top | item 37240363

Show HN: Dataherald AI – Natural Language to SQL Engine

215 points| aazo11 | 2 years ago |github.com | reply

Hi HN community. We are excited to open source Dataherald’s natural-language-to-SQL engine today (https://github.com/Dataherald/dataherald). This engine allows you to set up an API from your structured database that can answer questions in plain English.

GPT-4 class LLMs have gotten remarkably good at writing SQL. However, out-of-the-box LLMs and existing frameworks would not work with our own structured data at a necessary quality level. For example, given the question “what was the average rent in Los Angeles in May 2023?” a reasonable human would either assume the question is about Los Angeles, CA or would confirm the state with the question asker in a follow up. However, an LLM translates this to:

select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”

This pulls data for Los Angeles, CA and Los Angeles, TX without getting columns to differentiate between the two. You can read more about the challenges of enterprise-level text-to-SQL in this blog post I wrote on the topic: https://medium.com/dataherald/why-enterprise-natural-languag...

Dataherald comes with “batteries-included.” It has best-in-class implementations of core components, including, but not limited to: a state of the art NL-to-SQL agent, an LLM-based SQL-accuracy evaluator. The architecture is modular, allowing these components to be easily replaced. It’s easy to set up and use with major data warehouses.

There is a “Context Store” where information (NL2SQL examples, schemas and table descriptions) is used for the LLM prompts to make the engine get better with usage. And we even made it fast!

This version allows you to easily connect to PG, Databricks, BigQuery or Snowflake and set up an API for semantic interactions with your structured data. You can then add business and data context that are used for few-shot prompting by the engine.

The NL-to-SQL agent in this open source release was developed by our own Mohammadreza Pourreza, whose DIN-SQL algorithm is currently top of the Spider (https://yale-lily.github.io/spider) and Bird (https://bird-bench.github.io/) NL 2 SQL benchmarks. This agent has outperformed the Langchain SQLAgent anywhere from 12%-250%.5x (depending on the provided context) in our own internal benchmarking while being only ~15s slower on average.

Needless to say, this is an early release and the codebase is under swift development. We would love for you to try it out and give us your feedback! And if you are interested in contributing, we’d love to hear from you!

103 comments

[+] mcqueenjordan|2 years ago|reply

I want to preface this comment so that I don't detract from the idea. This is a really cool product, and obviously has value.

There's just one thing I worry about. It's losing expertise in your data model and gaining organizational false confidence in bad data. Let's consider Bob. Bob is a Product Manager.

Bob always used to bother his software engineers to write SQL queries, but now he just uses this tool. Bob didn't write the tables or the data structures, so Bob doesn't know the nuances of the data model. Bob just types English and gets result sets back. Bob doesn't know that field order_status can also be in "pending_legal", and neither does the "sql compiler" know when it's appropriate to add or elide that field. Bob then presents his data to leadership to make changes to the Pending Order Logic, based on bad data.

[+] kizum|2 years ago|reply

Very nice! I added a similar feature to https://sitespeak.ai recently that allows my users to connect their virtual assistant to a Sqlite database to retrieve data to answer visitor questions. Initially this was just a way around the difficulty of "chatting to a CSV" so the solution was to convert any CSV to a Sqlite database and using that, but it's proved very useful.

Will definitely look at using Dataherald for adding proper SQL support!

[+] aazo11|2 years ago|reply

Thanks. Let us know if we can help out in any way. We just set up a Discord server you can join as well https://discord.gg/A59Uxyy2k9

[+] saigal|2 years ago|reply

Very cool! is this mostly for unstructured data search? Any interest in adding tabular / structured data NL search?

[+] cpursley|2 years ago|reply

Neat! How difficult would it be to extend this to SQL like query languages like Hasura and PostgREST (via Supabase)?

- https://hasura.io/docs/latest/api-reference/graphql-api/quer...

- https://postgrest.org/en/stable/references/api/tables_views....

^ The advantage would be, these tools handle authorization.

[+] jcjc712|2 years ago|reply

It shouldn't be difficult, if we provide many examples and the db/table schema the agent can provide an accurate response

[+] aazo11|2 years ago|reply

Interesting. So far we have been focused on allowing non-technical users to self-serve from enterprise data warehouses. It might be possible to enable this scenario with a simple addition to the NL-2-SQL engine to translate the generated SQL to API calls. If that does not work, a custom imeplementation of the NL-2-SQL module would def solve this.

Do you have a specific use-case in mind? Would love to collaborate on adding it to the open-source library if you are open to it.

[+] panarky|2 years ago|reply

From the prompt context:

DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.

I've found that spelling out exactly what I don't want an LLM to do dramatically increases the probability that it will do exactly that.

[+] YeGoblynQueenne|2 years ago|reply

"Don't think of a pink elephant".

[+] p1necone|2 years ago|reply

Yeah, I hope this is paired with something that actually stops those statements from running otherwise this is a pretty naive way to do this.

[+] addisonj|2 years ago|reply

Congrats on the initial launch!

Here a few thoughts, feedback, and questions:

* you do a good job in this post of describing why you need more than just ChatGPT to get acceptable quality, but much less of that is in the readme. I wouldn't be afraid to sell the project a bit harder, even in early days

* likewise, I think a small visual of the architecture would be helpful, just to make clear what the relationship is to chatGPT, and the additional features of a context store, etc

* Between the notes here and your product pages, it seems a strategy for commercialization with this repo being a lower level tool, with your commercial service having UI, simplified integration, and a hosted version to make this easier for semi-techical teams? If that is the case, I wouldn't be afraid to make that more explicit. GitHub is more and more a place for discovery, even to semi-techical people, but to do that well, I think focusing on a readme that makes it clear who the open source is for is important

* How are you planning on solving the data access problem? Is this a full SaaS service that will need access to the customer data directly somehow? Do you deploy this in the customer's environment? In thinking beyond this open source release and to the commercial side, that would be what I would want to know

* This point has been hit by others in this thread, but I would be curious to know what your plans are to help protect against valid but incorrect queries? I am not as convinced that this problem is insurmountable, as it does seem like you could build features to remove ambiguity by asking questions, show alternatives, etc that most semi-techical people could reason through, but ultimately it seems like thinking about how to involve the teams that own the data might be an important part of the problem.

Anyways, congrats again! This is an area I am really excited to see how it evolves (and will be exploring more!). My email is in my profile if you are interested in chatting more :)

[+] aazo11|2 years ago|reply

Thanks for the feedback regarding the Readme. We have some additional detail in the docs here https://dataherald.readthedocs.io/en/latest/index.html (including an architecture doc), but will definitely be incorporating some of the feedback as we iterate on the documentation which.

You can connect the hosted version to your data warehouse using the SQLAlchemy connectors which are in the open source engine, and the connection info will be stored after being encrypted. An on-prem version of the enterprise product is also on the roadmap.

For protecting against errors the approach is to allow for a human to be in the loop as you alluded to. One of the core pieces of the engine is the Evaluation module which combines white box and black box techniques. You can block the response from going back to the question asker if the confidence level is below a certain threshold until a human verifies the response manually (or always put the human from the data team in the loop).

[+] saigal|2 years ago|reply

I want to double-down on what AAzo11 replied. There is a human in the loop via an admin dashboard that allows quick approval / editing in the event of incorrect queries. This significantly decreases time for a member of the data team to field requests since they typically block out an hour to approve/edit these queries before going back to the question-asker.

[+] neural_thing|2 years ago|reply

When I look at all the LLM SQL tools, I think: what a cheap and accessible way to get the wrong answers.

SQL is easy. Knowledge management is hard. Does the LLM know that there was a bug in June that changed the data? Does it know that this one column is improperly named and confusing? Does it know that you recently released a mobile app the data from which is in a different table?

No, of course not, those things are never explicitly documented and so are invisible to an LLM.

[+] mellosouls|2 years ago|reply

SQL is easy

If it was that easy you wouldn't have a tradition of devs trying for any alternative.

There's obviously a use case for this sort of product, objections appealing to the ease of use of any technical language or toolset are unlikely to be convincing to the majority who are not comfortable with it.

[+] Nick_Molter|2 years ago|reply

Love this! I don't think you could be more right about the practical challenges of implementing something like this. In my experience, this same problem is what makes it so challenging to onboard new data scientists/analysts.

It takes a lot of training to get a team member up to speed - with the same amount of training, do you think an LLM can compete?

[+] vekker|2 years ago|reply

SQL is easy, the problem is that some enterprise database schemas have gotten incredibly complex over time. I think LLMs might help the maintainers navigate such a complex landscape. Especially if comments are added to each table with clarifications. The only serious limit here is the LLM's context length...

[+] itsoktocry|2 years ago|reply

>Does the LLM know that there was a bug in June that changed the data?

No. Does "the SQL"? Also no.

That requirement falls on the person using the tool, doesn't it?

[+] neural_thing|2 years ago|reply

A less sarcastic take:

This only works if you have perfect data engineering. But if you have perfect data engineering, point and click tools like Looker will work too.

[+] l5870uoo9y|2 years ago|reply

> select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”

I have to say the best implementations out there are already far ahead. Naturally there are some context limitations with large database schema but those can be handled:

- having an easy way to include/exclude tables ad hoc

- importing database schema in multiple different data sources that you can switch between depending on usage

And of course if AI isn't using the correct tables or columns, simple prime the prompt using autosuggest feature containing the tables and columns.

Generating JOINS isn't really a hurdle if you have a good prompt template:

- https://aihelperbot.com/snippets/cllovmlat000kmj0fedssj8c5 (notice it even uses the correct "film" tables even-thought it was instructed to find "movies" and naturaly category and category bridges table)

- https://aihelperbot.com/snippets/cllnztzky0002mj0fa78hw0o2 (it can also find lat/lng for Boston for AWS Athena, was demonstrating this for a Boston company)

- https://aihelperbot.com/snippets/cllo1vwl6000wlb0fj5669zt4 (It can generate in Mandarin as well, but less capable than English. Same goes for German, French, Spanish and so)

You can also have AI generate advanced SWOT analysis based on your database schema: https://i.imgur.com/JTv4QmX.png

[+] xfalcox|2 years ago|reply

We are experimenting in this area, as a way to provide a way for community managers to query their Discourse PG database using natural language.

Your product is very interesting. Maybe we can collaborate here, as Discourse is a standard Rails database schema on PostgreSQL and could be a good way to show your tool working in a real database.

[+] aazo11|2 years ago|reply

Hi -- would love to connect and discuss more. Shoot me an email amir at dataherald.com.

[+] klft|2 years ago|reply

Could you provide more details about your approach? Is the context store based in vector-embeddings?

[+] MrezaPourreza|2 years ago|reply

I'd be delighted to assist you with your inquiries. Indeed, the context store interacts with vector databases to retrieve samples based on vector embeddings and cosine similarity.

[+] BenderV|2 years ago|reply

Congrats for your launch !

I also built NL2SQL solution (relying on OpenAI).

My first version was a direct NL2SQL version, named Olympe - https://github.com/BenderV/olympe I used it quite a while but trying to plug it to real database (100+ tables, unprepared) was unsuccesful.

I switched to chat version, named Ada - https://github.com/BenderV/ada

IMHO, it's the way to go. The AI explore the database, it's connection, the data format & co. Plus, it help with ambiguity and feels more "natural".

[+] Nick_Molter|2 years ago|reply

Nice! how's the switch to Ada been? Any unexpected hurdles?

[+] screye|2 years ago|reply

Hey ! This is awesome. I have been playing around with the dirty DIN-SQL repo for a couple of weeks, so this is a delight.

A couple of questions:

* How easy is it to swap out the LLM ? (to Azure GPT4)

* Do you have 1 end2end custom user scenario ? It would really help to have 1 golden workflow (the fastest way to get things up and running) and 1 custom workflow (how to maximally configure the package to make it work the way you want) outlined in a demo/blog/tutorial.

Great work! I especially like how neat and transferrable the ideas in DIN-SQL are. Should naturally improve itself as new LLMs come out.

[+] imrehg|2 years ago|reply

This looks very interesting and timely for me, thanks for sharing!

I was looking at the project to spin it up with a BigQuery connection, but I couldn't find an example in the docs (expected it on this page: https://dataherald.readthedocs.io/en/latest/api.database.htm...), nor can I find it in the repo. Any hints on this? Otherwise I'm fine going source code diving. :)

[+] ainesh93|2 years ago|reply

Hi, you can find updated documentation on connecting to BigQuery here: https://dataherald.readthedocs.io/en/latest/api.database.htm.... We have also updated the ReadMe.

[+] zainhoda|2 years ago|reply

Nice job! We're building something relatively similar at Vanna AI: https://vanna.ai/

[+] aazo11|2 years ago|reply

Steve from Hegel-AI had showed us Vanna. Looks super cool. Let us know if you think there is room to collaborate.

[+] saigal|2 years ago|reply

awesome, we'll check it out!

[+] cloudking|2 years ago|reply

Any solutions like this that work with self hosted models?

[+] aazo11|2 years ago|reply

Depends what you mean by 'work.' You can definitely use them, but GPT-4 is the only model that generates good SQL without lots of training + fine-tuning (and it is very difficult/impossible to come up with Nl <> SQL training sets). So GPT-4 is the only model that works well in terms of accuracy for now.

That being said, I am pretty sure the self hosted models will get there very soon.

Are you looking to deploy a solution with a self-hosted model?

[+] avree|2 years ago|reply

Looks really cool. Congrats on the launch. Typo in the readme, btw, "enteprise-level" instead of "enterprise-level"

[+] sandGorgon|2 years ago|reply

this is very interesting. just curious - is the context store specific to the table structure of the target database ? or is it a very generic one. im asking this cos the DIN-SQL algorithm seems to be a chain-of-thought kind of algorithm (with query decomposition). I'm wondering where do u plug-in the context store ? in DIN-SQL the schema linking module is random samples from Spider. Is that whats in the context store ?

[+] elietoubi|2 years ago|reply

Reminds me of https://www.parse.dev Used it and it was pretty cool!

[+] Nick_Molter|2 years ago|reply

Do you still use this? Thoughts?

[+] vekker|2 years ago|reply

This is interesting. I have an enterprise customer who's looking for exactly this. Any plans to support DB2?

[+] saigal|2 years ago|reply

happy to connect and describe more to see whether we could collaborate. please feel free to email be at anuj at dataherald dot com or sign up here and we will get in touch https://www.dataherald.com/contact

[+] o_____________o|2 years ago|reply

How are you dealing with stored procedures?

[+] aazo11|2 years ago|reply

TLDR -- We currently do not automatically load stored procedures as context for in context learning/prompting but it is on the roadmap.

Additional color: Once you configure a connection to a DB, you can trigger a 'scan' of the DB through the API which detects various tables/columns, identifies low cardinality (categorical) columns and their values and also loads in VIEWS stored on tables. In our default NL2SQL agent, these are then used together with 'verified' NL2SQL pairs and with other metadata you can manually add to generate the SQL.

We are looking to automatically add context from Stored procedures and other sources like dbt and various data dictionaries as well.

[+] DylanDmitri|2 years ago|reply

I’ve seen impressive demos with just LLM direct to SQL. Can’t wait to see what people build with these new tools.

[+] aazo11|2 years ago|reply

Hi Dylan -- new GPT-4 class LLMs have gotten good at writing correct SQL, but while the SQL they generate almost always executes correctly they often write SQL that generates incorrect answers to the question. Some reasons for this can be the business context or definition is not in the table schema, or if the correct query is complex and requires multiple joins.