Show HN: Dataherald AI – Natural Language to SQL Engine
215 points| aazo11 | 2 years ago |github.com | reply
GPT-4 class LLMs have gotten remarkably good at writing SQL. However, out-of-the-box LLMs and existing frameworks would not work with our own structured data at a necessary quality level. For example, given the question “what was the average rent in Los Angeles in May 2023?” a reasonable human would either assume the question is about Los Angeles, CA or would confirm the state with the question asker in a follow up. However, an LLM translates this to:
select price from rent_prices where city=”Los Angeles” AND month=”05” AND year=”2023”
This pulls data for Los Angeles, CA and Los Angeles, TX without getting columns to differentiate between the two. You can read more about the challenges of enterprise-level text-to-SQL in this blog post I wrote on the topic: https://medium.com/dataherald/why-enterprise-natural-languag...
Dataherald comes with “batteries-included.” It has best-in-class implementations of core components, including, but not limited to: a state of the art NL-to-SQL agent, an LLM-based SQL-accuracy evaluator. The architecture is modular, allowing these components to be easily replaced. It’s easy to set up and use with major data warehouses.
There is a “Context Store” where information (NL2SQL examples, schemas and table descriptions) is used for the LLM prompts to make the engine get better with usage. And we even made it fast!
This version allows you to easily connect to PG, Databricks, BigQuery or Snowflake and set up an API for semantic interactions with your structured data. You can then add business and data context that are used for few-shot prompting by the engine.
The NL-to-SQL agent in this open source release was developed by our own Mohammadreza Pourreza, whose DIN-SQL algorithm is currently top of the Spider (https://yale-lily.github.io/spider) and Bird (https://bird-bench.github.io/) NL 2 SQL benchmarks. This agent has outperformed the Langchain SQLAgent anywhere from 12%-250%.5x (depending on the provided context) in our own internal benchmarking while being only ~15s slower on average.
Needless to say, this is an early release and the codebase is under swift development. We would love for you to try it out and give us your feedback! And if you are interested in contributing, we’d love to hear from you!
[+] [-] mcqueenjordan|2 years ago|reply
There's just one thing I worry about. It's losing expertise in your data model and gaining organizational false confidence in bad data. Let's consider Bob. Bob is a Product Manager.
Bob always used to bother his software engineers to write SQL queries, but now he just uses this tool. Bob didn't write the tables or the data structures, so Bob doesn't know the nuances of the data model. Bob just types English and gets result sets back. Bob doesn't know that field order_status can also be in "pending_legal", and neither does the "sql compiler" know when it's appropriate to add or elide that field. Bob then presents his data to leadership to make changes to the Pending Order Logic, based on bad data.
[+] [-] kizum|2 years ago|reply
Will definitely look at using Dataherald for adding proper SQL support!
[+] [-] aazo11|2 years ago|reply
[+] [-] saigal|2 years ago|reply
[+] [-] cpursley|2 years ago|reply
- https://hasura.io/docs/latest/api-reference/graphql-api/quer...
- https://postgrest.org/en/stable/references/api/tables_views....
^ The advantage would be, these tools handle authorization.
[+] [-] jcjc712|2 years ago|reply
[+] [-] aazo11|2 years ago|reply
Do you have a specific use-case in mind? Would love to collaborate on adding it to the open-source library if you are open to it.
[+] [-] panarky|2 years ago|reply
DO NOT make any DML statements (INSERT, UPDATE, DELETE, DROP etc.) to the database.
I've found that spelling out exactly what I don't want an LLM to do dramatically increases the probability that it will do exactly that.
[+] [-] YeGoblynQueenne|2 years ago|reply
[+] [-] p1necone|2 years ago|reply
[+] [-] addisonj|2 years ago|reply
Here a few thoughts, feedback, and questions:
* you do a good job in this post of describing why you need more than just ChatGPT to get acceptable quality, but much less of that is in the readme. I wouldn't be afraid to sell the project a bit harder, even in early days
* likewise, I think a small visual of the architecture would be helpful, just to make clear what the relationship is to chatGPT, and the additional features of a context store, etc
* Between the notes here and your product pages, it seems a strategy for commercialization with this repo being a lower level tool, with your commercial service having UI, simplified integration, and a hosted version to make this easier for semi-techical teams? If that is the case, I wouldn't be afraid to make that more explicit. GitHub is more and more a place for discovery, even to semi-techical people, but to do that well, I think focusing on a readme that makes it clear who the open source is for is important
* How are you planning on solving the data access problem? Is this a full SaaS service that will need access to the customer data directly somehow? Do you deploy this in the customer's environment? In thinking beyond this open source release and to the commercial side, that would be what I would want to know
* This point has been hit by others in this thread, but I would be curious to know what your plans are to help protect against valid but incorrect queries? I am not as convinced that this problem is insurmountable, as it does seem like you could build features to remove ambiguity by asking questions, show alternatives, etc that most semi-techical people could reason through, but ultimately it seems like thinking about how to involve the teams that own the data might be an important part of the problem.
Anyways, congrats again! This is an area I am really excited to see how it evolves (and will be exploring more!). My email is in my profile if you are interested in chatting more :)
[+] [-] aazo11|2 years ago|reply
You can connect the hosted version to your data warehouse using the SQLAlchemy connectors which are in the open source engine, and the connection info will be stored after being encrypted. An on-prem version of the enterprise product is also on the roadmap.
For protecting against errors the approach is to allow for a human to be in the loop as you alluded to. One of the core pieces of the engine is the Evaluation module which combines white box and black box techniques. You can block the response from going back to the question asker if the confidence level is below a certain threshold until a human verifies the response manually (or always put the human from the data team in the loop).
[+] [-] saigal|2 years ago|reply
[+] [-] neural_thing|2 years ago|reply
SQL is easy. Knowledge management is hard. Does the LLM know that there was a bug in June that changed the data? Does it know that this one column is improperly named and confusing? Does it know that you recently released a mobile app the data from which is in a different table?
No, of course not, those things are never explicitly documented and so are invisible to an LLM.
[+] [-] mellosouls|2 years ago|reply
If it was that easy you wouldn't have a tradition of devs trying for any alternative.
There's obviously a use case for this sort of product, objections appealing to the ease of use of any technical language or toolset are unlikely to be convincing to the majority who are not comfortable with it.
[+] [-] Nick_Molter|2 years ago|reply
It takes a lot of training to get a team member up to speed - with the same amount of training, do you think an LLM can compete?
[+] [-] vekker|2 years ago|reply
[+] [-] itsoktocry|2 years ago|reply
No. Does "the SQL"? Also no.
That requirement falls on the person using the tool, doesn't it?
[+] [-] neural_thing|2 years ago|reply
This only works if you have perfect data engineering. But if you have perfect data engineering, point and click tools like Looker will work too.
[+] [-] l5870uoo9y|2 years ago|reply
I have to say the best implementations out there are already far ahead. Naturally there are some context limitations with large database schema but those can be handled:
- having an easy way to include/exclude tables ad hoc
- importing database schema in multiple different data sources that you can switch between depending on usage
And of course if AI isn't using the correct tables or columns, simple prime the prompt using autosuggest feature containing the tables and columns.
Generating JOINS isn't really a hurdle if you have a good prompt template:
- https://aihelperbot.com/snippets/cllovmlat000kmj0fedssj8c5 (notice it even uses the correct "film" tables even-thought it was instructed to find "movies" and naturaly category and category bridges table)
- https://aihelperbot.com/snippets/cllnztzky0002mj0fa78hw0o2 (it can also find lat/lng for Boston for AWS Athena, was demonstrating this for a Boston company)
- https://aihelperbot.com/snippets/cllo1vwl6000wlb0fj5669zt4 (It can generate in Mandarin as well, but less capable than English. Same goes for German, French, Spanish and so)
You can also have AI generate advanced SWOT analysis based on your database schema: https://i.imgur.com/JTv4QmX.png
[+] [-] xfalcox|2 years ago|reply
Your product is very interesting. Maybe we can collaborate here, as Discourse is a standard Rails database schema on PostgreSQL and could be a good way to show your tool working in a real database.
[+] [-] aazo11|2 years ago|reply
[+] [-] klft|2 years ago|reply
[+] [-] MrezaPourreza|2 years ago|reply
[+] [-] BenderV|2 years ago|reply
I also built NL2SQL solution (relying on OpenAI).
My first version was a direct NL2SQL version, named Olympe - https://github.com/BenderV/olympe I used it quite a while but trying to plug it to real database (100+ tables, unprepared) was unsuccesful.
I switched to chat version, named Ada - https://github.com/BenderV/ada
IMHO, it's the way to go. The AI explore the database, it's connection, the data format & co. Plus, it help with ambiguity and feels more "natural".
[+] [-] Nick_Molter|2 years ago|reply
[+] [-] screye|2 years ago|reply
A couple of questions:
* How easy is it to swap out the LLM ? (to Azure GPT4)
* Do you have 1 end2end custom user scenario ? It would really help to have 1 golden workflow (the fastest way to get things up and running) and 1 custom workflow (how to maximally configure the package to make it work the way you want) outlined in a demo/blog/tutorial.
Great work! I especially like how neat and transferrable the ideas in DIN-SQL are. Should naturally improve itself as new LLMs come out.
[+] [-] imrehg|2 years ago|reply
I was looking at the project to spin it up with a BigQuery connection, but I couldn't find an example in the docs (expected it on this page: https://dataherald.readthedocs.io/en/latest/api.database.htm...), nor can I find it in the repo. Any hints on this? Otherwise I'm fine going source code diving. :)
[+] [-] ainesh93|2 years ago|reply
[+] [-] zainhoda|2 years ago|reply
[+] [-] aazo11|2 years ago|reply
[+] [-] saigal|2 years ago|reply
[+] [-] cloudking|2 years ago|reply
[+] [-] aazo11|2 years ago|reply
That being said, I am pretty sure the self hosted models will get there very soon.
Are you looking to deploy a solution with a self-hosted model?
[+] [-] avree|2 years ago|reply
[+] [-] sandGorgon|2 years ago|reply
[+] [-] elietoubi|2 years ago|reply
[+] [-] Nick_Molter|2 years ago|reply
[+] [-] vekker|2 years ago|reply
[+] [-] saigal|2 years ago|reply
[+] [-] o_____________o|2 years ago|reply
[+] [-] aazo11|2 years ago|reply
Additional color: Once you configure a connection to a DB, you can trigger a 'scan' of the DB through the API which detects various tables/columns, identifies low cardinality (categorical) columns and their values and also loads in VIEWS stored on tables. In our default NL2SQL agent, these are then used together with 'verified' NL2SQL pairs and with other metadata you can manually add to generate the SQL.
We are looking to automatically add context from Stored procedures and other sources like dbt and various data dictionaries as well.
[+] [-] DylanDmitri|2 years ago|reply
[+] [-] aazo11|2 years ago|reply