Show HN: Hashquery, a Python library for defining reusable analysis
67 points| cpimhoff | 1 year ago |hashquery.dev
We were originally rendering SQL directly for all our queries, but that spiraled out of control with more complex analysis; the SQL needed to be changed sporadically for each new database dialect (BigQuery, Redshift, Aethena, Postgres, etc etc) and the SQL fragments were very challenging to reuse (and so fragments were copy-pasted all over).
~~~
Advantages we think it has over writing SQL by hand:
- Queries are fully compossible, so any analysis can be chained into further analysis without refactoring.
- All analysis executes within the data warehouse itself, making them super fast for big data.
- It's just Python, so you can extend or parameterize query logic with a simple def function.
- You can run it anywhere, like inside of unit tests, ETL nodes, or Jupyter notebooks.
- Developer experience is pretty good since editors already know how to autocomplete Python. ChatGPT knows Python already too :)
- The library handles normalizing between database dialects. Write once and use in any database.
- Hashquery content is fully serializable, which makes it a good fit for exposing a flexible API for consumers to efficiently query datamarts or internal analytics. We use it as the endpoint for headless BI, as opposed to having to define and teach a DSL or GraphQL specification.
~~~
We've built native funnel analysis on top of Hashquery and have been thrilled with it so far, and we thought others might want to use it too. It's pretty early days so we're still trying to explain it, and the docs aren't perfectly clear, but the examples on the dev site are editable and you can download the pip package to play around with it!
code_biologist|1 year ago
- Break the dependency on your product. I need to be able to use the library even if your company goes under.
- Add a dbt library that makes it easy to use hashquery within dbt models. It gets you materialization for free and will answer a lot of questions you will get about dbt integration.
To comment more broadly, if you want to be a broad solution, the going trend in data integration seems to be at the warehouse level so you need to have SQL answers.
A bunch of tools all consume from our warehouse (existing BI, reverse ETL, data science systems). A BI definition tool won't work if I can't define segments in a way that all of those can access, even as just tables or views.
Our programmers and data science people know Python and are often very good at SQL, but their time is short and BI projects depending on them have been delayed. Our analysts know SQL, and have the dedicated time to make these projects happen.
This kind of code snippet isn't crazy to put into dbt, and if someone wants to do Python magic in the background they can:
cpimhoff|1 year ago
100% on decoupling it from the product. Our customers needed a headless BI solution and we needed a better internal framework, so the stars aligned for this first version being a little coupled. This kind of feedback is helpful, because it helps me advocate for capacity to decouple it fully!
Hashquery does have APIs to get the rendered SQL without executing it, so I think integration with any in-warehouse processing tool is possible, dbt likely being the most valuable.
Full materialization/interoperability with SQL is hard for any tool trying to encapsulate the semantics of the SQL. The intent of those tools is to encode concepts that are not possible to represent with static tables, so folding them back to a SQL-surface area will always be lossy. Having said that, we could certainly have a better story around it — materialization can still be useful even if it has some caveats.
grantjpowell|1 year ago
anentropic|1 year ago
But https://hashquery.dev/#faq says:
> the Hashquery SQL compiler is not available to run locally, so you do need to define your data connections inside of Hashboard and use its API to execute your queries.
> We do plan on making the full Hashquery stack available to run locally in the near future
I'm not quite sure what the use case for this library is at present
If I'm not a Hashboard customer and don't want to pay $60/mo for a nicer way to query my existing db, what am I going to do with it?
Hashboard seems roughly similar to Superset and/or Cube?
cpimhoff|1 year ago
For folks wanting to hack around more with Hashquery in the meantime, Hashboard accounts can be created for free and we don't have plans to charge folks just using the Hashquery stuff (but we probably will bother you for feedback ). You can sign up here https://hashboard.com/getAccess
totalhack|1 year ago
https://github.com/totalhack/zillion
guy4261|1 year ago
1. How does this compare to dbt? If we're already using dbt, why migrate?
2. Will you consider making a tool that tries to transpile SQL back to Hashquery models? This way I can work against my database, then merge the changes back to the model.
Good luck!
cpimhoff|1 year ago
I actually think dbt and Hashquery work very well alongside one another!
dbt can help you normalize, clean, and materialize your data upstream, then Hashquery can be used to associate semantics to those output tables (measurements, synthesized attributes, common views) and query them.
So dbt can be the build/ETL part, and Hashquery can be the semantic layer/frontend for analytical queries.
> 2. Will you consider making a tool that tries to transpile SQL back to Hashquery models?
This is a really interesting idea!
I'm pretty skeptical we could make this technically feasible. Compilation from a higher level abstraction (Hashquery semantics) to a lower level abstraction is inherently lossy and can't really be done in reverse without a lot of noise.
Hashquery has a lot of escape hatches for raw SQL though if you need to access some functionality not yet implemented as part of the project. There's API to inline Hashquery structures inside of SQL fragments and visa versa.
peterhadlaw|1 year ago
0: https://ibis-project.org/
mjirv|1 year ago
Unless I'm mistaken there's no native way in Hashquery right now to materialize the results back to your warehouse like you would with dbt.
demaga|1 year ago
Maybe some examples would help
> AI and LLMs (coming soon)
Why?
cpimhoff|1 year ago
Maybe the more precise language would be "way beyond the _expressive_ capabilities of standard SQL". Ultimately Hashquery compiles into SQL for execution, in that way it's a bit of a transpiler.
One concrete example is funnel analysis. In SQL, an efficient funnel analysis on large data sets will span several hundred lines of pipelined queries; writing, reading, and re-parameterizing this is very challenging. In Hashquery, it's abstracted into a single function call: `model.funnel(*str[])`.
> AI and LLMs (coming soon)
Frankly? SEO just told us to drop "AI" _somewhere_ on the page for new `.dev` website. I agree it's a little silly to make it a top-line bullet point.
We do think Hashquery is better suited for coding co-pilot tools, as it's a known language, and you can establish API boundaries that an AI tool won't break into and futz with (good for writing queries on top of already defined business logic).
mpeg|1 year ago
tomrod|1 year ago
cpimhoff|1 year ago
Reflecting an existing database isn't part of the library, but with the magic of "it's just Python" it's pretty easy to write a function that does so yourself:
vndrewlee|1 year ago
https://hashquery.dev/docs/pattern_guides/joins/
zsdev|1 year ago