top | item 40132424

Show HN: Hashquery, a Python library for defining reusable analysis

67 points| cpimhoff | 1 year ago |hashquery.dev

Hi all, we recently open sourced the first version of Hashquery, a Python library which lets you model analytics, DRY up common logic, and execute it against a database or warehouse.

We were originally rendering SQL directly for all our queries, but that spiraled out of control with more complex analysis; the SQL needed to be changed sporadically for each new database dialect (BigQuery, Redshift, Aethena, Postgres, etc etc) and the SQL fragments were very challenging to reuse (and so fragments were copy-pasted all over).

~~~

Advantages we think it has over writing SQL by hand:

- Queries are fully compossible, so any analysis can be chained into further analysis without refactoring.

- All analysis executes within the data warehouse itself, making them super fast for big data.

- It's just Python, so you can extend or parameterize query logic with a simple def function.

- You can run it anywhere, like inside of unit tests, ETL nodes, or Jupyter notebooks.

- Developer experience is pretty good since editors already know how to autocomplete Python. ChatGPT knows Python already too :)

- The library handles normalizing between database dialects. Write once and use in any database.

- Hashquery content is fully serializable, which makes it a good fit for exposing a flexible API for consumers to efficiently query datamarts or internal analytics. We use it as the endpoint for headless BI, as opposed to having to define and teach a DSL or GraphQL specification.

~~~

We've built native funnel analysis on top of Hashquery and have been thrilled with it so far, and we thought others might want to use it too. It's pretty early days so we're still trying to explain it, and the docs aren't perfectly clear, but the examples on the dev site are editable and you can download the pip package to play around with it!

18 comments

code_biologist|1 year ago

I'm exactly the target audience for this type of tool, a tech leader that has implemented a data warehouse and BI strategy. Some concrete tips for adoption:

- Break the dependency on your product. I need to be able to use the library even if your company goes under.

- Add a dbt library that makes it easy to use hashquery within dbt models. It gets you materialization for free and will answer a lot of questions you will get about dbt integration.

To comment more broadly, if you want to be a broad solution, the going trend in data integration seems to be at the warehouse level so you need to have SQL answers.

A bunch of tools all consume from our warehouse (existing BI, reverse ETL, data science systems). A BI definition tool won't work if I can't define segments in a way that all of those can access, even as just tables or views.

Our programmers and data science people know Python and are often very good at SQL, but their time is short and BI projects depending on them have been delayed. Our analysts know SQL, and have the dedicated time to make these projects happen.

This kind of code snippet isn't crazy to put into dbt, and if someone wants to do Python magic in the background they can:

    hq_project.models.events_model.with_activity_schema(
        group='user_id', timestamp='timestamp', event_key='event_type'
    ).funnel("ad_impression", "add_to_cart", "buy").as_sql()

cpimhoff|1 year ago

All great advice. All this resonated with the team.

100% on decoupling it from the product. Our customers needed a headless BI solution and we needed a better internal framework, so the stars aligned for this first version being a little coupled. This kind of feedback is helpful, because it helps me advocate for capacity to decouple it fully!

Hashquery does have APIs to get the rendered SQL without executing it, so I think integration with any in-warehouse processing tool is possible, dbt likely being the most valuable.

Full materialization/interoperability with SQL is hard for any tool trying to encapsulate the semantics of the SQL. The intent of those tools is to encode concepts that are not possible to represent with static tables, so folding them back to a SQL-surface area will always be lossy. Having said that, we could certainly have a better story around it — materialization can still be useful even if it has some caveats.

grantjpowell|1 year ago

This looks neat. I'm the author of a similar project in typescript we use at Cotera called Era [0]. Y'all might be implement something similar to our caching layer [1] which we think is super useful. Once you have a decent cross warehouse representation it's pretty easy to "split" queries across the real warehouse and something like duckdb. The other thing that we find useful in Era that y'all might like are "Invariants"[2]. Invariants work by compiling lazily evaluated invalid casts into the query that only trigger under failing conditions. We use "invariants" to fail a query at _runtime_, which eliminates TOUTOC problems that come from a DBT tests style solution.

    [0] https://newera.dev/
    [1] https://cotera.co/blog/how-era-brings-last-mile-analytics-to-any-data-warehouse-via-duckdb
    [2] https://newera.dev/docs/invariants

anentropic|1 year ago

I'm potentially super interested in this as am building this kind of feature for my job at the moment

But https://hashquery.dev/#faq says:

> the Hashquery SQL compiler is not available to run locally, so you do need to define your data connections inside of Hashboard and use its API to execute your queries.

> We do plan on making the full Hashquery stack available to run locally in the near future

I'm not quite sure what the use case for this library is at present

If I'm not a Hashboard customer and don't want to pay $60/mo for a nicer way to query my existing db, what am I going to do with it?

Hashboard seems roughly similar to Superset and/or Cube?

cpimhoff|1 year ago

Bit of context here is that Hashquery started as an internal tool and as a way to power our headless BI offering, so it still has a few bits tied to our infrastructure we're working to fully separate.

For folks wanting to hack around more with Hashquery in the meantime, Hashboard accounts can be created for free and we don't have plans to charge folks just using the Hashquery stuff (but we probably will bother you for feedback ). You can sign up here https://hashboard.com/getAccess

totalhack|1 year ago

This looks cool. I built a similar open source semantic data / warehousing tool called Zillion. I use it to power my company's BI but haven't put as much time into the polish as you guys.

https://github.com/totalhack/zillion

guy4261|1 year ago

Looks pretty darn cool! Two questions please:

1. How does this compare to dbt? If we're already using dbt, why migrate?

2. Will you consider making a tool that tries to transpile SQL back to Hashquery models? This way I can work against my database, then merge the changes back to the model.

Good luck!

cpimhoff|1 year ago

> 1. How does this compare to dbt? If we're already using dbt, why migrate?

I actually think dbt and Hashquery work very well alongside one another!

dbt can help you normalize, clean, and materialize your data upstream, then Hashquery can be used to associate semantics to those output tables (measurements, synthesized attributes, common views) and query them.

So dbt can be the build/ETL part, and Hashquery can be the semantic layer/frontend for analytical queries.

> 2. Will you consider making a tool that tries to transpile SQL back to Hashquery models?

This is a really interesting idea!

I'm pretty skeptical we could make this technically feasible. Compilation from a higher level abstraction (Hashquery semantics) to a lower level abstraction is inherently lossy and can't really be done in reverse without a lot of noise.

Hashquery has a lot of escape hatches for raw SQL though if you need to access some functionality not yet implemented as part of the project. There's API to inline Hashquery structures inside of SQL fragments and visa versa.

peterhadlaw|1 year ago

I really don't understand the appeal of dbt vs a proper programming language. The templating approach leads to massive spaghetti. I look forward to trying out something like Ibis [0]

0: https://ibis-project.org/

mjirv|1 year ago

re: dbt, it looks like Hashquery is more for running analyses than doing in-data-warehouse transformations.

Unless I'm mistaken there's no native way in Hashquery right now to materialize the results back to your warehouse like you would with dbt.

demaga|1 year ago

> way beyond the capabilities of standard SQL

Maybe some examples would help

> AI and LLMs (coming soon)

Why?

cpimhoff|1 year ago

> way beyond the capabilities of standard SQL

Maybe the more precise language would be "way beyond the _expressive_ capabilities of standard SQL". Ultimately Hashquery compiles into SQL for execution, in that way it's a bit of a transpiler.

One concrete example is funnel analysis. In SQL, an efficient funnel analysis on large data sets will span several hundred lines of pipelined queries; writing, reading, and re-parameterizing this is very challenging. In Hashquery, it's abstracted into a single function call: `model.funnel(*str[])`.

> AI and LLMs (coming soon)

Frankly? SEO just told us to drop "AI" _somewhere_ on the page for new `.dev` website. I agree it's a little silly to make it a top-line bullet point.

We do think Hashquery is better suited for coding co-pilot tools, as it's a known language, and you can establish API boundaries that an AI tool won't break into and futz with (good for writing queries on top of already defined business logic).

mpeg|1 year ago

This is pretty cool, although DSLs can bring their own set of challenges vs writing SQL – how does it compare to dbt? (apart from it being python instead of SQL)

tomrod|1 year ago

How aware is the library regarding existing structure, e.g foreign key relationships?

cpimhoff|1 year ago

The library itself provides tools for declaring data tables, columns, metrics, and relationships, then performing transformations (eg. queries) on top of them.

Reflecting an existing database isn't part of the library, but with the magic of "it's just Python" it's pretty easy to write a function that does so yourself:

    import hashquery as hq
    def reflect_db(url: str) -> list[hq.Model]:
      """
      Given a connection string,
      returns a list of Hashquery models
      for all the physical tables in
      the database.
      """
      models: list[hq.Model] = []

      reflection = some_reflection_lib.reflect_db(url)
      for table in reflection.tables:
        models.append(
          hq.Model()
          .with_source(url, table.name, table.schema)
          .with_attributes(*table.column_names)
        )
        # ...more logic for importing foreign keys or whatever else

      return models

vndrewlee|1 year ago

Joining logic is defined in our data modeling layer, either from Hashboard (our BI tool) or can also be defined in Hashquery as well instead of replying on your database schema for fk relationships. We went this route since a lot of people are using dbt generated tables.

https://hashquery.dev/docs/pattern_guides/joins/

zsdev|1 year ago

Interested to see a comparison between hashquery and ibis. https://ibis-project.org/