mildbyte's comments

mildbyte | 1 year ago | on: Show HN: Pg_analytica – Speed up queries by exporting tables to columnar format

Another difference is that this solution uses parquet_fdw, which handles fast scans through Parquet files and filter pushdown via row group pruning, but doesn't vectorize the groupby / join operations above the table scan in the query tree (so you're still using the row-by-row PG query executor in the end).

pg_analytics uses DataFusion (dedicated analytical query engine) to run the entire query, which can achieve orders of magnitude speedups over vanilla PG with indexes on analytical benchmarks like TPC-H. We use the same approach at EDB for our Postgres Lakehouse (I'm part of the team that works on it).

mildbyte | 3 years ago | on: Databases in 2022: A Year in Review

I mentioned it recently[0], but this looks like a very good topic to plug our new database, Seafowl, that we released last year [1]. It also uses Apache DataFusion (like IOx) and separates storage and compute (like Neon, Snowflake etc) but is designed for client-side Web apps to run analytical SQL queries over HTTP (using semantics that make the query results cacheable by browser caches and CDNs). This makes it really useful for things like interactive visualizations or dashboards.

We're currently doing a lot of work at Splitgraph to reposition the product around this "analytics at the edge" use case, with usage-based billing, eventually moving our query execution from PostgreSQL to Seafowl.

[0] https://news.ycombinator.com/item?id=34175545

[1] https://seafowl.io

mildbyte | 3 years ago | on: PostgREST – Serve a RESTful API from any Postgres database

> why not just accept SQL and cut out all the unnecessary mapping?

You might be interested in what we're building: Seafowl, a database designed for running analytical SQL queries straight from the user's browser, with HTTP CDN-friendly caching [0]. It's a second iteration of the Splitgraph DDN [1] which we built on top of PostgreSQL (Seafowl is much faster for this use case, since it's based on Apache DataFusion + Parquet).

The tradeoff for allowing the client to run any SQL vs a limited API is that PostgREST-style queries have a fairly predictable and low overhead, but aren't as powerful as fully-fledged SQL with aggregations, joins, window functions and CTEs, which have their uses in interactive dashboards to reduce the amount of data that has to be processed on the client.

There's also ROAPI [2] which is a read-only SQL API that you can deploy in front of a database / other data source (though in case of using databases as a data source, it's only for tables that fit in memory).

[0] https://seafowl.io/

[1] https://www.splitgraph.com/connect

[2] https://github.com/roapi/roapi

mildbyte | 3 years ago | on: Show HN: Socrata Roulette – run random SQL on a random government dataset

It's possible! Currently this is running GROUP BY queries using Socrata's query API on the original government data portal. We're adding the ability to import data from these sources into a columnar format in the future, either into Splitgraph itself or syncing the data out into Seafowl (https://seafowl.io/) which uses Parquet and is much faster.

Technically, the ability is already there (you can add a dataset to Splitgraph and select Socrata as a source if you know the dataset ID), but it's not as turnkey as landing on a dataset page and clicking a button. More to come!

mildbyte | 3 years ago | on: Command-line data analytics

It could be the NDJSON parser (DF source: [0]) or could be a variety of other factors. Looking at the ROAPI release archive [1], it doesn't ship with the definitive `columnq` binary from your comment (EDIT: it does, I was looking in the wrong place! https://github.com/roapi/roapi/releases/tag/columnq-cli-v0.3...), so it could also have something to do with compilation-time flags.

FWIW, we use the Parquet format with DataFusion and get very good speeds similar to DuckDB [2], e.g. 1.5s to run a more complex aggregation query `SELECT date_trunc('month', tpep_pickup_datetime) AS month, COUNT(*) AS total_trips, SUM(total_amount) FROM tripdata GROUP BY 1 ORDER BY 1 ASC)` on a 55M row subset of NY Taxi trip data.

[0]: https://github.com/apache/arrow-datafusion/blob/master/dataf...

[1]: https://github.com/roapi/roapi/releases/tag/roapi-v0.8.0

[2]: https://observablehq.com/@seafowl/benchmarks

mildbyte | 3 years ago | on: IOx: InfluxData’s New Storage Engine

Great question! With Seafowl, the idea is different from what the modern data stack addresses. It's trying to simplify public-facing Web-based visualizations: apps that need to run analytical queries on large datasets and can be accessed by users all around the world. This is why we made the query API easily cacheable by CDNs and Seafowl itself easy to deploy at the edge, e.g. with Fly.io.

It's a fairly different use case from DuckDB (query execution for Web applications vs fast embedded analytical database for notebooks) and the rest of the modern data stack (which mostly is about analytics internal to a company). Just to clarify, we're not related to IOx directly (only via us both using Apache DataFusion).

If we had to place Seafowl _inside_ of the modern data stack, it'd be mostly a warehouse, but one that is optimized for being queried from the Internet, rather than by a limited set of internal users. Or, a potential use case could be extracting internal data from your warehouse to Seafowl in order to build public applications that use it.

We don't currently ship a Web front-end and so can't serve as a replacement to Superset: it's exposed to the developer as an HTTP API that can be queried directly from the end user's Web browser. But we have some ideas around a frontend component: some kind of a middleware, where the Web app can pre-declare the queries it will need to run at build time and we can compute some pre-aggregations to speed those up at runtime. Currently we recommend querying it with Observable [0] for an end-to-end query + visualization experience (or use a different viz library like d3/Vega).

Re: the second question about Splitgraph for a data lake, the intention behind Splitgraph is to orchestrate all those tools and there the use case is indeed the modern data stack in a box. It's kind of similar to dbt Labs's Sinter [1] which was supposed to be the end-to-end data platform before they focused on dbt and dbt Cloud instead: being able to run Airbyte ingestion, dbt transformations, be a data warehouse (using PostgreSQL and a columnar store extension), let users organize and discover data at the same time. There's a lot of baggage in Splitgraph though, as we moved through a few iterations of the product (first Git/Docker for data, then a platform for the modern data stack). Currently we're thinking about how to best integrate Splitgraph and Seafowl in order to build a managed pay-as-you-go Seafowl, kind of like Fauna [2] for analytics.

Hope this helps!

[0] https://observablehq.com/@seafowl/interactive-visualization-...

[1] https://www.getdbt.com/blog/whats-in-a-name/

[2] https://fauna.com/

mildbyte | 3 years ago | on: IOx: InfluxData’s New Storage Engine

Just wanted to also give a shout out to Apache DataFusion[0] that IOx relies on a lot (and contributes to as well!).

It's a framework for writing query engines in Rust that takes care of a lot of heavy lifting around parsing SQL, type casting, constructing and transforming query plans and optimizing them. It's pluggable, making it easy to write custom data sources, optimizer rules, query nodes etc.

It's has very good single-node performance (there's even a way to compile it with SIMD support) and Ballista [1] extends that to build it into a distributed query engine.

Plenty of other projects use it besides IOx, including VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily using it to build Seafowl [2], an analytical database that's optimized for running SQL queries directly from the user's browser (caching, CDNs, low latency, some WASM support, all that fun stuff).

[0] https://github.com/apache/arrow-datafusion

[1] https://github.com/apache/arrow-ballista

[2] https://github.com/splitgraph/seafowl

mildbyte | 3 years ago | on: Show HN: Seafowl – CDN-friendly analytical database

Hey HN,

A new project from us at Splitgraph: Seafowl, a database that's optimized for Web applications running analytical SQL queries straight from the user's browser. Used to power interactive visualizations and dashboards. Features:

- Fast: written in Rust and uses Apache DataFusion. About 5-10x faster than PostgreSQL (some benchmarks available at [1])

- Light: single 50MB binary that starts in 10ms

- Extensible: write user-defined functions in anything that compiles to WASM

- Cache-friendly: REST API designed to work well with CDNs like Cloudflare or caches like Varnish (as well as the user's browser cache).

- Demo of Seafowl providing data to an Observable notebook here [2] (press F12 and refresh the page to see caching in action)

Happy to answer any questions!

[1] https://observablehq.com/@seafowl/benchmarks

[2] https://observablehq.com/@seafowl/interactive-visualization-...

mildbyte | 3 years ago | on: Litestream live replication has been moved to the LiteFS project

The live replication (as it used to work in Litestream before the LiteFS move, without Consul) would have been perfect for our use case with Seafowl (I played around with Litestream before that but had to settle on PostgreSQL for the sample multi-node deployment [0]):

- rare writes that get directed to a single instance (e.g. using Fly.io's replay header), frequent reads (potentially at edge locations)

- no need to deploy a PostgreSQL cluster and set up logical replication

- SQLite database stored in object storage, reader replicas can boot up using the object storage copy and then get kept in sync by pulling data from the writer

- delay in replication is fine

LiteFS is probably going to be a great solution here since we're mainly using Fly.io and it has built-in support for it [1], but are there any alternatives that don't require Consul, still look like an SQLite database to the client and can work off of a HTTP connection to the primary, so that we don't have to require our users to deploy to Fly?

[0] https://seafowl.io/docs/guides/scaling-multiple-nodes

[1] https://fly.io/docs/litefs/getting-started/

mildbyte | 3 years ago | on: Hosting SQLite databases on any static file hoster (2021)

Hey, I co-built Seafowl, thanks for the plug!

To clarify, Seafowl itself can't be hosted statically (it's a Rust server-side application), but it works well for statically hosted pages. It's basically designed to run analytical SQL queries over HTTP, with caching by a CDN/Varnish for SQL query results. The Web page user downloads just the query result rather than required fragments of the database (which, if you're running aggregation queries, might have to scan through a large part of it).

page 1